I wanted an AI assistant that could actually execute work across my real tools, not just answer in a chat window. OpenClaw was the first setup that felt operationally useful day to day, from monitoring jobs to handling personal workflows.

Key Takeaways

  • Tool integration and autonomy matter more than chat quality alone.
  • The biggest jump came from splitting the system into “Hands” and “Brain.”
  • Hierarchical memory reduced token usage by roughly 60-70% in practice.
  • Model choice depends on task style, not just benchmark scores.

I use a lot of AI. I’m constantly comparing the latest LLMs, benchmarking ASR models, and testing every new coding tool that hits the market. But OpenClaw was the first time I really felt that “Agentic AI” had arrived.

Unlike closed environments like ChatGPT’s agents or the Microsoft Copilot ecosystem, OpenClaw is a fully hackable, open-source framework. While tools like Claude Code or Codex focus primarily on the terminal and IDE, OpenClaw lives where you live-communicating via Telegram or Signal and connecting directly to the tools you actually use (Notion, Gmail, custom Python scripts). It’s not just an assistant trapped in a tab; it’s an autonomous agent running on your own infrastructure.

Meet Leonardo: My Agentic Partner

I call my instance Leonardo. He isn’t just sitting there waiting for prompts; he’s integrated into my life.

Currently, Leonardo runs on a modest Hetzner VM (8 GB RAM) for just €4 a month. He’s powered by Kimi 2.5 via NVIDIA’s GPU-accelerated endpoints, which provides an incredibly fast and “human” interface for daily interactions.

The full power only showed up once I split OpenClaw into two units. Leonardo acts as the “Hands” (Kimi 2.5), handling agentic execution and day-to-day operations. For complex coding refactors and deep technical research, it escalates to the “Brain” (Codex CLI). This division of labor made the system dramatically more capable.

Telegram Interaction with Leonardo

A real-time conversation with Leonardo via Telegram, where he processes food logs and updates my status.

Beyond the Chatbox

What makes OpenClaw different is the autonomy. It was the first time I felt AI was “real”-from the first search it performed (finding cheap houses in Basel; Hint: NONE, but the way it returned a curated list of real URLs and logical reasoning was the spark) to its ability to manage my life:

  • Health Tracking: I just tell Leonardo what I ate, and he estimates the macros and updates my Notion DB.
  • Coding via Telegram: I can trigger backtests or git commits while I’m at the gym.
  • Proactive Monitoring: When I’m training a model, Leonardo watches the logs and pings me if something looks weird.
  • Personal Tasks: Leonardo booked a dentist appointment for me, researched a gift idea for my wife, and sent her an email. He’s also constantly scouting the (admittedly non-existent) “cheap” house market in Basel, keeping an eye out for anything nice that pops up. It’s these small, proactive touches that make the agentic experience so different.
Notion Calorie Tracking

Leonardo manages my shopping list and tracks my calories, automatically syncing everything to Notion.

Notion Data View

A look at the raw data entries Leonardo generates.

The Benchmark & The Models

I started a small Vincenzo OpenClaw Benchmark to track how different models perform on real agent tasks, which is ever-expanding. Right now this is only 9 tasks. It is not representative yet and is still evolving.

The current setup is intentionally easy to verify. Each run is checked with a second LLM call using binary questions such as:

  • “Is this row present in the table?”
  • “Is the collected number actually present?”

So each task can be scored as a clean True / False outcome. Of course, currently it’s also manually verified by me, but the goal is to eventually automate that as well.

Task examples include:

  • “Add my lunch from this description: 240g tuna roll from coop.”
  • “Use gog to write me an email with the following statement: …”
  • “Guess based on the image the calories, fibers and protein in this meal and add it to my Notion DB.”

Model coverage is also incomplete for now. I mostly test models that are free online, plus models available through Codex / GitHub subscriptions..

Vincenzo OpenClaw benchmark on 9 verifiable tasks

Early 9-task benchmark snapshot (binary verifiable outcomes, evolving dataset).

Current snapshot:

  • Opus 4.5/4.6: 100%
  • Codex 5.3: 100%
  • Kimi 2.5: 100%
  • MiniMax-2.5: 89%
  • GPT-5.2: 89%
  • GPT-5-mini: 77%
  • Gemini 3 Flash Preview: 77%
  • Grok Code Fast 1: 55%

Qualitative notes from daily use (same Soul/Memory/Identity setup for all models, so the prompt should be the same for all models):

  • Gemini 3 Flash Preview: very capable on reasoning and usually smart, but often too cautious in agent mode. Instead of acting, it frequently asked for more context, confirmation, or permission even when tasks were already clear. Could probably reach 100% with more prompt engineering/tuning of the markdown-files.
  • Grok Code Fast 1: often got confused, entered loops, and had weak recovery. Well, I know this too well from coding tasks, where it repeatedly damaged my Python files and struggled to repair basic indentation errors afterwards.
  • Kimi 2.5: the strongest personality and the most fun to use. It is the only model in my tests that naturally brings expressive style into Telegram interactions (including symbols), while still staying useful. As soon as I switch to other models, the tone becomes more robotic and less “human,” which is a shame because I really like the vibe Kimi brings to the table.

The Efficiency Engine: Hierarchical Memory

One of the biggest technical wins was implementing Hierarchical Memory. Instead of feeding the model a massive, flat history, we organized it like a file system:

  1. Main Memory: The evergreen “soul” and core identity.
  2. Areas: Category-specific knowledge (e.g., a “Coding” area that instructs Leonardo to use Codex for heavy lifting).
  3. Projects: Deep-dive context for specific repos or tasks.

By only loading the “nodes” Leonardo needs for a specific task, we’ve managed to reduce token usage by a massive 60-70%. It makes the assistant feel like it has a long-term “soul” without burning through API credits.

The Future: Privacy & Power

I’m excited about what’s next. I’m planning to move Leonardo to a Mac Mini M4 to give him full browser access and even more local horsepower.

I’m also looking to move away from Telegram. While I’m careful not to share secrets, I’m talking about my private life, my health, and my business. I’m looking forward to moving our communication to Signal for that extra layer of privacy. OpenClaw has access to the things that matter to me-my shopping list, my research, my data-and I want to make sure that “partnership” is as secure as possible.

Updated: