OpenClaw Hybrid Routing: The Hallucination Sandbox and the Reality of VRAM

Have you thought about dumping a 32-billion parameter model into a 24GB GPU and magically spawn a flawless autonomous agent? I did… It didn’t work.

In my previous post, we established the baseline for pushing past the VRAM Iron Curtain. The goal for this phase was simple on paper: implement a hybrid agentic routing system. We wanted a smart traffic cop. Send the simple, high-frequency tool executions to snappy local models. Reserve the heavy lifting for the cloud. We designed a brilliant theoretical pipeline. The reality of maintaining it was a nightmare of silent fallbacks, API mismatches, and hardware bottlenecks. But our story has a happy ending that will hopefully prevent catastrophic failure in production.

For this round, we used a slightly varied setup from our last post:

  • Ollama:
    • CPU: AMD Ryzen 9 9900X
    • RAM: 32GB DDR5 (6000 MHz / PC5-48000)
    • GPU: MSI GeForce RTX 3090 Ti Suprim X (24GB VRAM)
    • Power: MSI MAG A1000G PCIE5 (1000W)
    • OS: Windows 11
    • Context Window: 16K
  • LiteLLM (Proxy for ollama):
    • CPU: Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
    • RAM: 12 GB
    • OS: Ubuntu 24.04
  • OpenClaw:
    • CPU: Intel(r) Core(TM) i5 @1.4GHz
    • RAM: 4GB
    • OS: Debian GNU/Linux 13 (trixie).

A curious also changed (due to a Windows update). The GPU ennumeration changed and Ollama started loading models directly into the system’s RAM. We had to set environmental variables to disable the AMD Ryzen’s embedded Radeon GPU and fix the NVidia GPU UUID.

The Gateway Illusion

Everything seamed ready for this new set of experiments and after some minor updaes to the OpenClaw’s configration files for the new models, our initial testing phase felt like a massive success. We fired off complex multi-tool prompts to our local quantized models. They nailed every single task. They read files, wrote summaries, and executed shell commands flawlessly.

It was all a lie!

We had fallen into the Agent’s Blindfold. OpenClaw’s Gateway is designed for resilience. If a requested sub-agent fails or isn’t “perfectly configured”, the system silently falls back to the primary LLM model to ensure the user gets an answer. Our default was Gemini. Gemini was doing the homework. Our local models were just taking the credit.

To assure we were testing what we were supossed to test we built a quarantine by bypassing the OpenClaw’s Gateway entirely and mapped the canonical IDs from our LiteLLM proxy in our openclaw.json to strictly match the upstream provider. We dropped to the bare metal, using the exec tool to run local CLI commands while manually watching the nvidia-smi outputs. As the OpenClaw’s host is secluded, it did not have access to monitor the Ollama’s API to query which model was being loaded for each test (and we did not want to expose ollama via APISix for security reasons), so we diverted to the plain old method of watching 14GB of VRAM physically light up and GPU spikes on a 3090Ti as the grounded truth.

Ollama can load various models at the same time (when it calculates they fit in the available memory), but we decided to make sure the test ran with only one model loaded at a time to avoid testing artifacts like latency effects or inconsistencies in how ollama handles the calls.

The Ollama Curveball

Just as we locked down the testing environment, the ground shifted. Ollama has the “chat/” endpoint, that fundamentally alters how it handles native tool schemas. Our LiteLLM configuration was using the standard ollama/ endpoint. This meant LiteLLM was trying to hack tool-calling by injecting system prompts telling the model to “reply in JSON”.

The result was a mess. Models would spit out raw JSON strings instead of triggering actual functions.

We had to rewrite the LiteLLM model definitions to force the ollama_chat/ endpoint. This passed the raw OpenAI-compatible tool arrays directly to the models.

This change in the configuration acted as a brutal filter. Half our roster immediately threw 500 errors. Ollama rightfully refused to pass tool schemas to models that lacked a native tool-calling template in its registry.

The Trial by Fire

With the pipeline finally secure and the roster of models reduced, we ran the survivors through a gauntlet. We tested three vectors: a chain of tools (read, write, execute), a “needle in a haystack” workspace search, and a strict structural constraint test.

The failures were spectacular.

Granite 3.3 was fast, but it proved to be too smart for its own good. When asked to read a file about our Vault Bridge email setup, it hallucinated an entire tutorial on Gmail and Postfix servers. It prioritized its generic training weights over the physical file it just read.

Qwen 3 Coder (30B) nailed the technical execution. It found hidden IDs in archived files. But when asked to format its output into a strict Markdown table without conversational filler, it suffered a complete breakdown. It fell into an infinite repetition loop, regurgitating the same rows until it hit the token limit.

Then we hit the wall of Quantization Gravity. Qwen 2.5 (32B) passed the structural tests beautifully. But the 4-bit quant required 25GB of memory. It spilled over into system RAM (the cognitive lobotomy). A simple file read took 119 seconds of agonizing page swapping.

The New Roster

After the dust settled, we purged the dead weight. We deleted the hallucinating models and the syntax-failing endpoints to reclaim our HardDrive space (and VRAM) and defined strict, role-based boundaries for the survivors.

The Agentic Workhorses (Tools Enabled):

  • GLM 4.7 Flash: Our new Tier 1 champion. It passed every test. It exhibited the “Phantom Spike”: loading and executing so efficiently that the GPU compute curve barely registered.
  • Qwen 3 Coder (30B): The Technical Lead. We restrict it from strict formatting tasks, but it owns the CLI.
  • Qwen 3.5 (8B): The Efficiency King. Perfect for rapid, simple tool chains.

The Thinkers (No Tools Allowed):

  • DeepSeek-R1 (32B) & Phi-4: These models failed the tool tests. We stripped their tool privileges entirely. We route them through the legacy ollama/ endpoint. They are now pure reasoning engines, isolated in a sandbox to synthesize data without the risk of hallucinating a shell command.

The following table merges the initial failures with our definitive benchmarks.

Model NameTest A (Chain)Test B (Needle)Test C (Structure)Peak RAMFinal Verdict
GLM 4.7 Flash19.6 GBChampion
Qwen 3 Coder (30B)20.1 GBTechnical Lead
Qwen 3.5 (8B)9.3 GBEfficiency King
Qwen 2.5 (32B)25.0 GBStructural Lead
Qwen 2.5 Coder (14B)13.8 GBDeprecated
Granite 3.319.4 GBDeprecated
DeepSeek-R1 (32B)22.4 GBReasoning Only
Phi-4 / Reasoning18.3 GBReasoning Only
Gemma-2 (27B)19.2 GBSynthesis Only
Mistral-Small18.0 GBDeprecated
Llama3-ChatQA19.4 GBDeprecated
Mistral (7B)7.6 GBDeprecated
GPT-OSS14.0 GBDeprecated
SLLM GLM-z1-9b8.6 GBDeprecated
DeepSeek-Coder-V216.3 GBDeprecated
Phi-3.5N/ADeprecated

Engineering Over Vibes

We are no longer guessing which model is doing the work. We are no longer relying on prompt engineering to force a generic model to act like a coder. We designed the physics and the AI executes within the parameters we built.

The current state of our OpenClaw setup is lean. The VRAM is optimized. The models have explicit, hardware-enforced roles. In the next phase of this journey, we will wire these roles into the OpenClaw heartbeat, allowing the system to autonomously select the right model for the right task based on these very benchmarks.

Intent is useless without execution. Parameters beat vibes. Every single time.