OpenClaw Benchmarks: Engineering Past the VRAM Iron Curtain

Outocomes are not the same thing as progress…

The industry is currently obsessed with “vibe coding”: throwing loose prompts at an LLM and hoping for a usable script. This is the equivalent of letting contractors build a house without a blueprint. Transitioning from this factory-worker mindset to true software architecture requires rigor, intentionality, and a system that actually measures capability. We recently set up a dedicated OpenClaw rig, complete with strict backup protocols and execution limits, to find out exactly what consumer hardware can handle when forced to do actual engineering.

The Problem: The VRAM Iron Curtain

Balancing local and third-party cloud models is an exercise in managing trade-offs. Cloud models offer immense reasoning depth and zero infrastructure overhead, but they come with latency, privacy concerns, and recurring costs. Moving to local models gives you absolute control and data sovereignty, but you immediately hit the VRAM Iron Curtain.

You do not get engineering for free. If you want to run capable, agentic models locally, you are constrained by your hardware footprint. The challenge isn’t just fitting a model into memory; it’s finding a model small enough to load but capable enough to reason autonomously without collapsing into probabilistic drift. When a model exceeds the available VRAM on a card, it spills into System RAM. This isn’t just a slowdown, it is a cognitive lobotomy. The latency spikes, the KV cache fragments, and the model begins to hallucinate tool calls or loop its logic simply to avoid the massive computational friction of its own weights.

The Setup: Defining the Constraints

To test this, we built a deliberate, contained environment. The host machine represents “high-end” consumer hardware, specifically chosen to test the limits of what an individual architect or small QA team can deploy:

  • CPU: AMD Ryzen 9 9900X
  • System Board: X870E AORUS MASTER
  • RAM: 32GB DDR5 (6000 MHz / PC5-48000)
  • GPU: MSI GeForce RTX 3090 Ti Suprim X (24GB VRAM)
  • Power: MSI MAG A1000G PCIE5 (1000W)
  • Infrastructure: OpenClaw + LiteLLM running seamlessly on WSL2 Ubuntu 24.04.

This is not a massive server rack; it is a realistic workstation designed for an SDET who needs to move fast without abdicating design.

The Benchmark: Measuring the ‘SDET Architect’

We didn’t test for boilerplate generation. We tested for agentic, “SDET Architect” capabilities. An architect doesn’t just write code; they design systems, map dependencies, and troubleshoot environments. Our benchmark required the models to:

  • Design a scalable regression testing strategy for microservices.
  • Prioritize tests based on change impact analysis.
  • Explain the role of a Vector Database in selecting relevant test cases.

This last point is critical: in a microservices environment, you cannot run 10,000 tests on every commit. There are various techniques to apply to optimize your testing space, one of them is to use a Vector DB to store “embeddings” (mathematical representations) of your test cases and your code changes. By performing a similarity search, the system selects only the tests that “mathematically resemble” the changed code. This is an elegant way to achieve testing scalability. We wanted to see if our local models understood this architectural nuance.

The Results: Punching Above the Weight Class

We pitched local mid-weights (specifically Mistral Small 22B and Qwen 2.5 Coder 14B) against the 120B+ parameter behemoths in the cloud. The results were revealing:

ModelResultRuntimeVRAMRemarks Observations / Hypotheses
Gemini 3.1 ProPASS20sN/AGold Standard.Best architectural logic & safety. High context awareness.
GPT-OSS-120b-cloudPASS19sN/AThe Architect.Exceptional reasoning. RL feedback logic handles edge cases well.
Gemini 3.1 Flash LitePASS8sN/AThe Engineer.Best speed/accuracy ratio for agentic flows.
Qwen 2.5 Coder (14B)PASS35s11GBThe Daily Driver.Perfect tool accuracy. Best performant model for local code engineering.
Mistral Small (22B)PASS39s18GBThe Local Architect.Highest reasoning depth for local design. Follows strategy better than Qwen.
DeepSeek Coder V2 (16B)PASS40s14GBThe MoE Specialist.Solid technical documentation. MoE architecture prevents “logic collapse” seen in dense 30B models.
Gemma 2 (27B)PASS39s20GBTier-1 Strategist.Concise and technically accurate. Understands semantic search concepts deeply.
Phi-4 (Standard)PASS21s10GBEfficiency Hero.Surprising competence for size. Good structured output control.
Phi-3.5PASS25s8GBThe Drawing-Challenged Professor.High theoretical logic. Excellent explanations but cannot generate correct Mermaid syntax.
DeepSeek-R1-32bPASS45s22GBReasoning Specialist.High depth; prone to logic loops on first tool call.
Qwen 2.5 (32B)FAIL40s22GBTool Hallucinator.Hallucinated design_regression_testing_strategy tool instead of reasoning.
Command-R (35B)FAIL12s24GBInformation Hungry.Attempted to read unrelated tmux skill. Shows “Tool-Hunger” bias instinct is to find info rather than attempt reasoning.
Phi-4 ReasoningFAIL30s12GBLogic Loop.Caught in internal reasoning cycles; failed to output final artifact.
GLM-Z1 (9B)FAIL10s7GBFragmented Logic.Output broken JSON snippets instead of architectural design.
Qwen 3.5 (35B)KILLN/A27GBOOM Collapse.Reasoning collapsed during RAM swap. Too large for 3090Ti.
Llama 4KILLN/A78GBVRAM Overflow.Caused 504 Gateway Timeouts. Too large for 3090Ti.
NemotronKILLN/A57GBHardware Mismatch.Footprint too large for 3090Ti.
GPT-OSS-20bFAIL16s14GBZero Output.Reasoning floor issue.
Mistral (Local)FAIL6s8GBLogic Disconnect.Hallucinated unrelated tool calls.
Granite 3.3FAIL8s12GBTask Avoider.Hallucinated custom tools to avoid design work.

Conclusions:

Models exceeding 22GB-24GB are unsuitable for sub-agent tasks on this hardware. When a model spills into System RAM, the “illusion of velocity” is shattered by a “logic loop”: the model encounters an error, misinterprets the context due to memory fragmentation, and repeatedly attempts the same flawed execution. You cannot bluff your way past the hardware limit. Mistral Small (22B) and Qwen 2.5 Coder (14B) represent the current performance peak for this GPU. They offer enough room for a robust KV cache (the “memory” of the current conversation) without triggering a swap.

Next Steps: Moving Beyond Static Routing

The path forward is systemic. Our next phase involves moving away from static model assignment to automated routing (e.g. via model-matrix.json). We will route philosophical “Architect” tasks to models that excel at reasoning (Mistral Small) and syntax-heavy “Jarvis” tasks to specialized coders (Qwen 14B).

Future work may include fine-tuneing some of this these mid-weight models on successful OpenClaw execution traces to embed the rigorous, phase-based workflow directly into their weights permanently, reducing “Tool-Hunger” and ensuring that our local agents stop acting like typists and start acting like Architects.