Outocomes are not the same thing as progress…
The industry is currently obsessed with “vibe coding”: throwing loose prompts at an LLM and hoping for a usable script. This is the equivalent of letting contractors build a house without a blueprint. Transitioning from this factory-worker mindset to true software architecture requires rigor, intentionality, and a system that actually measures capability. We recently set up a dedicated OpenClaw rig, complete with strict backup protocols and execution limits, to find out exactly what consumer hardware can handle when forced to do actual engineering.
The Problem: The VRAM Iron Curtain
Balancing local and third-party cloud models is an exercise in managing trade-offs. Cloud models offer immense reasoning depth and zero infrastructure overhead, but they come with latency, privacy concerns, and recurring costs. Moving to local models gives you absolute control and data sovereignty, but you immediately hit the VRAM Iron Curtain.
You do not get engineering for free. If you want to run capable, agentic models locally, you are constrained by your hardware footprint. The challenge isn’t just fitting a model into memory; it’s finding a model small enough to load but capable enough to reason autonomously without collapsing into probabilistic drift. When a model exceeds the available VRAM on a card, it spills into System RAM. This isn’t just a slowdown, it is a cognitive lobotomy. The latency spikes, the KV cache fragments, and the model begins to hallucinate tool calls or loop its logic simply to avoid the massive computational friction of its own weights.
The Setup: Defining the Constraints
To test this, we built a deliberate, contained environment. The host machine represents “high-end” consumer hardware, specifically chosen to test the limits of what an individual architect or small QA team can deploy:
- CPU: AMD Ryzen 9 9900X
- System Board: X870E AORUS MASTER
- RAM: 32GB DDR5 (6000 MHz / PC5-48000)
- GPU: MSI GeForce RTX 3090 Ti Suprim X (24GB VRAM)
- Power: MSI MAG A1000G PCIE5 (1000W)
- Infrastructure: OpenClaw + LiteLLM running seamlessly on WSL2 Ubuntu 24.04.
This is not a massive server rack; it is a realistic workstation designed for an SDET who needs to move fast without abdicating design.
The Benchmark: Measuring the ‘SDET Architect’
We didn’t test for boilerplate generation. We tested for agentic, “SDET Architect” capabilities. An architect doesn’t just write code; they design systems, map dependencies, and troubleshoot environments. Our benchmark required the models to:
- Design a scalable regression testing strategy for microservices.
- Prioritize tests based on change impact analysis.
- Explain the role of a Vector Database in selecting relevant test cases.
This last point is critical: in a microservices environment, you cannot run 10,000 tests on every commit. There are various techniques to apply to optimize your testing space, one of them is to use a Vector DB to store “embeddings” (mathematical representations) of your test cases and your code changes. By performing a similarity search, the system selects only the tests that “mathematically resemble” the changed code. This is an elegant way to achieve testing scalability. We wanted to see if our local models understood this architectural nuance.
The Results: Punching Above the Weight Class
We pitched local mid-weights (specifically Mistral Small 22B and Qwen 2.5 Coder 14B) against the 120B+ parameter behemoths in the cloud. The results were revealing:
| Model | Result | Runtime | VRAM | Remarks | Observations / Hypotheses |
|---|---|---|---|---|---|
| Gemini 3.1 Pro | PASS | 20s | N/A | Gold Standard. | Best architectural logic & safety. High context awareness. |
| GPT-OSS-120b-cloud | PASS | 19s | N/A | The Architect. | Exceptional reasoning. RL feedback logic handles edge cases well. |
| Gemini 3.1 Flash Lite | PASS | 8s | N/A | The Engineer. | Best speed/accuracy ratio for agentic flows. |
| Qwen 2.5 Coder (14B) | PASS | 35s | 11GB | The Daily Driver. | Perfect tool accuracy. Best performant model for local code engineering. |
| Mistral Small (22B) | PASS | 39s | 18GB | The Local Architect. | Highest reasoning depth for local design. Follows strategy better than Qwen. |
| DeepSeek Coder V2 (16B) | PASS | 40s | 14GB | The MoE Specialist. | Solid technical documentation. MoE architecture prevents “logic collapse” seen in dense 30B models. |
| Gemma 2 (27B) | PASS | 39s | 20GB | Tier-1 Strategist. | Concise and technically accurate. Understands semantic search concepts deeply. |
| Phi-4 (Standard) | PASS | 21s | 10GB | Efficiency Hero. | Surprising competence for size. Good structured output control. |
| Phi-3.5 | PASS | 25s | 8GB | The Drawing-Challenged Professor. | High theoretical logic. Excellent explanations but cannot generate correct Mermaid syntax. |
| DeepSeek-R1-32b | PASS | 45s | 22GB | Reasoning Specialist. | High depth; prone to logic loops on first tool call. |
| Qwen 2.5 (32B) | FAIL | 40s | 22GB | Tool Hallucinator. | Hallucinated design_regression_testing_strategy tool instead of reasoning. |
| Command-R (35B) | FAIL | 12s | 24GB | Information Hungry. | Attempted to read unrelated tmux skill. Shows “Tool-Hunger” bias instinct is to find info rather than attempt reasoning. |
| Phi-4 Reasoning | FAIL | 30s | 12GB | Logic Loop. | Caught in internal reasoning cycles; failed to output final artifact. |
| GLM-Z1 (9B) | FAIL | 10s | 7GB | Fragmented Logic. | Output broken JSON snippets instead of architectural design. |
| Qwen 3.5 (35B) | KILL | N/A | 27GB | OOM Collapse. | Reasoning collapsed during RAM swap. Too large for 3090Ti. |
| Llama 4 | KILL | N/A | 78GB | VRAM Overflow. | Caused 504 Gateway Timeouts. Too large for 3090Ti. |
| Nemotron | KILL | N/A | 57GB | Hardware Mismatch. | Footprint too large for 3090Ti. |
| GPT-OSS-20b | FAIL | 16s | 14GB | Zero Output. | Reasoning floor issue. |
| Mistral (Local) | FAIL | 6s | 8GB | Logic Disconnect. | Hallucinated unrelated tool calls. |
| Granite 3.3 | FAIL | 8s | 12GB | Task Avoider. | Hallucinated custom tools to avoid design work. |
Conclusions:
Models exceeding 22GB-24GB are unsuitable for sub-agent tasks on this hardware. When a model spills into System RAM, the “illusion of velocity” is shattered by a “logic loop”: the model encounters an error, misinterprets the context due to memory fragmentation, and repeatedly attempts the same flawed execution. You cannot bluff your way past the hardware limit. Mistral Small (22B) and Qwen 2.5 Coder (14B) represent the current performance peak for this GPU. They offer enough room for a robust KV cache (the “memory” of the current conversation) without triggering a swap.
Next Steps: Moving Beyond Static Routing
The path forward is systemic. Our next phase involves moving away from static model assignment to automated routing (e.g. via model-matrix.json). We will route philosophical “Architect” tasks to models that excel at reasoning (Mistral Small) and syntax-heavy “Jarvis” tasks to specialized coders (Qwen 14B).
Future work may include fine-tuneing some of this these mid-weight models on successful OpenClaw execution traces to embed the rigorous, phase-based workflow directly into their weights permanently, reducing “Tool-Hunger” and ensuring that our local agents stop acting like typists and start acting like Architects.