Into the Abyss of Local Hybrid GraphRAG: what a 500 Million-Token buys you

If you have ever stared at a terminal window, watching log files scroll by at the speed of light while your GPUs hum at maximum thermal capacity, the fans running at full speed and the time passes slow, you are feeling the rush of building local AI infrastructure. It is equal parts exhilarating and terrifying.

For the past few weeks, we embarked on a highly ambitious research and engineering journey. The objective sounded straightforward on paper: construct a fully localized, enterprise-grade Hybrid RAG (Retrieval-Augmented Generation) pipeline. We wanted to serve high-fidelity, hallucination-free context to a swarm of local AI agents orchestrated agnostically by an agentic framework.

Our constraints were uncompromising. We would not use external providers. We would not rely on managed cloud databases. There would be no vendor lock-in. Every token of generation, every vector embedding, and every topological edge had to be computed and stored on our own hardware using open-weight models (like Qwen, Gemma, and mxbai-embed-large).

This is the story of how we processed half a billion tokens, broke our tools, audited our databases, and ultimately forced an ecosystem built for the cloud to survive in the rugged terrain of local infrastructure.

The Architecture: Why Three Databases?

When designing a system to ingest different types of documents, you quickly realize that standard Vector RAG is just not enough. Vector databases are fantastic at finding semantic similarity (they catch the “vibe” of a query) but they are not as good at answering questions like, “What are all the dependencies of this specific software design pattern?” or “When in this code base was this pattern introduced?” To solve this, we needed a Hybrid architecture. We settled on a tri-storage design, where each database played a highly specialized role, not as an overkill or for the sake of overengineering; it was necessary for the system to survive “odd” queries with huge amounts of data with high accuracy and low latency. Ultimately we want to provide our AI agents with quality information not slop. Here’s how we planned the data-plane.

1. PostgreSQL (The Single Source of Truth)

When you are orchestrating a multi-day ingestion across local GPUs, things will crash. Networks blink, memory maxes out, and containers restart. We needed a bulletproof state machine. PostgreSQL held our knowledge_registry and knowledge_chunks tables. It was the absolute source of truth. It didn’t hold the vectors or the graphs; it held the metadata and the status (graph_status, vector_status). If a worker died, Postgres knew exactly which chunk to requeue. It worked flawlessly.

2. Qdrant (The Semantic Net)

Qdrant was our vector store. Its job was to hold 600-character blocks of overlapped text. This size is large enough to retain the narrative context of a paragraph, but small enough to fit within the 512-token limit of our local embedding model (mxbai-embed-large). Qdrant’s role during retrieval was to cast a wide net and pull in the paragraphs surrounding a concept, providing the broader thematic context that an LLM needs to “sound natural”.

3. FalkorDB (The Crystalline Graph)

While Qdrant held the paragraphs, FalkorDB held the hard logic. We took those 600-character chunks and used an n8n workflow to chop them into 10 smaller micro-chunks. We then fed those micro-chunks to a high-context local LLM (Qwen) via the Graphiti framework to extract discrete entities (TechnologyConcept, Person, Publication) and the edges connecting them. FalkorDB’s role was to provide absolute, immutable precision. If node A requires node B, the graph knows it deterministically, regardless of how the words were phrased. Falkor also stored the validity period of the information, if document A was later conflicting with document B on an entity definition (perhaps a new revision or a new edition), falkorDB would not just overwrite it, it would state when the knowledge stops being valid and the new concept kicks in. This is fundamental for research and highly specialized knowledge bases.

The heavy lifting of moving data between these databases was handled by n8n and Docker. I cannot overstate how beautifully this part of the stack performed. n8n decoupled the document transformation form its original format to test, the graph extraction and the vectorization, allowing all asynchronous pipelines to chew through the original data repository independently while Postgres kept the score.

The 4-Day Grind and The Retrieval Crash

For four days, the ingestion pipeline ran non-stop. It processed over 500 million tokens with over 260 thousand LLM calls and over 350 thousand embeddings queries. When the queues finally cleared, we queried Postgres: over 27,000 entities and 120,000 relationships were safely tucked away in FalkorDB, with 1447 of vectors were sitting in Qdrant for just 3 large documents from a set of more than 600. Note that these numbers represent the computational extraction overhead of the chosen technologies (the agent reasoning, disambiguation, and edge-mapping), not the raw word count of the source documents.

At a rate of less than one document for day, we did not built the local library of Alexandria, but we were eager to test it. We fired up our custom Python FastAPI bridge (the expert-search endpoint that our agents would use) and submitted a simple query for a known concept in our dataset: “Logits Masking”.

The response came back in 1.3 seconds:

{"status": "success", "data": []}

Zero hits. No graph nodes. No vector chunks. Nothing.

As any researcher team would, initial panic set in:

Had the pipeline silently failed?

Did mxbai truncate the context windows?

Were the n8n workflows desynchronized?

We spent hours chasing ghosts in the three ingestion pipelines, only to realize they were working flawlessly, the data was perfectly fine. So, what was the problem?

Well, the problem wasn’t our data it was our abstraction layer and some of the tools/technologies we integrated.

The Bias in “Open” Ecosystems

We were using graphiti_core as the library to interface with FalkorDB. Graphiti is a brilliant piece of software, but like many tools in the current AI landscape, it was built, tested, and optimized almost exclusively around OpenAI’ stack (specifically text-embedding-3).

OpenAI’s embedding models are highly normalized and aggressively clustered. If you query “Logits Masking” against a database embedded by OpenAI, the cosine similarity score will naturally be very high (often 0.75 or above). Because Graphiti’s developers expected this, they hardcoded vector similarity thresholds deep inside the library’s source code to filter out “noise.”

But we weren’t using OpenAI. We were using mxbai-embed-large.

Local, open-weight embedding models distribute mathematics differently. Furthermore, mxbai is an asymmetric model, it requires specific instructional prefixes (like “Represent this sentence for searching…”) to align queries with stored documents.

When Graphiti’s internal .search() method vectorized our query, the mxbai math returned a cosine similarity of perhaps 0.45 or 0.55. This was actually a perfect semantic match in local-model space, but Graphiti’s hardcoded OpenAI-biased threshold saw a score below 0.70 and silently filtered our perfectly valid data into the garbage.

The exact same thing happened in Qdrant. We had set a conservative score_threshold=0.50. Because a 2-word query (“Logits Masking”) mathematically dilutes when compared against a dense 600-character chunk, the score dropped to ~0.40, and Qdrant quietly closed the door.

We had built a library but the librarian was being “too picky”.

Getting Our Hands Dirty: The Patches

We realized we couldn’t just plug-and-play generic libraries into a sovereign, local infrastructure. We had to tear open the engine and start patching. Here is exactly what we had to modify to make the system survive our use case:

1. The Cypher Override (Bypassing the Black Box)

We completely abandoned Graphiti’s native .search() method. Instead, we wrote a direct Cypher override in our FastAPI bridge. We used Python to extract the core keywords from the user’s prompt, connected directly to FalkorDB via Redis, and executed a raw MATCH (n)-[r]->(m) WHERE toLower(n.name) CONTAINS... query. We bypassed vector thresholds entirely for the graph, relying on deterministic string matching to find the topological center of gravity.

2. The Memory Pointer Optimization: The O(n x m) crisis

As our graph grew to 120,000 edges, we noticed Graphiti’s native Cypher queries were causing massive memory spikes in FalkorDB (especially during ingestion). The library was executing queries like:

MATCH (n:Entity)-[e:RELATES_TO {uuid: rel.uuid}]->(m:Entity)

In a massive local graph, this triggers an O(n x m) cartesian product scan, which is computationally devastating. We wrote a Docker startup script to physically patch the graphiti_core Python files inside the container at boot, replacing that logic with optimized memory pointers:

WITH rel AS e, score, startNode(rel) AS n, endNode(rel) AS m

This simple patch dropped our query times from seconds to milliseconds. We also added a “sub-chunking” node into our graph n8n workflow to reduce the time spent on a single HTTP request to the Graphiti bridge and reduce the possibility of timeout events for any given request.

3. RediSearch Armor & The Jaccard Threshold

Despite their inherent current limitations local LLMs are brilliant, but they can be “slightly” more erratic than GPT-4 when extracting JSON entities. Sometimes they hallucinate strange characters. If a weird character made its way into a Graphiti full-text search, the database would throw an error. We patched the driver to inject regex sanitization (RediSearch Armor) right before execution.

Additionally, because local LLMs extract entity names with slight variations (e.g., “Logits Masking” vs “Logit Masking”), we had to dive into Graphiti’s deduplication helpers and patch the _FUZZY_JACCARD_THRESHOLD from its strict default down to 0.78 to allow our local models some breathing room during ingestion deduplication.

4. Pure k-NN Vector Retrieval

To solve the Qdrant dilution issue, we completely removed the score_threshold parameter. We forced Qdrant to operate as a pure k-Nearest Neighbor (k-NN) database. We told it: “I don’t care how low the math says the score is, just give me the absolute closest n chunks.”

The Final Result

When we finally issued the curl command after applying these deep, architectural patches, the result was staggering.

The Graph engine instantly returned the immutable facts: Logits Masking is related to hyperparameter tuning, Diffusers, and grammar-constrained processors. Simultaneously, Qdrant returned the exact textbook paragraphs explaining the broader domain of evaluating and controlling LLM outputs.

Our custom Python bridge took these two distinct paradigms, deduplicated them (so we didn’t feed the LLM redundant text), truncated them to fit exactly within our local model’s context window, and returned a perfectly synthesized XML-tagged <Knowledge_Base_Context>.

Lessons Learned for the Sovereign Engineer

Open Source is great but it’s not a silver bullet. Our patches to Graphiti’s core were needed mainly due to our specific constraints and objectives but also because it’s code was not “parametrizable”; however, our use case does not justify pull requests to their repo as their scope and constraints do not necessarily align to our use case and they may have a different “philosophy” on what to parametrize and how pass them around. Additionally, we could not just wait for the PR to be reviewed and merged just to test if our 4 days of work were worthy.

If you are building localized AI infrastructure, here is a harsh truth: The generic tutorials will fail you. The current crop of RAG and Graph tools assume you have infinite cloud compute, symmetrical OpenAI embeddings, perfectly sanitized data, and the same software/hardware stack. When you bring these tools locally, you must become an active participant in their architecture. You must understand how your specific embedding model scales its math. You must know how to read your database’s schema natively. You must be willing to understand the code’s architecture and be willing to write regex patches for open-source libraries that may be choking on your data, install new dependencies or update them, add validations, make tests, debug, tune your infrastructure; in other words “the whole software development enchilada”.

The choosing of your embeddings model impacts various aspects of your design (i.e. max input affects chunk size, size affects meaningfulness and processing speed, similarity threshold tunning, etc.).

Although the embedding model “must be the same” for the vector, the graph ingestion and retrieval modules, the LLM for extracting the entities can be different from the model actively querying the knowledge base. This distinction helps in two main aspects:

  • It can enhance quality and accelerate the ingestion: A more strict (less imaginative) model can be smaller, faster and less prone to “invent” facts and entities.
  • It can enhance the contextual items before consumption: a more capable (reasoning) model can make better sense of the retrieved items from both graph and vector databases and blend it in any way you need your agent to consume it. As different models and agent roles may need different formats or phrasing of the context for optimal performance, our retrieval endpoint has an “agent_role” parameter that helps the query be dynamically routed to a model more suitable for a combination of the actual query and calling agent’s role.

For most use cases a vector search or a graph search individually will suffice. But if you have strict constraints over facts, relationships between the entities in the docs and need temporal information (all at the same time), this is the current cost:

Half a billion tokens for 3 documents (350 pages average each)… is just massive!!!

Having a large collection of documents ingested by a third-party paid service will breake the bank for almost any small business or research team that needs this type of knowledge bases. We are talking about more than a 100 billion tokens for a 600 documents collection.

Even using cheap models with cloud providers, we are talking of several tenths of thousand of dollars.

GraphRAG is not just a parsing exercise; it is an active reasoning loop. To extract an edge between two entities, the LLM must read the micro-chunk, identify the nouns, evaluate their relationship against the ontology, and output structured JSON. That reasoning cycle is why 3 documents burn half a billion tokens: you are trading upfront compute for permanent retrieval precision. The use of Graphiti adds a huge overhead on the entity and edge extraction (it makes calls to both the LLM and embedding models, handles the disambiguation, de-duping, etc), but the quality of the results after processing speak for themselves.

However, as we are not strictly constrained by time, we can wait 3 months (that’s our current expected finishing date) and our Agents can incrementally use the knowledge base as it grows. But if you are not constrained by time, money or security restrictions, use a model from a cloud provider (it will be more than 10 times faster), very, very expensive and much less private (data in transit, indirect logging and storage, unscrupulous/non-compliant vendors, etc.).

Closing thoughts

This experimental research was intense (as most things worth pursuing are) but when we finally broke through the barriers, the reward was immense. We now have a fully sovereign, blazing-fast, high-fidelity Hybrid GraphRAG pipeline and our data never left our servers. We still have things to tweak, other entity-extraction methods to tests and new features to add, but the knowledge ingestion and retrieval are now functional, tested and production ready.

The resulting knowledge base will likely be not that big so we are adding automated back up jobs. You should also regularly test your backup periodically too) and if you are not updating the contents too often, many cloud providers offer cheap “cold data storage” options to keep it safe (after proper local compression and encryption).

Additionally, one often overlooked benefit of having your document repository transformed into a knowledge base, is that you can use it for many purposes, it’s not just for your agents to consume!!! You can query and analyze it in many novel ways, visualize contents and relationships to find patterns or gaps, contribute it to the community or even use it as training or finetuning data for your own custom models! (which will be our case, more on this soon…)

Now with the infrastructure ready, our next steps are to keep ingesting the rest of the documents from the data store while in parallel we hand the keys over to the agent framework and its swarms to see what our Architect Agent and Team can build when they have “the right context”.