Decoupling GraphRAG: Breaking the Monolithic Ingestion Loop

In my last post, we presented the enabling of our workflow to start creating our “curated” knowledge base for Agentig RAG. We are now starting leveraging the advantages of it by optimizing or current infrastructure for speed, resiliency and accuracy.

We are still far from ingesting the whole stack of documents we want but we prioritized certain domains and with about 3% of the ingestion done we queried our knowledge base on ways to enhance the current workflow and the results culd not be better :). This is where we are at.

The VRAM Thrashing Inevitability

Building a local Agentic AI pipeline on consumer or homelab hardware forces a hard collision with physical memory limits. When constructing a Graph-based Retrieval-Augmented Generation (GraphRAG) system, the standard architectural approach is a synchronous monolith: read the document, chunk it, extract graph entities via a massive LLM (like Gemma 3 27B), swap to an embedding model (like mxbai-embed-large), write to a vector database, and write to a graph database.

This synchronous approach may work for casual document ingestion but for a homelab, a big pile of documents or a production environment it is structurally flawed. Forcing a single pipeline to juggle generative extraction, mathematical embedding, and topological database commits guarantees failure at scale as VRAM swap times cripple throughput, graph deduplication queries timeout and If an API call drops, the entire chunk is lost.

To process a 1,000-document library without melting the GPU or waiting a year for sequential execution, the strictly linear architecture must be shattered and rebuilt into decoupled, asynchronous state machines.

Here is how separating ingestion into three distinct pipelines backed by three specialized databases (PostgreSQL, FalkorDB, and Qdrant) creates a fault-tolerant, high-throughput GraphRAG mesh.

Phase 1: The PostgreSQL Golden Master (State & Temporal Truth)

The foundation of a decoupled system is a bulletproof ledger. In this architecture, PostgreSQL does not just store data; it dictates the state machine for the entire pipeline.

Instead of passing massive text arrays through memory, the ingestion pre-processor simply shreds documents into overlapping chunks and dumps them into a knowledge_chunks table, linked relationally to a knowledge_registry.

The knowledge_registry must also hold the overall document status across all the pipelines so you can recover from processing errors (that will inevitably occur).

The Temporal Anchor

GraphRAG requires temporal reasoning. If a system cannot differentiate between when a document was written and when it was processed, the LLM will hallucinate causal timelines. PostgreSQL handles this decoupling at the root:

  • created_at: The system ingestion time.
  • canonical_date: The strictly enforced ISO historical timestamp extracted from the document.

Note that, not all documents have a publication date so we created a specific skill for an agent to “infer” the canonical date of any document or setting a “default timestamp” if no date could be inferred. This skill makes use of various tools (e.g. API calls and web serach) and it’s a core part of the ingestion pipeline as the graph database requires a date to makr the known entity valid from a specific time stamp.

The CTE Queue

To feed the heavy compute pipelines without degrading database performance, the queue must be optimized. First we used a “naive” ORDER BY clause but when measuring its performance across the first 300 chunks it was evident that the query will eventually choke database I/O. Instead, we use a two-stage Common Table Expression (CTE) to locate the newest document first, lock onto it, and exhaust its chunks sequentially (chunk_index ASC). This prevents interleaving chunks from different books, which would destroy the LLM’s coreference resolution capabilities.

Phase 2: The Graph & Topological Extraction

With the text safely buffered in Postgres, the workflow triggers the next pipeline asynchronously. This pipeline is exclusively dedicated to the hardest computational task: topological entity extraction.

The chunks are routed running the Graphiti library, powered by Gemma 3 27B. Because this pipeline only handles graph extraction, the VRAM is strictly pinned to the LLM.

As mentioned before, the chunk order matters, you cannot ingest random chunks from different documents, you must enforce a strict ascending ordered ingestion of the chunks per document to guarantee graph consistency.

The Deduplication Threat

As the graph grows past a few hundred documents, LLM-extracted entities (e.g., “software development pattern”) trigger massive full-text index lookups against the graph to prevent duplicate nodes. RediSearch, which powers FalkorDB’s full-text engine, will aggressively timeout on heavy OR-based stop-word queries.

This is was mitigated by directly overriding the internal module timeouts at the database level, giving the C-engine the necessary runway to calculate complex topological merges. The winning timeout setting for our configuration was 6 minutes.

The Episodic Anchor

When FalkorDB commits the chunk, it generates an Episodic node containing the historical valid_at timestamp. Most crucially it returns the UUID of this specific node back to PostgreSQL. Postgres updates the chunk ledger: graph_status = 'completed' and graph_episode_id = [UUID] that is updated back into the knowledge_chunks table to link everything together.

Phase 3: Vector Embeddings & The Cryptographic Bind

This pipeline operates entirely blind to the Graph Pipeline, governed only by the PostgreSQL state machine, however, this state dependency enforces the Vectorizing pipeline to wait until the Graph pipeline is done for a document. it guarantees that the vector database has the exact cryptographic key needed to traverse into the knowledge graph.

A query constantly polls Postgres for chunks where graph_status = 'completed' AND vector_status = 'pending'.

The chunk text is routed to the embedding model (mxbai-embed-large), and the resulting high-dimensional array is upserted to the embeddings database.

The Embedding’s JSON payload is strictly mapped to include the graph_episode_id.

Trade-offs and the Hybrid Advantage

One may assume that maintaining three separate databases and orchestrating HTTP callbacks introduces unnecessary latency compared to a simple, synchronous Python/Typescript code. However; for our current setup a synchronous scripts would not scale. If a 10-minute LLM extraction fails on chunk 40,000 due to a malformed character, a non-carefully/exception-safe crafterd monolithic script may crash and take the vector data and state down with it. In this asynchronous triad architecture, if one pipeline fails, the chunk remains pending and its retried automatically or properly marked as error for debugging and auditing purposes. Pipelines can be triggered periodically and independenlty progress even if one of them fails, up until errors pile up to catch up with the other pipelines (e.g. no more documents can be chunked, or no more chunks can be used to extract entities or for vectorizing).

The True Value: Hybrid Agentic Routing

By separating the workloads, we achieve a concise and coherent Hybrid RAG environment.

When an AI Agent receives a prompt, it does not have to choose between vector similarity and graph topology. It queries the embeddings space for semantic relevance, then it returns the text, but more importantly, it returns the graph_episode_id. The Agent uses that UUID to jump directly into FalkorDB, landing perfectly on the chronological Episodic node, and traverses the graph edges to pull all causally related entities.

  • The hardware is protected from “model thrashing”.
  • The historical timelines are immutable.
  • The ingestion queue can be scaled horizontally across multiple nodes.

The Hidden Cost of Asynchornicity

Architect for stability first; speed is simply a byproduct of a system that (almost) never crashes.

This new architeture is far more stable but it may also be slower. As the main input are the chunks, the graph and vector pipelines will starve if none is available. Our initial solution implies chunking all documents first and then run the graph and vector pipelines periodically. If new knowledge is incorporated the document get’s chunked and it’s prioritized based on it’s iferred canonical date.

In a local setup, resources are scarse and the GPU can choke if both graph and vector pipelines trigger at the same time; however, it’s difficult to choose a precise time to start each pipeline as the processing time per chunk may vary singinficatively. Speciall for the graph we have the following times (n=200):

Cloud Gemini Flash per chunk:

AVG0:02:29
STD0:00:49

Local Setup (Gemma3 27b)

AVG0:06:27
STD0:04:58

For now, we are using a formula to disallow overlap triggering the date inference, graph and vector workflows. We are also evaluating the use of aditional CPU/GPUs or leveraging cloud models (whithout abandoning the privacy policies or incurring in stratospheric costs) for date inferences and embeddings.

Additionally, while decoupling fixes VRAM thrashing, we’ve effectively traded memory limits for network and I/O overhead. Our system is now constantly polling Postgres for state transitions (completed vs pending) and at enterprise scale, this naive polling can become its own I/O bottleneck. In subsquent iterations we will explore using message queues or a worklow-shared mutex to avoid this constant state polling.

We’ll keep you posted.