Four RAG architectures and when to use each one

This week, I dove deep into RAG systems. Some optimize for speed, others for precision, and others for synthesizing information across multiple sources. Here’s a framework for thinking about which RAG architecture fits which problem.

Four RAG Architectures

1. Query-based RAG: The simple default

This is the most common and simplest use case. The system retrieves relevant documents, concatenates them with the user’s query as text, and feeds everything to the generator.

It works with any language model, requires no special architecture, and makes debugging straightforward since we can see exactly what context the model receives. Tools like Perplexity AI and ChatGPT’s web search use this pattern.

The main limitation is context window size when working with many documents.

2. Latent representation RAG: For multi-source synthesis

Instead of concatenating text, this approach encodes the query and each retrieved document separately as embeddings, then fuses these representations before generation.

Notion AI is perfectly suited for using latent representation RAG. It synthesizes information across Slack messages, Google Drive documents, GitHub pull requests, and project management tickets simultaneously. Latent RAG can process all these sources in parallel and efficiently combine their information.

The trade-off is complexity. It requires specialized model architectures and is harder to interpret than simple text concatenation.

3. Logit-based RAG: For precision and exact terminology

This provides fine-grained control during generation. Retrieved information directly adjusts the probability scores (logits) for each potential next token.

This matters when exact wording is critical. Examples include medical coding systems that must use precise ICD-10 codes, legal document generation that requires specific contractual language, or API documentation that needs to preserve exact function names.

In this case, we’re trading natural language fluency for precision and fidelity to source material. The survey paper I read specifically highlighted this approach’s effectiveness for code summarization and image captioning tasks.

4. Speculative RAG: Optimizing for speed

This method decouples the generator and retriever. As the generator produces tokens, the retriever independently searches for exact matches in source documents. When it finds one, the system uses the retrieved sequence instead of continuing token-by-token generation, saving computation time.

It works particularly well for code autocomplete and similar sequential tasks where answers often exist verbatim in the knowledge base. The approach requires that relevant content can be retrieved quickly enough to beat generation speed.

Five Ways to Enhance RAG Systems

Beyond choosing the right architecture, there are five categories of enhancements.

1. Input enhancement

This focuses on preparing queries and data before retrieval. Techniques like query transformation generate pseudo-documents from queries or decompose ambiguous queries into clearer sub-queries. Data augmentation cleans source material by removing irrelevant information and updating outdated documents.

2. Retriever enhancement

This improves the quality of retrieved content. Methods include recursive retrieval (multiple search passes for richer results), chunk optimization (adjusting granularity using principles like “small-to-big”), hybrid retrieval combining sparse and dense methods, and re-ranking to improve diversity and relevance.

3. Generator enhancement

This optimizes the language model itself through prompt engineering, decoding parameter tuning for better control, and fine-tuning on domain-specific data when the model needs specialized knowledge.

4. Result enhancement

This post-processes generated outputs to meet downstream task requirements. For example, refining code-related outputs or reformatting responses for specific use cases.

5. Pipeline enhancement

This optimizes the overall RAG process. Adaptive retrieval techniques determine when retrieval actually helps versus when the model’s inherent knowledge suffices. Iterative RAG cycles through retrieval and generation phases to progressively refine results.

The Key Insight

The right RAG approach depends entirely on the problem at hand. Most applications work fine with query-based RAG. It’s simple, transparent, and effective.

The more sophisticated architectures exist for specific challenges:

Latent representation for multi-source synthesis
Logit-based for precision requirements
Speculative for speed optimization

Understanding these patterns helps us evaluate AI products differently. When we see a tool’s capabilities and constraints, we can reason backward about the architectural trade-offs its builders likely made.

That’s the real value of studying these systems, not just knowing the techniques, but understanding when each one matters.