Skip to content

This is a community translation of the original Chinese text. The translation may contain inaccuracies. When in doubt, please refer to the original Chinese version.

The Essence of RAG

Stripping Away the Packaging

RAG (Retrieval-Augmented Generation) is one of the most over-packaged concepts in the LLM application space. The frameworks, toolchains, and best-practice documents built around it create the impression that RAG is an independent paradigm requiring its own dedicated technology stack.

Strip away the framework wrappers and the core operation of RAG is simply: stuff relevant information into the context before calling the LLM.

python
def rag_in_ten_lines(query: str, documents: list[str]) -> str:
    # 1. Retrieve: find documents relevant to the query
    relevant_docs = retrieve(query, documents, top_k=3)

    # 2. Augment: inject retrieved results into the context
    context = "\n\n".join(relevant_docs)
    prompt = f"Answer the question based on the following references.\n\nReferences:\n{context}\n\nQuestion: {query}"

    # 3. Generate: call the LLM
    return call_llm(prompt)

These ten lines contain the entire essence of RAG. Everything that follows -- vector databases, embedding models, chunking strategies, reranking -- is optimization of these three steps.

Once you understand this essence, you can judge when to use RAG and when not to. If your "relevant information" can be obtained through simple keyword matching, you do not need vector retrieval. If your total document volume fits within a single context window, you do not need RAG -- just put all the documents into the prompt. RAG solves a specific problem: when the total volume of relevant information exceeds the context window, and you need to dynamically select which information to include.

Retrieval Quality Determines Everything

The quality bottleneck in a RAG system is almost never on the generation side -- as long as the context is sufficient, the LLM usually gives a reasonable answer. The bottleneck is on the retrieval side: whether the information you stuffed into the context is actually relevant to the question.

This is a critical judgment obscured by framework abstractions. When developers use LangChain or LlamaIndex to build a RAG pipeline, enormous effort goes into framework configuration and tuning -- which vector database, which embedding model, what chunk size. But the question that actually determines system quality is: given the user's query, do the retrieved document fragments contain the information needed to answer it?

If irrelevant documents are retrieved, the LLM either ignores them (wasting precious context window space) or gets misled by them (generating answers based on irrelevant information). If key documents are missed, the LLM can only answer based on incomplete information -- this is more dangerous than having no RAG at all, because the user assumes the system "consulted the references."

So when developing a RAG system, evaluating retrieval quality is the most important step. Common metrics include:

  • Recall: of all relevant documents, what proportion was retrieved.
  • Precision: of the retrieved documents, what proportion is actually relevant.
  • MRR (Mean Reciprocal Rank): the rank position of the first relevant document.

These metrics have nothing to do with the LLM -- they measure purely the quality of the information retrieval system. RAG is first a retrieval problem, and only second a generation problem.

The Engineering Trade-offs of Chunking

Splitting long documents into chunks is one of the most critical engineering decisions in RAG. The chunking strategy directly affects both retrieval quality and generation quality.

Chunks too large: low semantic signal-to-noise ratio. A 2000-token chunk may have only two or three sentences relevant to the query; the rest is noise. Large chunks also consume more context window space, limiting how many chunks can be included.

Chunks too small: context fragmentation. A 100-token chunk may lose paragraph-level semantic completeness. The premise of an argument sits in one chunk, the conclusion in another -- if only one is retrieved, the LLM gets incomplete information.

This trade-off has no universally optimal solution, because the optimal chunk size depends on the document's structure and the granularity of queries. Technical documentation (clear hierarchical structure, self-contained paragraphs) is well suited to splitting by paragraph or section; narrative documents (strong contextual dependency) need larger chunks or overlapping splits.

A practical rule of thumb: chunk boundaries should align with semantic boundaries. Splitting by paragraph is better than splitting by fixed token count; splitting by section headings is better than splitting by paragraph. This means the chunking strategy should leverage the document's structural information (heading hierarchy, list structure, code block boundaries).

RAG Is Not a Silver Bullet

RAG is overused. In the following scenarios, RAG is not the right solution:

Real-time data queries. RAG's retrieval source is a pre-indexed document store. If the user needs real-time data (current stock prices, latest news, live system status), RAG cannot provide it -- call the appropriate API directly and inject the result into the context.

Precise computation. The user asks "what is the total sales for the past three months." Even if RAG retrieves documents containing monthly sales figures, the LLM's arithmetic is unreliable. Use code to compute, then inject the result into the context.

Full-corpus analysis. The user requests "analyze common themes across all customer feedback." RAG's top-k retrieval means you are only seeing a subset -- full-corpus analysis is impossible. Process the full dataset first with traditional methods (clustering, topic models), then use the LLM to interpret the results.

Knowledge already in the model. The user asks a common-sense question or one likely covered in the model's training data. RAG retrieval may introduce noise. The criterion: if the LLM's answer quality is already good enough without RAG, adding RAG only increases latency and cost.

Determine whether the problem is suited to RAG before optimizing the RAG implementation details.