RAG FROM FIRST PRINCIPLES · PART 10 OF 20

2026-06-16

Advanced RAG Architectures

The leap from a fixed pipeline that runs the same way every time to a dynamic, decision-making loop that can choose whether to retrieve, judge what came back, and try again. Part 10 of a from-scratch series on Retrieval-Augmented Generation: a guided tour of Agentic RAG, Corrective RAG (CRAG), Self-RAG, GraphRAG, and Multi-Modal RAG, what control flow each one adds, and the sober cost of reaching for any of them.

What you’ll learn

For nine parts we built and tuned a pipeline that, however clever its retrieval, still runs as a single fixed path: retrieve, then generate, the same way for every query. This part is where that pipeline grows a brain. We trade the straight line for a dynamic, decision-making loop, a system that can decide whether to retrieve at all, judge whether what came back is any good, retrieve again when it is not, reason in steps, and traverse structured knowledge. This is the map of the advanced architectures, Agentic RAG, Corrective RAG, Self-RAG, GraphRAG, and Multi-Modal RAG, and an honest account of what each one costs.

Prerequisites

Parts 1 through 9, and especially the last two: Making Retrieval Smarter (Part 8, reranking, query transformation, and query decomposition) and Advanced Retrieval Patterns (Part 9). You should also have the hand-built app from Build Your First RAG (Part 6) in your head, because every architecture here is a modification of that same retrieve-augment-generate loop. No new math. This part is intuition and architecture, with one short illustrative sketch.

From pipeline to loop

Here is exactly where Part 9 left us. We have a well-tuned retriever: hybrid search, a reranker, query decomposition, parent-document retrieval, every trick in the book. And yet the shape of the system has not changed since Part 6. A query comes in, we retrieve once, we stuff the chunks into the prompt, we generate. One path. The same path for “what is our refund window?” and for a question our corpus cannot possibly answer. The pipeline never asks itself whether it should retrieve, never checks whether the chunks it got are relevant, never tries a second time. It is a straight line, run identically every time.

The leap in this part is to add control flow: the branches, loops, and conditions that let a program decide what to do next based on what it has seen so far. A naive RAG system has none. An advanced one decides whether to retrieve, judges what came back, retrieves again from a different source if the first attempt was bad, reasons across several steps, or walks a structured graph of knowledge instead of grabbing isolated chunks. That single idea, fixed pipeline becomes decision-making loop, is the umbrella over every named architecture below. They are not five unrelated products. Four of them, Agentic RAG, Corrective RAG, Self-RAG, and GraphRAG, are different ways of adding the same thing: the ability to decide. The fifth, Multi-Modal RAG, sits on a separate axis; it does not change whether or how the system decides, it changes what the system can retrieve over. We tour the four control-flow shapes first, then that orthogonal one.

The clearest way to feel the difference is to watch one query travel both shapes. Our store assistant gets asked: “What is the battery life of the X1 wireless earbuds?” The corpus, remember, holds only store policies, refunds, shipping, warranty, not product specs.

Open figure ↗

Fig 1 The same query down two shapes. The naive pipeline retrieves once from the policy index no matter what and is forced to answer from chunks that hold no battery spec. The agentic loop decides to retrieve, grades the chunks, sees they are irrelevant, corrects by routing to a different source, retrieves again, confirms the new chunks are relevant, and only then generates a grounded answer.

That bad-retrieval, then correct, then retry branch is the whole story of this part in miniature. Everything that follows is a way of building that branch, or a richer version of it.

Now that we have the spine, here is the map of the four control-flow shapes we are about to tour, drawn so you can see at a glance what control flow each one bolts onto the basic loop.

Four panels around a shared retrieve-generate core. Panel one, Agentic RAG, shows an agent node with arrows to several tools (a vector store, a web-search icon, a database) and a reason-act-observe loop arrow. Panel two, Corrective RAG, shows retrieval feeding an evaluator that grades chunks good or bad, with a bad branch looping back to a web-search fallback. Panel three, Self-RAG, shows the model itself emitting reflection markers that gate retrieval and check groundedness. Panel four, GraphRAG, shows a small graph of entity nodes joined by labelled edges being traversed instead of a flat list of chunks.
Fig 2 The map of advanced architectures. Agentic RAG makes retrieval a tool an agent chooses to call; Corrective RAG adds an evaluator that grades chunks and triggers a fallback; Self-RAG has the model reflect on whether to retrieve and whether it is grounded; GraphRAG retrieves over a knowledge graph instead of isolated chunks. Each adds control flow to the same retrieve-generate core.

These four share one theme: control flow. Multi-Modal RAG is the fifth idea and a different kind, it widens what the loop can read rather than adding a decision, so it sits outside this map and we cover it last.

Agentic RAG

Start with the most general shape, the one the other three are special cases of.

What an agent is

An agent, in this setting, is a language model that reasons about a task, chooses an action or a tool to use, observes the result, and then decides what to do next, in a loop, until it judges the task done. The contrast is with a fixed script. Our Part 6 app is a script: it always calls retrieve, always calls build_prompt, always calls generate, in that order, no matter the question. An agent is not handed that order. It is handed a goal and a set of capabilities and left to sequence them itself.

That loop has a name you will see everywhere: the ReAct loop, short for reason and act. The model thinks (reason), takes an action such as calling a tool (act), reads what the action returned (observe), and repeats. Reason, act, observe, repeat, until it has enough to answer. The point is that the number and order of steps are decided at run time, by the model, not fixed in advance by you.

What makes RAG agentic

Agentic RAG is what you get when retrieval stops being a mandatory first step and becomes one tool the agent can choose to call. Tool use is exactly that: giving the model a set of named capabilities it can invoke, each with a description of what it does, so the model can pick the right one for the moment. Retrieval is now one such tool. The agent decides whether to retrieve at all (a greeting or a pure-arithmetic question needs no documents), what to retrieve, and how many times.

And retrieval is rarely the only tool. A realistic agent has several: more than one vector store (a policy index, a product-spec index), a web-search tool for current events, an API or SQL database for live data like order status, even a calculator so it does not fumble arithmetic the way language models famously do. The agent calls whichever tools the task needs, in whatever order, then synthesizes the results into one answer.

Routing and multi-hop

Two patterns fall out of this naturally. The first is query routing: deciding which source a query should go to. An HR question goes to the HR index, a question about the codebase goes to the code index, a question about today’s news goes to web search. Routing is a small, cheap decision (often a single classification call) that prevents the system from searching a store that could not possibly hold the answer. Our flagship example was a routing failure waiting to happen: the naive pipeline sent a product-spec question to the policy index because it had no choice; an agent routes it to the spec source instead.

The second is multi-hop retrieval: chaining retrievals where one answer feeds the next. Consider “what is the warranty on the earbuds made by the company that acquired Acme?” You cannot answer that in one lookup. You retrieve to learn who acquired Acme, then retrieve again using that name to find the warranty. This is query decomposition from Part 8 made dynamic: instead of splitting the question up front into a fixed set of sub-questions, the agent discovers the next hop from the result of the last one.

The trade-off is the honest catch. Agents are powerful and flexible, and they are also slower (several model calls instead of one), costlier (you pay for every reasoning step and tool call), less predictable (the same question can take a different path twice), and much harder to debug (a wrong answer might come from any step in a chain you did not design). An agent can loop without converging or wander down an irrelevant tool call. Power and unpredictability are the same coin.

Corrective RAG (CRAG)

If Agentic RAG is the whole open-ended loop, Corrective RAG is one disciplined slice of it, and often all you actually need.

Corrective RAG (CRAG) adds a single component to the basic pipeline: a lightweight retrieval evaluator, a small model or classifier whose only job is to grade the chunks that retrieval returned. The grade is coarse, typically relevant, ambiguous, or irrelevant, and the system acts on it before generating. If the chunks are relevant, proceed as normal. If they are irrelevant, do not blindly generate on bad context; instead trigger a corrective action, such as falling back to web search, or reformulating the query and retrieving again. Ambiguous sits in between and usually combines both.

The principle is simple and worth saying plainly: a naive pipeline generates from whatever retrieval hands it, even garbage. CRAG inserts a check between retrieve and generate so the system can catch a bad retrieval and fix it rather than confidently answering from chunks that do not contain the answer. That is precisely the corrective branch you watched in the flagship animation.

Here is the control flow as a small sketch. It is conceptual, not a runnable build, but it shows the shape: retrieve, grade, then either correct and retry or generate.

def corrective_rag(query, max_tries=2):
    for attempt in range(max_tries):
        chunks = retrieve(query)              # search our own index first
        grade = evaluator.grade(query, chunks)  # relevant / ambiguous / irrelevant

        if grade == "relevant":
            return generate(query, chunks)    # good context: answer now

        # bad context: do NOT generate on it. take a corrective action.
        if grade == "irrelevant":
            query = rewrite_for_web(query)    # reformulate for an outside source
            chunks = web_search(query)        # fall back to a different source
            return generate(query, chunks)    # a real CRAG would grade these too

        # ambiguous: tweak the query and loop to try our index again
        query = reformulate(query)

    # ran out of tries without good context: refuse honestly (Part 6's grounding)
    return "I don't know based on the available sources."

CRAG shines in open-domain settings, where your own index simply may not hold the answer and an outside fallback is the difference between a useful reply and a confident guess. It is far cheaper and more predictable than a full agent: one extra evaluation step, one well-defined branch, no open-ended wandering.

Self-RAG (self-reflective RAG)

CRAG bolts an evaluator onto the outside of the model. Self-RAG moves the judgment inside the model itself.

Self-RAG, short for self-reflective RAG, trains the model to emit special reflection tokens as it works, little control signals woven into its own output that decide, on the fly, three things. First, whether retrieval is even needed for this part of the answer (a token written as code such as Retrieve versus No-Retrieve). Second, whether each retrieved passage is actually relevant and supportive of the point being made (a relevance judgment like Relevant versus Irrelevant). Third, and most distinctively, whether its own statement is genuinely grounded in the evidence it retrieved, a self-check that the sentence it just wrote is supported by the passage rather than invented (a support judgment such as Supported versus No-support).

So the model interleaves retrieval, generation, and assessment in one pass: it decides to retrieve, pulls passages, judges them, writes a sentence, checks that the sentence is grounded, and continues. The critique is built in rather than bolted on.

The contrast with CRAG is the crisp thing to hold onto. CRAG uses a separate evaluator that sits outside the model and gates retrieval before generation. Self-RAG has the model itself reflect, throughout generation, on both whether to retrieve and whether each of its own claims is grounded. CRAG asks “are these chunks good?” once, from the outside. Self-RAG asks “do I even need to look this up, and is what I just said actually supported?” continuously, from the inside. Both fight the same enemy, generating confidently from thin air, from opposite ends.

GraphRAG (knowledge-graph RAG)

Every architecture so far still retrieves the same thing our pipeline always has: isolated chunks, ranked by similarity. That works beautifully when the answer lives in one passage. It struggles badly with two kinds of question.

The first is the multi-hop, connect-the-dots question whose answer is scattered across many documents, where no single chunk holds it and similarity search keeps returning pieces that are individually relevant but never assembled. The second is the “global” question, like “what are the main themes across this entire corpus?” There is no chunk that is the answer; the answer is a property of the whole collection, and top-k retrieval, which by design returns a handful of the most similar pieces, cannot see the forest for the trees.

GraphRAG addresses this by changing what you retrieve over. The key structure is a knowledge graph: a representation of information as entities (the nodes, things like people, products, companies) joined by relationships (the edges, labelled connections like “made by,” “acquired,” “warranty covers”). Where vector search gives you a flat bag of chunks, a graph gives you the explicit connections between facts.

The recipe runs in two phases. At build time, you use a language model to read the corpus and extract entities and the relationships between them, assembling them into a graph. Optionally you then cluster the graph into communities, tightly connected groups of entities, and have the model write a summary of each one, a step called community summarization that gives you a hierarchy of pre-digested answers to “what is this part of the corpus about?” At query time, instead of (or alongside) vector similarity, you traverse the graph, following edges to gather connected facts, or you read the relevant community summaries for a holistic question.

On the left, vector-only retrieval shown as three separate chunk cards with no connections between them. On the right, a knowledge graph of entity nodes (a product, a company, a warranty) joined by labelled edges reading made by and warranty covers; a highlighted path walks from the product node to the company node to the warranty node, illustrating multi-hop traversal. Below the graph, a dashed box labelled community summary groups several nodes and holds a one-line description, illustrating community summarization for global questions.
Fig 3 GraphRAG retrieves over structure, not a flat list. Vector-only RAG returns three isolated chunks that never connect. The graph links entities by labelled relationships, so a multi-hop question can be answered by walking edges from one entity to the next, and a community summary answers a global question no single chunk could.

GraphRAG shines exactly where flat retrieval fails: multi-hop reasoning, connecting facts that live in different documents, and holistic or summary questions about a whole body of text. The trade-off is steep. Building the graph means running a model over your entire corpus to extract entities and relationships, which is slow and expensive, and the graph must be maintained as documents change. In practice GraphRAG is rarely used alone; it is usually combined with vector search, the graph for the connected and global questions, vectors for the everyday “find me the passage about X.”

Inside the Microsoft GraphRAG pipeline

The two-phase recipe above is the idea. The version that made “GraphRAG” a household word in the RAG world is Microsoft’s, and it is worth tracing concretely, because the pipeline tells you exactly where the cost lives and what you get for it.

Indexing runs as a chain of LLM passes over the corpus. First, entity extraction: you chunk the documents as usual, then prompt a language model on each chunk to pull out the entities it mentions (people, organizations, products, places) along with a short description of each. Second, relationship mapping: the same pass, or a following one, asks the model to name the relationships between those entities (“acquired,” “made by,” “reports to”), which become the labelled edges. The result is a single knowledge graph stitched together from every chunk, with duplicate mentions of the same entity merged into one node.

Then comes the step that makes the global questions answerable. You run community detection on the graph, specifically the Leiden algorithm, which partitions the nodes into nested communities: tightly interlinked clusters at several levels of granularity, from broad themes down to small tight-knit groups. For each community, you prompt the model again to write a community summary, a prose digest of what that cluster of entities and relationships is about. Because the communities are hierarchical, you end up with summaries at multiple zoom levels, a pre-computed table of contents for the whole corpus.

That structure buys you two query modes. Local search answers a question about a specific entity by starting at its node, gathering its neighbors, their relationships, and the raw text chunks they came from, then generating from that focused neighborhood: this is the multi-hop, connect-the-dots case. Global search answers a corpus-wide question (“what are the main themes?”) by fanning the question out across the community summaries, generating a partial answer from each, then reducing those partials into one final answer: this is the holistic case that top-k similarity simply cannot reach.

Notice where the money goes. Every one of those indexing passes is an LLM call, and you make them over the entire corpus, not just the few chunks a query touches. Entity extraction alone can mean a model call per chunk for thousands or millions of chunks; community summarization adds another call per community per level. Graph construction is LLM-token-expensive, often by orders of magnitude more than simply embedding the same corpus once, and that cost is paid up front, before a single user question arrives. So the calculus is sharp: GraphRAG pays off when the corpus is private, connected, and narrative, an internal wiki, a body of research, a case file, a set of interlinked reports, where questions genuinely span many documents or ask about the whole, and where you will ask enough such questions to amortize the build. It is overkill for a corpus of independent, self-contained passages whose questions each live in one chunk: there, you are paying graph-construction prices for answers a reranker would have found in one hop.

Multi-Modal RAG

The last shape relaxes a different assumption, one we have made since Part 2: that everything is text.

Multi-Modal RAG is retrieval and generation over more than text, images, tables, charts, audio, video, the figure-rich PDFs that Part 5 warned us were so easy to mangle. There are two high-level approaches, and real systems often mix them.

The first is multimodal embeddings: models that place text and images (or other modalities) into a single shared vector space, so an image of a product and the phrase “wireless earbuds” land near each other and you can search across modalities directly. The canonical example is CLIP, a model trained on huge numbers of image-and-caption pairs so that a picture and the text describing it embed to nearby points; with CLIP-style embeddings you can retrieve images using a text query, or text using an image, with the same cosine similarity from Part 3.

The second is translate-to-text: rather than embed the image directly, you use a model to describe it, caption an image, transcribe audio, summarize a chart into prose, then embed and retrieve those text descriptions with the ordinary pipeline you already have. At generation time you hand the retrieved figure to a multimodal language model that can actually “see” it, so the answer is grounded in the real image, not just its caption.

Tie this back to Part 5. There, a figure or a scanned table was a problem: flatten it into the text stream and you got nonsense that polluted retrieval. Multi-Modal RAG turns that liability into an asset. Instead of destroying the figure, we can embed it, retrieve it, and reason over it.

How to choose

Five shapes: four that add decision-making to the loop, and one (Multi-Modal RAG) that widens what the loop can retrieve over. Here is the comparison at a glance.

ArchitectureWhat it addsWhen to reach for itMain cost
Agentic RAGAn LLM agent that reasons and chooses tools/actions in a loopTasks needing multiple sources, routing, or multi-hop reasoningLatency, cost, unpredictability, debugging difficulty
Corrective RAG (CRAG)An evaluator that grades chunks and triggers a fallback before generatingOpen-domain queries where your index may not hold the answerOne extra evaluation step; a fallback source to maintain
Self-RAGThe model reflects on need-to-retrieve and on its own groundednessWhen grounding and selective retrieval matter and you can use a reflective modelNeeds a trained/specialized model; more involved to set up
GraphRAGRetrieval over a knowledge graph of entities and relationshipsMulti-hop, connect-the-dots, and holistic/summary questionsExpensive to build and maintain; usually paired with vectors
Multi-Modal RAGRetrieval and generation across images, tables, audio, and moreCorpora where the answer lives in figures, not just textMultimodal models and embeddings; richer ingestion pipeline

It helps to keep the inverse of that table in your head too, the one-line “when it is overkill” for each, because the failure mode here is reaching for power you do not need:

  • Agentic RAG is overkill when one retrieval against one index answers the question. A loop that can choose between tools is wasted on a system that only ever has one tool to choose.
  • CRAG is overkill when your index is closed and authoritative (it holds the answer or the answer does not exist), because there is no useful outside source to fall back to: a plain grounded refusal is the right behavior.
  • Self-RAG is overkill when you cannot or will not run a model trained to emit reflection tokens; bolting an external grader on (that is just CRAG) is simpler and gets you most of the safety.
  • GraphRAG is overkill when your questions each live in a single passage; you would pay graph-construction prices for answers flat retrieval already finds.
  • Multi-Modal RAG is overkill when your corpus is genuinely text, or when the figures only restate the prose around them.

And get a concrete feel for the cost, because “more expensive” is too vague to budget against. The unit that dominates a RAG bill is LLM calls, and the architectures multiply them very differently. Single-pass RAG is one generation call per query (plus a cheap embedding for the query). CRAG adds roughly one small evaluation call per query, and a second generation only on the fallback path, so call it 1 to 2x in the common case. Self-RAG folds its judgments into the generation itself, so the call count stays close to one, but you pay in a specialized model and longer outputs. An agentic loop is the expensive one: each reason-act-observe cycle is a model call, and a multi-hop task can easily run three, five, or more cycles before it answers, so agents commonly cost 3 to 10x the LLM calls of a single-pass query, and slower wall-clock to match. GraphRAG inverts the timing: query-time cost is modest (local search reads a neighborhood, global search a handful of summaries), but the indexing build is a large fixed LLM bill paid once over the whole corpus and again on every refresh.

Now the reality check, and I mean it as much as anything in this series. Every one of these adds latency, money, new failure modes, and debugging pain. The honest default, for the overwhelming majority of production systems, is well-tuned boring RAG: the hybrid search, reranking, and advanced retrieval patterns of Parts 6 through 9, done carefully. Add at most one of these architectures, and add it only where the problem genuinely demands it, proven by failure analysis and measurement, never because agents are exciting. The fact that you can build a five-tool agent that traverses a knowledge graph over multimodal embeddings does not mean your refund-policy bot should.

The discipline this requires, default to simple, add complexity only on evidence, depends entirely on having evidence. Which is exactly where the next two parts go: measuring whether your system is any good (Part 11) and running it in production (Part 12).

Try it yourself

The cleanest way to feel the difference between a grader-gated branch and an open-ended loop is to count the calls. agentic_loop.py is the smallest honest sketch: a mocked policy-only local index, a web “fallback,” a retrieval evaluator that grades the best chunk, and two control flows over them, CRAG (retrieve, grade, fall back only when the grade is bad) and an agentic ReAct loop (reason-act-observe, with a hard step cap). No real LLM, so it runs in a second on numpy alone; if you have sentence-transformers the grader actually routes by meaning.

Run it and watch the llm_calls counter. The in-corpus refund question is answered locally; the out-of-corpus battery question forces the corrective branch to the web. Then notice the gap: CRAG spends one generation call, the agent spends two or three for the same answers, a miniature of the 3 to 10x the table warns about. Three experiments worth doing:

  • Raise the grader’s irrelevant_below threshold and re-run. The web fallback starts triggering on good local answers (over-triggering): more calls, more latency, and every query leaking to an outside source.
  • Set max_steps=1 on the agent, then very high. Watch how the cap is the only thing standing between a stuck loop and a runaway bill.
  • Add a question your local index can answer well and confirm the agent never bothers with the web tool: routing falling out of the grade, for free.

⚠️ Common pitfalls

  • Agents that loop without converging. An agent that keeps grading its own retrieval as “not quite good enough” will reason-act-observe forever, one model call per turn, until your bill (or your patience) runs out. Always set a hard step cap and a token/cost budget, and define what the agent does when it hits them (answer from what it has, or refuse honestly), the same way the sketch stops at max_steps. An uncapped agent is not a feature, it is an incident waiting to happen.
  • CRAG over-triggering its web fallback. Set the evaluator’s “irrelevant” bar too aggressively and it will reject perfectly good local chunks and fall back to web search on queries your own index answered fine. That is slower, costlier, and quietly exfiltrates every query to an outside service. Tune the grading thresholds on real traffic, and log how often the fallback fires: if it is firing on most queries, the grader is miscalibrated, not the index.
  • GraphRAG going stale. The graph is a snapshot of your corpus at build time. As documents change, the extracted entities, relationships, and community summaries silently drift out of date, and nothing in the query path tells you. Treat the graph like any other index that needs refreshing: budget for re-extraction (it is the expensive LLM pass, so you cannot do it on every edit), track what has changed since the last build, and remember that a confidently wrong answer from a stale graph looks exactly like a correct one.

Key takeaways

  • The leap of this part is from a fixed pipeline (retrieve, then generate, the same way every time) to a decision-making loop with control flow: decide whether to retrieve, judge what came back, retry, reason in steps, traverse structure. Four of the architectures here (Agentic, CRAG, Self-RAG, GraphRAG) are different ways of adding that decision-making; Multi-Modal RAG is orthogonal, relaxing the assumption that everything is text.
  • Agentic RAG makes retrieval one tool an agent can choose in a reason-act-observe (ReAct) loop, enabling query routing and multi-hop retrieval, at the cost of latency, money, and unpredictability.
  • Corrective RAG (CRAG) adds a retrieval evaluator that grades chunks and, when they are bad, triggers a fallback such as web search before generating. Cheap, predictable, and great for open-domain gaps.
  • Self-RAG moves the judgment inside the model via reflection tokens that decide whether to retrieve and whether each claim is grounded; CRAG gates from the outside, Self-RAG reflects from the inside.
  • GraphRAG retrieves over a knowledge graph of entities and relationships (with optional community summarization) to win on multi-hop and global questions isolated chunks cannot answer, while Multi-Modal RAG retrieves over images, tables, and audio via multimodal embeddings (a shared CLIP-style space) or by translating media to text. Both are powerful, specialized, and pricier to build.
  • Default to well-tuned simple RAG and add one advanced architecture only when measurement, not excitement, proves the problem needs it. This field moves fast and I have a knowledge cutoff, so treat every named method here as a snapshot and verify the current state.

References

Each named architecture in this part traces to a specific paper. If you want the primary sources rather than my summaries:

  • Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629. https://arxiv.org/abs/2210.03629
  • Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv:2310.11511. https://arxiv.org/abs/2310.11511
  • Yan, S.-Q., Gu, J.-C., Zhu, Y., & Ling, Z.-H. (2024). Corrective Retrieval Augmented Generation. arXiv:2401.15884. https://arxiv.org/abs/2401.15884
  • Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., & Larson, J. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130. https://arxiv.org/abs/2404.16130
  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020. https://arxiv.org/abs/2103.00020

Glossary

  • Agent: a language model that reasons about a task, chooses an action or tool, observes the result, and decides the next step in a loop, rather than following a fixed script.
  • Agentic RAG: a RAG system in which retrieval is one tool an agent can choose to call, so the model decides whether, what, and how many times to retrieve.
  • ReAct loop: the reason, act, observe, repeat cycle an agent runs, deciding its steps at run time instead of in a predefined order.
  • Tool use: giving a model a set of named, described capabilities (retrieval, web search, a database, a calculator) that it can invoke as needed.
  • Query routing: deciding which source or index a query should be sent to, so the system does not search a store that cannot hold the answer.
  • Multi-hop retrieval: chaining retrievals where the result of one feeds the next, used when a question cannot be answered in a single lookup.
  • Corrective RAG (CRAG): a pipeline that grades retrieved chunks with an evaluator and, when they are bad, takes a corrective action such as a web-search fallback before generating.
  • Retrieval evaluator: the lightweight model or classifier in CRAG that grades retrieved chunks as relevant, ambiguous, or irrelevant.
  • Self-RAG: a self-reflective RAG approach where the model emits reflection tokens to decide on the fly whether to retrieve and whether its own statements are grounded.
  • Reflection tokens: special control signals the model emits in Self-RAG to gate retrieval and to judge passage relevance and the groundedness of its own claims.
  • Knowledge graph: a representation of information as entities (nodes) joined by labelled relationships (edges), making connections between facts explicit.
  • GraphRAG: a RAG approach that builds a knowledge graph from the corpus and, at query time, traverses it or uses community summaries instead of or alongside vector similarity.
  • Community summarization: clustering a knowledge graph into tightly connected communities and summarizing each, giving pre-digested answers to holistic, corpus-wide questions.
  • Multi-modal RAG: retrieval and generation over more than text, including images, tables, charts, audio, and video.
  • Multimodal embeddings: embeddings that place different modalities (text, images) into one shared vector space so you can search across them.
  • CLIP: a model trained on image-and-caption pairs so that an image and the text describing it embed to nearby points, enabling text-to-image and image-to-text search.

Next up, Part 11: Evaluating RAG. We add control flow and architectures freely, but how do we know any of it helped? Next we measure: faithfulness, answer relevance, context precision and recall, and the frameworks that quantify them. Naive or agentic, you cannot improve, or justify, a system you cannot measure.

RAGAgentic RAGGraphRAGLLMAIArchitectureRetrieval