Introduction
Retrieval-Augmented Generation (RAG) is the pattern that connects external knowledge to LLM reasoning. Instead of relying solely on what the model learned during training, RAG retrieves relevant information at query time and includes it in the prompt. For agents, RAG enables memory systems that can scale far beyond context window limits.
RAG in One Sentence: Find relevant information, add it to the prompt, generate a grounded response.
RAG Architecture
The basic RAG pipeline has three stages:
πrag_architecture.txt
1RAG PIPELINE
2
3ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
4β INDEXING (Offline) β
5β β
6β Documents β Chunking β Embedding β Vector Store β
7β β
8β ββββββββββ ββββββββββ ββββββββββ ββββββββββββββ β
9β β Source β β β Split β β β Embed β β β Store in β β
10β β Docs β β Chunks β β Chunks β β Vector DB β β
11β ββββββββββ ββββββββββ ββββββββββ ββββββββββββββ β
12β β
13ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
14
15ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
16β RETRIEVAL (Online) β
17β β
18β Query β Embed β Search β Rank β Top-K Chunks β
19β β
20β ββββββββββ ββββββββββ ββββββββββ ββββββββββββββ β
21β β User β β β Embed β β β Vector β β β Return Top β β
22β β Query β β Query β β Search β β K Results β β
23β ββββββββββ ββββββββββ ββββββββββ ββββββββββββββ β
24β β
25ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
26
27ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
28β GENERATION (Online) β
29β β
30β Retrieved Chunks + Query β LLM β Response β
31β β
32β ββββββββββββββββββββββ β
33β β System: You are... β β
34β β β β
35β β Context: β ββββββββββ βββββββββββ β
36β β [Retrieved Chunk 1]β β β LLM β β βResponse β β
37β β [Retrieved Chunk 2]β ββββββββββ βββββββββββ β
38β β ... β β
39β β β β
40β β Question: {query} β β
41β ββββββββββββββββββββββ β
42β β
43ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββBasic RAG Implementation
πbasic_rag.py
1import anthropic
2from dataclasses import dataclass
3
4@dataclass
5class RetrievedChunk:
6 content: str
7 source: str
8 score: float
9
10class BasicRAG:
11 """Simple RAG implementation."""
12
13 def __init__(
14 self,
15 vector_store,
16 embedding_model,
17 llm_model: str = "claude-sonnet-4-20250514"
18 ):
19 self.vectors = vector_store
20 self.embedder = embedding_model
21 self.client = anthropic.Anthropic()
22 self.llm_model = llm_model
23
24 async def query(
25 self,
26 question: str,
27 top_k: int = 5,
28 max_context_tokens: int = 4000
29 ) -> str:
30 # Step 1: Retrieve relevant chunks
31 chunks = await self._retrieve(question, top_k)
32
33 # Step 2: Build context from chunks
34 context = self._build_context(chunks, max_context_tokens)
35
36 # Step 3: Generate response with context
37 response = await self._generate(question, context)
38
39 return response
40
41 async def _retrieve(
42 self,
43 query: str,
44 top_k: int
45 ) -> list[RetrievedChunk]:
46 # Embed the query
47 query_embedding = await self.embedder.embed(query)
48
49 # Search vector store
50 results = await self.vectors.search(
51 vector=query_embedding,
52 limit=top_k
53 )
54
55 return [
56 RetrievedChunk(
57 content=r["content"],
58 source=r["metadata"].get("source", "unknown"),
59 score=r["score"]
60 )
61 for r in results
62 ]
63
64 def _build_context(
65 self,
66 chunks: list[RetrievedChunk],
67 max_tokens: int
68 ) -> str:
69 context_parts = []
70 current_tokens = 0
71
72 for chunk in chunks:
73 # Rough token estimate
74 chunk_tokens = len(chunk.content) // 4
75
76 if current_tokens + chunk_tokens > max_tokens:
77 break
78
79 context_parts.append(
80 f"[Source: {chunk.source}]\n{chunk.content}"
81 )
82 current_tokens += chunk_tokens
83
84 return "\n\n---\n\n".join(context_parts)
85
86 async def _generate(
87 self,
88 question: str,
89 context: str
90 ) -> str:
91 prompt = f"""Use the following context to answer the question.
92If the context doesn't contain relevant information, say so.
93
94CONTEXT:
95{context}
96
97QUESTION: {question}
98
99ANSWER:"""
100
101 response = self.client.messages.create(
102 model=self.llm_model,
103 max_tokens=1024,
104 messages=[{"role": "user", "content": prompt}]
105 )
106
107 return response.content[0].textChunking Strategies
How you split documents into chunks significantly impacts retrieval quality:
Fixed-Size Chunking
πfixed_chunking.py
1def fixed_size_chunks(
2 text: str,
3 chunk_size: int = 500,
4 overlap: int = 50
5) -> list[str]:
6 """Split text into fixed-size overlapping chunks."""
7 chunks = []
8 start = 0
9
10 while start < len(text):
11 end = start + chunk_size
12 chunk = text[start:end]
13 chunks.append(chunk)
14 start = end - overlap # Overlap with previous chunk
15
16 return chunks
17
18# Example
19text = "Long document text..."
20chunks = fixed_size_chunks(text, chunk_size=500, overlap=50)
21# ['First 500 chars...', 'Chars 450-950...', ...]Semantic Chunking
πsemantic_chunking.py
1import re
2
3def semantic_chunks(
4 text: str,
5 max_chunk_size: int = 1000
6) -> list[str]:
7 """Split by semantic boundaries (paragraphs, sections)."""
8
9 # Split by double newlines (paragraphs)
10 paragraphs = re.split(r'\n\n+', text)
11
12 chunks = []
13 current_chunk = []
14 current_size = 0
15
16 for para in paragraphs:
17 para_size = len(para)
18
19 if current_size + para_size > max_chunk_size and current_chunk:
20 # Save current chunk and start new one
21 chunks.append("\n\n".join(current_chunk))
22 current_chunk = [para]
23 current_size = para_size
24 else:
25 current_chunk.append(para)
26 current_size += para_size
27
28 if current_chunk:
29 chunks.append("\n\n".join(current_chunk))
30
31 return chunks
32
33
34def markdown_chunks(text: str) -> list[dict]:
35 """Split markdown by headers, preserving structure."""
36 sections = re.split(r'(^#{1,3} .+$)', text, flags=re.MULTILINE)
37
38 chunks = []
39 current_header = ""
40
41 for i, section in enumerate(sections):
42 if re.match(r'^#{1,3} ', section):
43 current_header = section.strip()
44 elif section.strip():
45 chunks.append({
46 "header": current_header,
47 "content": section.strip(),
48 "full": f"{current_header}\n\n{section.strip()}"
49 })
50
51 return chunksRecursive Chunking
πrecursive_chunking.py
1class RecursiveChunker:
2 """Recursively split text using multiple separators."""
3
4 def __init__(
5 self,
6 chunk_size: int = 1000,
7 chunk_overlap: int = 100
8 ):
9 self.chunk_size = chunk_size
10 self.overlap = chunk_overlap
11 self.separators = [
12 "\n\n", # Paragraphs
13 "\n", # Lines
14 ". ", # Sentences
15 ", ", # Clauses
16 " ", # Words
17 "" # Characters
18 ]
19
20 def split(self, text: str) -> list[str]:
21 return self._split_recursive(text, self.separators)
22
23 def _split_recursive(
24 self,
25 text: str,
26 separators: list[str]
27 ) -> list[str]:
28 if not separators:
29 # Base case: split by characters
30 return self._split_by_size(text)
31
32 separator = separators[0]
33 remaining_separators = separators[1:]
34
35 if separator:
36 splits = text.split(separator)
37 else:
38 splits = list(text)
39
40 chunks = []
41 current = []
42 current_size = 0
43
44 for split in splits:
45 split_size = len(split) + len(separator)
46
47 if current_size + split_size > self.chunk_size:
48 if current:
49 chunk_text = separator.join(current)
50
51 # If chunk is still too large, recurse
52 if len(chunk_text) > self.chunk_size:
53 chunks.extend(
54 self._split_recursive(chunk_text, remaining_separators)
55 )
56 else:
57 chunks.append(chunk_text)
58
59 current = [split]
60 current_size = split_size
61 else:
62 current.append(split)
63 current_size += split_size
64
65 if current:
66 chunks.append(separator.join(current))
67
68 return chunks
69
70 def _split_by_size(self, text: str) -> list[str]:
71 return [
72 text[i:i + self.chunk_size]
73 for i in range(0, len(text), self.chunk_size - self.overlap)
74 ]| Strategy | Best For | Pros | Cons |
|---|---|---|---|
| Fixed-size | Simple docs | Consistent size | May break mid-sentence |
| Semantic | Structured docs | Preserves meaning | Variable sizes |
| Recursive | Mixed content | Flexible, adaptive | More complex |
| By headers | Markdown/docs | Keeps structure | Depends on formatting |
Retrieval Techniques
Beyond basic vector search, several techniques improve retrieval quality:
Query Expansion
πquery_expansion.py
1class QueryExpander:
2 """Expand queries for better retrieval."""
3
4 def __init__(self, llm):
5 self.llm = llm
6
7 async def expand(self, query: str) -> list[str]:
8 """Generate multiple query variations."""
9
10 prompt = f"""Generate 3 alternative phrasings of this query.
11Return as a JSON array of strings.
12
13Query: {query}
14
15Consider:
16- Synonyms
17- More specific versions
18- More general versions
19- Different perspectives
20"""
21
22 response = await self.llm.generate(prompt)
23 variations = json.loads(response)
24
25 return [query] + variations
26
27 async def retrieve_with_expansion(
28 self,
29 query: str,
30 vector_store,
31 embedder,
32 top_k: int = 5
33 ) -> list[dict]:
34 # Get query variations
35 queries = await self.expand(query)
36
37 # Search with each variation
38 all_results = []
39 for q in queries:
40 embedding = await embedder.embed(q)
41 results = await vector_store.search(embedding, limit=top_k)
42 all_results.extend(results)
43
44 # Deduplicate and rank
45 seen_ids = set()
46 unique_results = []
47 for r in all_results:
48 if r["id"] not in seen_ids:
49 seen_ids.add(r["id"])
50 unique_results.append(r)
51
52 # Sort by score
53 unique_results.sort(key=lambda x: x["score"], reverse=True)
54
55 return unique_results[:top_k]Reranking
πreranking.py
1class Reranker:
2 """Rerank initial results for better precision."""
3
4 def __init__(self, cross_encoder_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
5 from sentence_transformers import CrossEncoder
6 self.model = CrossEncoder(cross_encoder_model)
7
8 def rerank(
9 self,
10 query: str,
11 documents: list[str],
12 top_k: int = 5
13 ) -> list[tuple[int, float]]:
14 """Rerank documents, return indices and scores."""
15
16 # Create query-document pairs
17 pairs = [(query, doc) for doc in documents]
18
19 # Score with cross-encoder
20 scores = self.model.predict(pairs)
21
22 # Sort by score
23 ranked = sorted(
24 enumerate(scores),
25 key=lambda x: x[1],
26 reverse=True
27 )
28
29 return ranked[:top_k]
30
31
32class TwoStageRetriever:
33 """First stage: fast vector search. Second stage: accurate reranking."""
34
35 def __init__(self, vector_store, embedder, reranker):
36 self.vectors = vector_store
37 self.embedder = embedder
38 self.reranker = reranker
39
40 async def retrieve(
41 self,
42 query: str,
43 initial_k: int = 20,
44 final_k: int = 5
45 ) -> list[dict]:
46 # Stage 1: Fast vector retrieval
47 query_embedding = await self.embedder.embed(query)
48 candidates = await self.vectors.search(
49 vector=query_embedding,
50 limit=initial_k
51 )
52
53 # Stage 2: Accurate reranking
54 documents = [c["content"] for c in candidates]
55 reranked = self.reranker.rerank(query, documents, top_k=final_k)
56
57 # Return reranked results
58 return [candidates[idx] for idx, score in reranked]Contextual Retrieval
πcontextual_retrieval.py
1class ContextualRetriever:
2 """Add context to chunks before embedding (Anthropic's approach)."""
3
4 def __init__(self, llm):
5 self.llm = llm
6
7 async def add_context(
8 self,
9 chunk: str,
10 document: str
11 ) -> str:
12 """Add document context to chunk for better embedding."""
13
14 prompt = f"""<document>
15{document[:2000]}...
16</document>
17
18Here is a chunk from the document:
19<chunk>
20{chunk}
21</chunk>
22
23Provide a short (1-2 sentence) context that situates this chunk
24within the overall document. Focus on key entities and topics
25that would help retrieve this chunk for relevant queries.
26"""
27
28 context = await self.llm.generate(prompt)
29
30 # Prepend context to chunk
31 return f"{context}\n\n{chunk}"
32
33 async def index_with_context(
34 self,
35 document: str,
36 chunker,
37 embedder,
38 vector_store
39 ):
40 """Index document with contextual embeddings."""
41
42 chunks = chunker.split(document)
43
44 for chunk in chunks:
45 # Add context
46 contextualized = await self.add_context(chunk, document)
47
48 # Embed the contextualized chunk
49 embedding = await embedder.embed(contextualized)
50
51 # Store with original chunk as content
52 await vector_store.add(
53 content=chunk, # Original chunk for retrieval
54 embedding=embedding, # Contextual embedding
55 metadata={"contextualized": contextualized}
56 )Context Integration
How you integrate retrieved context into the prompt matters:
Simple Concatenation
πsimple_context.py
1def build_prompt_simple(query: str, chunks: list[str]) -> str:
2 """Basic context concatenation."""
3 context = "\n\n".join(chunks)
4
5 return f"""Answer the question based on the following context.
6
7Context:
8{context}
9
10Question: {query}
11
12Answer:"""
13
14# Works but can be improved with better structureStructured Context
πstructured_context.py
1def build_prompt_structured(
2 query: str,
3 chunks: list[dict] # Contains 'content', 'source', 'score'
4) -> str:
5 """Structured context with sources and relevance."""
6
7 context_parts = []
8 for i, chunk in enumerate(chunks, 1):
9 context_parts.append(f"""<source id="{i}" relevance="{chunk['score']:.2f}">
10{chunk['content']}
11</source>""")
12
13 context = "\n\n".join(context_parts)
14
15 return f"""You are answering questions based on provided sources.
16
17INSTRUCTIONS:
181. Answer based ONLY on the provided sources
192. Cite sources using [Source N] notation
203. If sources don't contain the answer, say so
21
22SOURCES:
23{context}
24
25QUESTION: {query}
26
27ANSWER (cite your sources):"""
28
29# Better: structured, citations, clear instructionsDynamic Context Selection
πdynamic_context.py
1class DynamicContextBuilder:
2 """Dynamically select and format context based on query type."""
3
4 def __init__(self, max_tokens: int = 4000):
5 self.max_tokens = max_tokens
6
7 async def build(
8 self,
9 query: str,
10 chunks: list[dict],
11 query_type: str = None
12 ) -> str:
13 # Determine query type if not provided
14 if not query_type:
15 query_type = self._classify_query(query)
16
17 # Select relevant chunks based on query type
18 selected = self._select_chunks(chunks, query_type)
19
20 # Format based on query type
21 if query_type == "factual":
22 return self._format_factual(query, selected)
23 elif query_type == "comparison":
24 return self._format_comparison(query, selected)
25 elif query_type == "how_to":
26 return self._format_howto(query, selected)
27 else:
28 return self._format_general(query, selected)
29
30 def _classify_query(self, query: str) -> str:
31 query_lower = query.lower()
32 if any(w in query_lower for w in ["compare", "difference", "vs"]):
33 return "comparison"
34 if any(w in query_lower for w in ["how to", "steps", "process"]):
35 return "how_to"
36 if any(w in query_lower for w in ["what is", "define", "explain"]):
37 return "factual"
38 return "general"
39
40 def _format_comparison(self, query: str, chunks: list[dict]) -> str:
41 return f"""Compare based on these sources:
42
43{self._format_chunks(chunks)}
44
45Comparison Question: {query}
46
47Provide a balanced comparison citing specific sources."""Advanced RAG Patterns
Sophisticated patterns for production RAG systems:
Agentic RAG
πagentic_rag.py
1class AgenticRAG:
2 """RAG with multi-step retrieval and reasoning."""
3
4 def __init__(self, retriever, llm):
5 self.retriever = retriever
6 self.llm = llm
7
8 async def query(self, question: str) -> str:
9 # Step 1: Decompose complex questions
10 sub_questions = await self._decompose(question)
11
12 # Step 2: Retrieve for each sub-question
13 all_context = []
14 for sub_q in sub_questions:
15 chunks = await self.retriever.retrieve(sub_q)
16 all_context.append({
17 "question": sub_q,
18 "chunks": chunks
19 })
20
21 # Step 3: Synthesize answer
22 return await self._synthesize(question, all_context)
23
24 async def _decompose(self, question: str) -> list[str]:
25 prompt = f"""Break this question into simpler sub-questions
26that can be answered independently:
27
28Question: {question}
29
30Return as JSON array. If the question is already simple,
31return just the original question."""
32
33 response = await self.llm.generate(prompt)
34 return json.loads(response)
35
36 async def _synthesize(
37 self,
38 question: str,
39 sub_answers: list[dict]
40 ) -> str:
41 context_parts = []
42 for sa in sub_answers:
43 chunks_text = "\n".join([c["content"] for c in sa["chunks"]])
44 context_parts.append(
45 f"Sub-question: {sa['question']}\n"
46 f"Evidence:\n{chunks_text}"
47 )
48
49 prompt = f"""Answer the main question by synthesizing
50information from the sub-question analyses.
51
52{chr(10).join(context_parts)}
53
54Main Question: {question}
55
56Synthesized Answer:"""
57
58 return await self.llm.generate(prompt)Self-Correcting RAG
πself_correcting_rag.py
1class SelfCorrectingRAG:
2 """RAG that verifies and corrects its answers."""
3
4 def __init__(self, retriever, llm):
5 self.retriever = retriever
6 self.llm = llm
7
8 async def query(self, question: str, max_iterations: int = 3) -> str:
9 chunks = await self.retriever.retrieve(question)
10 context = self._format_context(chunks)
11
12 for iteration in range(max_iterations):
13 # Generate answer
14 answer = await self._generate_answer(question, context)
15
16 # Verify answer against sources
17 verification = await self._verify(question, answer, chunks)
18
19 if verification["is_supported"]:
20 return answer
21
22 # If not supported, try to retrieve better context
23 if verification["missing_info"]:
24 additional = await self.retriever.retrieve(
25 verification["missing_info"]
26 )
27 chunks.extend(additional)
28 context = self._format_context(chunks)
29
30 return f"[Low confidence] {answer}"
31
32 async def _verify(
33 self,
34 question: str,
35 answer: str,
36 chunks: list[dict]
37 ) -> dict:
38 prompt = f"""Verify if this answer is fully supported by the sources.
39
40Sources:
41{self._format_context(chunks)}
42
43Question: {question}
44Answer: {answer}
45
46Return JSON:
47{{
48 "is_supported": true/false,
49 "unsupported_claims": ["claim1", "claim2"],
50 "missing_info": "what additional info would help"
51}}"""
52
53 response = await self.llm.generate(prompt)
54 return json.loads(response)Anthropic's Contextual Retrieval
Anthropic's research shows that adding context to chunks before embedding improves retrieval by up to 67%. Combined with BM25 hybrid search, it reduces failed retrievals by 49%.
Summary
Key concepts for RAG in agent memory:
- Three stages: Index (offline), Retrieve (online), Generate (online)
- Chunking matters: Semantic chunking often beats fixed-size
- Enhance retrieval: Query expansion, reranking, contextual embeddings
- Structure context: Format context clearly with sources and citations
- Advanced patterns: Agentic RAG for complex queries, self-correction for reliability
Next: We'll explore conversation memory managementβhandling the unique challenges of multi-turn dialogue.