Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

Retrieval-Augmented Generation (RAG) is the pattern that connects external knowledge to LLM reasoning. Instead of relying solely on what the model learned during training, RAG retrieves relevant information at query time and includes it in the prompt. For agents, RAG enables memory systems that can scale far beyond context window limits.

RAG in One Sentence: Find relevant information, add it to the prompt, generate a grounded response.

RAG Architecture

The basic RAG pipeline has three stages:

📝rag_architecture.txt

1RAG PIPELINE
2
3┌──────────────────────────────────────────────────────────────┐
4│                        INDEXING (Offline)                     │
5│                                                               │
6│  Documents → Chunking → Embedding → Vector Store              │
7│                                                               │
8│  ┌────────┐    ┌────────┐    ┌────────┐    ┌────────────┐   │
9│  │ Source │ →  │ Split  │ →  │ Embed  │ →  │ Store in   │   │
10│  │  Docs  │    │ Chunks │    │ Chunks │    │ Vector DB  │   │
11│  └────────┘    └────────┘    └────────┘    └────────────┘   │
12│                                                               │
13└──────────────────────────────────────────────────────────────┘
14
15┌──────────────────────────────────────────────────────────────┐
16│                       RETRIEVAL (Online)                      │
17│                                                               │
18│  Query → Embed → Search → Rank → Top-K Chunks                 │
19│                                                               │
20│  ┌────────┐    ┌────────┐    ┌────────┐    ┌────────────┐   │
21│  │  User  │ →  │ Embed  │ →  │ Vector │ →  │ Return Top │   │
22│  │ Query  │    │ Query  │    │ Search │    │  K Results │   │
23│  └────────┘    └────────┘    └────────┘    └────────────┘   │
24│                                                               │
25└──────────────────────────────────────────────────────────────┘
26
27┌──────────────────────────────────────────────────────────────┐
28│                       GENERATION (Online)                     │
29│                                                               │
30│  Retrieved Chunks + Query → LLM → Response                    │
31│                                                               │
32│  ┌────────────────────┐                                      │
33│  │ System: You are... │                                      │
34│  │                    │                                      │
35│  │ Context:           │         ┌────────┐    ┌─────────┐   │
36│  │ [Retrieved Chunk 1]│    →    │  LLM   │ →  │Response │   │
37│  │ [Retrieved Chunk 2]│         └────────┘    └─────────┘   │
38│  │ ...                │                                      │
39│  │                    │                                      │
40│  │ Question: {query}  │                                      │
41│  └────────────────────┘                                      │
42│                                                               │
43└──────────────────────────────────────────────────────────────┘

Basic RAG Implementation

🐍basic_rag.py

1import anthropic
2from dataclasses import dataclass
3
4@dataclass
5class RetrievedChunk:
6    content: str
7    source: str
8    score: float
9
10class BasicRAG:
11    """Simple RAG implementation."""
12
13    def __init__(
14        self,
15        vector_store,
16        embedding_model,
17        llm_model: str = "claude-sonnet-4-20250514"
18    ):
19        self.vectors = vector_store
20        self.embedder = embedding_model
21        self.client = anthropic.Anthropic()
22        self.llm_model = llm_model
23
24    async def query(
25        self,
26        question: str,
27        top_k: int = 5,
28        max_context_tokens: int = 4000
29    ) -> str:
30        # Step 1: Retrieve relevant chunks
31        chunks = await self._retrieve(question, top_k)
32
33        # Step 2: Build context from chunks
34        context = self._build_context(chunks, max_context_tokens)
35
36        # Step 3: Generate response with context
37        response = await self._generate(question, context)
38
39        return response
40
41    async def _retrieve(
42        self,
43        query: str,
44        top_k: int
45    ) -> list[RetrievedChunk]:
46        # Embed the query
47        query_embedding = await self.embedder.embed(query)
48
49        # Search vector store
50        results = await self.vectors.search(
51            vector=query_embedding,
52            limit=top_k
53        )
54
55        return [
56            RetrievedChunk(
57                content=r["content"],
58                source=r["metadata"].get("source", "unknown"),
59                score=r["score"]
60            )
61            for r in results
62        ]
63
64    def _build_context(
65        self,
66        chunks: list[RetrievedChunk],
67        max_tokens: int
68    ) -> str:
69        context_parts = []
70        current_tokens = 0
71
72        for chunk in chunks:
73            # Rough token estimate
74            chunk_tokens = len(chunk.content) // 4
75
76            if current_tokens + chunk_tokens > max_tokens:
77                break
78
79            context_parts.append(
80                f"[Source: {chunk.source}]\n{chunk.content}"
81            )
82            current_tokens += chunk_tokens
83
84        return "\n\n---\n\n".join(context_parts)
85
86    async def _generate(
87        self,
88        question: str,
89        context: str
90    ) -> str:
91        prompt = f"""Use the following context to answer the question.
92If the context doesn't contain relevant information, say so.
93
94CONTEXT:
95{context}
96
97QUESTION: {question}
98
99ANSWER:"""
100
101        response = self.client.messages.create(
102            model=self.llm_model,
103            max_tokens=1024,
104            messages=[{"role": "user", "content": prompt}]
105        )
106
107        return response.content[0].text

Chunking Strategies

How you split documents into chunks significantly impacts retrieval quality:

Fixed-Size Chunking

🐍fixed_chunking.py

1def fixed_size_chunks(
2    text: str,
3    chunk_size: int = 500,
4    overlap: int = 50
5) -> list[str]:
6    """Split text into fixed-size overlapping chunks."""
7    chunks = []
8    start = 0
9
10    while start < len(text):
11        end = start + chunk_size
12        chunk = text[start:end]
13        chunks.append(chunk)
14        start = end - overlap  # Overlap with previous chunk
15
16    return chunks
17
18# Example
19text = "Long document text..."
20chunks = fixed_size_chunks(text, chunk_size=500, overlap=50)
21# ['First 500 chars...', 'Chars 450-950...', ...]

Semantic Chunking

🐍semantic_chunking.py

1import re
2
3def semantic_chunks(
4    text: str,
5    max_chunk_size: int = 1000
6) -> list[str]:
7    """Split by semantic boundaries (paragraphs, sections)."""
8
9    # Split by double newlines (paragraphs)
10    paragraphs = re.split(r'\n\n+', text)
11
12    chunks = []
13    current_chunk = []
14    current_size = 0
15
16    for para in paragraphs:
17        para_size = len(para)
18
19        if current_size + para_size > max_chunk_size and current_chunk:
20            # Save current chunk and start new one
21            chunks.append("\n\n".join(current_chunk))
22            current_chunk = [para]
23            current_size = para_size
24        else:
25            current_chunk.append(para)
26            current_size += para_size
27
28    if current_chunk:
29        chunks.append("\n\n".join(current_chunk))
30
31    return chunks
32
33
34def markdown_chunks(text: str) -> list[dict]:
35    """Split markdown by headers, preserving structure."""
36    sections = re.split(r'(^#{1,3} .+$)', text, flags=re.MULTILINE)
37
38    chunks = []
39    current_header = ""
40
41    for i, section in enumerate(sections):
42        if re.match(r'^#{1,3} ', section):
43            current_header = section.strip()
44        elif section.strip():
45            chunks.append({
46                "header": current_header,
47                "content": section.strip(),
48                "full": f"{current_header}\n\n{section.strip()}"
49            })
50
51    return chunks

Recursive Chunking

🐍recursive_chunking.py

1class RecursiveChunker:
2    """Recursively split text using multiple separators."""
3
4    def __init__(
5        self,
6        chunk_size: int = 1000,
7        chunk_overlap: int = 100
8    ):
9        self.chunk_size = chunk_size
10        self.overlap = chunk_overlap
11        self.separators = [
12            "\n\n",    # Paragraphs
13            "\n",       # Lines
14            ". ",       # Sentences
15            ", ",       # Clauses
16            " ",        # Words
17            ""          # Characters
18        ]
19
20    def split(self, text: str) -> list[str]:
21        return self._split_recursive(text, self.separators)
22
23    def _split_recursive(
24        self,
25        text: str,
26        separators: list[str]
27    ) -> list[str]:
28        if not separators:
29            # Base case: split by characters
30            return self._split_by_size(text)
31
32        separator = separators[0]
33        remaining_separators = separators[1:]
34
35        if separator:
36            splits = text.split(separator)
37        else:
38            splits = list(text)
39
40        chunks = []
41        current = []
42        current_size = 0
43
44        for split in splits:
45            split_size = len(split) + len(separator)
46
47            if current_size + split_size > self.chunk_size:
48                if current:
49                    chunk_text = separator.join(current)
50
51                    # If chunk is still too large, recurse
52                    if len(chunk_text) > self.chunk_size:
53                        chunks.extend(
54                            self._split_recursive(chunk_text, remaining_separators)
55                        )
56                    else:
57                        chunks.append(chunk_text)
58
59                current = [split]
60                current_size = split_size
61            else:
62                current.append(split)
63                current_size += split_size
64
65        if current:
66            chunks.append(separator.join(current))
67
68        return chunks
69
70    def _split_by_size(self, text: str) -> list[str]:
71        return [
72            text[i:i + self.chunk_size]
73            for i in range(0, len(text), self.chunk_size - self.overlap)
74        ]

Strategy	Best For	Pros	Cons
Fixed-size	Simple docs	Consistent size	May break mid-sentence
Semantic	Structured docs	Preserves meaning	Variable sizes
Recursive	Mixed content	Flexible, adaptive	More complex
By headers	Markdown/docs	Keeps structure	Depends on formatting

Retrieval Techniques

Beyond basic vector search, several techniques improve retrieval quality:

Query Expansion

🐍query_expansion.py

1class QueryExpander:
2    """Expand queries for better retrieval."""
3
4    def __init__(self, llm):
5        self.llm = llm
6
7    async def expand(self, query: str) -> list[str]:
8        """Generate multiple query variations."""
9
10        prompt = f"""Generate 3 alternative phrasings of this query.
11Return as a JSON array of strings.
12
13Query: {query}
14
15Consider:
16- Synonyms
17- More specific versions
18- More general versions
19- Different perspectives
20"""
21
22        response = await self.llm.generate(prompt)
23        variations = json.loads(response)
24
25        return [query] + variations
26
27    async def retrieve_with_expansion(
28        self,
29        query: str,
30        vector_store,
31        embedder,
32        top_k: int = 5
33    ) -> list[dict]:
34        # Get query variations
35        queries = await self.expand(query)
36
37        # Search with each variation
38        all_results = []
39        for q in queries:
40            embedding = await embedder.embed(q)
41            results = await vector_store.search(embedding, limit=top_k)
42            all_results.extend(results)
43
44        # Deduplicate and rank
45        seen_ids = set()
46        unique_results = []
47        for r in all_results:
48            if r["id"] not in seen_ids:
49                seen_ids.add(r["id"])
50                unique_results.append(r)
51
52        # Sort by score
53        unique_results.sort(key=lambda x: x["score"], reverse=True)
54
55        return unique_results[:top_k]

Reranking

🐍reranking.py

1class Reranker:
2    """Rerank initial results for better precision."""
3
4    def __init__(self, cross_encoder_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
5        from sentence_transformers import CrossEncoder
6        self.model = CrossEncoder(cross_encoder_model)
7
8    def rerank(
9        self,
10        query: str,
11        documents: list[str],
12        top_k: int = 5
13    ) -> list[tuple[int, float]]:
14        """Rerank documents, return indices and scores."""
15
16        # Create query-document pairs
17        pairs = [(query, doc) for doc in documents]
18
19        # Score with cross-encoder
20        scores = self.model.predict(pairs)
21
22        # Sort by score
23        ranked = sorted(
24            enumerate(scores),
25            key=lambda x: x[1],
26            reverse=True
27        )
28
29        return ranked[:top_k]
30
31
32class TwoStageRetriever:
33    """First stage: fast vector search. Second stage: accurate reranking."""
34
35    def __init__(self, vector_store, embedder, reranker):
36        self.vectors = vector_store
37        self.embedder = embedder
38        self.reranker = reranker
39
40    async def retrieve(
41        self,
42        query: str,
43        initial_k: int = 20,
44        final_k: int = 5
45    ) -> list[dict]:
46        # Stage 1: Fast vector retrieval
47        query_embedding = await self.embedder.embed(query)
48        candidates = await self.vectors.search(
49            vector=query_embedding,
50            limit=initial_k
51        )
52
53        # Stage 2: Accurate reranking
54        documents = [c["content"] for c in candidates]
55        reranked = self.reranker.rerank(query, documents, top_k=final_k)
56
57        # Return reranked results
58        return [candidates[idx] for idx, score in reranked]

Contextual Retrieval

🐍contextual_retrieval.py

1class ContextualRetriever:
2    """Add context to chunks before embedding (Anthropic's approach)."""
3
4    def __init__(self, llm):
5        self.llm = llm
6
7    async def add_context(
8        self,
9        chunk: str,
10        document: str
11    ) -> str:
12        """Add document context to chunk for better embedding."""
13
14        prompt = f"""<document>
15{document[:2000]}...
16</document>
17
18Here is a chunk from the document:
19<chunk>
20{chunk}
21</chunk>
22
23Provide a short (1-2 sentence) context that situates this chunk
24within the overall document. Focus on key entities and topics
25that would help retrieve this chunk for relevant queries.
26"""
27
28        context = await self.llm.generate(prompt)
29
30        # Prepend context to chunk
31        return f"{context}\n\n{chunk}"
32
33    async def index_with_context(
34        self,
35        document: str,
36        chunker,
37        embedder,
38        vector_store
39    ):
40        """Index document with contextual embeddings."""
41
42        chunks = chunker.split(document)
43
44        for chunk in chunks:
45            # Add context
46            contextualized = await self.add_context(chunk, document)
47
48            # Embed the contextualized chunk
49            embedding = await embedder.embed(contextualized)
50
51            # Store with original chunk as content
52            await vector_store.add(
53                content=chunk,  # Original chunk for retrieval
54                embedding=embedding,  # Contextual embedding
55                metadata={"contextualized": contextualized}
56            )

Context Integration

How you integrate retrieved context into the prompt matters:

Simple Concatenation

🐍simple_context.py

1def build_prompt_simple(query: str, chunks: list[str]) -> str:
2    """Basic context concatenation."""
3    context = "\n\n".join(chunks)
4
5    return f"""Answer the question based on the following context.
6
7Context:
8{context}
9
10Question: {query}
11
12Answer:"""
13
14# Works but can be improved with better structure

Structured Context

🐍structured_context.py

1def build_prompt_structured(
2    query: str,
3    chunks: list[dict]  # Contains 'content', 'source', 'score'
4) -> str:
5    """Structured context with sources and relevance."""
6
7    context_parts = []
8    for i, chunk in enumerate(chunks, 1):
9        context_parts.append(f"""<source id="{i}" relevance="{chunk['score']:.2f}">
10{chunk['content']}
11</source>""")
12
13    context = "\n\n".join(context_parts)
14
15    return f"""You are answering questions based on provided sources.
16
17INSTRUCTIONS:
181. Answer based ONLY on the provided sources
192. Cite sources using [Source N] notation
203. If sources don't contain the answer, say so
21
22SOURCES:
23{context}
24
25QUESTION: {query}
26
27ANSWER (cite your sources):"""
28
29# Better: structured, citations, clear instructions

Dynamic Context Selection

🐍dynamic_context.py

1class DynamicContextBuilder:
2    """Dynamically select and format context based on query type."""
3
4    def __init__(self, max_tokens: int = 4000):
5        self.max_tokens = max_tokens
6
7    async def build(
8        self,
9        query: str,
10        chunks: list[dict],
11        query_type: str = None
12    ) -> str:
13        # Determine query type if not provided
14        if not query_type:
15            query_type = self._classify_query(query)
16
17        # Select relevant chunks based on query type
18        selected = self._select_chunks(chunks, query_type)
19
20        # Format based on query type
21        if query_type == "factual":
22            return self._format_factual(query, selected)
23        elif query_type == "comparison":
24            return self._format_comparison(query, selected)
25        elif query_type == "how_to":
26            return self._format_howto(query, selected)
27        else:
28            return self._format_general(query, selected)
29
30    def _classify_query(self, query: str) -> str:
31        query_lower = query.lower()
32        if any(w in query_lower for w in ["compare", "difference", "vs"]):
33            return "comparison"
34        if any(w in query_lower for w in ["how to", "steps", "process"]):
35            return "how_to"
36        if any(w in query_lower for w in ["what is", "define", "explain"]):
37            return "factual"
38        return "general"
39
40    def _format_comparison(self, query: str, chunks: list[dict]) -> str:
41        return f"""Compare based on these sources:
42
43{self._format_chunks(chunks)}
44
45Comparison Question: {query}
46
47Provide a balanced comparison citing specific sources."""

Advanced RAG Patterns

Sophisticated patterns for production RAG systems:

Agentic RAG

🐍agentic_rag.py

1class AgenticRAG:
2    """RAG with multi-step retrieval and reasoning."""
3
4    def __init__(self, retriever, llm):
5        self.retriever = retriever
6        self.llm = llm
7
8    async def query(self, question: str) -> str:
9        # Step 1: Decompose complex questions
10        sub_questions = await self._decompose(question)
11
12        # Step 2: Retrieve for each sub-question
13        all_context = []
14        for sub_q in sub_questions:
15            chunks = await self.retriever.retrieve(sub_q)
16            all_context.append({
17                "question": sub_q,
18                "chunks": chunks
19            })
20
21        # Step 3: Synthesize answer
22        return await self._synthesize(question, all_context)
23
24    async def _decompose(self, question: str) -> list[str]:
25        prompt = f"""Break this question into simpler sub-questions
26that can be answered independently:
27
28Question: {question}
29
30Return as JSON array. If the question is already simple,
31return just the original question."""
32
33        response = await self.llm.generate(prompt)
34        return json.loads(response)
35
36    async def _synthesize(
37        self,
38        question: str,
39        sub_answers: list[dict]
40    ) -> str:
41        context_parts = []
42        for sa in sub_answers:
43            chunks_text = "\n".join([c["content"] for c in sa["chunks"]])
44            context_parts.append(
45                f"Sub-question: {sa['question']}\n"
46                f"Evidence:\n{chunks_text}"
47            )
48
49        prompt = f"""Answer the main question by synthesizing
50information from the sub-question analyses.
51
52{chr(10).join(context_parts)}
53
54Main Question: {question}
55
56Synthesized Answer:"""
57
58        return await self.llm.generate(prompt)

Self-Correcting RAG

🐍self_correcting_rag.py

1class SelfCorrectingRAG:
2    """RAG that verifies and corrects its answers."""
3
4    def __init__(self, retriever, llm):
5        self.retriever = retriever
6        self.llm = llm
7
8    async def query(self, question: str, max_iterations: int = 3) -> str:
9        chunks = await self.retriever.retrieve(question)
10        context = self._format_context(chunks)
11
12        for iteration in range(max_iterations):
13            # Generate answer
14            answer = await self._generate_answer(question, context)
15
16            # Verify answer against sources
17            verification = await self._verify(question, answer, chunks)
18
19            if verification["is_supported"]:
20                return answer
21
22            # If not supported, try to retrieve better context
23            if verification["missing_info"]:
24                additional = await self.retriever.retrieve(
25                    verification["missing_info"]
26                )
27                chunks.extend(additional)
28                context = self._format_context(chunks)
29
30        return f"[Low confidence] {answer}"
31
32    async def _verify(
33        self,
34        question: str,
35        answer: str,
36        chunks: list[dict]
37    ) -> dict:
38        prompt = f"""Verify if this answer is fully supported by the sources.
39
40Sources:
41{self._format_context(chunks)}
42
43Question: {question}
44Answer: {answer}
45
46Return JSON:
47{{
48    "is_supported": true/false,
49    "unsupported_claims": ["claim1", "claim2"],
50    "missing_info": "what additional info would help"
51}}"""
52
53        response = await self.llm.generate(prompt)
54        return json.loads(response)

Anthropic's Contextual Retrieval

Anthropic's research shows that adding context to chunks before embedding improves retrieval by up to 67%. Combined with BM25 hybrid search, it reduces failed retrievals by 49%.

Summary

Key concepts for RAG in agent memory:

Three stages: Index (offline), Retrieve (online), Generate (online)
Chunking matters: Semantic chunking often beats fixed-size
Enhance retrieval: Query expansion, reranking, contextual embeddings
Structure context: Format context clearly with sources and citations
Advanced patterns: Agentic RAG for complex queries, self-correction for reliability

Next: We'll explore conversation memory management—handling the unique challenges of multi-turn dialogue.