Jared AI Hub
Published on

Building RAG Systems: Retrieval Augmented Generation from Scratch

Authors
  • avatar
    Name
    Jared Chung
    Twitter

Introduction

Ask ChatGPT about your company's internal policies, and it will confidently make something up. Ask it about events after its training cutoff, and it has no idea. These limitations—hallucination and knowledge cutoffs—are inherent to how LLMs work.

Retrieval Augmented Generation (RAG) solves both problems elegantly: before generating a response, the system retrieves relevant information from your own documents and includes it in the prompt. The LLM then generates answers grounded in actual source material.

This post explains the methodology behind RAG systems—what makes them work, what makes them fail, and how to build effective ones.

Why RAG Matters

LLM LimitationHow RAG Solves It
HallucinationsAnswers are grounded in retrieved documents
Knowledge cutoffYour documents can contain current information
No private data accessIndexes your proprietary content
Can't cite sourcesCan reference specific documents
Expensive long contextsOnly includes relevant content

RAG isn't just a workaround—it's often the right architecture even when you could fit everything in context. Retrieved chunks are more relevant than dumping entire documents, and the system naturally scales to millions of documents.

How RAG Works

Understanding the architecture is essential for building effective systems.

RAG Architecture

The Two Phases

RAG systems operate in two distinct phases:

Indexing Phase (Offline)

Before users can query, you must prepare your knowledge base:

  1. Load documents - Read PDFs, web pages, databases, etc.
  2. Chunk - Split documents into smaller pieces (typically 200-1000 characters)
  3. Embed - Convert each chunk to a vector using an embedding model
  4. Store - Save vectors and text in a vector database

This is done once per document (and again when documents update).

Query Phase (Online)

When a user asks a question:

  1. Embed the query - Convert the question to a vector
  2. Search - Find the most similar document chunks
  3. Augment - Add retrieved chunks to the LLM prompt
  4. Generate - LLM produces an answer using the context

The key insight: embedding models are trained to produce similar vectors for semantically similar text. "What is machine learning?" and "ML is a type of AI that learns from data" will have similar vectors despite using different words.

Core Concepts

Chunking Strategy

How you split documents dramatically affects quality. Consider this trade-off:

Chunk SizeProsCons
Small (100-300 chars)Precise retrievalMay lose context
Medium (300-800 chars)BalancedGood default
Large (800-1500 chars)More contextLess precise

The Overlap Principle

Adjacent chunks should overlap by 10-20% to avoid cutting ideas in half. If you split "Machine learning uses algorithms. These algorithms learn from data." at the period, one chunk has "algorithms" without explanation, the other has "algorithms" without knowing what it refers to.

Chunking Strategies

Fixed-size with overlap: Simple, predictable. Split every N characters with M overlap.

Recursive splitting: Try splitting by paragraphs first, then sentences, then characters. Preserves natural boundaries.

Semantic chunking: Use embedding similarity to detect topic changes. More complex but preserves meaning.

For most use cases: Start with recursive splitting at 500 characters with 50 character overlap.

Embedding Models

The embedding model converts text to vectors. Quality varies significantly:

ModelDimensionsQualitySpeed
OpenAI text-embedding-3-small1536ExcellentFast
OpenAI text-embedding-3-large3072BestMedium
all-MiniLM-L6-v2384GoodVery Fast
nomic-embed-text768Very GoodFast
BGE-large-en1024ExcellentMedium

Key principle: Use the same embedding model for indexing and querying. Different models produce incompatible vectors.

Local vs API:

  • OpenAI embeddings are convenient but require API calls
  • Local models (via sentence-transformers or Ollama) work offline with no per-call cost

Vector Databases

Once you have vectors, you need somewhere to store and search them:

DatabaseTypeBest For
ChromaDBEmbeddedPrototyping, single-machine
PineconeCloudProduction, scalability
WeaviateSelf-hostedProduction with control
pgvectorPostgreSQL extensionExisting Postgres infra
QdrantSelf-hostedHigh performance

For learning and prototyping, ChromaDB is perfect—it runs in-memory or on disk with no setup.

Retrieval Techniques

Find the k most similar chunks to the query. Simple and effective for many use cases.

Similarity metrics:

  • Cosine similarity: Most common; measures angle between vectors
  • Euclidean distance: Actual distance in vector space
  • Dot product: Fast but assumes normalized vectors

Combine semantic search with keyword matching:

BM25 (keyword): Great for exact matches like names, codes, or technical terms Vector (semantic): Great for conceptual similarity

Hybrid search with alpha=0.5 (50% each) often outperforms either alone.

Re-ranking

After initial retrieval, re-rank results with a more powerful model:

  1. Retrieve top 20 with fast vector search
  2. Re-rank to top 5 with cross-encoder model

Cross-encoders are more accurate but slower—they process query and document together rather than independently.

Building a Basic RAG System

Here's a minimal but complete RAG implementation:

import chromadb
from sentence_transformers import SentenceTransformer
from openai import OpenAI

# Initialize components
embed_model = SentenceTransformer('all-MiniLM-L6-v2')
chroma = chromadb.Client()
collection = chroma.create_collection("docs")
llm_client = OpenAI()

# Index documents
def index_document(text: str, source: str):
    # Simple chunking
    chunks = [text[i:i+500] for i in range(0, len(text), 450)]

    for i, chunk in enumerate(chunks):
        embedding = embed_model.encode(chunk).tolist()
        collection.add(
            ids=[f"{source}_{i}"],
            embeddings=[embedding],
            documents=[chunk],
            metadatas=[{"source": source}]
        )

# Query the system
def ask(question: str) -> str:
    # Embed question and find similar chunks
    q_embedding = embed_model.encode(question).tolist()
    results = collection.query(query_embeddings=[q_embedding], n_results=3)

    # Build context from retrieved chunks
    context = "\n\n".join(results['documents'][0])

    # Generate answer
    response = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"Answer based on this context:\n{context}"},
            {"role": "user", "content": question}
        ]
    )
    return response.choices[0].message.content

# Usage
index_document("Python was created by Guido van Rossum in 1991...", "python_history.txt")
print(ask("Who created Python?"))

This is ~30 lines of actual logic. The complexity comes from handling edge cases, scaling, and improving quality.

Why RAG Fails (And How to Fix It)

Understanding failure modes is crucial for building reliable systems.

Problem 1: Wrong Chunks Retrieved

Symptoms: The answer ignores relevant information that exists in your documents.

Causes:

  • Query and document use different terminology
  • Chunk size doesn't match query granularity
  • Important context spans multiple chunks

Solutions:

  • Try hybrid search (BM25 + vector)
  • Adjust chunk size and overlap
  • Use query expansion (generate alternative phrasings)
  • Add metadata filtering when applicable

Problem 2: LLM Ignores Context

Symptoms: The model gives general answers instead of using retrieved content.

Causes:

  • Prompt doesn't emphasize using the context
  • Too much context dilutes important information
  • Context placed too far from the question

Solutions:

  • Explicit instructions: "Only use the provided context"
  • Reduce to fewer, more relevant chunks
  • Put context immediately before the question
  • Use a stronger LLM

Problem 3: Hallucination Despite Context

Symptoms: The answer contains information not in the retrieved chunks.

Causes:

  • Retrieved chunks don't actually answer the question
  • Model's training data "fills in" perceived gaps
  • Ambiguous or contradictory context

Solutions:

  • Add "If the answer isn't in the context, say so"
  • Increase retrieval count to improve coverage
  • Use confidence scoring and filtering
  • Enable source citations

Problem 4: Poor Chunking

Symptoms: Retrieved chunks are incomplete or lack necessary context.

Causes:

  • Sentences cut mid-thought
  • Related information scattered across chunks
  • No overlap between chunks

Solutions:

  • Add 10-20% overlap
  • Use sentence-aware splitting
  • Try semantic chunking
  • Include document metadata with each chunk

Evaluation Methodology

How do you know if your RAG system is working well?

Key Metrics

Retrieval Quality:

  • Recall@K: What percentage of relevant documents are in the top K?
  • MRR (Mean Reciprocal Rank): How high is the first relevant result?
  • NDCG: Accounts for position of all relevant results

Generation Quality:

  • Faithfulness: Is the answer supported by the context?
  • Relevance: Does it actually answer the question?
  • Coherence: Is it well-structured and clear?

Using RAGAS

RAGAS is a framework specifically for RAG evaluation:

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

result = evaluate(
    dataset={
        "question": ["What is machine learning?"],
        "answer": ["Machine learning is..."],
        "contexts": [["ML is a type of AI..."]],
    },
    metrics=[faithfulness, answer_relevancy, context_precision]
)

Building Test Sets

For reliable evaluation, create:

  1. Query set: Representative questions users will ask
  2. Ground truth: Expected answers or relevant document IDs
  3. Edge cases: Questions with no answer, ambiguous queries

Test against your set whenever you change chunking, embeddings, or prompts.

Advanced Techniques

Query Expansion

Generate multiple versions of the query to improve recall:

Original: "How do I reset my password?" Expanded: ["password reset", "forgot password", "change login credentials"]

Search all variations and combine results.

Contextual Compression

After retrieval, extract only the relevant sentences from each chunk. Reduces noise and fits more information in context.

Conversation Memory

For follow-up questions, reformulate with context:

User: "What is machine learning?" Assistant: "Machine learning is..." User: "How is it different from deep learning?"

Before searching, rewrite as: "How is machine learning different from deep learning?"

Metadata Filtering

Combine vector search with structured filters:

  • Only search documents from a specific date range
  • Filter by department or document type
  • Restrict to user's access permissions

Production Considerations

Caching

Cache embeddings for repeated queries. Most RAG traffic follows power-law distribution—a small number of queries account for most volume.

Streaming

Stream LLM responses for better user experience. The first token matters more than total latency.

Monitoring

Track:

  • Retrieval latency and result counts
  • LLM generation time and token usage
  • User satisfaction signals (thumbs up/down)
  • Queries with no good results

Document Updates

When documents change:

  • Re-chunk and re-embed updated documents
  • Delete old chunks before adding new
  • Consider versioning for audit trails

Conclusion

RAG transforms LLMs from general-purpose text generators into knowledgeable assistants grounded in your specific content. The core architecture is simple, but the details matter:

Start simple:

  1. Use recursive chunking with overlap
  2. Use a good embedding model
  3. Retrieve 3-5 chunks
  4. Clear prompts that emphasize using context

Iterate based on failures:

  1. Test with real queries
  2. Identify failure patterns
  3. Apply targeted fixes (hybrid search, reranking, better chunking)
  4. Measure improvement

The gap between a working prototype and a production system lies in handling edge cases, evaluation, and continuous improvement based on user feedback.

References