Jared AI Hub
Published on

Advanced RAG: Beyond Basic Retrieval

Authors
  • avatar
    Name
    Jared Chung
    Twitter

Introduction

Basic RAG is straightforward: embed documents, store in a vector database, retrieve similar chunks, and generate a response. But production RAG systems face challenges that simple implementations can't handle dense technical content, ambiguous queries, multi-hop reasoning, and the need for high precision at scale.

This guide covers advanced techniques that separate prototype RAG from production-grade systems.

Advanced RAG Pipeline

The Limitations of Basic RAG

Before diving into solutions, let's understand what goes wrong:

ProblemSymptomRoot Cause
Poor recallMisses relevant documentsSimple similarity isn't enough
Low precisionReturns irrelevant chunksNo reranking or filtering
Context fragmentationLoses important contextNaive chunking strategies
Query mismatchUser query doesn't match doc languageNo query transformation
HallucinationMakes up informationRetrieved context too sparse

Query Transformation Techniques

The user's query is rarely optimal for retrieval. Transform it first.

Query Expansion

Generate multiple query variants to improve recall:

def expand_query(query: str, llm) -> list[str]:
    prompt = f"""Generate 3 alternative phrasings of this query for search.
    Include technical synonyms and related concepts.

    Query: {query}

    Return as a Python list of strings."""

    response = llm.invoke(prompt)
    variants = eval(response.content)
    return [query] + variants

# Example
query = "How do I fine-tune LLaMA?"
expanded = expand_query(query, llm)
# ["How do I fine-tune LLaMA?",
#  "LLaMA model training customization",
#  "Adapting LLaMA weights for specific tasks",
#  "Fine-tuning open source large language models"]

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer, then search for documents similar to that answer:

def hyde_transform(query: str, llm) -> str:
    prompt = f"""Write a detailed paragraph that would answer this question.
    Write as if you're an expert, but don't make up specific facts.

    Question: {query}"""

    hypothetical_doc = llm.invoke(prompt).content
    return hypothetical_doc

# Search using the hypothetical document embedding
# instead of the query embedding
query = "What causes transformer attention to be slow?"
hyde_doc = hyde_transform(query, llm)
results = vectorstore.similarity_search(hyde_doc, k=5)

When to use HyDE: Technical queries where user language differs significantly from document language.

Step-Back Prompting

For specific questions, first ask a more general question:

def stepback_query(query: str, llm) -> str:
    prompt = f"""Given this specific question, generate a more general
    question that would help answer it.

    Specific: {query}
    General:"""

    return llm.invoke(prompt).content

# Example
specific = "What's the learning rate for fine-tuning BERT on NER?"
general = stepback_query(specific, llm)
# "What are best practices for fine-tuning BERT models?"

# Search for both and combine results

Advanced Chunking Strategies

How you split documents dramatically affects retrieval quality.

Semantic Chunking

Split based on semantic shifts, not arbitrary lengths:

from langchain.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    embeddings=OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)

chunks = splitter.split_text(document)

Parent-Child Chunking

Store small chunks for retrieval, but return larger context:

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

# Small chunks for precise matching
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

# Larger chunks for context
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=InMemoryStore(),
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Searches small chunks, returns parent chunks
docs = retriever.get_relevant_documents(query)

Proposition-Based Chunking

Extract atomic facts from documents:

def extract_propositions(text: str, llm) -> list[str]:
    prompt = f"""Extract atomic facts from this text.
    Each fact should be self-contained and understandable without context.

    Text: {text}

    Facts:"""

    response = llm.invoke(prompt)
    return parse_facts(response.content)

# Each proposition becomes a chunk
# "BERT uses 12 transformer layers"
# "BERT was trained on BookCorpus and Wikipedia"

Hybrid Search

Combine dense (vector) and sparse (keyword) search for better results.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Keyword-based retrieval
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 5

# Vector-based retrieval
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Combine with weights
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]  # Tune based on your data
)

When Hybrid Beats Pure Vector

Query TypeBest Approach
Exact terms (API names, error codes)BM25 heavy
Conceptual questionsVector heavy
Mixed (concept + specific term)Balanced hybrid

Reranking

Initial retrieval prioritizes recall. Reranking improves precision.

Cross-Encoder Reranking

Cross-encoders are more accurate than bi-encoders but slower:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_results(query: str, documents: list, top_k: int = 3):
    # Score each document against the query
    pairs = [[query, doc.page_content] for doc in documents]
    scores = reranker.predict(pairs)

    # Sort by score
    scored_docs = list(zip(documents, scores))
    scored_docs.sort(key=lambda x: x[1], reverse=True)

    return [doc for doc, score in scored_docs[:top_k]]

# Retrieve many, rerank to few
initial_docs = vectorstore.similarity_search(query, k=20)
final_docs = rerank_results(query, initial_docs, top_k=5)

Cohere Rerank API

Production-ready reranking as a service:

import cohere

co = cohere.Client(api_key)

def cohere_rerank(query: str, documents: list, top_k: int = 5):
    results = co.rerank(
        query=query,
        documents=[doc.page_content for doc in documents],
        top_n=top_k,
        model="rerank-english-v2.0"
    )

    return [documents[r.index] for r in results]

LLM-Based Reranking

Use the LLM itself to judge relevance:

def llm_rerank(query: str, documents: list, llm, top_k: int = 3):
    prompt = f"""Rate the relevance of each document to the query.
    Score 1-10 where 10 is highly relevant.

    Query: {query}

    Documents:
    {format_documents(documents)}

    Return scores as JSON: {{"doc_0": score, "doc_1": score, ...}}"""

    scores = json.loads(llm.invoke(prompt).content)
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)

    return [documents[int(k.split("_")[1])] for k, v in ranked[:top_k]]

Multi-Vector Retrieval

Represent documents with multiple embeddings for richer matching.

Summary + Content Embeddings

def create_multi_vector_doc(doc, llm):
    # Generate summary
    summary = llm.invoke(f"Summarize: {doc.page_content}").content

    # Generate questions this doc answers
    questions = llm.invoke(
        f"What questions does this answer? {doc.page_content}"
    ).content

    return {
        "content": doc.page_content,
        "content_embedding": embed(doc.page_content),
        "summary_embedding": embed(summary),
        "questions_embedding": embed(questions),
    }

# Search across all embedding types
# Return the original content

Contextual Compression

Reduce noise by extracting only relevant portions:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever(k=10)
)

# Returns only the relevant sentences from each document
docs = compression_retriever.get_relevant_documents(query)

Self-Query Retrieval

Let the LLM write the query filters:

from langchain.retrievers.self_query.base import SelfQueryRetriever

metadata_field_info = [
    AttributeInfo(name="category", type="string", description="Document category"),
    AttributeInfo(name="date", type="date", description="Publication date"),
    AttributeInfo(name="author", type="string", description="Author name"),
]

retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectorstore,
    document_content_description="Technical blog posts about ML",
    metadata_field_info=metadata_field_info,
)

# User: "Articles about transformers from 2024"
# Auto-generates: filter={"date": {"$gte": "2024-01-01"}}

Evaluation and Iteration

You can't improve what you don't measure.

Key Metrics

def evaluate_rag(test_set, retriever, generator):
    results = {
        "retrieval_precision": [],
        "retrieval_recall": [],
        "answer_relevance": [],
        "faithfulness": [],
    }

    for item in test_set:
        query = item["query"]
        ground_truth_docs = item["relevant_docs"]
        ground_truth_answer = item["answer"]

        # Retrieval metrics
        retrieved = retriever.get_relevant_documents(query)
        precision = calculate_precision(retrieved, ground_truth_docs)
        recall = calculate_recall(retrieved, ground_truth_docs)

        # Generation metrics
        answer = generator.generate(query, retrieved)
        relevance = judge_relevance(answer, query)
        faithfulness = check_faithfulness(answer, retrieved)

        results["retrieval_precision"].append(precision)
        # ... etc

    return {k: sum(v)/len(v) for k, v in results.items()}

Building Test Sets

Create diverse test cases:

test_set = [
    {
        "query": "What is the attention mechanism?",
        "type": "factual",
        "difficulty": "easy"
    },
    {
        "query": "Compare BERT and GPT architectures",
        "type": "comparison",
        "difficulty": "medium"
    },
    {
        "query": "How would you fine-tune for low-resource NER?",
        "type": "reasoning",
        "difficulty": "hard"
    },
]

Putting It All Together

A production RAG pipeline might look like:

class AdvancedRAGPipeline:
    def __init__(self, vectorstore, llm, reranker):
        self.vectorstore = vectorstore
        self.llm = llm
        self.reranker = reranker

    def query(self, user_query: str) -> str:
        # 1. Query expansion
        queries = self.expand_query(user_query)

        # 2. Hybrid retrieval
        all_docs = []
        for q in queries:
            vector_docs = self.vectorstore.similarity_search(q, k=10)
            bm25_docs = self.bm25_search(q, k=10)
            all_docs.extend(vector_docs + bm25_docs)

        # 3. Deduplicate
        unique_docs = self.deduplicate(all_docs)

        # 4. Rerank
        top_docs = self.reranker.rerank(user_query, unique_docs, k=5)

        # 5. Compress context
        compressed = self.compress_context(user_query, top_docs)

        # 6. Generate
        response = self.generate(user_query, compressed)

        return response

Conclusion

Advanced RAG is about combining multiple techniques thoughtfully. Start with the basics, measure your results, identify failure modes, and add complexity only where it helps.

The best RAG system isn't the most sophisticated one it's the one that reliably answers your users' questions with accurate, grounded responses.