- Published on
Advanced RAG: Beyond Basic Retrieval
- Authors

- Name
- Jared Chung
Introduction
Basic RAG is straightforward: embed documents, store in a vector database, retrieve similar chunks, and generate a response. But production RAG systems face challenges that simple implementations can't handle dense technical content, ambiguous queries, multi-hop reasoning, and the need for high precision at scale.
This guide covers advanced techniques that separate prototype RAG from production-grade systems.
The Limitations of Basic RAG
Before diving into solutions, let's understand what goes wrong:
| Problem | Symptom | Root Cause |
|---|---|---|
| Poor recall | Misses relevant documents | Simple similarity isn't enough |
| Low precision | Returns irrelevant chunks | No reranking or filtering |
| Context fragmentation | Loses important context | Naive chunking strategies |
| Query mismatch | User query doesn't match doc language | No query transformation |
| Hallucination | Makes up information | Retrieved context too sparse |
Query Transformation Techniques
The user's query is rarely optimal for retrieval. Transform it first.
Query Expansion
Generate multiple query variants to improve recall:
def expand_query(query: str, llm) -> list[str]:
prompt = f"""Generate 3 alternative phrasings of this query for search.
Include technical synonyms and related concepts.
Query: {query}
Return as a Python list of strings."""
response = llm.invoke(prompt)
variants = eval(response.content)
return [query] + variants
# Example
query = "How do I fine-tune LLaMA?"
expanded = expand_query(query, llm)
# ["How do I fine-tune LLaMA?",
# "LLaMA model training customization",
# "Adapting LLaMA weights for specific tasks",
# "Fine-tuning open source large language models"]
HyDE (Hypothetical Document Embeddings)
Generate a hypothetical answer, then search for documents similar to that answer:
def hyde_transform(query: str, llm) -> str:
prompt = f"""Write a detailed paragraph that would answer this question.
Write as if you're an expert, but don't make up specific facts.
Question: {query}"""
hypothetical_doc = llm.invoke(prompt).content
return hypothetical_doc
# Search using the hypothetical document embedding
# instead of the query embedding
query = "What causes transformer attention to be slow?"
hyde_doc = hyde_transform(query, llm)
results = vectorstore.similarity_search(hyde_doc, k=5)
When to use HyDE: Technical queries where user language differs significantly from document language.
Step-Back Prompting
For specific questions, first ask a more general question:
def stepback_query(query: str, llm) -> str:
prompt = f"""Given this specific question, generate a more general
question that would help answer it.
Specific: {query}
General:"""
return llm.invoke(prompt).content
# Example
specific = "What's the learning rate for fine-tuning BERT on NER?"
general = stepback_query(specific, llm)
# "What are best practices for fine-tuning BERT models?"
# Search for both and combine results
Advanced Chunking Strategies
How you split documents dramatically affects retrieval quality.
Semantic Chunking
Split based on semantic shifts, not arbitrary lengths:
from langchain.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
chunks = splitter.split_text(document)
Parent-Child Chunking
Store small chunks for retrieval, but return larger context:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
# Small chunks for precise matching
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
# Larger chunks for context
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=InMemoryStore(),
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
# Searches small chunks, returns parent chunks
docs = retriever.get_relevant_documents(query)
Proposition-Based Chunking
Extract atomic facts from documents:
def extract_propositions(text: str, llm) -> list[str]:
prompt = f"""Extract atomic facts from this text.
Each fact should be self-contained and understandable without context.
Text: {text}
Facts:"""
response = llm.invoke(prompt)
return parse_facts(response.content)
# Each proposition becomes a chunk
# "BERT uses 12 transformer layers"
# "BERT was trained on BookCorpus and Wikipedia"
Hybrid Search
Combine dense (vector) and sparse (keyword) search for better results.
BM25 + Vector Search
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
# Keyword-based retrieval
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 5
# Vector-based retrieval
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Combine with weights
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.4, 0.6] # Tune based on your data
)
When Hybrid Beats Pure Vector
| Query Type | Best Approach |
|---|---|
| Exact terms (API names, error codes) | BM25 heavy |
| Conceptual questions | Vector heavy |
| Mixed (concept + specific term) | Balanced hybrid |
Reranking
Initial retrieval prioritizes recall. Reranking improves precision.
Cross-Encoder Reranking
Cross-encoders are more accurate than bi-encoders but slower:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank_results(query: str, documents: list, top_k: int = 3):
# Score each document against the query
pairs = [[query, doc.page_content] for doc in documents]
scores = reranker.predict(pairs)
# Sort by score
scored_docs = list(zip(documents, scores))
scored_docs.sort(key=lambda x: x[1], reverse=True)
return [doc for doc, score in scored_docs[:top_k]]
# Retrieve many, rerank to few
initial_docs = vectorstore.similarity_search(query, k=20)
final_docs = rerank_results(query, initial_docs, top_k=5)
Cohere Rerank API
Production-ready reranking as a service:
import cohere
co = cohere.Client(api_key)
def cohere_rerank(query: str, documents: list, top_k: int = 5):
results = co.rerank(
query=query,
documents=[doc.page_content for doc in documents],
top_n=top_k,
model="rerank-english-v2.0"
)
return [documents[r.index] for r in results]
LLM-Based Reranking
Use the LLM itself to judge relevance:
def llm_rerank(query: str, documents: list, llm, top_k: int = 3):
prompt = f"""Rate the relevance of each document to the query.
Score 1-10 where 10 is highly relevant.
Query: {query}
Documents:
{format_documents(documents)}
Return scores as JSON: {{"doc_0": score, "doc_1": score, ...}}"""
scores = json.loads(llm.invoke(prompt).content)
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return [documents[int(k.split("_")[1])] for k, v in ranked[:top_k]]
Multi-Vector Retrieval
Represent documents with multiple embeddings for richer matching.
Summary + Content Embeddings
def create_multi_vector_doc(doc, llm):
# Generate summary
summary = llm.invoke(f"Summarize: {doc.page_content}").content
# Generate questions this doc answers
questions = llm.invoke(
f"What questions does this answer? {doc.page_content}"
).content
return {
"content": doc.page_content,
"content_embedding": embed(doc.page_content),
"summary_embedding": embed(summary),
"questions_embedding": embed(questions),
}
# Search across all embedding types
# Return the original content
Contextual Compression
Reduce noise by extracting only relevant portions:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever(k=10)
)
# Returns only the relevant sentences from each document
docs = compression_retriever.get_relevant_documents(query)
Self-Query Retrieval
Let the LLM write the query filters:
from langchain.retrievers.self_query.base import SelfQueryRetriever
metadata_field_info = [
AttributeInfo(name="category", type="string", description="Document category"),
AttributeInfo(name="date", type="date", description="Publication date"),
AttributeInfo(name="author", type="string", description="Author name"),
]
retriever = SelfQueryRetriever.from_llm(
llm=llm,
vectorstore=vectorstore,
document_content_description="Technical blog posts about ML",
metadata_field_info=metadata_field_info,
)
# User: "Articles about transformers from 2024"
# Auto-generates: filter={"date": {"$gte": "2024-01-01"}}
Evaluation and Iteration
You can't improve what you don't measure.
Key Metrics
def evaluate_rag(test_set, retriever, generator):
results = {
"retrieval_precision": [],
"retrieval_recall": [],
"answer_relevance": [],
"faithfulness": [],
}
for item in test_set:
query = item["query"]
ground_truth_docs = item["relevant_docs"]
ground_truth_answer = item["answer"]
# Retrieval metrics
retrieved = retriever.get_relevant_documents(query)
precision = calculate_precision(retrieved, ground_truth_docs)
recall = calculate_recall(retrieved, ground_truth_docs)
# Generation metrics
answer = generator.generate(query, retrieved)
relevance = judge_relevance(answer, query)
faithfulness = check_faithfulness(answer, retrieved)
results["retrieval_precision"].append(precision)
# ... etc
return {k: sum(v)/len(v) for k, v in results.items()}
Building Test Sets
Create diverse test cases:
test_set = [
{
"query": "What is the attention mechanism?",
"type": "factual",
"difficulty": "easy"
},
{
"query": "Compare BERT and GPT architectures",
"type": "comparison",
"difficulty": "medium"
},
{
"query": "How would you fine-tune for low-resource NER?",
"type": "reasoning",
"difficulty": "hard"
},
]
Putting It All Together
A production RAG pipeline might look like:
class AdvancedRAGPipeline:
def __init__(self, vectorstore, llm, reranker):
self.vectorstore = vectorstore
self.llm = llm
self.reranker = reranker
def query(self, user_query: str) -> str:
# 1. Query expansion
queries = self.expand_query(user_query)
# 2. Hybrid retrieval
all_docs = []
for q in queries:
vector_docs = self.vectorstore.similarity_search(q, k=10)
bm25_docs = self.bm25_search(q, k=10)
all_docs.extend(vector_docs + bm25_docs)
# 3. Deduplicate
unique_docs = self.deduplicate(all_docs)
# 4. Rerank
top_docs = self.reranker.rerank(user_query, unique_docs, k=5)
# 5. Compress context
compressed = self.compress_context(user_query, top_docs)
# 6. Generate
response = self.generate(user_query, compressed)
return response
Conclusion
Advanced RAG is about combining multiple techniques thoughtfully. Start with the basics, measure your results, identify failure modes, and add complexity only where it helps.
The best RAG system isn't the most sophisticated one it's the one that reliably answers your users' questions with accurate, grounded responses.