Jared AI Hub
Published on

Prompt Caching: Optimizing LLM API Costs and Latency

Authors
  • avatar
    Name
    Jared Chung
    Twitter

Introduction

Every time you call an LLM API, you're paying for tokens and waiting for processing. When your prompts share common prefixes like system prompts, few-shot examples, or document context you're paying repeatedly for the same computation.

Prompt caching solves this by reusing the computed representations of repeated prompt content. The result: dramatically lower costs and faster responses.

Prompt Caching Flow

Understanding the Cost Problem

Consider a typical RAG application:

System prompt:     ~500 tokens (same every request)
Document context: ~3000 tokens (same for related queries)
User question:      ~50 tokens (unique)
─────────────────────────────────
Total:             3550 tokens per request

If 95% of your tokens are repeated across requests, you're paying 20x more than necessary.

Real-world impact:

ScenarioWithout CachingWith CachingSavings
RAG with 4K context$0.06/query$0.008/query87%
Agent with long instructions$0.04/call$0.006/call85%
Code assistant with repo context$0.15/query$0.02/query87%

Anthropic Prompt Caching

Anthropic offers native prompt caching for Claude models.

How It Works

Mark content for caching with cache_control:

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert Python developer...",  # Long system prompt
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "How do I use asyncio?"}
    ]
)

# Check cache usage
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")

Caching Large Documents

Perfect for RAG and document Q&A:

def query_with_cached_context(document: str, question: str):
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system=[
            {
                "type": "text",
                "text": f"""You are a helpful assistant. Answer questions
                based on the following document:

                {document}""",
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[
            {"role": "user", "content": question}
        ]
    )
    return response

# First query: Creates cache (cache_creation_input_tokens charged)
# Subsequent queries: Uses cache (cache_read_input_tokens at 10% cost)

Caching Few-Shot Examples

FEW_SHOT_EXAMPLES = """
Example 1:
Input: Convert temperature from Celsius to Fahrenheit
Output: def celsius_to_fahrenheit(c): return c * 9/5 + 32

Example 2:
Input: Check if a number is prime
Output: def is_prime(n): return n > 1 and all(n % i for i in range(2, int(n**0.5) + 1))
"""

def code_generation(task: str):
    return client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": f"You are a Python code generator. Follow these examples:\n\n{FEW_SHOT_EXAMPLES}",
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[{"role": "user", "content": f"Input: {task}\nOutput:"}]
    )

Cache Behavior

  • TTL: 5 minutes of inactivity
  • Minimum size: 1024 tokens (2048 for Claude 3.5 Haiku)
  • Pricing: Cache writes at base rate, cache reads at 10% of base
  • Breakpoints: Up to 4 cache breakpoints per request

OpenAI Prompt Caching

OpenAI automatically caches prompts for certain models.

Automatic Caching

from openai import OpenAI

client = OpenAI()

# OpenAI automatically caches repeated prompt prefixes
system_prompt = """You are a helpful coding assistant specialized in Python.
You follow best practices and write clean, maintainable code..."""  # Long prompt

# Multiple calls with same prefix benefit from caching
for question in questions:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question}
        ]
    )
    # Cached tokens shown in usage.prompt_tokens_details.cached_tokens

Checking Cache Usage

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)

if hasattr(response.usage, 'prompt_tokens_details'):
    cached = response.usage.prompt_tokens_details.cached_tokens
    total = response.usage.prompt_tokens
    print(f"Cache hit rate: {cached/total:.1%}")

Optimizing for Cache Hits

OpenAI caches based on exact prefix matching:

# Good: Consistent prefix structure
def create_messages(context, question):
    return [
        {"role": "system", "content": SYSTEM_PROMPT},  # Always same
        {"role": "user", "content": f"Context:\n{context}"},  # Same context = cached
        {"role": "user", "content": question}  # Variable part last
    ]

# Bad: Variable content breaks cache
def create_messages_bad(context, question, timestamp):
    return [
        {"role": "system", "content": f"Time: {timestamp}\n{SYSTEM_PROMPT}"},  # Timestamp breaks cache!
        {"role": "user", "content": question}
    ]

Custom Caching Strategies

For providers without native caching, or for additional optimization.

Response Caching with Redis

Cache complete responses for identical queries:

import redis
import hashlib
import json
from functools import wraps

redis_client = redis.Redis(host='localhost', port=6379, db=0)

def cache_response(ttl=3600):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            # Create cache key from arguments
            key_data = json.dumps({"args": args, "kwargs": kwargs}, sort_keys=True)
            cache_key = f"llm:{hashlib.sha256(key_data.encode()).hexdigest()}"

            # Check cache
            cached = redis_client.get(cache_key)
            if cached:
                return json.loads(cached)

            # Call function and cache result
            result = func(*args, **kwargs)
            redis_client.setex(cache_key, ttl, json.dumps(result))
            return result

        return wrapper
    return decorator

@cache_response(ttl=3600)
def get_embedding(text: str) -> list:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

Semantic Caching

Cache based on meaning, not exact match:

from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticCache:
    def __init__(self, similarity_threshold=0.95):
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.cache = {}  # query_embedding -> response
        self.threshold = similarity_threshold

    def _get_embedding(self, text: str) -> np.ndarray:
        return self.encoder.encode(text, normalize_embeddings=True)

    def get(self, query: str):
        query_emb = self._get_embedding(query)

        for cached_emb, response in self.cache.items():
            similarity = np.dot(query_emb, cached_emb)
            if similarity >= self.threshold:
                return response

        return None

    def set(self, query: str, response: str):
        query_emb = tuple(self._get_embedding(query))
        self.cache[query_emb] = response

# Usage
cache = SemanticCache(similarity_threshold=0.92)

def query_with_semantic_cache(question: str) -> str:
    # Check cache
    cached = cache.get(question)
    if cached:
        return cached

    # Call LLM
    response = call_llm(question)

    # Cache for similar future queries
    cache.set(question, response)
    return response

Hierarchical Caching

Combine multiple caching strategies:

class HierarchicalCache:
    def __init__(self):
        self.l1_cache = {}  # In-memory, exact match
        self.l2_cache = SemanticCache()  # Semantic similarity
        self.l3_cache = redis_client  # Persistent storage

    def get(self, query: str):
        # L1: Exact match (fastest)
        if query in self.l1_cache:
            return self.l1_cache[query]

        # L2: Semantic similarity
        semantic_result = self.l2_cache.get(query)
        if semantic_result:
            self.l1_cache[query] = semantic_result  # Promote to L1
            return semantic_result

        # L3: Persistent storage
        cache_key = f"llm:{hash(query)}"
        persistent_result = self.l3_cache.get(cache_key)
        if persistent_result:
            result = json.loads(persistent_result)
            self.l1_cache[query] = result  # Promote to L1
            return result

        return None

    def set(self, query: str, response: str, ttl: int = 3600):
        self.l1_cache[query] = response
        self.l2_cache.set(query, response)
        self.l3_cache.setex(f"llm:{hash(query)}", ttl, json.dumps(response))

Caching Patterns for Common Use Cases

RAG Applications

class CachedRAG:
    def __init__(self, vectorstore, llm_client):
        self.vectorstore = vectorstore
        self.client = llm_client
        self.context_cache = {}  # document_ids -> cached context

    def query(self, question: str, k: int = 5):
        # Retrieve documents
        docs = self.vectorstore.similarity_search(question, k=k)
        doc_ids = tuple(doc.id for doc in docs)

        # Check if this exact document set is cached
        if doc_ids in self.context_cache:
            context = self.context_cache[doc_ids]
        else:
            context = "\n\n".join(doc.page_content for doc in docs)
            self.context_cache[doc_ids] = context

        # Use prompt caching for the context
        return self.client.messages.create(
            model="claude-sonnet-4-20250514",
            system=[{
                "type": "text",
                "text": f"Answer based on this context:\n\n{context}",
                "cache_control": {"type": "ephemeral"}
            }],
            messages=[{"role": "user", "content": question}]
        )

Multi-Turn Conversations

class CachedConversation:
    def __init__(self, system_prompt: str):
        self.system_prompt = system_prompt
        self.messages = []

    def chat(self, user_message: str):
        self.messages.append({"role": "user", "content": user_message})

        # System prompt is always cached
        # Conversation history grows but earlier turns remain cached
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            system=[{
                "type": "text",
                "text": self.system_prompt,
                "cache_control": {"type": "ephemeral"}
            }],
            messages=self.messages
        )

        assistant_message = response.content[0].text
        self.messages.append({"role": "assistant", "content": assistant_message})

        return assistant_message

Agent Tool Descriptions

TOOL_DESCRIPTIONS = """
Available tools:

1. search_web(query: str) -> str
   Search the internet for current information.

2. execute_python(code: str) -> str
   Execute Python code in a sandboxed environment.

3. query_database(sql: str) -> str
   Query the PostgreSQL database with read-only SQL.

... (many more tools)
"""

def agent_step(state: dict):
    return client.messages.create(
        model="claude-sonnet-4-20250514",
        system=[{
            "type": "text",
            "text": f"You are an AI assistant with these tools:\n\n{TOOL_DESCRIPTIONS}",
            "cache_control": {"type": "ephemeral"}
        }],
        messages=state["messages"]
    )

Measuring Cache Effectiveness

class CacheMetrics:
    def __init__(self):
        self.total_requests = 0
        self.cache_hits = 0
        self.tokens_saved = 0
        self.cost_saved = 0

    def record(self, response, cached_tokens: int, total_tokens: int):
        self.total_requests += 1
        if cached_tokens > 0:
            self.cache_hits += 1
            self.tokens_saved += cached_tokens

            # Calculate cost savings (example rates)
            # Cached tokens cost 10% of regular tokens
            regular_cost = cached_tokens * 0.000003  # $3/1M tokens
            cached_cost = cached_tokens * 0.0000003  # $0.30/1M tokens
            self.cost_saved += (regular_cost - cached_cost)

    def report(self):
        hit_rate = self.cache_hits / self.total_requests if self.total_requests > 0 else 0
        return {
            "total_requests": self.total_requests,
            "cache_hit_rate": f"{hit_rate:.1%}",
            "tokens_saved": self.tokens_saved,
            "estimated_savings": f"${self.cost_saved:.2f}"
        }

Best Practices

  1. Structure prompts for caching: Put static content first, variable content last
  2. Use consistent formatting: Any difference breaks cache matches
  3. Monitor cache metrics: Track hit rates and savings
  4. Set appropriate TTLs: Balance freshness vs. cache efficiency
  5. Warm the cache: Pre-populate cache for common queries during low-traffic periods
  6. Version your prompts: When prompts change, cache naturally refreshes

Conclusion

Prompt caching is one of the highest-ROI optimizations for LLM applications. With native support from major providers and straightforward custom implementations, there's no reason not to implement caching.

Start by identifying your repeated content system prompts, context documents, examples and structure your prompts to maximize cache hits. The savings in cost and latency compound with every request.