- Published on
Prompt Engineering: Advanced Techniques for Better LLM Outputs
- Authors

- Name
- Jared Chung
Introduction
Prompt engineering is the art and science of crafting inputs that elicit the best possible outputs from Large Language Models. As LLMs become more capable, the skill of prompting becomes increasingly valuable. In this post, we'll cover advanced techniques that consistently produce better results with complete, runnable code examples.
Prerequisites
pip install openai pydantic tenacity
# Setup for all examples
from openai import OpenAI
import json
from typing import List, Dict, Optional
client = OpenAI() # Uses OPENAI_API_KEY env var
def chat(messages: List[Dict], model: str = "gpt-4o-mini", **kwargs) -> str:
"""Helper function for chat completions."""
response = client.chat.completions.create(
model=model,
messages=messages,
**kwargs
)
return response.choices[0].message.content
Why Prompt Engineering Matters
| Approach | Quality | Cost | Effort |
|---|---|---|---|
| Bad prompts | Low | High (retries) | Low |
| Good prompts | High | Low | Medium |
| Fine-tuning | Highest | Very High | Very High |
Good prompts can often match or exceed fine-tuned model performance at a fraction of the cost.
The Anatomy of a Good Prompt
A well-structured prompt typically includes:
- System context: Who is the AI and what's its role?
- Task description: What should it do?
- Input data: What information to work with?
- Output format: How should the response be structured?
- Constraints: What to avoid or ensure?
- Examples: Demonstrations of desired behavior
Technique 1: Zero-Shot Prompting
Direct instruction without examples. Works well for simple, clear tasks.
from openai import OpenAI
client = OpenAI()
def zero_shot(task: str, input_text: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": task},
{"role": "user", "content": input_text}
]
)
return response.choices[0].message.content
# Example
result = zero_shot(
task="You are a sentiment analyzer. Classify the sentiment as positive, negative, or neutral.",
input_text="The product exceeded my expectations! Highly recommend."
)
print(result) # "positive"
Technique 2: Few-Shot Learning
Provide examples to guide the model's behavior.
def few_shot_classify(text: str) -> str:
messages = [
{"role": "system", "content": "Classify customer feedback into categories."},
{"role": "user", "content": "The delivery was 3 days late."},
{"role": "assistant", "content": "Category: Shipping"},
{"role": "user", "content": "Your app crashes every time I try to checkout."},
{"role": "assistant", "content": "Category: Technical Issue"},
{"role": "user", "content": "The customer service rep was very helpful!"},
{"role": "assistant", "content": "Category: Customer Service"},
{"role": "user", "content": text}
]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
return response.choices[0].message.content
result = few_shot_classify("I was charged twice for the same item")
print(result) # "Category: Billing"
How Many Examples?
| Task Complexity | Recommended Examples |
|---|---|
| Simple classification | 2-3 |
| Complex reasoning | 4-6 |
| Creative writing | 1-2 (style examples) |
| Data extraction | 3-5 |
Technique 3: Chain-of-Thought (CoT)
Encourage step-by-step reasoning for complex problems.
def chain_of_thought(problem: str) -> str:
prompt = f"""Solve this problem step by step. Show your reasoning before giving the final answer.
Problem: {problem}
Solution:
Step 1:"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Example
problem = """
A store sells apples for $0.50 each and oranges for $0.75 each.
If Sarah buys 6 apples and 4 oranges, and pays with a $10 bill,
how much change does she receive?
"""
print(chain_of_thought(problem))
Output:
Step 1: Calculate the cost of apples
6 apples × $0.50 = $3.00
Step 2: Calculate the cost of oranges
4 oranges × $0.75 = $3.00
Step 3: Calculate total cost
$3.00 + $3.00 = $6.00
Step 4: Calculate change
$10.00 - $6.00 = $4.00
Final Answer: Sarah receives $4.00 in change.
Zero-Shot CoT
Simply adding "Let's think step by step" can improve reasoning:
def zero_shot_cot(question: str) -> str:
prompt = f"{question}\n\nLet's think step by step."
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
Technique 4: Structured Outputs
Force consistent, parseable responses.
Using JSON Mode
import json
def extract_entities_json(text: str) -> dict:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": """Extract entities from the text. Return JSON with this structure:
{
"people": ["name1", "name2"],
"organizations": ["org1"],
"locations": ["loc1"],
"dates": ["date1"]
}"""
},
{"role": "user", "content": text}
],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
result = extract_entities_json(
"Apple CEO Tim Cook announced new products at the Cupertino event on March 15, 2024."
)
print(json.dumps(result, indent=2))
Using Pydantic for Validation
from pydantic import BaseModel
from typing import List, Optional
class ProductReview(BaseModel):
sentiment: str
rating: int
pros: List[str]
cons: List[str]
summary: str
def analyze_review(review_text: str) -> ProductReview:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": f"""Analyze the product review and return JSON matching this schema:
{ProductReview.model_json_schema()}"""
},
{"role": "user", "content": review_text}
],
response_format={"type": "json_object"}
)
data = json.loads(response.choices[0].message.content)
return ProductReview(**data)
Technique 5: Role Prompting
Assign a specific persona to the model.
def expert_analysis(topic: str, question: str) -> str:
personas = {
"security": "You are a senior cybersecurity expert with 20 years of experience at major tech companies.",
"legal": "You are a corporate lawyer specializing in technology and intellectual property law.",
"finance": "You are a CFA charterholder and senior financial analyst at an investment bank.",
"medical": "You are a board-certified physician with expertise in internal medicine."
}
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": personas.get(topic, "You are a knowledgeable expert.")},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
result = expert_analysis(
"security",
"What are the main vulnerabilities in JWT authentication?"
)
Technique 6: Self-Consistency
Generate multiple responses and aggregate for better accuracy.
from collections import Counter
def self_consistent_answer(question: str, n_samples: int = 5) -> str:
"""Generate multiple answers and return the most common one."""
answers = []
for _ in range(n_samples):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": f"{question}\n\nThink step by step, then provide your final answer on a new line starting with 'ANSWER:'"}
],
temperature=0.7 # Add some variance
)
# Extract final answer
content = response.choices[0].message.content
if "ANSWER:" in content:
answer = content.split("ANSWER:")[-1].strip()
answers.append(answer)
# Return most common answer
counter = Counter(answers)
return counter.most_common(1)[0][0] if counter else answers[0]
Technique 7: Prompt Chaining
Break complex tasks into a pipeline of simpler prompts.
def research_and_summarize(topic: str) -> dict:
"""Multi-step prompt chain for research."""
# Step 1: Generate research questions
questions_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": f"Generate 5 key research questions about: {topic}"}
]
)
questions = questions_response.choices[0].message.content
# Step 2: Answer each question
answers_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": f"Answer these questions concisely:\n\n{questions}"}
]
)
answers = answers_response.choices[0].message.content
# Step 3: Synthesize into summary
summary_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": f"Synthesize this research into a coherent 3-paragraph summary:\n\n{answers}"}
]
)
summary = summary_response.choices[0].message.content
return {
"questions": questions,
"answers": answers,
"summary": summary
}
Technique 8: Constrained Generation
Set clear boundaries and rules.
def generate_with_constraints(task: str, constraints: list[str]) -> str:
constraints_text = "\n".join(f"- {c}" for c in constraints)
prompt = f"""Task: {task}
You MUST follow these constraints:
{constraints_text}
Generate your response:"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Example
result = generate_with_constraints(
task="Write a product description for a laptop",
constraints=[
"Maximum 100 words",
"Include exactly 3 bullet points",
"Do not mention price",
"Focus on productivity features",
"Use professional tone"
]
)
Common Pitfalls
1. Vague Instructions
# Bad
"Summarize this article"
# Good
"Summarize this article in exactly 3 bullet points, each under 20 words, focusing on key findings."
2. Missing Output Format
# Bad
"Extract the dates from this text"
# Good
"Extract all dates from this text. Return as a JSON array in ISO format: ['YYYY-MM-DD', ...]"
3. Ambiguous Context
# Bad
"Fix the code"
# Good
"Fix the Python code below. The issue is: IndexError on line 15. Explain what caused it and provide the corrected code."
Temperature and Sampling
| Use Case | Temperature | Top-p |
|---|---|---|
| Code generation | 0.0-0.2 | 0.1 |
| Factual Q&A | 0.0-0.3 | 0.1 |
| Creative writing | 0.7-1.0 | 0.9 |
| Brainstorming | 0.8-1.2 | 0.95 |
# Deterministic output
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[...],
temperature=0,
seed=42 # For reproducibility
)
# Creative output
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[...],
temperature=0.9,
top_p=0.95
)
Technique 9: Tree of Thoughts (ToT)
For complex problems, explore multiple reasoning paths:
def tree_of_thoughts(problem: str, n_thoughts: int = 3) -> str:
"""Explore multiple reasoning paths and select the best."""
# Step 1: Generate multiple initial thoughts
thoughts_prompt = f"""Problem: {problem}
Generate {n_thoughts} different approaches to solve this problem.
For each approach, explain the reasoning briefly.
Format:
Approach 1: [description]
Approach 2: [description]
Approach 3: [description]"""
thoughts = chat([{"role": "user", "content": thoughts_prompt}])
# Step 2: Evaluate each approach
eval_prompt = f"""Problem: {problem}
Possible approaches:
{thoughts}
For each approach, rate its likelihood of success (1-10) and explain why.
Then select the best approach and solve the problem using it.
Format:
Evaluation:
- Approach 1: [score] - [reason]
- Approach 2: [score] - [reason]
- Approach 3: [score] - [reason]
Best approach: [number]
Solution using best approach:
[detailed solution]"""
result = chat([{"role": "user", "content": eval_prompt}])
return result
# Example: Complex reasoning problem
problem = """
A farmer needs to cross a river with a wolf, a goat, and a cabbage.
The boat can only carry the farmer and one item at a time.
If left alone, the wolf will eat the goat, and the goat will eat the cabbage.
How can the farmer get everything across safely?
"""
print(tree_of_thoughts(problem))
Technique 10: ReAct (Reasoning + Acting)
Combine reasoning with tool use for complex tasks:
def react_agent(question: str, tools: Dict[str, callable], max_steps: int = 5) -> str:
"""ReAct-style reasoning with tool use."""
tool_descriptions = "\n".join([
f"- {name}: {func.__doc__}"
for name, func in tools.items()
])
system_prompt = f"""You are a helpful assistant that answers questions by reasoning and using tools.
Available tools:
{tool_descriptions}
For each step, use this format:
Thought: [your reasoning about what to do next]
Action: [tool_name]
Action Input: [input to the tool]
After receiving tool output, continue reasoning.
When you have the final answer, respond with:
Thought: I now have enough information
Final Answer: [your answer]"""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": question}
]
for step in range(max_steps):
response = chat(messages, temperature=0)
messages.append({"role": "assistant", "content": response})
# Check for final answer
if "Final Answer:" in response:
return response.split("Final Answer:")[-1].strip()
# Parse action
if "Action:" in response and "Action Input:" in response:
action_line = [l for l in response.split("\n") if l.startswith("Action:")][0]
input_line = [l for l in response.split("\n") if l.startswith("Action Input:")][0]
action = action_line.replace("Action:", "").strip()
action_input = input_line.replace("Action Input:", "").strip()
if action in tools:
try:
result = tools[action](action_input)
messages.append({
"role": "user",
"content": f"Observation: {result}"
})
except Exception as e:
messages.append({
"role": "user",
"content": f"Error: {str(e)}"
})
return "Max steps reached without conclusion"
# Example tools
def calculator(expression: str) -> str:
"""Evaluates mathematical expressions. Input: math expression like '2 + 2'"""
try:
return str(eval(expression)) # Note: use safer eval in production
except:
return "Invalid expression"
def search(query: str) -> str:
"""Searches for information. Input: search query"""
# Simulated search results
data = {
"population of france": "67 million (2023)",
"capital of japan": "Tokyo",
"speed of light": "299,792,458 meters per second"
}
for key, value in data.items():
if key in query.lower():
return value
return "No results found"
# Usage
result = react_agent(
"What is the population of France divided by 10?",
tools={"calculator": calculator, "search": search}
)
print(result)
Technique 11: Meta-Prompting
Use the LLM to generate better prompts:
def generate_optimized_prompt(task_description: str, examples: List[str] = None) -> str:
"""Use LLM to generate an optimized prompt for a task."""
meta_prompt = f"""You are an expert prompt engineer. Generate an optimal prompt for the following task.
Task: {task_description}
{"Example inputs that the prompt should handle well:" + chr(10) + chr(10).join(f"- {ex}" for ex in examples) if examples else ""}
Requirements for the prompt:
1. Be specific and unambiguous
2. Include output format instructions
3. Add relevant constraints
4. Include 2-3 few-shot examples if helpful
5. Use chain-of-thought if the task requires reasoning
Generate the complete prompt:"""
return chat([{"role": "user", "content": meta_prompt}])
# Example
task = "Classify customer support tickets into categories and priority levels"
examples = [
"My account was charged twice for the same order",
"How do I reset my password?",
"Your service has been down for 3 hours and we're losing money"
]
optimized_prompt = generate_optimized_prompt(task, examples)
print(optimized_prompt)
Technique 12: Prompt Templates
Create reusable, parameterized prompts:
from dataclasses import dataclass
from string import Template
from typing import Any
@dataclass
class PromptTemplate:
"""Reusable prompt template with validation."""
template: str
required_vars: List[str]
system_prompt: Optional[str] = None
def format(self, **kwargs) -> List[Dict]:
"""Format the template with provided variables."""
# Validate required variables
missing = set(self.required_vars) - set(kwargs.keys())
if missing:
raise ValueError(f"Missing required variables: {missing}")
# Format template
formatted = Template(self.template).safe_substitute(**kwargs)
messages = []
if self.system_prompt:
messages.append({"role": "system", "content": self.system_prompt})
messages.append({"role": "user", "content": formatted})
return messages
# Define reusable templates
CODE_REVIEW_TEMPLATE = PromptTemplate(
template="""Review this $language code for:
1. Bugs and potential issues
2. Performance improvements
3. Code style and best practices
Code:
```$language
$code
Provide your review in this format:
Issues Found
[list issues with severity: critical/warning/info]
Suggested Improvements
[list improvements]
Overall Assessment
[brief summary]""", required_vars=["language", "code"], system_prompt="You are an expert code reviewer with deep knowledge of software engineering best practices." )
SUMMARIZE_TEMPLATE = PromptTemplate( template="""Summarize the following length.
Content: $content
Focus on: $focus_areas""", required_vars=["content_type", "length", "content", "focus_areas"] )
Usage
messages = CODE_REVIEW_TEMPLATE.format( language="python", code=""" def factorial(n): if n == 0: return 1 return n * factorial(n) """ ) review = chat(messages) print(review)
# Evaluation and Testing
## A/B Testing Prompts
```python
import random
from dataclasses import dataclass
from typing import Callable
@dataclass
class PromptExperiment:
"""A/B test different prompts."""
name: str
prompts: Dict[str, str]
evaluator: Callable[[str, str], float] # (input, output) -> score
def run(self, test_inputs: List[str], n_runs: int = 3) -> Dict:
results = {name: [] for name in self.prompts}
for input_text in test_inputs:
for _ in range(n_runs):
# Randomly select prompt variant
variant = random.choice(list(self.prompts.keys()))
prompt = self.prompts[variant]
# Run prompt
messages = [{"role": "user", "content": prompt.format(input=input_text)}]
output = chat(messages)
# Evaluate
score = self.evaluator(input_text, output)
results[variant].append(score)
# Calculate statistics
stats = {}
for variant, scores in results.items():
stats[variant] = {
"mean": sum(scores) / len(scores),
"min": min(scores),
"max": max(scores),
"n": len(scores)
}
return stats
# Example: Test different summarization prompts
def length_evaluator(input_text: str, output: str) -> float:
"""Score based on how close to 50 words the summary is."""
word_count = len(output.split())
target = 50
return 1.0 - min(abs(word_count - target) / target, 1.0)
experiment = PromptExperiment(
name="summarization_test",
prompts={
"simple": "Summarize this in about 50 words: {input}",
"detailed": "Create a concise summary of approximately 50 words. Focus on key points only: {input}",
"structured": "Summarize in exactly 50 words (+/- 5). Include: main topic, key finding, conclusion. Text: {input}"
},
evaluator=length_evaluator
)
# Run experiment
# results = experiment.run(test_articles, n_runs=5)
# print(json.dumps(results, indent=2))
Regression Testing
from dataclasses import dataclass
import hashlib
@dataclass
class PromptTestCase:
"""A test case for prompt regression testing."""
name: str
prompt: str
expected_contains: List[str] # Output should contain these
expected_not_contains: List[str] = None # Output should NOT contain these
def run_prompt_tests(test_cases: List[PromptTestCase]) -> Dict:
"""Run regression tests on prompts."""
results = {"passed": [], "failed": []}
for test in test_cases:
messages = [{"role": "user", "content": test.prompt}]
output = chat(messages, temperature=0) # Deterministic
# Check expected content
passed = True
failures = []
for expected in test.expected_contains:
if expected.lower() not in output.lower():
passed = False
failures.append(f"Missing: {expected}")
if test.expected_not_contains:
for forbidden in test.expected_not_contains:
if forbidden.lower() in output.lower():
passed = False
failures.append(f"Contains forbidden: {forbidden}")
if passed:
results["passed"].append(test.name)
else:
results["failed"].append({"name": test.name, "failures": failures, "output": output[:200]})
return results
# Example test suite
test_cases = [
PromptTestCase(
name="sentiment_positive",
prompt="Classify the sentiment of 'I love this product!' as positive, negative, or neutral. Reply with just the sentiment.",
expected_contains=["positive"],
expected_not_contains=["negative"]
),
PromptTestCase(
name="json_format",
prompt='Return a JSON object with keys "name" and "age" for a 25 year old named John.',
expected_contains=['"name"', '"age"', "John", "25"]
),
]
# results = run_prompt_tests(test_cases)
# print(results)
Production Best Practices
1. Retry with Exponential Backoff
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=60)
)
def robust_chat(messages: List[Dict], **kwargs) -> str:
"""Chat completion with automatic retry."""
return chat(messages, **kwargs)
2. Cost Tracking
import tiktoken
def estimate_cost(messages: List[Dict], model: str = "gpt-4o-mini") -> Dict:
"""Estimate API call cost."""
encoding = tiktoken.encoding_for_model(model)
input_tokens = sum(len(encoding.encode(m["content"])) for m in messages)
# Approximate pricing (check OpenAI for current rates)
pricing = {
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006}, # per 1K tokens
"gpt-4o": {"input": 0.005, "output": 0.015},
}
rates = pricing.get(model, pricing["gpt-4o-mini"])
estimated_cost = (input_tokens / 1000) * rates["input"]
return {
"input_tokens": input_tokens,
"estimated_cost_usd": round(estimated_cost, 6),
"model": model
}
3. Prompt Versioning
from datetime import datetime
import hashlib
class PromptRegistry:
"""Version and track prompts."""
def __init__(self):
self.prompts = {}
self.history = []
def register(self, name: str, prompt: str, metadata: Dict = None) -> str:
"""Register a prompt and return its version hash."""
version = hashlib.md5(prompt.encode()).hexdigest()[:8]
self.prompts[name] = {
"prompt": prompt,
"version": version,
"metadata": metadata or {},
"registered_at": datetime.now().isoformat()
}
self.history.append({
"name": name,
"version": version,
"timestamp": datetime.now().isoformat()
})
return version
def get(self, name: str) -> str:
"""Get a registered prompt."""
if name not in self.prompts:
raise KeyError(f"Prompt '{name}' not found")
return self.prompts[name]["prompt"]
def get_version(self, name: str) -> str:
"""Get the version hash of a prompt."""
return self.prompts[name]["version"]
# Usage
registry = PromptRegistry()
registry.register(
"sentiment_classifier",
"Classify the sentiment as positive, negative, or neutral: {text}",
metadata={"author": "team", "task": "classification"}
)
Conclusion
Prompt engineering is about clear communication with LLMs. Key takeaways:
- Be specific: Vague prompts get vague responses
- Show examples: Few-shot learning is powerful for consistent outputs
- Encourage reasoning: Chain-of-thought and Tree-of-thoughts improve accuracy
- Structure output: JSON and schemas ensure parseable, consistent responses
- Test systematically: Treat prompts like code with regression tests
- Version and track: Log prompt versions for reproducibility
- Iterate: The best prompts evolve through experimentation
The best prompt engineers treat prompts like code - version controlled, tested, and continuously improved.
References
- Wei et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (2022)
- Wang et al. "Self-Consistency Improves Chain of Thought Reasoning" (2023)
- Yao et al. "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (2023)
- Yao et al. "ReAct: Synergizing Reasoning and Acting in Language Models" (2023)
- OpenAI Prompt Engineering Guide: https://platform.openai.com/docs/guides/prompt-engineering
- Anthropic Prompt Engineering: https://docs.anthropic.com/claude/docs/prompt-engineering