- Published on
Prompt Engineering: Getting Better Results from LLMs
- Authors

- Name
- Jared Chung
Introduction
Prompt engineering is the practice of crafting inputs that reliably produce the outputs you need from Large Language Models. As LLMs become central to applications ranging from chatbots to code generation, the ability to communicate effectively with these models becomes a valuable skill.
The difference between a mediocre and an excellent prompt can mean the difference between:
- Inconsistent results vs. reliable, reproducible outputs
- Multiple retry attempts vs. first-time success
- Unparseable text vs. structured data ready for your application
This guide covers the fundamental techniques that work across all major LLMs.
The Core Techniques
Understanding the Spectrum
Prompt engineering techniques exist on a spectrum from simple to complex:
| Technique | When to Use | Token Cost | Best For |
|---|---|---|---|
| Zero-Shot | Clear, simple tasks | Low | Classification, extraction |
| Few-Shot | Custom formats or categories | Medium | Domain-specific tasks |
| Chain-of-Thought | Multi-step reasoning | Higher | Math, logic, analysis |
| Structured Output | Application integration | Medium | APIs, data pipelines |
The key insight: start simple and add complexity only when needed. Zero-shot prompts work surprisingly well for many tasks, and you should only escalate to more sophisticated techniques when simpler approaches fail.
Zero-Shot Prompting
Zero-shot prompting asks the model to perform a task without providing any examples. The model relies entirely on its training to understand what you want.
When Zero-Shot Works Well
Zero-shot is effective when:
- The task is unambiguous (sentiment classification, summarization)
- The output format is natural (text, simple labels)
- The domain is general (not industry-specific jargon)
Anatomy of a Good Zero-Shot Prompt
A well-structured prompt includes:
[Role/Context] - Who is the AI?
[Task] - What should it do?
[Input] - What to process?
[Format] - How to structure output?
[Constraints] - What to avoid?
Example:
prompt = """You are a customer service classifier.
Classify the following customer message into exactly one category:
- billing
- technical
- shipping
- general
Customer message: "I was charged twice for my subscription"
Respond with only the category name, nothing else."""
The explicit constraint ("Respond with only the category name") prevents verbose explanations that make parsing difficult.
Common Zero-Shot Mistakes
- Too vague: "Summarize this" → Better: "Summarize in 3 bullet points under 15 words each"
- No format guidance: "Extract the dates" → Better: "Extract dates as ISO format: YYYY-MM-DD"
- Ambiguous scope: "Fix the code" → Better: "Fix the IndexError on line 15 and explain the cause"
Few-Shot Learning
When zero-shot produces inconsistent results, few-shot learning provides examples that demonstrate the desired behavior. The model learns the pattern from your examples and applies it to new inputs.
How Examples Guide the Model
Few-shot works through pattern recognition. The model identifies:
- Input structure: What kind of data am I receiving?
- Output format: How should I structure my response?
- Decision logic: What reasoning connects input to output?
Example: Custom Classification
messages = [
{"role": "system", "content": "Classify support tickets."},
{"role": "user", "content": "My payment failed three times"},
{"role": "assistant", "content": "billing"},
{"role": "user", "content": "The app crashes on startup"},
{"role": "assistant", "content": "technical"},
{"role": "user", "content": "Package hasn't arrived in 2 weeks"},
{"role": "assistant", "content": "shipping"},
{"role": "user", "content": "I was charged in wrong currency"}
]
# Model learns pattern → outputs "billing"
How Many Examples?
| Task Complexity | Recommended | Why |
|---|---|---|
| Simple classification | 2-3 | Pattern is obvious |
| Custom categories | 3-5 | Need to show boundaries |
| Complex reasoning | 4-6 | Multiple steps to demonstrate |
| Creative/style | 1-2 | Showing tone, not logic |
More examples aren't always better—they consume tokens and can cause the model to overfit to specific patterns rather than generalizing.
Example Selection Matters
Choose examples that:
- Cover edge cases: Include borderline cases that define category boundaries
- Are diverse: Don't repeat similar examples
- Match expected inputs: Use realistic data similar to production
- Demonstrate the hard cases: Easy cases don't teach the model much
Chain-of-Thought Prompting
Chain-of-thought (CoT) prompting encourages the model to reason step-by-step before providing an answer. This dramatically improves performance on tasks requiring:
- Mathematical calculations
- Logical reasoning
- Multi-step analysis
- Complex decision-making
Why Step-by-Step Reasoning Helps
LLMs generate text token by token. Without explicit reasoning:
- The model might jump to a conclusion before considering all factors
- Errors in early reasoning steps compound without correction
- The model can't "backtrack" once tokens are generated
Chain-of-thought forces the model to show its work, which:
- Surfaces reasoning errors that can be caught
- Breaks complex problems into manageable steps
- Grounds the final answer in explicit logic
Zero-Shot CoT: The Magic Phrase
Simply adding "Let's think step by step" to your prompt triggers reasoning:
# Without CoT - often fails on complex math
prompt = "If a store sells 15% of 80 items on Monday and 20% of the
remaining items on Tuesday, how many items are left?"
# With CoT - much higher accuracy
prompt = """If a store sells 15% of 80 items on Monday and 20% of the
remaining items on Tuesday, how many items are left?
Let's think step by step."""
The model will then work through:
- 15% of 80 = 12 items sold Monday
- 80 - 12 = 68 items remaining
- 20% of 68 = 13.6 ≈ 14 items sold Tuesday
- 68 - 14 = 54 items left
When to Use Chain-of-Thought
| Use CoT | Avoid CoT |
|---|---|
| Math problems | Simple classification |
| Logic puzzles | Direct extraction |
| Multi-step analysis | Summarization |
| Decisions with tradeoffs | Translation |
| Debugging/troubleshooting | Formatting tasks |
CoT adds latency and cost. For simple tasks, it's unnecessary overhead.
Structured Output
For applications that consume LLM outputs, unstructured text is problematic. Structured output techniques ensure responses follow a predictable format.
JSON Mode
Most modern APIs support JSON mode, which constrains the model to output valid JSON:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": """Extract entities from text. Return JSON:
{
"people": ["name1", "name2"],
"organizations": ["org1"],
"locations": ["loc1"],
"dates": ["YYYY-MM-DD"]
}"""
},
{"role": "user", "content": text}
],
response_format={"type": "json_object"}
)
Schema Enforcement with Pydantic
For type safety and validation, define schemas that the model must follow:
from pydantic import BaseModel
from typing import List
class ExtractedData(BaseModel):
sentiment: str # positive, negative, neutral
confidence: float # 0.0 to 1.0
key_topics: List[str]
summary: str
# Include schema in prompt
schema_json = ExtractedData.model_json_schema()
This gives you:
- Automatic validation of LLM output
- Type hints in your IDE
- Clear documentation of expected format
Best Practices for Structured Output
- Always provide the schema in the prompt - Don't assume the model knows your format
- Use simple types - Arrays, objects, strings, numbers work reliably
- Include example output - Show exactly what valid JSON looks like
- Validate before using - Even with JSON mode, validate against your schema
Advanced Techniques
Self-Consistency
For high-stakes decisions, generate multiple responses and aggregate:
- Run the same prompt 3-5 times with temperature > 0
- Extract the final answer from each response
- Take the majority vote
This reduces single-run errors and provides a confidence signal (how often did answers agree?).
Prompt Chaining
Complex tasks often work better as a sequence of simpler prompts:
Task: Research report on a topic
Chain:
1. Generate research questions → questions
2. Answer each question → raw_findings
3. Identify themes → themes
4. Synthesize into report → final_report
Each step has a focused task, making debugging easier and quality higher.
Role Prompting
Assigning a specific persona influences the model's vocabulary, depth, and perspective:
personas = {
"expert": "You are a senior software architect with 20 years experience.",
"beginner": "You are explaining to someone new to programming.",
"skeptic": "You are a critical reviewer looking for flaws.",
}
Role prompting is especially useful for:
- Technical depth (expert roles)
- Accessibility (teacher roles)
- Quality assurance (reviewer roles)
Common Patterns
The RISEN Framework
A structured approach to prompt construction:
- Role: Who is the AI?
- Instructions: What should it do?
- Situation: What's the context?
- Examples: Demonstrations of desired behavior
- Narrowing: Constraints and format
Temperature and Sampling
| Use Case | Temperature | Why |
|---|---|---|
| Code generation | 0.0-0.2 | Determinism matters |
| Factual Q&A | 0.0-0.3 | Accuracy over creativity |
| Creative writing | 0.7-1.0 | Variety and novelty |
| Brainstorming | 0.8-1.2 | Maximum divergence |
For reproducible results, use temperature=0 and set a seed parameter if available.
Negative Constraints
Telling the model what NOT to do is often as important as what to do:
constraints = """
- Do NOT include code examples
- Do NOT use bullet points
- Do NOT exceed 100 words
- Do NOT use technical jargon
"""
Negative constraints prevent common failure modes and keep outputs focused.
Testing and Iteration
Treat Prompts Like Code
Good prompt engineering practices:
- Version control - Track prompt changes over time
- Test suites - Define expected outputs for given inputs
- Regression testing - Ensure changes don't break existing cases
- A/B testing - Compare prompt variations on real data
Evaluation Metrics
How to measure prompt quality:
| Metric | Measures | How to Calculate |
|---|---|---|
| Accuracy | Correctness | % matching expected output |
| Consistency | Reliability | Variance across multiple runs |
| Format compliance | Parseability | % valid JSON/schema matches |
| Latency | Speed | Response time in ms |
| Cost | Efficiency | Tokens used per request |
Iterative Improvement
When a prompt isn't working:
- Identify failure mode - What specifically is wrong?
- Add constraints - Explicitly forbid the bad behavior
- Add examples - Show the correct behavior
- Simplify - Maybe the task is too complex for one prompt
- Escalate - Try a more capable model
Conclusion
Effective prompt engineering comes down to clear communication:
- Be specific - Vague prompts produce vague results
- Start simple - Only add complexity when needed
- Show examples - Few-shot learning is surprisingly powerful
- Request structure - JSON mode enables reliable parsing
- Encourage reasoning - Chain-of-thought improves accuracy on hard problems
- Test systematically - Treat prompts as code that needs testing
The best prompts evolve through experimentation. Start with a simple approach, measure what's working, and iterate.
References
- Wei, J., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models". NeurIPS 2022.
- Wang, X., et al. (2023). "Self-Consistency Improves Chain of Thought Reasoning". ICLR 2023.
- OpenAI Prompt Engineering Guide - Official best practices.
- Anthropic Prompt Engineering - Claude-specific guidance.
- Learn Prompting - Comprehensive community resource.