- Published on
LLM Fine-tuning Fundamentals: Understanding When and How to Fine-tune
- Authors

- Name
- Jared Chung
Introduction
Fine-tuning Large Language Models transforms general-purpose models into specialized tools for your specific needs. Whether you want a model that follows particular instructions, adopts a specific writing style, or excels at domain-specific tasks, fine-tuning is how you get there.
But fine-tuning is not a single technique—it's a family of approaches with different tradeoffs. This guide explains when and why to use each approach, helping you make informed decisions before investing compute resources.
The Learning Hierarchy
From General to Specialized
LLMs develop capabilities in stages:
| Stage | What Happens | Scale |
|---|---|---|
| Pre-training | Learn language from billions of tokens | Weeks on GPU clusters |
| Fine-tuning | Learn specific tasks/behaviors | Hours on consumer GPUs |
| Alignment | Learn human preferences | Additional fine-tuning stage |
Pre-training teaches the model:
- Grammar, syntax, and semantic relationships
- General world knowledge
- Broad reasoning patterns
Fine-tuning teaches:
- Task-specific performance (coding, summarization)
- Domain expertise (medical, legal, technical)
- Specific interaction patterns and preferences
The key insight: you almost never pre-train from scratch. You leverage pre-trained models and fine-tune them for your needs.
Core Fine-tuning Approaches
1. Supervised Fine-tuning (SFT)
The most straightforward approach: provide input-output pairs and train the model to produce the desired outputs.
Training data format:
{
"instruction": "Explain quantum computing in simple terms",
"output": "Quantum computing uses quantum mechanics principles to process information differently than regular computers. Instead of bits that are either 0 or 1, quantum computers use 'qubits' that can be both simultaneously..."
}
When to use SFT:
| Scenario | Why SFT Works |
|---|---|
| Teaching new tasks | Direct demonstration of desired behavior |
| Domain adaptation | Exposure to domain-specific language |
| Instruction following | Examples of following instructions |
| Style transfer | Examples in the target style |
Considerations:
- Quality matters more than quantity—1,000 excellent examples often beat 10,000 mediocre ones
- Format consistency is crucial—maintain the same structure throughout
- Diversity ensures generalization—cover the full range of expected inputs
2. Parameter-Efficient Fine-tuning (PEFT/LoRA)
Instead of updating all model parameters (billions for modern LLMs), PEFT methods update only a small subset.
LoRA (Low-Rank Adaptation) is the most popular PEFT method:
- Adds small "adapter" matrices to existing layers
- Trains only these adapters (0.1-1% of total parameters)
- Original model weights stay frozen
Benefits:
| Benefit | Impact |
|---|---|
| 10-100x less memory | Fine-tune 70B models on consumer GPUs |
| Faster training | Fewer parameters to update |
| Multiple adapters | Different tasks without storing full model copies |
| Less forgetting | Original capabilities preserved in frozen weights |
When to use PEFT:
- Limited GPU memory
- Need to support multiple fine-tuned versions
- Want to preserve general capabilities
- Resource efficiency matters
3. Reinforcement Learning from Human Feedback (RLHF)
A multi-stage process that aligns models with human preferences, not just demonstrated behaviors.
The three stages:
Supervised Fine-tuning (SFT)
- Initial instruction-following capability
- Provides a starting point for preference learning
Reward Modeling
- Collect human preferences: "Which response is better?"
- Train a model to predict these preferences
- The reward model scores any output
Reinforcement Learning
- Use PPO or similar algorithms
- Optimize the LLM to maximize reward model scores
- Include KL penalty to prevent diverging too far from SFT model
Why RLHF works:
- Captures nuances hard to demonstrate in examples
- Optimizes for what humans actually prefer
- Used by ChatGPT, Claude, and most major assistants
Challenges:
- Complex three-stage pipeline
- Reward model can be gamed
- Expensive to collect preference data
- Training can be unstable
4. Direct Preference Optimization (DPO)
A newer approach that directly optimizes for preferences without a separate reward model.
Key insight: The optimal policy can be derived directly from preference data using a clever mathematical reformulation.
Advantages over RLHF:
| Aspect | RLHF | DPO |
|---|---|---|
| Pipeline complexity | 3 stages | 1 stage |
| Reward model needed | Yes | No |
| Training stability | Challenging | More stable |
| Implementation | Complex | Simpler |
| Results | Excellent | Comparable |
When to use DPO:
- Want RLHF benefits with simpler implementation
- Have preference pairs (chosen/rejected responses)
- Training stability is a concern
Choosing Your Approach
Decision Framework
| Your Goal | Recommended Approach |
|---|---|
| Teach new tasks/domains | SFT |
| Limited GPU resources | SFT + LoRA |
| Align with human preferences | RLHF or DPO |
| Multiple specialized versions | LoRA (swap adapters) |
| Maximum quality | SFT → RLHF/DPO pipeline |
| Quick experimentation | SFT + LoRA |
The Typical Pipeline
Most production LLMs follow this sequence:
Pre-trained Base Model
↓
Supervised Fine-tuning (SFT)
↓
Preference Alignment (RLHF or DPO)
↓
Aligned, Instruction-following Model
For resource-constrained scenarios, LoRA can be applied at any stage.
Key Concepts
Loss Functions
Cross-Entropy Loss (SFT):
- Measures how well predicted probabilities match target tokens
- Standard for next-token prediction training
DPO Loss:
- Maximizes log probability of preferred response
- Minimizes log probability of rejected response
- Includes reference model regularization
RLHF Policy Loss:
- Maximizes expected reward
- Includes KL divergence penalty to stay close to reference model
Catastrophic Forgetting
Problem: Model loses general capabilities while learning new tasks.
Solutions:
| Technique | How It Helps |
|---|---|
| Lower learning rate | Gentler updates preserve knowledge |
| LoRA/PEFT | Frozen weights retain capabilities |
| Mixed data | Include general examples in training |
| Regularization | Penalize divergence from base model |
Overfitting
Problem: Model memorizes training data but fails to generalize.
Signs of overfitting:
- Training loss continues decreasing
- Validation loss increases or plateaus
- Model repeats training examples verbatim
Prevention:
- Validation set monitoring
- Early stopping
- Dropout and weight decay
- More diverse training data
Data Quality Principles
Quality Over Quantity
| Dataset | Typical Result |
|---|---|
| 100 excellent examples | Basic capability |
| 1,000 excellent examples | Good performance |
| 10,000 excellent examples | Robust capability |
| 100,000 mediocre examples | Often worse than 1,000 excellent |
What Makes "Excellent" Data
- Accurate: Correct, factual, error-free
- Diverse: Covers range of expected inputs
- Consistent: Same format throughout
- Representative: Matches production distribution
- Challenging: Includes edge cases
Data Formatting
Maintain consistent structure:
{
"messages": [
{ "role": "system", "content": "You are a helpful coding assistant." },
{ "role": "user", "content": "How do I read a file in Python?" },
{ "role": "assistant", "content": "You can use the built-in open() function..." }
]
}
This format works with most modern training frameworks.
Evaluation Strategies
Automated Metrics
| Metric | Measures | Use Case |
|---|---|---|
| Perplexity | How "surprised" by text | Language modeling quality |
| BLEU/ROUGE | N-gram overlap with reference | Translation, summarization |
| Accuracy | Correct predictions | Classification tasks |
| Pass@k | Code that passes tests | Code generation |
Human Evaluation
Automated metrics don't capture everything. Human evaluation includes:
- Preference rankings: Which response is better?
- Quality ratings: 1-5 scale on helpfulness, accuracy
- Safety evaluation: Harmful content detection
- Factuality checks: Are claims correct?
A/B Testing
For production models:
- Split traffic between models
- Measure user engagement metrics
- Statistical significance testing
- Gradual rollout
Getting Started
Before Fine-tuning
Define objectives clearly
- What specific behavior do you want?
- How will you measure success?
- What are your compute constraints?
Evaluate base model first
- Maybe prompting is sufficient
- Identify specific gaps to address
- Establish baseline metrics
Prepare quality data
- Curate, don't just collect
- Format consistently
- Split into train/validation/test
Recommended Starting Point
For most use cases, start with:
- SFT + LoRA on a strong base model (Llama, Mistral, Qwen)
- Small, high-quality dataset (1,000-10,000 examples)
- Careful evaluation on held-out test set
- Add DPO if preference alignment needed
This gives you:
- Efficient resource usage
- Fast iteration
- Clear path to improvement
Conclusion
Fine-tuning transforms general LLMs into specialized tools. Key takeaways:
Choose your approach wisely:
- SFT for teaching new tasks
- LoRA for resource efficiency
- RLHF/DPO for preference alignment
Prioritize data quality:
- 1,000 excellent examples > 10,000 mediocre ones
- Consistency and diversity matter
- Match your training data to production use
Start simple, then iterate:
- Begin with SFT + LoRA
- Evaluate thoroughly
- Add complexity only when needed
The subsequent posts in this series will dive deep into implementation, covering environment setup, hands-on training, and advanced techniques.
References
- Hu et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models". ICLR 2022.
- Ouyang et al. (2022). "Training Language Models to Follow Instructions with Human Feedback". InstructGPT/RLHF paper.
- Rafailov et al. (2023). "Direct Preference Optimization". NeurIPS 2023.
- Hugging Face PEFT Documentation - Parameter-efficient fine-tuning library.
- TRL Documentation - Transformer Reinforcement Learning library.