Jared AI Hub
Published on

LLM Fine-tuning Fundamentals: Understanding When and How to Fine-tune

Authors
  • avatar
    Name
    Jared Chung
    Twitter

Introduction

Fine-tuning Large Language Models transforms general-purpose models into specialized tools for your specific needs. Whether you want a model that follows particular instructions, adopts a specific writing style, or excels at domain-specific tasks, fine-tuning is how you get there.

But fine-tuning is not a single technique—it's a family of approaches with different tradeoffs. This guide explains when and why to use each approach, helping you make informed decisions before investing compute resources.

The Learning Hierarchy

LLM Fine-tuning Approaches

From General to Specialized

LLMs develop capabilities in stages:

StageWhat HappensScale
Pre-trainingLearn language from billions of tokensWeeks on GPU clusters
Fine-tuningLearn specific tasks/behaviorsHours on consumer GPUs
AlignmentLearn human preferencesAdditional fine-tuning stage

Pre-training teaches the model:

  • Grammar, syntax, and semantic relationships
  • General world knowledge
  • Broad reasoning patterns

Fine-tuning teaches:

  • Task-specific performance (coding, summarization)
  • Domain expertise (medical, legal, technical)
  • Specific interaction patterns and preferences

The key insight: you almost never pre-train from scratch. You leverage pre-trained models and fine-tune them for your needs.

Core Fine-tuning Approaches

1. Supervised Fine-tuning (SFT)

The most straightforward approach: provide input-output pairs and train the model to produce the desired outputs.

Training data format:

{
  "instruction": "Explain quantum computing in simple terms",
  "output": "Quantum computing uses quantum mechanics principles to process information differently than regular computers. Instead of bits that are either 0 or 1, quantum computers use 'qubits' that can be both simultaneously..."
}

When to use SFT:

ScenarioWhy SFT Works
Teaching new tasksDirect demonstration of desired behavior
Domain adaptationExposure to domain-specific language
Instruction followingExamples of following instructions
Style transferExamples in the target style

Considerations:

  • Quality matters more than quantity—1,000 excellent examples often beat 10,000 mediocre ones
  • Format consistency is crucial—maintain the same structure throughout
  • Diversity ensures generalization—cover the full range of expected inputs

2. Parameter-Efficient Fine-tuning (PEFT/LoRA)

Instead of updating all model parameters (billions for modern LLMs), PEFT methods update only a small subset.

LoRA (Low-Rank Adaptation) is the most popular PEFT method:

  • Adds small "adapter" matrices to existing layers
  • Trains only these adapters (0.1-1% of total parameters)
  • Original model weights stay frozen

Benefits:

BenefitImpact
10-100x less memoryFine-tune 70B models on consumer GPUs
Faster trainingFewer parameters to update
Multiple adaptersDifferent tasks without storing full model copies
Less forgettingOriginal capabilities preserved in frozen weights

When to use PEFT:

  • Limited GPU memory
  • Need to support multiple fine-tuned versions
  • Want to preserve general capabilities
  • Resource efficiency matters

3. Reinforcement Learning from Human Feedback (RLHF)

A multi-stage process that aligns models with human preferences, not just demonstrated behaviors.

The three stages:

  1. Supervised Fine-tuning (SFT)

    • Initial instruction-following capability
    • Provides a starting point for preference learning
  2. Reward Modeling

    • Collect human preferences: "Which response is better?"
    • Train a model to predict these preferences
    • The reward model scores any output
  3. Reinforcement Learning

    • Use PPO or similar algorithms
    • Optimize the LLM to maximize reward model scores
    • Include KL penalty to prevent diverging too far from SFT model

Why RLHF works:

  • Captures nuances hard to demonstrate in examples
  • Optimizes for what humans actually prefer
  • Used by ChatGPT, Claude, and most major assistants

Challenges:

  • Complex three-stage pipeline
  • Reward model can be gamed
  • Expensive to collect preference data
  • Training can be unstable

4. Direct Preference Optimization (DPO)

A newer approach that directly optimizes for preferences without a separate reward model.

Key insight: The optimal policy can be derived directly from preference data using a clever mathematical reformulation.

Advantages over RLHF:

AspectRLHFDPO
Pipeline complexity3 stages1 stage
Reward model neededYesNo
Training stabilityChallengingMore stable
ImplementationComplexSimpler
ResultsExcellentComparable

When to use DPO:

  • Want RLHF benefits with simpler implementation
  • Have preference pairs (chosen/rejected responses)
  • Training stability is a concern

Choosing Your Approach

Decision Framework

Your GoalRecommended Approach
Teach new tasks/domainsSFT
Limited GPU resourcesSFT + LoRA
Align with human preferencesRLHF or DPO
Multiple specialized versionsLoRA (swap adapters)
Maximum qualitySFT → RLHF/DPO pipeline
Quick experimentationSFT + LoRA

The Typical Pipeline

Most production LLMs follow this sequence:

Pre-trained Base Model
  Supervised Fine-tuning (SFT)
  Preference Alignment (RLHF or DPO)
   Aligned, Instruction-following Model

For resource-constrained scenarios, LoRA can be applied at any stage.

Key Concepts

Loss Functions

Cross-Entropy Loss (SFT):

  • Measures how well predicted probabilities match target tokens
  • Standard for next-token prediction training

DPO Loss:

  • Maximizes log probability of preferred response
  • Minimizes log probability of rejected response
  • Includes reference model regularization

RLHF Policy Loss:

  • Maximizes expected reward
  • Includes KL divergence penalty to stay close to reference model

Catastrophic Forgetting

Problem: Model loses general capabilities while learning new tasks.

Solutions:

TechniqueHow It Helps
Lower learning rateGentler updates preserve knowledge
LoRA/PEFTFrozen weights retain capabilities
Mixed dataInclude general examples in training
RegularizationPenalize divergence from base model

Overfitting

Problem: Model memorizes training data but fails to generalize.

Signs of overfitting:

  • Training loss continues decreasing
  • Validation loss increases or plateaus
  • Model repeats training examples verbatim

Prevention:

  • Validation set monitoring
  • Early stopping
  • Dropout and weight decay
  • More diverse training data

Data Quality Principles

Quality Over Quantity

DatasetTypical Result
100 excellent examplesBasic capability
1,000 excellent examplesGood performance
10,000 excellent examplesRobust capability
100,000 mediocre examplesOften worse than 1,000 excellent

What Makes "Excellent" Data

  1. Accurate: Correct, factual, error-free
  2. Diverse: Covers range of expected inputs
  3. Consistent: Same format throughout
  4. Representative: Matches production distribution
  5. Challenging: Includes edge cases

Data Formatting

Maintain consistent structure:

{
  "messages": [
    { "role": "system", "content": "You are a helpful coding assistant." },
    { "role": "user", "content": "How do I read a file in Python?" },
    { "role": "assistant", "content": "You can use the built-in open() function..." }
  ]
}

This format works with most modern training frameworks.

Evaluation Strategies

Automated Metrics

MetricMeasuresUse Case
PerplexityHow "surprised" by textLanguage modeling quality
BLEU/ROUGEN-gram overlap with referenceTranslation, summarization
AccuracyCorrect predictionsClassification tasks
Pass@kCode that passes testsCode generation

Human Evaluation

Automated metrics don't capture everything. Human evaluation includes:

  • Preference rankings: Which response is better?
  • Quality ratings: 1-5 scale on helpfulness, accuracy
  • Safety evaluation: Harmful content detection
  • Factuality checks: Are claims correct?

A/B Testing

For production models:

  • Split traffic between models
  • Measure user engagement metrics
  • Statistical significance testing
  • Gradual rollout

Getting Started

Before Fine-tuning

  1. Define objectives clearly

    • What specific behavior do you want?
    • How will you measure success?
    • What are your compute constraints?
  2. Evaluate base model first

    • Maybe prompting is sufficient
    • Identify specific gaps to address
    • Establish baseline metrics
  3. Prepare quality data

    • Curate, don't just collect
    • Format consistently
    • Split into train/validation/test

For most use cases, start with:

  1. SFT + LoRA on a strong base model (Llama, Mistral, Qwen)
  2. Small, high-quality dataset (1,000-10,000 examples)
  3. Careful evaluation on held-out test set
  4. Add DPO if preference alignment needed

This gives you:

  • Efficient resource usage
  • Fast iteration
  • Clear path to improvement

Conclusion

Fine-tuning transforms general LLMs into specialized tools. Key takeaways:

Choose your approach wisely:

  • SFT for teaching new tasks
  • LoRA for resource efficiency
  • RLHF/DPO for preference alignment

Prioritize data quality:

  • 1,000 excellent examples > 10,000 mediocre ones
  • Consistency and diversity matter
  • Match your training data to production use

Start simple, then iterate:

  • Begin with SFT + LoRA
  • Evaluate thoroughly
  • Add complexity only when needed

The subsequent posts in this series will dive deep into implementation, covering environment setup, hands-on training, and advanced techniques.

References