- Published on
Setting Up Your LLM Fine-tuning Environment: Hardware, Software, and Best Practices
- Authors
- Name
- Jared Chung
Before diving into fine-tuning your first Large Language Model, you need a properly configured environment. This post will walk you through everything from hardware considerations to software setup, ensuring you have a solid foundation for the upcoming hands-on tutorials.
Hardware Requirements
GPU Considerations
Fine-tuning LLMs is computationally intensive, and your choice of hardware will significantly impact your experience:
Minimum Requirements:
- NVIDIA GPU: 8GB VRAM (RTX 3070, RTX 4060 Ti)
- RAM: 16GB system memory
- Storage: 100GB+ free space for models and datasets
Recommended Setup:
- NVIDIA GPU: 16GB+ VRAM (RTX 4080, RTX 4090, A4000, A5000)
- RAM: 32GB+ system memory
- Storage: 500GB+ NVMe SSD
Professional/Research Setup:
- NVIDIA GPU: 24GB+ VRAM (RTX 6000 Ada, A6000, H100)
- RAM: 64GB+ system memory
- Storage: 1TB+ NVMe SSD
Memory Planning by Model Size
Understanding memory requirements helps you choose appropriate models and techniques:
# Rough memory estimates for full fine-tuning (FP16)
model_memory = {
"7B parameters": "14GB VRAM", # LLaMA-7B, Mistral-7B
"13B parameters": "26GB VRAM", # LLaMA-13B, Vicuna-13B
"70B parameters": "140GB VRAM", # LLaMA-70B (multi-GPU required)
}
# With LoRA (Low-Rank Adaptation)
lora_memory = {
"7B parameters": "6-8GB VRAM", # Significant reduction
"13B parameters": "10-12GB VRAM",
"70B parameters": "24-32GB VRAM", # Still requires high-end hardware
}
Cloud Alternatives
If local hardware is insufficient, consider cloud options:
Google Colab Pro/Pro+: Good for experimentation and learning
- T4 (16GB), A100 (40GB) available
- Limited session time but cost-effective
AWS/Azure/GCP: For serious projects
- More control and longer sessions
- Higher cost but better for production workflows
Core Software Installation
Step 1: Python Environment Setup
Start with a clean Python environment to avoid dependency conflicts:
# Install miniconda if you haven't already
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# Create dedicated environment for LLM fine-tuning
conda create -n llm-finetuning python=3.10
conda activate llm-finetuning
Step 2: PyTorch with CUDA Support
Install PyTorch with appropriate CUDA version:
# Check your CUDA version first
nvidia-smi
# Install PyTorch (adjust CUDA version as needed)
# For CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# For CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Verify installation
python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name())"
Step 3: Transformers and Essential Libraries
Install the core libraries for LLM fine-tuning:
# Core libraries
pip install transformers>=4.35.0
pip install datasets>=2.14.0
pip install accelerate>=0.24.0
pip install peft>=0.6.0 # For LoRA and other PEFT methods
# Training utilities
pip install deepspeed # For distributed training and memory optimization
pip install bitsandbytes # For quantization
pip install wandb # For experiment tracking
# Data processing
pip install pandas numpy scikit-learn
pip install tokenizers
# Optional but useful
pip install jupyter # For interactive development
pip install matplotlib seaborn # For visualizations
Step 4: Verify Your Installation
Create a simple verification script:
# verify_setup.py
import torch
import transformers
import datasets
import peft
import accelerate
def check_setup():
print("=== System Information ===")
print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"Datasets version: {datasets.__version__}")
print(f"PEFT version: {peft.__version__}")
print(f"Accelerate version: {accelerate.__version__}")
print("\n=== GPU Information ===")
if torch.cuda.is_available():
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU count: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
print(f" Memory: {torch.cuda.get_device_properties(i).total_memory / 1e9:.1f} GB")
else:
print("CUDA not available - you'll be limited to CPU training")
print("\n=== Quick Model Test ===")
try:
from transformers import AutoTokenizer, AutoModelForCausalLM
# Test with a small model
model_name = "microsoft/DialoGPT-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
if torch.cuda.is_available():
model = model.cuda()
# Simple inference test
inputs = tokenizer.encode("Hello, how are you?", return_tensors="pt")
if torch.cuda.is_available():
inputs = inputs.cuda()
with torch.no_grad():
outputs = model.generate(inputs, max_length=20, pad_token_id=tokenizer.eos_token_id)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Test successful! Model response: {response}")
except Exception as e:
print(f"Test failed: {e}")
if __name__ == "__main__":
check_setup()
Run the verification:
python verify_setup.py
Data Preparation Workflow
Directory Structure
Organize your projects with a consistent structure:
llm-finetuning/
├── data/
│ ├── raw/ # Original datasets
│ ├── processed/ # Cleaned and formatted data
│ └── splits/ # Train/validation/test splits
├── models/
│ ├── base/ # Pre-trained models
│ ├── checkpoints/ # Training checkpoints
│ └── final/ # Final fine-tuned models
├── scripts/
│ ├── data_prep.py # Data preprocessing
│ ├── train.py # Training scripts
│ └── evaluate.py # Evaluation scripts
├── configs/ # Training configurations
├── logs/ # Training logs and metrics
└── notebooks/ # Jupyter notebooks for analysis
Data Format Standards
Establish consistent data formats early:
# Standard format for instruction-following data
{
"instruction": "Explain the concept of machine learning",
"input": "", # Optional context or additional input
"output": "Machine learning is a subset of artificial intelligence..."
}
# Format for conversational data
{
"messages": [
{"role": "user", "content": "What is the weather like?"},
{"role": "assistant", "content": "I don't have access to real-time weather data..."}
]
}
# Format for preference data (for RLHF/DPO)
{
"prompt": "Write a short story about a robot",
"chosen": "In a world where circuits hummed with consciousness...",
"rejected": "Robot go beep boop and walk around..."
}
Data Processing Pipeline
Create reusable data processing functions:
# data_utils.py
import json
import pandas as pd
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
def load_and_format_data(file_path, format_type="instruction"):
"""Load data and convert to standard format"""
if file_path.endswith('.json'):
with open(file_path, 'r') as f:
data = json.load(f)
elif file_path.endswith('.csv'):
data = pd.read_csv(file_path).to_dict('records')
else:
raise ValueError("Unsupported file format")
if format_type == "instruction":
# Ensure all entries have required fields
formatted_data = []
for item in data:
if 'instruction' in item and 'output' in item:
formatted_data.append({
'instruction': item['instruction'],
'input': item.get('input', ''),
'output': item['output']
})
return formatted_data
# Add other format handlers as needed
return data
def create_train_val_split(data, test_size=0.2, random_state=42):
"""Create train/validation splits"""
train_data, val_data = train_test_split(
data, test_size=test_size, random_state=random_state
)
return DatasetDict({
'train': Dataset.from_list(train_data),
'validation': Dataset.from_list(val_data)
})
def save_processed_data(dataset_dict, output_dir):
"""Save processed data for reuse"""
dataset_dict.save_to_disk(output_dir)
print(f"Saved processed data to {output_dir}")
Configuration Management
Training Configuration Template
Create a template for training configurations:
# config.py
from dataclasses import dataclass, field
from typing import Optional, List
@dataclass
class ModelConfig:
model_name: str = "microsoft/DialoGPT-medium"
tokenizer_name: Optional[str] = None
max_length: int = 512
use_fast_tokenizer: bool = True
@dataclass
class LoRAConfig:
r: int = 16 # Rank of adaptation
lora_alpha: int = 32 # LoRA scaling parameter
target_modules: List[str] = field(default_factory=lambda: ["q_proj", "v_proj"])
lora_dropout: float = 0.1
bias: str = "none"
@dataclass
class TrainingConfig:
output_dir: str = "./results"
num_train_epochs: int = 3
per_device_train_batch_size: int = 4
per_device_eval_batch_size: int = 4
warmup_steps: int = 500
weight_decay: float = 0.01
learning_rate: float = 5e-5
fp16: bool = True # Use mixed precision
gradient_checkpointing: bool = True
dataloader_num_workers: int = 4
save_steps: int = 500
eval_steps: int = 500
logging_steps: int = 100
load_best_model_at_end: bool = True
@dataclass
class DataConfig:
data_path: str = "./data/processed"
max_samples: Optional[int] = None
shuffle: bool = True
Environment Optimization Tips
Memory Optimization
# Techniques for reducing memory usage
# 1. Gradient checkpointing (trades compute for memory)
training_args.gradient_checkpointing = True
# 2. Mixed precision training
training_args.fp16 = True # or bf16 for newer GPUs
# 3. DeepSpeed ZeRO for large models
# deepspeed_config.json
{
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 200000000,
"reduce_scatter": true,
"reduce_bucket_size": 200000000,
"overlap_comm": true,
"contiguous_gradients": true
},
"fp16": {
"enabled": true
}
}
Monitoring and Logging
Set up proper monitoring:
# wandb_setup.py
import wandb
import os
def setup_wandb(project_name, config):
"""Initialize Weights & Biases for experiment tracking"""
wandb.login() # Make sure to set WANDB_API_KEY
run = wandb.init(
project=project_name,
config=config,
name=f"{config.model_name.split('/')[-1]}-{config.learning_rate}"
)
return run
# Usage in training script
wandb_run = setup_wandb("llm-finetuning", training_config)
Troubleshooting Common Issues
CUDA Out of Memory
# Solutions for OOM errors:
# 1. Reduce batch size
training_args.per_device_train_batch_size = 1
training_args.gradient_accumulation_steps = 8 # Maintain effective batch size
# 2. Enable gradient checkpointing
training_args.gradient_checkpointing = True
# 3. Use LoRA instead of full fine-tuning
from peft import LoraConfig, get_peft_model
# 4. Use 8-bit or 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True, # or load_in_4bit=True
device_map="auto"
)
Performance Issues
# Optimization strategies:
# 1. Increase dataloader workers
training_args.dataloader_num_workers = 4
# 2. Use faster tokenizers
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
# 3. Pre-tokenize your dataset
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True, padding=True)
tokenized_dataset = dataset.map(preprocess_function, batched=True)
tokenized_dataset.save_to_disk("./tokenized_data") # Save for reuse
Next Steps
With your environment properly set up, you're ready to dive into practical fine-tuning. The next post in this series will walk through your first supervised fine-tuning experiment, where we'll:
- Prepare a real dataset for instruction-following
- Configure and run a training job
- Monitor training progress and metrics
- Evaluate the fine-tuned model
Make sure to test your setup with the verification script and familiarize yourself with the directory structure. Having a solid foundation will make the upcoming hands-on tutorials much smoother.
Quick Setup Checklist:
- ✅ GPU drivers and CUDA installed
- ✅ Python environment with all required packages
- ✅ Verification script runs successfully
- ✅ Directory structure created
- ✅ Configuration templates ready
- ✅ Monitoring tools (wandb) configured
Ready to fine-tune your first model? The next post will put all this setup to work!