- Published on
LLM Quantization: GPTQ, AWQ, GGUF and When to Use Each
- Authors

- Name
- Jared Chung
Introduction
Large Language Models are resource-intensive. A 70B parameter model in full precision requires ~140GB of memory - far beyond consumer GPUs. Quantization solves this by reducing precision from 16-bit floats to 8-bit, 4-bit, or even lower, dramatically reducing memory requirements with minimal quality loss.
What is Quantization?
Quantization maps continuous floating-point values to discrete integer values:
FP16 (16-bit float) → INT8 (8-bit integer) → INT4 (4-bit integer)
| Precision | Memory/Param | 7B Model | 70B Model |
|---|---|---|---|
| FP32 | 4 bytes | 28 GB | 280 GB |
| FP16/BF16 | 2 bytes | 14 GB | 140 GB |
| INT8 | 1 byte | 7 GB | 70 GB |
| INT4 | 0.5 bytes | 3.5 GB | 35 GB |
Quantization Methods Comparison
| Method | Speed | Quality | GPU Required | Best For |
|---|---|---|---|---|
| GPTQ | Fast | Good | Yes (inference) | GPU deployment |
| AWQ | Fast | Better | Yes (inference) | GPU deployment |
| GGUF/GGML | Medium | Good | No (CPU/GPU) | Local/edge |
| bitsandbytes | Fast | Good | Yes | Training/fine-tuning |
| EETQ | Fastest | Good | Yes | High throughput |
GPTQ: GPU Quantization
GPTQ (Generalized Post-Training Quantization) is designed for GPU inference with minimal accuracy loss.
Using Pre-quantized Models
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "TheBloke/Llama-2-7B-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True
)
# Generate
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
Quantizing Your Own Model
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
model_id = "meta-llama/Llama-2-7b-hf"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Prepare calibration dataset
calibration_data = [
"Machine learning is a field of artificial intelligence",
"Deep learning uses neural networks with many layers",
"Natural language processing deals with text data",
# Add more diverse examples (typically 128-512 samples)
]
# Quantization config
quantization_config = GPTQConfig(
bits=4,
dataset=calibration_data,
tokenizer=tokenizer,
group_size=128,
desc_act=True
)
# Load and quantize
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quantization_config,
device_map="auto"
)
# Save quantized model
model.save_pretrained("llama-2-7b-gptq-4bit")
tokenizer.save_pretrained("llama-2-7b-gptq-4bit")
With AutoGPTQ Library
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
model_id = "meta-llama/Llama-2-7b-hf"
quantized_model_dir = "llama-2-7b-4bit-gptq"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Prepare examples
examples = [
tokenizer(text, return_tensors="pt")
for text in calibration_texts
]
# Quantize config
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=True,
damp_percent=0.1
)
# Load model
model = AutoGPTQForCausalLM.from_pretrained(
model_id,
quantize_config=quantize_config
)
# Quantize
model.quantize(examples)
# Save
model.save_quantized(quantized_model_dir)
AWQ: Activation-aware Quantization
AWQ (Activation-aware Weight Quantization) preserves important weights based on activation patterns, often achieving better quality than GPTQ.
Using Pre-quantized AWQ Models
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "TheBloke/Llama-2-7B-AWQ"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
inputs = tokenizer("Explain quantum computing:", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
Quantizing with AWQ
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "meta-llama/Llama-2-7b-hf"
quant_path = "llama-2-7b-awq"
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Quantization config
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM"
}
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
AWQ with vLLM
from vllm import LLM, SamplingParams
llm = LLM(
model="TheBloke/Llama-2-13B-AWQ",
quantization="awq",
dtype="half"
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = llm.generate(["Hello, how are you?"], sampling_params)
print(outputs[0].outputs[0].text)
GGUF: CPU-Friendly Quantization
GGUF (GPT-Generated Unified Format) is designed for efficient CPU inference with optional GPU acceleration. It's the format used by llama.cpp and Ollama.
Quantization Levels
| Quant Type | Bits | Size (7B) | Quality | Speed |
|---|---|---|---|---|
| Q2_K | 2 | 2.5 GB | Low | Fastest |
| Q3_K_M | 3 | 3.3 GB | Fair | Fast |
| Q4_K_M | 4 | 4.0 GB | Good | Fast |
| Q5_K_M | 5 | 4.8 GB | Very Good | Medium |
| Q6_K | 6 | 5.5 GB | Excellent | Medium |
| Q8_0 | 8 | 7.0 GB | Near FP16 | Slower |
Using with llama-cpp-python
from llama_cpp import Llama
# Load GGUF model
llm = Llama(
model_path="./models/llama-2-7b.Q4_K_M.gguf",
n_ctx=4096,
n_gpu_layers=35, # Offload layers to GPU
n_threads=8
)
# Generate
output = llm(
"Explain machine learning in simple terms:",
max_tokens=200,
temperature=0.7,
stop=["User:", "\n\n"]
)
print(output["choices"][0]["text"])
Converting to GGUF
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Install requirements
pip install -r requirements.txt
# Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py /path/to/model --outfile model.gguf
# Quantize
./llama-quantize model.gguf model-q4_k_m.gguf Q4_K_M
Using with Ollama
# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./model-q4_k_m.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF
# Create Ollama model
ollama create my-model -f Modelfile
# Run
ollama run my-model
bitsandbytes: Training-Friendly Quantization
bitsandbytes is ideal when you need to fine-tune or train with quantized models.
8-bit Loading
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=quantization_config,
device_map="auto"
)
4-bit Loading (QLoRA)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # or "fp4"
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True # Nested quantization
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=quantization_config,
device_map="auto"
)
# Ready for QLoRA fine-tuning!
QLoRA Fine-tuning
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# Prepare model
model = prepare_model_for_kbit_training(model)
# LoRA config
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%
Choosing the Right Method
Decision Tree
Need to fine-tune?
├── Yes → bitsandbytes (QLoRA)
└── No
├── GPU available?
│ ├── Yes
│ │ ├── Need best quality? → AWQ
│ │ ├── Need fastest inference? → GPTQ
│ │ └── Using vLLM? → AWQ or GPTQ
│ └── No → GGUF (llama.cpp/Ollama)
└── Edge/Mobile? → GGUF with aggressive quantization
Quality Comparison
Testing on common benchmarks (lower perplexity = better):
| Model | FP16 | GPTQ-4bit | AWQ-4bit | GGUF Q4_K_M |
|---|---|---|---|---|
| Llama-2-7B | 5.47 | 5.62 | 5.58 | 5.65 |
| Llama-2-13B | 4.88 | 5.01 | 4.97 | 5.05 |
| Mistral-7B | 5.25 | 5.38 | 5.34 | 5.41 |
Practical Tips
1. Memory Estimation
def estimate_memory(params_billions, bits=16, overhead=1.2):
"""Estimate GPU memory for a model."""
bytes_per_param = bits / 8
memory_gb = params_billions * bytes_per_param * overhead
return f"{memory_gb:.1f} GB"
print(estimate_memory(7, 16)) # 16.8 GB (FP16)
print(estimate_memory(7, 4)) # 4.2 GB (4-bit)
print(estimate_memory(70, 4)) # 42.0 GB (4-bit)
2. Batch Size Optimization
# With quantization, you can often increase batch size
# FP16: batch_size=1, 14GB VRAM
# 4-bit: batch_size=4, 14GB VRAM
3. Mixed Precision
# Keep some layers in higher precision
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16
)
4. Layer Offloading
# GGUF: Offload some layers to GPU
llm = Llama(
model_path="model.gguf",
n_gpu_layers=20, # First 20 layers on GPU
)
Benchmarking Your Setup
import time
import torch
def benchmark_model(model, tokenizer, prompt, num_runs=10):
"""Benchmark inference speed."""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Warmup
for _ in range(3):
with torch.no_grad():
model.generate(**inputs, max_new_tokens=50)
# Benchmark
times = []
for _ in range(num_runs):
torch.cuda.synchronize()
start = time.perf_counter()
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=100)
torch.cuda.synchronize()
times.append(time.perf_counter() - start)
tokens = len(outputs[0]) - len(inputs.input_ids[0])
return {
"avg_time": sum(times) / len(times),
"tokens_per_sec": tokens / (sum(times) / len(times)),
"memory_gb": torch.cuda.max_memory_allocated() / 1e9
}
Complete Benchmarking Suite
Here's a comprehensive benchmarking script to evaluate quantized models:
import torch
import time
import gc
from dataclasses import dataclass
from typing import List, Dict, Optional
from transformers import AutoModelForCausalLM, AutoTokenizer
import json
@dataclass
class BenchmarkResult:
"""Results from a single benchmark run."""
model_name: str
quantization: str
prompt_tokens: int
generated_tokens: int
time_seconds: float
memory_gb: float
tokens_per_second: float
class QuantizationBenchmark:
"""Benchmark quantized models for speed and quality."""
def __init__(self, prompts: List[str] = None):
self.prompts = prompts or [
"Write a Python function to calculate the Fibonacci sequence.",
"Explain the theory of relativity in simple terms.",
"What are the key differences between SQL and NoSQL databases?",
]
self.results: List[BenchmarkResult] = []
def benchmark_model(
self,
model_name: str,
quantization: Optional[str] = None,
max_new_tokens: int = 100,
n_runs: int = 3
) -> List[BenchmarkResult]:
"""Benchmark a model with specified quantization."""
print(f"\nBenchmarking: {model_name} ({quantization or 'fp16'})")
# Load model based on quantization type
load_kwargs = {"device_map": "auto", "trust_remote_code": True}
if quantization == "gptq":
load_kwargs["torch_dtype"] = torch.float16
elif quantization == "awq":
load_kwargs["torch_dtype"] = torch.float16
elif quantization == "4bit":
from transformers import BitsAndBytesConfig
load_kwargs["quantization_config"] = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
elif quantization == "8bit":
from transformers import BitsAndBytesConfig
load_kwargs["quantization_config"] = BitsAndBytesConfig(load_in_8bit=True)
else:
load_kwargs["torch_dtype"] = torch.float16
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, **load_kwargs)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
run_results = []
for prompt in self.prompts:
for run in range(n_runs):
# Clear cache
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
prompt_tokens = inputs.input_ids.shape[1]
# Generate with timing
torch.cuda.synchronize() if torch.cuda.is_available() else None
start_time = time.perf_counter()
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
pad_token_id=tokenizer.pad_token_id
)
torch.cuda.synchronize() if torch.cuda.is_available() else None
elapsed = time.perf_counter() - start_time
generated_tokens = outputs.shape[1] - prompt_tokens
# Get memory usage
if torch.cuda.is_available():
memory_gb = torch.cuda.max_memory_allocated() / 1e9
else:
memory_gb = 0
result = BenchmarkResult(
model_name=model_name,
quantization=quantization or "fp16",
prompt_tokens=prompt_tokens,
generated_tokens=generated_tokens,
time_seconds=elapsed,
memory_gb=memory_gb,
tokens_per_second=generated_tokens / elapsed
)
run_results.append(result)
print(f" Run {run+1}/{n_runs}: {result.tokens_per_second:.1f} tok/s, {result.memory_gb:.2f} GB")
# Cleanup
del model
del tokenizer
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
self.results.extend(run_results)
return run_results
def get_summary(self) -> Dict:
"""Get aggregated summary of all benchmark runs."""
from collections import defaultdict
from statistics import mean, stdev
grouped = defaultdict(list)
for r in self.results:
key = (r.model_name, r.quantization)
grouped[key].append(r)
summary = []
for (model, quant), runs in grouped.items():
tps = [r.tokens_per_second for r in runs]
mem = [r.memory_gb for r in runs]
summary.append({
"model": model,
"quantization": quant,
"avg_tokens_per_sec": mean(tps),
"std_tokens_per_sec": stdev(tps) if len(tps) > 1 else 0,
"avg_memory_gb": mean(mem),
"n_runs": len(runs)
})
return summary
def save_results(self, filepath: str):
"""Save results to JSON."""
data = {
"results": [vars(r) for r in self.results],
"summary": self.get_summary()
}
with open(filepath, 'w') as f:
json.dump(data, f, indent=2)
# Example usage
if __name__ == "__main__":
benchmark = QuantizationBenchmark()
# Compare different quantizations of the same model
models_to_test = [
("TheBloke/Llama-2-7B-fp16", None),
("TheBloke/Llama-2-7B-GPTQ", "gptq"),
("TheBloke/Llama-2-7B-AWQ", "awq"),
# For bitsandbytes, use the base model
("meta-llama/Llama-2-7b-hf", "4bit"),
("meta-llama/Llama-2-7b-hf", "8bit"),
]
for model_name, quant in models_to_test:
try:
benchmark.benchmark_model(model_name, quant, n_runs=3)
except Exception as e:
print(f"Failed to benchmark {model_name}: {e}")
# Print summary
print("\n=== Summary ===")
for s in benchmark.get_summary():
print(f"{s['model']} ({s['quantization']}): "
f"{s['avg_tokens_per_sec']:.1f} tok/s, "
f"{s['avg_memory_gb']:.2f} GB")
benchmark.save_results("benchmark_results.json")
Quality Evaluation
Quantization reduces model quality. Here's how to measure the impact:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import math
def calculate_perplexity(model, tokenizer, texts: list, max_length: int = 512) -> float:
"""Calculate perplexity on a set of texts."""
model.eval()
total_loss = 0
total_tokens = 0
with torch.no_grad():
for text in texts:
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=max_length
).to(model.device)
outputs = model(**inputs, labels=inputs["input_ids"])
loss = outputs.loss.item()
total_loss += loss * inputs["input_ids"].shape[1]
total_tokens += inputs["input_ids"].shape[1]
avg_loss = total_loss / total_tokens
perplexity = math.exp(avg_loss)
return perplexity
# Test texts (use a proper evaluation dataset in practice)
eval_texts = [
"The quick brown fox jumps over the lazy dog.",
"Machine learning is transforming how we interact with technology.",
"In the beginning, there was nothing but vast emptiness.",
]
# Compare perplexity across quantizations
# Lower perplexity = better
results = {}
# FP16 baseline
model_fp16 = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
results["fp16"] = calculate_perplexity(model_fp16, tokenizer, eval_texts)
del model_fp16
# GPTQ
model_gptq = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-GPTQ",
device_map="auto"
)
results["gptq"] = calculate_perplexity(model_gptq, tokenizer, eval_texts)
del model_gptq
print("Perplexity comparison:")
for quant, ppl in results.items():
print(f" {quant}: {ppl:.2f}")
Troubleshooting Common Issues
CUDA Out of Memory
# Solution 1: Reduce batch size
model.generate(inputs, max_new_tokens=50) # Instead of 512
# Solution 2: Use more aggressive quantization
from transformers import BitsAndBytesConfig
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16, # Not bfloat16
bnb_4bit_use_double_quant=True # Nested quantization
)
# Solution 3: Offload to CPU
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
offload_folder="offload",
offload_state_dict=True
)
GPTQ Loading Errors
# Install correct packages
pip install auto-gptq optimum
# For CUDA 11.8
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
AWQ Compatibility
# AWQ requires specific GPU architectures (Ampere+)
# For older GPUs, use GPTQ instead
# Check GPU compatibility
import torch
if torch.cuda.is_available():
capability = torch.cuda.get_device_capability()
print(f"CUDA capability: {capability}")
if capability[0] < 8:
print("Warning: AWQ may not work optimally on this GPU")
Decision Guide
Need to fine-tune?
├── Yes → bitsandbytes 4-bit (QLoRA)
└── No
├── Have GPU?
│ ├── Yes
│ │ ├── Ampere+ GPU (RTX 30xx, 40xx)?
│ │ │ ├── Yes → AWQ (best quality + speed)
│ │ │ └── No → GPTQ (wider compatibility)
│ │ └── Need multiple models?
│ │ └── Use vLLM with quantization
│ └── No → GGUF (llama.cpp/Ollama)
└── Edge deployment? → GGUF with aggressive quantization
Conclusion
Quantization makes powerful LLMs accessible on consumer hardware:
- GPTQ: Best for GPU inference with good speed and wide compatibility
- AWQ: Better quality than GPTQ on modern GPUs (Ampere+)
- GGUF: Best for CPU or mixed CPU/GPU inference
- bitsandbytes: Essential for fine-tuning (QLoRA)
Key recommendations:
- Start with pre-quantized models from TheBloke on Hugging Face
- Use Q4_K_M for GGUF - best balance of size and quality
- Benchmark on your hardware - results vary significantly
- Evaluate quality - measure perplexity on your use case
- Consider AWQ for production GPU deployments
References
- GPTQ Paper: "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers"
- AWQ Paper: "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration"
- llama.cpp: https://github.com/ggerganov/llama.cpp
- bitsandbytes: https://github.com/TimDettmers/bitsandbytes
- AutoGPTQ: https://github.com/PanQiWei/AutoGPTQ
- AutoAWQ: https://github.com/casper-hansen/AutoAWQ