Jared AI Hub
Published on

LLM Quantization: GPTQ, AWQ, GGUF and When to Use Each

Authors
  • avatar
    Name
    Jared Chung
    Twitter

Introduction

Large Language Models are incredibly powerful—but also incredibly resource-hungry. A 70-billion parameter model in full precision requires roughly 140GB of memory, far exceeding what any consumer GPU can handle. So how do people run these models on laptops and gaming PCs?

The answer is quantization: a technique that dramatically reduces memory requirements while preserving most of the model's capabilities. In this post, we'll demystify quantization, understand the different methods, and learn when to use each approach.

What is Quantization?

At its core, quantization is about trading precision for memory. Instead of storing model weights as high-precision floating-point numbers (32 or 16 bits), we store them as smaller integers (8 or 4 bits).

Quantization: Precision vs Memory

The Math Behind It

A neural network weight like 0.3847293 would normally be stored as a 32-bit floating-point number. Through quantization:

FormatStorageHow It Works
FP324 bytesExact value stored as floating point
FP162 bytesReduced precision, stores ~0.3848
INT81 byteMapped to integer 98 (in range 0-255)
INT40.5 bytesMapped to integer 6 (in range 0-15)

The key insight: Neural networks are remarkably tolerant of reduced precision. The small errors introduced by quantization tend to average out, and the model still produces coherent, useful outputs.

Memory Savings at Scale

The savings become dramatic for large models:

Model SizeFP32FP16INT8INT4
7B params28 GB14 GB7 GB3.5 GB
13B params52 GB26 GB13 GB6.5 GB
70B params280 GB140 GB70 GB35 GB

A 4-bit quantized 70B model can run on a single high-end consumer GPU (48GB)—something impossible with full precision.

Understanding the Methods

Different quantization methods take different approaches to the precision-quality trade-off. Each has strengths for specific use cases.

The Landscape

MethodBest ForGPU RequiredQuality
GPTQGPU inferenceYesGood
AWQGPU inference (modern)Yes (Ampere+)Better
GGUFCPU/mixed inferenceNoGood
bitsandbytesFine-tuningYesGood

GPTQ: The GPU Workhorse

GPTQ (Generalized Post-Training Quantization) was one of the first practical methods for 4-bit LLM quantization. It remains popular due to its wide compatibility.

How GPTQ Works

The methodology behind GPTQ involves three key steps:

1. Calibration: The algorithm runs a small set of sample inputs through the model to understand how weights are actually used. This helps identify which weights matter most.

2. Optimal Quantization: Using the calibration data, GPTQ solves an optimization problem to find the best integer values that minimize output error. Weights that significantly affect outputs get more careful treatment.

3. Group Quantization: Weights are quantized in groups (typically 128 weights per group). Each group has its own scale and zero-point, allowing better precision where needed.

When to Use GPTQ

Pros:

  • Works on any CUDA GPU
  • Fast inference
  • Mature ecosystem with many pre-quantized models
  • Good quality at 4-bit

Cons:

  • Slightly lower quality than AWQ on modern GPUs
  • Requires calibration data for custom quantization
  • GPU-only (no CPU inference)

Using GPTQ in Practice

Most users download pre-quantized models rather than quantizing themselves:

from transformers import AutoModelForCausalLM, AutoTokenizer

# TheBloke provides many pre-quantized models
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-GPTQ")

# Use like any other model
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

AWQ: Better Quality on Modern GPUs

AWQ (Activation-aware Weight Quantization) is a newer method that often achieves better quality than GPTQ by being smarter about which weights matter most.

How AWQ Works

The key insight behind AWQ: not all weights are equally important. Some weights, when quantized poorly, cause large output errors. Others can be quantized aggressively with minimal impact.

AWQ identifies these important weights by looking at activation patterns—how inputs actually flow through the network. Weights that frequently interact with large activations are treated more carefully during quantization.

The methodology:

  1. Activation analysis: Run sample inputs and track which weights produce large activations
  2. Importance scoring: Weight importance = average magnitude of activations it processes
  3. Adaptive quantization: Important weights get higher effective precision through scaling tricks

When to Use AWQ

Pros:

  • Better quality than GPTQ at same bit width
  • Fast inference on modern GPUs
  • Growing ecosystem of pre-quantized models

Cons:

  • Requires Ampere or newer GPU (RTX 30xx, 40xx)
  • Less mature than GPTQ
  • GPU-only

AWQ vs GPTQ Quality

In benchmarks, AWQ typically shows 1-3% lower perplexity (better) than GPTQ at 4-bit:

ModelFP16 (baseline)GPTQ-4bitAWQ-4bit
Llama-2-7B5.475.62 (+2.7%)5.58 (+2.0%)
Llama-2-13B4.885.01 (+2.7%)4.97 (+1.8%)
Mistral-7B5.255.38 (+2.5%)5.34 (+1.7%)

The difference is meaningful for production applications where every bit of quality matters.

GGUF: The Flexible Format

GGUF (GPT-Generated Unified Format) is designed for the llama.cpp ecosystem. Unlike GPTQ and AWQ which are GPU-focused, GGUF enables efficient inference on CPUs with optional GPU acceleration.

How GGUF Works

GGUF uses a different approach called k-quant that's optimized for CPU inference:

Mixed-precision blocks: Instead of uniform quantization, GGUF uses different precision levels within the same layer. Some blocks get 6 bits, others get 4 or even 2 bits.

CPU-optimized kernels: The format includes highly optimized code paths for x86 (Intel/AMD) and ARM (Apple Silicon) CPUs.

Flexible offloading: You can offload some layers to GPU while keeping others on CPU, enabling larger models than your GPU memory alone would allow.

GGUF Quantization Levels

GGUF offers fine-grained control over the quality-size trade-off:

Quant TypeBitsSize (7B)QualityUse Case
Q2_K22.5 GBLowExtreme memory constraints
Q3_K_M33.3 GBFairLow-memory systems
Q4_K_M44.0 GBGoodRecommended default
Q5_K_M54.8 GBVery GoodQuality-focused
Q6_K65.5 GBExcellentNear-lossless
Q8_087.0 GBNear FP16Maximum quality

Recommendation: Start with Q4_K_M for the best balance of size and quality. Move to Q5_K_M if you notice quality issues.

When to Use GGUF

Pros:

  • Runs on CPU (no GPU required)
  • Works on Mac, Windows, Linux
  • Excellent for Apple Silicon (M1/M2/M3)
  • Fine-grained size/quality control
  • Used by Ollama, LM Studio, and other popular tools

Cons:

  • Slower than GPU-optimized formats
  • More complex ecosystem (many quant variants)

Using GGUF with Ollama

The easiest way to use GGUF models:

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Run a model (automatically downloads GGUF)
ollama run llama3.2

# Or use a specific quantization
ollama run llama3.2:7b-q4_K_M

bitsandbytes: The Training-Friendly Option

Unlike the other methods, bitsandbytes is designed primarily for training and fine-tuning, not just inference.

How bitsandbytes Works

bitsandbytes uses a technique called NF4 (Normal Float 4-bit):

Dynamic quantization: Weights are quantized on-the-fly during the forward pass and dequantized when needed for gradient computation.

Nested quantization: The quantization constants themselves can be quantized, saving additional memory.

Gradient-friendly: Unlike static quantization, bitsandbytes maintains the ability to compute gradients through the quantized weights.

When to Use bitsandbytes

The primary use case: Fine-tuning large models with QLoRA.

QLoRA (Quantized Low-Rank Adaptation) lets you fine-tune a 70B model on a single GPU by:

  1. Loading the base model in 4-bit using bitsandbytes
  2. Adding small trainable LoRA adapters
  3. Only training the adapters while the frozen base model stays quantized
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 4-bit config for QLoRA
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=config,
    device_map="auto"
)
# Now add LoRA adapters and train!

Choosing the Right Method

Use this decision tree to pick the right quantization approach:

Quantization Decision Tree

Quick Reference

ScenarioMethodWhy
Fine-tuning a modelbitsandbytesOnly option that supports training
GPU inference, best qualityAWQBetter quality than GPTQ
GPU inference, older GPUGPTQWorks on any CUDA GPU
CPU/Mac inferenceGGUFOptimized for non-GPU
Resource-constrainedGGUF Q2/Q3Smallest possible size
Local app (Ollama)GGUFNative format

Practical Considerations

Quality vs Speed vs Memory

Every quantization decision involves trade-offs:

Lower bits = smaller size but potentially lower quality and sometimes faster inference (less memory bandwidth needed).

Higher bits = better quality but larger memory footprint and potentially slower (more data to transfer).

For most use cases, 4-bit quantization hits the sweet spot—models fit on consumer hardware with minimal quality loss.

Pre-quantized vs DIY

Use pre-quantized models when:

  • You're using a popular base model (Llama, Mistral, etc.)
  • You want to get started quickly
  • Quality is sufficient for your needs

Quantize yourself when:

  • You've fine-tuned a custom model
  • You need a specific quantization configuration
  • Pre-quantized versions aren't available

Finding Pre-quantized Models

The best source for pre-quantized models is TheBloke on Hugging Face, who provides GPTQ, AWQ, and GGUF versions of most popular models.

Model naming convention:

  • Llama-2-7B-GPTQ = GPTQ 4-bit
  • Llama-2-7B-AWQ = AWQ 4-bit
  • llama-2-7b.Q4_K_M.gguf = GGUF 4-bit (K-quant Medium)

Common Issues and Solutions

Out of Memory

If you're hitting OOM errors:

  1. Try a smaller quantization: Q4 → Q3 or Q2
  2. Reduce context length: Less memory for KV cache
  3. Offload layers: Split between GPU and CPU
  4. Use a smaller model: 7B instead of 13B

Slow Inference

If inference is too slow:

  1. For CPU: Use GGUF with GPU layer offloading
  2. For GPU: Ensure you're using the right format (GPTQ/AWQ, not GGUF)
  3. Check memory bandwidth: Quantization helps here by reducing data movement

Quality Issues

If outputs seem degraded:

  1. Try higher bit quantization: Q4 → Q5 or Q6
  2. Try a different method: AWQ often beats GPTQ
  3. Check calibration: Custom quantization may need better calibration data

Conclusion

Quantization democratizes access to large language models. What once required data center GPUs now runs on laptops and gaming PCs.

Key takeaways:

  1. Start with pre-quantized models: TheBloke and Hugging Face have you covered
  2. Use Q4_K_M GGUF for local inference: Best balance for most users
  3. Use AWQ for GPU production: Best quality on modern hardware
  4. Use bitsandbytes for fine-tuning: The only option that supports training
  5. Quality loss is minimal: 4-bit models are typically 95-98% as capable as full precision

The field continues to evolve rapidly. New methods regularly improve the quality-size trade-off, and the gap between quantized and full-precision models continues to shrink.

References