- Published on
Running LLMs Locally: A Complete Guide to Ollama and vLLM
- Authors

- Name
- Jared Chung
Introduction
While cloud-based LLMs like GPT-4 and Claude are powerful, running models locally offers significant advantages: complete privacy, no API costs, low latency, and offline capability. In this post, we'll explore how to run open-source LLMs locally using Ollama for ease of use and vLLM for high-performance production deployments.
Why Run LLMs Locally?
| Benefit | Description |
|---|---|
| Privacy | Data never leaves your machine |
| Cost | No per-token API fees |
| Latency | No network round trips |
| Availability | Works offline |
| Customization | Fine-tune for your use case |
| Control | No rate limits or policy restrictions |
Part 1: Ollama - The Easy Way
Ollama makes running local LLMs as simple as running Docker containers.
Installation
# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Or download from https://ollama.ai for Windows/macOS
Basic Usage
# Pull a model
ollama pull llama3.2
# Run interactively
ollama run llama3.2
# List installed models
ollama list
# Remove a model
ollama rm llama3.2
Available Models
| Model | Parameters | RAM Required | Best For |
|---|---|---|---|
| llama3.2 | 3B | 4GB | Fast, general use |
| llama3.1 | 8B | 8GB | Balanced performance |
| llama3.1:70b | 70B | 48GB | High quality |
| mistral | 7B | 6GB | Efficient, fast |
| mixtral | 8x7B | 32GB | MoE, high quality |
| codellama | 7B-34B | 6-24GB | Code generation |
| phi3 | 3.8B | 4GB | Compact, efficient |
| gemma2 | 9B/27B | 8-20GB | Google's open model |
| qwen2.5 | 7B-72B | 6-48GB | Multilingual |
Python Integration
import ollama
# Simple completion
response = ollama.chat(
model='llama3.2',
messages=[
{'role': 'user', 'content': 'Explain quantum computing in simple terms'}
]
)
print(response['message']['content'])
# Streaming
stream = ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Write a haiku about programming'}],
stream=True
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
Using with OpenAI SDK
Ollama provides an OpenAI-compatible API:
from openai import OpenAI
# Point to Ollama's local server
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required but not used
)
response = client.chat.completions.create(
model="llama3.2",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
]
)
print(response.choices[0].message.content)
Custom Models with Modelfile
Create specialized models by modifying system prompts and parameters:
# Save as Modelfile
FROM llama3.2
# Set system prompt
SYSTEM You are a senior Python developer. You write clean, well-documented code following PEP 8 guidelines. Always include type hints and docstrings.
# Adjust parameters
PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
# Create custom model
ollama create python-dev -f Modelfile
# Use it
ollama run python-dev
Embeddings with Ollama
import ollama
# Generate embeddings
response = ollama.embeddings(
model='nomic-embed-text',
prompt='Machine learning is a subset of artificial intelligence'
)
embedding = response['embedding']
print(f"Embedding dimension: {len(embedding)}") # 768
LangChain Integration
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
# Initialize
llm = Ollama(model="llama3.2")
embeddings = OllamaEmbeddings(model="nomic-embed-text")
# Create chain
prompt = ChatPromptTemplate.from_template("Explain {topic} to a 5-year-old")
chain = prompt | llm | StrOutputParser()
result = chain.invoke({"topic": "gravity"})
print(result)
Part 2: vLLM - Production Performance
vLLM is optimized for high-throughput inference with features like PagedAttention and continuous batching.
Installation
pip install vllm
Basic Server
# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 8000 \
--tensor-parallel-size 1
Python API
from vllm import LLM, SamplingParams
# Initialize model
llm = LLM(model="meta-llama/Meta-Llama-3.1-8B-Instruct")
# Define sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512
)
# Generate
prompts = [
"Write a Python function to sort a list",
"Explain the theory of relativity",
"What are the benefits of exercise?"
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt[:50]}...")
print(f"Response: {output.outputs[0].text}\n")
OpenAI-Compatible Server
# Start server
# python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": "Hello, how are you?"}
]
)
print(response.choices[0].message.content)
Performance Optimization
Tensor Parallelism (Multi-GPU)
# Use 2 GPUs
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 2
Quantization
from vllm import LLM
# Use AWQ quantized model
llm = LLM(
model="TheBloke/Llama-2-13B-AWQ",
quantization="awq"
)
# Use GPTQ quantized model
llm = LLM(
model="TheBloke/Llama-2-13B-GPTQ",
quantization="gptq"
)
Memory Optimization
llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
gpu_memory_utilization=0.9, # Use 90% of GPU memory
max_model_len=4096, # Limit context length
enforce_eager=True # Disable CUDA graphs for memory savings
)
Part 3: Model Comparison
Benchmark: Llama 3.2 vs Mistral vs Phi-3
Testing on common tasks with Ollama:
import ollama
import time
models = ['llama3.2', 'mistral', 'phi3']
prompt = "Write a Python function to find the nth Fibonacci number with memoization."
for model in models:
start = time.time()
response = ollama.chat(
model=model,
messages=[{'role': 'user', 'content': prompt}]
)
elapsed = time.time() - start
tokens = response.get('eval_count', 0)
print(f"\n{model}:")
print(f" Time: {elapsed:.2f}s")
print(f" Tokens: {tokens}")
print(f" Tokens/sec: {tokens/elapsed:.1f}")
Choosing the Right Model
| Use Case | Recommended Model | Reason |
|---|---|---|
| General chat | llama3.2, mistral | Good balance |
| Code generation | codellama, deepseek-coder | Specialized |
| Fast responses | phi3, llama3.2:1b | Small, quick |
| High quality | llama3.1:70b, mixtral | More capable |
| Multilingual | qwen2.5, aya | Better language support |
| Long context | yarn-mistral, llama3.1 | Extended context |
Part 4: Building Applications
RAG with Local Models
import ollama
import chromadb
from sentence_transformers import SentenceTransformer
# Initialize
embed_model = SentenceTransformer('all-MiniLM-L6-v2')
chroma = chromadb.Client()
collection = chroma.create_collection("docs")
# Index documents
documents = [
"Python is a high-level programming language.",
"Machine learning uses algorithms to learn from data.",
"Neural networks are inspired by the human brain."
]
embeddings = embed_model.encode(documents).tolist()
collection.add(
documents=documents,
embeddings=embeddings,
ids=[f"doc_{i}" for i in range(len(documents))]
)
# RAG query
def rag_query(question: str) -> str:
# Retrieve
query_embedding = embed_model.encode(question).tolist()
results = collection.query(query_embeddings=[query_embedding], n_results=2)
context = "\n".join(results['documents'][0])
# Generate
response = ollama.chat(
model='llama3.2',
messages=[
{'role': 'system', 'content': f'Answer based on this context:\n{context}'},
{'role': 'user', 'content': question}
]
)
return response['message']['content']
answer = rag_query("What is machine learning?")
print(answer)
Streaming Chat Interface
import ollama
def chat():
messages = []
print("Chat with Llama (type 'quit' to exit)\n")
while True:
user_input = input("You: ")
if user_input.lower() == 'quit':
break
messages.append({'role': 'user', 'content': user_input})
print("Assistant: ", end='', flush=True)
response_text = ""
stream = ollama.chat(
model='llama3.2',
messages=messages,
stream=True
)
for chunk in stream:
text = chunk['message']['content']
print(text, end='', flush=True)
response_text += text
print("\n")
messages.append({'role': 'assistant', 'content': response_text})
if __name__ == "__main__":
chat()
FastAPI Server
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import ollama
app = FastAPI()
class ChatRequest(BaseModel):
message: str
model: str = "llama3.2"
@app.post("/chat")
async def chat(request: ChatRequest):
response = ollama.chat(
model=request.model,
messages=[{'role': 'user', 'content': request.message}]
)
return {"response": response['message']['content']}
@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
def generate():
stream = ollama.chat(
model=request.model,
messages=[{'role': 'user', 'content': request.message}],
stream=True
)
for chunk in stream:
yield chunk['message']['content']
return StreamingResponse(generate(), media_type="text/plain")
Hardware Recommendations
| Model Size | Minimum RAM | Recommended GPU | Notes |
|---|---|---|---|
| 1-3B | 4GB | None (CPU OK) | Fast on M1/M2 |
| 7-8B | 8GB | 8GB VRAM | RTX 3070+ |
| 13B | 16GB | 12GB VRAM | RTX 3080+ |
| 30-34B | 32GB | 24GB VRAM | RTX 4090 |
| 70B | 64GB | 48GB+ VRAM | Multi-GPU or quantized |
Complete Production Setup
Docker Deployment for Ollama
# Dockerfile
FROM ollama/ollama:latest
# Pre-pull models on build (optional)
# RUN ollama pull llama3.2
EXPOSE 11434
# docker-compose.yml
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
ollama_data:
# Start with docker compose
docker compose up -d
# Pull model
docker exec -it ollama-ollama-1 ollama pull llama3.2
Production vLLM with Nginx
# nginx.conf
upstream vllm {
server 127.0.0.1:8000;
keepalive 32;
}
server {
listen 443 ssl http2;
server_name llm.example.com;
ssl_certificate /etc/letsencrypt/live/llm.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/llm.example.com/privkey.pem;
location /v1/ {
proxy_pass http://vllm;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# For streaming
proxy_buffering off;
proxy_read_timeout 300s;
}
}
Complete Local LLM Service
import ollama
from fastapi import FastAPI, HTTPException, BackgroundTasks
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
from typing import List, Optional
import asyncio
from contextlib import asynccontextmanager
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Request/Response models
class Message(BaseModel):
role: str = Field(..., pattern="^(system|user|assistant)$")
content: str
class ChatRequest(BaseModel):
model: str = "llama3.2"
messages: List[Message]
stream: bool = False
temperature: float = 0.7
max_tokens: int = 2048
class ChatResponse(BaseModel):
model: str
response: str
tokens_used: int
# Health check and model management
class ModelManager:
def __init__(self):
self.available_models = []
async def refresh_models(self):
"""Refresh list of available models."""
try:
result = ollama.list()
self.available_models = [m['name'] for m in result['models']]
logger.info(f"Available models: {self.available_models}")
except Exception as e:
logger.error(f"Failed to list models: {e}")
async def ensure_model(self, model_name: str):
"""Ensure a model is available, pull if needed."""
if model_name not in self.available_models:
logger.info(f"Pulling model: {model_name}")
ollama.pull(model_name)
await self.refresh_models()
model_manager = ModelManager()
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup
await model_manager.refresh_models()
yield
# Shutdown
pass
app = FastAPI(
title="Local LLM API",
description="OpenAI-compatible API for local LLMs",
lifespan=lifespan
)
@app.get("/health")
async def health():
"""Health check endpoint."""
return {"status": "healthy", "models": model_manager.available_models}
@app.get("/v1/models")
async def list_models():
"""List available models (OpenAI-compatible)."""
return {
"object": "list",
"data": [
{"id": m, "object": "model", "owned_by": "local"}
for m in model_manager.available_models
]
}
@app.post("/v1/chat/completions")
async def chat_completions(request: ChatRequest):
"""Chat completions endpoint (OpenAI-compatible)."""
try:
messages = [{"role": m.role, "content": m.content} for m in request.messages]
if request.stream:
return StreamingResponse(
stream_response(request.model, messages),
media_type="text/event-stream"
)
response = ollama.chat(
model=request.model,
messages=messages,
options={
"temperature": request.temperature,
"num_predict": request.max_tokens
}
)
return {
"id": "chatcmpl-local",
"object": "chat.completion",
"model": request.model,
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": response['message']['content']
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": response.get('prompt_eval_count', 0),
"completion_tokens": response.get('eval_count', 0),
"total_tokens": response.get('prompt_eval_count', 0) + response.get('eval_count', 0)
}
}
except Exception as e:
logger.error(f"Chat error: {e}")
raise HTTPException(status_code=500, detail=str(e))
async def stream_response(model: str, messages: list):
"""Stream chat response."""
try:
stream = ollama.chat(model=model, messages=messages, stream=True)
for chunk in stream:
content = chunk['message']['content']
yield f"data: {{'choices': [{{'delta': {{'content': '{content}'}}}}]}}\n\n"
yield "data: [DONE]\n\n"
except Exception as e:
yield f"data: {{'error': '{str(e)}'}}\n\n"
@app.post("/pull/{model_name}")
async def pull_model(model_name: str, background_tasks: BackgroundTasks):
"""Pull a model in the background."""
background_tasks.add_task(model_manager.ensure_model, model_name)
return {"status": "pulling", "model": model_name}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8080)
Troubleshooting
Common Issues
1. Out of Memory
# Reduce context length
ollama.chat(
model='llama3.2',
messages=[...],
options={'num_ctx': 2048} # Default is 4096
)
# Or use a smaller quantization
# Pull a smaller model variant
# ollama pull llama3.2:1b
2. Slow Performance on Mac
# Ensure Metal is being used (should show in Ollama logs)
# Check GPU usage
sudo powermetrics --samplers gpu_power
# For M1/M2, ensure enough memory is allocated
# Models use unified memory, so more RAM = better
3. CUDA Errors on Linux
# Check NVIDIA driver
nvidia-smi
# Check CUDA version
nvcc --version
# Ensure compatible versions
# vLLM requires CUDA 11.8+ or 12.1+
# For Ollama, set GPU explicitly
CUDA_VISIBLE_DEVICES=0 ollama serve
4. Model Loading Slow
# Pre-load model on startup
@asynccontextmanager
async def lifespan(app: FastAPI):
# Warm up model
ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Hello'}]
)
yield
Performance Benchmarking
import ollama
import time
from statistics import mean, stdev
def benchmark_model(model: str, prompts: list, n_runs: int = 3):
"""Benchmark a model's performance."""
results = []
for prompt in prompts:
run_times = []
tokens_per_sec = []
for _ in range(n_runs):
start = time.perf_counter()
response = ollama.chat(
model=model,
messages=[{'role': 'user', 'content': prompt}]
)
elapsed = time.perf_counter() - start
tokens = response.get('eval_count', 0)
run_times.append(elapsed)
if elapsed > 0:
tokens_per_sec.append(tokens / elapsed)
results.append({
'prompt': prompt[:50],
'avg_time': mean(run_times),
'std_time': stdev(run_times) if len(run_times) > 1 else 0,
'avg_tokens_per_sec': mean(tokens_per_sec) if tokens_per_sec else 0
})
return results
# Run benchmark
prompts = [
"Write a Python function to calculate factorial",
"Explain quantum computing in simple terms",
"What are the benefits of renewable energy?"
]
print("Benchmarking llama3.2...")
results = benchmark_model('llama3.2', prompts)
for r in results:
print(f" {r['prompt']}: {r['avg_time']:.2f}s ({r['avg_tokens_per_sec']:.1f} tok/s)")
Comparison: Ollama vs vLLM vs llama.cpp
| Feature | Ollama | vLLM | llama.cpp |
|---|---|---|---|
| Ease of use | Easiest | Medium | Manual |
| Performance | Good | Best | Good |
| GPU support | Auto | Excellent | Manual |
| CPU inference | Good | Limited | Excellent |
| Quantization | Built-in | AWQ/GPTQ | GGUF |
| API | OpenAI-compatible | OpenAI-compatible | Custom |
| Best for | Development | Production | Edge/Custom |
Conclusion
Running LLMs locally is now accessible to everyone:
- Ollama: Perfect for development, prototyping, and personal use
- vLLM: Ideal for production deployments with high throughput requirements
- Quantization: Enables larger models on consumer hardware
Start with Ollama for experimentation, then move to vLLM when you need production performance.
References
- Ollama: https://ollama.ai
- vLLM: https://docs.vllm.ai
- Hugging Face Models: https://huggingface.co/models