Introduction

While cloud-based LLMs like GPT-4 and Claude are powerful, running models locally offers significant advantages: complete privacy, no API costs, low latency, and offline capability. In this post, we'll explore how to run open-source LLMs locally using Ollama for ease of use and vLLM for high-performance production deployments.

Why Run LLMs Locally?

Benefit	Description
Privacy	Data never leaves your machine
Cost	No per-token API fees
Latency	No network round trips
Availability	Works offline
Customization	Fine-tune for your use case
Control	No rate limits or policy restrictions

Part 1: Ollama - The Easy Way

Ollama makes running local LLMs as simple as running Docker containers.

Installation

# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Or download from https://ollama.ai for Windows/macOS

Basic Usage

# Pull a model
ollama pull llama3.2

# Run interactively
ollama run llama3.2

# List installed models
ollama list

# Remove a model
ollama rm llama3.2

Available Models

Model	Parameters	RAM Required	Best For
llama3.2	3B	4GB	Fast, general use
llama3.1	8B	8GB	Balanced performance
llama3.1:70b	70B	48GB	High quality
mistral	7B	6GB	Efficient, fast
mixtral	8x7B	32GB	MoE, high quality
codellama	7B-34B	6-24GB	Code generation
phi3	3.8B	4GB	Compact, efficient
gemma2	9B/27B	8-20GB	Google's open model
qwen2.5	7B-72B	6-48GB	Multilingual

Python Integration

import ollama

# Simple completion
response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'user', 'content': 'Explain quantum computing in simple terms'}
    ]
)
print(response['message']['content'])

# Streaming
stream = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Write a haiku about programming'}],
    stream=True
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Using with OpenAI SDK

Ollama provides an OpenAI-compatible API:

from openai import OpenAI

# Point to Ollama's local server
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but not used
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"}
    ]
)
print(response.choices[0].message.content)

Custom Models with Modelfile

Create specialized models by modifying system prompts and parameters:

# Save as Modelfile
FROM llama3.2

# Set system prompt
SYSTEM You are a senior Python developer. You write clean, well-documented code following PEP 8 guidelines. Always include type hints and docstrings.

# Adjust parameters
PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

# Create custom model
ollama create python-dev -f Modelfile

# Use it
ollama run python-dev

Embeddings with Ollama

import ollama

# Generate embeddings
response = ollama.embeddings(
    model='nomic-embed-text',
    prompt='Machine learning is a subset of artificial intelligence'
)

embedding = response['embedding']
print(f"Embedding dimension: {len(embedding)}")  # 768

LangChain Integration

from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Initialize
llm = Ollama(model="llama3.2")
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Create chain
prompt = ChatPromptTemplate.from_template("Explain {topic} to a 5-year-old")
chain = prompt | llm | StrOutputParser()

result = chain.invoke({"topic": "gravity"})
print(result)

Part 2: vLLM - Production Performance

vLLM is optimized for high-throughput inference with features like PagedAttention and continuous batching.

Installation

pip install vllm

Basic Server

# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3.1-8B-Instruct \
    --port 8000 \
    --tensor-parallel-size 1

Python API

from vllm import LLM, SamplingParams

# Initialize model
llm = LLM(model="meta-llama/Meta-Llama-3.1-8B-Instruct")

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Generate
prompts = [
    "Write a Python function to sort a list",
    "Explain the theory of relativity",
    "What are the benefits of exercise?"
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Response: {output.outputs[0].text}\n")

OpenAI-Compatible Server

# Start server
# python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ]
)
print(response.choices[0].message.content)

Performance Optimization

Tensor Parallelism (Multi-GPU)

# Use 2 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2

Quantization

from vllm import LLM

# Use AWQ quantized model
llm = LLM(
    model="TheBloke/Llama-2-13B-AWQ",
    quantization="awq"
)

# Use GPTQ quantized model
llm = LLM(
    model="TheBloke/Llama-2-13B-GPTQ",
    quantization="gptq"
)

Memory Optimization

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    gpu_memory_utilization=0.9,  # Use 90% of GPU memory
    max_model_len=4096,  # Limit context length
    enforce_eager=True  # Disable CUDA graphs for memory savings
)

Part 3: Model Comparison

Benchmark: Llama 3.2 vs Mistral vs Phi-3

Testing on common tasks with Ollama:

import ollama
import time

models = ['llama3.2', 'mistral', 'phi3']
prompt = "Write a Python function to find the nth Fibonacci number with memoization."

for model in models:
    start = time.time()

    response = ollama.chat(
        model=model,
        messages=[{'role': 'user', 'content': prompt}]
    )

    elapsed = time.time() - start
    tokens = response.get('eval_count', 0)

    print(f"\n{model}:")
    print(f"  Time: {elapsed:.2f}s")
    print(f"  Tokens: {tokens}")
    print(f"  Tokens/sec: {tokens/elapsed:.1f}")

Choosing the Right Model

Use Case	Recommended Model	Reason
General chat	llama3.2, mistral	Good balance
Code generation	codellama, deepseek-coder	Specialized
Fast responses	phi3, llama3.2:1b	Small, quick
High quality	llama3.1:70b, mixtral	More capable
Multilingual	qwen2.5, aya	Better language support
Long context	yarn-mistral, llama3.1	Extended context

Part 4: Building Applications

RAG with Local Models

import ollama
import chromadb
from sentence_transformers import SentenceTransformer

# Initialize
embed_model = SentenceTransformer('all-MiniLM-L6-v2')
chroma = chromadb.Client()
collection = chroma.create_collection("docs")

# Index documents
documents = [
    "Python is a high-level programming language.",
    "Machine learning uses algorithms to learn from data.",
    "Neural networks are inspired by the human brain."
]

embeddings = embed_model.encode(documents).tolist()
collection.add(
    documents=documents,
    embeddings=embeddings,
    ids=[f"doc_{i}" for i in range(len(documents))]
)

# RAG query
def rag_query(question: str) -> str:
    # Retrieve
    query_embedding = embed_model.encode(question).tolist()
    results = collection.query(query_embeddings=[query_embedding], n_results=2)
    context = "\n".join(results['documents'][0])

    # Generate
    response = ollama.chat(
        model='llama3.2',
        messages=[
            {'role': 'system', 'content': f'Answer based on this context:\n{context}'},
            {'role': 'user', 'content': question}
        ]
    )
    return response['message']['content']

answer = rag_query("What is machine learning?")
print(answer)

Streaming Chat Interface

import ollama

def chat():
    messages = []

    print("Chat with Llama (type 'quit' to exit)\n")

    while True:
        user_input = input("You: ")
        if user_input.lower() == 'quit':
            break

        messages.append({'role': 'user', 'content': user_input})

        print("Assistant: ", end='', flush=True)
        response_text = ""

        stream = ollama.chat(
            model='llama3.2',
            messages=messages,
            stream=True
        )

        for chunk in stream:
            text = chunk['message']['content']
            print(text, end='', flush=True)
            response_text += text

        print("\n")
        messages.append({'role': 'assistant', 'content': response_text})

if __name__ == "__main__":
    chat()

FastAPI Server

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import ollama

app = FastAPI()

class ChatRequest(BaseModel):
    message: str
    model: str = "llama3.2"

@app.post("/chat")
async def chat(request: ChatRequest):
    response = ollama.chat(
        model=request.model,
        messages=[{'role': 'user', 'content': request.message}]
    )
    return {"response": response['message']['content']}

@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
    def generate():
        stream = ollama.chat(
            model=request.model,
            messages=[{'role': 'user', 'content': request.message}],
            stream=True
        )
        for chunk in stream:
            yield chunk['message']['content']

    return StreamingResponse(generate(), media_type="text/plain")

Hardware Recommendations

Model Size	Minimum RAM	Recommended GPU	Notes
1-3B	4GB	None (CPU OK)	Fast on M1/M2
7-8B	8GB	8GB VRAM	RTX 3070+
13B	16GB	12GB VRAM	RTX 3080+
30-34B	32GB	24GB VRAM	RTX 4090
70B	64GB	48GB+ VRAM	Multi-GPU or quantized

Complete Production Setup

Docker Deployment for Ollama

# Dockerfile
FROM ollama/ollama:latest

# Pre-pull models on build (optional)
# RUN ollama pull llama3.2

EXPOSE 11434

# docker-compose.yml
version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  ollama_data:

# Start with docker compose
docker compose up -d

# Pull model
docker exec -it ollama-ollama-1 ollama pull llama3.2

Production vLLM with Nginx

# nginx.conf
upstream vllm {
    server 127.0.0.1:8000;
    keepalive 32;
}

server {
    listen 443 ssl http2;
    server_name llm.example.com;

    ssl_certificate /etc/letsencrypt/live/llm.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/llm.example.com/privkey.pem;

    location /v1/ {
        proxy_pass http://vllm;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;

        # For streaming
        proxy_buffering off;
        proxy_read_timeout 300s;
    }
}

Complete Local LLM Service

import ollama
from fastapi import FastAPI, HTTPException, BackgroundTasks
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
from typing import List, Optional
import asyncio
from contextlib import asynccontextmanager
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Request/Response models
class Message(BaseModel):
    role: str = Field(..., pattern="^(system|user|assistant)$")
    content: str

class ChatRequest(BaseModel):
    model: str = "llama3.2"
    messages: List[Message]
    stream: bool = False
    temperature: float = 0.7
    max_tokens: int = 2048

class ChatResponse(BaseModel):
    model: str
    response: str
    tokens_used: int

# Health check and model management
class ModelManager:
    def __init__(self):
        self.available_models = []

    async def refresh_models(self):
        """Refresh list of available models."""
        try:
            result = ollama.list()
            self.available_models = [m['name'] for m in result['models']]
            logger.info(f"Available models: {self.available_models}")
        except Exception as e:
            logger.error(f"Failed to list models: {e}")

    async def ensure_model(self, model_name: str):
        """Ensure a model is available, pull if needed."""
        if model_name not in self.available_models:
            logger.info(f"Pulling model: {model_name}")
            ollama.pull(model_name)
            await self.refresh_models()

model_manager = ModelManager()

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup
    await model_manager.refresh_models()
    yield
    # Shutdown
    pass

app = FastAPI(
    title="Local LLM API",
    description="OpenAI-compatible API for local LLMs",
    lifespan=lifespan
)

@app.get("/health")
async def health():
    """Health check endpoint."""
    return {"status": "healthy", "models": model_manager.available_models}

@app.get("/v1/models")
async def list_models():
    """List available models (OpenAI-compatible)."""
    return {
        "object": "list",
        "data": [
            {"id": m, "object": "model", "owned_by": "local"}
            for m in model_manager.available_models
        ]
    }

@app.post("/v1/chat/completions")
async def chat_completions(request: ChatRequest):
    """Chat completions endpoint (OpenAI-compatible)."""
    try:
        messages = [{"role": m.role, "content": m.content} for m in request.messages]

        if request.stream:
            return StreamingResponse(
                stream_response(request.model, messages),
                media_type="text/event-stream"
            )

        response = ollama.chat(
            model=request.model,
            messages=messages,
            options={
                "temperature": request.temperature,
                "num_predict": request.max_tokens
            }
        )

        return {
            "id": "chatcmpl-local",
            "object": "chat.completion",
            "model": request.model,
            "choices": [{
                "index": 0,
                "message": {
                    "role": "assistant",
                    "content": response['message']['content']
                },
                "finish_reason": "stop"
            }],
            "usage": {
                "prompt_tokens": response.get('prompt_eval_count', 0),
                "completion_tokens": response.get('eval_count', 0),
                "total_tokens": response.get('prompt_eval_count', 0) + response.get('eval_count', 0)
            }
        }

    except Exception as e:
        logger.error(f"Chat error: {e}")
        raise HTTPException(status_code=500, detail=str(e))

async def stream_response(model: str, messages: list):
    """Stream chat response."""
    try:
        stream = ollama.chat(model=model, messages=messages, stream=True)
        for chunk in stream:
            content = chunk['message']['content']
            yield f"data: {{'choices': [{{'delta': {{'content': '{content}'}}}}]}}\n\n"
        yield "data: [DONE]\n\n"
    except Exception as e:
        yield f"data: {{'error': '{str(e)}'}}\n\n"

@app.post("/pull/{model_name}")
async def pull_model(model_name: str, background_tasks: BackgroundTasks):
    """Pull a model in the background."""
    background_tasks.add_task(model_manager.ensure_model, model_name)
    return {"status": "pulling", "model": model_name}


if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8080)

Troubleshooting

Common Issues

1. Out of Memory

# Reduce context length
ollama.chat(
    model='llama3.2',
    messages=[...],
    options={'num_ctx': 2048}  # Default is 4096
)

# Or use a smaller quantization
# Pull a smaller model variant
# ollama pull llama3.2:1b

2. Slow Performance on Mac

# Ensure Metal is being used (should show in Ollama logs)
# Check GPU usage
sudo powermetrics --samplers gpu_power

# For M1/M2, ensure enough memory is allocated
# Models use unified memory, so more RAM = better

3. CUDA Errors on Linux

# Check NVIDIA driver
nvidia-smi

# Check CUDA version
nvcc --version

# Ensure compatible versions
# vLLM requires CUDA 11.8+ or 12.1+

# For Ollama, set GPU explicitly
CUDA_VISIBLE_DEVICES=0 ollama serve

4. Model Loading Slow

# Pre-load model on startup
@asynccontextmanager
async def lifespan(app: FastAPI):
    # Warm up model
    ollama.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': 'Hello'}]
    )
    yield

Performance Benchmarking

import ollama
import time
from statistics import mean, stdev

def benchmark_model(model: str, prompts: list, n_runs: int = 3):
    """Benchmark a model's performance."""
    results = []

    for prompt in prompts:
        run_times = []
        tokens_per_sec = []

        for _ in range(n_runs):
            start = time.perf_counter()

            response = ollama.chat(
                model=model,
                messages=[{'role': 'user', 'content': prompt}]
            )

            elapsed = time.perf_counter() - start
            tokens = response.get('eval_count', 0)

            run_times.append(elapsed)
            if elapsed > 0:
                tokens_per_sec.append(tokens / elapsed)

        results.append({
            'prompt': prompt[:50],
            'avg_time': mean(run_times),
            'std_time': stdev(run_times) if len(run_times) > 1 else 0,
            'avg_tokens_per_sec': mean(tokens_per_sec) if tokens_per_sec else 0
        })

    return results

# Run benchmark
prompts = [
    "Write a Python function to calculate factorial",
    "Explain quantum computing in simple terms",
    "What are the benefits of renewable energy?"
]

print("Benchmarking llama3.2...")
results = benchmark_model('llama3.2', prompts)
for r in results:
    print(f"  {r['prompt']}: {r['avg_time']:.2f}s ({r['avg_tokens_per_sec']:.1f} tok/s)")

Comparison: Ollama vs vLLM vs llama.cpp

Feature	Ollama	vLLM	llama.cpp
Ease of use	Easiest	Medium	Manual
Performance	Good	Best	Good
GPU support	Auto	Excellent	Manual
CPU inference	Good	Limited	Excellent
Quantization	Built-in	AWQ/GPTQ	GGUF
API	OpenAI-compatible	OpenAI-compatible	Custom
Best for	Development	Production	Edge/Custom

Conclusion

Running LLMs locally is now accessible to everyone:

Ollama: Perfect for development, prototyping, and personal use
vLLM: Ideal for production deployments with high throughput requirements
Quantization: Enables larger models on consumer hardware

Start with Ollama for experimentation, then move to vLLM when you need production performance.

References

Ollama: https://ollama.ai
vLLM: https://docs.vllm.ai
Hugging Face Models: https://huggingface.co/models