Jared AI Hub
Published on

Running LLMs Locally: A Complete Guide to Ollama and vLLM

Authors
  • avatar
    Name
    Jared Chung
    Twitter

Introduction

While cloud-based LLMs like GPT-4 and Claude are powerful, running models locally offers significant advantages: complete privacy, no API costs, low latency, and offline capability. In this post, we'll explore how to run open-source LLMs locally using Ollama for ease of use and vLLM for high-performance production deployments.

Why Run LLMs Locally?

BenefitDescription
PrivacyData never leaves your machine
CostNo per-token API fees
LatencyNo network round trips
AvailabilityWorks offline
CustomizationFine-tune for your use case
ControlNo rate limits or policy restrictions

Part 1: Ollama - The Easy Way

Ollama makes running local LLMs as simple as running Docker containers.

Installation

# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Or download from https://ollama.ai for Windows/macOS

Basic Usage

# Pull a model
ollama pull llama3.2

# Run interactively
ollama run llama3.2

# List installed models
ollama list

# Remove a model
ollama rm llama3.2

Available Models

ModelParametersRAM RequiredBest For
llama3.23B4GBFast, general use
llama3.18B8GBBalanced performance
llama3.1:70b70B48GBHigh quality
mistral7B6GBEfficient, fast
mixtral8x7B32GBMoE, high quality
codellama7B-34B6-24GBCode generation
phi33.8B4GBCompact, efficient
gemma29B/27B8-20GBGoogle's open model
qwen2.57B-72B6-48GBMultilingual

Python Integration

import ollama

# Simple completion
response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'user', 'content': 'Explain quantum computing in simple terms'}
    ]
)
print(response['message']['content'])

# Streaming
stream = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Write a haiku about programming'}],
    stream=True
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Using with OpenAI SDK

Ollama provides an OpenAI-compatible API:

from openai import OpenAI

# Point to Ollama's local server
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but not used
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"}
    ]
)
print(response.choices[0].message.content)

Custom Models with Modelfile

Create specialized models by modifying system prompts and parameters:

# Save as Modelfile
FROM llama3.2

# Set system prompt
SYSTEM You are a senior Python developer. You write clean, well-documented code following PEP 8 guidelines. Always include type hints and docstrings.

# Adjust parameters
PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
# Create custom model
ollama create python-dev -f Modelfile

# Use it
ollama run python-dev

Embeddings with Ollama

import ollama

# Generate embeddings
response = ollama.embeddings(
    model='nomic-embed-text',
    prompt='Machine learning is a subset of artificial intelligence'
)

embedding = response['embedding']
print(f"Embedding dimension: {len(embedding)}")  # 768

LangChain Integration

from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Initialize
llm = Ollama(model="llama3.2")
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Create chain
prompt = ChatPromptTemplate.from_template("Explain {topic} to a 5-year-old")
chain = prompt | llm | StrOutputParser()

result = chain.invoke({"topic": "gravity"})
print(result)

Part 2: vLLM - Production Performance

vLLM is optimized for high-throughput inference with features like PagedAttention and continuous batching.

Installation

pip install vllm

Basic Server

# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3.1-8B-Instruct \
    --port 8000 \
    --tensor-parallel-size 1

Python API

from vllm import LLM, SamplingParams

# Initialize model
llm = LLM(model="meta-llama/Meta-Llama-3.1-8B-Instruct")

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Generate
prompts = [
    "Write a Python function to sort a list",
    "Explain the theory of relativity",
    "What are the benefits of exercise?"
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Response: {output.outputs[0].text}\n")

OpenAI-Compatible Server

# Start server
# python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ]
)
print(response.choices[0].message.content)

Performance Optimization

Tensor Parallelism (Multi-GPU)

# Use 2 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2

Quantization

from vllm import LLM

# Use AWQ quantized model
llm = LLM(
    model="TheBloke/Llama-2-13B-AWQ",
    quantization="awq"
)

# Use GPTQ quantized model
llm = LLM(
    model="TheBloke/Llama-2-13B-GPTQ",
    quantization="gptq"
)

Memory Optimization

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    gpu_memory_utilization=0.9,  # Use 90% of GPU memory
    max_model_len=4096,  # Limit context length
    enforce_eager=True  # Disable CUDA graphs for memory savings
)

Part 3: Model Comparison

Benchmark: Llama 3.2 vs Mistral vs Phi-3

Testing on common tasks with Ollama:

import ollama
import time

models = ['llama3.2', 'mistral', 'phi3']
prompt = "Write a Python function to find the nth Fibonacci number with memoization."

for model in models:
    start = time.time()

    response = ollama.chat(
        model=model,
        messages=[{'role': 'user', 'content': prompt}]
    )

    elapsed = time.time() - start
    tokens = response.get('eval_count', 0)

    print(f"\n{model}:")
    print(f"  Time: {elapsed:.2f}s")
    print(f"  Tokens: {tokens}")
    print(f"  Tokens/sec: {tokens/elapsed:.1f}")

Choosing the Right Model

Use CaseRecommended ModelReason
General chatllama3.2, mistralGood balance
Code generationcodellama, deepseek-coderSpecialized
Fast responsesphi3, llama3.2:1bSmall, quick
High qualityllama3.1:70b, mixtralMore capable
Multilingualqwen2.5, ayaBetter language support
Long contextyarn-mistral, llama3.1Extended context

Part 4: Building Applications

RAG with Local Models

import ollama
import chromadb
from sentence_transformers import SentenceTransformer

# Initialize
embed_model = SentenceTransformer('all-MiniLM-L6-v2')
chroma = chromadb.Client()
collection = chroma.create_collection("docs")

# Index documents
documents = [
    "Python is a high-level programming language.",
    "Machine learning uses algorithms to learn from data.",
    "Neural networks are inspired by the human brain."
]

embeddings = embed_model.encode(documents).tolist()
collection.add(
    documents=documents,
    embeddings=embeddings,
    ids=[f"doc_{i}" for i in range(len(documents))]
)

# RAG query
def rag_query(question: str) -> str:
    # Retrieve
    query_embedding = embed_model.encode(question).tolist()
    results = collection.query(query_embeddings=[query_embedding], n_results=2)
    context = "\n".join(results['documents'][0])

    # Generate
    response = ollama.chat(
        model='llama3.2',
        messages=[
            {'role': 'system', 'content': f'Answer based on this context:\n{context}'},
            {'role': 'user', 'content': question}
        ]
    )
    return response['message']['content']

answer = rag_query("What is machine learning?")
print(answer)

Streaming Chat Interface

import ollama

def chat():
    messages = []

    print("Chat with Llama (type 'quit' to exit)\n")

    while True:
        user_input = input("You: ")
        if user_input.lower() == 'quit':
            break

        messages.append({'role': 'user', 'content': user_input})

        print("Assistant: ", end='', flush=True)
        response_text = ""

        stream = ollama.chat(
            model='llama3.2',
            messages=messages,
            stream=True
        )

        for chunk in stream:
            text = chunk['message']['content']
            print(text, end='', flush=True)
            response_text += text

        print("\n")
        messages.append({'role': 'assistant', 'content': response_text})

if __name__ == "__main__":
    chat()

FastAPI Server

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import ollama

app = FastAPI()

class ChatRequest(BaseModel):
    message: str
    model: str = "llama3.2"

@app.post("/chat")
async def chat(request: ChatRequest):
    response = ollama.chat(
        model=request.model,
        messages=[{'role': 'user', 'content': request.message}]
    )
    return {"response": response['message']['content']}

@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
    def generate():
        stream = ollama.chat(
            model=request.model,
            messages=[{'role': 'user', 'content': request.message}],
            stream=True
        )
        for chunk in stream:
            yield chunk['message']['content']

    return StreamingResponse(generate(), media_type="text/plain")

Hardware Recommendations

Model SizeMinimum RAMRecommended GPUNotes
1-3B4GBNone (CPU OK)Fast on M1/M2
7-8B8GB8GB VRAMRTX 3070+
13B16GB12GB VRAMRTX 3080+
30-34B32GB24GB VRAMRTX 4090
70B64GB48GB+ VRAMMulti-GPU or quantized

Complete Production Setup

Docker Deployment for Ollama

# Dockerfile
FROM ollama/ollama:latest

# Pre-pull models on build (optional)
# RUN ollama pull llama3.2

EXPOSE 11434
# docker-compose.yml
version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  ollama_data:
# Start with docker compose
docker compose up -d

# Pull model
docker exec -it ollama-ollama-1 ollama pull llama3.2

Production vLLM with Nginx

# nginx.conf
upstream vllm {
    server 127.0.0.1:8000;
    keepalive 32;
}

server {
    listen 443 ssl http2;
    server_name llm.example.com;

    ssl_certificate /etc/letsencrypt/live/llm.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/llm.example.com/privkey.pem;

    location /v1/ {
        proxy_pass http://vllm;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;

        # For streaming
        proxy_buffering off;
        proxy_read_timeout 300s;
    }
}

Complete Local LLM Service

import ollama
from fastapi import FastAPI, HTTPException, BackgroundTasks
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
from typing import List, Optional
import asyncio
from contextlib import asynccontextmanager
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Request/Response models
class Message(BaseModel):
    role: str = Field(..., pattern="^(system|user|assistant)$")
    content: str

class ChatRequest(BaseModel):
    model: str = "llama3.2"
    messages: List[Message]
    stream: bool = False
    temperature: float = 0.7
    max_tokens: int = 2048

class ChatResponse(BaseModel):
    model: str
    response: str
    tokens_used: int

# Health check and model management
class ModelManager:
    def __init__(self):
        self.available_models = []

    async def refresh_models(self):
        """Refresh list of available models."""
        try:
            result = ollama.list()
            self.available_models = [m['name'] for m in result['models']]
            logger.info(f"Available models: {self.available_models}")
        except Exception as e:
            logger.error(f"Failed to list models: {e}")

    async def ensure_model(self, model_name: str):
        """Ensure a model is available, pull if needed."""
        if model_name not in self.available_models:
            logger.info(f"Pulling model: {model_name}")
            ollama.pull(model_name)
            await self.refresh_models()

model_manager = ModelManager()

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup
    await model_manager.refresh_models()
    yield
    # Shutdown
    pass

app = FastAPI(
    title="Local LLM API",
    description="OpenAI-compatible API for local LLMs",
    lifespan=lifespan
)

@app.get("/health")
async def health():
    """Health check endpoint."""
    return {"status": "healthy", "models": model_manager.available_models}

@app.get("/v1/models")
async def list_models():
    """List available models (OpenAI-compatible)."""
    return {
        "object": "list",
        "data": [
            {"id": m, "object": "model", "owned_by": "local"}
            for m in model_manager.available_models
        ]
    }

@app.post("/v1/chat/completions")
async def chat_completions(request: ChatRequest):
    """Chat completions endpoint (OpenAI-compatible)."""
    try:
        messages = [{"role": m.role, "content": m.content} for m in request.messages]

        if request.stream:
            return StreamingResponse(
                stream_response(request.model, messages),
                media_type="text/event-stream"
            )

        response = ollama.chat(
            model=request.model,
            messages=messages,
            options={
                "temperature": request.temperature,
                "num_predict": request.max_tokens
            }
        )

        return {
            "id": "chatcmpl-local",
            "object": "chat.completion",
            "model": request.model,
            "choices": [{
                "index": 0,
                "message": {
                    "role": "assistant",
                    "content": response['message']['content']
                },
                "finish_reason": "stop"
            }],
            "usage": {
                "prompt_tokens": response.get('prompt_eval_count', 0),
                "completion_tokens": response.get('eval_count', 0),
                "total_tokens": response.get('prompt_eval_count', 0) + response.get('eval_count', 0)
            }
        }

    except Exception as e:
        logger.error(f"Chat error: {e}")
        raise HTTPException(status_code=500, detail=str(e))

async def stream_response(model: str, messages: list):
    """Stream chat response."""
    try:
        stream = ollama.chat(model=model, messages=messages, stream=True)
        for chunk in stream:
            content = chunk['message']['content']
            yield f"data: {{'choices': [{{'delta': {{'content': '{content}'}}}}]}}\n\n"
        yield "data: [DONE]\n\n"
    except Exception as e:
        yield f"data: {{'error': '{str(e)}'}}\n\n"

@app.post("/pull/{model_name}")
async def pull_model(model_name: str, background_tasks: BackgroundTasks):
    """Pull a model in the background."""
    background_tasks.add_task(model_manager.ensure_model, model_name)
    return {"status": "pulling", "model": model_name}


if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8080)

Troubleshooting

Common Issues

1. Out of Memory

# Reduce context length
ollama.chat(
    model='llama3.2',
    messages=[...],
    options={'num_ctx': 2048}  # Default is 4096
)

# Or use a smaller quantization
# Pull a smaller model variant
# ollama pull llama3.2:1b

2. Slow Performance on Mac

# Ensure Metal is being used (should show in Ollama logs)
# Check GPU usage
sudo powermetrics --samplers gpu_power

# For M1/M2, ensure enough memory is allocated
# Models use unified memory, so more RAM = better

3. CUDA Errors on Linux

# Check NVIDIA driver
nvidia-smi

# Check CUDA version
nvcc --version

# Ensure compatible versions
# vLLM requires CUDA 11.8+ or 12.1+

# For Ollama, set GPU explicitly
CUDA_VISIBLE_DEVICES=0 ollama serve

4. Model Loading Slow

# Pre-load model on startup
@asynccontextmanager
async def lifespan(app: FastAPI):
    # Warm up model
    ollama.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': 'Hello'}]
    )
    yield

Performance Benchmarking

import ollama
import time
from statistics import mean, stdev

def benchmark_model(model: str, prompts: list, n_runs: int = 3):
    """Benchmark a model's performance."""
    results = []

    for prompt in prompts:
        run_times = []
        tokens_per_sec = []

        for _ in range(n_runs):
            start = time.perf_counter()

            response = ollama.chat(
                model=model,
                messages=[{'role': 'user', 'content': prompt}]
            )

            elapsed = time.perf_counter() - start
            tokens = response.get('eval_count', 0)

            run_times.append(elapsed)
            if elapsed > 0:
                tokens_per_sec.append(tokens / elapsed)

        results.append({
            'prompt': prompt[:50],
            'avg_time': mean(run_times),
            'std_time': stdev(run_times) if len(run_times) > 1 else 0,
            'avg_tokens_per_sec': mean(tokens_per_sec) if tokens_per_sec else 0
        })

    return results

# Run benchmark
prompts = [
    "Write a Python function to calculate factorial",
    "Explain quantum computing in simple terms",
    "What are the benefits of renewable energy?"
]

print("Benchmarking llama3.2...")
results = benchmark_model('llama3.2', prompts)
for r in results:
    print(f"  {r['prompt']}: {r['avg_time']:.2f}s ({r['avg_tokens_per_sec']:.1f} tok/s)")

Comparison: Ollama vs vLLM vs llama.cpp

FeatureOllamavLLMllama.cpp
Ease of useEasiestMediumManual
PerformanceGoodBestGood
GPU supportAutoExcellentManual
CPU inferenceGoodLimitedExcellent
QuantizationBuilt-inAWQ/GPTQGGUF
APIOpenAI-compatibleOpenAI-compatibleCustom
Best forDevelopmentProductionEdge/Custom

Conclusion

Running LLMs locally is now accessible to everyone:

  • Ollama: Perfect for development, prototyping, and personal use
  • vLLM: Ideal for production deployments with high throughput requirements
  • Quantization: Enables larger models on consumer hardware

Start with Ollama for experimentation, then move to vLLM when you need production performance.

References