Jared AI Hub
Published on

Running LLMs Locally: A Complete Guide to Ollama and vLLM

Authors
  • avatar
    Name
    Jared Chung
    Twitter

Introduction

Every API call to OpenAI or Anthropic sends your data to their servers, costs money per token, and depends on internet connectivity. But there's another way: running LLMs directly on your own hardware.

Local LLM deployment has become remarkably accessible. This guide covers the "why" and "how" of running models locally, from simple experimentation with Ollama to production deployment with vLLM.

Why Run LLMs Locally?

Before diving into implementation, let's understand when local deployment makes sense.

The Benefits

BenefitWhat It Means
PrivacyYour data never leaves your machine—critical for sensitive information
CostNo per-token fees; just electricity and hardware
LatencyNo network round trips; responses can be faster for interactive use
AvailabilityWorks offline, no API outages to worry about
CustomizationFine-tune models for your specific domain
ControlNo rate limits, content filtering, or policy changes

The Trade-offs

Local deployment isn't always better:

  • Quality: GPT-4 and Claude still outperform open-source models on complex tasks
  • Hardware: Larger models need significant GPU memory
  • Maintenance: You're responsible for updates, security, and reliability
  • Context Length: Many local models have shorter context windows

When to Choose Local

Local is ideal for:

  • Processing sensitive/confidential data
  • High-volume applications where API costs add up
  • Embedding local models into products
  • Learning and experimentation
  • Air-gapped environments

APIs are better for:

  • Maximum capability tasks (complex reasoning, coding)
  • Variable workloads (pay only for what you use)
  • When you need the latest models immediately
  • Teams without ML infrastructure expertise

Understanding the Landscape

Three main tools dominate local LLM deployment, each with different strengths.

Local LLM Runtimes Comparison

Ollama: The User-Friendly Choice

Ollama makes running local LLMs as simple as running Docker containers. It handles model downloads, quantization selection, and GPU/CPU allocation automatically.

Target audience: Developers, researchers, hobbyists who want to experiment quickly.

vLLM: The Production Powerhouse

vLLM is optimized for serving LLMs at scale. Its PagedAttention algorithm and continuous batching deliver 2-4x higher throughput than naive implementations.

Target audience: Teams deploying LLMs for production applications with multiple concurrent users.

llama.cpp: The Flexible Foundation

The C++ inference engine that powers Ollama and many other tools. Direct use gives maximum control but requires more setup.

Target audience: Developers building custom integrations or deploying to edge devices.

Part 1: Getting Started with Ollama

Ollama is the fastest path from zero to running a local LLM.

Installation

# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows: Download from https://ollama.ai

Basic Usage

# Pull a model (downloads once, then cached)
ollama pull llama3.2

# Start an interactive chat
ollama run llama3.2

# List downloaded models
ollama list

# Get model information
ollama show llama3.2

That's it—you're now running a 3B parameter language model locally.

Understanding Model Choices

Ollama hosts many open-source models. Here's how to choose:

ModelParametersRAM NeededBest For
llama3.23B4GBQuick tasks, constrained resources
llama3.18B8GBBalanced quality/speed
llama3.1:70b70B48GB+Maximum open-source quality
mistral7B6GBEfficient, good at code
codellama7B-34B6-24GBSpecialized for programming
phi33.8B4GBSurprisingly capable for size
qwen2.57B-72B6-48GBStrong multilingual

The naming convention: model:variant (e.g., llama3.1:70b-q4_K_M)

  • The number after the colon indicates size or quantization
  • q4_K_M means 4-bit quantization (smaller but slightly lower quality)

Using Ollama Programmatically

Python Integration

import ollama

# Simple completion
response = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Explain quantum computing briefly'}]
)
print(response['message']['content'])

Streaming Responses

For a better user experience with longer outputs:

stream = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Write a short story about AI'}],
    stream=True
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

OpenAI SDK Compatibility

Ollama exposes an OpenAI-compatible API, making migration easy:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but unused
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello!"}]
)

This means existing code using OpenAI can switch to local models with just a URL change.

Creating Custom Models

Ollama's Modelfile lets you create specialized model variants:

# Save as Modelfile
FROM llama3.2

# Set a custom system prompt
SYSTEM You are a senior Python developer who writes clean, documented code.

# Adjust generation parameters
PARAMETER temperature 0.2
PARAMETER top_p 0.9
# Create and use your custom model
ollama create python-expert -f Modelfile
ollama run python-expert

This is perfect for domain-specific assistants without fine-tuning.

Part 2: vLLM for Production

When you need high throughput and professional deployment, vLLM is the answer.

How vLLM Achieves Speed

vLLM's performance comes from two key innovations:

PagedAttention: Traditional LLM inference reserves contiguous GPU memory for each request's key-value cache. PagedAttention uses non-contiguous memory blocks (like virtual memory in operating systems), reducing waste from 60-80% to near zero.

Continuous Batching: Instead of waiting for a batch to complete before starting new requests, vLLM dynamically adds and removes requests from the batch. This keeps the GPU consistently busy.

The result: 2-4x higher throughput than naive serving.

Setting Up vLLM

pip install vllm

# Start an OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3.1-8B-Instruct \
    --port 8000

Then use it like any OpenAI API:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

When to Use vLLM Over Ollama

ScenarioChoose
Learning/experimentingOllama
Single user, occasional useOllama
Multiple concurrent usersvLLM
Production API backendvLLM
Maximum throughput neededvLLM
CPU inferenceOllama (uses llama.cpp)
Quick model switchingOllama

Part 3: Practical Applications

Building a Local RAG System

Combine local LLMs with retrieval for a private knowledge base:

RAG Architecture

The key components:

  1. Embedding model for converting documents to vectors
  2. Vector database for similarity search
  3. Local LLM for generating answers
import ollama
import chromadb

# Initialize vector database
client = chromadb.Client()
collection = client.create_collection("docs")

# Index your documents
documents = [
    "Python was created by Guido van Rossum in 1991.",
    "Machine learning uses algorithms to learn from data.",
    "Docker containers package applications with dependencies."
]

# Simple embedding using Ollama
for i, doc in enumerate(documents):
    response = ollama.embeddings(model='nomic-embed-text', prompt=doc)
    collection.add(
        ids=[f"doc_{i}"],
        embeddings=[response['embedding']],
        documents=[doc]
    )

# Query the knowledge base
def ask(question):
    # Get question embedding
    q_embedding = ollama.embeddings(model='nomic-embed-text', prompt=question)

    # Find relevant documents
    results = collection.query(
        query_embeddings=[q_embedding['embedding']],
        n_results=2
    )
    context = "\n".join(results['documents'][0])

    # Generate answer with context
    response = ollama.chat(
        model='llama3.2',
        messages=[
            {'role': 'system', 'content': f'Answer based on this context:\n{context}'},
            {'role': 'user', 'content': question}
        ]
    )
    return response['message']['content']

print(ask("Who created Python?"))

This entire system runs locally—your documents never leave your machine.

Hardware Considerations

Understanding memory requirements helps choose the right model:

Model SizeMinimum RAMRecommended GPUNotes
1-3B4GBNone (CPU OK)Fast on M1/M2/M3 Macs
7-8B8GB8GB VRAMRTX 3070 or better
13B16GB12GB VRAMRTX 3080+
30-34B32GB24GB VRAMRTX 4090
70B64GB48GB+ VRAMMulti-GPU or quantized

For Apple Silicon Macs: Ollama automatically uses Metal acceleration. A 7B model runs smoothly on M1 with 16GB unified memory.

For NVIDIA GPUs: Most models run well on RTX 3080 (10GB) or better. The RTX 4090 (24GB) is the sweet spot for enthusiasts.

CPU-only: Possible but slower. A 7B model might generate 5-10 tokens/second on a modern CPU vs. 50+ on GPU.

Common Issues and Solutions

Out of Memory

Symptoms: Model fails to load or crashes during generation.

Solutions:

  1. Use a smaller quantization: ollama run llama3.2:q4_0 instead of default
  2. Reduce context length in Modelfile: PARAMETER num_ctx 2048
  3. Try a smaller model: 7B instead of 13B
  4. For vLLM: --gpu-memory-utilization 0.9 and --max-model-len 4096

Slow Performance

Check these in order:

  1. GPU detected? Run nvidia-smi or check Ollama logs
  2. Correct driver? CUDA toolkit should match your GPU
  3. Thermal throttling? Monitor GPU temperature
  4. Memory swapping? Model may be too large for VRAM

Model Not Found

Ollama can't find the model you requested:

Conclusion

Running LLMs locally is now accessible to anyone with a modern computer. The choice between approaches is clear:

Start with Ollama for learning, development, and personal projects. The simplicity can't be beat.

Graduate to vLLM when you need production performance with multiple users or maximum throughput.

Consider llama.cpp directly for edge deployment, embedded systems, or when you need maximum control.

The gap between local and cloud-hosted models continues to narrow. Each generation of open-source models gets closer to proprietary quality, while tools like Ollama make them increasingly accessible.

References