Jared AI Hub
Published on

Multimodal AI: Building with Vision-Language Models

Authors
  • avatar
    Name
    Jared Chung
    Twitter

Introduction

What if an AI could look at a photo and have a conversation about it? That's exactly what Vision-Language Models (VLMs) enable. These multimodal AI systems combine the visual understanding of image models with the reasoning and language capabilities of LLMs, opening up applications that were impossible just a few years ago.

In this post, we'll demystify how these models work, explore the key players in the field, and learn when to use which approach.

Understanding Vision-Language Models

The Core Challenge

Traditional language models only understand text. Traditional vision models can classify images or detect objects, but can't explain what they see in natural language. VLMs bridge this gap by creating a unified system that can:

  • See: Process and understand visual content
  • Reason: Think about what's in an image
  • Communicate: Describe, answer questions, and have conversations about images

How VLMs Work

The architecture of a VLM consists of three main components working together:

Vision-Language Model Architecture

Let's break down each component:

1. Vision Encoder

The vision encoder converts an image into a sequence of numerical representations (embeddings) that capture visual information. Most modern VLMs use a Vision Transformer (ViT) pretrained with CLIP.

How it works:

  • The image is divided into small patches (typically 14x14 or 16x16 pixels)
  • Each patch is converted to an embedding
  • Self-attention mechanisms help the model understand relationships between patches
  • The output is a sequence of "visual tokens" that represent different parts of the image

Key insight: CLIP was trained on 400 million image-text pairs from the internet, learning to associate images with their descriptions. This means the vision encoder already "knows" concepts like "dog," "sunset," or "office building" from its pretraining.

2. Projection Layer (The Key Innovation)

Here's the crucial insight that makes VLMs work: the projection layer maps visual embeddings into the same space as text embeddings.

Think of it like a translator. The vision encoder speaks "vision language" and the LLM speaks "text language." The projection layer translates between them, allowing the LLM to process visual information as if it were just more text tokens.

This is typically just 1-2 linear layers with about 8 million parameters—tiny compared to the rest of the model. But this small component is what enables the magic of multimodal understanding.

3. Language Model Backbone

The language model (typically LLaMA, Mistral, or similar) receives a combined sequence of:

  • Visual tokens from the projection layer
  • Text tokens from the user's question

From the LLM's perspective, it's just processing tokens and generating a response. It doesn't "know" some tokens came from an image—it simply learned during training that these particular tokens carry visual meaning.

The Training Methodology

VLMs aren't trained in one step. The process involves multiple stages, each building on the previous:

VLM Training Pipeline

Stage 1: Pretrain the Vision Encoder

Before building a VLM, you need a vision encoder that understands images. CLIP training involves:

  • Showing the model millions of image-text pairs
  • Training it to match images with their correct descriptions
  • Learning a shared embedding space where similar concepts cluster together

Result: A vision encoder that can extract meaningful representations from any image.

Stage 2: Train the Projection Layer

With frozen vision encoder and frozen LLM, only the small projection layer is trained:

  • Use image-caption pairs as training data
  • The model learns to predict the caption given the image
  • The projection layer learns the "translation" between vision and language spaces

Why freeze everything else? Both the vision encoder and LLM were trained on massive datasets. We want to preserve those capabilities while just learning how to connect them.

Stage 3: Visual Instruction Tuning

This is where the model learns to be helpful. Using datasets like:

  • Visual question-answering pairs
  • Image description requests
  • Multi-turn conversations about images
  • Complex reasoning tasks

The model learns not just to describe images, but to follow instructions, answer specific questions, and engage in dialogue.

Stage 4: Alignment (Optional)

Commercial models like GPT-4V and Claude 3 undergo additional training with human feedback (RLHF) to:

  • Reduce hallucinations (making up things not in the image)
  • Improve safety (refusing harmful requests)
  • Better follow user intent

The Multimodal Landscape

Commercial Models

ModelProviderStrengthsContextBest For
GPT-4V/GPT-4oOpenAIBest overall quality, reliable128KProduction apps requiring highest quality
Claude 3AnthropicStrong reasoning, long context200KComplex analysis, documents
Gemini Pro VisionGoogleFast, good value32KCost-sensitive applications

Open Source Models

ModelParametersLicenseStrengths
LLaVA 1.67B-34BApache 2.0Best general-purpose open model
Qwen-VL7BQwen LicenseExcellent multilingual support
CogVLM17BApache 2.0High quality, research-focused
InternVL6B-26BMITStrong Chinese + English

Choosing the Right Model

Use commercial models (GPT-4V, Claude 3) when:

  • Quality is the top priority
  • You need reliable production performance
  • Your data can be sent to external APIs
  • Budget allows for per-token pricing

Use open source models (LLaVA, Qwen-VL) when:

  • Data must stay on your infrastructure (privacy/compliance)
  • You need to fine-tune for your specific domain
  • Running costs must be minimized
  • You want to experiment and learn

Key Concepts and Methodology

How Visual Tokens Work

When a VLM processes an image, it doesn't see pixels—it sees tokens. For a typical model:

  1. A 336x336 image is divided into 576 patches (24x24 grid)
  2. Each patch becomes one visual token
  3. These 576 tokens are prepended to the text tokens
  4. The LLM processes everything together

Implication: Larger images mean more tokens, which means higher cost and longer processing. Most APIs let you choose "low" or "high" detail modes to control this trade-off.

The Importance of Prompting

Just like text LLMs, VLMs respond better to well-crafted prompts. Key strategies:

Be specific about what you want:

  • Instead of: "What's in this image?"
  • Try: "List all the objects visible in this image, including their approximate positions."

Provide context:

  • Instead of: "Analyze this."
  • Try: "This is a medical X-ray. Describe any abnormalities you observe."

Ask for structured output:

  • Instead of: "Extract the data."
  • Try: "Extract the invoice data as JSON with fields: invoice_number, date, total, line_items."

Understanding Limitations

VLMs have important limitations to understand:

Hallucinations: The model may confidently describe things that aren't in the image. Always verify critical information.

OCR imperfections: While VLMs can read text in images, they may make errors with unusual fonts, handwriting, or low resolution.

Spatial reasoning: Understanding precise spatial relationships ("is X to the left of Y?") can be challenging.

Small details: Fine details may be missed, especially in complex or cluttered images.

Knowledge cutoff: The model knows about concepts from its training data, not real-time information.

Practical Applications

Document Understanding

VLMs excel at extracting information from documents, forms, receipts, and invoices. The methodology:

  1. Capture high-quality images - Resolution matters for text extraction
  2. Use structured prompts - Tell the model exactly what fields to extract
  3. Request JSON output - Makes parsing reliable
  4. Validate results - Cross-check extracted values when possible

Visual Question Answering

Building interactive systems where users can ask questions about images:

  1. First message includes the image - Establishes the visual context
  2. Subsequent messages are text-only - The model remembers the image
  3. Maintain conversation history - For context-aware follow-ups

Chart and Graph Analysis

VLMs can interpret charts, graphs, and data visualizations:

  • Identify chart type and axes
  • Extract approximate data values
  • Describe trends and patterns
  • Compare multiple data series

Best practice: Ask for specific numbers when visible, but understand estimates for read-from-graph values.

Image-Based Search (RAG for Images)

Use CLIP embeddings to build visual search systems:

  1. Index: Convert images to embeddings using CLIP
  2. Store: Save embeddings in a vector database
  3. Search: Convert text query to embedding, find similar images
  4. Analyze: Send retrieved images to VLM for detailed analysis

This combines the speed of vector search with the reasoning capability of VLMs.

Production Considerations

Cost Optimization

VLM API calls are more expensive than text-only calls due to image processing. Strategies:

StrategySavingsTrade-off
Use "low" detail mode~70%Less accuracy on fine details
Use GPT-4o-mini~90%Lower quality for complex tasks
Cache repeated analyses100%Stale results for changing images
Batch similar requestsVariableAdded latency

Latency

Image processing adds latency. For real-time applications:

  • Consider local models (LLaVA via Ollama) for sub-second response
  • Use lower resolution when speed matters
  • Implement streaming for better perceived performance

Privacy

If your images contain sensitive data:

  • Use local models (LLaVA, Qwen-VL)
  • Run on your own infrastructure
  • Avoid cloud APIs that may retain data

Getting Started

Quick Start with OpenAI

The simplest way to start experimenting:

from openai import OpenAI
import base64

client = OpenAI()

def analyze_image(image_path: str, question: str) -> str:
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/jpeg;base64,{image_data}"
                }}
            ]
        }]
    )
    return response.choices[0].message.content

# Example
result = analyze_image("chart.png", "What trends do you see in this chart?")

Quick Start with Ollama (Local)

For local, private image analysis:

# Install Ollama and pull LLaVA
ollama pull llava

# Analyze an image
ollama run llava "Describe this image: ./photo.jpg"
import ollama

response = ollama.chat(
    model='llava',
    messages=[{
        'role': 'user',
        'content': 'What objects are in this image?',
        'images': ['./photo.jpg']
    }]
)
print(response['message']['content'])

Conclusion

Vision-Language Models represent a fundamental shift in AI capabilities. By combining visual understanding with language reasoning, they enable applications from document processing to visual search to interactive image analysis.

Key takeaways:

  1. Architecture matters: Understanding the vision encoder → projection → LLM pipeline helps you use these models effectively

  2. Choose wisely: Commercial models for quality, open source for privacy and cost

  3. Prompt well: Structured, specific prompts dramatically improve results

  4. Know the limits: Hallucinations and OCR errors require validation for critical applications

  5. Start simple: Begin with GPT-4V for prototyping, then optimize based on your needs

The field is evolving rapidly. Open source models are closing the gap with commercial offerings, and new architectures continue to improve efficiency and capability.

References