Introduction

Text embeddings transform words, sentences, or documents into numerical vectors that capture semantic meaning. They're the foundation of modern NLP—nearly every application from search to chatbots relies on them.

The key insight: similar concepts have similar vectors. This enables computers to understand that "happy" and "joyful" are related, even though they share no characters.

This guide traces the evolution of embeddings and helps you choose the right approach for your use case.

The Evolution of Text Representations

Why Traditional Methods Fall Short

Before embeddings, text was represented using sparse methods:

One-Hot Encoding:

Vocabulary: [cat, dog, bird]
cat  = [1, 0, 0]
dog  = [0, 1, 0]
bird = [0, 0, 1]

Problems:

Vocabulary of 100,000 words = 100,000-dimensional vectors
No similarity: cosine(cat, dog) = 0, same as cosine(cat, quantum)
Cannot handle words not in the vocabulary

TF-IDF:

Weights words by importance (term frequency × inverse document frequency)
Better than one-hot, but still high-dimensional and sparse
Treats "happy" and "joyful" as completely unrelated

The Embedding Solution

Embeddings learn dense, low-dimensional vectors where semantically similar items cluster together:

Word	Dense Embedding (conceptual)
cat	[0.2, -0.4, 0.1, ..., 0.8]
dog	[0.3, -0.3, 0.2, ..., 0.7]
car	[-0.5, 0.2, -0.1, ..., -0.3]

Now cosine(cat, dog) ≈ 0.8 (similar animals), while cosine(cat, car) ≈ 0.1 (unrelated).

Word Embeddings

Word2Vec: The Pioneer (2013)

Word2Vec, from Mikolov et al. at Google, demonstrated that neural networks could learn meaningful word representations from raw text.

The Core Insight: Distributional Hypothesis

"You shall know a word by the company it keeps" - J.R. Firth (1957)

Words appearing in similar contexts have similar meanings. Word2Vec learns by predicting context:

Skip-gram (predict context from word):

Sentence: "The cat sat on the mat"
Target: "sat"
Predict: ["The", "cat", "on", "the"]

Training creates embeddings where "sat" is similar to other verbs
that appear in similar contexts ("stood", "lay", "slept")

Word Analogies

Word2Vec famously captured relationships through vector arithmetic:

king - man + woman ≈ queen
paris - france + japan ≈ tokyo

This works because relationships are encoded as directions in the vector space.

Limitations

Limitation	Example
Static embeddings	"bank" has one vector for both "river bank" and "bank account"
No OOV handling	Unknown words have no representation
Word-level only	No sentence or document embeddings

GloVe: Global Vectors (2014)

Stanford's GloVe takes a different approach: directly factorize the word co-occurrence matrix.

Key insight: The ratio of co-occurrence probabilities encodes meaning.

Word k	P(k\|ice)	P(k\|steam)	Ratio
solid	high	low	≫ 1
gas	low	high	≪ 1
water	medium	medium	≈ 1

Words related to "ice" but not "steam" have high ratios. GloVe learns embeddings that capture these ratios.

Comparison to Word2Vec:

Similar quality for most tasks
GloVe uses global statistics, Word2Vec uses local context windows
Both produce static word embeddings

FastText: Subword Embeddings (2016)

Facebook's FastText represents words as bags of character n-grams.

The word "where" with n=3 becomes:

<wh, whe, her, ere, re>, <where>

Key benefits:

Benefit	Why It Matters
OOV handling	Can generate vectors for unseen words
Morphology	Captures prefixes, suffixes, roots
Rare words	Better representations through subword sharing

# FastText can handle words it's never seen
fasttext_model["deeplearning"]  # Works! (combines subwords)
word2vec_model["deeplearning"]  # KeyError!

Sentence Embeddings

The Context Problem

Word embeddings assign one vector per word regardless of context:

"I deposited money in the bank"  → bank = [0.5, -0.3, ...]
"I sat by the river bank"        → bank = [0.5, -0.3, ...]  # Same!

This is wrong—the meaning of "bank" depends on context.

Sentence Transformers (SBERT, 2019)

Sentence-BERT generates sentence-level embeddings that:

Understand context within the sentence
Can be compared with simple cosine similarity
Are efficient enough for real-time applications

Why Not Use BERT Directly?

BERT wasn't designed for sentence similarity. Using BERT naively requires:

Passing both sentences through BERT together
O(n²) comparisons for n sentences

For 10,000 sentences, that's 50+ million forward passes!

SBERT generates independent embeddings: O(n) forward passes, then fast vector comparison.

Practical Usage

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "Machine learning is a subset of AI",
    "AI systems can learn from experience",
    "The weather is nice today",
]

embeddings = model.encode(sentences)

# Similar concepts cluster together
print(cosine_similarity([embeddings[0]], [embeddings[1]]))  # ~0.75
print(cosine_similarity([embeddings[0]], [embeddings[2]]))  # ~0.15

Model Selection

Model	Dimensions	Speed	Quality	Use Case
`all-MiniLM-L6-v2`	384	Fast	Good	General purpose, production
`all-mpnet-base-v2`	768	Medium	Best	When quality matters most
`all-MiniLM-L12-v2`	384	Fast	Better	Balance of speed/quality
`paraphrase-multilingual-*`	384	Fast	Good	50+ languages

Rule of thumb: Start with all-MiniLM-L6-v2. Only upgrade if you need better quality and can accept slower inference.

When to Use Each Approach

Decision Framework

Your Need	Recommended Approach
Simple word similarity	Word2Vec or GloVe
Handle unknown words	FastText
Sentence/paragraph similarity	Sentence Transformers
Semantic search	Sentence Transformers
Clustering documents	Sentence Transformers
Word analogies	Word2Vec
Very limited compute	Word2Vec (pretrained)

Modern Default

For most NLP tasks in 2024+, start with Sentence Transformers. Only use word embeddings for specific use cases like word analogy tasks or when computational resources are extremely limited.

Practical Patterns

Semantic Search

from sentence_transformers import SentenceTransformer, util
import torch

class SemanticSearch:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.corpus = []
        self.corpus_embeddings = None

    def index(self, documents):
        self.corpus = documents
        self.corpus_embeddings = self.model.encode(
            documents, convert_to_tensor=True
        )

    def search(self, query, top_k=5):
        query_embedding = self.model.encode(query, convert_to_tensor=True)
        scores = util.cos_sim(query_embedding, self.corpus_embeddings)[0]
        top_indices = torch.topk(scores, k=min(top_k, len(self.corpus))).indices

        return [
            {"text": self.corpus[idx], "score": scores[idx].item()}
            for idx in top_indices
        ]

# Usage
search = SemanticSearch()
search.index(["Python programming", "Machine learning basics", "Web development"])
results = search.search("How do I learn AI?")

Batch Processing

Always encode in batches for efficiency:

# Slow: one at a time
embeddings = [model.encode(s) for s in sentences]

# Fast: batch encode
embeddings = model.encode(sentences, batch_size=32)

GPU Acceleration

import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)

GPU acceleration provides 5-10x speedup for batch encoding.

Handling Long Documents

Models have maximum sequence lengths (typically 256-512 tokens):

Strategy	When to Use
Truncation	Summary information is at the start
Chunking + averaging	Information spread throughout
Chunking + max pooling	Key info in specific sections
Long-context models	Need full document understanding

def embed_long_document(text, model, chunk_size=256, overlap=50):
    """Embed long documents by chunking and averaging."""
    words = text.split()
    chunks = []

    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        chunks.append(chunk)

    chunk_embeddings = model.encode(chunks)
    return chunk_embeddings.mean(axis=0)  # Average pooling

Domain Adaptation

Pre-trained models work well for general text but may underperform on specialized domains (medical, legal, financial).

When to Fine-tune

Scenario	Action
General text, good results	Use pretrained
Domain terms, acceptable results	Use pretrained
Domain terms, poor results	Fine-tune
Critical accuracy needs	Fine-tune

Fine-tuning Approaches

Similarity pairs: Pairs of texts with similarity scores (0-1)
Triplets: (anchor, positive, negative) examples
Contrastive: Similar/dissimilar pairs

Typically need 1,000-10,000 examples for meaningful improvement.

Best Practices Summary

Do

Batch encode for efficiency
Normalize vectors before cosine similarity
Use GPU when available
Start with pretrained models before fine-tuning
Test on your actual data before choosing a model

Don't

Don't encode one sentence at a time in loops
Don't assume pretrained models work for specialized domains
Don't ignore sequence length limits for long documents
Don't mix embeddings from different models in the same vector space

Conclusion

Text embeddings have evolved from sparse representations to powerful dense vectors that capture meaning:

Word2Vec/GloVe: Pioneered dense word representations, limited by static nature
FastText: Added subword information, handles unknown words
Sentence Transformers: Current state-of-the-art for semantic similarity

For modern NLP applications, start with Sentence Transformers (all-MiniLM-L6-v2) and only use word embeddings for specific use cases where they excel.

References

Mikolov et al. (2013). "Efficient Estimation of Word Representations". Word2Vec paper.
Pennington et al. (2014). "GloVe: Global Vectors for Word Representation". GloVe paper.
Bojanowski et al. (2017). "Enriching Word Vectors with Subword Information". FastText paper.
Reimers & Gurevych (2019). "Sentence-BERT". SBERT paper.
Sentence Transformers Documentation - Comprehensive library documentation.