Introduction

Named Entity Recognition (NER) is a fundamental NLP task that extracts structured information from unstructured text. It identifies and classifies mentions of real-world entities into categories like people, organizations, locations, and dates.

NER powers many practical applications:

Email processing: Extract contacts, meetings, and action items
News analysis: Identify companies, people, and events mentioned
Resume parsing: Extract skills, education, and work history
Data anonymization: Detect and redact PII (personally identifiable information)
Knowledge graphs: Build structured databases from documents

This guide explains how NER works, compares different approaches, and shows how to handle domain-specific entities.

How NER Works

The NER Pipeline

A typical NER system processes text in stages:

Tokenization: Split text into individual words or subwords
Encoding: Convert tokens to numerical representations
Classification: Predict entity labels for each token
Span extraction: Combine labeled tokens into entity spans

BIO Tagging Scheme

NER operates at the token level, but entities often span multiple tokens. The BIO scheme handles this:

B-XXX: Beginning of entity type XXX
I-XXX: Inside (continuation) of entity type XXX
O: Outside any entity

Example:

Token	Label	Meaning
Apple	B-ORG	Start of organization
Inc.	I-ORG	Continuation of organization
CEO	O	Not an entity
Tim	B-PER	Start of person
Cook	I-PER	Continuation of person
visited	O	Not an entity
Paris	B-LOC	Location (single token)

The BIO scheme ensures multi-word entities like "Apple Inc." and "Tim Cook" are correctly grouped.

Standard Entity Types

Different datasets define different entity categories. The most common:

OntoNotes 5.0 (spaCy default)

Type	Description	Examples
PERSON	People, including fictional	Barack Obama, Sherlock Holmes
ORG	Companies, agencies, institutions	Google, FBI, Stanford
GPE	Countries, cities, states	France, New York
LOC	Non-GPE locations	Mount Everest, Pacific Ocean
DATE	Dates and periods	June 2023, yesterday
TIME	Times of day	3:00 PM, morning
MONEY	Monetary values	$500, fifty euros
PERCENT	Percentages	25%, three percent
PRODUCT	Objects, vehicles	iPhone, Boeing 747
EVENT	Named events	Olympics, World War II

CoNLL-2003 (Standard benchmark)

Type	Description
PER	Person names
ORG	Organizations
LOC	Locations
MISC	Miscellaneous

The entity types you need depend on your application. Custom domains (medical, legal, financial) typically require custom entity types.

Approaches to NER

There are three main approaches, each with different tradeoffs:

1. Rule-Based (Pattern Matching)

Match explicit patterns like regular expressions or keyword lists.

When to use:

Known, finite set of entities (product names, internal codes)
High precision required (legal compliance)
No training data available

Characteristics:

Aspect	Rating
Accuracy on known patterns	Excellent
Generalization to unseen text	Poor
Speed	Very Fast
Maintenance effort	High

spaCy's EntityRuler lets you define exact patterns for entity recognition. This is ideal when you have a known vocabulary (like product names or internal codes) that statistical models might miss. Patterns can match exact strings or use token-level attributes like lowercase text or digit detection.

import spacy

nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")

patterns = [
    {"label": "PRODUCT", "pattern": "iPhone 15 Pro"},
    {"label": "PRODUCT", "pattern": [{"LOWER": "iphone"}, {"IS_DIGIT": True}]},
    {"label": "TECH", "pattern": [{"LOWER": {"IN": ["pytorch", "tensorflow"]}}]},
]
ruler.add_patterns(patterns)

doc = nlp("I use PyTorch on my iPhone 15")
for ent in doc.ents:
    print(f"{ent.text} → {ent.label_}")

2. Statistical Models (spaCy)

Machine learning models trained on labeled data, using features like word shape, context, and embeddings.

When to use:

General-purpose NER on standard entity types
Production systems requiring speed
Balance between accuracy and performance

Characteristics:

Aspect	Rating
Accuracy	Good
Generalization	Good
Speed	Fast
Resource usage	Low

spaCy's pre-trained models provide a quick way to get started with NER. Loading a model and processing text is straightforward—the model handles tokenization, feature extraction, and classification automatically. The code below shows how to extract entities from a text snippet and display them with their labels.

import spacy

nlp = spacy.load("en_core_web_sm")  # or en_core_web_trf for better accuracy

text = """
Elon Musk, CEO of Tesla, announced a $5 billion investment in Berlin.
The press conference was held at Tesla headquarters in Palo Alto.
"""

doc = nlp(text)
for ent in doc.ents:
    print(f"{ent.text:20} → {ent.label_}")

Output:

Elon Musk            → PERSON
Tesla                → ORG
$5 billion           → MONEY
Berlin               → GPE
Tesla                → ORG
Palo Alto            → GPE

3. Transformer Models (BERT, RoBERTa)

Deep learning models that understand context deeply through self-attention.

When to use:

Highest accuracy required
Ambiguous entities that need context
Domain-specific fine-tuning

Characteristics:

Aspect	Rating
Accuracy	Excellent
Generalization	Very Good
Speed	Slower
Resource usage	High (GPU preferred)

Hugging Face's pipeline API provides a high-level interface for transformer-based NER. The aggregation_strategy="simple" parameter automatically groups subword tokens back into complete entity spans. This approach gives you state-of-the-art accuracy with just a few lines of code.

from transformers import pipeline

ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")

text = "Microsoft CEO Satya Nadella announced a partnership with OpenAI."
entities = ner(text)

for ent in entities:
    print(f"{ent['word']:20} → {ent['entity_group']} ({ent['score']:.3f})")

Choosing an Approach

Requirement	Best Approach
Known patterns, high precision	Rule-based
General NER, production speed	spaCy statistical
Maximum accuracy	Transformer
Mixed (known + general)	Rule-based + Statistical
Domain-specific entities	Fine-tuned Transformer

Hybrid approach: Combine rule-based patterns for known entities with statistical models for general coverage. spaCy's EntityRuler can run before or after the statistical NER.

Training Custom NER Models

Pre-trained models don't know your domain-specific entities. Custom training is needed for:

Medical terms (drug names, conditions, procedures)
Legal entities (case citations, contract clauses)
Financial data (ticker symbols, financial instruments)
Product catalogs (your company's products)

Training Data Requirements

NER training requires token-level labeled data in BIO format:

Dataset Size	Typical Results
50-100 examples	Basic recognition, many errors
200-500 examples	Reasonable accuracy for common patterns
1000+ examples	Good generalization
5000+ examples	Production-quality for most use cases

Quality matters more than quantity. 500 diverse, well-labeled examples beat 2000 noisy ones.

Training with spaCy

spaCy training uses annotated examples with character offsets. Each example pairs a text string with a dictionary containing entity spans defined by start position, end position, and label. This format allows precise annotation of exactly where entities appear in your text.

The training process involves several steps:

Create a blank model and add the NER component
Register all entity labels from your training data
Convert annotations to spaCy's Example format
Run multiple training iterations with dropout for regularization

import spacy
from spacy.training import Example
import random

# Training data: (text, {"entities": [(start, end, label), ...]})
TRAIN_DATA = [
    ("iPhone 15 Pro has a titanium frame", {"entities": [(0, 13, "PRODUCT")]}),
    ("The M3 chip delivers great performance", {"entities": [(4, 11, "HARDWARE")]}),
    ("macOS Sonoma includes new features", {"entities": [(0, 12, "SOFTWARE")]}),
    ("Download Xcode from the App Store", {"entities": [(9, 14, "SOFTWARE"), (24, 33, "PRODUCT")]}),
    # Add 50-200+ examples per entity type
]

def train_ner(train_data, n_iter=30):
    nlp = spacy.blank("en")
    ner = nlp.add_pipe("ner")

    # Add labels
    for _, annotations in train_data:
        for start, end, label in annotations.get("entities", []):
            ner.add_label(label)

    # Convert to Example objects
    examples = [Example.from_dict(nlp.make_doc(text), ann) for text, ann in train_data]
    nlp.initialize(lambda: examples)

    # Training loop
    for i in range(n_iter):
        random.shuffle(examples)
        losses = {}
        for example in examples:
            nlp.update([example], drop=0.35, losses=losses)

    return nlp

nlp = train_ner(TRAIN_DATA)
nlp.to_disk("custom_ner")

Fine-tuning Transformers

For higher accuracy, fine-tune a transformer model:

Prepare data in token-level format with BIO labels
Tokenize carefully - align labels with subword tokens
Fine-tune using HuggingFace Trainer

Key considerations:

Subword alignment: When "iPhone" becomes ["i", "##Phone"], only the first subword gets the label
Learning rate: Use 2e-5 to 5e-5 for fine-tuning
Epochs: 3-10 depending on dataset size

Evaluation Metrics

NER uses entity-level metrics, not token-level:

Metric	Definition
Precision	Of predicted entities, what % are correct?
Recall	Of actual entities, what % did we find?
F1 Score	Harmonic mean of precision and recall

Strict matching: Both entity boundaries AND type must be exactly correct.

The seqeval library provides proper entity-level evaluation metrics that account for the BIO tagging scheme. Unlike token-level accuracy, these metrics only count an entity as correct if both the boundaries and type match exactly. This gives you a realistic picture of model performance.

from seqeval.metrics import classification_report

y_true = [["O", "B-PER", "I-PER", "O", "B-ORG"]]
y_pred = [["O", "B-PER", "I-PER", "O", "O"]]  # Missed ORG

print(classification_report(y_true, y_pred))

Output shows per-entity-type metrics:

PERSON: 100% (correctly identified)
ORG: 0% (missed entirely)

Common Evaluation Mistakes

Token-level accuracy misleads: 95% token accuracy can mean 60% entity F1 if boundaries are wrong
O-class dominates: Most tokens aren't entities, so token accuracy looks artificially high
Partial matches: Identifying "Tim" instead of "Tim Cook" counts as a complete miss in strict evaluation

Common Challenges

1. Ambiguous Entities

The same text can be different entity types depending on context:

"Apple released new products"  → Apple = ORG
"I ate an apple for lunch"     → apple = not an entity
"Apple Martin is an actress"   → Apple Martin = PERSON

Solution: Transformer models handle context better than rule-based systems. They consider surrounding words when classifying.

2. Nested Entities

Some entities contain other entities:

"Bank of America headquarters in Charlotte"
- "Bank of America" → ORG
- "America" → GPE (nested inside ORG)
- "Charlotte" → GPE

Standard BIO tagging can't represent nesting. Solutions:

Use only the outermost entity
Use spaCy's SpanCategorizer for overlapping spans
Use specialized nested NER models

3. Long Documents

Transformer models have token limits (typically 512):

When processing documents that exceed the model's token limit, you need to split the text into chunks. The challenge is that entities might span chunk boundaries, causing them to be missed or truncated. Using overlapping chunks solves this—each chunk extends into the previous one, ensuring entities near boundaries are captured in at least one chunk.

def process_long_doc(text, nlp, chunk_size=500, overlap=50):
    """Process long documents with overlapping chunks."""
    entities = []
    start = 0

    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunk = text[start:end]

        doc = nlp(chunk)
        for ent in doc.ents:
            entities.append({
                "text": ent.text,
                "label": ent.label_,
                "start": start + ent.start_char,
                "end": start + ent.end_char
            })

        start = end - overlap  # Overlap to catch split entities

    return deduplicate(entities)  # Remove duplicates from overlap

4. Domain-Specific Entities

Pre-trained models don't recognize specialized terminology:

Domain	Custom Entities Needed
Medical	Drug names, conditions, procedures
Legal	Case citations, contract terms
Finance	Ticker symbols, financial products
E-commerce	Product SKUs, brand names

Solutions:

Add rule-based patterns for known terms
Fine-tune on domain-specific data
Use domain-specific pre-trained models (BioBERT, LegalBERT)

Production Considerations

Batch Processing

Process multiple texts efficiently:

For production workloads, processing one document at a time is inefficient. spaCy's nlp.pipe() method batches multiple texts together, reducing overhead and enabling GPU parallelization when available. This can provide 5-10x speedup compared to calling nlp() in a loop.

texts = ["Text 1...", "Text 2...", "Text 3..."]

# spaCy's nlp.pipe is much faster than individual calls
for doc in nlp.pipe(texts, batch_size=50):
    entities = [(ent.text, ent.label_) for ent in doc.ents]

Model Selection

Model	Speed	Accuracy	Memory
`en_core_web_sm`	Fast	Good	12MB
`en_core_web_md`	Medium	Better	40MB
`en_core_web_lg`	Medium	Better	560MB
`en_core_web_trf`	Slow	Best	440MB
Custom BERT	Slow	Domain-best	~500MB

For production:

Start with en_core_web_sm for speed
Upgrade to trf if accuracy is insufficient
Fine-tune only if pre-trained models don't cover your entities

Combining Approaches

A practical production system often combines multiple approaches:

EntityRuler first: Catch known patterns with high confidence
Statistical NER: Handle general entities
Post-processing: Apply business rules (validation, deduplication)

A practical production system often combines rule-based patterns for high-precision known entities with statistical NER for general coverage. By adding the EntityRuler before the NER component in the pipeline, your patterns take priority—the statistical model won't overwrite entities that match your rules.

nlp = spacy.load("en_core_web_sm")

# Add rule-based patterns BEFORE statistical NER
ruler = nlp.add_pipe("entity_ruler", before="ner")
ruler.add_patterns([
    {"label": "PRODUCT", "pattern": "iPhone 15 Pro"},
    {"label": "INTERNAL_CODE", "pattern": [{"TEXT": {"REGEX": "PRD-\\d{6}"}}]},
])

# Now both custom patterns and statistical NER run
doc = nlp("Order PRD-123456 for iPhone 15 Pro shipped to John Smith")

Conclusion

NER extracts structured entities from unstructured text. Key takeaways:

Understanding NER:

BIO tagging handles multi-token entities
Entity types depend on your use case
Evaluation must be entity-level, not token-level

Choosing an approach:

Rule-based for known patterns
spaCy statistical for general production use
Transformers for maximum accuracy
Combine approaches for best results

Custom training:

Needed for domain-specific entities
Quality of labels matters more than quantity
200-500 diverse examples is a good starting point

Production tips:

Use batch processing (nlp.pipe)
Start simple, add complexity as needed
Monitor and iterate on real data

References

spaCy NER Documentation - Production NLP library.
Hugging Face Token Classification - Transformer-based NER.
seqeval - Sequence labeling evaluation.
Devlin, J., et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers". NAACL 2019.
OntoNotes 5.0 Entity Types - Standard entity definitions.