- Published on
Named Entity Recognition: Extracting Structured Information from Text
- Authors

- Name
- Jared Chung
Introduction
Named Entity Recognition (NER) is a fundamental NLP task that extracts structured information from unstructured text. It identifies and classifies mentions of real-world entities into categories like people, organizations, locations, and dates.
NER powers many practical applications:
- Email processing: Extract contacts, meetings, and action items
- News analysis: Identify companies, people, and events mentioned
- Resume parsing: Extract skills, education, and work history
- Data anonymization: Detect and redact PII (personally identifiable information)
- Knowledge graphs: Build structured databases from documents
This guide explains how NER works, compares different approaches, and shows how to handle domain-specific entities.
How NER Works
The NER Pipeline
A typical NER system processes text in stages:
- Tokenization: Split text into individual words or subwords
- Encoding: Convert tokens to numerical representations
- Classification: Predict entity labels for each token
- Span extraction: Combine labeled tokens into entity spans
BIO Tagging Scheme
NER operates at the token level, but entities often span multiple tokens. The BIO scheme handles this:
- B-XXX: Beginning of entity type XXX
- I-XXX: Inside (continuation) of entity type XXX
- O: Outside any entity
Example:
| Token | Label | Meaning |
|---|---|---|
| Apple | B-ORG | Start of organization |
| Inc. | I-ORG | Continuation of organization |
| CEO | O | Not an entity |
| Tim | B-PER | Start of person |
| Cook | I-PER | Continuation of person |
| visited | O | Not an entity |
| Paris | B-LOC | Location (single token) |
The BIO scheme ensures multi-word entities like "Apple Inc." and "Tim Cook" are correctly grouped.
Standard Entity Types
Different datasets define different entity categories. The most common:
OntoNotes 5.0 (spaCy default)
| Type | Description | Examples |
|---|---|---|
| PERSON | People, including fictional | Barack Obama, Sherlock Holmes |
| ORG | Companies, agencies, institutions | Google, FBI, Stanford |
| GPE | Countries, cities, states | France, New York |
| LOC | Non-GPE locations | Mount Everest, Pacific Ocean |
| DATE | Dates and periods | June 2023, yesterday |
| TIME | Times of day | 3:00 PM, morning |
| MONEY | Monetary values | $500, fifty euros |
| PERCENT | Percentages | 25%, three percent |
| PRODUCT | Objects, vehicles | iPhone, Boeing 747 |
| EVENT | Named events | Olympics, World War II |
CoNLL-2003 (Standard benchmark)
| Type | Description |
|---|---|
| PER | Person names |
| ORG | Organizations |
| LOC | Locations |
| MISC | Miscellaneous |
The entity types you need depend on your application. Custom domains (medical, legal, financial) typically require custom entity types.
Approaches to NER
There are three main approaches, each with different tradeoffs:
1. Rule-Based (Pattern Matching)
Match explicit patterns like regular expressions or keyword lists.
When to use:
- Known, finite set of entities (product names, internal codes)
- High precision required (legal compliance)
- No training data available
Characteristics:
| Aspect | Rating |
|---|---|
| Accuracy on known patterns | Excellent |
| Generalization to unseen text | Poor |
| Speed | Very Fast |
| Maintenance effort | High |
import spacy
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
patterns = [
{"label": "PRODUCT", "pattern": "iPhone 15 Pro"},
{"label": "PRODUCT", "pattern": [{"LOWER": "iphone"}, {"IS_DIGIT": True}]},
{"label": "TECH", "pattern": [{"LOWER": {"IN": ["pytorch", "tensorflow"]}}]},
]
ruler.add_patterns(patterns)
doc = nlp("I use PyTorch on my iPhone 15")
for ent in doc.ents:
print(f"{ent.text} → {ent.label_}")
2. Statistical Models (spaCy)
Machine learning models trained on labeled data, using features like word shape, context, and embeddings.
When to use:
- General-purpose NER on standard entity types
- Production systems requiring speed
- Balance between accuracy and performance
Characteristics:
| Aspect | Rating |
|---|---|
| Accuracy | Good |
| Generalization | Good |
| Speed | Fast |
| Resource usage | Low |
import spacy
nlp = spacy.load("en_core_web_sm") # or en_core_web_trf for better accuracy
text = """
Elon Musk, CEO of Tesla, announced a $5 billion investment in Berlin.
The press conference was held at Tesla headquarters in Palo Alto.
"""
doc = nlp(text)
for ent in doc.ents:
print(f"{ent.text:20} → {ent.label_}")
Output:
Elon Musk → PERSON
Tesla → ORG
$5 billion → MONEY
Berlin → GPE
Tesla → ORG
Palo Alto → GPE
3. Transformer Models (BERT, RoBERTa)
Deep learning models that understand context deeply through self-attention.
When to use:
- Highest accuracy required
- Ambiguous entities that need context
- Domain-specific fine-tuning
Characteristics:
| Aspect | Rating |
|---|---|
| Accuracy | Excellent |
| Generalization | Very Good |
| Speed | Slower |
| Resource usage | High (GPU preferred) |
from transformers import pipeline
ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
text = "Microsoft CEO Satya Nadella announced a partnership with OpenAI."
entities = ner(text)
for ent in entities:
print(f"{ent['word']:20} → {ent['entity_group']} ({ent['score']:.3f})")
Choosing an Approach
| Requirement | Best Approach |
|---|---|
| Known patterns, high precision | Rule-based |
| General NER, production speed | spaCy statistical |
| Maximum accuracy | Transformer |
| Mixed (known + general) | Rule-based + Statistical |
| Domain-specific entities | Fine-tuned Transformer |
Hybrid approach: Combine rule-based patterns for known entities with statistical models for general coverage. spaCy's EntityRuler can run before or after the statistical NER.
Training Custom NER Models
Pre-trained models don't know your domain-specific entities. Custom training is needed for:
- Medical terms (drug names, conditions, procedures)
- Legal entities (case citations, contract clauses)
- Financial data (ticker symbols, financial instruments)
- Product catalogs (your company's products)
Training Data Requirements
NER training requires token-level labeled data in BIO format:
| Dataset Size | Typical Results |
|---|---|
| 50-100 examples | Basic recognition, many errors |
| 200-500 examples | Reasonable accuracy for common patterns |
| 1000+ examples | Good generalization |
| 5000+ examples | Production-quality for most use cases |
Quality matters more than quantity. 500 diverse, well-labeled examples beat 2000 noisy ones.
Training with spaCy
spaCy training uses annotated examples with character offsets:
import spacy
from spacy.training import Example
import random
# Training data: (text, {"entities": [(start, end, label), ...]})
TRAIN_DATA = [
("iPhone 15 Pro has a titanium frame", {"entities": [(0, 13, "PRODUCT")]}),
("The M3 chip delivers great performance", {"entities": [(4, 11, "HARDWARE")]}),
("macOS Sonoma includes new features", {"entities": [(0, 12, "SOFTWARE")]}),
("Download Xcode from the App Store", {"entities": [(9, 14, "SOFTWARE"), (24, 33, "PRODUCT")]}),
# Add 50-200+ examples per entity type
]
def train_ner(train_data, n_iter=30):
nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")
# Add labels
for _, annotations in train_data:
for start, end, label in annotations.get("entities", []):
ner.add_label(label)
# Convert to Example objects
examples = [Example.from_dict(nlp.make_doc(text), ann) for text, ann in train_data]
nlp.initialize(lambda: examples)
# Training loop
for i in range(n_iter):
random.shuffle(examples)
losses = {}
for example in examples:
nlp.update([example], drop=0.35, losses=losses)
return nlp
nlp = train_ner(TRAIN_DATA)
nlp.to_disk("custom_ner")
Fine-tuning Transformers
For higher accuracy, fine-tune a transformer model:
- Prepare data in token-level format with BIO labels
- Tokenize carefully - align labels with subword tokens
- Fine-tune using HuggingFace Trainer
Key considerations:
- Subword alignment: When "iPhone" becomes ["i", "##Phone"], only the first subword gets the label
- Learning rate: Use 2e-5 to 5e-5 for fine-tuning
- Epochs: 3-10 depending on dataset size
Evaluation Metrics
NER uses entity-level metrics, not token-level:
| Metric | Definition |
|---|---|
| Precision | Of predicted entities, what % are correct? |
| Recall | Of actual entities, what % did we find? |
| F1 Score | Harmonic mean of precision and recall |
Strict matching: Both entity boundaries AND type must be exactly correct.
from seqeval.metrics import classification_report
y_true = [["O", "B-PER", "I-PER", "O", "B-ORG"]]
y_pred = [["O", "B-PER", "I-PER", "O", "O"]] # Missed ORG
print(classification_report(y_true, y_pred))
Output shows per-entity-type metrics:
- PERSON: 100% (correctly identified)
- ORG: 0% (missed entirely)
Common Evaluation Mistakes
- Token-level accuracy misleads: 95% token accuracy can mean 60% entity F1 if boundaries are wrong
- O-class dominates: Most tokens aren't entities, so token accuracy looks artificially high
- Partial matches: Identifying "Tim" instead of "Tim Cook" counts as a complete miss in strict evaluation
Common Challenges
1. Ambiguous Entities
The same text can be different entity types depending on context:
"Apple released new products" → Apple = ORG
"I ate an apple for lunch" → apple = not an entity
"Apple Martin is an actress" → Apple Martin = PERSON
Solution: Transformer models handle context better than rule-based systems. They consider surrounding words when classifying.
2. Nested Entities
Some entities contain other entities:
"Bank of America headquarters in Charlotte"
- "Bank of America" → ORG
- "America" → GPE (nested inside ORG)
- "Charlotte" → GPE
Standard BIO tagging can't represent nesting. Solutions:
- Use only the outermost entity
- Use spaCy's
SpanCategorizerfor overlapping spans - Use specialized nested NER models
3. Long Documents
Transformer models have token limits (typically 512):
def process_long_doc(text, nlp, chunk_size=500, overlap=50):
"""Process long documents with overlapping chunks."""
entities = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunk = text[start:end]
doc = nlp(chunk)
for ent in doc.ents:
entities.append({
"text": ent.text,
"label": ent.label_,
"start": start + ent.start_char,
"end": start + ent.end_char
})
start = end - overlap # Overlap to catch split entities
return deduplicate(entities) # Remove duplicates from overlap
4. Domain-Specific Entities
Pre-trained models don't recognize specialized terminology:
| Domain | Custom Entities Needed |
|---|---|
| Medical | Drug names, conditions, procedures |
| Legal | Case citations, contract terms |
| Finance | Ticker symbols, financial products |
| E-commerce | Product SKUs, brand names |
Solutions:
- Add rule-based patterns for known terms
- Fine-tune on domain-specific data
- Use domain-specific pre-trained models (BioBERT, LegalBERT)
Production Considerations
Batch Processing
Process multiple texts efficiently:
texts = ["Text 1...", "Text 2...", "Text 3..."]
# spaCy's nlp.pipe is much faster than individual calls
for doc in nlp.pipe(texts, batch_size=50):
entities = [(ent.text, ent.label_) for ent in doc.ents]
Model Selection
| Model | Speed | Accuracy | Memory |
|---|---|---|---|
en_core_web_sm | Fast | Good | 12MB |
en_core_web_md | Medium | Better | 40MB |
en_core_web_lg | Medium | Better | 560MB |
en_core_web_trf | Slow | Best | 440MB |
| Custom BERT | Slow | Domain-best | ~500MB |
For production:
- Start with
en_core_web_smfor speed - Upgrade to
trfif accuracy is insufficient - Fine-tune only if pre-trained models don't cover your entities
Combining Approaches
A practical production system often combines multiple approaches:
- EntityRuler first: Catch known patterns with high confidence
- Statistical NER: Handle general entities
- Post-processing: Apply business rules (validation, deduplication)
nlp = spacy.load("en_core_web_sm")
# Add rule-based patterns BEFORE statistical NER
ruler = nlp.add_pipe("entity_ruler", before="ner")
ruler.add_patterns([
{"label": "PRODUCT", "pattern": "iPhone 15 Pro"},
{"label": "INTERNAL_CODE", "pattern": [{"TEXT": {"REGEX": "PRD-\\d{6}"}}]},
])
# Now both custom patterns and statistical NER run
doc = nlp("Order PRD-123456 for iPhone 15 Pro shipped to John Smith")
Conclusion
NER extracts structured entities from unstructured text. Key takeaways:
Understanding NER:
- BIO tagging handles multi-token entities
- Entity types depend on your use case
- Evaluation must be entity-level, not token-level
Choosing an approach:
- Rule-based for known patterns
- spaCy statistical for general production use
- Transformers for maximum accuracy
- Combine approaches for best results
Custom training:
- Needed for domain-specific entities
- Quality of labels matters more than quantity
- 200-500 diverse examples is a good starting point
Production tips:
- Use batch processing (
nlp.pipe) - Start simple, add complexity as needed
- Monitor and iterate on real data
References
- spaCy NER Documentation - Production NLP library.
- Hugging Face Token Classification - Transformer-based NER.
- seqeval - Sequence labeling evaluation.
- Devlin, J., et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers". NAACL 2019.
- OntoNotes 5.0 Entity Types - Standard entity definitions.