Named Entity Recognition: Extracting Information from Text
Every day, NLP systems automatically extract structured information from billions of documents: news articles, contracts, emails, medical records, social media posts. The engine driving this extraction is Named Entity Recognition (NER) — a task that identifies and classifies named entities in text: persons, organizations, locations, dates, monetary values, and much more.
NER is the first step in many information extraction pipelines: without knowing who does what, where and when, we cannot build knowledge graphs, feed RAG systems, automate contract analysis, or parse financial news. In this article we build NER systems from a spaCy baseline to BERT fine-tuning, with specific attention to Italian-language text.
What You Will Learn
- What NER is and the main entity categories (PER, ORG, LOC, DATE, MONEY...)
- The BIO (Beginning-Inside-Outside) format for token annotation
- NER with spaCy: pre-trained models and customization
- Fine-tuning BERT for NER with HuggingFace Transformers
- Metrics: span-level F1, precision, recall with seqeval
- Handling WordPiece tokenization for label alignment
- NER for Italian with spaCy it_core_news and Italian BERT models
- NER on long documents: sliding window and post-processing
- Advanced architectures: CRF layer, RoBERTa, DeBERTa for NER
- Production pipeline, visualization and end-to-end case study
1. What is Named Entity Recognition
NER is a token classification task: for each token in the text, the model must predict whether it is part of a named entity and what type it belongs to. Unlike sentence classification (which produces one output per sentence), NER produces one output per token — this makes it more complex both architecturally and in post-processing.
NER Example
Input: "Elon Musk founded Tesla in 2003 in San Carlos, California."
Annotated output:
- Elon Musk → PER (person)
- Tesla → ORG (organization)
- 2003 → DATE
- San Carlos → LOC (location)
- California → LOC (location)
1.1 The BIO Format
NER annotation uses the BIO (Beginning-Inside-Outside) format:
- B-TYPE: first token of an entity of type TYPE
- I-TYPE: token inside an entity of type TYPE
- O: token outside any named entity
# BIO format example
sentence = "Elon Musk founded Tesla in San Carlos in 2003"
bio_labels = [
("Elon", "B-PER"), # beginning of person
("Musk", "I-PER"), # inside person
("founded", "O"),
("Tesla", "B-ORG"), # beginning of organization
("in", "O"),
("San", "B-LOC"), # beginning of location
("Carlos", "I-LOC"), # inside location
("in", "O"),
("2003", "B-DATE"), # date
]
# BIOES format (extended): adds S-TYPE for single-token entities
# S-Tesla = single ORG token
# BIO format is the most common in modern NER datasets
# Label set for CoNLL-2003 (most widely used NER benchmark):
CONLL_LABELS = [
'O',
'B-PER', 'I-PER', # persons
'B-ORG', 'I-ORG', # organizations
'B-LOC', 'I-LOC', # locations
'B-MISC', 'I-MISC', # miscellaneous
]
1.2 NER Benchmarks and Datasets
Standard NER Datasets for Benchmarking
| Dataset | Language | Entities | Train size | Best F1 |
|---|---|---|---|---|
| CoNLL-2003 | EN | PER, ORG, LOC, MISC | 14,041 sent | ~94% (DeBERTa) |
| OntoNotes 5.0 | EN | 18 types | ~75K sent | ~92% |
| Evalita 2009 NER | IT | PER, ORG, LOC, GPE | ~10K sent | ~88% |
| WikiNEuRal IT | IT | PER, ORG, LOC, MISC | ~40K sent | ~90% |
| I2B2 2014 | EN (medical) | PHI (de-identification) | 27K sent | ~97% |
2. NER with spaCy
spaCy offers pre-trained NER models for many languages, including Italian. It is the fastest starting point for a production NER system.
2.1 Out-of-the-Box NER with spaCy
import spacy
from spacy import displacy
# Load Italian model with NER
# python -m spacy download it_core_news_lg
nlp_it = spacy.load("it_core_news_lg")
# English model for comparison
# python -m spacy download en_core_web_trf
nlp_en = spacy.load("en_core_web_trf") # Transformer-based, more accurate
# NER on Italian text
text_it = """
Il presidente Sergio Mattarella ha incontrato ieri a Roma il CEO di Fiat Stellantis
Carlos Tavares per discutere del piano industriale 2025-2030.
L'incontro e avvenuto al Quirinale e ha riguardato investimenti per 5 miliardi di euro.
"""
doc_it = nlp_it(text_it)
print("=== Italian NER ===")
for ent in doc_it.ents:
print(f" '{ent.text}' -> {ent.label_} ({spacy.explain(ent.label_)})")
# NER on English text
text_en = "Apple CEO Tim Cook announced a new $3 billion investment in Austin, Texas on Monday."
doc_en = nlp_en(text_en)
print("\n=== English NER ===")
for ent in doc_en.ents:
print(f" '{ent.text}' -> {ent.label_}")
# HTML visualization (useful in Jupyter)
html = displacy.render(doc_en, style="ent", page=False)
with open("ner_visualization.html", "w") as f:
f.write(html)
2.2 spaCy Entity Categories for Italian
| Label | Type | Example |
|---|---|---|
| PER | Person | Mario Draghi, Sophia Loren |
| ORG | Organization | ENI, Juventus, Banca d'Italia |
| LOC | Generic location | Alpi, Mar Mediterraneo |
| GPE | Geopolitical entity | Italia, Roma, Lombardia |
| DATE | Date/period | 3 marzo, estate 2024 |
| MONEY | Currency | 5 miliardi di euro |
| MISC | Miscellaneous | Coppa del Mondo, COVID-19 |
2.3 Training a Custom spaCy NER Model
import spacy
from spacy.training import Example
import random
# Annotated training data (with character offsets)
TRAIN_DATA = [
(
"La startup Satispay ha raccolto 320 milioni dalla BAFIN.",
{"entities": [(11, 19, "ORG"), (39, 53, "MONEY"), (58, 63, "ORG")]}
),
(
"Andrea Pirlo allena la Juve a Torino.",
{"entities": [(0, 12, "PER"), (23, 27, "ORG"), (30, 36, "LOC")]}
),
(
"Ferrari ha presentato la nuova SF-23 al Gran Premio di Monza.",
{"entities": [(0, 7, "ORG"), (29, 34, "MISC"), (38, 60, "MISC")]}
),
]
def train_custom_ner(train_data, n_iter=30):
"""Train a custom spaCy NER component."""
nlp = spacy.blank("it")
ner = nlp.add_pipe("ner")
# Add labels
for _, annotations in train_data:
for _, _, label in annotations.get("entities", []):
ner.add_label(label)
# Training loop
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
with nlp.disable_pipes(*other_pipes):
optimizer = nlp.begin_training()
for i in range(n_iter):
random.shuffle(train_data)
losses = {}
for text, annotations in train_data:
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annotations)
nlp.update([example], sgd=optimizer, losses=losses)
if (i + 1) % 10 == 0:
print(f"Iteration {i+1}: losses = {losses}")
return nlp
custom_nlp = train_custom_ner(TRAIN_DATA)
# Test
test_text = "Enel ha investito 2 miliardi a Milano."
doc = custom_nlp(test_text)
for ent in doc.ents:
print(f" '{ent.text}' -> {ent.label_}")
3. NER with BERT and HuggingFace Transformers
Transformer models outperform spaCy on most NER benchmarks, especially on complex text or when entities are ambiguous. They require more data and training time, but deliver significantly higher precision and recall on challenging entity types.
3.1 CoNLL-2003 Dataset
from datasets import load_dataset
# CoNLL-2003: standard English NER benchmark
dataset = load_dataset("conll2003")
print(dataset)
# train: 14,041 | validation: 3,250 | test: 3,453
# Dataset structure
example = dataset['train'][0]
print("Tokens:", example['tokens'])
print("NER tags:", example['ner_tags'])
# Tokens: ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
# NER tags: [3, 0, 7, 0, 0, 0, 7, 0, 0]
# (3=B-ORG, 0=O, 7=B-MISC)
# ID to label mapping
label_names = dataset['train'].features['ner_tags'].feature.names
print("Labels:", label_names)
# ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
3.2 The Token-Label Alignment Problem
BERT uses WordPiece tokenization: a word can be split into multiple subtokens. We must align word-level NER labels with BERT subtokens. This is one of the transformer-specific challenges in NER that does not exist with spaCy.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
# Example: word "Johannesburg" and its labels
words = ["Johannesburg", "is", "the", "largest", "city"]
word_labels = ["B-LOC", "O", "O", "O", "O"]
# WordPiece tokenization
tokenized = tokenizer(
words,
is_split_into_words=True, # input already word-tokenized
return_offsets_mapping=True
)
print("Subword tokens:", tokenizer.convert_ids_to_tokens(tokenized['input_ids']))
# ['[CLS]', 'Johann', '##es', '##burg', 'is', 'the', 'largest', 'city', '[SEP]']
# Label alignment (strategy: -100 for non-first subtokens)
def align_labels(tokenized, word_labels, label2id):
word_ids = tokenized.word_ids()
label_ids = []
prev_word_id = None
for word_id in word_ids:
if word_id is None:
# Special token [CLS] or [SEP]
label_ids.append(-100)
elif word_id != prev_word_id:
# First subtoken of word: use the real label
label_ids.append(label2id[word_labels[word_id]])
else:
# Subsequent subtokens: -100 (ignored in loss)
label_ids.append(-100)
prev_word_id = word_id
return label_ids
label2id = {"O": 0, "B-LOC": 1, "I-LOC": 2, "B-PER": 3, "I-PER": 4,
"B-ORG": 5, "I-ORG": 6, "B-MISC": 7, "I-MISC": 8}
aligned = align_labels(tokenized, word_labels, label2id)
tokens = tokenizer.convert_ids_to_tokens(tokenized['input_ids'])
for tok, lab in zip(tokens, aligned):
print(f" {tok:15s}: {lab}")
# [CLS] : -100
# Johann : 1 (B-LOC)
# ##es : -100 (ignored)
# ##burg : -100 (ignored)
# is : 0 (O)
# ...
3.3 Complete BERT Fine-tuning for NER
from transformers import (
AutoModelForTokenClassification,
AutoTokenizer,
TrainingArguments,
Trainer,
DataCollatorForTokenClassification
)
from datasets import load_dataset
import evaluate
import numpy as np
# Configuration
MODEL_NAME = "bert-base-cased"
DATASET_NAME = "conll2003"
MAX_LENGTH = 128
dataset = load_dataset(DATASET_NAME)
label_names = dataset['train'].features['ner_tags'].feature.names
num_labels = len(label_names)
id2label = {i: l for i, l in enumerate(label_names)}
label2id = {l: i for i, l in enumerate(label_names)}
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
def tokenize_and_align_labels(examples):
tokenized = tokenizer(
examples["tokens"],
truncation=True,
max_length=MAX_LENGTH,
is_split_into_words=True
)
all_labels = []
for i, labels in enumerate(examples["ner_tags"]):
word_ids = tokenized.word_ids(batch_index=i)
label_ids = []
prev_word_id = None
for word_id in word_ids:
if word_id is None:
label_ids.append(-100)
elif word_id != prev_word_id:
label_ids.append(labels[word_id])
else:
label_ids.append(-100)
prev_word_id = word_id
all_labels.append(label_ids)
tokenized["labels"] = all_labels
return tokenized
tokenized_datasets = dataset.map(
tokenize_and_align_labels,
batched=True,
remove_columns=dataset["train"].column_names
)
# Model
model = AutoModelForTokenClassification.from_pretrained(
MODEL_NAME,
num_labels=num_labels,
id2label=id2label,
label2id=label2id
)
# Data collator with dynamic padding for NER
data_collator = DataCollatorForTokenClassification(tokenizer)
# seqeval metrics for span-level NER evaluation
seqeval = evaluate.load("seqeval")
def compute_metrics(p):
predictions, labels = p
predictions = np.argmax(predictions, axis=2)
true_predictions = [
[label_names[p] for (p, l) in zip(pred, label) if l != -100]
for pred, label in zip(predictions, labels)
]
true_labels = [
[label_names[l] for (p, l) in zip(pred, label) if l != -100]
for pred, label in zip(predictions, labels)
]
results = seqeval.compute(predictions=true_predictions, references=true_labels)
return {
"precision": results["overall_precision"],
"recall": results["overall_recall"],
"f1": results["overall_f1"],
"accuracy": results["overall_accuracy"],
}
# Training
args = TrainingArguments(
output_dir="./results/bert-ner-conll",
num_train_epochs=3,
per_device_train_batch_size=32,
per_device_eval_batch_size=64,
learning_rate=2e-5,
warmup_ratio=0.1,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1",
fp16=True,
report_to="none"
)
trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics
)
trainer.train()
# Expected F1 on CoNLL-2003 test: ~91-92% (BERT-base-cased)
# With RoBERTa-large: ~93-94%
4. Advanced NER Architectures
Beyond classic BERT fine-tuning, there are architectural variants that improve NER performance, particularly for capturing dependencies between BIO labels.
4.1 BERT + CRF Layer
A CRF (Conditional Random Field) applied on top of BERT imposes
structural constraints on label sequences: for example, an I-ORG
token cannot follow a B-PER. This reduces common sequence errors
in purely neural architectures.
# BERT + CRF with pytorch-crf
# pip install pytorch-crf
import torch
import torch.nn as nn
from transformers import BertModel, BertPreTrainedModel
from torchcrf import CRF
class BertCRFForNER(BertPreTrainedModel):
"""BERT fine-tuned with a CRF layer for NER."""
def __init__(self, config, num_labels):
super().__init__(config)
self.bert = BertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, num_labels)
self.crf = CRF(num_labels, batch_first=True)
self.init_weights()
def forward(self, input_ids, attention_mask, token_type_ids=None, labels=None):
outputs = self.bert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids
)
sequence_output = self.dropout(outputs[0])
emissions = self.classifier(sequence_output) # (batch, seq_len, num_labels)
if labels is not None:
# Training: compute CRF loss (negative log-likelihood)
loss = -self.crf(emissions, labels, mask=attention_mask.bool(), reduction='mean')
return {'loss': loss, 'logits': emissions}
else:
# Inference: Viterbi decoding
predictions = self.crf.decode(emissions, mask=attention_mask.bool())
return {'predictions': predictions, 'logits': emissions}
# Advantages of CRF:
# + Guarantees valid BIO sequences (no I-X without B-X before)
# + Improves F1 by ~0.5-1.5 points on CoNLL
# Disadvantages:
# - Slower at inference (Viterbi decoding O(n * L^2))
# - More complex to implement
4.2 More Recent Models: RoBERTa and DeBERTa for NER
from transformers import pipeline
# RoBERTa-large: ~1.5% more F1 than BERT-base on CoNLL-2003
# Use the same code but change MODEL_NAME
# Best English NER model on CoNLL benchmark:
model_name = "Jean-Baptiste/roberta-large-ner-english"
ner_pipeline = pipeline(
"ner",
model=model_name,
aggregation_strategy="simple"
)
text = "Elon Musk's Tesla announced a new Gigafactory in Berlin, Germany, with a 5B EUR investment."
entities = ner_pipeline(text)
for ent in entities:
print(f" '{ent['word']}' -> {ent['entity_group']} (score={ent['score']:.3f})")
# Benchmark comparison (CoNLL-2003 test set F1):
# BERT-base-cased: ~92.0%
# RoBERTa-large: ~93.5%
# DeBERTa-v3-large: ~94.0%
# XLNet-large: ~93.0%
5. Inference and Post-processing for NER
After training, inference requires post-processing to reconstruct named entities from token spans.
from transformers import pipeline
import torch
# HuggingFace Pipeline (handles post-processing automatically)
ner_pipeline = pipeline(
"ner",
model="./results/bert-ner-conll",
tokenizer="./results/bert-ner-conll",
aggregation_strategy="simple" # groups subtokens of the same entity
)
texts = [
"Tim Cook presented Apple's new iPhone 16 in Cupertino last September.",
"The European Central Bank in Frankfurt raised rates by 25 basis points.",
"Enel Green Power signed a deal worth 2.5 billion euros with the Italian government.",
]
for text in texts:
entities = ner_pipeline(text)
print(f"\nText: {text}")
for ent in entities:
print(f" '{ent['word']}' -> {ent['entity_group']} "
f"(score={ent['score']:.3f}, start={ent['start']}, end={ent['end']})")
# Available aggregation strategies:
# "none": returns all tokens with their label
# "simple": groups consecutive tokens with the same entity group
# "first": uses the label of the first subtoken per word
# "average": averages logits across subtokens (more accurate)
# "max": uses the maximum logit across subtokens
5.1 NER on Long Documents (over 512 tokens)
def ner_long_document(text, ner_pipeline, max_length=400, stride=50):
"""
NER on documents longer than 512 tokens using a sliding window.
max_length: maximum tokens per window
stride: overlap between consecutive windows (avoids boundary artifacts)
"""
words = text.split()
all_entities = []
processed_positions = set()
for start_idx in range(0, len(words), max_length - stride):
end_idx = min(start_idx + max_length, len(words))
chunk = ' '.join(words[start_idx:end_idx])
entities = ner_pipeline(chunk)
# Adjust offset for position in original text
chunk_offset = len(' '.join(words[:start_idx])) + (1 if start_idx > 0 else 0)
for ent in entities:
abs_start = ent['start'] + chunk_offset
abs_end = ent['end'] + chunk_offset
# Avoid duplicates from overlap
if abs_start not in processed_positions:
all_entities.append({
'word': ent['word'],
'entity_group': ent['entity_group'],
'score': ent['score'],
'start': abs_start,
'end': abs_end
})
processed_positions.add(abs_start)
if end_idx == len(words):
break
return sorted(all_entities, key=lambda x: x['start'])
# Alternative: use Longformer (supports up to 4096 tokens natively)
# from allenai/longformer-base-4096
6. NER for Italian
Italian has morphological characteristics that make NER more challenging: gender and number agreement, clitic forms, proper names with definite articles ("la Roma", "il Milan"). Here are the best available options.
import spacy
from transformers import pipeline
# spaCy NER for Italian
nlp_it = spacy.load("it_core_news_lg")
italian_texts = [
"Il primo ministro Giorgia Meloni ha incontrato il presidente francese Macron a Parigi.",
"Fiat Chrysler Automobiles ha annunciato fusione con PSA Group per 50 miliardi.",
"L'AS Roma ha battuto la Lazio per 2-1 allo Stadio Olimpico domenica sera.",
"Il Tribunale di Milano ha condannato Mediaset a pagare 300 milioni a Vivendi.",
]
print("=== Italian NER with spaCy it_core_news_lg ===")
for text in italian_texts:
doc = nlp_it(text)
entities = [(ent.text, ent.label_) for ent in doc.ents]
print(f"\nText: {text[:70]}")
print(f"Entities: {entities}")
# BERT NER for Italian
try:
it_ner = pipeline(
"ner",
model="osiria/bert-base-italian-uncased-ner",
aggregation_strategy="simple"
)
text = "Matteo Renzi ha fondato Italia Viva a Firenze nel 2019."
entities = it_ner(text)
print("\n=== BERT NER Italian ===")
for ent in entities:
print(f" '{ent['word']}' -> {ent['entity_group']} ({ent['score']:.3f})")
except Exception as e:
print(f"Model not available: {e}")
# Italian NER options summary:
print("\nItalian NER Options:")
print(" 1. spaCy it_core_news_lg (fastest, F1 ~85%)")
print(" 2. osiria/bert-base-italian-uncased-ner (more accurate, F1 ~88%)")
print(" 3. Custom fine-tuning on WikiNEuRal IT (highest quality)")
7. Evaluation and NER Metrics
from seqeval.metrics import (
classification_report,
f1_score,
precision_score,
recall_score
)
# seqeval evaluates at span level (entire entity)
# more appropriate than token-level accuracy
true_sequences = [
['O', 'B-PER', 'I-PER', 'O', 'B-ORG', 'O'],
['B-LOC', 'I-LOC', 'O', 'O', 'B-DATE', 'O'],
]
pred_sequences = [
['O', 'B-PER', 'I-PER', 'O', 'O', 'O'], # misses ORG
['B-LOC', 'I-LOC', 'O', 'O', 'B-DATE', 'O'], # perfect
]
print("=== NER Evaluation (span-level) ===")
print(classification_report(true_sequences, pred_sequences))
print(f"Overall F1: {f1_score(true_sequences, pred_sequences):.4f}")
print(f"Overall Precision: {precision_score(true_sequences, pred_sequences):.4f}")
print(f"Overall Recall: {recall_score(true_sequences, pred_sequences):.4f}")
# Types of NER errors:
# 1. False Negative (Missed): entity not recognized
# 2. False Positive (Spurious): entity invented where there is none
# 3. Wrong Type: entity found but wrong type (PER instead of ORG)
# 4. Wrong Boundary: entity found but span partially incorrect
# Key difference:
# Token-level accuracy: counts correct tokens / total tokens
# Span-level F1 (seqeval): an entity is correct ONLY if
# ALL its tokens have the right label
# -> much stricter and more realistic
8. Case Study: NER on Financial News Articles
Let us build a complete NER pipeline to extract entities from financial articles: companies, key people, monetary values, and dates.
from transformers import pipeline
from collections import defaultdict
class FinancialNERExtractor:
"""
NER extractor specialized for financial news.
Extracts: companies, key people, monetary values and dates.
"""
def __init__(self, model_name="dslim/bert-large-NER"):
self.ner = pipeline(
"ner",
model=model_name,
aggregation_strategy="simple"
)
self.entity_types = {
'ORG': 'companies',
'PER': 'people',
'MONEY': 'values',
'DATE': 'dates',
'LOC': 'locations',
'GPE': 'locations'
}
def extract(self, text: str) -> dict:
"""Extract and organize entities by type."""
entities = self.ner(text)
result = defaultdict(list)
for ent in entities:
group = ent['entity_group']
mapped = self.entity_types.get(group)
if mapped and ent['score'] > 0.8:
result[mapped].append({
'text': ent['word'],
'score': round(ent['score'], 3),
'position': (ent['start'], ent['end'])
})
return dict(result)
def analyze_article(self, title: str, body: str) -> dict:
"""Full analysis of a financial article."""
full_text = f"{title}. {body}"
raw_entities = self.extract(full_text)
# Deduplicate (same text, different positions)
for etype, ents in raw_entities.items():
seen = set()
deduped = []
for e in ents:
if e['text'] not in seen:
seen.add(e['text'])
deduped.append(e)
raw_entities[etype] = deduped
return {
'title': title,
'entities': raw_entities,
'entity_count': sum(len(v) for v in raw_entities.values())
}
# Test
extractor = FinancialNERExtractor()
articles = [
{
"title": "Amazon acquires Whole Foods for $13.7 billion",
"body": "Jeff Bezos announced the acquisition in Seattle on June 16, 2017. Whole Foods CEO John Mackey will remain in his role."
},
{
"title": "Tesla opens new Gigafactory in Germany",
"body": "Elon Musk inaugurated the Berlin factory in March 2022. The facility in Gruenheide will employ 12,000 people and produce 500,000 vehicles per year."
},
]
for article in articles:
result = extractor.analyze_article(article['title'], article['body'])
print(f"Title: {result['title']}")
print(f"Total entities: {result['entity_count']}")
for etype, ents in result['entities'].items():
if ents:
texts = [e['text'] for e in ents]
print(f" {etype:12s}: {', '.join(texts)}")
print()
9. Optimized NER Pipeline for Production
A production NER system must balance precision, speed, and computational cost. Below is an optimized pipeline combining a lexical pre-filter, batch inference, and result caching for high-volume scenarios.
from transformers import pipeline
import hashlib
import time
from typing import List, Dict
class OptimizedNERPipeline:
"""
Production-optimized NER pipeline:
- LRU-style result caching
- Adaptive batch processing
- Confidence filtering
- Latency and accuracy monitoring
"""
def __init__(
self,
model_name: str = "dslim/bert-large-NER",
batch_size: int = 8,
min_confidence: float = 0.75,
cache_size: int = 1024
):
self.ner = pipeline(
"ner",
model=model_name,
aggregation_strategy="simple",
batch_size=batch_size,
device=0 # -1 for CPU, 0 for first GPU
)
self.min_confidence = min_confidence
self._cache: Dict[str, list] = {}
self._cache_size = cache_size
self._stats = {"hits": 0, "misses": 0, "total_time_ms": 0.0}
def _text_hash(self, text: str) -> str:
return hashlib.md5(text.encode()).hexdigest()
def extract(self, texts: List[str]) -> List[List[Dict]]:
"""NER extraction with caching and batch processing."""
results = [None] * len(texts)
uncached_indices = []
uncached_texts = []
# Check cache
for i, text in enumerate(texts):
key = self._text_hash(text)
if key in self._cache:
results[i] = self._cache[key]
self._stats["hits"] += 1
else:
uncached_indices.append(i)
uncached_texts.append(text)
self._stats["misses"] += 1
# Process texts not in cache
if uncached_texts:
start = time.perf_counter()
raw_results = self.ner(uncached_texts)
elapsed_ms = (time.perf_counter() - start) * 1000
self._stats["total_time_ms"] += elapsed_ms
# Handle single vs batch
if len(uncached_texts) == 1:
raw_results = [raw_results]
for idx, raw in zip(uncached_indices, raw_results):
# Filter by confidence and clean
filtered = [
{
'word': e['word'].replace(' ##', '').strip(),
'entity_group': e['entity_group'],
'score': round(e['score'], 4),
'start': e['start'],
'end': e['end']
}
for e in raw
if e['score'] >= self.min_confidence
]
key = self._text_hash(texts[idx])
# Simple FIFO cache eviction
if len(self._cache) >= self._cache_size:
oldest_key = next(iter(self._cache))
del self._cache[oldest_key]
self._cache[key] = filtered
results[idx] = filtered
return results
def get_stats(self) -> Dict:
"""Return pipeline performance statistics."""
total = self._stats["hits"] + self._stats["misses"]
return {
"cache_hit_rate": self._stats["hits"] / total if total > 0 else 0.0,
"avg_latency_ms": self._stats["total_time_ms"] / max(self._stats["misses"], 1),
"cache_size": len(self._cache),
**self._stats
}
# Usage
ner_pipe = OptimizedNERPipeline(min_confidence=0.80)
batch_texts = [
"Mario Draghi led the ECB from 2011 to 2019.",
"Amazon acquired MGM Studios for $8.45 billion.",
"MIT researchers published a study on GPT-4 capabilities.",
"Sergio Mattarella is the President of the Italian Republic.",
]
# First call: full processing
results1 = ner_pipe.extract(batch_texts)
# Second call: all from cache!
results2 = ner_pipe.extract(batch_texts)
print("NER Pipeline Statistics:")
for k, v in ner_pipe.get_stats().items():
print(f" {k}: {v}")
print("\nExtraction results:")
for text, entities in zip(batch_texts, results1):
print(f"\n Text: {text[:60]}")
for ent in entities:
print(f" '{ent['word']}' -> {ent['entity_group']} ({ent['score']:.3f})")
9.1 NER Model Comparison: Practical Benchmark
NER Benchmark: Speed vs Accuracy (2024-2025)
| Model | CoNLL F1 | Speed (CPU) | Params | Language | Use Case |
|---|---|---|---|---|---|
| spaCy en_core_web_sm | ~84% | Very fast (<5ms) | 12M | EN | Rapid prototyping |
| spaCy en_core_web_trf | ~89% | Fast (10-30ms) | 440M | EN | CPU production |
| dslim/bert-base-NER | ~91% | Medium (50-100ms) | 110M | EN | GPU production |
| dslim/bert-large-NER | ~92% | Slow (100-200ms) | 340M | EN | High accuracy |
| Jean-Baptiste/roberta-large-ner-english | ~93.5% | Slow (150-250ms) | 355M | EN | State of the art EN |
| osiria/bert-base-italian-uncased-ner | ~88% | Medium (50-100ms) | 110M | IT | Best Italian model |
9.2 Text Anonymization with NER
A critical use case in legal, medical, and GDPR contexts is automatic anonymization of personal data. NER can automatically identify PER, ORG, LOC, and DATE entities for pseudonymization or redaction of sensitive documents.
from transformers import pipeline
class TextAnonymizer:
"""
NER-based text anonymizer.
Replaces sensitive entities with typed placeholders.
Useful for GDPR-compliant data processing and training dataset creation.
"""
REPLACEMENT_MAP = {
'PER': '<PERSON>',
'ORG': '<ORGANIZATION>',
'LOC': '<LOCATION>',
'GPE': '<LOCATION>',
'DATE': '<DATE>',
'MONEY': '<AMOUNT>',
'MISC': '<OTHER>',
}
def __init__(self, model_name="dslim/bert-large-NER"):
self.ner = pipeline(
"ner",
model=model_name,
aggregation_strategy="simple"
)
def anonymize(self, text: str, entity_types: list = None) -> dict:
"""
Anonymize text by replacing entities.
entity_types: list of types to anonymize (None = all)
"""
entities = self.ner(text)
# Filter by type if specified
if entity_types:
entities = [e for e in entities if e['entity_group'] in entity_types]
# Sort by position descending to replace from the end
entities_sorted = sorted(entities, key=lambda e: e['start'], reverse=True)
anonymized = text
replacements = []
for ent in entities_sorted:
placeholder = self.REPLACEMENT_MAP.get(ent['entity_group'], '<ENTITY>')
original = text[ent['start']:ent['end']]
anonymized = anonymized[:ent['start']] + placeholder + anonymized[ent['end']:]
replacements.append({
'original': original,
'replacement': placeholder,
'type': ent['entity_group'],
'confidence': round(ent['score'], 3)
})
return {
'original': text,
'anonymized': anonymized,
'replacements': replacements,
'num_entities': len(replacements)
}
# Test
anonymizer = TextAnonymizer()
sensitive_texts = [
"Patient John Smith, born March 15, 1978, was admitted to Massachusetts General Hospital on January 3, 2024 with a diagnosis of pneumonia.",
"Accenture plc, headquartered at 161 North Clark Street in Chicago, reported revenues of $64.9 billion in fiscal year 2023.",
"Attorney Sarah Johnson from Skadden Arps represented Apple Inc. in the appeal to the Federal Circuit Court.",
]
print("=== Text Anonymization with NER ===\n")
for text in sensitive_texts:
result = anonymizer.anonymize(text, entity_types=['PER', 'ORG', 'LOC', 'GPE', 'DATE', 'MONEY'])
print(f"Original: {result['original'][:100]}")
print(f"Anonymized: {result['anonymized'][:100]}")
print(f"Replaced: {result['num_entities']} entities")
print()
10. NER Production Best Practices
Anti-Pattern: Ignoring Post-processing
NER models output raw BIO token-level predictions. In production, you must always reconstruct spans, handle WordPiece subtokens, and filter low-confidence entities. Never expose raw token predictions to end users.
Anti-Pattern: Evaluating with Token Accuracy Only
Token accuracy on CoNLL-2003 is typically 98-99% even for mediocre models,
because most tokens have label O. Always use seqeval for
span-level F1 evaluation, which is the only relevant metric for NER.
Production NER Checklist
- Evaluate with seqeval (span F1), not just token accuracy
- Set confidence thresholds (typically 0.7-0.85) to filter false positives
- Handle overlapping entities (rare but possible)
- Normalize extracted entities (deduplication, canonicalization)
- Monitor entity distribution over time to detect domain shift
- Use visualization (displacy) for debugging predictions
- Test across different text domains: news, contracts, social media behave very differently
- For Italian: use it_core_news_lg (fast) or BERT fine-tuned on WikiNEuRal IT (accurate)
Conclusions and Next Steps
NER is one of the most useful NLP tasks in real-world applications: information extraction, knowledge graph construction, feeding RAG systems, data anonymization. With spaCy for simple cases and fine-tuned BERT for high precision, you have all the tools to build robust NER pipelines for both Italian and English.
The key to excellent performance in a specific domain is always fine-tuning on annotated data from your context: even a few hundred domain-specific examples can significantly improve performance over the generic model.
Continue the Series
- Next: Multi-label Text Classification — classifying texts with multiple simultaneous labels
- Article 7: HuggingFace Transformers: Complete Guide — Trainer API, Model Hub, optimization
- Article 8: LoRA Fine-tuning — train LLMs locally on consumer GPU
- Article 9: Semantic Similarity — NER as an extraction step in RAG pipelines
- Related series: AI Engineering/RAG — NER as an extraction step in RAG pipelines







