BERT Explained: Architecture, Pretraining and Fine-tuning
2018 marked a turning point in the history of Natural Language Processing. With the publication of BERT (Bidirectional Encoder Representations from Transformers), the Google AI team introduced a model that redefined the state of the art on 11 NLP benchmarks simultaneously. For the first time, a single pre-trained model could be adapted to very different tasks — classification, question answering, NER — achieving results superior to all previous specialized systems combined.
What makes BERT so revolutionary? The answer lies in three fundamental innovations: deep bidirectionality, large-scale pre-training, and the simplicity of fine-tuning for downstream tasks. In this article we analyze every aspect of the BERT architecture, from the internal mechanics of attention to practical implementation with HuggingFace, including the many variants that followed.
This is the second article in the Modern NLP: from BERT to LLMs series. If you have not yet read the first article on fundamentals (tokenization, embeddings and the NLP pipeline), I recommend doing so first — many concepts here build directly on those foundations.
What You Will Learn
- Why BERT represented a revolution in NLP and the limits of previous models
- The encoder-only Transformer architecture underlying BERT
- How multi-head self-attention works with the mathematical formulas
- Input representation: token, segment and position embeddings
- The two pre-training strategies: Masked Language Model (MLM) and Next Sentence Prediction (NSP)
- WordPiece tokenization and the special tokens [CLS], [SEP], [MASK]
- How to fine-tune for classification, NER, and question answering
- How to extract contextual embeddings from any BERT layer
- Complete practical implementation with HuggingFace Transformers
- BERT variants: RoBERTa, ALBERT, DistilBERT, DeBERTa, ELECTRA
- BERT for the Italian language: available models and comparison
- BERT's limitations and how later models overcame them
Series Overview
| # | Article | Focus |
|---|---|---|
| 1 | NLP Fundamentals | Tokenization, Embeddings, Pipeline |
| 2 | You are here — BERT and Transformers | Attention, Pre-training |
| 3 | Sentiment Analysis | BERT classifiers in production |
| 4 | Italian NLP | feel-it, AlBERTo, Italian-specific models |
| 5 | Named Entity Recognition | Extracting entities from text |
| 6 | Text Classification | Multi-label and zero-shot |
| 7 | HuggingFace Transformers | Trainer API, Datasets, Hub |
| 8 | LoRA Fine-tuning | Local training on consumer GPU |
| 9 | Semantic Similarity | Sentence embeddings and FAISS |
| 10 | NLP Monitoring | Drift detection and retraining |
1. Why BERT Revolutionized NLP
To appreciate BERT's impact, it helps to look at the NLP landscape before 2018 and understand which fundamental problems previous models could not solve.
1.1 The Limits of Word2Vec and GloVe
As we covered in the first article, Word2Vec (2013) and GloVe (2014) were a genuine breakthrough: words mapped to dense vectors in a continuous space where semantic relationships like "king - man + woman = queen" emerged naturally from the geometry of that space.
The fundamental limitation: these models produce static representations. Every word has exactly one vector, regardless of the context it appears in. Consider the word "bank":
The Static Representation Problem
- "I deposited the money at the bank." — financial institution
- "We sat on the river bank and watched the sunset." — land beside water
- "A bank of fog rolled in from the sea." — dense mass of fog
- "A bank of computers filled the entire room." — an array or row
Word2Vec assigns a single averaged vector to "bank" — a significant loss of semantic precision across all four senses.
1.2 ELMo: The First Step Toward Contextual Representations
In early 2018, ELMo (Embeddings from Language Models) from AllenAI addressed this with a bidirectional LSTM. ELMo generated different representations for the same word depending on context, achieving meaningful improvements on many benchmarks.
But ELMo had two important limitations. First, its bicontextualization was shallow: two separate LSTM passes (one forward, one backward) were concatenated, but they did not interact during processing. Second, LSTMs suffer from an information bottleneck — long-range dependencies get progressively "forgotten" as sequences grow longer.
1.3 GPT: Unidirectional Pretraining
Also in 2018, OpenAI published GPT, which used the Transformer architecture for language modeling. GPT demonstrated that large-corpus pretraining followed by fine-tuning works extremely well for NLP tasks.
However, GPT processes text only left-to-right. This is natural for text generation but suboptimal for understanding: to classify a sentence, answer a question, or extract entities, you need the full bidirectional context — both what came before and what comes after.
1.4 BERT: Deep Bidirectionality
BERT resolves both problems through the self-attention mechanism of the Transformer. Every token in a sequence can "look at" every other token simultaneously — both left and right. This bidirectionality is not a concatenation of two separate models (as in ELMo), but a deep interaction at every single layer of the architecture.
Bidirectionality Approaches Compared
| Model | Type | Context | Key Limitation |
|---|---|---|---|
| Word2Vec/GloVe | Static | No context | One vector per word regardless of meaning |
| GPT-1 | Unidirectional (L→R) | Left context only | Cannot see future tokens |
| ELMo | Shallow bidir. | Concatenated contexts | No interaction between directions |
| BERT | Deep bidir. | Full context at every layer | Encoder-only (no generation) |
This deep bidirectionality is why BERT established new records on 11 NLP benchmarks simultaneously at publication, including GLUE, SQuAD 1.1, SQuAD 2.0, and MultiNLI.
2. BERT Architecture: The Transformer Encoder
BERT is built on the Transformer architecture from "Attention Is All You Need" (Vaswani et al., 2017). The original Transformer has two components: an encoder (processes input) and a decoder (generates output). BERT uses only the encoder, which makes it ideal for understanding tasks rather than generation.
2.1 Architecture Overview
BERT's architecture can be visualized as a stack of Transformer encoder blocks arranged vertically. Each block contains a multi-head self-attention layer followed by a position-wise feed-forward network, with residual connections and layer normalization at each sub-layer.
Input: [CLS] The cat sits on the mat [SEP]
| | | | | | |
+-----+-----+----+----+---+---+----+-----+
| Token Embeddings |
| + Segment Embeddings |
| + Position Embeddings |
+-----+-----+----+----+---+---+----+-----+
| | | | | | |
+-----------------------------------------------+
| Transformer Encoder Block 1 |
| +-------------------------------------------+|
| | Multi-Head Self-Attention ||
| | Q = XWq K = XWk V = XWv ||
| | Attention(Q,K,V) = softmax(QK^T/sqrt(dk))V||
| +-------------------------------------------+|
| | Add & Layer Norm ||
| +-------------------------------------------+|
| | Feed-Forward Network (GELU) ||
| | FFN(x) = GELU(xW1 + b1)W2 + b2 ||
| +-------------------------------------------+|
| | Add & Layer Norm ||
+-----------------------------------------------+
| | | | | | |
... ... ... ... ... ... ...
| | | | | | |
+-----------------------------------------------+
| Transformer Encoder Block L (12 or 24) |
+-----------------------------------------------+
| | | | | | |
[CLS]out T1 T2 T3 T4 T5 [SEP]out
|
Pooling --> Classification / Task-specific output
2.2 BERT-base vs BERT-large
The original paper proposes two configurations, described with notation L/H/A (Layers / Hidden size / Attention heads):
BERT Model Configurations
| Parameter | BERT-base | BERT-large |
|---|---|---|
| Transformer Layers (L) | 12 | 24 |
| Hidden Size (H) | 768 | 1024 |
| Attention Heads (A) | 12 | 16 |
| Total Parameters | 110M | 340M |
| Dim per Head (d_k) | 768/12 = 64 | 1024/16 = 64 |
| Feed-Forward Dim | 3072 (4 × 768) | 4096 (4 × 1024) |
| Max Sequence Length | 512 | 512 |
| Vocabulary Size | 30,522 | 30,522 |
| Pre-training Time | 4 days (16 TPUs) | 4 days (64 TPUs) |
2.3 Input Representation: Three Embeddings
One of BERT's design choices is its input representation, formed by summing three distinct embedding types. The final input vector for each token is:
Input: [CLS] The cat sleeps [SEP] The dog runs [SEP]
| | | | | | | | |
Token Emb: E[CLS] E_the E_cat E_sl E[SEP] E_the E_dog E_r E[SEP]
+ + + + + + + + +
Segment Emb: EA EA EA EA EA EB EB EB EB
+ + + + + + + + +
Position Emb: E0 E1 E2 E3 E4 E5 E6 E7 E8
= = = = = = = = =
Final Input: I0 I1 I2 I3 I4 I5 I6 I7 I8
- Token Embeddings: Dense vectors for each vocabulary token (30,522 × 768). Learned during pre-training. The token embedding matrix constitutes the bulk of BERT's parameters related to language knowledge.
- Segment Embeddings: Indicate which sentence each token belongs to (A for the first sentence, B for the second). Essential for NSP tasks and any task requiring reasoning over sentence pairs such as Natural Language Inference or Paraphrase Detection.
- Position Embeddings: Encode the absolute position of each token in the sequence (0–511). Unlike the original Transformer (which used sinusoidal functions), BERT uses learned position embeddings — each position has its own trainable vector, allowing the model to develop a richer position-dependent representation.
from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Inspect the three embedding layers directly
token_emb = model.embeddings.word_embeddings
position_emb = model.embeddings.position_embeddings
segment_emb = model.embeddings.token_type_embeddings
print(f"Token embedding matrix: {token_emb.weight.shape}")
# torch.Size([30522, 768])
print(f"Position embedding matrix: {position_emb.weight.shape}")
# torch.Size([512, 768])
print(f"Segment embedding matrix: {segment_emb.weight.shape}")
# torch.Size([2, 768])
# Encode a sentence pair
text_a = "The cat sleeps"
text_b = "The dog runs"
encoded = tokenizer(text_a, text_b, return_tensors='pt')
print("Token IDs:", encoded['input_ids'])
print("Token Type IDs (segments):", encoded['token_type_ids'])
print("Attention Mask:", encoded['attention_mask'])
tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])
print("Tokens:", tokens)
# ['[CLS]', 'the', 'cat', 'sleeps', '[SEP]', 'the', 'dog', 'runs', '[SEP]']
2.4 Multi-Head Self-Attention: The Core Mechanism
The central innovation of the Transformer — and therefore of BERT — is self-attention: a mechanism that allows each token to "look at" every other token in the sequence and decide how much to weight each when computing its own representation.
Scaled Dot-Product Attention
For each token, the attention mechanism computes three vectors through learned linear projections:
- Query (Q): "What am I looking for?"
- Key (K): "What do I offer to other tokens?"
- Value (V): "What information do I carry?"
The attention formula is:
Attention(Q, K, V) = softmax( QK^T / sqrt(d_k) ) * V
Where:
Q: (seq_len, d_k) — query matrix
K: (seq_len, d_k) — key matrix
V: (seq_len, d_v) — value matrix
d_k: dimension of keys (64 for BERT-base, i.e. 768/12 heads)
The 1/sqrt(d_k) scaling factor prevents dot products from growing
so large that softmax produces near-zero gradients (saturation).
For d_k=64: sqrt(64) = 8.0 — dividing by 8 keeps gradients healthy.
Multi-Head Attention
Instead of computing one attention function, BERT runs h attention functions in parallel. Each head can capture different relationship types simultaneously: one head might specialize in syntactic dependencies, another in coreference, another in semantic similarity.
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W^O
head_i = Attention(Q * W_i^Q, K * W_i^K, V * W_i^V)
BERT-base: h = 12 heads, d_model = 768, d_k = d_v = 64
BERT-large: h = 16 heads, d_model = 1024, d_k = d_v = 64
Total attention parameters per layer (BERT-base):
W_Q: 768 x 768 = 589,824
W_K: 768 x 768 = 589,824
W_V: 768 x 768 = 589,824
W_O: 768 x 768 = 589,824
Total: ~2.36M per layer x 12 layers = ~28.3M
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
def scaled_dot_product_attention(Q, K, V, mask=None):
"""Scaled dot-product attention."""
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attention_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, V)
return output, attention_weights
class MultiHeadAttention(nn.Module):
"""Multi-Head Attention as in BERT."""
def __init__(self, d_model=768, num_heads=12):
super().__init__()
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads # 64 for BERT-base
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, x, mask=None):
batch_size, seq_len, _ = x.size()
# Linear projections
Q = self.W_q(x)
K = self.W_k(x)
V = self.W_v(x)
# Reshape for multi-head: (batch, heads, seq_len, d_k)
Q = Q.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
K = K.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
V = V.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
# Attention per head
attn_output, attn_weights = scaled_dot_product_attention(Q, K, V, mask)
# Concatenate heads
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.view(batch_size, seq_len, self.d_model)
return self.W_o(attn_output), attn_weights
# Quick verification
mha = MultiHeadAttention(d_model=768, num_heads=12)
x = torch.randn(2, 64, 768) # batch=2, seq_len=64, dim=768
output, weights = mha(x)
print(f"Output shape: {output.shape}") # torch.Size([2, 64, 768])
print(f"Weights shape: {weights.shape}") # torch.Size([2, 12, 64, 64])
2.5 Feed-Forward Network and Activations
After each attention layer, BERT applies a position-wise feed-forward network — the same network applied independently to each position in the sequence:
FFN(x) = GELU(x * W1 + b1) * W2 + b2
Dimensions:
W1: d_model × d_ff = 768 × 3072
W2: d_ff × d_model = 3072 × 768
BERT uses GELU activation (not ReLU):
GELU(x) = x * Phi(x) where Phi is the standard normal CDF
GELU provides a smoother transition and better gradients than ReLU.
FFN parameters per layer (BERT-base):
W1 + b1: 768 * 3072 = 2,359,296 + 3,072 biases
W2 + b2: 3072 * 768 = 2,359,296 + 768 biases
Total: ~4.7M per layer × 12 layers = ~56.5M
2.6 Residual Connections and Layer Normalization
Each sub-layer (attention and FFN) in BERT is wrapped with a residual connection followed by layer normalization:
output = LayerNorm(x + Sublayer(x))
Residual connections allow gradients to flow directly through the network
during backpropagation, preventing vanishing gradients in deep architectures.
LayerNorm normalizes activations across the feature dimension, stabilizing
training independently of batch size (unlike BatchNorm).
import torch
import torch.nn as nn
class TransformerEncoderBlock(nn.Module):
"""A single encoder block as in BERT."""
def __init__(self, d_model=768, num_heads=12, d_ff=3072, dropout=0.1):
super().__init__()
# Multi-Head Attention
self.attention = nn.MultiheadAttention(
embed_dim=d_model,
num_heads=num_heads,
dropout=dropout,
batch_first=True
)
# Feed-Forward Network with GELU activation
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model),
nn.Dropout(dropout)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Sub-layer 1: Multi-Head Attention + Residual + LayerNorm
attn_output, _ = self.attention(x, x, x, key_padding_mask=mask)
x = self.norm1(x + self.dropout(attn_output))
# Sub-layer 2: FFN + Residual + LayerNorm
x = self.norm2(x + self.ffn(x))
return x
class BERTEncoder(nn.Module):
"""Stack of encoder blocks as in BERT-base."""
def __init__(self, num_layers=12, d_model=768, num_heads=12, d_ff=3072):
super().__init__()
self.layers = nn.ModuleList([
TransformerEncoderBlock(d_model, num_heads, d_ff)
for _ in range(num_layers)
])
def forward(self, x, mask=None):
for layer in self.layers:
x = layer(x, mask)
return x
# Verify BERT-base dimensions
encoder = BERTEncoder(num_layers=12, d_model=768, num_heads=12, d_ff=3072)
x = torch.randn(2, 128, 768) # batch=2, seq_len=128
output = encoder(x)
print(f"Output: {output.shape}") # torch.Size([2, 128, 768])
total_params = sum(p.numel() for p in encoder.parameters())
print(f"Encoder parameters: {total_params:,}")
# ~85M (encoder only, excluding embeddings)
3. WordPiece Tokenization
BERT uses WordPiece tokenization — a subword algorithm that balances vocabulary efficiency with linguistic coverage. WordPiece was originally developed for Google's machine translation system and later adopted for BERT.
3.1 How WordPiece Works
The algorithm starts with a vocabulary of individual characters and iteratively merges pairs of tokens that maximize the likelihood of the training corpus. The process continues until the vocabulary reaches a target size (30,522 for BERT-base).
The resulting vocabulary contains:
- Common whole words ("the", "of", "and", "is")
- Common prefixes and stems ("un", "re", "pre", "inter")
- Suffixes and endings marked with
##("##ing", "##tion", "##ed", "##ness") - Individual characters to handle any out-of-vocabulary word
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# WordPiece handles rare and complex words gracefully
examples = [
"The cat sat on the mat",
"Electroencephalography is fascinating",
"Unstructured data preprocessing pipeline",
"Transformers revolutionized NLP in 2018"
]
for text in examples:
tokens = tokenizer.tokenize(text)
print(f"Text: {text}")
print(f" Tokens {len(tokens)}: {tokens}")
print()
# Text: Electroencephalography is fascinating
# Tokens 7: ['electro', '##ence', '##pha', '##log', '##raphy', 'is', 'fascinating']
# → The rare word is split into 5 subword units, common words stay whole
# Vocabulary size
print(f"Vocabulary size: {tokenizer.vocab_size}") # 30522
# Encoding vs. tokenization
encoded = tokenizer("Hello NLP!", return_tensors="pt")
decoded = tokenizer.decode(encoded['input_ids'][0])
print(f"Encoded -> Decoded: {decoded}")
# [CLS] hello nlp! [SEP]
3.2 Special Tokens
Beyond the WordPiece vocabulary tokens, BERT uses several special tokens that carry structural meaning:
BERT Special Tokens
| Token | ID | Purpose |
|---|---|---|
| [PAD] | 0 | Padding to make sequences the same length within a batch |
| [UNK] | 100 | Unknown token (not in vocabulary after WordPiece splitting) |
| [CLS] | 101 | Start of sequence; its final-layer embedding serves as the sentence representation |
| [SEP] | 102 | Separator between the two input sentences (also marks end of sequence) |
| [MASK] | 103 | Masked token placeholder used during Masked Language Model pre-training |
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Sentence pair encoding for NSP and NLI tasks
pair = tokenizer(
"What is the capital of France?",
"Paris is the capital of France.",
return_tensors='pt',
padding=True,
truncation=True,
max_length=128
)
tokens = tokenizer.convert_ids_to_tokens(pair['input_ids'][0])
print("Tokens:", tokens)
# ['[CLS]', 'what', 'is', 'the', 'capital', 'of', 'france', '?',
# '[SEP]', 'paris', 'is', 'the', 'capital', 'of', 'france', '.', '[SEP]']
# Token type IDs: 0 = first sentence, 1 = second sentence
print("Segments:", pair['token_type_ids'][0].tolist())
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
4. Pre-training: How BERT Learns Language
BERT is pre-trained on enormous text corpora using two complementary objectives. The fundamental idea: the model learns rich language representations from these self-supervised tasks, and this learned knowledge can then be transferred to specific downstream tasks via fine-tuning.
Pre-training Data
- BooksCorpus: ~800M words (11,000 unpublished books) — long, coherent prose
- English Wikipedia: ~2,500M words (text only, no tables or lists)
- Total: ~3.3 billion words of high-quality English text
- Training hardware: 16 Cloud TPUs (BERT-base) / 64 TPUs (BERT-large)
- Training time: 4 days for BERT-base, 4 days for BERT-large
4.1 Masked Language Model (MLM)
The Masked Language Model is BERT's primary pre-training innovation. Traditional language modeling (predicting the next word left-to-right) is intrinsically unidirectional. MLM introduces an elegant trick: randomly mask a portion of input tokens and ask the model to predict them, forcing it to use both left and right context simultaneously.
The 80/10/10 Masking Strategy
For each training sequence, 15% of tokens are selected for prediction. Of these selected tokens:
- 80%: Replaced with the special [MASK] token
- 10%: Replaced with a random vocabulary token
- 10%: Kept unchanged (original token)
This mixed strategy solves a subtle problem: during fine-tuning and inference, [MASK] never appears. If the model only ever saw [MASK] during pre-training, there would be a pre-training/fine-tuning mismatch. By replacing 10% with random tokens and leaving 10% unchanged, the model is forced to maintain useful representations for all tokens — not just the masked ones.
import random
import torch
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def apply_mlm_masking(token_ids, tokenizer, mask_prob=0.15):
"""Apply BERT's 80/10/10 masking strategy."""
masked_tokens = list(token_ids)
labels = [-100] * len(token_ids) # -100 = ignore in loss
for i in range(len(token_ids)):
# Skip special tokens: [CLS], [SEP], [PAD]
if token_ids[i] in [tokenizer.cls_token_id,
tokenizer.sep_token_id,
tokenizer.pad_token_id]:
continue
if random.random() < mask_prob:
labels[i] = token_ids[i] # Record original as label
rand = random.random()
if rand < 0.8:
# 80%: replace with [MASK]
masked_tokens[i] = tokenizer.mask_token_id
elif rand < 0.9:
# 10%: replace with random vocabulary token
masked_tokens[i] = random.randint(0, tokenizer.vocab_size - 1)
# else: 10% → keep original token unchanged
return masked_tokens, labels
# Demonstration
text = "The quick brown fox jumps over the lazy dog"
token_ids = tokenizer.encode(text)
print("Original:", tokenizer.decode(token_ids))
masked, labels = apply_mlm_masking(token_ids, tokenizer)
print("Masked: ", tokenizer.decode(masked))
print("Labels (tokens to predict):",
[(i, tokenizer.decode([labels[i]])) for i in range(len(labels)) if labels[i] != -100])
from transformers import BertForMaskedLM, BertTokenizer
import torch
# Use BERT to fill in masked tokens
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
text = "Paris is the [MASK] of France."
inputs = tokenizer(text, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
mask_pos = torch.where(inputs['input_ids'][0] == tokenizer.mask_token_id)[0]
logits = outputs.logits[0, mask_pos, :]
top5 = torch.topk(logits, 5, dim=1)
print("Top 5 predictions for [MASK]:")
for token_id, score in zip(top5.indices[0], top5.values[0]):
print(f" {tokenizer.decode([token_id])}: {score:.3f}")
# capital: 12.847
# city: 9.123
# heart: 7.891
# centre: 7.234
# port: 6.512
4.2 Next Sentence Prediction (NSP)
The second objective teaches BERT to understand sentence-level relationships. Given two sentences A and B, the model must predict whether B actually follows A in the original corpus (IsNext) or is a random sentence from elsewhere (NotNext). The dataset is balanced 50/50.
Positive example (IsNext):
Input: [CLS] The dog ran in the park. [SEP] It jumped over the puddle. [SEP]
Label: IsNext
Negative example (NotNext):
Input: [CLS] The dog ran in the park. [SEP] Quantum computers use qubits. [SEP]
Label: NotNext
The [CLS] token's final representation is passed through a binary classifier
(linear + softmax) to produce the IsNext/NotNext prediction.
from transformers import BertForNextSentencePrediction, BertTokenizer
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')
def predict_nsp(sent_a, sent_b):
inputs = tokenizer(sent_a, sent_b, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
return probs[0, 0].item(), probs[0, 1].item() # IsNext, NotNext
# Coherent continuation
a = "The dog ran in the park."
b_coherent = "It jumped over the puddle and splashed through the mud."
is_next, not_next = predict_nsp(a, b_coherent)
print(f"Coherent: IsNext={is_next:.3f}, NotNext={not_next:.3f}")
# Coherent: IsNext=0.892, NotNext=0.108
# Random, unrelated sentence
b_random = "Quantum computers revolutionize modern cryptography."
is_next, not_next = predict_nsp(a, b_random)
print(f"Random: IsNext={is_next:.3f}, NotNext={not_next:.3f}")
# Random: IsNext=0.041, NotNext=0.959
NSP: A Controversial Design Choice
NSP was challenged by subsequent research. RoBERTa (2019) showed that removing NSP and training only on MLM (with more data and longer training) actually improves performance on most benchmarks. The hypothesis: NSP is too simple a task, and its training procedure — using sentence pairs that may span multiple documents — inadvertently encourages the model to use coarse document-level features rather than fine-grained token-level ones. Later models like ALBERT replaced NSP with Sentence Order Prediction (SOP), a harder variant.
4.3 Combined Pre-training Loss
BERT's total pre-training loss is the sum of both objectives:
L_total = L_MLM + L_NSP
L_MLM = CrossEntropy over the 15% masked tokens only
(positions where labels != -100)
L_NSP = Binary CrossEntropy for IsNext (0) vs NotNext (1)
Both losses are computed simultaneously on each batch.
The model must learn to handle both tasks with the same
set of Transformer parameters — this shared training
produces a rich, multi-task representation of language.
5. Fine-tuning BERT for Downstream Tasks
The "pre-train, then fine-tune" paradigm introduced by BERT is remarkably simple: add a small task-specific head on top of the pre-trained model, then train the entire system together on labeled data. This requires far fewer labeled examples and far less compute than training from scratch.
Typically, 2–4 epochs with a low learning rate (2e-5 to 5e-5) are sufficient. The pre-trained parameters already encode rich language knowledge — fine-tuning just specializes this knowledge for the target task.
5.1 Text Classification (Sentiment Analysis)
For sequence classification, the [CLS] token embedding from the final layer serves as the sentence representation. A linear classification head maps it to class logits.
from transformers import (
BertForSequenceClassification, BertTokenizer,
TrainingArguments, Trainer
)
from datasets import load_dataset
import numpy as np, evaluate
# Load pre-trained model with classification head
model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=2,
id2label={0: "NEGATIVE", 1: "POSITIVE"},
label2id={"NEGATIVE": 0, "POSITIVE": 1}
)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Prepare IMDB dataset
dataset = load_dataset('imdb')
def tokenize(examples):
return tokenizer(
examples['text'],
padding='max_length',
truncation=True,
max_length=256
)
tokenized = dataset.map(tokenize, batched=True)
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
return {
'accuracy': accuracy.compute(predictions=preds, references=labels)['accuracy'],
'f1': f1.compute(predictions=preds, references=labels)['f1']
}
training_args = TrainingArguments(
output_dir='./bert-sentiment',
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_ratio=0.1, # warmup for first 10% of steps
weight_decay=0.01, # L2 regularization
learning_rate=2e-5,
evaluation_strategy='epoch',
save_strategy='epoch',
load_best_model_at_end=True,
metric_for_best_model='f1',
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized['train'],
eval_dataset=tokenized['test'],
compute_metrics=compute_metrics,
)
trainer.train()
# Expected: Accuracy ~93%, F1 ~0.93 on IMDB after 3 epochs
5.2 Named Entity Recognition (NER)
For NER, instead of using only [CLS], BERT outputs a representation for every token. A classification layer assigns an entity label to each position in the sequence.
from transformers import BertForTokenClassification, BertTokenizer, pipeline
import torch
# Using a pre-fine-tuned NER model
ner = pipeline(
"ner",
model="dbmdz/bert-large-cased-finetuned-conll03-english",
aggregation_strategy="simple"
)
text = "Elon Musk founded Tesla in Palo Alto, California."
entities = ner(text)
for e in entities:
print(f" {e['word']} → {e['entity_group']} ({e['score']:.3f})")
# Elon Musk → PER (0.998)
# Tesla → ORG (0.997)
# Palo Alto → LOC (0.994)
# California → LOC (0.999)
# --- Custom NER fine-tuning from scratch ---
label_list = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
model = BertForTokenClassification.from_pretrained(
'bert-base-cased',
num_labels=len(label_list),
id2label={i: l for i, l in enumerate(label_list)},
label2id={l: i for i, l in enumerate(label_list)}
)
# → Then fine-tune with CoNLL-2003 using the Trainer API
5.3 Extractive Question Answering
For extractive QA (like SQuAD), BERT receives a question and a context separated by [SEP]. The model must predict the start and end positions of the answer span within the context.
Architecture for Extractive QA:
Input: [CLS] When was BERT published? [SEP] BERT was published in 2018 by Google AI. [SEP]
| | | | | | | | | | | | | |
BERT: 12 Transformer Encoder layers
| | | | | | | | | | | | | |
Start: 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.90 0.01 0.01 0.01
End: 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.88
Extracted answer: "2018"
Two vectors W_start and W_end (768-dim each) are learned.
For each token i: P_start(i) = softmax(W_start . h_i)
P_end(i) = softmax(W_end . h_i)
from transformers import pipeline
qa = pipeline("question-answering", model="deepset/bert-base-cased-squad2")
context = """
BERT (Bidirectional Encoder Representations from Transformers) is a language model
developed by Google AI in 2018. It was trained on Wikipedia and BooksCorpus,
totaling approximately 3.3 billion words. BERT-base has 110 million parameters
and 12 Transformer encoder layers.
"""
questions = [
"Who developed BERT?",
"How many parameters does BERT-base have?",
"What data was BERT trained on?",
"In what year was BERT published?",
]
for question in questions:
result = qa(question=question, context=context)
print(f"Q: {question}")
print(f"A: {result['answer']} (confidence: {result['score']:.3f})")
print()
5.4 Sentence Pair Classification (NLI)
For Natural Language Inference and paraphrase detection, BERT receives two sentences and classifies their relationship: entailment, contradiction, or neutral.
from transformers import pipeline
nli = pipeline(
"text-classification",
model="cross-encoder/nli-deberta-v3-small"
)
pairs = [
("The cat sleeps on the sofa", "An animal is resting"),
("All students passed the exam", "No student failed"),
("The cat sleeps on the sofa", "The dog is running in the park"),
]
for premise, hypothesis in pairs:
result = nli(f"{premise} [SEP] {hypothesis}")
print(f"Premise: {premise}")
print(f"Hypothesis: {hypothesis}")
print(f"Relation: {result[0]['label']} ({result[0]['score']:.3f})")
print()
5.5 Fine-tuning Best Practices
Recommended Hyperparameters (from the original BERT paper)
| Parameter | Recommended Range | Notes |
|---|---|---|
| Learning Rate | 2e-5, 3e-5, 5e-5 | Linear warmup + linear decay |
| Batch Size | 16, 32 | Larger = more stable gradients |
| Epochs | 2–4 | More risks catastrophic forgetting |
| Max Sequence Length | 128–512 | Shorter = faster; 128 often sufficient |
| Warmup Ratio | 10% of total steps | Stabilizes early training |
| Weight Decay | 0.01 | L2 on all params except biases |
| Dropout | 0.1 (default) | Sometimes reduce to 0.0 for small datasets |
| Adam epsilon | 1e-8 | Numerical stability in Adam optimizer |
6. Extracting Contextual Embeddings from BERT
Beyond fine-tuning for specific tasks, BERT is extremely useful as a feature extractor. The representations produced by its internal layers capture linguistic information at different levels of abstraction — from morphology in the lower layers to high-level semantics in the upper layers.
from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
model.eval()
text = "Natural language processing is fascinating and powerful"
inputs = tokenizer(text, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
# outputs contains:
# - last_hidden_state: final layer output (batch, seq_len, 768)
# - pooler_output: [CLS] after a linear + tanh (batch, 768)
# - hidden_states: tuple of 13 tensors (embedding layer + 12 Transformer layers)
last_hidden = outputs.last_hidden_state
pooler = outputs.pooler_output
all_layers = outputs.hidden_states
print(f"Last hidden state: {last_hidden.shape}") # (1, N_tokens, 768)
print(f"Pooler output: {pooler.shape}") # (1, 768)
print(f"Number of layers: {len(all_layers)}") # 13
# Strategy 1: [CLS] token of the last layer
cls_embedding = last_hidden[:, 0, :] # (1, 768)
# Strategy 2: Mean pooling over the last layer (often better than CLS)
attention_mask = inputs['attention_mask'].unsqueeze(-1).float()
mean_embedding = (last_hidden * attention_mask).sum(1) / attention_mask.sum(1)
print(f"Mean pooling shape: {mean_embedding.shape}") # (1, 768)
# Strategy 3: Concatenate last 4 layers (captures richer information)
last_4 = torch.cat([all_layers[i] for i in [-1, -2, -3, -4]], dim=-1)
print(f"Last 4 layers concatenated: {last_4.shape}") # (1, N_tokens, 3072)
# Strategy 4: Sum of last 4 layers (preserves 768-dim)
sum_last_4 = torch.stack([all_layers[i] for i in [-1, -2, -3, -4]]).sum(0)
print(f"Sum of last 4 layers: {sum_last_4.shape}") # (1, N_tokens, 768)
Which Layer to Use?
Research has shown that different BERT layers capture different types of information:
- Layers 1–4 (low): Morphological and basic syntactic features (POS tags, subword patterns)
- Layers 5–8 (middle): Complex syntactic features, dependency relations, coreference
- Layers 9–12 (high): High-level semantic information, world knowledge
- Last layer: Best for semantic tasks (NLI, similarity, classification)
- Second-to-last layer: Often slightly better for NER than the very last layer
- Middle layers: Best for syntactic tasks (POS tagging, constituent parsing)
7. BERT Variants: A Rich Ecosystem
After BERT's publication, numerous research groups proposed variants that improve different aspects of the original design. Here is a comprehensive overview of the most important ones.
BERT Family Comparison Table
| Model | Year | Key Innovation | Parameters | vs BERT-base |
|---|---|---|---|---|
| BERT-base | 2018 | MLM + NSP, deep bidirectionality | 110M | Baseline |
| RoBERTa | 2019 | No NSP, dynamic masking, 10× more data | 125M | +1–2% GLUE |
| ALBERT | 2019 | Parameter sharing, factorized embedding | 12M–235M | Comparable, smaller |
| DistilBERT | 2019 | Knowledge distillation, 60% faster | 66M | 97% retained |
| ELECTRA | 2020 | Replaced token detection (more efficient than MLM) | 110M | +1% with same compute |
| DeBERTa | 2020 | Disentangled attention (content + position) | 134M–5B | SotA on SuperGLUE |
| XLNet | 2019 | Permutation language modeling | 340M | +1–2% select tasks |
| SpanBERT | 2019 | Contiguous span masking + SBO objective | 110M | +4–8% QA tasks |
7.1 RoBERTa — Robustly Optimized BERT
Facebook AI (2019) showed BERT was significantly undertrained. RoBERTa makes five changes to the training procedure, with no architectural modifications:
- Remove NSP: Training only on MLM improves downstream performance
- Dynamic masking: Masking patterns are regenerated each epoch (not fixed during preprocessing)
- More data: 160GB of text (vs ~16GB for BERT) including CC-News, OpenWebText, Stories
- Longer training: 500K steps vs 100K, with larger batch sizes
- Full 512-token sequences: Always; BERT started with 128-token sequences
7.2 ALBERT — Efficient Parameter Sharing
ALBERT (A Lite BERT) from Google addresses parameter count growth with two techniques:
- Factorized embedding parameterization: Separates vocabulary embedding size (E=128) from the hidden layer size (H=768). Instead of a V×H matrix, ALBERT uses V×E and E×H, drastically reducing embedding parameters.
- Cross-layer parameter sharing: All 12 Transformer layers share the same weights. This reduces parameters without reducing the number of forward passes (depth is preserved).
ALBERT-xxlarge achieves better performance than BERT-large with only 235M effective parameters (vs 340M). However, inference speed does not improve since the number of forward passes through the shared layers remains the same.
7.3 DistilBERT — Knowledge Distillation
HuggingFace (2019) used knowledge distillation to create a smaller, faster model. A 6-layer "student" model is trained to mimic both the output probabilities and the internal hidden states of a 12-layer "teacher" (BERT-base).
- 40% fewer parameters: 66M vs 110M
- 60% faster inference: 6 layers vs 12
- 97% of BERT-base performance: Retains nearly all quality
from transformers import pipeline
import time
# DistilBERT: ideal for latency-sensitive production
classifier = pipeline(
"text-classification",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
texts = [
"This product is absolutely fantastic!",
"Terrible experience, never buying again.",
"It works as described, nothing special.",
]
start = time.time()
results = classifier(texts)
latency = (time.time() - start) * 1000
for text, result in zip(texts, results):
print(f"{result['label']} ({result['score']:.3f}): {text[:40]}...")
print(f"\nBatch latency: {latency:.0f}ms for {len(texts)} samples")
7.4 ELECTRA — Replaced Token Detection
ELECTRA (Clark et al., 2020) replaces MLM with a more efficient objective inspired by Generative Adversarial Networks:
- A small generator (a compact BERT) produces plausible replacement tokens for masked positions
- A larger discriminator (the main model) must identify which tokens were replaced
The key advantage: the discriminator learns from every token in the sequence, not just the 15% masked in BERT's MLM. This makes ELECTRA substantially more computationally efficient — it matches RoBERTa's performance using only 25% of the compute budget.
7.5 DeBERTa — Disentangled Attention
DeBERTa (Decoding-enhanced BERT with disentangled Attention) from Microsoft introduces two key innovations:
- Disentangled attention: Represents content and position as two separate vectors rather than summing them. The attention score between any two tokens uses four terms: content-to-content, content-to-position, position-to-content, and position-to-position.
- Enhanced mask decoder: Injects absolute position information into the final softmax layer for MLM predictions (positions are needed to distinguish, for example, whether a masked word is a subject or object).
DeBERTa v3 is currently one of the highest-performing encoder models on many benchmarks, frequently outperforming models that are 5–10× larger.
8. BERT for the Italian Language
The original BERT is pre-trained on English text. For Italian NLP, models pre-trained on Italian corpora consistently outperform multilingual BERT on Italian tasks — they understand Italian morphology, morphosyntax, and vocabulary far better.
Italian BERT Models
| Model | Base | Training Data | Vocabulary | Best Use Case |
|---|---|---|---|---|
| mBERT | BERT-base | Wikipedia in 104 languages | 110K multilingual | Cross-lingual baseline |
| dbmdz/bert-base-italian-xxl-cased | BERT-base | OPUS + Italian Wikipedia (~13GB) | 30K Italian tokens | General Italian NLP |
| AlBERTo | BERT-base | Italian Twitter corpus | 30K Italian tokens | Social media analysis |
| UmBERTo | RoBERTa | OSCAR Italian corpus (~69GB) | 32K SentencePiece | Highest accuracy on Italian |
| XLM-RoBERTa-large | RoBERTa-large | CC-100 in 100 languages | 250K multilingual | Cross-lingual, zero-shot |
from transformers import pipeline, BertTokenizer, BertModel
import torch
# --- 1. Italian BERT (dbmdz) for fill-mask ---
fill_it = pipeline("fill-mask", model="dbmdz/bert-base-italian-xxl-cased")
results = fill_it("Roma è la [MASK] d'Italia.")
for r in results[:3]:
print(f" {r['token_str']}: {r['score']:.4f}")
# capitale: 0.8234
# città: 0.0521
# cuore: 0.0312
# --- 2. UmBERTo (RoBERTa-based) for fill-mask ---
fill_umb = pipeline(
"fill-mask",
model="Musixmatch/umberto-commoncrawl-cased-v1"
)
results = fill_umb("L'intelligenza artificiale <mask> il futuro.")
for r in results[:3]:
print(f" {r['token_str']}: {r['score']:.4f}")
# --- 3. Extract Italian sentence embeddings ---
tokenizer = BertTokenizer.from_pretrained('dbmdz/bert-base-italian-xxl-cased')
model = BertModel.from_pretrained('dbmdz/bert-base-italian-xxl-cased')
model.eval()
sentences = [
"La tecnologia sta cambiando il modo in cui lavoriamo",
"Il cambiamento tecnologico trasforma il mercato del lavoro",
"Il gatto dorme sul tappeto",
]
def get_sentence_embedding(text, tokenizer, model):
inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
# Mean pooling with attention mask
mask = inputs['attention_mask'].unsqueeze(-1).float()
return (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)
embeddings = [get_sentence_embedding(s, tokenizer, model) for s in sentences]
# Compute cosine similarities
import torch.nn.functional as F
for i in range(len(sentences)):
for j in range(i + 1, len(sentences)):
sim = F.cosine_similarity(embeddings[i], embeddings[j]).item()
print(f"Similarity {i}-{j}: {sim:.3f}")
# Similarity 0-1: 0.892 (related meaning)
# Similarity 0-2: 0.432 (unrelated)
# Similarity 1-2: 0.438 (unrelated)
Which Italian Model to Choose?
- General NLP (NER, classification, QA):
dbmdz/bert-base-italian-xxl-casedor UmBERTo — both excellent - Social media / Twitter analysis: AlBERTo (pre-trained specifically on Italian tweets)
- Maximum accuracy: Fine-tune
xlm-roberta-largeon Italian data - Cross-lingual tasks: XLM-RoBERTa or multilingual DeBERTa v3
- Fast inference in production: DistilBERT multilingual or quantized dbmdz model
9. BERT's Limitations and What Came Next
Despite its enormous success, BERT has several structural limitations that constrain its applicability. Understanding them is crucial for choosing the right model for each problem.
9.1 The 512-Token Context Limit
BERT can process sequences of at most 512 tokens. For long documents (legal contracts, scientific papers, technical manuals), this is a severe constraint. Workarounds include:
- Truncation: Simply cut the document to 512 tokens. Loss of information.
- Sliding window: Process overlapping chunks and aggregate predictions. Costly.
- Hierarchical pooling: Split into chunks, encode separately, then pool sentence-level representations.
- Dedicated models: Longformer (sparse attention, 4096 tokens), BigBird (8192 tokens).
9.2 Quadratic Attention Complexity
Self-attention has complexity O(n²) with respect to sequence length n. Doubling the input length quadruples compute cost and memory usage. Solutions:
- Linformer: Projects K and V to a lower-dimensional space, achieving O(n) complexity
- Performer: Uses random feature approximations to compute attention in O(n)
- Flash Attention: Exact attention with IO-aware memory tiling — same result, much faster in practice
- Longformer: Combines local sliding window attention with global attention for selected tokens
9.3 Encoder-Only: No Generative Capability
BERT is an encoder-only model: it produces rich representations of text but cannot generate new text. For generation tasks (summarization, translation, dialogue), you need a decoder (GPT-style) or an encoder-decoder architecture (T5, BART).
9.4 [MASK] Token Mismatch
The [MASK] token appears during pre-training but never during fine-tuning or inference. Even with the 80/10/10 strategy mitigating this, the mismatch remains a theoretical weakness. ELECTRA eliminates this entirely by never using [MASK] at all.
9.5 Independent Masking Assumption
When BERT masks multiple tokens, it predicts them independently, ignoring dependencies between them. For a sentence like "New [MASK] [MASK]" where the answer is "New York City", BERT predicts "York" and "City" independently — neither prediction conditions on the other. XLNet's permutation language modeling addresses this.
Limitations and Modern Solutions
| Limitation | Impact | Solution |
|---|---|---|
| Max 512 tokens | Cannot process long documents | Longformer, BigBird, sliding window |
| O(n²) attention | High compute for long sequences | Flash Attention, Linformer, Performer |
| Encoder-only | No text generation | T5, BART, GPT, LLaMA |
| [MASK] mismatch | Pre-training/inference gap | ELECTRA (replaced token detection) |
| Independent masking | Cannot model token correlations | XLNet (permutation LM) |
| Static after training | Cannot update from new facts | RAG (retrieval-augmented generation) |
Post-BERT Timeline: The Path to LLMs
| Model | Year | Key Innovation | Impact |
|---|---|---|---|
| XLNet | 2019 | Permutation language modeling | Overcomes BERT's masking assumptions |
| T5 | 2019 | Text-to-text, encoder-decoder | Unifies all NLP tasks as generation |
| GPT-3 | 2020 | 175B parameters, in-context learning | Few-shot without fine-tuning |
| DeBERTa | 2020–21 | Disentangled attention | State of the art on SuperGLUE |
| FLAN-T5 | 2022 | Instruction fine-tuning at scale | Better zero/few-shot generalization |
| LLaMA 2/3 | 2023/24 | Open-source efficient LLMs | Democratized LLM research |
| Mistral | 2023 | Sliding window + grouped query attention | Efficient LLM inference |
| Gemma | 2024 | Google's open efficient model | Best-in-class at 2B/7B scale |
Conclusions and Next Steps
BERT represents a paradigm shift in NLP: a single pre-trained model that can be efficiently adapted to a wide range of tasks with minimal labeled data. The key concepts to retain:
- The bidirectional Transformer encoder enables deep contextual understanding — both directions simultaneously
- Three input embedding types — token, segment, and position — sum to form each token's representation
- MLM and NSP as self-supervised pre-training objectives enabling learning from unlabeled text
- How to fine-tune for classification (CLS token), NER (per-token), QA (span extraction)
- The variant ecosystem: RoBERTa (better training), DistilBERT (faster), ALBERT (smaller), DeBERTa (state-of-the-art)
- Italian BERT models for Italian NLP: dbmdz, AlBERTo, UmBERTo, XLM-RoBERTa
- BERT's structural limitations and how modern architectures address them
Continue the Series
- Next: Sentiment Analysis with Transformers — build and deploy a production-grade classifier with uncertainty estimation
- Article 4: Italian NLP with feel-it and Custom Models — challenges specific to Italian morphology and resources
- Article 5: Named Entity Recognition (NER) — extract structured entities from text at production scale
- Article 7: HuggingFace Transformers: Complete Guide — Trainer API, custom training loops, Hub, ONNX
- Article 8: LoRA Fine-tuning — adapt LLMs locally on a single consumer GPU
Cross-Series Connections
- AI Engineering / RAG (Series 6): BERT embeddings are the foundation for semantic search in RAG pipelines. The contextual representations you extract from BERT's hidden states directly fuel the embedding columns of your vector database.
- Advanced Deep Learning (Series 7): Learn gradient checkpointing, mixed precision (fp16/bf16), and ZeRO optimizer sharding for memory-efficient training of large BERT-family models on limited hardware.
- MLOps (Series 5): Monitor BERT models in production — detect embedding drift, distribution shift in classification confidence, and automated retraining triggers covered in article 10 of this series.







