Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

BERT Explained: Architecture, Pretraining and Fine-tuning

2018 marked a turning point in the history of Natural Language Processing. With the publication of BERT (Bidirectional Encoder Representations from Transformers), the Google AI team introduced a model that redefined the state of the art on 11 NLP benchmarks simultaneously. For the first time, a single pre-trained model could be adapted to very different tasks — classification, question answering, NER — achieving results superior to all previous specialized systems combined.

What makes BERT so revolutionary? The answer lies in three fundamental innovations: deep bidirectionality, large-scale pre-training, and the simplicity of fine-tuning for downstream tasks. In this article we analyze every aspect of the BERT architecture, from the internal mechanics of attention to practical implementation with HuggingFace, including the many variants that followed.

This is the second article in the Modern NLP: from BERT to LLMs series. If you have not yet read the first article on fundamentals (tokenization, embeddings and the NLP pipeline), I recommend doing so first — many concepts here build directly on those foundations.

What You Will Learn

Why BERT represented a revolution in NLP and the limits of previous models
The encoder-only Transformer architecture underlying BERT
How multi-head self-attention works with the mathematical formulas
Input representation: token, segment and position embeddings
The two pre-training strategies: Masked Language Model (MLM) and Next Sentence Prediction (NSP)
WordPiece tokenization and the special tokens [CLS], [SEP], [MASK]
How to fine-tune for classification, NER, and question answering
How to extract contextual embeddings from any BERT layer
Complete practical implementation with HuggingFace Transformers
BERT variants: RoBERTa, ALBERT, DistilBERT, DeBERTa, ELECTRA
BERT for the Italian language: available models and comparison
BERT's limitations and how later models overcame them

Series Overview

#	Article	Focus
1	NLP Fundamentals	Tokenization, Embeddings, Pipeline
2	You are here — BERT and Transformers	Attention, Pre-training
3	Sentiment Analysis	BERT classifiers in production
4	Italian NLP	feel-it, AlBERTo, Italian-specific models
5	Named Entity Recognition	Extracting entities from text
6	Text Classification	Multi-label and zero-shot
7	HuggingFace Transformers	Trainer API, Datasets, Hub
8	LoRA Fine-tuning	Local training on consumer GPU
9	Semantic Similarity	Sentence embeddings and FAISS
10	NLP Monitoring	Drift detection and retraining

1. Why BERT Revolutionized NLP

To appreciate BERT's impact, it helps to look at the NLP landscape before 2018 and understand which fundamental problems previous models could not solve.

1.1 The Limits of Word2Vec and GloVe

As we covered in the first article, Word2Vec (2013) and GloVe (2014) were a genuine breakthrough: words mapped to dense vectors in a continuous space where semantic relationships like "king - man + woman = queen" emerged naturally from the geometry of that space.

The fundamental limitation: these models produce static representations. Every word has exactly one vector, regardless of the context it appears in. Consider the word "bank":

The Static Representation Problem

"I deposited the money at the bank." — financial institution
"We sat on the river bank and watched the sunset." — land beside water
"A bank of fog rolled in from the sea." — dense mass of fog
"A bank of computers filled the entire room." — an array or row

Word2Vec assigns a single averaged vector to "bank" — a significant loss of semantic precision across all four senses.

1.2 ELMo: The First Step Toward Contextual Representations

In early 2018, ELMo (Embeddings from Language Models) from AllenAI addressed this with a bidirectional LSTM. ELMo generated different representations for the same word depending on context, achieving meaningful improvements on many benchmarks.

But ELMo had two important limitations. First, its bicontextualization was shallow: two separate LSTM passes (one forward, one backward) were concatenated, but they did not interact during processing. Second, LSTMs suffer from an information bottleneck — long-range dependencies get progressively "forgotten" as sequences grow longer.

1.3 GPT: Unidirectional Pretraining

Also in 2018, OpenAI published GPT, which used the Transformer architecture for language modeling. GPT demonstrated that large-corpus pretraining followed by fine-tuning works extremely well for NLP tasks.

However, GPT processes text only left-to-right. This is natural for text generation but suboptimal for understanding: to classify a sentence, answer a question, or extract entities, you need the full bidirectional context — both what came before and what comes after.

1.4 BERT: Deep Bidirectionality

BERT resolves both problems through the self-attention mechanism of the Transformer. Every token in a sequence can "look at" every other token simultaneously — both left and right. This bidirectionality is not a concatenation of two separate models (as in ELMo), but a deep interaction at every single layer of the architecture.

      Bidirectionality Approaches Compared
      
            Model
            Type
            Context
            Key Limitation
          
            Word2Vec/GloVe
            Static
            No context
            One vector per word regardless of meaning
          
            GPT-1
            Unidirectional (L→R)
            Left context only
            Cannot see future tokens
          
            ELMo
            Shallow bidir.
            Concatenated contexts
            No interaction between directions
          
            BERT
            Deep bidir.
            Full context at every layer
            Encoder-only (no generation)

This deep bidirectionality is why BERT established new records on 11 NLP benchmarks simultaneously at publication, including GLUE, SQuAD 1.1, SQuAD 2.0, and MultiNLI.

2. BERT Architecture: The Transformer Encoder

BERT is built on the Transformer architecture from "Attention Is All You Need" (Vaswani et al., 2017). The original Transformer has two components: an encoder (processes input) and a decoder (generates output). BERT uses only the encoder, which makes it ideal for understanding tasks rather than generation.

2.1 Architecture Overview

BERT's architecture can be visualized as a stack of Transformer encoder blocks arranged vertically. Each block contains a multi-head self-attention layer followed by a position-wise feed-forward network, with residual connections and layer normalization at each sub-layer.


Input: [CLS] The cat sits on the mat [SEP]
          |     |    |    |   |   |    |
    +-----+-----+----+----+---+---+----+-----+
    |         Token Embeddings                |
    |       + Segment Embeddings              |
    |       + Position Embeddings             |
    +-----+-----+----+----+---+---+----+-----+
          |     |    |    |   |   |    |
    +-----------------------------------------------+
    |       Transformer Encoder Block 1             |
    |  +-------------------------------------------+|
    |  |     Multi-Head Self-Attention             ||
    |  |  Q = XWq  K = XWk  V = XWv              ||
    |  |  Attention(Q,K,V) = softmax(QK^T/sqrt(dk))V||
    |  +-------------------------------------------+|
    |  |     Add & Layer Norm                      ||
    |  +-------------------------------------------+|
    |  |     Feed-Forward Network (GELU)           ||
    |  |  FFN(x) = GELU(xW1 + b1)W2 + b2         ||
    |  +-------------------------------------------+|
    |  |     Add & Layer Norm                      ||
    +-----------------------------------------------+
          |     |    |    |   |   |    |
         ...   ...  ...  ...  ...  ...  ...
          |     |    |    |   |   |    |
    +-----------------------------------------------+
    |       Transformer Encoder Block L (12 or 24) |
    +-----------------------------------------------+
          |     |    |    |   |   |    |
    [CLS]out  T1  T2   T3  T4  T5   [SEP]out
      |
  Pooling --> Classification / Task-specific output

2.2 BERT-base vs BERT-large

The original paper proposes two configurations, described with notation L/H/A (Layers / Hidden size / Attention heads):

      BERT Model Configurations
      
        
            Parameter
            BERT-base
            BERT-large
          

        Transformer Layers (L)1224
Hidden Size (H)7681024
Attention Heads (A)1216
Total Parameters110M340M
Dim per Head (d_k)768/12 = 641024/16 = 64
Feed-Forward Dim3072 (4 × 768)4096 (4 × 1024)
Max Sequence Length512512
Vocabulary Size30,52230,522
Pre-training Time4 days (16 TPUs)4 days (64 TPUs)

      
    

2.3 Input Representation: Three Embeddings

One of BERT's design choices is its input representation, formed by summing three distinct embedding types. The final input vector for each token is:


  Input:       [CLS]  The   cat  sleeps [SEP]  The  dog  runs  [SEP]
                 |     |     |     |      |      |    |     |     |
  Token Emb:  E[CLS] E_the E_cat E_sl  E[SEP]  E_the E_dog E_r  E[SEP]
                 +     +     +     +      +       +    +     +     +
  Segment Emb:  EA    EA    EA    EA     EA      EB   EB    EB    EB
                 +     +     +     +      +       +    +     +     +
  Position Emb: E0    E1    E2    E3     E4      E5   E6    E7    E8
                 =     =     =     =      =       =    =     =     =
  Final Input:  I0    I1    I2    I3     I4      I5   I6    I7    I8

Token Embeddings: Dense vectors for each vocabulary token (30,522 × 768). Learned during pre-training. The token embedding matrix constitutes the bulk of BERT's parameters related to language knowledge.
Segment Embeddings: Indicate which sentence each token belongs to (A for the first sentence, B for the second). Essential for NSP tasks and any task requiring reasoning over sentence pairs such as Natural Language Inference or Paraphrase Detection.
Position Embeddings: Encode the absolute position of each token in the sequence (0–511). Unlike the original Transformer (which used sinusoidal functions), BERT uses learned position embeddings — each position has its own trainable vector, allowing the model to develop a richer position-dependent representation.

from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Inspect the three embedding layers directly
token_emb = model.embeddings.word_embeddings
position_emb = model.embeddings.position_embeddings
segment_emb = model.embeddings.token_type_embeddings

print(f"Token embedding matrix: {token_emb.weight.shape}")
# torch.Size([30522, 768])
print(f"Position embedding matrix: {position_emb.weight.shape}")
# torch.Size([512, 768])
print(f"Segment embedding matrix: {segment_emb.weight.shape}")
# torch.Size([2, 768])

# Encode a sentence pair
text_a = "The cat sleeps"
text_b = "The dog runs"
encoded = tokenizer(text_a, text_b, return_tensors='pt')

print("Token IDs:", encoded['input_ids'])
print("Token Type IDs (segments):", encoded['token_type_ids'])
print("Attention Mask:", encoded['attention_mask'])

tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])
print("Tokens:", tokens)
# ['[CLS]', 'the', 'cat', 'sleeps', '[SEP]', 'the', 'dog', 'runs', '[SEP]']

2.4 Multi-Head Self-Attention: The Core Mechanism

The central innovation of the Transformer — and therefore of BERT — is self-attention: a mechanism that allows each token to "look at" every other token in the sequence and decide how much to weight each when computing its own representation.

Scaled Dot-Product Attention

For each token, the attention mechanism computes three vectors through learned linear projections:

Query (Q): "What am I looking for?"
Key (K): "What do I offer to other tokens?"
Value (V): "What information do I carry?"

The attention formula is:

Attention(Q, K, V) = softmax( QK^T / sqrt(d_k) ) * V

Where:
  Q: (seq_len, d_k) — query matrix
  K: (seq_len, d_k) — key matrix
  V: (seq_len, d_v) — value matrix
  d_k: dimension of keys (64 for BERT-base, i.e. 768/12 heads)

The 1/sqrt(d_k) scaling factor prevents dot products from growing
so large that softmax produces near-zero gradients (saturation).

For d_k=64: sqrt(64) = 8.0  — dividing by 8 keeps gradients healthy.

Multi-Head Attention

Instead of computing one attention function, BERT runs h attention functions in parallel. Each head can capture different relationship types simultaneously: one head might specialize in syntactic dependencies, another in coreference, another in semantic similarity.

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W^O
head_i = Attention(Q * W_i^Q, K * W_i^K, V * W_i^V)

BERT-base: h = 12 heads, d_model = 768, d_k = d_v = 64
BERT-large: h = 16 heads, d_model = 1024, d_k = d_v = 64

Total attention parameters per layer (BERT-base):
  W_Q: 768 x 768 = 589,824
  W_K: 768 x 768 = 589,824
  W_V: 768 x 768 = 589,824
  W_O: 768 x 768 = 589,824
  Total: ~2.36M per layer x 12 layers = ~28.3M

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    """Scaled dot-product attention."""
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    attention_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, V)
    return output, attention_weights


class MultiHeadAttention(nn.Module):
    """Multi-Head Attention as in BERT."""

    def __init__(self, d_model=768, num_heads=12):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads  # 64 for BERT-base

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.size()

        # Linear projections
        Q = self.W_q(x)
        K = self.W_k(x)
        V = self.W_v(x)

        # Reshape for multi-head: (batch, heads, seq_len, d_k)
        Q = Q.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)

        # Attention per head
        attn_output, attn_weights = scaled_dot_product_attention(Q, K, V, mask)

        # Concatenate heads
        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.view(batch_size, seq_len, self.d_model)

        return self.W_o(attn_output), attn_weights


# Quick verification
mha = MultiHeadAttention(d_model=768, num_heads=12)
x = torch.randn(2, 64, 768)  # batch=2, seq_len=64, dim=768
output, weights = mha(x)
print(f"Output shape: {output.shape}")    # torch.Size([2, 64, 768])
print(f"Weights shape: {weights.shape}")  # torch.Size([2, 12, 64, 64])

2.5 Feed-Forward Network and Activations

After each attention layer, BERT applies a position-wise feed-forward network — the same network applied independently to each position in the sequence:

FFN(x) = GELU(x * W1 + b1) * W2 + b2

Dimensions:
  W1: d_model × d_ff = 768 × 3072
  W2: d_ff × d_model = 3072 × 768

BERT uses GELU activation (not ReLU):
  GELU(x) = x * Phi(x)   where Phi is the standard normal CDF
  GELU provides a smoother transition and better gradients than ReLU.

FFN parameters per layer (BERT-base):
  W1 + b1: 768 * 3072 = 2,359,296 + 3,072 biases
  W2 + b2: 3072 * 768 = 2,359,296 + 768 biases
  Total: ~4.7M per layer × 12 layers = ~56.5M

2.6 Residual Connections and Layer Normalization

Each sub-layer (attention and FFN) in BERT is wrapped with a residual connection followed by layer normalization:

output = LayerNorm(x + Sublayer(x))

Residual connections allow gradients to flow directly through the network
during backpropagation, preventing vanishing gradients in deep architectures.

LayerNorm normalizes activations across the feature dimension, stabilizing
training independently of batch size (unlike BatchNorm).

import torch
import torch.nn as nn

class TransformerEncoderBlock(nn.Module):
    """A single encoder block as in BERT."""

    def __init__(self, d_model=768, num_heads=12, d_ff=3072, dropout=0.1):
        super().__init__()
        # Multi-Head Attention
        self.attention = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=num_heads,
            dropout=dropout,
            batch_first=True
        )
        # Feed-Forward Network with GELU activation
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Sub-layer 1: Multi-Head Attention + Residual + LayerNorm
        attn_output, _ = self.attention(x, x, x, key_padding_mask=mask)
        x = self.norm1(x + self.dropout(attn_output))

        # Sub-layer 2: FFN + Residual + LayerNorm
        x = self.norm2(x + self.ffn(x))
        return x


class BERTEncoder(nn.Module):
    """Stack of encoder blocks as in BERT-base."""

    def __init__(self, num_layers=12, d_model=768, num_heads=12, d_ff=3072):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerEncoderBlock(d_model, num_heads, d_ff)
            for _ in range(num_layers)
        ])

    def forward(self, x, mask=None):
        for layer in self.layers:
            x = layer(x, mask)
        return x


# Verify BERT-base dimensions
encoder = BERTEncoder(num_layers=12, d_model=768, num_heads=12, d_ff=3072)
x = torch.randn(2, 128, 768)  # batch=2, seq_len=128
output = encoder(x)
print(f"Output: {output.shape}")  # torch.Size([2, 128, 768])

total_params = sum(p.numel() for p in encoder.parameters())
print(f"Encoder parameters: {total_params:,}")
# ~85M (encoder only, excluding embeddings)

3. WordPiece Tokenization

BERT uses WordPiece tokenization — a subword algorithm that balances vocabulary efficiency with linguistic coverage. WordPiece was originally developed for Google's machine translation system and later adopted for BERT.

3.1 How WordPiece Works

The algorithm starts with a vocabulary of individual characters and iteratively merges pairs of tokens that maximize the likelihood of the training corpus. The process continues until the vocabulary reaches a target size (30,522 for BERT-base).

The resulting vocabulary contains:

Common whole words ("the", "of", "and", "is")
Common prefixes and stems ("un", "re", "pre", "inter")
Suffixes and endings marked with ## ("##ing", "##tion", "##ed", "##ness")
Individual characters to handle any out-of-vocabulary word

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# WordPiece handles rare and complex words gracefully
examples = [
    "The cat sat on the mat",
    "Electroencephalography is fascinating",
    "Unstructured data preprocessing pipeline",
    "Transformers revolutionized NLP in 2018"
]

for text in examples:
    tokens = tokenizer.tokenize(text)
    print(f"Text: {text}")
    print(f"  Tokens {len(tokens)}: {tokens}")
    print()

# Text: Electroencephalography is fascinating
# Tokens 7: ['electro', '##ence', '##pha', '##log', '##raphy', 'is', 'fascinating']
# → The rare word is split into 5 subword units, common words stay whole

# Vocabulary size
print(f"Vocabulary size: {tokenizer.vocab_size}")  # 30522

# Encoding vs. tokenization
encoded = tokenizer("Hello NLP!", return_tensors="pt")
decoded = tokenizer.decode(encoded['input_ids'][0])
print(f"Encoded -> Decoded: {decoded}")
# [CLS] hello nlp! [SEP]

3.2 Special Tokens

Beyond the WordPiece vocabulary tokens, BERT uses several special tokens that carry structural meaning:

      BERT Special Tokens
      
        TokenIDPurpose

            [PAD]
            0
            Padding to make sequences the same length within a batch
          
            [UNK]
            100
            Unknown token (not in vocabulary after WordPiece splitting)
          
            [CLS]
            101
            Start of sequence; its final-layer embedding serves as the sentence representation
          
            [SEP]
            102
            Separator between the two input sentences (also marks end of sequence)
          
            [MASK]
            103
            Masked token placeholder used during Masked Language Model pre-training

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Sentence pair encoding for NSP and NLI tasks
pair = tokenizer(
    "What is the capital of France?",
    "Paris is the capital of France.",
    return_tensors='pt',
    padding=True,
    truncation=True,
    max_length=128
)

tokens = tokenizer.convert_ids_to_tokens(pair['input_ids'][0])
print("Tokens:", tokens)
# ['[CLS]', 'what', 'is', 'the', 'capital', 'of', 'france', '?',
#  '[SEP]', 'paris', 'is', 'the', 'capital', 'of', 'france', '.', '[SEP]']

# Token type IDs: 0 = first sentence, 1 = second sentence
print("Segments:", pair['token_type_ids'][0].tolist())
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

4. Pre-training: How BERT Learns Language

BERT is pre-trained on enormous text corpora using two complementary objectives. The fundamental idea: the model learns rich language representations from these self-supervised tasks, and this learned knowledge can then be transferred to specific downstream tasks via fine-tuning.

Pre-training Data

BooksCorpus: ~800M words (11,000 unpublished books) — long, coherent prose
English Wikipedia: ~2,500M words (text only, no tables or lists)
Total: ~3.3 billion words of high-quality English text
Training hardware: 16 Cloud TPUs (BERT-base) / 64 TPUs (BERT-large)
Training time: 4 days for BERT-base, 4 days for BERT-large

4.1 Masked Language Model (MLM)

The Masked Language Model is BERT's primary pre-training innovation. Traditional language modeling (predicting the next word left-to-right) is intrinsically unidirectional. MLM introduces an elegant trick: randomly mask a portion of input tokens and ask the model to predict them, forcing it to use both left and right context simultaneously.

The 80/10/10 Masking Strategy

For each training sequence, 15% of tokens are selected for prediction. Of these selected tokens:

80%: Replaced with the special [MASK] token
10%: Replaced with a random vocabulary token
10%: Kept unchanged (original token)

This mixed strategy solves a subtle problem: during fine-tuning and inference, [MASK] never appears. If the model only ever saw [MASK] during pre-training, there would be a pre-training/fine-tuning mismatch. By replacing 10% with random tokens and leaving 10% unchanged, the model is forced to maintain useful representations for all tokens — not just the masked ones.

import random
import torch
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def apply_mlm_masking(token_ids, tokenizer, mask_prob=0.15):
    """Apply BERT's 80/10/10 masking strategy."""
    masked_tokens = list(token_ids)
    labels = [-100] * len(token_ids)  # -100 = ignore in loss

    for i in range(len(token_ids)):
        # Skip special tokens: [CLS], [SEP], [PAD]
        if token_ids[i] in [tokenizer.cls_token_id,
                             tokenizer.sep_token_id,
                             tokenizer.pad_token_id]:
            continue

        if random.random() < mask_prob:
            labels[i] = token_ids[i]  # Record original as label

            rand = random.random()
            if rand < 0.8:
                # 80%: replace with [MASK]
                masked_tokens[i] = tokenizer.mask_token_id
            elif rand < 0.9:
                # 10%: replace with random vocabulary token
                masked_tokens[i] = random.randint(0, tokenizer.vocab_size - 1)
            # else: 10% → keep original token unchanged

    return masked_tokens, labels


# Demonstration
text = "The quick brown fox jumps over the lazy dog"
token_ids = tokenizer.encode(text)
print("Original:", tokenizer.decode(token_ids))

masked, labels = apply_mlm_masking(token_ids, tokenizer)
print("Masked:  ", tokenizer.decode(masked))
print("Labels (tokens to predict):",
      [(i, tokenizer.decode([labels[i]])) for i in range(len(labels)) if labels[i] != -100])

from transformers import BertForMaskedLM, BertTokenizer
import torch

# Use BERT to fill in masked tokens
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

text = "Paris is the [MASK] of France."
inputs = tokenizer(text, return_tensors='pt')

with torch.no_grad():
    outputs = model(**inputs)

mask_pos = torch.where(inputs['input_ids'][0] == tokenizer.mask_token_id)[0]
logits = outputs.logits[0, mask_pos, :]
top5 = torch.topk(logits, 5, dim=1)

print("Top 5 predictions for [MASK]:")
for token_id, score in zip(top5.indices[0], top5.values[0]):
    print(f"  {tokenizer.decode([token_id])}: {score:.3f}")
# capital: 12.847
# city: 9.123
# heart: 7.891
# centre: 7.234
# port: 6.512

4.2 Next Sentence Prediction (NSP)

The second objective teaches BERT to understand sentence-level relationships. Given two sentences A and B, the model must predict whether B actually follows A in the original corpus (IsNext) or is a random sentence from elsewhere (NotNext). The dataset is balanced 50/50.

Positive example (IsNext):
  Input: [CLS] The dog ran in the park. [SEP] It jumped over the puddle. [SEP]
  Label: IsNext

Negative example (NotNext):
  Input: [CLS] The dog ran in the park. [SEP] Quantum computers use qubits. [SEP]
  Label: NotNext

The [CLS] token's final representation is passed through a binary classifier
(linear + softmax) to produce the IsNext/NotNext prediction.

from transformers import BertForNextSentencePrediction, BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

def predict_nsp(sent_a, sent_b):
    inputs = tokenizer(sent_a, sent_b, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    return probs[0, 0].item(), probs[0, 1].item()  # IsNext, NotNext

# Coherent continuation
a = "The dog ran in the park."
b_coherent = "It jumped over the puddle and splashed through the mud."
is_next, not_next = predict_nsp(a, b_coherent)
print(f"Coherent: IsNext={is_next:.3f}, NotNext={not_next:.3f}")
# Coherent: IsNext=0.892, NotNext=0.108

# Random, unrelated sentence
b_random = "Quantum computers revolutionize modern cryptography."
is_next, not_next = predict_nsp(a, b_random)
print(f"Random:   IsNext={is_next:.3f}, NotNext={not_next:.3f}")
# Random: IsNext=0.041, NotNext=0.959

NSP: A Controversial Design Choice

NSP was challenged by subsequent research. RoBERTa (2019) showed that removing NSP and training only on MLM (with more data and longer training) actually improves performance on most benchmarks. The hypothesis: NSP is too simple a task, and its training procedure — using sentence pairs that may span multiple documents — inadvertently encourages the model to use coarse document-level features rather than fine-grained token-level ones. Later models like ALBERT replaced NSP with Sentence Order Prediction (SOP), a harder variant.

4.3 Combined Pre-training Loss

BERT's total pre-training loss is the sum of both objectives:

L_total = L_MLM + L_NSP

L_MLM = CrossEntropy over the 15% masked tokens only
        (positions where labels != -100)

L_NSP = Binary CrossEntropy for IsNext (0) vs NotNext (1)

Both losses are computed simultaneously on each batch.
The model must learn to handle both tasks with the same
set of Transformer parameters — this shared training
produces a rich, multi-task representation of language.

5. Fine-tuning BERT for Downstream Tasks

The "pre-train, then fine-tune" paradigm introduced by BERT is remarkably simple: add a small task-specific head on top of the pre-trained model, then train the entire system together on labeled data. This requires far fewer labeled examples and far less compute than training from scratch.

Typically, 2–4 epochs with a low learning rate (2e-5 to 5e-5) are sufficient. The pre-trained parameters already encode rich language knowledge — fine-tuning just specializes this knowledge for the target task.

5.1 Text Classification (Sentiment Analysis)

For sequence classification, the [CLS] token embedding from the final layer serves as the sentence representation. A linear classification head maps it to class logits.

from transformers import (
    BertForSequenceClassification, BertTokenizer,
    TrainingArguments, Trainer
)
from datasets import load_dataset
import numpy as np, evaluate

# Load pre-trained model with classification head
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2,
    id2label={0: "NEGATIVE", 1: "POSITIVE"},
    label2id={"NEGATIVE": 0, "POSITIVE": 1}
)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Prepare IMDB dataset
dataset = load_dataset('imdb')

def tokenize(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=256
    )

tokenized = dataset.map(tokenize, batched=True)

accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        'accuracy': accuracy.compute(predictions=preds, references=labels)['accuracy'],
        'f1': f1.compute(predictions=preds, references=labels)['f1']
    }

training_args = TrainingArguments(
    output_dir='./bert-sentiment',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_ratio=0.1,           # warmup for first 10% of steps
    weight_decay=0.01,          # L2 regularization
    learning_rate=2e-5,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='f1',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized['train'],
    eval_dataset=tokenized['test'],
    compute_metrics=compute_metrics,
)

trainer.train()
# Expected: Accuracy ~93%, F1 ~0.93 on IMDB after 3 epochs

5.2 Named Entity Recognition (NER)

For NER, instead of using only [CLS], BERT outputs a representation for every token. A classification layer assigns an entity label to each position in the sequence.

from transformers import BertForTokenClassification, BertTokenizer, pipeline
import torch

# Using a pre-fine-tuned NER model
ner = pipeline(
    "ner",
    model="dbmdz/bert-large-cased-finetuned-conll03-english",
    aggregation_strategy="simple"
)

text = "Elon Musk founded Tesla in Palo Alto, California."
entities = ner(text)

for e in entities:
    print(f"  {e['word']} → {e['entity_group']} ({e['score']:.3f})")
# Elon Musk → PER (0.998)
# Tesla → ORG (0.997)
# Palo Alto → LOC (0.994)
# California → LOC (0.999)

# --- Custom NER fine-tuning from scratch ---
label_list = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

model = BertForTokenClassification.from_pretrained(
    'bert-base-cased',
    num_labels=len(label_list),
    id2label={i: l for i, l in enumerate(label_list)},
    label2id={l: i for i, l in enumerate(label_list)}
)
# → Then fine-tune with CoNLL-2003 using the Trainer API

5.3 Extractive Question Answering

For extractive QA (like SQuAD), BERT receives a question and a context separated by [SEP]. The model must predict the start and end positions of the answer span within the context.

Architecture for Extractive QA:

  Input: [CLS] When was BERT published? [SEP] BERT was published in 2018 by Google AI. [SEP]
           |      |    |   |     |         |     |    |    |         |   |    |   |   |
  BERT: 12 Transformer Encoder layers
           |      |    |   |     |         |     |    |    |         |   |    |   |   |
  Start:  0.01  0.01 0.01 0.01  0.01      0.01  0.01 0.01 0.01     0.01 0.90 0.01 0.01 0.01
  End:    0.01  0.01 0.01 0.01  0.01      0.01  0.01 0.01 0.01     0.01 0.01 0.01 0.01 0.88

  Extracted answer: "2018"

  Two vectors W_start and W_end (768-dim each) are learned.
  For each token i: P_start(i) = softmax(W_start . h_i)
                    P_end(i)   = softmax(W_end . h_i)

from transformers import pipeline

qa = pipeline("question-answering", model="deepset/bert-base-cased-squad2")

context = """
BERT (Bidirectional Encoder Representations from Transformers) is a language model
developed by Google AI in 2018. It was trained on Wikipedia and BooksCorpus,
totaling approximately 3.3 billion words. BERT-base has 110 million parameters
and 12 Transformer encoder layers.
"""

questions = [
    "Who developed BERT?",
    "How many parameters does BERT-base have?",
    "What data was BERT trained on?",
    "In what year was BERT published?",
]

for question in questions:
    result = qa(question=question, context=context)
    print(f"Q: {question}")
    print(f"A: {result['answer']} (confidence: {result['score']:.3f})")
    print()

5.4 Sentence Pair Classification (NLI)

For Natural Language Inference and paraphrase detection, BERT receives two sentences and classifies their relationship: entailment, contradiction, or neutral.

from transformers import pipeline

nli = pipeline(
    "text-classification",
    model="cross-encoder/nli-deberta-v3-small"
)

pairs = [
    ("The cat sleeps on the sofa", "An animal is resting"),
    ("All students passed the exam", "No student failed"),
    ("The cat sleeps on the sofa", "The dog is running in the park"),
]

for premise, hypothesis in pairs:
    result = nli(f"{premise} [SEP] {hypothesis}")
    print(f"Premise:    {premise}")
    print(f"Hypothesis: {hypothesis}")
    print(f"Relation:   {result[0]['label']} ({result[0]['score']:.3f})")
    print()

5.5 Fine-tuning Best Practices

Recommended Hyperparameters (from the original BERT paper)

Parameter	Recommended Range	Notes
Learning Rate	2e-5, 3e-5, 5e-5	Linear warmup + linear decay
Batch Size	16, 32	Larger = more stable gradients
Epochs	2–4	More risks catastrophic forgetting
Max Sequence Length	128–512	Shorter = faster; 128 often sufficient
Warmup Ratio	10% of total steps	Stabilizes early training
Weight Decay	0.01	L2 on all params except biases
Dropout	0.1 (default)	Sometimes reduce to 0.0 for small datasets
Adam epsilon	1e-8	Numerical stability in Adam optimizer

6. Extracting Contextual Embeddings from BERT

Beyond fine-tuning for specific tasks, BERT is extremely useful as a feature extractor. The representations produced by its internal layers capture linguistic information at different levels of abstraction — from morphology in the lower layers to high-level semantics in the upper layers.

from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
model.eval()

text = "Natural language processing is fascinating and powerful"
inputs = tokenizer(text, return_tensors='pt')

with torch.no_grad():
    outputs = model(**inputs)

# outputs contains:
# - last_hidden_state: final layer output (batch, seq_len, 768)
# - pooler_output: [CLS] after a linear + tanh (batch, 768)
# - hidden_states: tuple of 13 tensors (embedding layer + 12 Transformer layers)

last_hidden = outputs.last_hidden_state
pooler = outputs.pooler_output
all_layers = outputs.hidden_states

print(f"Last hidden state: {last_hidden.shape}")  # (1, N_tokens, 768)
print(f"Pooler output:     {pooler.shape}")        # (1, 768)
print(f"Number of layers:  {len(all_layers)}")     # 13

# Strategy 1: [CLS] token of the last layer
cls_embedding = last_hidden[:, 0, :]   # (1, 768)

# Strategy 2: Mean pooling over the last layer (often better than CLS)
attention_mask = inputs['attention_mask'].unsqueeze(-1).float()
mean_embedding = (last_hidden * attention_mask).sum(1) / attention_mask.sum(1)
print(f"Mean pooling shape: {mean_embedding.shape}")  # (1, 768)

# Strategy 3: Concatenate last 4 layers (captures richer information)
last_4 = torch.cat([all_layers[i] for i in [-1, -2, -3, -4]], dim=-1)
print(f"Last 4 layers concatenated: {last_4.shape}")  # (1, N_tokens, 3072)

# Strategy 4: Sum of last 4 layers (preserves 768-dim)
sum_last_4 = torch.stack([all_layers[i] for i in [-1, -2, -3, -4]]).sum(0)
print(f"Sum of last 4 layers: {sum_last_4.shape}")  # (1, N_tokens, 768)

Which Layer to Use?

Research has shown that different BERT layers capture different types of information:

Layers 1–4 (low): Morphological and basic syntactic features (POS tags, subword patterns)
Layers 5–8 (middle): Complex syntactic features, dependency relations, coreference
Layers 9–12 (high): High-level semantic information, world knowledge
Last layer: Best for semantic tasks (NLI, similarity, classification)
Second-to-last layer: Often slightly better for NER than the very last layer
Middle layers: Best for syntactic tasks (POS tagging, constituent parsing)

7. BERT Variants: A Rich Ecosystem

After BERT's publication, numerous research groups proposed variants that improve different aspects of the original design. Here is a comprehensive overview of the most important ones.

      BERT Family Comparison Table
      
        
            Model
            Year
            Key Innovation
            Parameters
            vs BERT-base
          

        
            BERT-base
            2018
            MLM + NSP, deep bidirectionality
            110M
            Baseline
          

            RoBERTa
            2019
            No NSP, dynamic masking, 10× more data
            125M
            +1–2% GLUE
          

            ALBERT
            2019
            Parameter sharing, factorized embedding
            12M–235M
            Comparable, smaller
          

            DistilBERT
            2019
            Knowledge distillation, 60% faster
            66M
            97% retained
          

            ELECTRA
            2020
            Replaced token detection (more efficient than MLM)
            110M
            +1% with same compute
          

            DeBERTa
            2020
            Disentangled attention (content + position)
            134M–5B
            SotA on SuperGLUE
          

            XLNet
            2019
            Permutation language modeling
            340M
            +1–2% select tasks
          

            SpanBERT
            2019
            Contiguous span masking + SBO objective
            110M
            +4–8% QA tasks
          

      
    

7.1 RoBERTa — Robustly Optimized BERT

Facebook AI (2019) showed BERT was significantly undertrained. RoBERTa makes five changes to the training procedure, with no architectural modifications:

Remove NSP: Training only on MLM improves downstream performance
Dynamic masking: Masking patterns are regenerated each epoch (not fixed during preprocessing)
More data: 160GB of text (vs ~16GB for BERT) including CC-News, OpenWebText, Stories
Longer training: 500K steps vs 100K, with larger batch sizes
Full 512-token sequences: Always; BERT started with 128-token sequences

7.2 ALBERT — Efficient Parameter Sharing

ALBERT (A Lite BERT) from Google addresses parameter count growth with two techniques:

Factorized embedding parameterization: Separates vocabulary embedding size (E=128) from the hidden layer size (H=768). Instead of a V×H matrix, ALBERT uses V×E and E×H, drastically reducing embedding parameters.
Cross-layer parameter sharing: All 12 Transformer layers share the same weights. This reduces parameters without reducing the number of forward passes (depth is preserved).

ALBERT-xxlarge achieves better performance than BERT-large with only 235M effective parameters (vs 340M). However, inference speed does not improve since the number of forward passes through the shared layers remains the same.

7.3 DistilBERT — Knowledge Distillation

HuggingFace (2019) used knowledge distillation to create a smaller, faster model. A 6-layer "student" model is trained to mimic both the output probabilities and the internal hidden states of a 12-layer "teacher" (BERT-base).

40% fewer parameters: 66M vs 110M
60% faster inference: 6 layers vs 12
97% of BERT-base performance: Retains nearly all quality

from transformers import pipeline
import time

# DistilBERT: ideal for latency-sensitive production
classifier = pipeline(
    "text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

texts = [
    "This product is absolutely fantastic!",
    "Terrible experience, never buying again.",
    "It works as described, nothing special.",
]

start = time.time()
results = classifier(texts)
latency = (time.time() - start) * 1000

for text, result in zip(texts, results):
    print(f"{result['label']} ({result['score']:.3f}): {text[:40]}...")
print(f"\nBatch latency: {latency:.0f}ms for {len(texts)} samples")

7.4 ELECTRA — Replaced Token Detection

ELECTRA (Clark et al., 2020) replaces MLM with a more efficient objective inspired by Generative Adversarial Networks:

A small generator (a compact BERT) produces plausible replacement tokens for masked positions
A larger discriminator (the main model) must identify which tokens were replaced

The key advantage: the discriminator learns from every token in the sequence, not just the 15% masked in BERT's MLM. This makes ELECTRA substantially more computationally efficient — it matches RoBERTa's performance using only 25% of the compute budget.

7.5 DeBERTa — Disentangled Attention

DeBERTa (Decoding-enhanced BERT with disentangled Attention) from Microsoft introduces two key innovations:

Disentangled attention: Represents content and position as two separate vectors rather than summing them. The attention score between any two tokens uses four terms: content-to-content, content-to-position, position-to-content, and position-to-position.
Enhanced mask decoder: Injects absolute position information into the final softmax layer for MLM predictions (positions are needed to distinguish, for example, whether a masked word is a subject or object).

DeBERTa v3 is currently one of the highest-performing encoder models on many benchmarks, frequently outperforming models that are 5–10× larger.

8. BERT for the Italian Language

The original BERT is pre-trained on English text. For Italian NLP, models pre-trained on Italian corpora consistently outperform multilingual BERT on Italian tasks — they understand Italian morphology, morphosyntax, and vocabulary far better.

      Italian BERT Models
      
        
            Model
            Base
            Training Data
            Vocabulary
            Best Use Case
          

        
            mBERT
            BERT-base
            Wikipedia in 104 languages
            110K multilingual
            Cross-lingual baseline
          

            dbmdz/bert-base-italian-xxl-cased
            BERT-base
            OPUS + Italian Wikipedia (~13GB)
            30K Italian tokens
            General Italian NLP
          

            AlBERTo
            BERT-base
            Italian Twitter corpus
            30K Italian tokens
            Social media analysis
          

            UmBERTo
            RoBERTa
            OSCAR Italian corpus (~69GB)
            32K SentencePiece
            Highest accuracy on Italian
          

            XLM-RoBERTa-large
            RoBERTa-large
            CC-100 in 100 languages
            250K multilingual
            Cross-lingual, zero-shot
          

      
    

from transformers import pipeline, BertTokenizer, BertModel
import torch

# --- 1. Italian BERT (dbmdz) for fill-mask ---
fill_it = pipeline("fill-mask", model="dbmdz/bert-base-italian-xxl-cased")

results = fill_it("Roma è la [MASK] d'Italia.")
for r in results[:3]:
    print(f"  {r['token_str']}: {r['score']:.4f}")
# capitale: 0.8234
# città: 0.0521
# cuore: 0.0312

# --- 2. UmBERTo (RoBERTa-based) for fill-mask ---
fill_umb = pipeline(
    "fill-mask",
    model="Musixmatch/umberto-commoncrawl-cased-v1"
)
results = fill_umb("L'intelligenza artificiale <mask> il futuro.")
for r in results[:3]:
    print(f"  {r['token_str']}: {r['score']:.4f}")

# --- 3. Extract Italian sentence embeddings ---
tokenizer = BertTokenizer.from_pretrained('dbmdz/bert-base-italian-xxl-cased')
model = BertModel.from_pretrained('dbmdz/bert-base-italian-xxl-cased')
model.eval()

sentences = [
    "La tecnologia sta cambiando il modo in cui lavoriamo",
    "Il cambiamento tecnologico trasforma il mercato del lavoro",
    "Il gatto dorme sul tappeto",
]

def get_sentence_embedding(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    # Mean pooling with attention mask
    mask = inputs['attention_mask'].unsqueeze(-1).float()
    return (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)

embeddings = [get_sentence_embedding(s, tokenizer, model) for s in sentences]

# Compute cosine similarities
import torch.nn.functional as F
for i in range(len(sentences)):
    for j in range(i + 1, len(sentences)):
        sim = F.cosine_similarity(embeddings[i], embeddings[j]).item()
        print(f"Similarity {i}-{j}: {sim:.3f}")
# Similarity 0-1: 0.892 (related meaning)
# Similarity 0-2: 0.432 (unrelated)
# Similarity 1-2: 0.438 (unrelated)

Which Italian Model to Choose?

General NLP (NER, classification, QA): dbmdz/bert-base-italian-xxl-cased or UmBERTo — both excellent
Social media / Twitter analysis: AlBERTo (pre-trained specifically on Italian tweets)
Maximum accuracy: Fine-tune xlm-roberta-large on Italian data
Cross-lingual tasks: XLM-RoBERTa or multilingual DeBERTa v3
Fast inference in production: DistilBERT multilingual or quantized dbmdz model

9. BERT's Limitations and What Came Next

Despite its enormous success, BERT has several structural limitations that constrain its applicability. Understanding them is crucial for choosing the right model for each problem.

9.1 The 512-Token Context Limit

BERT can process sequences of at most 512 tokens. For long documents (legal contracts, scientific papers, technical manuals), this is a severe constraint. Workarounds include:

Truncation: Simply cut the document to 512 tokens. Loss of information.
Sliding window: Process overlapping chunks and aggregate predictions. Costly.
Hierarchical pooling: Split into chunks, encode separately, then pool sentence-level representations.
Dedicated models: Longformer (sparse attention, 4096 tokens), BigBird (8192 tokens).

9.2 Quadratic Attention Complexity

Self-attention has complexity O(n²) with respect to sequence length n. Doubling the input length quadruples compute cost and memory usage. Solutions:

Linformer: Projects K and V to a lower-dimensional space, achieving O(n) complexity
Performer: Uses random feature approximations to compute attention in O(n)
Flash Attention: Exact attention with IO-aware memory tiling — same result, much faster in practice
Longformer: Combines local sliding window attention with global attention for selected tokens

9.3 Encoder-Only: No Generative Capability

BERT is an encoder-only model: it produces rich representations of text but cannot generate new text. For generation tasks (summarization, translation, dialogue), you need a decoder (GPT-style) or an encoder-decoder architecture (T5, BART).

9.4 [MASK] Token Mismatch

The [MASK] token appears during pre-training but never during fine-tuning or inference. Even with the 80/10/10 strategy mitigating this, the mismatch remains a theoretical weakness. ELECTRA eliminates this entirely by never using [MASK] at all.

9.5 Independent Masking Assumption

When BERT masks multiple tokens, it predicts them independently, ignoring dependencies between them. For a sentence like "New [MASK] [MASK]" where the answer is "New York City", BERT predicts "York" and "City" independently — neither prediction conditions on the other. XLNet's permutation language modeling addresses this.

      Limitations and Modern Solutions
      
        LimitationImpactSolution

            Max 512 tokens
            Cannot process long documents
            Longformer, BigBird, sliding window
          
            O(n²) attention
            High compute for long sequences
            Flash Attention, Linformer, Performer
          
            Encoder-only
            No text generation
            T5, BART, GPT, LLaMA
          
            [MASK] mismatch
            Pre-training/inference gap
            ELECTRA (replaced token detection)
          
            Independent masking
            Cannot model token correlations
            XLNet (permutation LM)
          
            Static after training
            Cannot update from new facts
            RAG (retrieval-augmented generation)

Post-BERT Timeline: The Path to LLMs

Model	Year	Key Innovation	Impact
XLNet	2019	Permutation language modeling	Overcomes BERT's masking assumptions
T5	2019	Text-to-text, encoder-decoder	Unifies all NLP tasks as generation
GPT-3	2020	175B parameters, in-context learning	Few-shot without fine-tuning
DeBERTa	2020–21	Disentangled attention	State of the art on SuperGLUE
FLAN-T5	2022	Instruction fine-tuning at scale	Better zero/few-shot generalization
LLaMA 2/3	2023/24	Open-source efficient LLMs	Democratized LLM research
Mistral	2023	Sliding window + grouped query attention	Efficient LLM inference
Gemma	2024	Google's open efficient model	Best-in-class at 2B/7B scale

Conclusions and Next Steps

BERT represents a paradigm shift in NLP: a single pre-trained model that can be efficiently adapted to a wide range of tasks with minimal labeled data. The key concepts to retain:

The bidirectional Transformer encoder enables deep contextual understanding — both directions simultaneously
Three input embedding types — token, segment, and position — sum to form each token's representation
MLM and NSP as self-supervised pre-training objectives enabling learning from unlabeled text
How to fine-tune for classification (CLS token), NER (per-token), QA (span extraction)
The variant ecosystem: RoBERTa (better training), DistilBERT (faster), ALBERT (smaller), DeBERTa (state-of-the-art)
Italian BERT models for Italian NLP: dbmdz, AlBERTo, UmBERTo, XLM-RoBERTa
BERT's structural limitations and how modern architectures address them

Continue the Series

Next: Sentiment Analysis with Transformers — build and deploy a production-grade classifier with uncertainty estimation
Article 4: Italian NLP with feel-it and Custom Models — challenges specific to Italian morphology and resources
Article 5: Named Entity Recognition (NER) — extract structured entities from text at production scale
Article 7: HuggingFace Transformers: Complete Guide — Trainer API, custom training loops, Hub, ONNX
Article 8: LoRA Fine-tuning — adapt LLMs locally on a single consumer GPU

Cross-Series Connections

AI Engineering / RAG (Series 6): BERT embeddings are the foundation for semantic search in RAG pipelines. The contextual representations you extract from BERT's hidden states directly fuel the embedding columns of your vector database.
Advanced Deep Learning (Series 7): Learn gradient checkpointing, mixed precision (fp16/bf16), and ZeRO optimizer sharding for memory-efficient training of large BERT-family models on limited hardware.
MLOps (Series 5): Monitor BERT models in production — detect embedding drift, distribution shift in classification confidence, and automated retraining triggers covered in article 10 of this series.

Model	Type	Context	Key Limitation
Word2Vec/GloVe	Static	No context	One vector per word regardless of meaning
GPT-1	Unidirectional (L→R)	Left context only	Cannot see future tokens
ELMo	Shallow bidir.	Concatenated contexts	No interaction between directions
BERT	Deep bidir.	Full context at every layer	Encoder-only (no generation)

Parameter	BERT-base	BERT-large
Transformer Layers (L)	12	24
Hidden Size (H)	768	1024
Attention Heads (A)	12	16
Total Parameters	110M	340M
Dim per Head (d_k)	768/12 = 64	1024/16 = 64
Feed-Forward Dim	3072 (4 × 768)	4096 (4 × 1024)
Max Sequence Length	512	512
Vocabulary Size	30,522	30,522
Pre-training Time	4 days (16 TPUs)	4 days (64 TPUs)

Token	ID	Purpose
[PAD]	0	Padding to make sequences the same length within a batch
[UNK]	100	Unknown token (not in vocabulary after WordPiece splitting)
[CLS]	101	Start of sequence; its final-layer embedding serves as the sentence representation
[SEP]	102	Separator between the two input sentences (also marks end of sequence)
[MASK]	103	Masked token placeholder used during Masked Language Model pre-training

Model	Year	Key Innovation	Parameters	vs BERT-base
BERT-base	2018	MLM + NSP, deep bidirectionality	110M	Baseline
RoBERTa	2019	No NSP, dynamic masking, 10× more data	125M	+1–2% GLUE
ALBERT	2019	Parameter sharing, factorized embedding	12M–235M	Comparable, smaller
DistilBERT	2019	Knowledge distillation, 60% faster	66M	97% retained
ELECTRA	2020	Replaced token detection (more efficient than MLM)	110M	+1% with same compute
DeBERTa	2020	Disentangled attention (content + position)	134M–5B	SotA on SuperGLUE
XLNet	2019	Permutation language modeling	340M	+1–2% select tasks
SpanBERT	2019	Contiguous span masking + SBO objective	110M	+4–8% QA tasks

Model	Base	Training Data	Vocabulary	Best Use Case
mBERT	BERT-base	Wikipedia in 104 languages	110K multilingual	Cross-lingual baseline
dbmdz/bert-base-italian-xxl-cased	BERT-base	OPUS + Italian Wikipedia (~13GB)	30K Italian tokens	General Italian NLP
AlBERTo	BERT-base	Italian Twitter corpus	30K Italian tokens	Social media analysis
UmBERTo	RoBERTa	OSCAR Italian corpus (~69GB)	32K SentencePiece	Highest accuracy on Italian
XLM-RoBERTa-large	RoBERTa-large	CC-100 in 100 languages	250K multilingual	Cross-lingual, zero-shot

Limitation	Impact	Solution
Max 512 tokens	Cannot process long documents	Longformer, BigBird, sliding window
O(n²) attention	High compute for long sequences	Flash Attention, Linformer, Performer
Encoder-only	No text generation	T5, BART, GPT, LLaMA
[MASK] mismatch	Pre-training/inference gap	ELECTRA (replaced token detection)
Independent masking	Cannot model token correlations	XLNet (permutation LM)
Static after training	Cannot update from new facts	RAG (retrieval-augmented generation)