NLP Fundamentals: Tokenization, Embeddings and the Modern Pipeline
Every time you ask a voice assistant a question, translate text with Google Translate, filter spam from your inbox, or read auto-generated subtitles on a video, you are using Natural Language Processing (NLP). This discipline, at the intersection of linguistics, computer science, and artificial intelligence, focuses on teaching machines to understand, interpret, and generate human language.
In recent years, NLP has undergone a radical transformation. We have moved from hand-written rules and static dictionaries to neural models capable of understanding nuance, context, and even irony. BERT, GPT, LLaMA, and next-generation models would not be possible without the foundations we will explore in this article: tokenization, embeddings, and the modern NLP pipeline.
This is the first article in the Modern NLP: from BERT to LLMs series. We will start from absolute basics, building step by step the intuitions needed to understand the most advanced language models. We will also pay special attention to the specifics of the Italian language, often overlooked in English-centric resources.
What You Will Learn
- What NLP is and why it underpins almost every modern AI application
- How text is preprocessed: lowercasing, stopwords, stemming, and lemmatization
- Different tokenization approaches: word-level, character-level, and subword (BPE, WordPiece, SentencePiece)
- Classic text representations: Bag of Words and TF-IDF
- Word embeddings: Word2Vec, GloVe, and the geometric intuition of meaning
- Contextual embeddings: from static representations to BERT
- Sentence embeddings and their practical applications
- The modern NLP pipeline: from raw text to prediction
- Italian language preprocessing specifics
- A complete end-to-end example with Python code
Series Overview
| # | Article | Focus |
|---|---|---|
| 1 | You are here - NLP Fundamentals | Tokenization, Embeddings, Pipeline |
| 2 | BERT and Transformers | Attention Architecture, Pre-training |
| 3 | Sentiment Analysis | Text Classification with BERT |
| 4 | Named Entity Recognition | Extracting Entities from Text |
| 5 | HuggingFace Transformers | Library and Pre-trained Models |
| 6 | Model Fine-Tuning | Adapting BERT to Your Domain |
| 7 | NLP for Italian | Models and Resources for the Italian Language |
| 8 | From BERT to LLMs | GPT, LLaMA and Text Generation |
1. Text Preprocessing: Preparing the Data
Before any NLP model can work with text, it must be cleaned and normalized. Raw text is full of noise: punctuation, capitalization, abbreviations, emojis, HTML, URLs. Preprocessing transforms this chaos into a structured and consistent format.
1.1 Lowercasing and Normalization
The first step is converting all text to lowercase. For a computer, "House", "house", and "HOUSE" are three completely different strings. Lowercasing unifies them.
import re
import unicodedata
def normalize_text(text: str) -> str:
"""Basic text normalization."""
# Lowercasing
text = text.lower()
# Remove accents (optional, NOT recommended for Italian)
# text = unicodedata.normalize('NFKD', text)
# Remove punctuation
text = re.sub(r'[^\w\s]', '', text)
# Remove multiple spaces
text = re.sub(r'\s+', ' ', text).strip()
return text
text = "NLP is AMAZING! It analyzes text in 50+ languages."
print(normalize_text(text))
# Output: "nlp is amazing it analyzes text in 50 languages"
Watch Out for Accents in Italian
In many English-centric NLP pipelines, accent removal is a standard step. In Italian, however, accents change word meaning: "pero" (conjunction: however) vs "pero" (noun: pear tree), "e" (conjunction: and) vs "e" (verb: is). Never remove accents when working with Italian text.
1.2 Stopwords: Words Without Informational Content
Stopwords are very frequent words that carry little semantic meaning: articles, prepositions, conjunctions. Removing them reduces data dimensionality and helps models focus on meaningful words.
Italian vs English Stopwords
| Language | Stopword Examples | Typical Count |
|---|---|---|
| English | the, is, at, which, on, a, an, and, or | ~180 words |
| Italian | il, lo, la, di, a, da, in, con, su, per, che, e, non, un | ~300 words |
Italian has more stopwords than English due to its richer set of articles (il, lo, la, i, gli, le), articulated prepositions (del, dello, della, nei, negli, nelle), and auxiliary verb forms.
# Approach 1: NLTK
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stop_en = set(stopwords.words('english'))
stop_it = set(stopwords.words('italian'))
print(f"English stopwords: {len(stop_en)}") # ~179
print(f"Italian stopwords: {len(stop_it)}") # ~279
text = "the cat eats the fish on the table"
tokens = text.split()
filtered = [t for t in tokens if t not in stop_en]
print(filtered)
# Output: ['cat', 'eats', 'fish', 'table']
# Approach 2: spaCy (more complete)
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("the cat eats the fish on the table")
filtered_spacy = [token.text for token in doc if not token.is_stop]
print(filtered_spacy)
# Output: ['cat', 'eats', 'fish', 'table']
1.3 Stemming vs Lemmatization
Both techniques reduce words to their base form, but they do so in very different ways.
Stemming vs Lemmatization - Comparison
| Aspect | Stemming | Lemmatization |
|---|---|---|
| Method | Chops suffixes with heuristic rules | Uses a dictionary and morphological analysis |
| Result | Stem (not always a real word) | Lemma (a real dictionary word) |
| English Example | "running" -> "run", "better" -> "better" | "running" -> "run", "better" -> "good" |
| Italian Example | "mangiando" -> "mangi" | "mangiando" -> "mangiare" |
| Speed | Very fast | Slower (requires dictionary lookup) |
| Accuracy | Low (over-stemming is common) | High (correct forms) |
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
import spacy
# English Stemming with Porter
stemmer_en = PorterStemmer()
words_en = ["running", "runs", "ran", "easily", "fairly"]
stems_en = [stemmer_en.stem(w) for w in words_en]
print(dict(zip(words_en, stems_en)))
# {'running': 'run', 'runs': 'run', 'ran': 'ran',
# 'easily': 'easili', 'fairly': 'fairli'}
# Italian Stemming with Snowball
stemmer_it = SnowballStemmer("italian")
words_it = ["mangiando", "mangiare", "mangiato", "bellissimo"]
stems_it = [stemmer_it.stem(w) for w in words_it]
print(dict(zip(words_it, stems_it)))
# {'mangiando': 'mang', 'mangiare': 'mang',
# 'mangiato': 'mang', 'bellissimo': 'bellissim'}
# Lemmatization with spaCy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The girls were eating the most beautiful apples")
for token in doc:
print(f" {token.text:15s} -> {token.lemma_:15s} ({token.pos_})")
# The -> the (DET)
# girls -> girl (NOUN)
# were -> be (AUX)
# eating -> eat (VERB)
# the -> the (DET)
# most -> most (ADV)
# beautiful -> beautiful (ADJ)
# apples -> apple (NOUN)
For languages with rich morphology like Italian, lemmatization with spaCy
is almost always preferable to stemming. The it_core_news_lg model
contains 500,000 word vectors and supports tokenization, POS tagging, dependency
parsing, NER, and lemmatization.
2. Tokenization: How Machines Read Text
Tokenization is the process of splitting text into discrete units called tokens. It is the first and most critical step in any NLP pipeline: the quality of tokenization directly influences the performance of every downstream model.
There are three fundamental approaches, each with different trade-offs.
2.1 Word-Level Tokenization
The most intuitive approach: each word becomes a token.
# Naive approach: split by whitespace
text = "Artificial intelligence is changing the world"
tokens_naive = text.split()
print(tokens_naive)
# ['Artificial', 'intelligence', 'is', 'changing', 'the', 'world']
# Better approach: spaCy (handles contractions, punctuation)
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("It's a well-known fact that NLP isn't trivial")
tokens_spacy = [token.text for token in doc]
print(tokens_spacy)
# ['It', "'s", 'a', 'well', '-', 'known', 'fact', 'that',
# 'NLP', 'is', "n't", 'trivial']
# spaCy correctly handles contractions like "isn't" -> "is" + "n't"
Limitations of Word-Level Tokenization
- Huge vocabulary: Every unique word requires a vocabulary entry. English alone has hundreds of thousands of word forms; Italian has even more due to verb conjugations
- Out-of-vocabulary (OOV) words: Words never seen during training become <UNK> (unknown)
- No morphological sharing: "eat", "eating", "eaten" are three completely separate tokens with no relationship
2.2 Character-Level Tokenization
At the other extreme, each character becomes a token. The vocabulary is tiny (26 letters + digits + punctuation), but sequences become very long.
Text: "hello world"
Word-level: ["hello", "world"] -> 2 tokens
Char-level: ["h","e","l","l","o"," ","w","o","r","l","d"] -> 11 tokens
A 1,000-word text:
Word-level: ~1,000 tokens
Char-level: ~5,000 tokens (5x longer!)
Character-level tokenization solves the unknown word problem (any word can be represented), but the very long sequences make it difficult for models to capture long-range dependencies in the text.
2.3 Subword Tokenization: The Optimal Compromise
Subword tokenization is the method used by all modern models (BERT, GPT, LLaMA, T5). The idea is elegant: common words remain whole, while rare words are split into sub-units (subwords) that the model has already seen.
Subword Tokenization Algorithms
| Algorithm | Used By | Strategy | Direction |
|---|---|---|---|
| BPE (Byte Pair Encoding) | GPT-2, GPT-3, GPT-4, LLaMA, RoBERTa | Iterative merge of most frequent pairs | Bottom-up |
| WordPiece | BERT, DistilBERT, ELECTRA | Merge that maximizes likelihood | Bottom-up |
| SentencePiece | T5, ALBERT, XLNet, mBART | Treats text as raw character stream | Language-independent |
| Unigram | SentencePiece (optional), ALBERT | Starts large, removes least useful tokens | Top-down |
How BPE (Byte Pair Encoding) Works
BPE starts from individual characters and iteratively merges the most frequent pairs until the desired vocabulary size is reached.
Corpus: "low low lower lowest"
Step 0 - Initial vocabulary (characters):
l, o, w, e, r, s, t
Step 1 - Most frequent pair: (l, o) -> "lo"
lo w lo w lo w e r lo w e s t
Step 2 - Most frequent pair: (lo, w) -> "low"
low low low e r low e s t
Step 3 - Most frequent pair: (low, e) -> "lowe"
low low lowe r lowe s t
Step 4 - Most frequent pair: (lowe, r) -> "lower"
low low lower lowe s t
Final vocabulary: [l, o, w, e, r, s, t, lo, low, lowe, lower]
WordPiece vs BPE
WordPiece uses an approach similar to BPE, but instead of choosing the most
frequent pair, it chooses the one that maximizes the likelihood
of the training corpus. In practice, WordPiece prefers merges that produce tokens
more useful for the language model, not simply the most common ones. Tokens that
do not start a word are prefixed with ##.
SentencePiece: True Language Independence
The key difference with SentencePiece is that it does not require pre-tokenization. BPE and WordPiece assume the text is already split into words (typically by whitespace), which works well for English and Italian but fails for languages like Chinese, Japanese, or Thai that do not use spaces between words. SentencePiece treats the text as a raw byte stream, making it truly language-independent.
3. Practical Example: Tokenization with HuggingFace
Let us see concretely how BERT and GPT-2 tokenize the same text.
We will use the transformers library from HuggingFace.
from transformers import AutoTokenizer
# Example text
text = "Artificial intelligence is revolutionizing the world"
# --- BERT (WordPiece) ---
bert_tok = AutoTokenizer.from_pretrained("bert-base-uncased")
bert_tokens = bert_tok.tokenize(text)
bert_ids = bert_tok.encode(text)
print("BERT tokens:", bert_tokens)
print("BERT IDs: ", bert_ids)
# BERT tokens: ['artificial', 'intelligence', 'is',
# 'revolution', '##izing', 'the', 'world']
# Note: "revolutionizing" is split into "revolution" + "##izing"
# --- GPT-2 (BPE) ---
gpt2_tok = AutoTokenizer.from_pretrained("gpt2")
gpt2_tokens = gpt2_tok.tokenize(text)
gpt2_ids = gpt2_tok.encode(text)
print("\nGPT-2 tokens:", gpt2_tokens)
print("GPT-2 IDs: ", gpt2_ids)
# GPT-2 tokens: ['Art', 'ificial', ' intelligence', ' is',
# ' revolution', 'izing', ' the', ' world']
# --- Comparison ---
print(f"\nBERT: {len(bert_tokens)} tokens")
print(f"GPT-2: {len(gpt2_tokens)} tokens")
Key Observations
- BERT uses
##prefix for continuation subwords (e.g.,##izing), while GPT-2 uses a space-aware encoding (leading space is part of the token) - The same word gets split differently depending on the tokenizer's training data and algorithm
- A tokenizer trained on the target language produces fewer tokens, meaning more context in the attention window and lower per-token costs
- BERT adds special tokens:
[CLS]at the start and[SEP]at the end. GPT-2 does not
from transformers import AutoTokenizer
# Italian BERT tokenizer
tok = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-cased")
# Vocabulary size
print(f"Italian BERT vocabulary: {tok.vocab_size} tokens")
# Output: Italian BERT vocabulary: 31102 tokens
# Special tokens
print(f"[CLS] = {tok.cls_token} (ID: {tok.cls_token_id})")
print(f"[SEP] = {tok.sep_token} (ID: {tok.sep_token_id})")
print(f"[UNK] = {tok.unk_token} (ID: {tok.unk_token_id})")
print(f"[MASK] = {tok.mask_token} (ID: {tok.mask_token_id})")
# Compare English vs Italian tokenization
text_it = "L'intelligenza artificiale sta rivoluzionando il mondo"
text_en = "Artificial intelligence is revolutionizing the world"
it_tokens = tok.tokenize(text_it)
en_tokens = tok.tokenize(text_en)
print(f"\nItalian text: {len(it_tokens)} tokens -> {it_tokens}")
print(f"English text: {len(en_tokens)} tokens -> {en_tokens}")
# Italian BERT handles Italian text efficiently (fewer tokens)
# but splits English words into more subwords
4. Bag of Words and TF-IDF: Classic Representations
Before word embeddings, text was represented as sparse vectors based on word frequencies. These methods are still used in many contexts, and understanding their limitations helps explain why embeddings were revolutionary.
4.1 Bag of Words (BoW)
The Bag of Words model represents a document as a vector where each position corresponds to a word in the vocabulary and the value is the number of times that word appears in the document.
from sklearn.feature_extraction.text import CountVectorizer
documents = [
"the cat eats the fish",
"the dog eats the meat",
"the cat chases the dog"
]
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)
print("Vocabulary:", vectorizer.get_feature_names_out())
# ['cat', 'chases', 'dog', 'eats', 'fish', 'meat', 'the']
print("\nBoW Matrix:")
print(bow_matrix.toarray())
# [[1, 0, 0, 1, 1, 0, 2], # doc 1
# [0, 0, 1, 1, 0, 1, 2], # doc 2
# [1, 1, 1, 0, 0, 0, 2]] # doc 3
4.2 TF-IDF (Term Frequency - Inverse Document Frequency)
TF-IDF improves BoW by weighting words by their relative importance. Words that are frequent in a document but rare across the corpus receive a higher weight. Words common everywhere (like "the") receive a low weight.
TF-IDF(t, d) = TF(t, d) x IDF(t)
where:
TF(t, d) = frequency of term t in document d
IDF(t) = log(N / df(t))
N = total number of documents
df(t) = number of documents containing term t
Example:
Word "cat" in document 1:
TF = 1/5 = 0.2 (1 occurrence out of 5 words)
IDF = log(3/2) = 0.405 (appears in 2 out of 3 documents)
TF-IDF = 0.2 x 0.405 = 0.081
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
documents = [
"the cat eats the fish",
"the dog eats the meat",
"the cat chases the dog"
]
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(documents)
print("Features:", tfidf.get_feature_names_out())
print("\nTF-IDF Matrix (rounded):")
print(np.round(tfidf_matrix.toarray(), 3))
# "fish" and "meat" have higher weights because they appear
# in only one document (more discriminative)
Limitations of BoW and TF-IDF
- No semantics: "dog" and "canine" are completely different; "bank" (river) and "bank" (financial) are identical
- No word order: "the cat eats the mouse" and "the mouse eats the cat" have the same representation
- High dimensionality: A vocabulary of 100,000 words produces 100,000-dimensional vectors, almost all zeros (sparse vectors)
- No generalization: They do not capture relationships between words ("king" and "queen" have no proximity)
5. Word Embeddings: Meaning as Geometry
Word embeddings revolutionized NLP by transforming words into dense, low-dimensional vectors (typically 100-300 dimensions) that capture semantic relationships between words. Two words with similar meaning will have nearby vectors in the embedding space.
5.1 Word2Vec: The Invention that Changed Everything
Introduced by Tomas Mikolov and colleagues (Google, 2013), Word2Vec learns word vectors from the context in which words appear. The fundamental intuition is the distributional hypothesis: "a word is characterized by the company it keeps" (J.R. Firth, 1957).
Two Word2Vec Architectures
| Architecture | Input | Output | Intuition |
|---|---|---|---|
| CBOW (Continuous Bag of Words) | Context (surrounding words) | Target word | Given context "the ___ eats", predict "cat" |
| Skip-gram | Target word | Context (surrounding words) | Given word "cat", predict "the", "eats", etc. |
In practice, Skip-gram works better with small datasets and captures rare words better. CBOW is faster and works well with frequent words.
Sentence: "the black cat eats the fresh fish"
^
target word
With window_size = 2, Skip-gram learns:
cat -> the (left context, distance 1)
cat -> black (left context, distance 1)
cat -> eats (right context, distance 1)
cat -> the (right context, distance 2)
After millions of sentences, words appearing in similar
contexts will have similar vectors:
cat ~ feline ~ kitten (similar contexts: "the ___ eats")
dog ~ canine ~ puppy (similar contexts: "the ___ runs")
5.2 The Arithmetic of Words
The most surprising property of word embeddings is that semantic relationships become algebraic operations on vectors. The famous analogy:
king - man + woman = queen
This is not a coincidence: the vector that transforms "man" into "woman" is the same one that transforms "king" into "queen". This works for many relationships: country-capital, verb-past tense, adjective-superlative.
import gensim.downloader as api
# Load pre-trained word embeddings
model = api.load("word2vec-google-news-300")
# Analogy: king - man + woman = ?
result = model.most_similar(
positive=["king", "woman"],
negative=["man"],
topn=3
)
print("king - man + woman =")
for word, score in result:
print(f" {word}: {score:.4f}")
# king - man + woman =
# queen: 0.7118
# monarch: 0.6189
# princess: 0.5902
# Word similarity
print(f"\ncat ~ dog: {model.similarity('cat', 'dog'):.4f}")
print(f"cat ~ car: {model.similarity('cat', 'car'):.4f}")
# cat ~ dog: 0.7609
# cat ~ car: 0.2004
5.3 GloVe: Global Vectors for Word Representation
GloVe (Stanford, 2014) takes a different approach: instead of predicting context word by word like Word2Vec, GloVe first builds a global co-occurrence matrix of the entire corpus, then factorizes this matrix to obtain the vectors. It combines the advantages of global statistics-based methods with the local learning of Word2Vec.
Word2Vec vs GloVe
| Aspect | Word2Vec | GloVe |
|---|---|---|
| Method | Predictive (neural network) | Count-based (matrix factorization) |
| Context | Local window | Global corpus statistics |
| Training | Online (streams through text) | Batch (full matrix) |
| Common Dimensions | 100, 200, 300 | 50, 100, 200, 300 |
| Performance | Excellent for analogies | Excellent for similarity |
6. Contextual Embeddings: Same Word, Different Meanings
Word2Vec and GloVe have a fundamental limitation: they assign a single vector to each word, regardless of context. But language is full of ambiguity: the word "bank" has a completely different meaning in "river bank" vs "bank account".
Contextual embeddings solve this problem: the vector for a word depends on the entire sentence in which it appears. This is the approach used by BERT, GPT, and all Transformer-based models.
Static vs Contextual Embeddings
| Aspect | Static (Word2Vec, GloVe) | Contextual (BERT, GPT) |
|---|---|---|
| Vector per word | One fixed, always the same | Different depending on context |
| Polysemy | Not handled ("bank" has one vector) | Handled ("bank" gets different vectors per meaning) |
| Model | Lookup table | Deep neural network (Transformer) |
| Model size | A few MB | Hundreds of MB or GB |
| Speed | Instantaneous | Requires forward pass |
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
# Load BERT model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
def get_word_embedding(sentence: str, word: str):
"""Get the contextual embedding of a word in a sentence."""
inputs = tokenizer(sentence, return_tensors="pt")
tokens = tokenizer.tokenize(sentence)
with torch.no_grad():
outputs = model(**inputs)
# outputs.last_hidden_state: [batch, seq_len, hidden_dim]
embeddings = outputs.last_hidden_state[0] # [seq_len, 768]
# Find the index of the target token
word_idx = tokens.index(word) + 1 # +1 for [CLS]
return embeddings[word_idx]
# "bank" in different contexts
emb_river = get_word_embedding("I walked along the river bank", "bank")
emb_money = get_word_embedding("I deposited money at the bank", "bank")
emb_data = get_word_embedding("The data bank stores records", "bank")
# Calculate cosine similarity
sim_12 = F.cosine_similarity(emb_river.unsqueeze(0), emb_money.unsqueeze(0))
sim_13 = F.cosine_similarity(emb_river.unsqueeze(0), emb_data.unsqueeze(0))
sim_23 = F.cosine_similarity(emb_money.unsqueeze(0), emb_data.unsqueeze(0))
print(f"bank(river) ~ bank(money): {sim_12.item():.4f}")
print(f"bank(river) ~ bank(data): {sim_13.item():.4f}")
print(f"bank(money) ~ bank(data): {sim_23.item():.4f}")
# Vectors will be DIFFERENT because BERT understands context!
This is the fundamental conceptual leap: with BERT, the word "bank" no longer has a fixed meaning. Its vector changes based on what surrounds it, just as it does in human language understanding.
7. Sentence Embeddings: One Vector for an Entire Sentence
Often we do not need the embedding of a single word, but of an entire sentence or paragraph. Sentence embeddings compress the meaning of a text of any length into a single fixed-dimension vector.
The reference model is Sentence-BERT (SBERT), which modifies the
BERT architecture to produce sentence embeddings optimized for similarity comparison.
For multilingual use, paraphrase-multilingual-MiniLM-L12-v2 supports
50+ languages with 384-dimensional vectors.
Sentence Embedding Applications
| Application | Description | How It Works |
|---|---|---|
| Semantic Search | Search by meaning, not keywords | Embed the query, find closest documents |
| Clustering | Group similar texts automatically | K-means or HDBSCAN on embeddings |
| Duplicate Detection | Find duplicates or near-duplicates | Cosine similarity threshold > 0.9 |
| Zero-Shot Classification | Classify without training data | Compare text embedding with label embeddings |
from sentence_transformers import SentenceTransformer, util
# Multilingual model (supports 50+ languages including Italian)
model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
# Sentences in English
sentences = [
"The cat sleeps on the couch",
"A feline rests on the sofa",
"The stock market went up today",
"Financial markets are growing",
"I bought a new laptop computer"
]
# Generate embeddings (1 vector per sentence, 384 dimensions)
embeddings = model.encode(sentences, convert_to_tensor=True)
print(f"Shape: {embeddings.shape}") # [5, 384]
# Calculate similarity matrix
cosine_scores = util.cos_sim(embeddings, embeddings)
print("\nSimilarity Matrix:")
for i in range(len(sentences)):
for j in range(i + 1, len(sentences)):
print(f" {sentences[i][:40]:40s} <-> {sentences[j][:40]:40s}")
print(f" Similarity: {cosine_scores[i][j]:.4f}")
# Expected results:
# "cat sleeps" <-> "feline rests" : ~0.85 (very similar)
# "stock market" <-> "markets growing" : ~0.70 (correlated)
# "cat sleeps" <-> "stock market" : ~0.10 (unrelated)
8. The Modern NLP Pipeline
We have seen the individual components. Now let us put them together to understand how a modern end-to-end NLP pipeline works -- the one used by BERT, GPT, and all Transformer-based models.
Raw Text
|
v
[1. TOKENIZATION]
Input: "Artificial intelligence is amazing"
Output: ["artificial", "intelligence", "is", "amazing"]
|
v
[2. ENCODING (Token -> ID)]
Input: ["artificial", "intelligence", "is", "amazing"]
Output: [101, 7976, 4454, 2003, 6429, 102]
^[CLS] ^[SEP]
|
v
[3. EMBEDDING LAYER]
Input: [101, 7976, 4454, 2003, 6429, 102]
Output: Matrix [6 x 768] - one 768-dim vector per token
|
v
[4. TRANSFORMER ENCODER/DECODER]
Self-Attention: every token "looks at" all others
Input: Matrix [6 x 768]
Output: Matrix [6 x 768] (contextualized vectors)
|
v
[5. TASK HEAD]
Classification: [CLS] embedding -> softmax -> class
NER: each token -> entity label
Generation: last token -> next token
QA: start/end answer position
The Role of the [CLS] Token
In BERT, the special [CLS] token is inserted at the beginning of every
input. After passing through all Transformer layers, its embedding represents the
entire sequence. It is used as input for classification tasks
(sentiment analysis, spam detection, etc.).
9. NLP for Italian: Specifics and Tools
The Italian language presents unique NLP challenges that distinguish it from English and other languages. Understanding these specifics is essential for building effective NLP systems for Italian.
9.1 Italian Linguistic Challenges
Italian-Specific NLP Challenges
| Challenge | Description | Example |
|---|---|---|
| Rich morphology | Each verb has dozens of conjugated forms | "mangiare" (to eat) has 50+ forms |
| Elisions and apostrophes | Articles and prepositions merge | "l'uomo" (the man), "dell'arte" (of art) |
| Articulated prepositions | Preposition + article in one word | "del" (of the), "nello" (in the), "sulla" (on the) |
| Meaningful accents | Change word meaning | "e" (and) vs "e" (is), "da" (from) vs "da" (gives) |
| Clitic pronouns | Attach to the verb | "dammelo" (give+me+it), "portarglielo" (bring+him+it) |
| Free word order | SVO is not mandatory | "La torta la mangia Marco" = "Marco mangia la torta" |
9.2 Pre-trained Models for Italian
Main Italian NLP Models
| Model | Base | Task | Repository |
|---|---|---|---|
| dbmdz/bert-base-italian-cased | BERT | General-purpose Italian NLP | HuggingFace |
| AlBERTo | BERT | Italian social media (Twitter) | HuggingFace |
| feel-it-italian-sentiment | UmBERTo | Italian sentiment analysis | MilaNLProc |
| feel-it-italian-emotion | UmBERTo | Emotion detection (joy, anger, fear, sadness) | MilaNLProc |
| Italian-Legal-BERT | BERT | Italian legal texts | dlicari |
| DeepMount00/Italian_NER_XXL | BERT | Italian Named Entity Recognition | HuggingFace |
| it_core_news_lg | spaCy CNN | Tokenization, POS, NER, lemma, parsing | spaCy |
9.3 Italian-Specific Preprocessing
import spacy
import re
class ItalianPreprocessor:
"""Preprocessing pipeline specific to the Italian language."""
def __init__(self):
self.nlp = spacy.load("it_core_news_lg")
# Additional regional/informal stopwords
self.custom_stops = {
"cioe", "quindi", "comunque", "praticamente",
"allora", "insomma", "magari", "ecco", "tipo",
"boh", "mah", "vabbe", "ok", "okay"
}
def preprocess(self, text: str, remove_stops: bool = True,
lemmatize: bool = True) -> list[str]:
"""Complete preprocessing for Italian text."""
# 1. Basic normalization
text = text.lower()
text = re.sub(r'http\S+|www\.\S+', '', text) # remove URLs
text = re.sub(r'[^\w\s\']', '', text) # keep apostrophes
text = re.sub(r'\d+', '', text) # remove numbers
text = re.sub(r'\s+', ' ', text).strip()
# 2. spaCy analysis
doc = self.nlp(text)
# 3. Filtering and lemmatization
tokens = []
for token in doc:
# Skip punctuation and spaces
if token.is_punct or token.is_space:
continue
# Skip stopwords if requested
if remove_stops and (token.is_stop or
token.text in self.custom_stops):
continue
# Lemmatize or use original form
word = token.lemma_ if lemmatize else token.text
if len(word) > 1: # skip single characters
tokens.append(word)
return tokens
# Usage example
prep = ItalianPreprocessor()
text = """L'intelligenza artificiale sta rivoluzionando
il modo in cui le aziende italiane gestiscono i loro
processi, cioe praticamente tutto sta cambiando."""
result = prep.preprocess(text)
print("Processed tokens:", result)
# ['intelligenza', 'artificiale', 'rivoluzionare', 'modo',
# 'azienda', 'italiano', 'gestire', 'processo', 'cambiare']
10. End-to-End Example: Semantic Search
Let us put everything we have learned together in a complete example: a semantic search engine over a corpus of texts. Given a set of documents and a user query, we will find the most relevant documents using sentence embeddings.
from sentence_transformers import SentenceTransformer, util
import torch
class SemanticSearch:
"""Semantic search engine using sentence embeddings."""
def __init__(self, model_name: str =
"paraphrase-multilingual-MiniLM-L12-v2"):
self.model = SentenceTransformer(model_name)
self.documents: list[str] = []
self.embeddings = None
def index_documents(self, documents: list[str]) -> None:
"""Index documents by computing their embeddings."""
self.documents = documents
self.embeddings = self.model.encode(
documents,
convert_to_tensor=True,
show_progress_bar=True
)
print(f"Indexed {len(documents)} documents")
print(f"Embeddings shape: {self.embeddings.shape}")
def search(self, query: str, top_k: int = 3) -> list[dict]:
"""Search for the most relevant documents for the query."""
query_embedding = self.model.encode(
query, convert_to_tensor=True
)
scores = util.cos_sim(query_embedding, self.embeddings)[0]
top_results = torch.topk(scores, k=min(top_k, len(self.documents)))
results = []
for score, idx in zip(top_results.values, top_results.indices):
results.append({
"document": self.documents[idx],
"score": round(score.item(), 4),
"index": idx.item()
})
return results
# --- Usage Example ---
corpus = [
"Python is a versatile and easy-to-learn programming language",
"Machine learning allows computers to learn from data",
"Angular is a framework for building modern web applications",
"Carbonara pasta is a typical dish of Roman cuisine",
"Relational databases use SQL to query data",
"Deep learning uses neural networks with many hidden layers",
"Rome is the capital of Italy with a thousand-year history",
"REST APIs enable communication between web services",
"Natural Language Processing analyzes and understands text",
"Neapolitan pizza became a UNESCO heritage in 2017"
]
# Create search engine and index documents
search_engine = SemanticSearch()
search_engine.index_documents(corpus)
# Run some searches
queries = [
"how to analyze natural language",
"frontend development framework",
"traditional Italian food"
]
for query in queries:
print(f"\nQuery: '{query}'")
print("-" * 60)
results = search_engine.search(query, top_k=3)
for i, r in enumerate(results, 1):
print(f" {i}. [{r['score']:.4f}] {r['document']}")
Expected Results
Semantic search understands meaning, not just words. For example:
- "how to analyze natural language" will find the NLP document even if it does not contain those exact words
- "frontend development framework" will find Angular, even though "frontend" does not appear in the document (but "modern web applications" is semantically related)
- "traditional Italian food" will find both carbonara and pizza, because the model understands the semantic relationship
Roadmap: From Here to LLMs
In this article, we have built the foundations of modern NLP, starting from text preprocessing through contextual embeddings and the complete pipeline. Let us summarize the journey we have taken:
Concept Summary
| Concept | What It Does | Evolution |
|---|---|---|
| Preprocessing | Cleans and normalizes raw text | Manual rules -> spaCy pipeline |
| Tokenization | Splits text into discrete units | Word -> Char -> Subword (BPE/WordPiece) |
| BoW / TF-IDF | Represents text as sparse vectors | Simple but no semantics |
| Word Embeddings | Dense vectors capturing meaning | Word2Vec -> GloVe -> FastText |
| Contextual Embeddings | Vectors that depend on context | ELMo -> BERT -> GPT |
| Sentence Embeddings | One vector for an entire sentence | Mean pooling -> Sentence-BERT |
In the next article, we will make the leap to the architecture that revolutionized everything: the Transformer. We will explore the Self-Attention mechanism in detail, understand why BERT was a turning point, and learn how to use it for real tasks like text classification and question answering.
Resources for Further Study
- spaCy Italian Models: Official documentation for Italian spaCy models (spacy.io/models/it)
- HuggingFace Models: Repository of pre-trained Italian models (huggingface.co/models?language=it)
- Sentence-BERT: sentence-transformers documentation (sbert.net)
- FEEL-IT: Sentiment analysis and emotion classification for Italian (MilaNLProc)
- Word2Vec Paper: "Efficient Estimation of Word Representations in Vector Space" (Mikolov et al., 2013)
- GloVe Paper: "Global Vectors for Word Representation" (Pennington et al., 2014)
- BERT Paper: "BERT: Pre-training of Deep Bidirectional Transformers" (Devlin et al., 2019)







