Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

NLP Fundamentals: Tokenization, Embeddings and the Modern Pipeline

Every time you ask a voice assistant a question, translate text with Google Translate, filter spam from your inbox, or read auto-generated subtitles on a video, you are using Natural Language Processing (NLP). This discipline, at the intersection of linguistics, computer science, and artificial intelligence, focuses on teaching machines to understand, interpret, and generate human language.

In recent years, NLP has undergone a radical transformation. We have moved from hand-written rules and static dictionaries to neural models capable of understanding nuance, context, and even irony. BERT, GPT, LLaMA, and next-generation models would not be possible without the foundations we will explore in this article: tokenization, embeddings, and the modern NLP pipeline.

This is the first article in the Modern NLP: from BERT to LLMs series. We will start from absolute basics, building step by step the intuitions needed to understand the most advanced language models. We will also pay special attention to the specifics of the Italian language, often overlooked in English-centric resources.

What You Will Learn

What NLP is and why it underpins almost every modern AI application
How text is preprocessed: lowercasing, stopwords, stemming, and lemmatization
Different tokenization approaches: word-level, character-level, and subword (BPE, WordPiece, SentencePiece)
Classic text representations: Bag of Words and TF-IDF
Word embeddings: Word2Vec, GloVe, and the geometric intuition of meaning
Contextual embeddings: from static representations to BERT
Sentence embeddings and their practical applications
The modern NLP pipeline: from raw text to prediction
Italian language preprocessing specifics
A complete end-to-end example with Python code

Series Overview

#	Article	Focus
1	You are here - NLP Fundamentals	Tokenization, Embeddings, Pipeline
2	BERT and Transformers	Attention Architecture, Pre-training
3	Sentiment Analysis	Text Classification with BERT
4	Named Entity Recognition	Extracting Entities from Text
5	HuggingFace Transformers	Library and Pre-trained Models
6	Model Fine-Tuning	Adapting BERT to Your Domain
7	NLP for Italian	Models and Resources for the Italian Language
8	From BERT to LLMs	GPT, LLaMA and Text Generation

1. Text Preprocessing: Preparing the Data

Before any NLP model can work with text, it must be cleaned and normalized. Raw text is full of noise: punctuation, capitalization, abbreviations, emojis, HTML, URLs. Preprocessing transforms this chaos into a structured and consistent format.

1.1 Lowercasing and Normalization

The first step is converting all text to lowercase. For a computer, "House", "house", and "HOUSE" are three completely different strings. Lowercasing unifies them.

Python - Basic Normalization

import re
import unicodedata

def normalize_text(text: str) -> str:
    """Basic text normalization."""
    # Lowercasing
    text = text.lower()

    # Remove accents (optional, NOT recommended for Italian)
    # text = unicodedata.normalize('NFKD', text)

    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # Remove multiple spaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

text = "NLP is AMAZING! It analyzes text in 50+ languages."
print(normalize_text(text))
# Output: "nlp is amazing it analyzes text in 50 languages"

Watch Out for Accents in Italian

In many English-centric NLP pipelines, accent removal is a standard step. In Italian, however, accents change word meaning: "pero" (conjunction: however) vs "pero" (noun: pear tree), "e" (conjunction: and) vs "e" (verb: is). Never remove accents when working with Italian text.

1.2 Stopwords: Words Without Informational Content

Stopwords are very frequent words that carry little semantic meaning: articles, prepositions, conjunctions. Removing them reduces data dimensionality and helps models focus on meaningful words.

Italian vs English Stopwords

Language	Stopword Examples	Typical Count
English	the, is, at, which, on, a, an, and, or	~180 words
Italian	il, lo, la, di, a, da, in, con, su, per, che, e, non, un	~300 words

Italian has more stopwords than English due to its richer set of articles (il, lo, la, i, gli, le), articulated prepositions (del, dello, della, nei, negli, nelle), and auxiliary verb forms.

Python - Stopword Removal with NLTK and spaCy

# Approach 1: NLTK
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

stop_en = set(stopwords.words('english'))
stop_it = set(stopwords.words('italian'))
print(f"English stopwords: {len(stop_en)}")  # ~179
print(f"Italian stopwords: {len(stop_it)}")  # ~279

text = "the cat eats the fish on the table"
tokens = text.split()
filtered = [t for t in tokens if t not in stop_en]
print(filtered)
# Output: ['cat', 'eats', 'fish', 'table']

# Approach 2: spaCy (more complete)
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("the cat eats the fish on the table")
filtered_spacy = [token.text for token in doc if not token.is_stop]
print(filtered_spacy)
# Output: ['cat', 'eats', 'fish', 'table']

1.3 Stemming vs Lemmatization

Both techniques reduce words to their base form, but they do so in very different ways.

      Stemming vs Lemmatization - Comparison
      
            Aspect
            Stemming
            Lemmatization
          
            Method
            Chops suffixes with heuristic rules
            Uses a dictionary and morphological analysis
          
            Result
            Stem (not always a real word)
            Lemma (a real dictionary word)
          
            English Example
            "running" -> "run", "better" -> "better"
            "running" -> "run", "better" -> "good"
          
            Italian Example
            "mangiando" -> "mangi"
            "mangiando" -> "mangiare"
          
            Speed
            Very fast
            Slower (requires dictionary lookup)
          
            Accuracy
            Low (over-stemming is common)
            High (correct forms)

Python - Stemming and Lemmatization

from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
import spacy

# English Stemming with Porter
stemmer_en = PorterStemmer()
words_en = ["running", "runs", "ran", "easily", "fairly"]
stems_en = [stemmer_en.stem(w) for w in words_en]
print(dict(zip(words_en, stems_en)))
# {'running': 'run', 'runs': 'run', 'ran': 'ran',
#  'easily': 'easili', 'fairly': 'fairli'}

# Italian Stemming with Snowball
stemmer_it = SnowballStemmer("italian")
words_it = ["mangiando", "mangiare", "mangiato", "bellissimo"]
stems_it = [stemmer_it.stem(w) for w in words_it]
print(dict(zip(words_it, stems_it)))
# {'mangiando': 'mang', 'mangiare': 'mang',
#  'mangiato': 'mang', 'bellissimo': 'bellissim'}

# Lemmatization with spaCy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The girls were eating the most beautiful apples")
for token in doc:
    print(f"  {token.text:15s} -> {token.lemma_:15s} ({token.pos_})")
# The             -> the             (DET)
# girls           -> girl            (NOUN)
# were            -> be              (AUX)
# eating          -> eat             (VERB)
# the             -> the             (DET)
# most            -> most            (ADV)
# beautiful       -> beautiful       (ADJ)
# apples          -> apple           (NOUN)

For languages with rich morphology like Italian, lemmatization with spaCy is almost always preferable to stemming. The it_core_news_lg model contains 500,000 word vectors and supports tokenization, POS tagging, dependency parsing, NER, and lemmatization.

2. Tokenization: How Machines Read Text

Tokenization is the process of splitting text into discrete units called tokens. It is the first and most critical step in any NLP pipeline: the quality of tokenization directly influences the performance of every downstream model.

There are three fundamental approaches, each with different trade-offs.

2.1 Word-Level Tokenization

The most intuitive approach: each word becomes a token.

Python - Word-Level Tokenization

# Naive approach: split by whitespace
text = "Artificial intelligence is changing the world"
tokens_naive = text.split()
print(tokens_naive)
# ['Artificial', 'intelligence', 'is', 'changing', 'the', 'world']

# Better approach: spaCy (handles contractions, punctuation)
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("It's a well-known fact that NLP isn't trivial")
tokens_spacy = [token.text for token in doc]
print(tokens_spacy)
# ['It', "'s", 'a', 'well', '-', 'known', 'fact', 'that',
#  'NLP', 'is', "n't", 'trivial']
# spaCy correctly handles contractions like "isn't" -> "is" + "n't"

Limitations of Word-Level Tokenization

Huge vocabulary: Every unique word requires a vocabulary entry. English alone has hundreds of thousands of word forms; Italian has even more due to verb conjugations
Out-of-vocabulary (OOV) words: Words never seen during training become <UNK> (unknown)
No morphological sharing: "eat", "eating", "eaten" are three completely separate tokens with no relationship

2.2 Character-Level Tokenization

At the other extreme, each character becomes a token. The vocabulary is tiny (26 letters + digits + punctuation), but sequences become very long.

Sequence Length Comparison


Text: "hello world"

Word-level:   ["hello", "world"]                       -> 2 tokens
Char-level:   ["h","e","l","l","o"," ","w","o","r","l","d"] -> 11 tokens

A 1,000-word text:
  Word-level:  ~1,000 tokens
  Char-level:  ~5,000 tokens  (5x longer!)

Character-level tokenization solves the unknown word problem (any word can be represented), but the very long sequences make it difficult for models to capture long-range dependencies in the text.

2.3 Subword Tokenization: The Optimal Compromise

Subword tokenization is the method used by all modern models (BERT, GPT, LLaMA, T5). The idea is elegant: common words remain whole, while rare words are split into sub-units (subwords) that the model has already seen.

      Subword Tokenization Algorithms
      
            Algorithm
            Used By
            Strategy
            Direction
          
            BPE (Byte Pair Encoding)
            GPT-2, GPT-3, GPT-4, LLaMA, RoBERTa
            Iterative merge of most frequent pairs
            Bottom-up
          
            WordPiece
            BERT, DistilBERT, ELECTRA
            Merge that maximizes likelihood
            Bottom-up
          
            SentencePiece
            T5, ALBERT, XLNet, mBART
            Treats text as raw character stream
            Language-independent
          
            Unigram
            SentencePiece (optional), ALBERT
            Starts large, removes least useful tokens
            Top-down

How BPE (Byte Pair Encoding) Works

BPE starts from individual characters and iteratively merges the most frequent pairs until the desired vocabulary size is reached.

BPE - Step-by-Step Example


Corpus: "low low lower lowest"

Step 0 - Initial vocabulary (characters):
  l, o, w, e, r, s, t

Step 1 - Most frequent pair: (l, o) -> "lo"
  lo w   lo w   lo w e r   lo w e s t

Step 2 - Most frequent pair: (lo, w) -> "low"
  low   low   low e r   low e s t

Step 3 - Most frequent pair: (low, e) -> "lowe"
  low   low   lowe r   lowe s t

Step 4 - Most frequent pair: (lowe, r) -> "lower"
  low   low   lower   lowe s t

Final vocabulary: [l, o, w, e, r, s, t, lo, low, lowe, lower]

WordPiece vs BPE

WordPiece uses an approach similar to BPE, but instead of choosing the most frequent pair, it chooses the one that maximizes the likelihood of the training corpus. In practice, WordPiece prefers merges that produce tokens more useful for the language model, not simply the most common ones. Tokens that do not start a word are prefixed with ##.

SentencePiece: True Language Independence

The key difference with SentencePiece is that it does not require pre-tokenization. BPE and WordPiece assume the text is already split into words (typically by whitespace), which works well for English and Italian but fails for languages like Chinese, Japanese, or Thai that do not use spaces between words. SentencePiece treats the text as a raw byte stream, making it truly language-independent.

3. Practical Example: Tokenization with HuggingFace

Let us see concretely how BERT and GPT-2 tokenize the same text. We will use the transformers library from HuggingFace.

Python - Comparing BERT vs GPT-2 Tokenizers

from transformers import AutoTokenizer

# Example text
text = "Artificial intelligence is revolutionizing the world"

# --- BERT (WordPiece) ---
bert_tok = AutoTokenizer.from_pretrained("bert-base-uncased")
bert_tokens = bert_tok.tokenize(text)
bert_ids = bert_tok.encode(text)
print("BERT tokens:", bert_tokens)
print("BERT IDs:   ", bert_ids)
# BERT tokens: ['artificial', 'intelligence', 'is',
#               'revolution', '##izing', 'the', 'world']
# Note: "revolutionizing" is split into "revolution" + "##izing"

# --- GPT-2 (BPE) ---
gpt2_tok = AutoTokenizer.from_pretrained("gpt2")
gpt2_tokens = gpt2_tok.tokenize(text)
gpt2_ids = gpt2_tok.encode(text)
print("\nGPT-2 tokens:", gpt2_tokens)
print("GPT-2 IDs:   ", gpt2_ids)
# GPT-2 tokens: ['Art', 'ificial', ' intelligence', ' is',
#                ' revolution', 'izing', ' the', ' world']

# --- Comparison ---
print(f"\nBERT:  {len(bert_tokens)} tokens")
print(f"GPT-2: {len(gpt2_tokens)} tokens")

      Key Observations
      BERT uses ## prefix for continuation subwords (e.g., ##izing), while GPT-2 uses a space-aware encoding (leading space is part of the token)
The same word gets split differently depending on the tokenizer's training data and algorithm
A tokenizer trained on the target language produces fewer tokens, meaning more context in the attention window and lower per-token costs
BERT adds special tokens: [CLS] at the start and [SEP] at the end. GPT-2 does not

    

Python - Italian BERT Tokenizer

from transformers import AutoTokenizer

# Italian BERT tokenizer
tok = AutoTokenizer.from_pretrained("dbmdz/bert-base-italian-cased")

# Vocabulary size
print(f"Italian BERT vocabulary: {tok.vocab_size} tokens")
# Output: Italian BERT vocabulary: 31102 tokens

# Special tokens
print(f"[CLS] = {tok.cls_token} (ID: {tok.cls_token_id})")
print(f"[SEP] = {tok.sep_token} (ID: {tok.sep_token_id})")
print(f"[UNK] = {tok.unk_token} (ID: {tok.unk_token_id})")
print(f"[MASK] = {tok.mask_token} (ID: {tok.mask_token_id})")

# Compare English vs Italian tokenization
text_it = "L'intelligenza artificiale sta rivoluzionando il mondo"
text_en = "Artificial intelligence is revolutionizing the world"

it_tokens = tok.tokenize(text_it)
en_tokens = tok.tokenize(text_en)
print(f"\nItalian text: {len(it_tokens)} tokens -> {it_tokens}")
print(f"English text: {len(en_tokens)} tokens -> {en_tokens}")
# Italian BERT handles Italian text efficiently (fewer tokens)
# but splits English words into more subwords

4. Bag of Words and TF-IDF: Classic Representations

Before word embeddings, text was represented as sparse vectors based on word frequencies. These methods are still used in many contexts, and understanding their limitations helps explain why embeddings were revolutionary.

4.1 Bag of Words (BoW)

The Bag of Words model represents a document as a vector where each position corresponds to a word in the vocabulary and the value is the number of times that word appears in the document.

Python - Bag of Words with scikit-learn

from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "the cat eats the fish",
    "the dog eats the meat",
    "the cat chases the dog"
]

vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

print("Vocabulary:", vectorizer.get_feature_names_out())
# ['cat', 'chases', 'dog', 'eats', 'fish', 'meat', 'the']

print("\nBoW Matrix:")
print(bow_matrix.toarray())
# [[1, 0, 0, 1, 1, 0, 2],   # doc 1
#  [0, 0, 1, 1, 0, 1, 2],   # doc 2
#  [1, 1, 1, 0, 0, 0, 2]]   # doc 3

4.2 TF-IDF (Term Frequency - Inverse Document Frequency)

TF-IDF improves BoW by weighting words by their relative importance. Words that are frequent in a document but rare across the corpus receive a higher weight. Words common everywhere (like "the") receive a low weight.

TF-IDF Formula


TF-IDF(t, d) = TF(t, d) x IDF(t)

where:
  TF(t, d)  = frequency of term t in document d
  IDF(t)    = log(N / df(t))
  N         = total number of documents
  df(t)     = number of documents containing term t

Example:
  Word "cat" in document 1:
    TF  = 1/5 = 0.2    (1 occurrence out of 5 words)
    IDF = log(3/2) = 0.405  (appears in 2 out of 3 documents)
    TF-IDF = 0.2 x 0.405 = 0.081

Python - TF-IDF with scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

documents = [
    "the cat eats the fish",
    "the dog eats the meat",
    "the cat chases the dog"
]

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(documents)

print("Features:", tfidf.get_feature_names_out())
print("\nTF-IDF Matrix (rounded):")
print(np.round(tfidf_matrix.toarray(), 3))
# "fish" and "meat" have higher weights because they appear
# in only one document (more discriminative)

Limitations of BoW and TF-IDF

No semantics: "dog" and "canine" are completely different; "bank" (river) and "bank" (financial) are identical
No word order: "the cat eats the mouse" and "the mouse eats the cat" have the same representation
High dimensionality: A vocabulary of 100,000 words produces 100,000-dimensional vectors, almost all zeros (sparse vectors)
No generalization: They do not capture relationships between words ("king" and "queen" have no proximity)

5. Word Embeddings: Meaning as Geometry

Word embeddings revolutionized NLP by transforming words into dense, low-dimensional vectors (typically 100-300 dimensions) that capture semantic relationships between words. Two words with similar meaning will have nearby vectors in the embedding space.

5.1 Word2Vec: The Invention that Changed Everything

Introduced by Tomas Mikolov and colleagues (Google, 2013), Word2Vec learns word vectors from the context in which words appear. The fundamental intuition is the distributional hypothesis: "a word is characterized by the company it keeps" (J.R. Firth, 1957).

Two Word2Vec Architectures

Architecture	Input	Output	Intuition
CBOW (Continuous Bag of Words)	Context (surrounding words)	Target word	Given context "the ___ eats", predict "cat"
Skip-gram	Target word	Context (surrounding words)	Given word "cat", predict "the", "eats", etc.

In practice, Skip-gram works better with small datasets and captures rare words better. CBOW is faster and works well with frequent words.

Skip-gram Schema


Sentence: "the black cat eats the fresh fish"
                   ^
             target word

With window_size = 2, Skip-gram learns:
  cat -> the    (left context, distance 1)
  cat -> black  (left context, distance 1)
  cat -> eats   (right context, distance 1)
  cat -> the    (right context, distance 2)

After millions of sentences, words appearing in similar
contexts will have similar vectors:
  cat ~ feline ~ kitten  (similar contexts: "the ___ eats")
  dog ~ canine ~ puppy   (similar contexts: "the ___ runs")

5.2 The Arithmetic of Words

The most surprising property of word embeddings is that semantic relationships become algebraic operations on vectors. The famous analogy:

king - man + woman = queen

This is not a coincidence: the vector that transforms "man" into "woman" is the same one that transforms "king" into "queen". This works for many relationships: country-capital, verb-past tense, adjective-superlative.

Python - Vector Arithmetic with Gensim

import gensim.downloader as api

# Load pre-trained word embeddings
model = api.load("word2vec-google-news-300")

# Analogy: king - man + woman = ?
result = model.most_similar(
    positive=["king", "woman"],
    negative=["man"],
    topn=3
)
print("king - man + woman =")
for word, score in result:
    print(f"  {word}: {score:.4f}")
# king - man + woman =
#   queen: 0.7118
#   monarch: 0.6189
#   princess: 0.5902

# Word similarity
print(f"\ncat ~ dog: {model.similarity('cat', 'dog'):.4f}")
print(f"cat ~ car: {model.similarity('cat', 'car'):.4f}")
# cat ~ dog: 0.7609
# cat ~ car: 0.2004

5.3 GloVe: Global Vectors for Word Representation

GloVe (Stanford, 2014) takes a different approach: instead of predicting context word by word like Word2Vec, GloVe first builds a global co-occurrence matrix of the entire corpus, then factorizes this matrix to obtain the vectors. It combines the advantages of global statistics-based methods with the local learning of Word2Vec.

      Word2Vec vs GloVe
      
            Aspect
            Word2Vec
            GloVe
          
            Method
            Predictive (neural network)
            Count-based (matrix factorization)
          
            Context
            Local window
            Global corpus statistics
          
            Training
            Online (streams through text)
            Batch (full matrix)
          
            Common Dimensions
            100, 200, 300
            50, 100, 200, 300
          
            Performance
            Excellent for analogies
            Excellent for similarity

6. Contextual Embeddings: Same Word, Different Meanings

Word2Vec and GloVe have a fundamental limitation: they assign a single vector to each word, regardless of context. But language is full of ambiguity: the word "bank" has a completely different meaning in "river bank" vs "bank account".

Contextual embeddings solve this problem: the vector for a word depends on the entire sentence in which it appears. This is the approach used by BERT, GPT, and all Transformer-based models.

      Static vs Contextual Embeddings
      
            Aspect
            Static (Word2Vec, GloVe)
            Contextual (BERT, GPT)
          
            Vector per word
            One fixed, always the same
            Different depending on context
          
            Polysemy
            Not handled ("bank" has one vector)
            Handled ("bank" gets different vectors per meaning)
          
            Model
            Lookup table
            Deep neural network (Transformer)
          
            Model size
            A few MB
            Hundreds of MB or GB
          
            Speed
            Instantaneous
            Requires forward pass

Python - Contextual Embeddings with BERT

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Load BERT model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

def get_word_embedding(sentence: str, word: str):
    """Get the contextual embedding of a word in a sentence."""
    inputs = tokenizer(sentence, return_tensors="pt")
    tokens = tokenizer.tokenize(sentence)

    with torch.no_grad():
        outputs = model(**inputs)

    # outputs.last_hidden_state: [batch, seq_len, hidden_dim]
    embeddings = outputs.last_hidden_state[0]  # [seq_len, 768]

    # Find the index of the target token
    word_idx = tokens.index(word) + 1  # +1 for [CLS]
    return embeddings[word_idx]

# "bank" in different contexts
emb_river = get_word_embedding("I walked along the river bank", "bank")
emb_money = get_word_embedding("I deposited money at the bank", "bank")
emb_data  = get_word_embedding("The data bank stores records", "bank")

# Calculate cosine similarity
sim_12 = F.cosine_similarity(emb_river.unsqueeze(0), emb_money.unsqueeze(0))
sim_13 = F.cosine_similarity(emb_river.unsqueeze(0), emb_data.unsqueeze(0))
sim_23 = F.cosine_similarity(emb_money.unsqueeze(0), emb_data.unsqueeze(0))

print(f"bank(river) ~ bank(money): {sim_12.item():.4f}")
print(f"bank(river) ~ bank(data):  {sim_13.item():.4f}")
print(f"bank(money) ~ bank(data):  {sim_23.item():.4f}")
# Vectors will be DIFFERENT because BERT understands context!

This is the fundamental conceptual leap: with BERT, the word "bank" no longer has a fixed meaning. Its vector changes based on what surrounds it, just as it does in human language understanding.

7. Sentence Embeddings: One Vector for an Entire Sentence

Often we do not need the embedding of a single word, but of an entire sentence or paragraph. Sentence embeddings compress the meaning of a text of any length into a single fixed-dimension vector.

The reference model is Sentence-BERT (SBERT), which modifies the BERT architecture to produce sentence embeddings optimized for similarity comparison. For multilingual use, paraphrase-multilingual-MiniLM-L12-v2 supports 50+ languages with 384-dimensional vectors.

      Sentence Embedding Applications
      
            Application
            Description
            How It Works
          
            Semantic Search
            Search by meaning, not keywords
            Embed the query, find closest documents
          
            Clustering
            Group similar texts automatically
            K-means or HDBSCAN on embeddings
          
            Duplicate Detection
            Find duplicates or near-duplicates
            Cosine similarity threshold > 0.9
          
            Zero-Shot Classification
            Classify without training data
            Compare text embedding with label embeddings

Python - Sentence Embeddings with sentence-transformers

from sentence_transformers import SentenceTransformer, util

# Multilingual model (supports 50+ languages including Italian)
model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

# Sentences in English
sentences = [
    "The cat sleeps on the couch",
    "A feline rests on the sofa",
    "The stock market went up today",
    "Financial markets are growing",
    "I bought a new laptop computer"
]

# Generate embeddings (1 vector per sentence, 384 dimensions)
embeddings = model.encode(sentences, convert_to_tensor=True)
print(f"Shape: {embeddings.shape}")  # [5, 384]

# Calculate similarity matrix
cosine_scores = util.cos_sim(embeddings, embeddings)

print("\nSimilarity Matrix:")
for i in range(len(sentences)):
    for j in range(i + 1, len(sentences)):
        print(f"  {sentences[i][:40]:40s} <-> {sentences[j][:40]:40s}")
        print(f"    Similarity: {cosine_scores[i][j]:.4f}")

# Expected results:
# "cat sleeps"     <-> "feline rests"    : ~0.85 (very similar)
# "stock market"   <-> "markets growing" : ~0.70 (correlated)
# "cat sleeps"     <-> "stock market"    : ~0.10 (unrelated)

8. The Modern NLP Pipeline

We have seen the individual components. Now let us put them together to understand how a modern end-to-end NLP pipeline works -- the one used by BERT, GPT, and all Transformer-based models.

Modern NLP Pipeline - Complete Flow


Raw Text
    |
    v
[1. TOKENIZATION]
    Input:  "Artificial intelligence is amazing"
    Output: ["artificial", "intelligence", "is", "amazing"]
    |
    v
[2. ENCODING (Token -> ID)]
    Input:  ["artificial", "intelligence", "is", "amazing"]
    Output: [101, 7976, 4454, 2003, 6429, 102]
            ^[CLS]                        ^[SEP]
    |
    v
[3. EMBEDDING LAYER]
    Input:  [101, 7976, 4454, 2003, 6429, 102]
    Output: Matrix [6 x 768] - one 768-dim vector per token
    |
    v
[4. TRANSFORMER ENCODER/DECODER]
    Self-Attention: every token "looks at" all others
    Input:  Matrix [6 x 768]
    Output: Matrix [6 x 768] (contextualized vectors)
    |
    v
[5. TASK HEAD]
    Classification:  [CLS] embedding -> softmax -> class
    NER:             each token -> entity label
    Generation:      last token -> next token
    QA:              start/end answer position

The Role of the [CLS] Token

In BERT, the special [CLS] token is inserted at the beginning of every input. After passing through all Transformer layers, its embedding represents the entire sequence. It is used as input for classification tasks (sentiment analysis, spam detection, etc.).

9. NLP for Italian: Specifics and Tools

The Italian language presents unique NLP challenges that distinguish it from English and other languages. Understanding these specifics is essential for building effective NLP systems for Italian.

9.1 Italian Linguistic Challenges

      Italian-Specific NLP Challenges
      
            Challenge
            Description
            Example
          
            Rich morphology
            Each verb has dozens of conjugated forms
            "mangiare" (to eat) has 50+ forms
          
            Elisions and apostrophes
            Articles and prepositions merge
            "l'uomo" (the man), "dell'arte" (of art)
          
            Articulated prepositions
            Preposition + article in one word
            "del" (of the), "nello" (in the), "sulla" (on the)
          
            Meaningful accents
            Change word meaning
            "e" (and) vs "e" (is), "da" (from) vs "da" (gives)
          
            Clitic pronouns
            Attach to the verb
            "dammelo" (give+me+it), "portarglielo" (bring+him+it)
          
            Free word order
            SVO is not mandatory
            "La torta la mangia Marco" = "Marco mangia la torta"

9.2 Pre-trained Models for Italian

      Main Italian NLP Models
      
        
            Model
            Base
            Task
            Repository
          

        
            dbmdz/bert-base-italian-cased
            BERT
            General-purpose Italian NLP
            HuggingFace
          

            AlBERTo
            BERT
            Italian social media (Twitter)
            HuggingFace
          

            feel-it-italian-sentiment
            UmBERTo
            Italian sentiment analysis
            MilaNLProc
          

            feel-it-italian-emotion
            UmBERTo
            Emotion detection (joy, anger, fear, sadness)
            MilaNLProc
          

            Italian-Legal-BERT
            BERT
            Italian legal texts
            dlicari
          

            DeepMount00/Italian_NER_XXL
            BERT
            Italian Named Entity Recognition
            HuggingFace
          

            it_core_news_lg
            spaCy CNN
            Tokenization, POS, NER, lemma, parsing
            spaCy
          

      
    

9.3 Italian-Specific Preprocessing

Python - Complete Italian Preprocessing Pipeline

import spacy
import re

class ItalianPreprocessor:
    """Preprocessing pipeline specific to the Italian language."""

    def __init__(self):
        self.nlp = spacy.load("it_core_news_lg")
        # Additional regional/informal stopwords
        self.custom_stops = {
            "cioe", "quindi", "comunque", "praticamente",
            "allora", "insomma", "magari", "ecco", "tipo",
            "boh", "mah", "vabbe", "ok", "okay"
        }

    def preprocess(self, text: str, remove_stops: bool = True,
                   lemmatize: bool = True) -> list[str]:
        """Complete preprocessing for Italian text."""
        # 1. Basic normalization
        text = text.lower()
        text = re.sub(r'http\S+|www\.\S+', '', text)  # remove URLs
        text = re.sub(r'[^\w\s\']', '', text)  # keep apostrophes
        text = re.sub(r'\d+', '', text)  # remove numbers
        text = re.sub(r'\s+', ' ', text).strip()

        # 2. spaCy analysis
        doc = self.nlp(text)

        # 3. Filtering and lemmatization
        tokens = []
        for token in doc:
            # Skip punctuation and spaces
            if token.is_punct or token.is_space:
                continue
            # Skip stopwords if requested
            if remove_stops and (token.is_stop or
                                  token.text in self.custom_stops):
                continue
            # Lemmatize or use original form
            word = token.lemma_ if lemmatize else token.text
            if len(word) > 1:  # skip single characters
                tokens.append(word)

        return tokens


# Usage example
prep = ItalianPreprocessor()

text = """L'intelligenza artificiale sta rivoluzionando
il modo in cui le aziende italiane gestiscono i loro
processi, cioe praticamente tutto sta cambiando."""

result = prep.preprocess(text)
print("Processed tokens:", result)
# ['intelligenza', 'artificiale', 'rivoluzionare', 'modo',
#  'azienda', 'italiano', 'gestire', 'processo', 'cambiare']

10. End-to-End Example: Semantic Search

Let us put everything we have learned together in a complete example: a semantic search engine over a corpus of texts. Given a set of documents and a user query, we will find the most relevant documents using sentence embeddings.

Python - Complete Semantic Search Engine

from sentence_transformers import SentenceTransformer, util
import torch

class SemanticSearch:
    """Semantic search engine using sentence embeddings."""

    def __init__(self, model_name: str =
                 "paraphrase-multilingual-MiniLM-L12-v2"):
        self.model = SentenceTransformer(model_name)
        self.documents: list[str] = []
        self.embeddings = None

    def index_documents(self, documents: list[str]) -> None:
        """Index documents by computing their embeddings."""
        self.documents = documents
        self.embeddings = self.model.encode(
            documents,
            convert_to_tensor=True,
            show_progress_bar=True
        )
        print(f"Indexed {len(documents)} documents")
        print(f"Embeddings shape: {self.embeddings.shape}")

    def search(self, query: str, top_k: int = 3) -> list[dict]:
        """Search for the most relevant documents for the query."""
        query_embedding = self.model.encode(
            query, convert_to_tensor=True
        )
        scores = util.cos_sim(query_embedding, self.embeddings)[0]
        top_results = torch.topk(scores, k=min(top_k, len(self.documents)))

        results = []
        for score, idx in zip(top_results.values, top_results.indices):
            results.append({
                "document": self.documents[idx],
                "score": round(score.item(), 4),
                "index": idx.item()
            })
        return results


# --- Usage Example ---
corpus = [
    "Python is a versatile and easy-to-learn programming language",
    "Machine learning allows computers to learn from data",
    "Angular is a framework for building modern web applications",
    "Carbonara pasta is a typical dish of Roman cuisine",
    "Relational databases use SQL to query data",
    "Deep learning uses neural networks with many hidden layers",
    "Rome is the capital of Italy with a thousand-year history",
    "REST APIs enable communication between web services",
    "Natural Language Processing analyzes and understands text",
    "Neapolitan pizza became a UNESCO heritage in 2017"
]

# Create search engine and index documents
search_engine = SemanticSearch()
search_engine.index_documents(corpus)

# Run some searches
queries = [
    "how to analyze natural language",
    "frontend development framework",
    "traditional Italian food"
]

for query in queries:
    print(f"\nQuery: '{query}'")
    print("-" * 60)
    results = search_engine.search(query, top_k=3)
    for i, r in enumerate(results, 1):
        print(f"  {i}. [{r['score']:.4f}] {r['document']}")

Expected Results

Semantic search understands meaning, not just words. For example:

"how to analyze natural language" will find the NLP document even if it does not contain those exact words
"frontend development framework" will find Angular, even though "frontend" does not appear in the document (but "modern web applications" is semantically related)
"traditional Italian food" will find both carbonara and pizza, because the model understands the semantic relationship

Roadmap: From Here to LLMs

In this article, we have built the foundations of modern NLP, starting from text preprocessing through contextual embeddings and the complete pipeline. Let us summarize the journey we have taken:

      Concept Summary
      
            Concept
            What It Does
            Evolution
          
            Preprocessing
            Cleans and normalizes raw text
            Manual rules -> spaCy pipeline
          
            Tokenization
            Splits text into discrete units
            Word -> Char -> Subword (BPE/WordPiece)
          
            BoW / TF-IDF
            Represents text as sparse vectors
            Simple but no semantics
          
            Word Embeddings
            Dense vectors capturing meaning
            Word2Vec -> GloVe -> FastText
          
            Contextual Embeddings
            Vectors that depend on context
            ELMo -> BERT -> GPT
          
            Sentence Embeddings
            One vector for an entire sentence
            Mean pooling -> Sentence-BERT

In the next article, we will make the leap to the architecture that revolutionized everything: the Transformer. We will explore the Self-Attention mechanism in detail, understand why BERT was a turning point, and learn how to use it for real tasks like text classification and question answering.

Resources for Further Study

spaCy Italian Models: Official documentation for Italian spaCy models (spacy.io/models/it)
HuggingFace Models: Repository of pre-trained Italian models (huggingface.co/models?language=it)
Sentence-BERT: sentence-transformers documentation (sbert.net)
FEEL-IT: Sentiment analysis and emotion classification for Italian (MilaNLProc)
Word2Vec Paper: "Efficient Estimation of Word Representations in Vector Space" (Mikolov et al., 2013)
GloVe Paper: "Global Vectors for Word Representation" (Pennington et al., 2014)
BERT Paper: "BERT: Pre-training of Deep Bidirectional Transformers" (Devlin et al., 2019)

Aspect	Stemming	Lemmatization
Method	Chops suffixes with heuristic rules	Uses a dictionary and morphological analysis
Result	Stem (not always a real word)	Lemma (a real dictionary word)
English Example	"running" -> "run", "better" -> "better"	"running" -> "run", "better" -> "good"
Italian Example	"mangiando" -> "mangi"	"mangiando" -> "mangiare"
Speed	Very fast	Slower (requires dictionary lookup)
Accuracy	Low (over-stemming is common)	High (correct forms)

Algorithm	Used By	Strategy	Direction
BPE (Byte Pair Encoding)	GPT-2, GPT-3, GPT-4, LLaMA, RoBERTa	Iterative merge of most frequent pairs	Bottom-up
WordPiece	BERT, DistilBERT, ELECTRA	Merge that maximizes likelihood	Bottom-up
SentencePiece	T5, ALBERT, XLNet, mBART	Treats text as raw character stream	Language-independent
Unigram	SentencePiece (optional), ALBERT	Starts large, removes least useful tokens	Top-down

Aspect	Word2Vec	GloVe
Method	Predictive (neural network)	Count-based (matrix factorization)
Context	Local window	Global corpus statistics
Training	Online (streams through text)	Batch (full matrix)
Common Dimensions	100, 200, 300	50, 100, 200, 300
Performance	Excellent for analogies	Excellent for similarity

Aspect	Static (Word2Vec, GloVe)	Contextual (BERT, GPT)
Vector per word	One fixed, always the same	Different depending on context
Polysemy	Not handled ("bank" has one vector)	Handled ("bank" gets different vectors per meaning)
Model	Lookup table	Deep neural network (Transformer)
Model size	A few MB	Hundreds of MB or GB
Speed	Instantaneous	Requires forward pass

Application	Description	How It Works
Semantic Search	Search by meaning, not keywords	Embed the query, find closest documents
Clustering	Group similar texts automatically	K-means or HDBSCAN on embeddings
Duplicate Detection	Find duplicates or near-duplicates	Cosine similarity threshold > 0.9
Zero-Shot Classification	Classify without training data	Compare text embedding with label embeddings

Challenge	Description	Example
Rich morphology	Each verb has dozens of conjugated forms	"mangiare" (to eat) has 50+ forms
Elisions and apostrophes	Articles and prepositions merge	"l'uomo" (the man), "dell'arte" (of art)
Articulated prepositions	Preposition + article in one word	"del" (of the), "nello" (in the), "sulla" (on the)
Meaningful accents	Change word meaning	"e" (and) vs "e" (is), "da" (from) vs "da" (gives)
Clitic pronouns	Attach to the verb	"dammelo" (give+me+it), "portarglielo" (bring+him+it)
Free word order	SVO is not mandatory	"La torta la mangia Marco" = "Marco mangia la torta"

Model	Base	Task	Repository
dbmdz/bert-base-italian-cased	BERT	General-purpose Italian NLP	HuggingFace
AlBERTo	BERT	Italian social media (Twitter)	HuggingFace
feel-it-italian-sentiment	UmBERTo	Italian sentiment analysis	MilaNLProc
feel-it-italian-emotion	UmBERTo	Emotion detection (joy, anger, fear, sadness)	MilaNLProc
Italian-Legal-BERT	BERT	Italian legal texts	dlicari
DeepMount00/Italian_NER_XXL	BERT	Italian Named Entity Recognition	HuggingFace
it_core_news_lg	spaCy CNN	Tokenization, POS, NER, lemma, parsing	spaCy

Concept	What It Does	Evolution
Preprocessing	Cleans and normalizes raw text	Manual rules -> spaCy pipeline
Tokenization	Splits text into discrete units	Word -> Char -> Subword (BPE/WordPiece)
BoW / TF-IDF	Represents text as sparse vectors	Simple but no semantics
Word Embeddings	Dense vectors capturing meaning	Word2Vec -> GloVe -> FastText
Contextual Embeddings	Vectors that depend on context	ELMo -> BERT -> GPT
Sentence Embeddings	One vector for an entire sentence	Mean pooling -> Sentence-BERT