NLP for Italian Language: Specific Challenges and Solutions
Italian is one of the most morphologically complex Romance languages: grammatical gender, declensions, adjective-noun agreement, irregular verb forms, and flexible syntax make NLP preprocessing and modeling significantly more challenging than English. Yet the vast majority of NLP tutorials are in English, and the most well-known models are often optimized for English.
This article fills that gap. We explore the specific challenges of Italian, available datasets, Italian BERT models (feel-it, AlBERTo, dbmdz BERT), Italian-specific preprocessing, and how to build a complete sentiment analysis system for Italian step by step.
This is the fourth article in the Modern NLP: from BERT to LLMs series. It is the only series that specifically covers Italian NLP preprocessing and modeling end to end.
What You Will Learn
- Morphological challenges of Italian: gender, declensions, irregular verbs
- Italian-specific preprocessing: stopwords, spaCy lemmatization, normalization
- Italian BERT models: feel-it-italian-sentiment, AlBERTo, dbmdz BERT, GilBERTo
- Italian datasets: SENTIPOLC, TweetSent-IT, ItalianSentiment
- Fine-tuning feel-it on custom domain data
- Handling colloquial language, dialects, and Italian neologisms
- Complete production pipeline for Italian sentiment analysis
- Comparing Italian models vs multilingual BERT
1. Specific Challenges of Italian in NLP
Italian has linguistic characteristics that make NLP more complex than English. Understanding these challenges is fundamental to building effective systems.
1.1 Rich Morphology
Unlike English, Italian has very rich morphology: the same verb root generates dozens of inflected forms, and adjectives must agree in gender and number with nouns. This creates data sparsity problems.
Example: The Italian Verb "Andare" (to go)
- vado, vai, va, andiamo, andate, vanno (present)
- andavo, andavi, andava, andavamo, andavate, andavano (imperfect)
- andro, andrai, andra, andremo, andrete, andranno (future)
- andai, andasti, ando, andammo, andaste, andarono (passato remoto)
- sia andato/a, siano andati/e (subjunctive past)
In English, "to go" has very few forms. For an NLP model, each form is initially a different token.
1.2 Enclitic Pronouns and Compound Words
In Italian, pronouns can be attached to verbs (enclitic), creating complex tokens that standard tokenizers may handle poorly.
# Common issues with tokenizers for Italian
# Enclitic pronouns attached to verbs
examples = [
"Dimmelo", # dimmi + lo
"Portarmelo", # portare + mi + lo
"Fallo", # fai + lo
"Dateglielo", # date + glie + lo
]
# Incorrect tokenization with non-Italian tokenizers
from transformers import BertTokenizer
tokenizer_en = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer_it = BertTokenizer.from_pretrained('dbmdz/bert-base-italian-cased')
word = "Dimmelo"
print(f"EN tokenizer: {tokenizer_en.tokenize(word)}")
# ['dim', '##mel', '##o'] - misses the structure
print(f"IT tokenizer: {tokenizer_it.tokenize(word)}")
# ['Dim', '##me', '##lo'] - better but not perfect
# The optimal solution is lemmatization before tokenization
1.3 Informal Orthography and Dialects
Italian online text (social media, reviews) commonly features:
- Accents replaced by apostrophes: "può'" instead of "può"
- Repeated characters: "bellissimoooo!!!"
- Common abbreviations: "cmq" (comunque = anyway), "nn" (non = not), "xke" (perchè = because)
- Code-switching with English: "Il prodotto e davvero top quality"
- Regional dialectalisms: "mizzica" (Sicilian), "mannaggia" (Southern Italian)
2. Italian-Specific Preprocessing
2.1 spaCy for Italian
spaCy offers an Italian model (it_core_news_sm/md/lg)
with lemmatization, POS tagging, and dependency parsing.
# Install Italian model: python -m spacy download it_core_news_lg
import spacy
nlp = spacy.load("it_core_news_lg")
def preprocess_italian(text: str,
remove_stopwords: bool = True,
lemmatize: bool = True) -> str:
"""Complete preprocessing for Italian texts."""
doc = nlp(text)
tokens = []
for token in doc:
# Skip punctuation, spaces, numbers (if not relevant)
if token.is_punct or token.is_space:
continue
# Normalize to lowercase
word = token.text.lower()
# Remove Italian stopwords
if remove_stopwords and token.is_stop:
continue
# Lemmatize
if lemmatize:
word = token.lemma_.lower()
tokens.append(word)
return ' '.join(tokens)
# Test
texts = [
"I prodotti sono stati consegnati rapidamente e tutto funzionava perfettamente",
"Ho comprato questo telefono tre mesi fa e sono rimasto deluso dalla batteria",
"PRODOTTO FANTASTICO! Lo consiglio assolutamente a tutti voi amici!!!"
]
for text in texts:
processed = preprocess_italian(text)
print(f"Original: {text}")
print(f"Processed: {processed}")
print()
2.2 Normalizing Informal Italian Text
import re
import unicodedata
def normalize_italian_text(text: str) -> str:
"""
Normalization for informal Italian texts (social media, reviews).
"""
# 1. Normalize unicode (accents)
text = unicodedata.normalize('NFC', text)
# 2. Expand common Italian abbreviations
abbreviations = {
r'\bcmq\b': 'comunque',
r'\bnn\b': 'non',
r'\bxke\b': 'perchè',
r'\bxche\b': 'perchè',
r'\bx\b': 'per',
r'\bke\b': 'che',
r'\bkm\b': 'come',
r'\bqs\b': 'questo',
r'\btv\b': 'televisione',
r'\bgg\b': 'giorni',
r'\bprof\b': 'professore',
}
for abbr, expanded in abbreviations.items():
text = re.sub(abbr, expanded, text, flags=re.IGNORECASE)
# 3. Reduce excessive character repetitions (max 2)
text = re.sub(r'(.)\1{2,}', r'\1\1', text)
# 4. Normalize multiple spaces
text = re.sub(r'\s+', ' ', text).strip()
return text
# Test
informal_texts = [
"cmq il prodotto e' fantasticooo!!!",
"nn mi e piaciuto x niente, sto cercando di restituirlo xke nn funziona",
"Amici... COMPRATE QUESTOOO!!! e' il TOP del TOP!!!",
]
for text in informal_texts:
normalized = normalize_italian_text(text)
print(f"Original: {text}")
print(f"Normalized: {normalized}")
print()
3. Italian BERT Models
Several BERT models pre-trained on Italian corpora are available. The choice depends on the domain and the specific task.
3.1 feel-it-italian-sentiment
feel-it is a dataset and model specifically for sentiment analysis and emotion detection in Italian. It is Twitter-based and was trained on manual annotations for sentiment (positive/negative) and emotions (joy, sadness, anger, fear, disgust, surprise).
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import torch
# feel-it for sentiment (positive/negative)
sentiment_model = pipeline(
"text-classification",
model="MilaNLProc/feel-it-italian-sentiment",
tokenizer="MilaNLProc/feel-it-italian-sentiment"
)
# feel-it for emotions (joy, sadness, anger, fear, disgust, surprise)
emotion_model = pipeline(
"text-classification",
model="MilaNLProc/feel-it-italian-emotion",
tokenizer="MilaNLProc/feel-it-italian-emotion"
)
# Test on Italian texts
texts = [
"Sono molto felice del mio acquisto, qualità eccellente!",
"Ho perso tutto il mio lavoro, sono devastato.",
"Questa e la situazione più ridicola che abbia mai visto.",
"Non credevo che potesse funzionare cosi bene, sono stupito!",
]
print("=== SENTIMENT ===")
for text in texts:
result = sentiment_model(text)[0]
print(f" [{result['label']}: {result['score']:.3f}] {text[:60]}")
print("\n=== EMOTION ===")
for text in texts:
result = emotion_model(text)[0]
print(f" [{result['label']}: {result['score']:.3f}] {text[:60]}")
3.2 AlBERTo: BERT for Italian Social Media
AlBERTo was pre-trained on a corpus of Italian tweets (over 200 million tweets). It is particularly effective for informal text, social media, and colloquial Italian.
from transformers import AutoTokenizer, AutoModel
import torch
# AlBERTo - uncased BERT for Italian Twitter
alberto_name = "m-polignano-uniba/bert_uncased_L-12_H-768_A-12_Italian_alb3rt0"
tokenizer = AutoTokenizer.from_pretrained(alberto_name)
model = AutoModel.from_pretrained(alberto_name)
# Test tokenization on colloquial text
informal_texts = [
"PRODOTTO TOP! ma la spedizione ha fatto schifo cmq",
"mizzica quanto e bello sto telefono!! ci ho messo 2gg ma ne valeva la pena",
"ok mi avete rotto... non lo compro più #delusione",
]
for text in informal_texts:
tokens = tokenizer.tokenize(text)
print(f"Text: {text[:50]}")
print(f"Tokens ({len(tokens)}): {tokens[:10]}...")
print()
# Embedding extraction
def get_sentence_embedding(text, model, tokenizer, pooling='cls'):
inputs = tokenizer(text, return_tensors='pt',
truncation=True, max_length=128, padding=True)
with torch.no_grad():
outputs = model(**inputs)
if pooling == 'cls':
return outputs.last_hidden_state[:, 0, :] # [CLS] token
elif pooling == 'mean':
mask = inputs['attention_mask'].unsqueeze(-1)
return (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)
emb = get_sentence_embedding(informal_texts[0], model, tokenizer)
print(f"Embedding shape: {emb.shape}") # (1, 768)
3.3 dbmdz BERT Italian
dbmdz/bert-base-italian-cased was pre-trained on Italian Wikipedia and an OPUS corpus. It is the best starting point for formal text (news, legal documents, academic writing).
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import TrainingArguments, Trainer
from datasets import Dataset
import torch
# Base model for Italian
MODEL = "dbmdz/bert-base-italian-cased"
tokenizer = BertTokenizer.from_pretrained(MODEL)
# Create a sentiment classifier for Italian
model = BertForSequenceClassification.from_pretrained(
MODEL,
num_labels=2,
id2label={0: "NEGATIVE", 1: "POSITIVE"},
label2id={"NEGATIVE": 0, "POSITIVE": 1}
)
# Sample Italian training dataset
train_data = {
"text": [
"Il prodotto e arrivato in perfette condizioni, molto soddisfatto",
"qualità pessima, si e rotto dopo due giorni",
"Eccellente rapporto qualità/prezzo, lo consiglio",
"Imballaggio scarso, prodotto danneggiato alla consegna",
"Supera le aspettative, ottimo acquisto",
"Servizio clienti inesistente, rimborso impossibile",
"Materiali di qualità, costruzione solida",
"Non corrisponde alla descrizione, immagine ingannevole",
],
"label": [1, 0, 1, 0, 1, 0, 1, 0]
}
def tokenize_fn(examples):
return tokenizer(examples["text"], truncation=True,
padding="max_length", max_length=128)
dataset = Dataset.from_dict(train_data)
tokenized = dataset.map(tokenize_fn, batched=True)
# Quick training (few data = very few epochs)
args = TrainingArguments(
output_dir="./models/bert-italian-sentiment",
num_train_epochs=5,
per_device_train_batch_size=8,
learning_rate=3e-5,
warmup_ratio=0.1,
weight_decay=0.01,
save_steps=100,
logging_steps=10,
report_to="none"
)
trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized,
)
trainer.train()
3.4 Comparing Italian Models
Which Italian Model to Use?
| Model | Best Domain | Best Tasks | Size |
|---|---|---|---|
| feel-it-sentiment | Social media, opinions | Sentiment, emotion detection | ~440MB |
| feel-it-emotion | Social media, opinions | 6 basic emotions | ~440MB |
| AlBERTo | Twitter, chat, SMS | Sentiment, NER, classification | ~420MB |
| dbmdz BERT cased | News, formal documents | NER, classification, QA | ~420MB |
| GilBERTo | General Italian text | General NLU tasks | ~440MB |
| mBERT | Cross-lingual | Multilingual transfer learning | ~670MB |
4. Italian Datasets for Sentiment Analysis
from datasets import load_dataset
# SENTIPOLC 2016 - Italian dataset for polarity detection on Twitter
# Available at: http://www.di.unito.it/~tutreeb/sentipolc-evalita16/
# Labels: OBJ (objective), POS (positive), NEG (negative), MIX
# Dataset available on HuggingFace
try:
dataset = load_dataset("gsarti/itacola")
print("ITA-CoLA dataset:", dataset)
except Exception:
print("Dataset not directly available, use manual URL")
# Building a custom dataset from CSV
import pandas as pd
from datasets import Dataset
# Expected format: columns 'text' and 'label'
def load_italian_dataset(csv_path):
df = pd.read_csv(csv_path)
# Validation
assert 'text' in df.columns, "Missing 'text' column"
assert 'label' in df.columns, "Missing 'label' column"
# Remove rows with empty text
df = df.dropna(subset=['text', 'label'])
df = df[df['text'].str.strip() != '']
# Normalize labels
label_map = {
'positivo': 1, 'pos': 1, '1': 1, 1: 1,
'negativo': 0, 'neg': 0, '0': 0, 0: 0
}
df['label'] = df['label'].map(label_map)
df = df.dropna(subset=['label'])
df['label'] = df['label'].astype(int)
return Dataset.from_pandas(df[['text', 'label']])
5. Complete Italian Sentiment Pipeline
Let's integrate everything into a production-ready pipeline for Italian sentiment analysis.
import re
import spacy
from transformers import pipeline as hf_pipeline
from typing import Optional
import unicodedata
class ItalianSentimentPipeline:
"""
Complete pipeline for Italian sentiment analysis.
Combines Italian-specific preprocessing with feel-it for sentiment.
"""
def __init__(self,
sentiment_model: str = "MilaNLProc/feel-it-italian-sentiment",
emotion_model: Optional[str] = "MilaNLProc/feel-it-italian-emotion",
use_spacy: bool = True,
confidence_threshold: float = 0.6):
# Load sentiment and emotion models
self.sentiment = hf_pipeline(
"text-classification",
model=sentiment_model,
truncation=True,
max_length=128
)
self.emotion = hf_pipeline(
"text-classification",
model=emotion_model,
truncation=True,
max_length=128
) if emotion_model else None
# spaCy for advanced preprocessing
if use_spacy:
try:
self.nlp = spacy.load("it_core_news_sm")
except OSError:
print("spaCy model 'it_core_news_sm' not found.")
print("Install with: python -m spacy download it_core_news_sm")
self.nlp = None
else:
self.nlp = None
self.threshold = confidence_threshold
def preprocess(self, text: str) -> str:
"""Italian-specific preprocessing."""
if not text or not text.strip():
return ""
# Normalize unicode
text = unicodedata.normalize('NFC', text)
# Common Italian abbreviations
abbr_map = {
r'\bcmq\b': 'comunque',
r'\bnn\b': 'non',
r'\bxke\b': 'perchè',
r'\bx\b': 'per',
}
for pattern, replacement in abbr_map.items():
text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)
# Reduce repeated characters
text = re.sub(r'(.)\1{2,}', r'\1\1', text)
# Normalize spaces
text = re.sub(r'\s+', ' ', text).strip()
return text
def analyze(self, text: str) -> dict:
"""Full analysis: sentiment + emotion + preprocessing."""
if not text or not text.strip():
return {"error": "Empty text"}
preprocessed = self.preprocess(text)
# Sentiment
sent_result = self.sentiment(preprocessed)[0]
sentiment_label = sent_result['label']
sentiment_score = sent_result['score']
result = {
"original_text": text,
"preprocessed_text": preprocessed,
"sentiment": sentiment_label,
"sentiment_score": round(sentiment_score, 4),
"confident": sentiment_score >= self.threshold
}
# Emotion (if available)
if self.emotion:
em_result = self.emotion(preprocessed)[0]
result["emotion"] = em_result['label']
result["emotion_score"] = round(em_result['score'], 4)
return result
def analyze_batch(self, texts: list) -> list:
return [self.analyze(t) for t in texts]
# Usage
pipeline = ItalianSentimentPipeline()
test_texts = [
"Il prodotto e arrivato in perfette condizioni, sono molto soddisfatto dell'acquisto!",
"Pessima esperienza. Il pacco era danneggiato e il servizio clienti non risponde.",
"Mah, diciamo che si poteva fare meglio. Non e ne buono ne cattivo.",
"INCREDIBILE! Non avrei mai pensato che fosse cosi bello!!! Sto piangendo di gioia",
"Nn ci credo... mi ha di nuovo fregato sto negozio di schifo",
]
for text in test_texts:
result = pipeline.analyze(text)
print(f"Text: {text[:60]}...")
print(f"Sentiment: {result['sentiment']} ({result['sentiment_score']:.3f})")
if 'emotion' in result:
print(f"Emotion: {result['emotion']} ({result['emotion_score']:.3f})")
print(f"Confident: {result['confident']}")
print()
6. Domain-Specific Fine-tuning
feel-it was trained on Twitter. For specific domains such as product reviews, medical comments, or legal text, additional fine-tuning is often necessary.
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer
)
from datasets import Dataset
import evaluate
import numpy as np
# Strategy 1: Fine-tune feel-it on domain data
def finetune_for_domain(
base_model: str,
train_texts: list,
train_labels: list,
val_texts: list,
val_labels: list,
output_dir: str,
num_epochs: int = 3
):
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForSequenceClassification.from_pretrained(
base_model,
num_labels=2,
ignore_mismatched_sizes=True # for already fine-tuned models
)
def tokenize(examples):
return tokenizer(examples["text"], truncation=True,
padding="max_length", max_length=128)
train_ds = Dataset.from_dict({"text": train_texts, "label": train_labels})
val_ds = Dataset.from_dict({"text": val_texts, "label": val_labels})
train_tok = train_ds.map(tokenize, batched=True)
val_tok = val_ds.map(tokenize, batched=True)
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
return accuracy.compute(predictions=preds, references=labels)
args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_epochs,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
learning_rate=2e-5,
warmup_ratio=0.1,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="accuracy",
report_to="none"
)
trainer = Trainer(
model=model,
args=args,
train_dataset=train_tok,
eval_dataset=val_tok,
compute_metrics=compute_metrics
)
trainer.train()
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
return trainer
# Strategy 2: Compare Italian models
def compare_italian_models(texts, true_labels):
"""Automatic comparison of different Italian BERT models."""
models = {
"feel-it": "MilaNLProc/feel-it-italian-sentiment",
"AlBERTo-fine": "m-polignano-uniba/bert_uncased_L-12_H-768_A-12_Italian_alb3rt0",
"mBERT": "bert-base-multilingual-cased"
}
results = {}
for name, model_id in models.items():
try:
clf = hf_pipeline("text-classification", model=model_id,
truncation=True, max_length=128)
preds = clf(texts)
print(f"{name}: model loaded successfully")
except Exception as e:
print(f"{name}: error - {e}")
return results
7. Handling Dialects and Regional Varieties
Italy has a strong dialectal tradition. Social media posts, reviews, and informal messages often mix standard Italian and dialect, especially southern dialects (Neapolitan, Sicilian, Barese, Calabrian).
Strategies for Dialectal Text
- Light normalization: convert the most common dialectal forms to standard Italian (e.g., "maje" → "mai" in Neapolitan)
- Use AlBERTo: trained on Twitter, it includes many dialectal forms given the nature of Italian social media
- Multilingual BERT: sometimes handles dialects better as "unknown languages" compared to Italian-specific models that expect standard Italian
- Domain-specific data collection: if your dataset contains many dialectalisms, collect annotated examples for fine-tuning
8. Benchmarking and Metrics for Italian
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
def benchmark_italian_sentiment(model_pipeline, test_data):
"""
Complete benchmark for Italian sentiment models.
test_data: list of tuples (text, label)
"""
texts = [d[0] for d in test_data]
true_labels = [d[1] for d in test_data]
predictions = model_pipeline(texts)
pred_labels = []
for pred in predictions:
label = pred['label'].upper()
if label in ['POSITIVE', 'POSITIVO', 'POS']:
pred_labels.append(1)
else:
pred_labels.append(0)
print("=== CLASSIFICATION REPORT ===")
print(classification_report(
true_labels, pred_labels,
target_names=['NEGATIVE', 'POSITIVE'],
digits=4
))
# Analysis by text category
categories = {
'formal': [i for i, t in enumerate(texts) if len(t.split()) > 20],
'informal': [i for i, t in enumerate(texts) if len(t.split()) <= 20],
}
for cat_name, indices in categories.items():
if indices:
cat_true = [true_labels[i] for i in indices]
cat_pred = [pred_labels[i] for i in indices]
report = classification_report(cat_true, cat_pred, output_dict=True)
acc = report['accuracy']
print(f"\nCategory '{cat_name}' ({len(indices)} samples): accuracy={acc:.4f}")
return pred_labels
9. Fine-tuning feel-it on Custom Data
feel-it is an excellent starting point, but best performance is always achieved by adapting the model to your specific domain. Here is a complete workflow for fine-tuning on custom Italian data — for example, Italian e-commerce reviews.
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer
)
from datasets import Dataset
import numpy as np
import evaluate
# 1. Custom dataset (e.g., Italian e-commerce reviews)
custom_data = {
"text": [
"Excellent product, very fast delivery. Highly recommended!",
"Poor quality, broke after one week. Very disappointed.",
"It's okay, nothing special. Could do without.",
"Fantastic! Exactly as described, very satisfied.",
"Fast shipping but the product does not match the description.",
"Cheap material, not worth the price. Will not buy again.",
"Great value for money, I recommend it to everyone.",
"Works perfectly, exactly what I was looking for.",
],
"label": [1, 0, 0, 1, 0, 0, 1, 1] # 0=negative, 1=positive
}
# 2. Load feel-it tokenizer
model_name = "MilaNLProc/feel-it-italian-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
def tokenize(examples):
return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=128
)
dataset = Dataset.from_dict(custom_data)
dataset = dataset.train_test_split(test_size=0.2, seed=42)
tokenized = dataset.map(tokenize, batched=True)
# 3. Load model with new classification head
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=2,
ignore_mismatched_sizes=True, # original head has different labels
id2label={0: "NEGATIVE", 1: "POSITIVE"},
label2id={"NEGATIVE": 0, "POSITIVE": 1}
)
# 4. Evaluation metrics
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
return {
"accuracy": accuracy_metric.compute(predictions=preds, references=labels)["accuracy"],
"f1": f1_metric.compute(predictions=preds, references=labels)["f1"]
}
# 5. Training arguments calibrated for small datasets
training_args = TrainingArguments(
output_dir="./feel-it-finetuned",
num_train_epochs=5, # more epochs for small datasets
per_device_train_batch_size=8,
per_device_eval_batch_size=16,
learning_rate=2e-5,
warmup_ratio=0.2, # longer warmup for stability
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1",
fp16=False,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
compute_metrics=compute_metrics,
)
trainer.train()
results = trainer.evaluate()
print(f"Accuracy: {results['eval_accuracy']:.4f}")
print(f"F1: {results['eval_f1']:.4f}")
# 6. Save and optionally push to HuggingFace Hub
trainer.save_model("./feel-it-custom-ecommerce")
tokenizer.save_pretrained("./feel-it-custom-ecommerce")
10. Model Selection Guide for Italian NLP
Choosing the Right Italian Model
| Use Case | Recommended Model | Rationale | Alternative |
|---|---|---|---|
| Binary sentiment (pos/neg) | feel-it | Explicitly trained for Italian sentiment | Fine-tuned UmBERTo |
| Emotion detection (6 classes) | feel-it | Only Italian model with 6 emotions | XLM-RoBERTa multilabel |
| Social media / Twitter | AlBERTo | Trained on 196M Italian tweets | feel-it with normalization |
| Formal text (news, documents) | dbmdz/bert-base-italian-xxl-cased | Academic and news corpora | UmBERTo |
| Italian NER | dbmdz/bert-base-italian-xxl-cased + NER head | Richer Italian vocabulary coverage | spaCy it_core_news_lg |
| Multilingual tasks (IT+EN+...) | xlm-roberta-large | Top-1 on XNLI, supports 100 languages | mDeBERTa-v3-base |
| Low-latency production | Quantized multilingual DistilBERT | 60% faster, retains 97% quality | feel-it + ONNX export |
Conclusions and Next Steps
Italian NLP requires specific attention: rich morphology, colloquial language, regional dialects, and the scarcity of annotated resources make this domain challenging but also very interesting. Models like feel-it and AlBERTo have significantly improved the landscape in recent years.
Key Takeaways
- Use feel-it as a starting point for Italian sentiment and emotion detection
- For social media and informal text, AlBERTo is often superior
- For formal text (news, documents), use dbmdz BERT cased
- Italian-specific preprocessing (abbreviation normalization, lemmatization) improves results
- Always fine-tune on your specific domain data for best results
- Collect continuous feedback: Italian evolves rapidly (neologisms, anglicisms)
Continue the Series
- Next: Named Entity Recognition — extract entities from text with spaCy and BERT
- Article 6: Multi-label Text Classification — when text belongs to multiple categories
- Article 7: HuggingFace Transformers: Complete Guide — Trainer API and Model Hub
- Article 8: LoRA Fine-tuning — train large models on consumer GPUs
- Related series: AI Engineering/RAG — Italian embeddings for semantic search







