Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

NLP for Italian Language: Specific Challenges and Solutions

Italian is one of the most morphologically complex Romance languages: grammatical gender, declensions, adjective-noun agreement, irregular verb forms, and flexible syntax make NLP preprocessing and modeling significantly more challenging than English. Yet the vast majority of NLP tutorials are in English, and the most well-known models are often optimized for English.

This article fills that gap. We explore the specific challenges of Italian, available datasets, Italian BERT models (feel-it, AlBERTo, dbmdz BERT), Italian-specific preprocessing, and how to build a complete sentiment analysis system for Italian step by step.

This is the fourth article in the Modern NLP: from BERT to LLMs series. It is the only series that specifically covers Italian NLP preprocessing and modeling end to end.

What You Will Learn

Morphological challenges of Italian: gender, declensions, irregular verbs
Italian-specific preprocessing: stopwords, spaCy lemmatization, normalization
Italian BERT models: feel-it-italian-sentiment, AlBERTo, dbmdz BERT, GilBERTo
Italian datasets: SENTIPOLC, TweetSent-IT, ItalianSentiment
Fine-tuning feel-it on custom domain data
Handling colloquial language, dialects, and Italian neologisms
Complete production pipeline for Italian sentiment analysis
Comparing Italian models vs multilingual BERT

1. Specific Challenges of Italian in NLP

Italian has linguistic characteristics that make NLP more complex than English. Understanding these challenges is fundamental to building effective systems.

1.1 Rich Morphology

Unlike English, Italian has very rich morphology: the same verb root generates dozens of inflected forms, and adjectives must agree in gender and number with nouns. This creates data sparsity problems.

Example: The Italian Verb "Andare" (to go)

vado, vai, va, andiamo, andate, vanno (present)
andavo, andavi, andava, andavamo, andavate, andavano (imperfect)
andro, andrai, andra, andremo, andrete, andranno (future)
andai, andasti, ando, andammo, andaste, andarono (passato remoto)
sia andato/a, siano andati/e (subjunctive past)

In English, "to go" has very few forms. For an NLP model, each form is initially a different token.

1.2 Enclitic Pronouns and Compound Words

In Italian, pronouns can be attached to verbs (enclitic), creating complex tokens that standard tokenizers may handle poorly.

# Common issues with tokenizers for Italian

# Enclitic pronouns attached to verbs
examples = [
    "Dimmelo",      # dimmi + lo
    "Portarmelo",   # portare + mi + lo
    "Fallo",        # fai + lo
    "Dateglielo",   # date + glie + lo
]

# Incorrect tokenization with non-Italian tokenizers
from transformers import BertTokenizer
tokenizer_en = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer_it = BertTokenizer.from_pretrained('dbmdz/bert-base-italian-cased')

word = "Dimmelo"
print(f"EN tokenizer: {tokenizer_en.tokenize(word)}")
# ['dim', '##mel', '##o'] - misses the structure

print(f"IT tokenizer: {tokenizer_it.tokenize(word)}")
# ['Dim', '##me', '##lo'] - better but not perfect

# The optimal solution is lemmatization before tokenization

1.3 Informal Orthography and Dialects

Italian online text (social media, reviews) commonly features:

Accents replaced by apostrophes: "può'" instead of "può"
Repeated characters: "bellissimoooo!!!"
Common abbreviations: "cmq" (comunque = anyway), "nn" (non = not), "xke" (perchè = because)
Code-switching with English: "Il prodotto e davvero top quality"
Regional dialectalisms: "mizzica" (Sicilian), "mannaggia" (Southern Italian)

2. Italian-Specific Preprocessing

2.1 spaCy for Italian

spaCy offers an Italian model (it_core_news_sm/md/lg) with lemmatization, POS tagging, and dependency parsing.

# Install Italian model: python -m spacy download it_core_news_lg
import spacy

nlp = spacy.load("it_core_news_lg")

def preprocess_italian(text: str,
                        remove_stopwords: bool = True,
                        lemmatize: bool = True) -> str:
    """Complete preprocessing for Italian texts."""
    doc = nlp(text)

    tokens = []
    for token in doc:
        # Skip punctuation, spaces, numbers (if not relevant)
        if token.is_punct or token.is_space:
            continue

        # Normalize to lowercase
        word = token.text.lower()

        # Remove Italian stopwords
        if remove_stopwords and token.is_stop:
            continue

        # Lemmatize
        if lemmatize:
            word = token.lemma_.lower()

        tokens.append(word)

    return ' '.join(tokens)

# Test
texts = [
    "I prodotti sono stati consegnati rapidamente e tutto funzionava perfettamente",
    "Ho comprato questo telefono tre mesi fa e sono rimasto deluso dalla batteria",
    "PRODOTTO FANTASTICO! Lo consiglio assolutamente a tutti voi amici!!!"
]

for text in texts:
    processed = preprocess_italian(text)
    print(f"Original:  {text}")
    print(f"Processed: {processed}")
    print()

2.2 Normalizing Informal Italian Text

import re
import unicodedata

def normalize_italian_text(text: str) -> str:
    """
    Normalization for informal Italian texts (social media, reviews).
    """
    # 1. Normalize unicode (accents)
    text = unicodedata.normalize('NFC', text)

    # 2. Expand common Italian abbreviations
    abbreviations = {
        r'\bcmq\b': 'comunque',
        r'\bnn\b': 'non',
        r'\bxke\b': 'perchè',
        r'\bxche\b': 'perchè',
        r'\bx\b': 'per',
        r'\bke\b': 'che',
        r'\bkm\b': 'come',
        r'\bqs\b': 'questo',
        r'\btv\b': 'televisione',
        r'\bgg\b': 'giorni',
        r'\bprof\b': 'professore',
    }
    for abbr, expanded in abbreviations.items():
        text = re.sub(abbr, expanded, text, flags=re.IGNORECASE)

    # 3. Reduce excessive character repetitions (max 2)
    text = re.sub(r'(.)\1{2,}', r'\1\1', text)

    # 4. Normalize multiple spaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Test
informal_texts = [
    "cmq il prodotto e' fantasticooo!!!",
    "nn mi e piaciuto x niente, sto cercando di restituirlo xke nn funziona",
    "Amici... COMPRATE QUESTOOO!!! e' il TOP del TOP!!!",
]

for text in informal_texts:
    normalized = normalize_italian_text(text)
    print(f"Original:   {text}")
    print(f"Normalized: {normalized}")
    print()

3. Italian BERT Models

Several BERT models pre-trained on Italian corpora are available. The choice depends on the domain and the specific task.

3.1 feel-it-italian-sentiment

feel-it is a dataset and model specifically for sentiment analysis and emotion detection in Italian. It is Twitter-based and was trained on manual annotations for sentiment (positive/negative) and emotions (joy, sadness, anger, fear, disgust, surprise).

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import torch

# feel-it for sentiment (positive/negative)
sentiment_model = pipeline(
    "text-classification",
    model="MilaNLProc/feel-it-italian-sentiment",
    tokenizer="MilaNLProc/feel-it-italian-sentiment"
)

# feel-it for emotions (joy, sadness, anger, fear, disgust, surprise)
emotion_model = pipeline(
    "text-classification",
    model="MilaNLProc/feel-it-italian-emotion",
    tokenizer="MilaNLProc/feel-it-italian-emotion"
)

# Test on Italian texts
texts = [
    "Sono molto felice del mio acquisto, qualità eccellente!",
    "Ho perso tutto il mio lavoro, sono devastato.",
    "Questa e la situazione più ridicola che abbia mai visto.",
    "Non credevo che potesse funzionare cosi bene, sono stupito!",
]

print("=== SENTIMENT ===")
for text in texts:
    result = sentiment_model(text)[0]
    print(f"  [{result['label']}: {result['score']:.3f}] {text[:60]}")

print("\n=== EMOTION ===")
for text in texts:
    result = emotion_model(text)[0]
    print(f"  [{result['label']}: {result['score']:.3f}] {text[:60]}")

3.2 AlBERTo: BERT for Italian Social Media

AlBERTo was pre-trained on a corpus of Italian tweets (over 200 million tweets). It is particularly effective for informal text, social media, and colloquial Italian.

from transformers import AutoTokenizer, AutoModel
import torch

# AlBERTo - uncased BERT for Italian Twitter
alberto_name = "m-polignano-uniba/bert_uncased_L-12_H-768_A-12_Italian_alb3rt0"
tokenizer = AutoTokenizer.from_pretrained(alberto_name)
model = AutoModel.from_pretrained(alberto_name)

# Test tokenization on colloquial text
informal_texts = [
    "PRODOTTO TOP! ma la spedizione ha fatto schifo cmq",
    "mizzica quanto e bello sto telefono!! ci ho messo 2gg ma ne valeva la pena",
    "ok mi avete rotto... non lo compro più #delusione",
]

for text in informal_texts:
    tokens = tokenizer.tokenize(text)
    print(f"Text: {text[:50]}")
    print(f"Tokens ({len(tokens)}): {tokens[:10]}...")
    print()

# Embedding extraction
def get_sentence_embedding(text, model, tokenizer, pooling='cls'):
    inputs = tokenizer(text, return_tensors='pt',
                      truncation=True, max_length=128, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    if pooling == 'cls':
        return outputs.last_hidden_state[:, 0, :]  # [CLS] token
    elif pooling == 'mean':
        mask = inputs['attention_mask'].unsqueeze(-1)
        return (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)

emb = get_sentence_embedding(informal_texts[0], model, tokenizer)
print(f"Embedding shape: {emb.shape}")  # (1, 768)

3.3 dbmdz BERT Italian

dbmdz/bert-base-italian-cased was pre-trained on Italian Wikipedia and an OPUS corpus. It is the best starting point for formal text (news, legal documents, academic writing).

from transformers import BertTokenizer, BertForSequenceClassification
from transformers import TrainingArguments, Trainer
from datasets import Dataset
import torch

# Base model for Italian
MODEL = "dbmdz/bert-base-italian-cased"
tokenizer = BertTokenizer.from_pretrained(MODEL)

# Create a sentiment classifier for Italian
model = BertForSequenceClassification.from_pretrained(
    MODEL,
    num_labels=2,
    id2label={0: "NEGATIVE", 1: "POSITIVE"},
    label2id={"NEGATIVE": 0, "POSITIVE": 1}
)

# Sample Italian training dataset
train_data = {
    "text": [
        "Il prodotto e arrivato in perfette condizioni, molto soddisfatto",
        "qualità pessima, si e rotto dopo due giorni",
        "Eccellente rapporto qualità/prezzo, lo consiglio",
        "Imballaggio scarso, prodotto danneggiato alla consegna",
        "Supera le aspettative, ottimo acquisto",
        "Servizio clienti inesistente, rimborso impossibile",
        "Materiali di qualità, costruzione solida",
        "Non corrisponde alla descrizione, immagine ingannevole",
    ],
    "label": [1, 0, 1, 0, 1, 0, 1, 0]
}

def tokenize_fn(examples):
    return tokenizer(examples["text"], truncation=True,
                    padding="max_length", max_length=128)

dataset = Dataset.from_dict(train_data)
tokenized = dataset.map(tokenize_fn, batched=True)

# Quick training (few data = very few epochs)
args = TrainingArguments(
    output_dir="./models/bert-italian-sentiment",
    num_train_epochs=5,
    per_device_train_batch_size=8,
    learning_rate=3e-5,
    warmup_ratio=0.1,
    weight_decay=0.01,
    save_steps=100,
    logging_steps=10,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized,
)
trainer.train()

3.4 Comparing Italian Models

Which Italian Model to Use?

Model	Best Domain	Best Tasks	Size
feel-it-sentiment	Social media, opinions	Sentiment, emotion detection	~440MB
feel-it-emotion	Social media, opinions	6 basic emotions	~440MB
AlBERTo	Twitter, chat, SMS	Sentiment, NER, classification	~420MB
dbmdz BERT cased	News, formal documents	NER, classification, QA	~420MB
GilBERTo	General Italian text	General NLU tasks	~440MB
mBERT	Cross-lingual	Multilingual transfer learning	~670MB

4. Italian Datasets for Sentiment Analysis

from datasets import load_dataset

# SENTIPOLC 2016 - Italian dataset for polarity detection on Twitter
# Available at: http://www.di.unito.it/~tutreeb/sentipolc-evalita16/
# Labels: OBJ (objective), POS (positive), NEG (negative), MIX

# Dataset available on HuggingFace
try:
    dataset = load_dataset("gsarti/itacola")
    print("ITA-CoLA dataset:", dataset)
except Exception:
    print("Dataset not directly available, use manual URL")

# Building a custom dataset from CSV
import pandas as pd
from datasets import Dataset

# Expected format: columns 'text' and 'label'
def load_italian_dataset(csv_path):
    df = pd.read_csv(csv_path)

    # Validation
    assert 'text' in df.columns, "Missing 'text' column"
    assert 'label' in df.columns, "Missing 'label' column"

    # Remove rows with empty text
    df = df.dropna(subset=['text', 'label'])
    df = df[df['text'].str.strip() != '']

    # Normalize labels
    label_map = {
        'positivo': 1, 'pos': 1, '1': 1, 1: 1,
        'negativo': 0, 'neg': 0, '0': 0, 0: 0
    }
    df['label'] = df['label'].map(label_map)
    df = df.dropna(subset=['label'])
    df['label'] = df['label'].astype(int)

    return Dataset.from_pandas(df[['text', 'label']])

5. Complete Italian Sentiment Pipeline

Let's integrate everything into a production-ready pipeline for Italian sentiment analysis.

import re
import spacy
from transformers import pipeline as hf_pipeline
from typing import Optional
import unicodedata

class ItalianSentimentPipeline:
    """
    Complete pipeline for Italian sentiment analysis.
    Combines Italian-specific preprocessing with feel-it for sentiment.
    """

    def __init__(self,
                 sentiment_model: str = "MilaNLProc/feel-it-italian-sentiment",
                 emotion_model: Optional[str] = "MilaNLProc/feel-it-italian-emotion",
                 use_spacy: bool = True,
                 confidence_threshold: float = 0.6):
        # Load sentiment and emotion models
        self.sentiment = hf_pipeline(
            "text-classification",
            model=sentiment_model,
            truncation=True,
            max_length=128
        )
        self.emotion = hf_pipeline(
            "text-classification",
            model=emotion_model,
            truncation=True,
            max_length=128
        ) if emotion_model else None

        # spaCy for advanced preprocessing
        if use_spacy:
            try:
                self.nlp = spacy.load("it_core_news_sm")
            except OSError:
                print("spaCy model 'it_core_news_sm' not found.")
                print("Install with: python -m spacy download it_core_news_sm")
                self.nlp = None
        else:
            self.nlp = None

        self.threshold = confidence_threshold

    def preprocess(self, text: str) -> str:
        """Italian-specific preprocessing."""
        if not text or not text.strip():
            return ""

        # Normalize unicode
        text = unicodedata.normalize('NFC', text)

        # Common Italian abbreviations
        abbr_map = {
            r'\bcmq\b': 'comunque',
            r'\bnn\b': 'non',
            r'\bxke\b': 'perchè',
            r'\bx\b': 'per',
        }
        for pattern, replacement in abbr_map.items():
            text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)

        # Reduce repeated characters
        text = re.sub(r'(.)\1{2,}', r'\1\1', text)

        # Normalize spaces
        text = re.sub(r'\s+', ' ', text).strip()

        return text

    def analyze(self, text: str) -> dict:
        """Full analysis: sentiment + emotion + preprocessing."""
        if not text or not text.strip():
            return {"error": "Empty text"}

        preprocessed = self.preprocess(text)

        # Sentiment
        sent_result = self.sentiment(preprocessed)[0]
        sentiment_label = sent_result['label']
        sentiment_score = sent_result['score']

        result = {
            "original_text": text,
            "preprocessed_text": preprocessed,
            "sentiment": sentiment_label,
            "sentiment_score": round(sentiment_score, 4),
            "confident": sentiment_score >= self.threshold
        }

        # Emotion (if available)
        if self.emotion:
            em_result = self.emotion(preprocessed)[0]
            result["emotion"] = em_result['label']
            result["emotion_score"] = round(em_result['score'], 4)

        return result

    def analyze_batch(self, texts: list) -> list:
        return [self.analyze(t) for t in texts]

# Usage
pipeline = ItalianSentimentPipeline()

test_texts = [
    "Il prodotto e arrivato in perfette condizioni, sono molto soddisfatto dell'acquisto!",
    "Pessima esperienza. Il pacco era danneggiato e il servizio clienti non risponde.",
    "Mah, diciamo che si poteva fare meglio. Non e ne buono ne cattivo.",
    "INCREDIBILE! Non avrei mai pensato che fosse cosi bello!!! Sto piangendo di gioia",
    "Nn ci credo... mi ha di nuovo fregato sto negozio di schifo",
]

for text in test_texts:
    result = pipeline.analyze(text)
    print(f"Text: {text[:60]}...")
    print(f"Sentiment: {result['sentiment']} ({result['sentiment_score']:.3f})")
    if 'emotion' in result:
        print(f"Emotion:   {result['emotion']} ({result['emotion_score']:.3f})")
    print(f"Confident: {result['confident']}")
    print()

6. Domain-Specific Fine-tuning

feel-it was trained on Twitter. For specific domains such as product reviews, medical comments, or legal text, additional fine-tuning is often necessary.

from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer
)
from datasets import Dataset
import evaluate
import numpy as np

# Strategy 1: Fine-tune feel-it on domain data
def finetune_for_domain(
    base_model: str,
    train_texts: list,
    train_labels: list,
    val_texts: list,
    val_labels: list,
    output_dir: str,
    num_epochs: int = 3
):
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    model = AutoModelForSequenceClassification.from_pretrained(
        base_model,
        num_labels=2,
        ignore_mismatched_sizes=True   # for already fine-tuned models
    )

    def tokenize(examples):
        return tokenizer(examples["text"], truncation=True,
                        padding="max_length", max_length=128)

    train_ds = Dataset.from_dict({"text": train_texts, "label": train_labels})
    val_ds = Dataset.from_dict({"text": val_texts, "label": val_labels})

    train_tok = train_ds.map(tokenize, batched=True)
    val_tok = val_ds.map(tokenize, batched=True)

    accuracy = evaluate.load("accuracy")
    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        preds = np.argmax(logits, axis=-1)
        return accuracy.compute(predictions=preds, references=labels)

    args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=num_epochs,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=32,
        learning_rate=2e-5,
        warmup_ratio=0.1,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
        report_to="none"
    )

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=train_tok,
        eval_dataset=val_tok,
        compute_metrics=compute_metrics
    )

    trainer.train()
    trainer.save_model(output_dir)
    tokenizer.save_pretrained(output_dir)
    return trainer

# Strategy 2: Compare Italian models
def compare_italian_models(texts, true_labels):
    """Automatic comparison of different Italian BERT models."""
    models = {
        "feel-it": "MilaNLProc/feel-it-italian-sentiment",
        "AlBERTo-fine": "m-polignano-uniba/bert_uncased_L-12_H-768_A-12_Italian_alb3rt0",
        "mBERT": "bert-base-multilingual-cased"
    }

    results = {}
    for name, model_id in models.items():
        try:
            clf = hf_pipeline("text-classification", model=model_id,
                              truncation=True, max_length=128)
            preds = clf(texts)
            print(f"{name}: model loaded successfully")
        except Exception as e:
            print(f"{name}: error - {e}")

    return results

7. Handling Dialects and Regional Varieties

Italy has a strong dialectal tradition. Social media posts, reviews, and informal messages often mix standard Italian and dialect, especially southern dialects (Neapolitan, Sicilian, Barese, Calabrian).

Strategies for Dialectal Text

Light normalization: convert the most common dialectal forms to standard Italian (e.g., "maje" → "mai" in Neapolitan)
Use AlBERTo: trained on Twitter, it includes many dialectal forms given the nature of Italian social media
Multilingual BERT: sometimes handles dialects better as "unknown languages" compared to Italian-specific models that expect standard Italian
Domain-specific data collection: if your dataset contains many dialectalisms, collect annotated examples for fine-tuning

8. Benchmarking and Metrics for Italian

from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

def benchmark_italian_sentiment(model_pipeline, test_data):
    """
    Complete benchmark for Italian sentiment models.
    test_data: list of tuples (text, label)
    """
    texts = [d[0] for d in test_data]
    true_labels = [d[1] for d in test_data]

    predictions = model_pipeline(texts)
    pred_labels = []

    for pred in predictions:
        label = pred['label'].upper()
        if label in ['POSITIVE', 'POSITIVO', 'POS']:
            pred_labels.append(1)
        else:
            pred_labels.append(0)

    print("=== CLASSIFICATION REPORT ===")
    print(classification_report(
        true_labels, pred_labels,
        target_names=['NEGATIVE', 'POSITIVE'],
        digits=4
    ))

    # Analysis by text category
    categories = {
        'formal': [i for i, t in enumerate(texts) if len(t.split()) > 20],
        'informal': [i for i, t in enumerate(texts) if len(t.split()) <= 20],
    }

    for cat_name, indices in categories.items():
        if indices:
            cat_true = [true_labels[i] for i in indices]
            cat_pred = [pred_labels[i] for i in indices]
            report = classification_report(cat_true, cat_pred, output_dict=True)
            acc = report['accuracy']
            print(f"\nCategory '{cat_name}' ({len(indices)} samples): accuracy={acc:.4f}")

    return pred_labels

9. Fine-tuning feel-it on Custom Data

feel-it is an excellent starting point, but best performance is always achieved by adapting the model to your specific domain. Here is a complete workflow for fine-tuning on custom Italian data — for example, Italian e-commerce reviews.

from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer
)
from datasets import Dataset
import numpy as np
import evaluate

# 1. Custom dataset (e.g., Italian e-commerce reviews)
custom_data = {
    "text": [
        "Excellent product, very fast delivery. Highly recommended!",
        "Poor quality, broke after one week. Very disappointed.",
        "It's okay, nothing special. Could do without.",
        "Fantastic! Exactly as described, very satisfied.",
        "Fast shipping but the product does not match the description.",
        "Cheap material, not worth the price. Will not buy again.",
        "Great value for money, I recommend it to everyone.",
        "Works perfectly, exactly what I was looking for.",
    ],
    "label": [1, 0, 0, 1, 0, 0, 1, 1]  # 0=negative, 1=positive
}

# 2. Load feel-it tokenizer
model_name = "MilaNLProc/feel-it-italian-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=128
    )

dataset = Dataset.from_dict(custom_data)
dataset = dataset.train_test_split(test_size=0.2, seed=42)
tokenized = dataset.map(tokenize, batched=True)

# 3. Load model with new classification head
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    ignore_mismatched_sizes=True,  # original head has different labels
    id2label={0: "NEGATIVE", 1: "POSITIVE"},
    label2id={"NEGATIVE": 0, "POSITIVE": 1}
)

# 4. Evaluation metrics
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_metric.compute(predictions=preds, references=labels)["accuracy"],
        "f1": f1_metric.compute(predictions=preds, references=labels)["f1"]
    }

# 5. Training arguments calibrated for small datasets
training_args = TrainingArguments(
    output_dir="./feel-it-finetuned",
    num_train_epochs=5,         # more epochs for small datasets
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    warmup_ratio=0.2,           # longer warmup for stability
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    fp16=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    compute_metrics=compute_metrics,
)

trainer.train()
results = trainer.evaluate()
print(f"Accuracy: {results['eval_accuracy']:.4f}")
print(f"F1: {results['eval_f1']:.4f}")

# 6. Save and optionally push to HuggingFace Hub
trainer.save_model("./feel-it-custom-ecommerce")
tokenizer.save_pretrained("./feel-it-custom-ecommerce")

10. Model Selection Guide for Italian NLP

Choosing the Right Italian Model

Use Case	Recommended Model	Rationale	Alternative
Binary sentiment (pos/neg)	feel-it	Explicitly trained for Italian sentiment	Fine-tuned UmBERTo
Emotion detection (6 classes)	feel-it	Only Italian model with 6 emotions	XLM-RoBERTa multilabel
Social media / Twitter	AlBERTo	Trained on 196M Italian tweets	feel-it with normalization
Formal text (news, documents)	dbmdz/bert-base-italian-xxl-cased	Academic and news corpora	UmBERTo
Italian NER	dbmdz/bert-base-italian-xxl-cased + NER head	Richer Italian vocabulary coverage	spaCy it_core_news_lg
Multilingual tasks (IT+EN+...)	xlm-roberta-large	Top-1 on XNLI, supports 100 languages	mDeBERTa-v3-base
Low-latency production	Quantized multilingual DistilBERT	60% faster, retains 97% quality	feel-it + ONNX export

Conclusions and Next Steps

Italian NLP requires specific attention: rich morphology, colloquial language, regional dialects, and the scarcity of annotated resources make this domain challenging but also very interesting. Models like feel-it and AlBERTo have significantly improved the landscape in recent years.

Key Takeaways

Use feel-it as a starting point for Italian sentiment and emotion detection
For social media and informal text, AlBERTo is often superior
For formal text (news, documents), use dbmdz BERT cased
Italian-specific preprocessing (abbreviation normalization, lemmatization) improves results
Always fine-tune on your specific domain data for best results
Collect continuous feedback: Italian evolves rapidly (neologisms, anglicisms)

Continue the Series

Next: Named Entity Recognition — extract entities from text with spaCy and BERT
Article 6: Multi-label Text Classification — when text belongs to multiple categories
Article 7: HuggingFace Transformers: Complete Guide — Trainer API and Model Hub
Article 8: LoRA Fine-tuning — train large models on consumer GPUs
Related series: AI Engineering/RAG — Italian embeddings for semantic search