Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Sentiment Analysis with Transformers: Techniques and Implementation

Sentiment Analysis is the most in-demand NLP task in enterprise settings. Every day, millions of companies analyze product reviews, social media posts, support tickets, and customer feedback to understand what people really think. With the advent of BERT and Transformer models, the quality of these systems has improved radically compared to classic dictionary-based or TF-IDF approaches.

In this article we build a complete sentiment analysis system: from dataset preparation to production deployment, including HuggingFace fine-tuning, handling class imbalance, evaluating metrics, and strategies for edge cases like irony, negation, and ambiguous language.

This is the third article in the Modern NLP: from BERT to LLMs series. It assumes familiarity with BERT fundamentals (article 2). For Italian-specific models, see article 4 on feel-it and AlBERTo.

What You Will Learn

Classical vs BERT approaches: VADER, lexicon-based, fine-tuned Transformers
Public sentiment datasets: SST-2, IMDb, Amazon Reviews, SemEval
Complete implementation with HuggingFace Transformers and Trainer API
Handling class imbalance in sentiment datasets
Metrics: accuracy, F1, precision, recall, AUC-ROC
Fine-grained sentiment: Aspect-Based Sentiment Analysis (ABSA) and intensity
Hard cases: irony, negation, ambiguous language
Production pipeline with FastAPI and batch inference
Latency optimization: quantization and ONNX export

1. Evolution of Approaches: from VADER to BERT

Before diving into Transformer implementation, it is useful to understand the historical path of sentiment analysis approaches — in production you often use the simplest method that meets the requirements.

1.1 Dictionary-Based Approaches: VADER

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon-based analyzer optimized for social media. It requires no training, is extremely fast, and works surprisingly well on informal text.

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

# Basic examples
texts = [
    "This product is absolutely AMAZING!!!",         # strong positive
    "The service was okay I guess",                   # ambiguous neutral
    "Worst purchase I've ever made. Complete waste.", # negative
    "The food wasn't bad at all",                     # tricky negation
    "Yeah right, as if this would work :)",           # sarcasm
]

for text in texts:
    scores = analyzer.polarity_scores(text)
    print(f"Text: {text[:50]}")
    print(f"  neg={scores['neg']:.3f}, neu={scores['neu']:.3f}, "
          f"pos={scores['pos']:.3f}, compound={scores['compound']:.3f}")
    label = 'POSITIVE' if scores['compound'] >= 0.05 else \
            'NEGATIVE' if scores['compound'] <= -0.05 else 'NEUTRAL'
    print(f"  Label: {label}\n")

# VADER handles well: capitalization, punctuation, emoji
# Struggles with: sarcasm, complex context

1.2 Classical Machine Learning Approaches

Before Transformers, TF-IDF + Logistic Regression or SVM were the most common approaches. Still useful as fast baselines or when labeled data is very scarce.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

# Sample dataset
train_texts = [
    "Excellent product, highly recommend to everyone",
    "Terrible experience, will not buy again",
    "Great quality, fast shipping",
    "Complete waste of money",
    "Impeccable customer service",
    "Defective product, very disappointed"
]
train_labels = [1, 0, 1, 0, 1, 0]

# TF-IDF + Logistic Regression pipeline
pipe = Pipeline([
    ('tfidf', TfidfVectorizer(
        ngram_range=(1, 2),      # unigrams and bigrams
        max_features=50000,
        sublinear_tf=True        # log(1+tf) to dampen high frequencies
    )),
    ('clf', LogisticRegression(C=1.0, max_iter=1000))
])

pipe.fit(train_texts, train_labels)

# Evaluation
test_texts = ["Fantastic product!", "Terrible, it doesn't work"]
preds = pipe.predict(test_texts)
probs = pipe.predict_proba(test_texts)

for text, pred, prob in zip(test_texts, preds, probs):
    label = 'POSITIVE' if pred == 1 else 'NEGATIVE'
    confidence = max(prob)
    print(f"{text}: {label} ({confidence:.2f})")

1.3 Why BERT Is Superior

Sentiment Analysis Approaches Comparison

Approach	Accuracy (SST-2)	Latency	Training Data	Hard Cases
VADER	~71%	<1ms	None	Poor
TF-IDF + LR	~85%	~5ms	Required	Fair
DistilBERT	~91%	~50ms	Required	Good
BERT-base	~93%	~100ms	Required	Very Good
RoBERTa	~96%	~100ms	Required	Excellent

2. Datasets for Sentiment Analysis

The quality of fine-tuning depends heavily on the quality and size of the dataset. Here are the most important English datasets, with Italian resources covered in the next article.

from datasets import load_dataset

# SST-2: Stanford Sentiment Treebank (binary: positive/negative)
sst2 = load_dataset("glue", "sst2")
print(sst2)
# train: 67,349 examples, validation: 872, test: 1,821

# IMDb Reviews (binary: positive/negative)
imdb = load_dataset("imdb")
print(imdb)
# train: 25,000, test: 25,000

# Amazon Reviews (1-5 stars)
amazon = load_dataset("amazon_polarity")
print(amazon)
# train: 3,600,000, test: 400,000

# Dataset exploration
print("\nSST-2 examples:")
for i, example in enumerate(sst2['train'].select(range(3))):
    label = 'POSITIVE' if example['label'] == 1 else 'NEGATIVE'
    print(f"  [{label}] {example['sentence']}")

# Class distribution analysis
from collections import Counter
labels = sst2['train']['label']
print("\nSST-2 train distribution:", Counter(labels))
# Counter({1: 37569, 0: 29780}) - slight imbalance

3. Complete Fine-tuning with HuggingFace

Let's build a complete sentiment classifier, from data preparation to saving the trained model.

3.1 Data Preparation

from transformers import AutoTokenizer
from datasets import load_dataset, DatasetDict
import numpy as np

# Using DistilBERT for speed (97% of BERT, 60% faster)
MODEL_NAME = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Load SST-2 from GLUE
dataset = load_dataset("glue", "sst2")

def tokenize_function(examples):
    return tokenizer(
        examples["sentence"],
        padding="max_length",
        truncation=True,
        max_length=128,
        return_tensors=None   # returns lists, not tensors
    )

# Tokenize the full dataset (with cache)
tokenized = dataset.map(
    tokenize_function,
    batched=True,
    batch_size=1000,
    remove_columns=["sentence", "idx"]  # remove unnecessary columns
)

# PyTorch format
tokenized.set_format("torch")
print(tokenized)
print("Train columns:", tokenized['train'].column_names)
# ['input_ids', 'attention_mask', 'label']

3.2 Model Definition and Training

from transformers import (
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)
import evaluate
import numpy as np

# Model with classification head
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=2,
    id2label={0: "NEGATIVE", 1: "POSITIVE"},
    label2id={"NEGATIVE": 0, "POSITIVE": 1}
)

# Evaluation metrics
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy.compute(
            predictions=predictions, references=labels)["accuracy"],
        "f1": f1.compute(
            predictions=predictions, references=labels,
            average="binary")["f1"]
    }

# Training configuration
training_args = TrainingArguments(
    output_dir="./results/distilbert-sst2",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    warmup_ratio=0.1,
    weight_decay=0.01,
    learning_rate=2e-5,
    lr_scheduler_type="linear",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    logging_dir="./logs",
    logging_steps=100,
    fp16=True,          # Mixed precision (GPU with Tensor Cores)
    dataloader_num_workers=4,
    report_to="none",   # Disable wandb/tensorboard for simplicity
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    compute_metrics=compute_metrics,
)

# Start training
train_result = trainer.train()
print(f"Training loss: {train_result.training_loss:.4f}")

# Final evaluation
metrics = trainer.evaluate()
print(f"Validation accuracy: {metrics['eval_accuracy']:.4f}")
print(f"Validation F1: {metrics['eval_f1']:.4f}")

# Save model and tokenizer together
trainer.save_model("./models/distilbert-sst2")
tokenizer.save_pretrained("./models/distilbert-sst2")

3.3 Handling Class Imbalance

In many real-world datasets (e.g., customer support reviews), classes are heavily imbalanced: 90% negative, 10% positive. Without adjustments, the model will learn to always predict the majority class.

import torch
from torch import nn
from transformers import Trainer

# Solution 1: Weighted loss function
class WeightedTrainer(Trainer):
    def __init__(self, class_weights, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.class_weights = torch.tensor(class_weights, dtype=torch.float)

    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")

        # CrossEntropy with weights inversely proportional to frequency
        loss_fct = nn.CrossEntropyLoss(
            weight=self.class_weights.to(logits.device)
        )
        loss = loss_fct(logits.view(-1, self.model.config.num_labels),
                       labels.view(-1))
        return (loss, outputs) if return_outputs else loss

# Compute weights from dataset frequencies
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

labels = tokenized['train']['label'].numpy()
weights = compute_class_weight(
    class_weight='balanced',
    classes=np.unique(labels),
    y=labels
)
print("Class weights:", weights)  # e.g. [2.3, 0.7] if negative is rare

# Solution 2: Oversampling with imbalanced-learn
# pip install imbalanced-learn
from imblearn.over_sampling import RandomOverSampler
# (applicable to feature matrices, not directly to tensors)

# Solution 3: Appropriate metrics for imbalanced data
from sklearn.metrics import classification_report
# Use macro F1 or minority class F1, not just accuracy

4. Fine-grained Sentiment: Aspect-Based (ABSA)

Binary sentiment analysis (positive/negative) does not capture the complexity of real opinions. A customer can be satisfied with the product but unhappy with the shipping. Aspect-Based Sentiment Analysis (ABSA) identifies the sentiment for each mentioned aspect.

from transformers import pipeline

# Zero-shot classification for ABSA
classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli"
)

review = "The product is excellent but shipping took three weeks. Customer service never responded."

# Classification for each aspect
aspects = ["product", "shipping", "customer service"]
sentiments_per_aspect = {}

for aspect in aspects:
    result = classifier(
        review,
        candidate_labels=["positive", "negative", "neutral"],
        hypothesis_template=f"In this review, the {} regarding {aspect} is {}."
    )
    sentiments_per_aspect[aspect] = result['labels'][0]
    print(f"{aspect}: {result['labels'][0]} ({result['scores'][0]:.2f})")

# Expected output:
# product: positive (0.89)
# shipping: negative (0.92)
# customer service: negative (0.87)

5. Hard Cases: Irony, Negation, Ambiguity

BERT models handle many difficult cases better than classical methods, but they are not infallible. Here is how to analyze and mitigate the most common failure modes.

5.1 Handling Negation

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

# Test negation cases
negation_examples = [
    "This is not bad at all",        # double negation = positive
    "I wouldn't say it's terrible",  # attenuating negation
    "Not the worst, but not great",  # ambiguous
    "Far from perfect",              # implicit negation
    "Could have been worse",         # negative-positive comparative
]

for text in negation_examples:
    result = classifier(text)[0]
    print(f"'{text}'")
    print(f"  -> {result['label']} ({result['score']:.3f})\n")

# BERT handles "not bad" -> POSITIVE correctly
# But may struggle with complex and indirect negations

5.2 Error Analysis

import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

def analyze_errors(texts, true_labels, predicted_labels, probs):
    """Detailed model error analysis."""
    results = pd.DataFrame({
        'text': texts,
        'true_label': true_labels,
        'pred_label': predicted_labels,
        'confidence': [max(p) for p in probs],
        'correct': [t == p for t, p in zip(true_labels, predicted_labels)]
    })

    # False positives: model says POSITIVE but ground truth is NEGATIVE
    fp = results[(results['true_label'] == 0) & (results['pred_label'] == 1)]
    print(f"False Positives ({len(fp)}):")
    for _, row in fp.head(5).iterrows():
        print(f"  Conf={row['confidence']:.2f}: {row['text'][:80]}")

    # False negatives: model says NEGATIVE but ground truth is POSITIVE
    fn = results[(results['true_label'] == 1) & (results['pred_label'] == 0)]
    print(f"\nFalse Negatives ({len(fn)}):")
    for _, row in fn.head(5).iterrows():
        print(f"  Conf={row['confidence']:.2f}: {row['text'][:80]}")

    # Classification report
    cm = confusion_matrix(true_labels, predicted_labels)
    print(f"\nClassification Report:\n")
    print(classification_report(true_labels, predicted_labels,
                                target_names=['NEGATIVE', 'POSITIVE']))

    return results

6. Production Deployment with FastAPI

A sentiment analysis model has value only if it is accessible in production. Here is how to build a fast and scalable REST endpoint with FastAPI.

# sentiment_api.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, validator
from transformers import pipeline
from typing import List
import time

app = FastAPI(title="Sentiment Analysis API", version="1.0")

# Load model only once at startup
MODEL_PATH = "./models/distilbert-sst2"
sentiment_pipeline = pipeline(
    "text-classification",
    model=MODEL_PATH,
    device=-1,              # -1 = CPU, 0 = first GPU
    batch_size=32,          # batch inference for efficiency
    truncation=True,
    max_length=128
)

class SentimentRequest(BaseModel):
    texts: List[str]

    @validator('texts')
    def validate_texts(cls, texts):
        if not texts:
            raise ValueError("Text list cannot be empty")
        if len(texts) > 100:
            raise ValueError("Maximum 100 texts per request")
        for text in texts:
            if len(text) > 5000:
                raise ValueError("Text too long (max 5000 characters)")
        return texts

class SentimentResult(BaseModel):
    text: str
    label: str
    score: float
    processing_time_ms: float

@app.post("/predict", response_model=List[SentimentResult])
async def predict_sentiment(request: SentimentRequest):
    start = time.time()
    try:
        results = sentiment_pipeline(request.texts)
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

    elapsed = (time.time() - start) * 1000
    per_text = elapsed / len(request.texts)

    return [
        SentimentResult(
            text=text,
            label=r['label'],
            score=r['score'],
            processing_time_ms=per_text
        )
        for text, r in zip(request.texts, results)
    ]

@app.get("/health")
def health_check():
    return {"status": "ok", "model": MODEL_PATH}

# Start: uvicorn sentiment_api:app --host 0.0.0.0 --port 8000

7. Latency Optimization

In production, latency is often critical. Here are the main techniques to reduce inference time without losing too much quality.

7.1 Dynamic Quantization

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("./models/distilbert-sst2")
tokenizer = AutoTokenizer.from_pretrained("./models/distilbert-sst2")

# Dynamic quantization (INT8): reduces size and increases CPU speed
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},  # only quantize Linear layers
    dtype=torch.qint8
)

# Size comparison
import os
def model_size(m):
    torch.save(m.state_dict(), "tmp.pt")
    size = os.path.getsize("tmp.pt") / (1024 * 1024)
    os.remove("tmp.pt")
    return size

print(f"Original model: {model_size(model):.1f} MB")
print(f"Quantized model: {model_size(quantized_model):.1f} MB")
# Original: ~250 MB, Quantized: ~65 MB

# Speed benchmark
import time

def benchmark(m, tokenizer, texts, n_runs=50):
    inputs = tokenizer(texts, return_tensors='pt',
                      padding=True, truncation=True, max_length=128)
    with torch.no_grad():
        # Warm-up
        for _ in range(5):
            _ = m(**inputs)
        # Benchmark
        start = time.time()
        for _ in range(n_runs):
            _ = m(**inputs)
        elapsed = (time.time() - start) / n_runs * 1000
    return elapsed

texts = ["This product is amazing!"] * 8  # batch of 8
t_orig = benchmark(model, tokenizer, texts)
t_quant = benchmark(quantized_model, tokenizer, texts)
print(f"Original: {t_orig:.1f}ms, Quantized: {t_quant:.1f}ms")
print(f"Speedup: {t_orig/t_quant:.2f}x")

7.2 ONNX Export for Deployment

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
import time

# Convert to ONNX with HuggingFace Optimum
# pip install optimum[onnxruntime]
model_onnx = ORTModelForSequenceClassification.from_pretrained(
    "./models/distilbert-sst2",
    export=True,          # exports to ONNX on first load
    provider="CPUExecutionProvider"
)
tokenizer = AutoTokenizer.from_pretrained("./models/distilbert-sst2")

# Inference with ONNX Runtime
text = "This product exceeded all my expectations!"
inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=128)

start = time.time()
outputs = model_onnx(**inputs)
latency = (time.time() - start) * 1000

import torch
probs = torch.softmax(outputs.logits, dim=-1)
label = model_onnx.config.id2label[probs.argmax().item()]
confidence = probs.max().item()

print(f"Label: {label}")
print(f"Confidence: {confidence:.3f}")
print(f"Latency: {latency:.1f}ms")
# ONNX is typically 2-4x faster than PyTorch on CPU

8. Complete Evaluation and Reporting

from sklearn.metrics import (
    classification_report,
    roc_auc_score,
    average_precision_score,
    confusion_matrix
)
import numpy as np

def evaluate_sentiment_model(model, tokenizer, test_texts, test_labels,
                              batch_size=64):
    """Complete evaluation of the sentiment model."""
    all_probs = []
    all_preds = []

    for i in range(0, len(test_texts), batch_size):
        batch = test_texts[i:i+batch_size]
        inputs = tokenizer(
            batch, return_tensors='pt', padding=True,
            truncation=True, max_length=128
        )
        with torch.no_grad():
            outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=-1).numpy()
        preds = np.argmax(probs, axis=1)
        all_probs.extend(probs[:, 1])  # positive class probability
        all_preds.extend(preds)

    # Main report
    print("=== Classification Report ===")
    print(classification_report(
        test_labels, all_preds,
        target_names=['NEGATIVE', 'POSITIVE'],
        digits=4
    ))

    # Additional metrics
    auc = roc_auc_score(test_labels, all_probs)
    ap = average_precision_score(test_labels, all_probs)
    print(f"AUC-ROC:            {auc:.4f}")
    print(f"Average Precision:  {ap:.4f}")

    # Error analysis by confidence band
    all_probs = np.array(all_probs)
    all_preds = np.array(all_preds)
    test_labels = np.array(test_labels)

    for threshold in [0.5, 0.7, 0.9]:
        high_conf = all_probs >= threshold
        if high_conf.sum() > 0:
            acc_high = (all_preds[high_conf] == test_labels[high_conf]).mean()
            print(f"Accuracy (conf >= {threshold}): {acc_high:.4f} "
                  f"({high_conf.sum()} examples)")

    return np.array(all_probs), np.array(all_preds)

9. Production Optimization: ONNX and Quantization

BERT models require significant computational resources. For low-latency applications or constrained hardware, several optimization strategies can dramatically reduce inference time while preserving model quality.

Optimization Strategies Comparison

Strategy	Latency Reduction	Model Size Reduction	Quality Loss	Complexity
ONNX Export	2-4x	~10%	<0.1%	Low
Dynamic Quantization (INT8)	2-3x	75%	0.5-1%	Low
Static Quantization (INT8)	3-5x	75%	0.3-0.8%	Medium
DistilBERT (KD)	2x	40%	3%	Medium
TorchScript	1.5-2x	None	<0.1%	Low

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from optimum.onnxruntime import ORTModelForSequenceClassification
import torch
import time

# ---- ONNX Export with Optimum ----
model_path = "./models/distilbert-sentiment"

# Export and optimize in one step
ort_model = ORTModelForSequenceClassification.from_pretrained(
    model_path,
    export=True,       # Automatically export to ONNX
    provider="CPUExecutionProvider"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Save ONNX model
ort_model.save_pretrained("./models/distilbert-sentiment-onnx")

# ---- Benchmark: PyTorch vs ONNX ----
def benchmark_model(predict_fn, texts, n_runs=100):
    """Measure mean latency over n_runs inferences."""
    for _ in range(10):  # warmup
        predict_fn(texts[0])

    import numpy as np
    times = []
    for text in texts[:n_runs]:
        start = time.perf_counter()
        predict_fn(text)
        times.append((time.perf_counter() - start) * 1000)

    return {
        "mean_ms": round(np.mean(times), 2),
        "p50_ms":  round(np.percentile(times, 50), 2),
        "p95_ms":  round(np.percentile(times, 95), 2),
        "p99_ms":  round(np.percentile(times, 99), 2),
    }

pt_model = AutoModelForSequenceClassification.from_pretrained(model_path)
pt_model.eval()

def pt_predict(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=128)
    with torch.no_grad():
        return pt_model(**inputs).logits

def onnx_predict(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=128)
    return ort_model(**inputs).logits

test_texts = ["Excellent product, highly recommended!"] * 100
pt_stats = benchmark_model(pt_predict, test_texts)
onnx_stats = benchmark_model(onnx_predict, test_texts)

print("PyTorch:  ", pt_stats)
print("ONNX:     ", onnx_stats)
print(f"Speedup: {pt_stats['p95_ms'] / onnx_stats['p95_ms']:.1f}x")

# Dynamic INT8 Quantization (no calibration data needed)
import torch

def quantize_bert_dynamic(model_path: str, output_path: str):
    """Dynamic INT8 quantization for CPU inference."""
    from transformers import AutoModelForSequenceClassification

    model = AutoModelForSequenceClassification.from_pretrained(model_path)
    model.eval()

    # Quantize only nn.Linear layers dynamically
    quantized = torch.quantization.quantize_dynamic(
        model,
        {torch.nn.Linear},
        dtype=torch.qint8
    )

    torch.save(quantized.state_dict(), f"{output_path}/quantized_model.pt")

    import os
    original_size = sum(
        os.path.getsize(f"{model_path}/{f}")
        for f in os.listdir(model_path) if f.endswith('.bin')
    ) / 1024 / 1024

    print(f"Original size: ~{original_size:.0f} MB")
    print(f"Estimated reduction: ~75% → ~{original_size * 0.25:.0f} MB")

    return quantized

10. Production Best Practices

Anti-Pattern: Deploying Without Domain Validation

A model trained on SST-2 (movie reviews) can perform poorly on technical support tickets or social media posts. Always validate on your specific domain before deploying.

Production Deployment Checklist

Evaluate the model on target domain data — not just public benchmarks
Set confidence thresholds: return "uncertain" below threshold (e.g., 0.6)
Monitor the confidence score distribution over time
Implement a feedback mechanism to collect incorrect predictions
Version model and tokenizer together
Test behavior on edge cases: empty text, special characters, extreme lengths
Implement rate limiting and timeouts for the API
Log all predictions for post-hoc analysis

class ProductionSentimentClassifier:
    """Production-ready sentiment classifier."""

    def __init__(self, model_path: str, confidence_threshold: float = 0.7):
        self.pipeline = pipeline(
            "text-classification",
            model=model_path,
            truncation=True,
            max_length=128
        )
        self.threshold = confidence_threshold

    def predict(self, text: str) -> dict:
        # Input validation
        if not text or not text.strip():
            return {"label": "UNKNOWN", "score": 0.0, "reason": "empty_input"}

        text = text.strip()[:5000]  # Truncate overly long texts

        result = self.pipeline(text)[0]

        # Uncertainty handling
        if result['score'] < self.threshold:
            return {
                "label": "UNCERTAIN",
                "score": result['score'],
                "raw_label": result['label'],
                "reason": "below_confidence_threshold"
            }

        return {
            "label": result['label'],
            "score": result['score'],
            "reason": "ok"
        }

    def predict_batch(self, texts: list) -> list:
        # Filter empty texts while preserving position
        valid_texts = [t.strip()[:5000] if t and t.strip() else "" for t in texts]
        results = self.pipeline(valid_texts)
        return [
            self.predict(t) if t else {"label": "UNKNOWN", "score": 0.0}
            for t in valid_texts
        ]

Conclusions and Next Steps

We have covered the complete lifecycle of a sentiment analysis system: from classical approaches (VADER, TF-IDF) to Transformer fine-tuning, from imbalanced data handling to production deployment with FastAPI and latency optimization.

Key Takeaways

Choose the approach based on requirements: VADER for speed, BERT for quality
Always evaluate on your specific domain, not just benchmarks
Handle class imbalance with weighted loss or oversampling
Use confidence thresholds in production instead of forced predictions
DistilBERT offers an excellent speed/quality trade-off for production
Monitor predictions over time to detect data drift

Continue the Series

Next: Italian NLP — feel-it, AlBERTo and Italian-specific challenges
Article 5: Named Entity Recognition — extract entities from text
Article 6: Multi-label Text Classification — when text belongs to multiple categories
Article 7: HuggingFace Transformers: Complete Guide — Trainer API, Datasets, Hub
Article 10: NLP Monitoring in Production — drift detection and automatic retraining