Sentiment Analysis with Transformers: Techniques and Implementation
Sentiment Analysis is the most in-demand NLP task in enterprise settings. Every day, millions of companies analyze product reviews, social media posts, support tickets, and customer feedback to understand what people really think. With the advent of BERT and Transformer models, the quality of these systems has improved radically compared to classic dictionary-based or TF-IDF approaches.
In this article we build a complete sentiment analysis system: from dataset preparation to production deployment, including HuggingFace fine-tuning, handling class imbalance, evaluating metrics, and strategies for edge cases like irony, negation, and ambiguous language.
This is the third article in the Modern NLP: from BERT to LLMs series. It assumes familiarity with BERT fundamentals (article 2). For Italian-specific models, see article 4 on feel-it and AlBERTo.
What You Will Learn
- Classical vs BERT approaches: VADER, lexicon-based, fine-tuned Transformers
- Public sentiment datasets: SST-2, IMDb, Amazon Reviews, SemEval
- Complete implementation with HuggingFace Transformers and Trainer API
- Handling class imbalance in sentiment datasets
- Metrics: accuracy, F1, precision, recall, AUC-ROC
- Fine-grained sentiment: Aspect-Based Sentiment Analysis (ABSA) and intensity
- Hard cases: irony, negation, ambiguous language
- Production pipeline with FastAPI and batch inference
- Latency optimization: quantization and ONNX export
1. Evolution of Approaches: from VADER to BERT
Before diving into Transformer implementation, it is useful to understand the historical path of sentiment analysis approaches — in production you often use the simplest method that meets the requirements.
1.1 Dictionary-Based Approaches: VADER
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon-based analyzer optimized for social media. It requires no training, is extremely fast, and works surprisingly well on informal text.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
# Basic examples
texts = [
"This product is absolutely AMAZING!!!", # strong positive
"The service was okay I guess", # ambiguous neutral
"Worst purchase I've ever made. Complete waste.", # negative
"The food wasn't bad at all", # tricky negation
"Yeah right, as if this would work :)", # sarcasm
]
for text in texts:
scores = analyzer.polarity_scores(text)
print(f"Text: {text[:50]}")
print(f" neg={scores['neg']:.3f}, neu={scores['neu']:.3f}, "
f"pos={scores['pos']:.3f}, compound={scores['compound']:.3f}")
label = 'POSITIVE' if scores['compound'] >= 0.05 else \
'NEGATIVE' if scores['compound'] <= -0.05 else 'NEUTRAL'
print(f" Label: {label}\n")
# VADER handles well: capitalization, punctuation, emoji
# Struggles with: sarcasm, complex context
1.2 Classical Machine Learning Approaches
Before Transformers, TF-IDF + Logistic Regression or SVM were the most common approaches. Still useful as fast baselines or when labeled data is very scarce.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
# Sample dataset
train_texts = [
"Excellent product, highly recommend to everyone",
"Terrible experience, will not buy again",
"Great quality, fast shipping",
"Complete waste of money",
"Impeccable customer service",
"Defective product, very disappointed"
]
train_labels = [1, 0, 1, 0, 1, 0]
# TF-IDF + Logistic Regression pipeline
pipe = Pipeline([
('tfidf', TfidfVectorizer(
ngram_range=(1, 2), # unigrams and bigrams
max_features=50000,
sublinear_tf=True # log(1+tf) to dampen high frequencies
)),
('clf', LogisticRegression(C=1.0, max_iter=1000))
])
pipe.fit(train_texts, train_labels)
# Evaluation
test_texts = ["Fantastic product!", "Terrible, it doesn't work"]
preds = pipe.predict(test_texts)
probs = pipe.predict_proba(test_texts)
for text, pred, prob in zip(test_texts, preds, probs):
label = 'POSITIVE' if pred == 1 else 'NEGATIVE'
confidence = max(prob)
print(f"{text}: {label} ({confidence:.2f})")
1.3 Why BERT Is Superior
Sentiment Analysis Approaches Comparison
| Approach | Accuracy (SST-2) | Latency | Training Data | Hard Cases |
|---|---|---|---|---|
| VADER | ~71% | <1ms | None | Poor |
| TF-IDF + LR | ~85% | ~5ms | Required | Fair |
| DistilBERT | ~91% | ~50ms | Required | Good |
| BERT-base | ~93% | ~100ms | Required | Very Good |
| RoBERTa | ~96% | ~100ms | Required | Excellent |
2. Datasets for Sentiment Analysis
The quality of fine-tuning depends heavily on the quality and size of the dataset. Here are the most important English datasets, with Italian resources covered in the next article.
from datasets import load_dataset
# SST-2: Stanford Sentiment Treebank (binary: positive/negative)
sst2 = load_dataset("glue", "sst2")
print(sst2)
# train: 67,349 examples, validation: 872, test: 1,821
# IMDb Reviews (binary: positive/negative)
imdb = load_dataset("imdb")
print(imdb)
# train: 25,000, test: 25,000
# Amazon Reviews (1-5 stars)
amazon = load_dataset("amazon_polarity")
print(amazon)
# train: 3,600,000, test: 400,000
# Dataset exploration
print("\nSST-2 examples:")
for i, example in enumerate(sst2['train'].select(range(3))):
label = 'POSITIVE' if example['label'] == 1 else 'NEGATIVE'
print(f" [{label}] {example['sentence']}")
# Class distribution analysis
from collections import Counter
labels = sst2['train']['label']
print("\nSST-2 train distribution:", Counter(labels))
# Counter({1: 37569, 0: 29780}) - slight imbalance
3. Complete Fine-tuning with HuggingFace
Let's build a complete sentiment classifier, from data preparation to saving the trained model.
3.1 Data Preparation
from transformers import AutoTokenizer
from datasets import load_dataset, DatasetDict
import numpy as np
# Using DistilBERT for speed (97% of BERT, 60% faster)
MODEL_NAME = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# Load SST-2 from GLUE
dataset = load_dataset("glue", "sst2")
def tokenize_function(examples):
return tokenizer(
examples["sentence"],
padding="max_length",
truncation=True,
max_length=128,
return_tensors=None # returns lists, not tensors
)
# Tokenize the full dataset (with cache)
tokenized = dataset.map(
tokenize_function,
batched=True,
batch_size=1000,
remove_columns=["sentence", "idx"] # remove unnecessary columns
)
# PyTorch format
tokenized.set_format("torch")
print(tokenized)
print("Train columns:", tokenized['train'].column_names)
# ['input_ids', 'attention_mask', 'label']
3.2 Model Definition and Training
from transformers import (
AutoModelForSequenceClassification,
TrainingArguments,
Trainer
)
import evaluate
import numpy as np
# Model with classification head
model = AutoModelForSequenceClassification.from_pretrained(
MODEL_NAME,
num_labels=2,
id2label={0: "NEGATIVE", 1: "POSITIVE"},
label2id={"NEGATIVE": 0, "POSITIVE": 1}
)
# Evaluation metrics
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return {
"accuracy": accuracy.compute(
predictions=predictions, references=labels)["accuracy"],
"f1": f1.compute(
predictions=predictions, references=labels,
average="binary")["f1"]
}
# Training configuration
training_args = TrainingArguments(
output_dir="./results/distilbert-sst2",
num_train_epochs=3,
per_device_train_batch_size=32,
per_device_eval_batch_size=64,
warmup_ratio=0.1,
weight_decay=0.01,
learning_rate=2e-5,
lr_scheduler_type="linear",
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1",
greater_is_better=True,
logging_dir="./logs",
logging_steps=100,
fp16=True, # Mixed precision (GPU with Tensor Cores)
dataloader_num_workers=4,
report_to="none", # Disable wandb/tensorboard for simplicity
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["validation"],
compute_metrics=compute_metrics,
)
# Start training
train_result = trainer.train()
print(f"Training loss: {train_result.training_loss:.4f}")
# Final evaluation
metrics = trainer.evaluate()
print(f"Validation accuracy: {metrics['eval_accuracy']:.4f}")
print(f"Validation F1: {metrics['eval_f1']:.4f}")
# Save model and tokenizer together
trainer.save_model("./models/distilbert-sst2")
tokenizer.save_pretrained("./models/distilbert-sst2")
3.3 Handling Class Imbalance
In many real-world datasets (e.g., customer support reviews), classes are heavily imbalanced: 90% negative, 10% positive. Without adjustments, the model will learn to always predict the majority class.
import torch
from torch import nn
from transformers import Trainer
# Solution 1: Weighted loss function
class WeightedTrainer(Trainer):
def __init__(self, class_weights, *args, **kwargs):
super().__init__(*args, **kwargs)
self.class_weights = torch.tensor(class_weights, dtype=torch.float)
def compute_loss(self, model, inputs, return_outputs=False):
labels = inputs.get("labels")
outputs = model(**inputs)
logits = outputs.get("logits")
# CrossEntropy with weights inversely proportional to frequency
loss_fct = nn.CrossEntropyLoss(
weight=self.class_weights.to(logits.device)
)
loss = loss_fct(logits.view(-1, self.model.config.num_labels),
labels.view(-1))
return (loss, outputs) if return_outputs else loss
# Compute weights from dataset frequencies
from sklearn.utils.class_weight import compute_class_weight
import numpy as np
labels = tokenized['train']['label'].numpy()
weights = compute_class_weight(
class_weight='balanced',
classes=np.unique(labels),
y=labels
)
print("Class weights:", weights) # e.g. [2.3, 0.7] if negative is rare
# Solution 2: Oversampling with imbalanced-learn
# pip install imbalanced-learn
from imblearn.over_sampling import RandomOverSampler
# (applicable to feature matrices, not directly to tensors)
# Solution 3: Appropriate metrics for imbalanced data
from sklearn.metrics import classification_report
# Use macro F1 or minority class F1, not just accuracy
4. Fine-grained Sentiment: Aspect-Based (ABSA)
Binary sentiment analysis (positive/negative) does not capture the complexity of real opinions. A customer can be satisfied with the product but unhappy with the shipping. Aspect-Based Sentiment Analysis (ABSA) identifies the sentiment for each mentioned aspect.
from transformers import pipeline
# Zero-shot classification for ABSA
classifier = pipeline(
"zero-shot-classification",
model="facebook/bart-large-mnli"
)
review = "The product is excellent but shipping took three weeks. Customer service never responded."
# Classification for each aspect
aspects = ["product", "shipping", "customer service"]
sentiments_per_aspect = {}
for aspect in aspects:
result = classifier(
review,
candidate_labels=["positive", "negative", "neutral"],
hypothesis_template=f"In this review, the {} regarding {aspect} is {}."
)
sentiments_per_aspect[aspect] = result['labels'][0]
print(f"{aspect}: {result['labels'][0]} ({result['scores'][0]:.2f})")
# Expected output:
# product: positive (0.89)
# shipping: negative (0.92)
# customer service: negative (0.87)
5. Hard Cases: Irony, Negation, Ambiguity
BERT models handle many difficult cases better than classical methods, but they are not infallible. Here is how to analyze and mitigate the most common failure modes.
5.1 Handling Negation
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
# Test negation cases
negation_examples = [
"This is not bad at all", # double negation = positive
"I wouldn't say it's terrible", # attenuating negation
"Not the worst, but not great", # ambiguous
"Far from perfect", # implicit negation
"Could have been worse", # negative-positive comparative
]
for text in negation_examples:
result = classifier(text)[0]
print(f"'{text}'")
print(f" -> {result['label']} ({result['score']:.3f})\n")
# BERT handles "not bad" -> POSITIVE correctly
# But may struggle with complex and indirect negations
5.2 Error Analysis
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
def analyze_errors(texts, true_labels, predicted_labels, probs):
"""Detailed model error analysis."""
results = pd.DataFrame({
'text': texts,
'true_label': true_labels,
'pred_label': predicted_labels,
'confidence': [max(p) for p in probs],
'correct': [t == p for t, p in zip(true_labels, predicted_labels)]
})
# False positives: model says POSITIVE but ground truth is NEGATIVE
fp = results[(results['true_label'] == 0) & (results['pred_label'] == 1)]
print(f"False Positives ({len(fp)}):")
for _, row in fp.head(5).iterrows():
print(f" Conf={row['confidence']:.2f}: {row['text'][:80]}")
# False negatives: model says NEGATIVE but ground truth is POSITIVE
fn = results[(results['true_label'] == 1) & (results['pred_label'] == 0)]
print(f"\nFalse Negatives ({len(fn)}):")
for _, row in fn.head(5).iterrows():
print(f" Conf={row['confidence']:.2f}: {row['text'][:80]}")
# Classification report
cm = confusion_matrix(true_labels, predicted_labels)
print(f"\nClassification Report:\n")
print(classification_report(true_labels, predicted_labels,
target_names=['NEGATIVE', 'POSITIVE']))
return results
6. Production Deployment with FastAPI
A sentiment analysis model has value only if it is accessible in production. Here is how to build a fast and scalable REST endpoint with FastAPI.
# sentiment_api.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, validator
from transformers import pipeline
from typing import List
import time
app = FastAPI(title="Sentiment Analysis API", version="1.0")
# Load model only once at startup
MODEL_PATH = "./models/distilbert-sst2"
sentiment_pipeline = pipeline(
"text-classification",
model=MODEL_PATH,
device=-1, # -1 = CPU, 0 = first GPU
batch_size=32, # batch inference for efficiency
truncation=True,
max_length=128
)
class SentimentRequest(BaseModel):
texts: List[str]
@validator('texts')
def validate_texts(cls, texts):
if not texts:
raise ValueError("Text list cannot be empty")
if len(texts) > 100:
raise ValueError("Maximum 100 texts per request")
for text in texts:
if len(text) > 5000:
raise ValueError("Text too long (max 5000 characters)")
return texts
class SentimentResult(BaseModel):
text: str
label: str
score: float
processing_time_ms: float
@app.post("/predict", response_model=List[SentimentResult])
async def predict_sentiment(request: SentimentRequest):
start = time.time()
try:
results = sentiment_pipeline(request.texts)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
elapsed = (time.time() - start) * 1000
per_text = elapsed / len(request.texts)
return [
SentimentResult(
text=text,
label=r['label'],
score=r['score'],
processing_time_ms=per_text
)
for text, r in zip(request.texts, results)
]
@app.get("/health")
def health_check():
return {"status": "ok", "model": MODEL_PATH}
# Start: uvicorn sentiment_api:app --host 0.0.0.0 --port 8000
7. Latency Optimization
In production, latency is often critical. Here are the main techniques to reduce inference time without losing too much quality.
7.1 Dynamic Quantization
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("./models/distilbert-sst2")
tokenizer = AutoTokenizer.from_pretrained("./models/distilbert-sst2")
# Dynamic quantization (INT8): reduces size and increases CPU speed
quantized_model = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear}, # only quantize Linear layers
dtype=torch.qint8
)
# Size comparison
import os
def model_size(m):
torch.save(m.state_dict(), "tmp.pt")
size = os.path.getsize("tmp.pt") / (1024 * 1024)
os.remove("tmp.pt")
return size
print(f"Original model: {model_size(model):.1f} MB")
print(f"Quantized model: {model_size(quantized_model):.1f} MB")
# Original: ~250 MB, Quantized: ~65 MB
# Speed benchmark
import time
def benchmark(m, tokenizer, texts, n_runs=50):
inputs = tokenizer(texts, return_tensors='pt',
padding=True, truncation=True, max_length=128)
with torch.no_grad():
# Warm-up
for _ in range(5):
_ = m(**inputs)
# Benchmark
start = time.time()
for _ in range(n_runs):
_ = m(**inputs)
elapsed = (time.time() - start) / n_runs * 1000
return elapsed
texts = ["This product is amazing!"] * 8 # batch of 8
t_orig = benchmark(model, tokenizer, texts)
t_quant = benchmark(quantized_model, tokenizer, texts)
print(f"Original: {t_orig:.1f}ms, Quantized: {t_quant:.1f}ms")
print(f"Speedup: {t_orig/t_quant:.2f}x")
7.2 ONNX Export for Deployment
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
import time
# Convert to ONNX with HuggingFace Optimum
# pip install optimum[onnxruntime]
model_onnx = ORTModelForSequenceClassification.from_pretrained(
"./models/distilbert-sst2",
export=True, # exports to ONNX on first load
provider="CPUExecutionProvider"
)
tokenizer = AutoTokenizer.from_pretrained("./models/distilbert-sst2")
# Inference with ONNX Runtime
text = "This product exceeded all my expectations!"
inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=128)
start = time.time()
outputs = model_onnx(**inputs)
latency = (time.time() - start) * 1000
import torch
probs = torch.softmax(outputs.logits, dim=-1)
label = model_onnx.config.id2label[probs.argmax().item()]
confidence = probs.max().item()
print(f"Label: {label}")
print(f"Confidence: {confidence:.3f}")
print(f"Latency: {latency:.1f}ms")
# ONNX is typically 2-4x faster than PyTorch on CPU
8. Complete Evaluation and Reporting
from sklearn.metrics import (
classification_report,
roc_auc_score,
average_precision_score,
confusion_matrix
)
import numpy as np
def evaluate_sentiment_model(model, tokenizer, test_texts, test_labels,
batch_size=64):
"""Complete evaluation of the sentiment model."""
all_probs = []
all_preds = []
for i in range(0, len(test_texts), batch_size):
batch = test_texts[i:i+batch_size]
inputs = tokenizer(
batch, return_tensors='pt', padding=True,
truncation=True, max_length=128
)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1).numpy()
preds = np.argmax(probs, axis=1)
all_probs.extend(probs[:, 1]) # positive class probability
all_preds.extend(preds)
# Main report
print("=== Classification Report ===")
print(classification_report(
test_labels, all_preds,
target_names=['NEGATIVE', 'POSITIVE'],
digits=4
))
# Additional metrics
auc = roc_auc_score(test_labels, all_probs)
ap = average_precision_score(test_labels, all_probs)
print(f"AUC-ROC: {auc:.4f}")
print(f"Average Precision: {ap:.4f}")
# Error analysis by confidence band
all_probs = np.array(all_probs)
all_preds = np.array(all_preds)
test_labels = np.array(test_labels)
for threshold in [0.5, 0.7, 0.9]:
high_conf = all_probs >= threshold
if high_conf.sum() > 0:
acc_high = (all_preds[high_conf] == test_labels[high_conf]).mean()
print(f"Accuracy (conf >= {threshold}): {acc_high:.4f} "
f"({high_conf.sum()} examples)")
return np.array(all_probs), np.array(all_preds)
9. Production Optimization: ONNX and Quantization
BERT models require significant computational resources. For low-latency applications or constrained hardware, several optimization strategies can dramatically reduce inference time while preserving model quality.
Optimization Strategies Comparison
| Strategy | Latency Reduction | Model Size Reduction | Quality Loss | Complexity |
|---|---|---|---|---|
| ONNX Export | 2-4x | ~10% | <0.1% | Low |
| Dynamic Quantization (INT8) | 2-3x | 75% | 0.5-1% | Low |
| Static Quantization (INT8) | 3-5x | 75% | 0.3-0.8% | Medium |
| DistilBERT (KD) | 2x | 40% | 3% | Medium |
| TorchScript | 1.5-2x | None | <0.1% | Low |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from optimum.onnxruntime import ORTModelForSequenceClassification
import torch
import time
# ---- ONNX Export with Optimum ----
model_path = "./models/distilbert-sentiment"
# Export and optimize in one step
ort_model = ORTModelForSequenceClassification.from_pretrained(
model_path,
export=True, # Automatically export to ONNX
provider="CPUExecutionProvider"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Save ONNX model
ort_model.save_pretrained("./models/distilbert-sentiment-onnx")
# ---- Benchmark: PyTorch vs ONNX ----
def benchmark_model(predict_fn, texts, n_runs=100):
"""Measure mean latency over n_runs inferences."""
for _ in range(10): # warmup
predict_fn(texts[0])
import numpy as np
times = []
for text in texts[:n_runs]:
start = time.perf_counter()
predict_fn(text)
times.append((time.perf_counter() - start) * 1000)
return {
"mean_ms": round(np.mean(times), 2),
"p50_ms": round(np.percentile(times, 50), 2),
"p95_ms": round(np.percentile(times, 95), 2),
"p99_ms": round(np.percentile(times, 99), 2),
}
pt_model = AutoModelForSequenceClassification.from_pretrained(model_path)
pt_model.eval()
def pt_predict(text):
inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=128)
with torch.no_grad():
return pt_model(**inputs).logits
def onnx_predict(text):
inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=128)
return ort_model(**inputs).logits
test_texts = ["Excellent product, highly recommended!"] * 100
pt_stats = benchmark_model(pt_predict, test_texts)
onnx_stats = benchmark_model(onnx_predict, test_texts)
print("PyTorch: ", pt_stats)
print("ONNX: ", onnx_stats)
print(f"Speedup: {pt_stats['p95_ms'] / onnx_stats['p95_ms']:.1f}x")
# Dynamic INT8 Quantization (no calibration data needed)
import torch
def quantize_bert_dynamic(model_path: str, output_path: str):
"""Dynamic INT8 quantization for CPU inference."""
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model.eval()
# Quantize only nn.Linear layers dynamically
quantized = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear},
dtype=torch.qint8
)
torch.save(quantized.state_dict(), f"{output_path}/quantized_model.pt")
import os
original_size = sum(
os.path.getsize(f"{model_path}/{f}")
for f in os.listdir(model_path) if f.endswith('.bin')
) / 1024 / 1024
print(f"Original size: ~{original_size:.0f} MB")
print(f"Estimated reduction: ~75% → ~{original_size * 0.25:.0f} MB")
return quantized
10. Production Best Practices
Anti-Pattern: Deploying Without Domain Validation
A model trained on SST-2 (movie reviews) can perform poorly on technical support tickets or social media posts. Always validate on your specific domain before deploying.
Production Deployment Checklist
- Evaluate the model on target domain data — not just public benchmarks
- Set confidence thresholds: return "uncertain" below threshold (e.g., 0.6)
- Monitor the confidence score distribution over time
- Implement a feedback mechanism to collect incorrect predictions
- Version model and tokenizer together
- Test behavior on edge cases: empty text, special characters, extreme lengths
- Implement rate limiting and timeouts for the API
- Log all predictions for post-hoc analysis
class ProductionSentimentClassifier:
"""Production-ready sentiment classifier."""
def __init__(self, model_path: str, confidence_threshold: float = 0.7):
self.pipeline = pipeline(
"text-classification",
model=model_path,
truncation=True,
max_length=128
)
self.threshold = confidence_threshold
def predict(self, text: str) -> dict:
# Input validation
if not text or not text.strip():
return {"label": "UNKNOWN", "score": 0.0, "reason": "empty_input"}
text = text.strip()[:5000] # Truncate overly long texts
result = self.pipeline(text)[0]
# Uncertainty handling
if result['score'] < self.threshold:
return {
"label": "UNCERTAIN",
"score": result['score'],
"raw_label": result['label'],
"reason": "below_confidence_threshold"
}
return {
"label": result['label'],
"score": result['score'],
"reason": "ok"
}
def predict_batch(self, texts: list) -> list:
# Filter empty texts while preserving position
valid_texts = [t.strip()[:5000] if t and t.strip() else "" for t in texts]
results = self.pipeline(valid_texts)
return [
self.predict(t) if t else {"label": "UNKNOWN", "score": 0.0}
for t in valid_texts
]
Conclusions and Next Steps
We have covered the complete lifecycle of a sentiment analysis system: from classical approaches (VADER, TF-IDF) to Transformer fine-tuning, from imbalanced data handling to production deployment with FastAPI and latency optimization.
Key Takeaways
- Choose the approach based on requirements: VADER for speed, BERT for quality
- Always evaluate on your specific domain, not just benchmarks
- Handle class imbalance with weighted loss or oversampling
- Use confidence thresholds in production instead of forced predictions
- DistilBERT offers an excellent speed/quality trade-off for production
- Monitor predictions over time to detect data drift
Continue the Series
- Next: Italian NLP — feel-it, AlBERTo and Italian-specific challenges
- Article 5: Named Entity Recognition — extract entities from text
- Article 6: Multi-label Text Classification — when text belongs to multiple categories
- Article 7: HuggingFace Transformers: Complete Guide — Trainer API, Datasets, Hub
- Article 10: NLP Monitoring in Production — drift detection and automatic retraining







