Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

HuggingFace Transformers: Practical Guide to the Ecosystem

HuggingFace has become the reference platform for modern Machine Learning. With over 500,000 pre-trained models, 100,000+ datasets, and libraries like transformers, datasets, peft, accelerate, and optimum, it represents the infrastructure underlying most current NLP and Computer Vision research and development.

In this article we explore the HuggingFace ecosystem in a practical and systematic way: from choosing the right model in the Hub, to the Trainer API for fine-tuning, to managing large datasets, to model optimization and deployment. We will also cover advanced patterns such as custom training loops, personalized callbacks, production-optimized inference, and integration with MLOps systems.

This is the seventh article in the Modern NLP: from BERT to LLMs series. It assumes familiarity with BERT (article 2) and sentiment analysis (article 3).

What You Will Learn

The HuggingFace ecosystem: main libraries and when to use them
Model Hub: searching, filtering, and loading models
AutoClass API: AutoModel, AutoTokenizer, AutoConfig
Pipeline API: zero-config inference for common tasks
Datasets library: loading, manipulation, and streaming of large datasets
Trainer API: complete fine-tuning with logging, callbacks, and checkpointing
Custom training loops with native PyTorch
PEFT and LoRA: efficient fine-tuning with few parameters
Accelerate: distributed training and mixed precision
Inference optimization: ONNX, BitsAndBytes, quantization
Push to Hub: sharing models and datasets publicly
Integration with WandB, MLflow, and MLOps systems

1. The HuggingFace Ecosystem

The HuggingFace ecosystem is composed of many separate but integrated libraries. Understanding which to use for each scenario is fundamental to avoid reinventing the wheel and maximizing the benefit of community work.

Main Libraries and Use Cases

Library	Purpose	Typical Scenario	Installation
`transformers`	Models, tokenizers, training	BERT fine-tuning, inference pipeline	`pip install transformers`
`datasets`	Dataset management	Loading, preprocessing, streaming	`pip install datasets`
`peft`	Efficient fine-tuning	LoRA, Prefix Tuning, P-Tuning	`pip install peft`
`accelerate`	Distributed training	Multi-GPU, TPU, mixed precision	`pip install accelerate`
`optimum`	Inference optimization	ONNX export, quantization, TensorRT	`pip install optimum`
`evaluate`	Standard NLP metrics	BLEU, ROUGE, F1, accuracy, seqeval	`pip install evaluate`
`trl`	RLHF and SFT for LLMs	Instruction-following, reward modeling	`pip install trl`
`safetensors`	Secure format for weights	Fast and secure save/load	`pip install safetensors`
`sentence-transformers`	Sentence embeddings	Semantic similarity, clustering, RAG	`pip install sentence-transformers`
`tokenizers`	Fast tokenization	Custom BPE, WordPiece, Unigram	`pip install tokenizers`

Library selection depends on context. For rapid prototyping use pipeline() from transformers. For production-grade fine-tuning use Trainer with datasets. For large models on limited GPUs use peft. For optimized inference use optimum.

2. Model Hub: Finding the Right Model

The HuggingFace Hub hosts over 500,000 models. Finding the right one requires understanding available filters and naming conventions. A model is identified by username/model-name, with tags for language, task, framework, and dataset.

from huggingface_hub import HfApi, list_models, ModelFilter
import pandas as pd

api = HfApi()

# Search models by task and language
models = list(list_models(
    filter=ModelFilter(
        task="text-classification",
        language="en",           # English
    ),
    sort="downloads",
    direction=-1,                # descending
    limit=10
))

print("Top 10 English models for text-classification:")
for i, model in enumerate(models, 1):
    print(f"  {i}. {model.modelId} "
          f"(downloads: {model.downloads:,}, likes: {model.likes})")

# Search for Italian BERT models
italian_models = list(list_models(
    search="bert italian",
    sort="downloads",
    direction=-1,
    limit=5
))

# Load detailed model info
model_info = api.model_info("dbmdz/bert-base-italian-cased")
print(f"\nModel: {model_info.modelId}")
print(f"Task: {model_info.pipeline_tag}")
print(f"Tags: {model_info.tags}")
print(f"Downloads/month: {model_info.downloads:,}")

# Recommended Italian models by task
italian_models_map = {
    "sentiment": [
        "neuraly/bert-base-italian-cased-sentiment",
        "MilaNLProc/feel-it-italian-sentiment",
        "morenolq/bert-base-italian-cased-sentiment"
    ],
    "ner": [
        "osiria/bert-base-italian-uncased-ner",
        "Babelscape/wikineural-multilingual-ner"
    ],
    "embeddings": [
        "nickprock/sentence-bert-base-italian-uncased",
        "paraphrase-multilingual-mpnet-base-v2"
    ],
    "base_models": [
        "dbmdz/bert-base-italian-cased",
        "dbmdz/bert-base-italian-uncased"
    ]
}

for task, models_list in italian_models_map.items():
    print(f"\n{task.upper()}:")
    for m in models_list:
        print(f"  - {m}")

Criteria for Choosing a Model from the Hub

Monthly downloads: indicator of community adoption and reliability
Task tag: verify the model has the task-specific head (e.g., text-classification)
Model card: documentation of training, datasets used, benchmarks, limitations
Language: ensure it supports the target language (en, it, multilingual)
Size: balance performance vs speed. base models (110M) are 3-4x faster than large (340M)
Training date: recent models often outperform older ones even on the same architecture

3. AutoClass API: Flexible Loading

The AutoClass API allows loading any HuggingFace model with the same code, regardless of the underlying architecture. This is enabled by the config.json file that accompanies every model and specifies the exact class to instantiate.

from transformers import (
    AutoModel,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    AutoModelForCausalLM,
    AutoModelForSeq2SeqLM,
    AutoModelForQuestionAnswering,
    AutoModelForMaskedLM,
    AutoTokenizer,
    AutoConfig
)
import torch

# Load configuration without downloading weights (very fast)
config = AutoConfig.from_pretrained("bert-base-uncased")
print(f"Architecture: {config.architectures}")
print(f"Hidden size: {config.hidden_size}")
print(f"Num layers: {config.num_hidden_layers}")
print(f"Num attention heads: {config.num_attention_heads}")
print(f"Vocab size: {config.vocab_size}")

# Tokenizer (works for any model)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Encode text
inputs = tokenizer(
    "HuggingFace is amazing!",
    return_tensors="pt",       # "pt" for PyTorch, "tf" for TensorFlow
    truncation=True,
    max_length=128,
    padding="max_length"
)
print(f"\nToken IDs shape: {inputs['input_ids'].shape}")  # [1, 128]

# Base model (no task-specific head) - for feature extraction
model_base = AutoModel.from_pretrained("bert-base-uncased")
with torch.no_grad():
    outputs = model_base(**inputs)
    hidden_states = outputs.last_hidden_state  # [1, 128, 768]
    cls_embedding = hidden_states[:, 0, :]      # CLS token [1, 768]
    print(f"CLS embedding shape: {cls_embedding.shape}")

# Model with classification head
model_clf = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)
print(f"\nClassification labels: {model_clf.config.id2label}")

# AutoClass to task mapping
autoclass_map = {
    "AutoModelForSequenceClassification": "Text classification, sentiment analysis",
    "AutoModelForTokenClassification":    "NER, POS tagging, chunking",
    "AutoModelForQuestionAnswering":      "Extractive QA (SQuAD-style)",
    "AutoModelForCausalLM":              "Text generation (GPT-style)",
    "AutoModelForSeq2SeqLM":             "Translation, summarization (T5/mBART)",
    "AutoModelForMaskedLM":              "Masked language modeling (BERT)",
    "AutoModelForMultipleChoice":        "Multiple choice (SWAG, HellaSwag)",
}

# Advanced loading options
model_optimized = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=3,
    torch_dtype=torch.float16,     # fp16 saves ~50% GPU memory
    device_map="auto",             # auto-distribute across available GPUs
    low_cpu_mem_usage=True,        # load parameters progressively
    attn_implementation="flash_attention_2"  # Flash Attention 2 if available
)

4. Pipeline API: Fast Inference

The Pipeline API is the simplest way to use a HuggingFace model. It handles tokenization, inference, and post-processing automatically. It is ideal for prototyping but can also be used in production with batch processing.

from transformers import pipeline
import torch

# =========================================
# Task 1: Text Classification / Sentiment
# =========================================
sentiment = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    device=0 if torch.cuda.is_available() else -1  # GPU if available
)
results = sentiment(["I love this product!", "This is terrible.", "It's okay I guess."])
for r in results:
    print(f"  Label: {r['label']}, Score: {r['score']:.3f}")

# =========================================
# Task 2: Named Entity Recognition
# =========================================
ner = pipeline(
    "ner",
    model="dslim/bert-base-NER",
    aggregation_strategy="simple"   # aggregate tokens of the same entity
)
entities = ner("Apple CEO Tim Cook announced a new iPhone in Cupertino.")
for ent in entities:
    print(f"  '{ent['word']}' -> {ent['entity_group']} ({ent['score']:.3f})")

# =========================================
# Task 3: Question Answering
# =========================================
qa = pipeline("question-answering", model="deepset/roberta-base-squad2")
result = qa(
    question="Who co-founded Tesla?",
    context="Elon Musk co-founded Tesla Motors in 2003 along with Martin Eberhard."
)
print(f"\nQA Answer: '{result['answer']}' (score={result['score']:.3f})")

# =========================================
# Task 4: Text Generation
# =========================================
generator = pipeline(
    "text-generation",
    model="gpt2",
    max_new_tokens=80,
    num_return_sequences=2,
    do_sample=True,
    temperature=0.8,
    top_p=0.95,
    repetition_penalty=1.2
)
outputs = generator("The future of artificial intelligence is")
for i, out in enumerate(outputs):
    print(f"\nGeneration {i+1}: {out['generated_text']}")

# =========================================
# Task 5: Zero-Shot Classification
# =========================================
zero_shot = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
result = zero_shot(
    "The Italian economy contracted by 0.5% in Q3 2024.",
    candidate_labels=["economics", "politics", "sports", "technology", "health"]
)
print("\nZero-shot classification:")
for label, score in zip(result['labels'][:3], result['scores'][:3]):
    print(f"  {label}: {score:.3f}")

# =========================================
# Task 6: Summarization
# =========================================
summarizer = pipeline("summarization", model="facebook/bart-large-cnn",
                     min_length=30, max_length=130)
text = """
Generative AI has revolutionized the technology sector in 2024. Large language models
like GPT-4, Claude, and Gemini have demonstrated surprising capabilities in reasoning,
creative writing, and solving complex problems. Companies worldwide are integrating
these technologies into their production processes, from customer service to data analysis,
from code generation to multimedia content creation.
"""
summary = summarizer(text)[0]['summary_text']
print(f"\nSummary: {summary}")

# =========================================
# Task 7: Translation
# =========================================
translator = pipeline("translation_it_to_en", model="Helsinki-NLP/opus-mt-it-en")
italian_text = "Il machine learning sta trasformando il mondo moderno."
translated = translator(italian_text)[0]['translation_text']
print(f"\nTranslation: {translated}")

# =========================================
# Batch Processing for Performance
# =========================================
texts = ["Great product, highly recommended!",
         "Poor quality, won't buy again.",
         "Average, nothing special."] * 100  # 300 texts

# Batch inference is much more efficient than a single-item loop
sentiment_batch = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    batch_size=32   # process 32 texts at a time
)
results_batch = sentiment_batch(texts)
print(f"\nProcessed {len(results_batch)} texts")

5. Datasets Library: Efficient Data Management

The datasets library uses Apache Arrow as its backend, making it extremely efficient for large data. All operations are lazy and memory-mapped, allowing you to work with datasets that don't fit in RAM.

from datasets import (
    load_dataset,
    Dataset,
    DatasetDict,
    concatenate_datasets,
    interleave_datasets,
    Features,
    Value,
    ClassLabel
)
import pandas as pd

# =========================================
# Loading from HuggingFace Hub
# =========================================
# Public dataset with splits
sst2 = load_dataset("glue", "sst2")
print("SST-2 dataset:", sst2)
print("Training size:", len(sst2["train"]))
print("Features:", sst2["train"].features)

# With streaming (for huge datasets - does not download everything)
wiki_stream = load_dataset(
    "wikipedia",
    "20220301.en",
    split="train",
    streaming=True,
    trust_remote_code=True
)
# Get only 5 examples without downloading the entire dataset
for i, example in enumerate(wiki_stream.take(5)):
    print(f"Title: {example['title']} - Length: {len(example['text'])} chars")

# =========================================
# Creating from local sources
# =========================================
# From Python dictionary
data = Dataset.from_dict({
    "text": [
        "Great product, highly recommend!",
        "Poor quality, not worth the price.",
        "Average product, nothing exceptional.",
        "Fantastic! Exceeded all expectations."
    ],
    "label": [1, 0, 0, 1]
})

# From pandas DataFrame with explicit types
df = pd.DataFrame({
    "text": ["Sample text 1", "Sample text 2"],
    "label": [0, 1]
})
dataset_from_df = Dataset.from_pandas(df)

# From files with explicit schema
features = Features({
    "text": Value("string"),
    "label": ClassLabel(names=["negative", "positive"]),
    "confidence": Value("float32")
})
dataset_json = load_dataset("json", data_files="data.jsonl", features=features)

# =========================================
# Advanced manipulation
# =========================================
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples, max_length=128):
    return tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=max_length,
        return_token_type_ids=False
    )

tokenized = data.map(
    tokenize_function,
    batched=True,           # process batches for efficiency
    batch_size=1000,
    num_proc=4,             # use 4 parallel processes
    remove_columns=["text"]
)
tokenized.set_format("torch")  # convert to PyTorch tensors

# filter: remove short texts
long_texts = data.filter(
    lambda x: len(x["text"].split()) > 5,
    num_proc=4
)

# Balancing: oversample minority class
class_0 = data.filter(lambda x: x["label"] == 0)
class_1 = data.filter(lambda x: x["label"] == 1)
if len(class_0) < len(class_1):
    factor = len(class_1) // len(class_0)
    class_0_repeated = concatenate_datasets([class_0] * factor)
    balanced = concatenate_datasets([class_0_repeated, class_1]).shuffle(seed=42)

# train_test_split with stratification
splits = data.train_test_split(test_size=0.2, seed=42, stratify_by_column="label")
train_ds = splits["train"]
test_ds = splits["test"]
print(f"\nTrain: {len(train_ds)}, Test: {len(test_ds)}")

# =========================================
# DatasetDict: multi-split management
# =========================================
dataset_dict = DatasetDict({
    "train": train_ds,
    "test": test_ds,
    "validation": data.select(range(2))
})
# Save and reload (efficient Arrow format)
dataset_dict.save_to_disk("./data/my_dataset")
loaded = DatasetDict.load_from_disk("./data/my_dataset")

# Dataset statistics
print("\n=== Dataset Statistics ===")
print(f"Number of examples: {len(data)}")
print(f"Label distribution: {data.to_pandas()['label'].value_counts().to_dict()}")
print(f"Average text length: {data.to_pandas()['text'].str.len().mean():.0f} chars")

6. Trainer API: Complete Fine-tuning

The Trainer API is the high-level abstraction for training in HuggingFace. It handles training loops, evaluation, checkpointing, logging, and much more. It supports out-of-the-box mixed precision, gradient accumulation, distributed training, and integration with WandB, TensorBoard, and MLflow.

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback,
    TrainerCallback,
    TrainerControl,
    TrainerState
)
from datasets import load_dataset
import evaluate
import numpy as np
import torch

MODEL = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=2)

# Dataset preparation
dataset = load_dataset("glue", "sst2")
def tokenize(examples):
    return tokenizer(
        examples["sentence"],
        truncation=True,
        padding="max_length",
        max_length=128
    )
tokenized = dataset.map(tokenize, batched=True, remove_columns=["sentence", "idx"])
tokenized.set_format("torch")

# Composite metrics
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy.compute(predictions=preds, references=labels)["accuracy"]
    f1_score = f1.compute(predictions=preds, references=labels, average="binary")["f1"]
    return {"accuracy": acc, "f1": f1_score}

# TrainingArguments: complete configuration
args = TrainingArguments(
    # I/O and checkpointing
    output_dir="./results/distilbert-sst2",
    logging_dir="./logs",
    logging_steps=50,
    logging_strategy="steps",

    # Epochs and batch size
    num_train_epochs=5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,

    # Learning rate schedule
    learning_rate=2e-5,
    lr_scheduler_type="cosine",  # cosine, linear, polynomial
    warmup_ratio=0.1,            # 10% warm-up steps
    weight_decay=0.01,           # L2 regularization

    # Evaluation and checkpointing
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    save_total_limit=3,          # keep only 3 best checkpoints

    # Computational optimization
    fp16=True,                   # mixed precision FP16
    dataloader_num_workers=4,
    gradient_accumulation_steps=2,  # effective batch = 32*2 = 64
    max_grad_norm=1.0,              # gradient clipping

    # Reporting
    report_to="none",            # "wandb", "tensorboard", "mlflow"
    seed=42,
    data_seed=42,
)

# =========================================
# Custom Callback: advanced monitoring
# =========================================
class TrainingMonitorCallback(TrainerCallback):
    def __init__(self, patience: int = 3):
        self.patience = patience
        self.best_metric = None
        self.steps_without_improvement = 0

    def on_evaluate(self, args, state: TrainerState, control: TrainerControl, metrics, **kwargs):
        current_metric = metrics.get("eval_f1", 0)
        if self.best_metric is None or current_metric > self.best_metric:
            self.best_metric = current_metric
            self.steps_without_improvement = 0
            print(f"\n[Callback] New best F1: {current_metric:.4f}")
        else:
            self.steps_without_improvement += 1
            print(f"\n[Callback] No improvement ({self.steps_without_improvement}/{self.patience})")

    def on_log(self, args, state: TrainerState, control: TrainerControl, logs=None, **kwargs):
        if logs and "loss" in logs and state.global_step % 200 == 0:
            print(f"  Step {state.global_step}: loss={logs['loss']:.4f}")

# Trainer with multiple callbacks
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    compute_metrics=compute_metrics,
    callbacks=[
        EarlyStoppingCallback(early_stopping_patience=2, early_stopping_threshold=0.001),
        TrainingMonitorCallback(patience=3)
    ]
)

# Training and evaluation
train_result = trainer.train()
print(f"\nTraining complete!")
print(f"Train loss: {train_result.training_loss:.4f}")
print(f"Samples/sec: {train_result.metrics['train_samples_per_second']:.1f}")

metrics = trainer.evaluate(eval_dataset=tokenized["validation"])
print(f"Validation F1: {metrics['eval_f1']:.4f}")
print(f"Validation Acc: {metrics['eval_accuracy']:.4f}")

trainer.save_model("./models/distilbert-sst2-final")
tokenizer.save_pretrained("./models/distilbert-sst2-final")

7. Custom Training Loop with PyTorch

For advanced cases where the Trainer API is insufficient, we can write a custom training loop while maintaining all optimizations. This gives maximum control over custom loss functions, sampling strategies, curriculum learning, and more.

from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup
from torch.optim import AdamW
from torch.utils.data import DataLoader
from datasets import load_dataset
import torch
import numpy as np
from tqdm import tqdm

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
EPOCHS = 3
BATCH_SIZE = 32
LR = 2e-5
MODEL = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=2).to(DEVICE)

dataset = load_dataset("glue", "sst2")
def tokenize(examples):
    return tokenizer(examples["sentence"], truncation=True, padding="max_length", max_length=128)
tokenized = dataset.map(tokenize, batched=True, remove_columns=["sentence", "idx"])
tokenized.set_format("torch")

train_loader = DataLoader(tokenized["train"], batch_size=BATCH_SIZE, shuffle=True, num_workers=4, pin_memory=True)
val_loader = DataLoader(tokenized["validation"], batch_size=64, num_workers=4)

# Selective weight decay (no bias and LayerNorm)
no_decay = ["bias", "LayerNorm.weight", "LayerNorm.bias"]
optimizer_grouped = [
    {"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], "weight_decay": 0.01},
    {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0}
]
optimizer = AdamW(optimizer_grouped, lr=LR, eps=1e-8)

total_steps = len(train_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=int(total_steps * 0.1), num_training_steps=total_steps)
scaler = torch.cuda.amp.GradScaler(enabled=torch.cuda.is_available())

best_f1 = 0.0
for epoch in range(EPOCHS):
    model.train()
    total_loss = 0
    progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{EPOCHS}")

    for batch in progress_bar:
        batch = {k: v.to(DEVICE) for k, v in batch.items()}

        # Mixed precision forward pass
        with torch.cuda.amp.autocast(enabled=torch.cuda.is_available()):
            outputs = model(**batch)
            loss = outputs.loss

        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()
        scheduler.step()

        total_loss += loss.item()
        progress_bar.set_postfix({"loss": f"{loss.item():.4f}"})

    # Validation
    model.eval()
    all_preds, all_labels = [], []
    with torch.no_grad():
        for batch in val_loader:
            batch = {k: v.to(DEVICE) for k, v in batch.items()}
            preds = torch.argmax(model(**batch).logits, dim=-1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(batch["labels"].cpu().numpy())

    from sklearn.metrics import f1_score, accuracy_score
    f1 = f1_score(all_labels, all_preds)
    acc = accuracy_score(all_labels, all_preds)
    print(f"\nEpoch {epoch+1}: val_f1={f1:.4f}, val_acc={acc:.4f}")

    if f1 > best_f1:
        best_f1 = f1
        model.save_pretrained("./models/best_model")
        print(f"  Saved new best model (F1={f1:.4f})")

8. PEFT: Parameter-Efficient Fine-tuning with LoRA

The PEFT (Parameter-Efficient Fine-Tuning) library allows fine-tuning large models by updating only a small fraction of the parameters. LoRA (Low-Rank Adaptation) is the most widely used method: it decomposes weight updates as the product of two low-rank matrices, reducing trainable parameters by approximately 99%.

PEFT Methods Compared

Method	Trainable Params	Memory	Performance	Use Case
Full Fine-tuning	100%	High (>40GB for 7B)	Maximum	Large dataset, enterprise GPU
LoRA (r=16)	~0.5%	Low (-70%)	Near full FT	Consumer GPU (8-24GB)
QLoRA	~0.5% (4-bit model)	Very low (-85%)	Slightly lower	8-16GB GPU, large models
Prefix Tuning	~0.1%	Very low	Lower	Generation, LLMs
Prompt Tuning	~0.01%	Minimal	Variable	Large LLMs (>10B)
Adapter Layers	~1-3%	Low	Good	Multi-task, modular

from peft import (
    LoraConfig,
    get_peft_model,
    TaskType,
    PeftModel,
    prepare_model_for_kbit_training
)
from transformers import AutoModelForSequenceClassification, BitsAndBytesConfig
import torch

# =========================================
# Standard LoRA
# =========================================
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=3)

lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=16,                           # decomposition rank
    lora_alpha=32,                  # scaling factor (alpha/r = scaling ratio)
    lora_dropout=0.1,               # dropout on LoRA layers
    target_modules=["query", "value", "key"],  # layers to modify
    bias="none",
    inference_mode=False
)

peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 888,578 || all params: 125,535,234 || trainable%: 0.71%

# =========================================
# QLoRA: LoRA with 4-bit quantization
# =========================================
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",     # NormalFloat4 (best for language models)
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Only for large models (>1B parameters)
model_4bit = AutoModelForSequenceClassification.from_pretrained(
    "bert-large-uncased",
    quantization_config=bnb_config,
    device_map="auto",
    num_labels=3
)
model_4bit = prepare_model_for_kbit_training(model_4bit)

lora_config_qlora = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=16, lora_alpha=32,
    lora_dropout=0.05, target_modules=["query", "value"], bias="none"
)
peft_4bit_model = get_peft_model(model_4bit, lora_config_qlora)
peft_4bit_model.print_trainable_parameters()

# =========================================
# Saving and loading LoRA
# =========================================
# Save only LoRA weights (very lightweight ~1-5MB)
peft_model.save_pretrained("./models/roberta-lora-classification")

# Loading: base model + LoRA adapter
base = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=3)
model_with_lora = PeftModel.from_pretrained(base, "./models/roberta-lora-classification")
model_with_lora.eval()

# Merge for faster inference (eliminates LoRA overhead)
merged = model_with_lora.merge_and_unload()
merged.save_pretrained("./models/roberta-merged")

9. Accelerate: Distributed Training

Accelerate automatically handles the complexity of training on diverse hardware configurations: single CPU, single GPU, multi-GPU, multi-node, TPU, with mixed precision. Code changes are minimal.

# accelerate_training.py
# Launch: accelerate launch accelerate_training.py
# Multi-GPU: accelerate launch --num_processes 4 accelerate_training.py
# Config: accelerate config (interactive wizard)

from accelerate import Accelerator
from accelerate.utils import set_seed, ProjectConfiguration
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_cosine_schedule_with_warmup
from torch.optim import AdamW
from torch.utils.data import DataLoader
from datasets import load_dataset
import torch
from tqdm import tqdm

project_config = ProjectConfiguration(
    project_dir="./accelerate_project",
    logging_dir="./logs"
)
accelerator = Accelerator(
    mixed_precision="fp16",
    gradient_accumulation_steps=4,
    log_with="tensorboard",
    project_config=project_config
)

accelerator.print(f"Training on: {accelerator.device}")
accelerator.print(f"Num processes: {accelerator.num_processes}")
set_seed(42)

MODEL = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=2)

dataset = load_dataset("glue", "sst2")
def tokenize(examples):
    return tokenizer(examples["sentence"], truncation=True, padding="max_length", max_length=128)
tokenized = dataset.map(tokenize, batched=True, remove_columns=["sentence", "idx"])
tokenized.set_format("torch")

train_loader = DataLoader(tokenized["train"], batch_size=32, shuffle=True, num_workers=4)
val_loader = DataLoader(tokenized["validation"], batch_size=64, num_workers=4)

optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)

total_steps = (len(train_loader) // accelerator.gradient_accumulation_steps) * 3
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=total_steps//10, num_training_steps=total_steps)

# Prepare ALL objects with Accelerate
model, optimizer, train_loader, val_loader, scheduler = accelerator.prepare(
    model, optimizer, train_loader, val_loader, scheduler
)

for epoch in range(3):
    model.train()
    total_loss = 0

    for step, batch in enumerate(tqdm(train_loader, desc=f"Epoch {epoch+1}")):
        with accelerator.accumulate(model):
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)

            if accelerator.sync_gradients:
                accelerator.clip_grad_norm_(model.parameters(), 1.0)

            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()

        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)
    accelerator.print(f"\nEpoch {epoch+1}: avg_loss={avg_loss:.4f}")
    accelerator.save_state(f"./checkpoints/epoch_{epoch+1}")

# Save final model (remove Accelerate wrapper)
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained("./models/final_model", save_function=accelerator.save)

10. Production Inference Optimization

In production, inference must be fast, efficient, and scalable. HuggingFace offers several optimization strategies: ONNX Runtime, static/dynamic quantization, TorchScript, and dedicated inference servers (TGI/TEI).

from optimum.onnxruntime import ORTModelForSequenceClassification
from optimum.exporters.onnx import main_export
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import time
import numpy as np

# =========================================
# ONNX Export with optimization
# =========================================
main_export(
    model_name_or_path="./models/distilbert-sst2-final",
    output="./models/onnx-optimized",
    task="text-classification",
    optimize="O2"   # O1: basic, O2: extended, O3: layout, O4: full + fp16
)

# Load and use ONNX model
ort_model = ORTModelForSequenceClassification.from_pretrained(
    "./models/onnx-optimized",
    provider="CPUExecutionProvider"  # "CUDAExecutionProvider" for GPU
)
tokenizer = AutoTokenizer.from_pretrained("./models/distilbert-sst2-final")

# Benchmark: PyTorch vs ONNX
texts = ["This product is absolutely amazing!"] * 200

def benchmark(model, tokenizer, texts, batch_size=32, num_runs=5):
    times = []
    for _ in range(num_runs):
        start = time.perf_counter()
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i+batch_size]
            inputs = tokenizer(batch, return_tensors='pt', padding=True, truncation=True, max_length=128)
            with torch.no_grad():
                _ = model(**inputs)
        times.append(time.perf_counter() - start)
    return np.mean(times), np.std(times)

onnx_avg, onnx_std = benchmark(ort_model, tokenizer, texts)
print(f"ONNX (200 texts): {onnx_avg*1000:.1f}ms ± {onnx_std*1000:.1f}ms")

# =========================================
# Dynamic quantization INT8
# =========================================
pt_model = AutoModelForSequenceClassification.from_pretrained("./models/distilbert-sst2-final")
pt_model.eval()

quantized_model = torch.quantization.quantize_dynamic(
    pt_model,
    {torch.nn.Linear},
    dtype=torch.qint8
)
pt_size = sum(p.numel() * p.element_size() for p in pt_model.parameters()) / 1e6
print(f"\nFP32 size: {pt_size:.1f}MB")
print("INT8 is approximately 4x smaller")

# =========================================
# Text Embeddings Inference (TEI) client
# =========================================
import requests

def get_embeddings_tei(texts: list, url: str = "http://localhost:8080") -> np.ndarray:
    """Call Text Embeddings Inference server."""
    response = requests.post(
        f"{url}/embed",
        json={"inputs": texts, "normalize": True}
    )
    response.raise_for_status()
    return np.array(response.json())

# Docker: docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:latest
#         --model-id BAAI/bge-base-en-v1.5 --pooling mean
print("\nTEI endpoint: POST http://localhost:8080/embed")

11. MLOps Integration: WandB and MLflow

In a production-grade context, training must be monitored, versioned, and reproducible. HuggingFace Trainer natively integrates with the main MLOps tools.

import wandb
import mlflow
import mlflow.pytorch
from transformers import TrainingArguments

# =========================================
# WandB Integration
# =========================================
wandb.init(
    project="bert-nlp-experiments",
    name="distilbert-sst2-run1",
    config={
        "model": "distilbert-base-uncased",
        "learning_rate": 2e-5,
        "epochs": 3,
        "batch_size": 32,
        "dataset": "SST-2"
    },
    tags=["bert", "sentiment", "fine-tuning"]
)

# Trainer automatically uses WandB when available
args_wandb = TrainingArguments(
    output_dir="./results",
    report_to="wandb",
    run_name="distilbert-run1",
    num_train_epochs=3,
)

# =========================================
# MLflow Integration
# =========================================
mlflow.set_tracking_uri("./mlflow_runs")
mlflow.set_experiment("bert-nlp-experiments")

with mlflow.start_run(run_name="distilbert-sst2"):
    mlflow.log_params({
        "model": "distilbert-base-uncased",
        "lr": 2e-5,
        "epochs": 3,
        "batch_size": 32
    })

    args_mlflow = TrainingArguments(
        output_dir="./results",
        report_to="mlflow",
        num_train_epochs=3,
    )

    # After training:
    mlflow.log_metrics({"eval_f1": 0.924, "eval_accuracy": 0.931})

    from transformers import AutoModelForSequenceClassification
    model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
    mlflow.pytorch.log_model(
        model,
        "model",
        registered_model_name="bert-sentiment-classifier"
    )
    print(f"Run ID: {mlflow.active_run().info.run_id}")

Common Anti-Patterns with HuggingFace

No batch processing: calling the pipeline in a loop on single texts is 10-50x slower than passing a full batch
Reloading the model on every call: keep the model loaded in memory and reuse it; loading from disk takes seconds
Ignoring max_length: without truncation, long sequences consume memory exponentially; always set truncation=True
Not using torch.no_grad() in inference: PyTorch accumulates gradients unnecessarily, wasting memory
Not calling model.eval(): BatchNorm and Dropout behave differently in training vs inference
Always using padding="max_length": fixed-length padding wastes computation; in inference use dynamic padding
Ignoring vocabulary size: embedding layers with large vocabularies can dominate model memory

Conclusions and Next Steps

The HuggingFace ecosystem has become the de facto standard for modern NLP. Knowing its main libraries — transformers, datasets, peft, accelerate — is fundamental for any NLP engineer or ML engineer in 2025.

The strength of the ecosystem lies in the integration between components: datasets provides data, transformers the model, peft parameter optimization, accelerate hardware scalability, and optimum production inference. Each library can be used independently or in combination with the others.

Key Takeaways

Use AutoClass to load any architecture with the same code
The Pipeline API is perfect for rapid prototyping; use batch_size for production
The Trainer API handles 90% of standard fine-tuning cases with customizable callbacks
Custom training loops are necessary for custom loss functions, curriculum learning, and advanced training
PEFT/LoRA drastically reduces memory: ~0.5% of parameters with nearly equivalent performance
Accelerate enables distributed training without changing code
ONNX offers 2-5x speedup in CPU inference compared to native PyTorch
Integrate WandB or MLflow from the start for experiment tracking

Continue the Modern NLP Series

Previous: Text Classification: Multi-label and Multi-class — BERT-based text classification
Next: Fine-tuning LLMs Locally: LoRA on Consumer GPU — advanced QLoRA for LLMs 7B+
Article 9: Semantic Similarity and Text Matching — SBERT, FAISS, dense retrieval
Article 10: Monitoring NLP Models in Production — drift detection, automatic retraining
Related series: AI Engineering/RAG — HuggingFace Embeddings for RAG pipelines
Related series: Deep Learning Advanced — model quantization and optimization