HuggingFace Transformers: Practical Guide to the Ecosystem
HuggingFace has become the reference platform for modern Machine Learning.
With over 500,000 pre-trained models, 100,000+ datasets, and libraries like
transformers, datasets, peft, accelerate,
and optimum, it represents the infrastructure underlying most current NLP
and Computer Vision research and development.
In this article we explore the HuggingFace ecosystem in a practical and systematic way: from choosing the right model in the Hub, to the Trainer API for fine-tuning, to managing large datasets, to model optimization and deployment. We will also cover advanced patterns such as custom training loops, personalized callbacks, production-optimized inference, and integration with MLOps systems.
This is the seventh article in the Modern NLP: from BERT to LLMs series. It assumes familiarity with BERT (article 2) and sentiment analysis (article 3).
What You Will Learn
- The HuggingFace ecosystem: main libraries and when to use them
- Model Hub: searching, filtering, and loading models
- AutoClass API: AutoModel, AutoTokenizer, AutoConfig
- Pipeline API: zero-config inference for common tasks
- Datasets library: loading, manipulation, and streaming of large datasets
- Trainer API: complete fine-tuning with logging, callbacks, and checkpointing
- Custom training loops with native PyTorch
- PEFT and LoRA: efficient fine-tuning with few parameters
- Accelerate: distributed training and mixed precision
- Inference optimization: ONNX, BitsAndBytes, quantization
- Push to Hub: sharing models and datasets publicly
- Integration with WandB, MLflow, and MLOps systems
1. The HuggingFace Ecosystem
The HuggingFace ecosystem is composed of many separate but integrated libraries. Understanding which to use for each scenario is fundamental to avoid reinventing the wheel and maximizing the benefit of community work.
Main Libraries and Use Cases
| Library | Purpose | Typical Scenario | Installation |
|---|---|---|---|
transformers |
Models, tokenizers, training | BERT fine-tuning, inference pipeline | pip install transformers |
datasets |
Dataset management | Loading, preprocessing, streaming | pip install datasets |
peft |
Efficient fine-tuning | LoRA, Prefix Tuning, P-Tuning | pip install peft |
accelerate |
Distributed training | Multi-GPU, TPU, mixed precision | pip install accelerate |
optimum |
Inference optimization | ONNX export, quantization, TensorRT | pip install optimum |
evaluate |
Standard NLP metrics | BLEU, ROUGE, F1, accuracy, seqeval | pip install evaluate |
trl |
RLHF and SFT for LLMs | Instruction-following, reward modeling | pip install trl |
safetensors |
Secure format for weights | Fast and secure save/load | pip install safetensors |
sentence-transformers |
Sentence embeddings | Semantic similarity, clustering, RAG | pip install sentence-transformers |
tokenizers |
Fast tokenization | Custom BPE, WordPiece, Unigram | pip install tokenizers |
Library selection depends on context. For rapid prototyping use
pipeline() from transformers. For production-grade fine-tuning
use Trainer with datasets. For large models on limited GPUs
use peft. For optimized inference use optimum.
2. Model Hub: Finding the Right Model
The HuggingFace Hub hosts over 500,000 models. Finding the right one
requires understanding available filters and naming conventions. A model is identified
by username/model-name, with tags for language, task, framework, and dataset.
from huggingface_hub import HfApi, list_models, ModelFilter
import pandas as pd
api = HfApi()
# Search models by task and language
models = list(list_models(
filter=ModelFilter(
task="text-classification",
language="en", # English
),
sort="downloads",
direction=-1, # descending
limit=10
))
print("Top 10 English models for text-classification:")
for i, model in enumerate(models, 1):
print(f" {i}. {model.modelId} "
f"(downloads: {model.downloads:,}, likes: {model.likes})")
# Search for Italian BERT models
italian_models = list(list_models(
search="bert italian",
sort="downloads",
direction=-1,
limit=5
))
# Load detailed model info
model_info = api.model_info("dbmdz/bert-base-italian-cased")
print(f"\nModel: {model_info.modelId}")
print(f"Task: {model_info.pipeline_tag}")
print(f"Tags: {model_info.tags}")
print(f"Downloads/month: {model_info.downloads:,}")
# Recommended Italian models by task
italian_models_map = {
"sentiment": [
"neuraly/bert-base-italian-cased-sentiment",
"MilaNLProc/feel-it-italian-sentiment",
"morenolq/bert-base-italian-cased-sentiment"
],
"ner": [
"osiria/bert-base-italian-uncased-ner",
"Babelscape/wikineural-multilingual-ner"
],
"embeddings": [
"nickprock/sentence-bert-base-italian-uncased",
"paraphrase-multilingual-mpnet-base-v2"
],
"base_models": [
"dbmdz/bert-base-italian-cased",
"dbmdz/bert-base-italian-uncased"
]
}
for task, models_list in italian_models_map.items():
print(f"\n{task.upper()}:")
for m in models_list:
print(f" - {m}")
Criteria for Choosing a Model from the Hub
- Monthly downloads: indicator of community adoption and reliability
- Task tag: verify the model has the task-specific head (e.g.,
text-classification) - Model card: documentation of training, datasets used, benchmarks, limitations
- Language: ensure it supports the target language (
en,it,multilingual) - Size: balance performance vs speed.
basemodels (110M) are 3-4x faster thanlarge(340M) - Training date: recent models often outperform older ones even on the same architecture
3. AutoClass API: Flexible Loading
The AutoClass API allows loading any HuggingFace model
with the same code, regardless of the underlying architecture.
This is enabled by the config.json file that accompanies every model
and specifies the exact class to instantiate.
from transformers import (
AutoModel,
AutoModelForSequenceClassification,
AutoModelForTokenClassification,
AutoModelForCausalLM,
AutoModelForSeq2SeqLM,
AutoModelForQuestionAnswering,
AutoModelForMaskedLM,
AutoTokenizer,
AutoConfig
)
import torch
# Load configuration without downloading weights (very fast)
config = AutoConfig.from_pretrained("bert-base-uncased")
print(f"Architecture: {config.architectures}")
print(f"Hidden size: {config.hidden_size}")
print(f"Num layers: {config.num_hidden_layers}")
print(f"Num attention heads: {config.num_attention_heads}")
print(f"Vocab size: {config.vocab_size}")
# Tokenizer (works for any model)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Encode text
inputs = tokenizer(
"HuggingFace is amazing!",
return_tensors="pt", # "pt" for PyTorch, "tf" for TensorFlow
truncation=True,
max_length=128,
padding="max_length"
)
print(f"\nToken IDs shape: {inputs['input_ids'].shape}") # [1, 128]
# Base model (no task-specific head) - for feature extraction
model_base = AutoModel.from_pretrained("bert-base-uncased")
with torch.no_grad():
outputs = model_base(**inputs)
hidden_states = outputs.last_hidden_state # [1, 128, 768]
cls_embedding = hidden_states[:, 0, :] # CLS token [1, 768]
print(f"CLS embedding shape: {cls_embedding.shape}")
# Model with classification head
model_clf = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
)
print(f"\nClassification labels: {model_clf.config.id2label}")
# AutoClass to task mapping
autoclass_map = {
"AutoModelForSequenceClassification": "Text classification, sentiment analysis",
"AutoModelForTokenClassification": "NER, POS tagging, chunking",
"AutoModelForQuestionAnswering": "Extractive QA (SQuAD-style)",
"AutoModelForCausalLM": "Text generation (GPT-style)",
"AutoModelForSeq2SeqLM": "Translation, summarization (T5/mBART)",
"AutoModelForMaskedLM": "Masked language modeling (BERT)",
"AutoModelForMultipleChoice": "Multiple choice (SWAG, HellaSwag)",
}
# Advanced loading options
model_optimized = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=3,
torch_dtype=torch.float16, # fp16 saves ~50% GPU memory
device_map="auto", # auto-distribute across available GPUs
low_cpu_mem_usage=True, # load parameters progressively
attn_implementation="flash_attention_2" # Flash Attention 2 if available
)
4. Pipeline API: Fast Inference
The Pipeline API is the simplest way to use a HuggingFace model. It handles tokenization, inference, and post-processing automatically. It is ideal for prototyping but can also be used in production with batch processing.
from transformers import pipeline
import torch
# =========================================
# Task 1: Text Classification / Sentiment
# =========================================
sentiment = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english",
device=0 if torch.cuda.is_available() else -1 # GPU if available
)
results = sentiment(["I love this product!", "This is terrible.", "It's okay I guess."])
for r in results:
print(f" Label: {r['label']}, Score: {r['score']:.3f}")
# =========================================
# Task 2: Named Entity Recognition
# =========================================
ner = pipeline(
"ner",
model="dslim/bert-base-NER",
aggregation_strategy="simple" # aggregate tokens of the same entity
)
entities = ner("Apple CEO Tim Cook announced a new iPhone in Cupertino.")
for ent in entities:
print(f" '{ent['word']}' -> {ent['entity_group']} ({ent['score']:.3f})")
# =========================================
# Task 3: Question Answering
# =========================================
qa = pipeline("question-answering", model="deepset/roberta-base-squad2")
result = qa(
question="Who co-founded Tesla?",
context="Elon Musk co-founded Tesla Motors in 2003 along with Martin Eberhard."
)
print(f"\nQA Answer: '{result['answer']}' (score={result['score']:.3f})")
# =========================================
# Task 4: Text Generation
# =========================================
generator = pipeline(
"text-generation",
model="gpt2",
max_new_tokens=80,
num_return_sequences=2,
do_sample=True,
temperature=0.8,
top_p=0.95,
repetition_penalty=1.2
)
outputs = generator("The future of artificial intelligence is")
for i, out in enumerate(outputs):
print(f"\nGeneration {i+1}: {out['generated_text']}")
# =========================================
# Task 5: Zero-Shot Classification
# =========================================
zero_shot = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
result = zero_shot(
"The Italian economy contracted by 0.5% in Q3 2024.",
candidate_labels=["economics", "politics", "sports", "technology", "health"]
)
print("\nZero-shot classification:")
for label, score in zip(result['labels'][:3], result['scores'][:3]):
print(f" {label}: {score:.3f}")
# =========================================
# Task 6: Summarization
# =========================================
summarizer = pipeline("summarization", model="facebook/bart-large-cnn",
min_length=30, max_length=130)
text = """
Generative AI has revolutionized the technology sector in 2024. Large language models
like GPT-4, Claude, and Gemini have demonstrated surprising capabilities in reasoning,
creative writing, and solving complex problems. Companies worldwide are integrating
these technologies into their production processes, from customer service to data analysis,
from code generation to multimedia content creation.
"""
summary = summarizer(text)[0]['summary_text']
print(f"\nSummary: {summary}")
# =========================================
# Task 7: Translation
# =========================================
translator = pipeline("translation_it_to_en", model="Helsinki-NLP/opus-mt-it-en")
italian_text = "Il machine learning sta trasformando il mondo moderno."
translated = translator(italian_text)[0]['translation_text']
print(f"\nTranslation: {translated}")
# =========================================
# Batch Processing for Performance
# =========================================
texts = ["Great product, highly recommended!",
"Poor quality, won't buy again.",
"Average, nothing special."] * 100 # 300 texts
# Batch inference is much more efficient than a single-item loop
sentiment_batch = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english",
batch_size=32 # process 32 texts at a time
)
results_batch = sentiment_batch(texts)
print(f"\nProcessed {len(results_batch)} texts")
5. Datasets Library: Efficient Data Management
The datasets library uses Apache Arrow as its backend, making it
extremely efficient for large data. All operations are lazy and memory-mapped,
allowing you to work with datasets that don't fit in RAM.
from datasets import (
load_dataset,
Dataset,
DatasetDict,
concatenate_datasets,
interleave_datasets,
Features,
Value,
ClassLabel
)
import pandas as pd
# =========================================
# Loading from HuggingFace Hub
# =========================================
# Public dataset with splits
sst2 = load_dataset("glue", "sst2")
print("SST-2 dataset:", sst2)
print("Training size:", len(sst2["train"]))
print("Features:", sst2["train"].features)
# With streaming (for huge datasets - does not download everything)
wiki_stream = load_dataset(
"wikipedia",
"20220301.en",
split="train",
streaming=True,
trust_remote_code=True
)
# Get only 5 examples without downloading the entire dataset
for i, example in enumerate(wiki_stream.take(5)):
print(f"Title: {example['title']} - Length: {len(example['text'])} chars")
# =========================================
# Creating from local sources
# =========================================
# From Python dictionary
data = Dataset.from_dict({
"text": [
"Great product, highly recommend!",
"Poor quality, not worth the price.",
"Average product, nothing exceptional.",
"Fantastic! Exceeded all expectations."
],
"label": [1, 0, 0, 1]
})
# From pandas DataFrame with explicit types
df = pd.DataFrame({
"text": ["Sample text 1", "Sample text 2"],
"label": [0, 1]
})
dataset_from_df = Dataset.from_pandas(df)
# From files with explicit schema
features = Features({
"text": Value("string"),
"label": ClassLabel(names=["negative", "positive"]),
"confidence": Value("float32")
})
dataset_json = load_dataset("json", data_files="data.jsonl", features=features)
# =========================================
# Advanced manipulation
# =========================================
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenize_function(examples, max_length=128):
return tokenizer(
examples["text"],
truncation=True,
padding="max_length",
max_length=max_length,
return_token_type_ids=False
)
tokenized = data.map(
tokenize_function,
batched=True, # process batches for efficiency
batch_size=1000,
num_proc=4, # use 4 parallel processes
remove_columns=["text"]
)
tokenized.set_format("torch") # convert to PyTorch tensors
# filter: remove short texts
long_texts = data.filter(
lambda x: len(x["text"].split()) > 5,
num_proc=4
)
# Balancing: oversample minority class
class_0 = data.filter(lambda x: x["label"] == 0)
class_1 = data.filter(lambda x: x["label"] == 1)
if len(class_0) < len(class_1):
factor = len(class_1) // len(class_0)
class_0_repeated = concatenate_datasets([class_0] * factor)
balanced = concatenate_datasets([class_0_repeated, class_1]).shuffle(seed=42)
# train_test_split with stratification
splits = data.train_test_split(test_size=0.2, seed=42, stratify_by_column="label")
train_ds = splits["train"]
test_ds = splits["test"]
print(f"\nTrain: {len(train_ds)}, Test: {len(test_ds)}")
# =========================================
# DatasetDict: multi-split management
# =========================================
dataset_dict = DatasetDict({
"train": train_ds,
"test": test_ds,
"validation": data.select(range(2))
})
# Save and reload (efficient Arrow format)
dataset_dict.save_to_disk("./data/my_dataset")
loaded = DatasetDict.load_from_disk("./data/my_dataset")
# Dataset statistics
print("\n=== Dataset Statistics ===")
print(f"Number of examples: {len(data)}")
print(f"Label distribution: {data.to_pandas()['label'].value_counts().to_dict()}")
print(f"Average text length: {data.to_pandas()['text'].str.len().mean():.0f} chars")
6. Trainer API: Complete Fine-tuning
The Trainer API is the high-level abstraction for training in HuggingFace. It handles training loops, evaluation, checkpointing, logging, and much more. It supports out-of-the-box mixed precision, gradient accumulation, distributed training, and integration with WandB, TensorBoard, and MLflow.
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
TrainingArguments,
Trainer,
EarlyStoppingCallback,
TrainerCallback,
TrainerControl,
TrainerState
)
from datasets import load_dataset
import evaluate
import numpy as np
import torch
MODEL = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=2)
# Dataset preparation
dataset = load_dataset("glue", "sst2")
def tokenize(examples):
return tokenizer(
examples["sentence"],
truncation=True,
padding="max_length",
max_length=128
)
tokenized = dataset.map(tokenize, batched=True, remove_columns=["sentence", "idx"])
tokenized.set_format("torch")
# Composite metrics
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
acc = accuracy.compute(predictions=preds, references=labels)["accuracy"]
f1_score = f1.compute(predictions=preds, references=labels, average="binary")["f1"]
return {"accuracy": acc, "f1": f1_score}
# TrainingArguments: complete configuration
args = TrainingArguments(
# I/O and checkpointing
output_dir="./results/distilbert-sst2",
logging_dir="./logs",
logging_steps=50,
logging_strategy="steps",
# Epochs and batch size
num_train_epochs=5,
per_device_train_batch_size=32,
per_device_eval_batch_size=64,
# Learning rate schedule
learning_rate=2e-5,
lr_scheduler_type="cosine", # cosine, linear, polynomial
warmup_ratio=0.1, # 10% warm-up steps
weight_decay=0.01, # L2 regularization
# Evaluation and checkpointing
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1",
greater_is_better=True,
save_total_limit=3, # keep only 3 best checkpoints
# Computational optimization
fp16=True, # mixed precision FP16
dataloader_num_workers=4,
gradient_accumulation_steps=2, # effective batch = 32*2 = 64
max_grad_norm=1.0, # gradient clipping
# Reporting
report_to="none", # "wandb", "tensorboard", "mlflow"
seed=42,
data_seed=42,
)
# =========================================
# Custom Callback: advanced monitoring
# =========================================
class TrainingMonitorCallback(TrainerCallback):
def __init__(self, patience: int = 3):
self.patience = patience
self.best_metric = None
self.steps_without_improvement = 0
def on_evaluate(self, args, state: TrainerState, control: TrainerControl, metrics, **kwargs):
current_metric = metrics.get("eval_f1", 0)
if self.best_metric is None or current_metric > self.best_metric:
self.best_metric = current_metric
self.steps_without_improvement = 0
print(f"\n[Callback] New best F1: {current_metric:.4f}")
else:
self.steps_without_improvement += 1
print(f"\n[Callback] No improvement ({self.steps_without_improvement}/{self.patience})")
def on_log(self, args, state: TrainerState, control: TrainerControl, logs=None, **kwargs):
if logs and "loss" in logs and state.global_step % 200 == 0:
print(f" Step {state.global_step}: loss={logs['loss']:.4f}")
# Trainer with multiple callbacks
trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["validation"],
compute_metrics=compute_metrics,
callbacks=[
EarlyStoppingCallback(early_stopping_patience=2, early_stopping_threshold=0.001),
TrainingMonitorCallback(patience=3)
]
)
# Training and evaluation
train_result = trainer.train()
print(f"\nTraining complete!")
print(f"Train loss: {train_result.training_loss:.4f}")
print(f"Samples/sec: {train_result.metrics['train_samples_per_second']:.1f}")
metrics = trainer.evaluate(eval_dataset=tokenized["validation"])
print(f"Validation F1: {metrics['eval_f1']:.4f}")
print(f"Validation Acc: {metrics['eval_accuracy']:.4f}")
trainer.save_model("./models/distilbert-sst2-final")
tokenizer.save_pretrained("./models/distilbert-sst2-final")
7. Custom Training Loop with PyTorch
For advanced cases where the Trainer API is insufficient, we can write a custom training loop while maintaining all optimizations. This gives maximum control over custom loss functions, sampling strategies, curriculum learning, and more.
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup
from torch.optim import AdamW
from torch.utils.data import DataLoader
from datasets import load_dataset
import torch
import numpy as np
from tqdm import tqdm
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
EPOCHS = 3
BATCH_SIZE = 32
LR = 2e-5
MODEL = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=2).to(DEVICE)
dataset = load_dataset("glue", "sst2")
def tokenize(examples):
return tokenizer(examples["sentence"], truncation=True, padding="max_length", max_length=128)
tokenized = dataset.map(tokenize, batched=True, remove_columns=["sentence", "idx"])
tokenized.set_format("torch")
train_loader = DataLoader(tokenized["train"], batch_size=BATCH_SIZE, shuffle=True, num_workers=4, pin_memory=True)
val_loader = DataLoader(tokenized["validation"], batch_size=64, num_workers=4)
# Selective weight decay (no bias and LayerNorm)
no_decay = ["bias", "LayerNorm.weight", "LayerNorm.bias"]
optimizer_grouped = [
{"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], "weight_decay": 0.01},
{"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0}
]
optimizer = AdamW(optimizer_grouped, lr=LR, eps=1e-8)
total_steps = len(train_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=int(total_steps * 0.1), num_training_steps=total_steps)
scaler = torch.cuda.amp.GradScaler(enabled=torch.cuda.is_available())
best_f1 = 0.0
for epoch in range(EPOCHS):
model.train()
total_loss = 0
progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{EPOCHS}")
for batch in progress_bar:
batch = {k: v.to(DEVICE) for k, v in batch.items()}
# Mixed precision forward pass
with torch.cuda.amp.autocast(enabled=torch.cuda.is_available()):
outputs = model(**batch)
loss = outputs.loss
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
scheduler.step()
total_loss += loss.item()
progress_bar.set_postfix({"loss": f"{loss.item():.4f}"})
# Validation
model.eval()
all_preds, all_labels = [], []
with torch.no_grad():
for batch in val_loader:
batch = {k: v.to(DEVICE) for k, v in batch.items()}
preds = torch.argmax(model(**batch).logits, dim=-1)
all_preds.extend(preds.cpu().numpy())
all_labels.extend(batch["labels"].cpu().numpy())
from sklearn.metrics import f1_score, accuracy_score
f1 = f1_score(all_labels, all_preds)
acc = accuracy_score(all_labels, all_preds)
print(f"\nEpoch {epoch+1}: val_f1={f1:.4f}, val_acc={acc:.4f}")
if f1 > best_f1:
best_f1 = f1
model.save_pretrained("./models/best_model")
print(f" Saved new best model (F1={f1:.4f})")
8. PEFT: Parameter-Efficient Fine-tuning with LoRA
The PEFT (Parameter-Efficient Fine-Tuning) library allows fine-tuning large models by updating only a small fraction of the parameters. LoRA (Low-Rank Adaptation) is the most widely used method: it decomposes weight updates as the product of two low-rank matrices, reducing trainable parameters by approximately 99%.
PEFT Methods Compared
| Method | Trainable Params | Memory | Performance | Use Case |
|---|---|---|---|---|
| Full Fine-tuning | 100% | High (>40GB for 7B) | Maximum | Large dataset, enterprise GPU |
| LoRA (r=16) | ~0.5% | Low (-70%) | Near full FT | Consumer GPU (8-24GB) |
| QLoRA | ~0.5% (4-bit model) | Very low (-85%) | Slightly lower | 8-16GB GPU, large models |
| Prefix Tuning | ~0.1% | Very low | Lower | Generation, LLMs |
| Prompt Tuning | ~0.01% | Minimal | Variable | Large LLMs (>10B) |
| Adapter Layers | ~1-3% | Low | Good | Multi-task, modular |
from peft import (
LoraConfig,
get_peft_model,
TaskType,
PeftModel,
prepare_model_for_kbit_training
)
from transformers import AutoModelForSequenceClassification, BitsAndBytesConfig
import torch
# =========================================
# Standard LoRA
# =========================================
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=3)
lora_config = LoraConfig(
task_type=TaskType.SEQ_CLS,
r=16, # decomposition rank
lora_alpha=32, # scaling factor (alpha/r = scaling ratio)
lora_dropout=0.1, # dropout on LoRA layers
target_modules=["query", "value", "key"], # layers to modify
bias="none",
inference_mode=False
)
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 888,578 || all params: 125,535,234 || trainable%: 0.71%
# =========================================
# QLoRA: LoRA with 4-bit quantization
# =========================================
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 (best for language models)
bnb_4bit_compute_dtype=torch.bfloat16
)
# Only for large models (>1B parameters)
model_4bit = AutoModelForSequenceClassification.from_pretrained(
"bert-large-uncased",
quantization_config=bnb_config,
device_map="auto",
num_labels=3
)
model_4bit = prepare_model_for_kbit_training(model_4bit)
lora_config_qlora = LoraConfig(
task_type=TaskType.SEQ_CLS, r=16, lora_alpha=32,
lora_dropout=0.05, target_modules=["query", "value"], bias="none"
)
peft_4bit_model = get_peft_model(model_4bit, lora_config_qlora)
peft_4bit_model.print_trainable_parameters()
# =========================================
# Saving and loading LoRA
# =========================================
# Save only LoRA weights (very lightweight ~1-5MB)
peft_model.save_pretrained("./models/roberta-lora-classification")
# Loading: base model + LoRA adapter
base = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=3)
model_with_lora = PeftModel.from_pretrained(base, "./models/roberta-lora-classification")
model_with_lora.eval()
# Merge for faster inference (eliminates LoRA overhead)
merged = model_with_lora.merge_and_unload()
merged.save_pretrained("./models/roberta-merged")
9. Accelerate: Distributed Training
Accelerate automatically handles the complexity of training on diverse hardware configurations: single CPU, single GPU, multi-GPU, multi-node, TPU, with mixed precision. Code changes are minimal.
# accelerate_training.py
# Launch: accelerate launch accelerate_training.py
# Multi-GPU: accelerate launch --num_processes 4 accelerate_training.py
# Config: accelerate config (interactive wizard)
from accelerate import Accelerator
from accelerate.utils import set_seed, ProjectConfiguration
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_cosine_schedule_with_warmup
from torch.optim import AdamW
from torch.utils.data import DataLoader
from datasets import load_dataset
import torch
from tqdm import tqdm
project_config = ProjectConfiguration(
project_dir="./accelerate_project",
logging_dir="./logs"
)
accelerator = Accelerator(
mixed_precision="fp16",
gradient_accumulation_steps=4,
log_with="tensorboard",
project_config=project_config
)
accelerator.print(f"Training on: {accelerator.device}")
accelerator.print(f"Num processes: {accelerator.num_processes}")
set_seed(42)
MODEL = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=2)
dataset = load_dataset("glue", "sst2")
def tokenize(examples):
return tokenizer(examples["sentence"], truncation=True, padding="max_length", max_length=128)
tokenized = dataset.map(tokenize, batched=True, remove_columns=["sentence", "idx"])
tokenized.set_format("torch")
train_loader = DataLoader(tokenized["train"], batch_size=32, shuffle=True, num_workers=4)
val_loader = DataLoader(tokenized["validation"], batch_size=64, num_workers=4)
optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
total_steps = (len(train_loader) // accelerator.gradient_accumulation_steps) * 3
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=total_steps//10, num_training_steps=total_steps)
# Prepare ALL objects with Accelerate
model, optimizer, train_loader, val_loader, scheduler = accelerator.prepare(
model, optimizer, train_loader, val_loader, scheduler
)
for epoch in range(3):
model.train()
total_loss = 0
for step, batch in enumerate(tqdm(train_loader, desc=f"Epoch {epoch+1}")):
with accelerator.accumulate(model):
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
if accelerator.sync_gradients:
accelerator.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
total_loss += loss.item()
avg_loss = total_loss / len(train_loader)
accelerator.print(f"\nEpoch {epoch+1}: avg_loss={avg_loss:.4f}")
accelerator.save_state(f"./checkpoints/epoch_{epoch+1}")
# Save final model (remove Accelerate wrapper)
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained("./models/final_model", save_function=accelerator.save)
10. Production Inference Optimization
In production, inference must be fast, efficient, and scalable. HuggingFace offers several optimization strategies: ONNX Runtime, static/dynamic quantization, TorchScript, and dedicated inference servers (TGI/TEI).
from optimum.onnxruntime import ORTModelForSequenceClassification
from optimum.exporters.onnx import main_export
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import time
import numpy as np
# =========================================
# ONNX Export with optimization
# =========================================
main_export(
model_name_or_path="./models/distilbert-sst2-final",
output="./models/onnx-optimized",
task="text-classification",
optimize="O2" # O1: basic, O2: extended, O3: layout, O4: full + fp16
)
# Load and use ONNX model
ort_model = ORTModelForSequenceClassification.from_pretrained(
"./models/onnx-optimized",
provider="CPUExecutionProvider" # "CUDAExecutionProvider" for GPU
)
tokenizer = AutoTokenizer.from_pretrained("./models/distilbert-sst2-final")
# Benchmark: PyTorch vs ONNX
texts = ["This product is absolutely amazing!"] * 200
def benchmark(model, tokenizer, texts, batch_size=32, num_runs=5):
times = []
for _ in range(num_runs):
start = time.perf_counter()
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
inputs = tokenizer(batch, return_tensors='pt', padding=True, truncation=True, max_length=128)
with torch.no_grad():
_ = model(**inputs)
times.append(time.perf_counter() - start)
return np.mean(times), np.std(times)
onnx_avg, onnx_std = benchmark(ort_model, tokenizer, texts)
print(f"ONNX (200 texts): {onnx_avg*1000:.1f}ms ± {onnx_std*1000:.1f}ms")
# =========================================
# Dynamic quantization INT8
# =========================================
pt_model = AutoModelForSequenceClassification.from_pretrained("./models/distilbert-sst2-final")
pt_model.eval()
quantized_model = torch.quantization.quantize_dynamic(
pt_model,
{torch.nn.Linear},
dtype=torch.qint8
)
pt_size = sum(p.numel() * p.element_size() for p in pt_model.parameters()) / 1e6
print(f"\nFP32 size: {pt_size:.1f}MB")
print("INT8 is approximately 4x smaller")
# =========================================
# Text Embeddings Inference (TEI) client
# =========================================
import requests
def get_embeddings_tei(texts: list, url: str = "http://localhost:8080") -> np.ndarray:
"""Call Text Embeddings Inference server."""
response = requests.post(
f"{url}/embed",
json={"inputs": texts, "normalize": True}
)
response.raise_for_status()
return np.array(response.json())
# Docker: docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:latest
# --model-id BAAI/bge-base-en-v1.5 --pooling mean
print("\nTEI endpoint: POST http://localhost:8080/embed")
11. MLOps Integration: WandB and MLflow
In a production-grade context, training must be monitored, versioned, and reproducible. HuggingFace Trainer natively integrates with the main MLOps tools.
import wandb
import mlflow
import mlflow.pytorch
from transformers import TrainingArguments
# =========================================
# WandB Integration
# =========================================
wandb.init(
project="bert-nlp-experiments",
name="distilbert-sst2-run1",
config={
"model": "distilbert-base-uncased",
"learning_rate": 2e-5,
"epochs": 3,
"batch_size": 32,
"dataset": "SST-2"
},
tags=["bert", "sentiment", "fine-tuning"]
)
# Trainer automatically uses WandB when available
args_wandb = TrainingArguments(
output_dir="./results",
report_to="wandb",
run_name="distilbert-run1",
num_train_epochs=3,
)
# =========================================
# MLflow Integration
# =========================================
mlflow.set_tracking_uri("./mlflow_runs")
mlflow.set_experiment("bert-nlp-experiments")
with mlflow.start_run(run_name="distilbert-sst2"):
mlflow.log_params({
"model": "distilbert-base-uncased",
"lr": 2e-5,
"epochs": 3,
"batch_size": 32
})
args_mlflow = TrainingArguments(
output_dir="./results",
report_to="mlflow",
num_train_epochs=3,
)
# After training:
mlflow.log_metrics({"eval_f1": 0.924, "eval_accuracy": 0.931})
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
mlflow.pytorch.log_model(
model,
"model",
registered_model_name="bert-sentiment-classifier"
)
print(f"Run ID: {mlflow.active_run().info.run_id}")
Common Anti-Patterns with HuggingFace
- No batch processing: calling the pipeline in a loop on single texts is 10-50x slower than passing a full batch
- Reloading the model on every call: keep the model loaded in memory and reuse it; loading from disk takes seconds
- Ignoring max_length: without truncation, long sequences consume memory exponentially; always set
truncation=True - Not using
torch.no_grad()in inference: PyTorch accumulates gradients unnecessarily, wasting memory - Not calling
model.eval(): BatchNorm and Dropout behave differently in training vs inference - Always using padding="max_length": fixed-length padding wastes computation; in inference use dynamic padding
- Ignoring vocabulary size: embedding layers with large vocabularies can dominate model memory
Conclusions and Next Steps
The HuggingFace ecosystem has become the de facto standard for modern NLP.
Knowing its main libraries — transformers, datasets,
peft, accelerate — is fundamental for any
NLP engineer or ML engineer in 2025.
The strength of the ecosystem lies in the integration between components: datasets provides data, transformers the model, peft parameter optimization, accelerate hardware scalability, and optimum production inference. Each library can be used independently or in combination with the others.
Key Takeaways
- Use AutoClass to load any architecture with the same code
- The Pipeline API is perfect for rapid prototyping; use
batch_sizefor production - The Trainer API handles 90% of standard fine-tuning cases with customizable callbacks
- Custom training loops are necessary for custom loss functions, curriculum learning, and advanced training
- PEFT/LoRA drastically reduces memory: ~0.5% of parameters with nearly equivalent performance
- Accelerate enables distributed training without changing code
- ONNX offers 2-5x speedup in CPU inference compared to native PyTorch
- Integrate WandB or MLflow from the start for experiment tracking
Continue the Modern NLP Series
- Previous: Text Classification: Multi-label and Multi-class — BERT-based text classification
- Next: Fine-tuning LLMs Locally: LoRA on Consumer GPU — advanced QLoRA for LLMs 7B+
- Article 9: Semantic Similarity and Text Matching — SBERT, FAISS, dense retrieval
- Article 10: Monitoring NLP Models in Production — drift detection, automatic retraining
- Related series: AI Engineering/RAG — HuggingFace Embeddings for RAG pipelines
- Related series: Deep Learning Advanced — model quantization and optimization







