Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Transfer Learning: Reusing Pre-trained Models for Computer Vision

Imagine teaching a child to recognize dog breeds. If that child already knows how to identify shapes, colors, textures, and general anatomical structures, the task becomes enormously simpler. They do not need to start from scratch: they can transfer their existing knowledge to the new task. This is exactly what Transfer Learning does in deep learning.

In this second article of the Computer Vision with Deep Learning series, we will explore Transfer Learning in depth: why it works, which strategies exist, how to choose the right pre-trained model, and how to implement complete pipelines in PyTorch. We will walk through a real industrial case study and the advanced techniques that professionals use every day.

Series Overview

#	Article	Focus
1	CNN Fundamentals	Architecture, training, deployment
2	You are here - Transfer Learning and Fine-Tuning	Pre-trained models, domain adaptation
3	Object Detection with YOLO	Real-time detection
4	Semantic Segmentation	Pixel-level classification
5	Image Generation with GAN and Diffusion	Synthetic image generation
6	Edge Deployment and Optimization	Models on embedded devices

What You Will Learn

What Transfer Learning is and why it works (feature hierarchies in CNNs)
Main strategies: feature extraction, fine-tuning, domain adaptation
How to choose the right pre-trained model (ResNet, EfficientNet, ViT, ConvNeXt)
Complete PyTorch implementation: from data preparation to deployment
Advanced techniques: discriminative learning rates, gradual unfreezing, LR warmup
Data augmentation optimized for Transfer Learning
Practical case study: industrial defect classification with ResNet-50
Transfer Learning applied to object detection (Faster R-CNN, YOLO)
Common mistakes and how to avoid them

1. What is Transfer Learning

Transfer Learning is a machine learning technique where a model trained on one task (the source task) is reused as the starting point for a different task (the target task). Instead of training a neural network from scratch on millions of images, we take a model already trained (typically on ImageNet: 1.2 million images across 1000 classes) and adapt it to our specific problem.

1.1 The Human Analogy

Our brain operates in Transfer Learning mode constantly. A surgeon learning a new procedure does not need to relearn anatomy, physiology, or basic motor skills. A classical musician switching to jazz transfers their instrument technique, score reading, and harmonic theory. A Python developer learning Rust transfers programming concepts, debugging mentality, and algorithmic thinking. In every case, prior knowledge dramatically accelerates learning in a new domain.

1.2 Why It Works: The Feature Hierarchy

The fundamental reason Transfer Learning works in CNNs lies in the feature hierarchy they learn. Research has demonstrated that CNNs trained on ImageNet organize features in increasing levels of abstraction:

      Feature Hierarchy in CNNs
      
            Layers
            Features Learned
            Specificity
            Transferability
          
            Layer 1-2
            Edges, corners, color gradients
            Generic (task-agnostic)
            Very High
          
            Layer 3-4
            Textures, repeated patterns, geometric motifs
            Semi-generic
            High
          
            Layer 5-6
            Object parts (eyes, wheels, windows)
            Semi-specific
            Medium
          
            Layer 7+
            Complete objects, scenes, compositions
            Task-specific
            Low

Early layers learn universal features: edges, textures, and gradients useful for any visual task. Middle layers capture more complex but still reasonably generic patterns. Only the final layers are highly specific to the original task. This means we can reuse most of the network as a powerful feature extractor and only adapt the final parts to our task.

CNN Feature Hierarchy Visualization

ImageNet Pre-trained CNN (e.g. ResNet-50):

Input Image
    |
    v
[Layers 1-2]  --->  Horizontal, vertical, diagonal edges
    |               Color gradients, blobs
    |               UNIVERSAL: useful for ANY image domain
    v
[Layers 3-4]  --->  Textures (fur, metal, wood, fabric)
    |               Geometric patterns, grids
    |               SEMI-GENERIC: transferable to many domains
    v
[Layers 5-6]  --->  Object parts (eyes, wheels, wings)
    |               Local compositions
    |               SEMI-SPECIFIC: domain-dependent
    v
[Layers 7+]   --->  Full classes (cat, car, bird)
    |               Highly ImageNet-specific features
    |               TASK-SPECIFIC: replace or fine-tune
    v
[Classifier]  --->  1000 ImageNet classes
                    ALWAYS replace for your task

Formal Definition

Given a source domain D_s with task T_s and a target domain D_t with task T_t, Transfer Learning aims to improve the learning function f_t in the target domain using knowledge extracted from D_s and T_s, where D_s != D_t or T_s != T_t. In practice, the weights theta learned from the source task are used as initialization (theta_0) for training on the target task, instead of random initialization. This warm start dramatically accelerates convergence.

2. Transfer Learning Strategies

There is no single way to apply Transfer Learning. The optimal strategy depends on the size of the target dataset, its similarity to the source dataset, and available compute. Let us examine the four main strategies.

2.1 Feature Extraction (Frozen Backbone)

The simplest strategy: the entire pre-trained backbone is frozen and used as a fixed feature extractor. Only a new classifier head is trained on top. Backbone weights never change.

When to use: Small dataset (hundreds to a few thousand images) and domain similar to the source (e.g., classifying dog breeds with a model pre-trained on ImageNet, which contains many dog images).

Feature Extraction - Architecture Schema

Pre-trained ResNet-50 (ImageNet):
+---------------------------------------------------+
| [Conv layers] --> [Res blocks] --> [Global AvgPool] |  FROZEN
|   25.5M parameters - NOT updated                   |  requires_grad = False
+---------------------------------------------------+
                        |
                        v
                  Feature vector (2048-dim)
                        |
                        v
              +-------------------+
              | [Linear 2048->N]  |  TRAINABLE
              |   N = your classes|  requires_grad = True
              +-------------------+
                        |
                        v
                  Output: N classes

Advantages:
  + Very fast training (few parameters to optimize)
  + No powerful GPU required
  + Minimal overfitting risk
  + Works with small datasets

Disadvantages:
  - Less flexible (features are fixed)
  - Limited performance if domain is very different

2.2 Fine-Tuning (Unfreeze Some or All Layers)

In fine-tuning, after initializing the network with pre-trained weights, we unfreeze some or all layers and retrain the network (or part of it) with a very low learning rate. Pre-trained layers are slightly updated to adapt to the new domain, preserving previously acquired knowledge.

When to use: Medium to large target dataset (thousands to tens of thousands of images) and/or moderately different domain from the source.

Progressive Fine-Tuning Strategy

Progressive Fine-Tuning Strategy:

Phase 1 - Feature Extraction (5-10 epochs):
  [Backbone FROZEN] --> [New Classifier] TRAINED (lr=1e-3)

Phase 2 - Partial Fine-Tuning (10-20 epochs):
  [Layers 1-3 FROZEN] --> [Layer 4 UNFROZEN lr=1e-5] --> [Classifier lr=1e-4]

Phase 3 - Full Fine-Tuning (optional, 5-10 epochs):
  [ALL layers UNFROZEN lr=1e-6] --> [Classifier lr=1e-5]

Progressive learning rates:
  Initial layers:   lr = 1e-6  (generic features, change very little)
  Middle layers:    lr = 1e-5  (adapt gradually)
  Final layers:     lr = 1e-4  (adapt to new domain)
  Classifier:       lr = 1e-3  (learn from scratch)

2.3 Domain Adaptation

Domain Adaptation is a specialized form of Transfer Learning used when the source domain and target domain share the same classes but have different data distributions. For example, a model trained on professional product photos that must work on factory images with variable lighting. Techniques like DANN (Domain-Adversarial Neural Network) add a domain discriminator that forces the network to learn domain-invariant features.

2.4 Zero-Shot and Few-Shot Transfer

With models like CLIP (Contrastive Language-Image Pre-training), it is possible to classify images into categories never seen during training (zero-shot) or with very few examples (few-shot). CLIP learns a joint text-image representation: given a textual prompt like "a photo of a welding defect", the model can classify images without any specific training. DINOv2, trained with self-supervised learning on 142M images, provides extremely transferable generic features.

      Transfer Learning Strategy Comparison
      
        
            Strategy
            Data Required
            Training Time
            Performance
            Overfitting Risk
          

        
            Feature Extraction
            100-1,000
            Minutes
            Good
            Very Low
          

            Partial Fine-Tuning
            1,000-10,000
            Hours
            Very Good
            Low
          

            Full Fine-Tuning
            10,000+
            Hours-Days
            Excellent
            Medium
          

            Domain Adaptation
            Variable
            Hours
            Good-Excellent
            Medium
          

            Zero-Shot (CLIP)
            0
            None
            Variable
            None
          

      
    

3. Pre-trained Models for Computer Vision

Choosing the right pre-trained model is a critical decision. Each architecture has different tradeoffs between accuracy, inference speed, model size, and memory requirements. Here is an overview of the most widely used models in 2025-2026.

      Pre-trained Model Comparison (ImageNet Top-1 Accuracy)
      
        
            Model
            Parameters
            Top-1 Acc
            Type
            Ideal Use
          

        
            ResNet-50
            25.6M
            80.9% (V2)
            CNN
            Solid baseline, easy deployment
          

            EfficientNet-B0
            5.3M
            77.1%
            CNN
            Mobile, edge, limited resources
          

            EfficientNet-B4
            19.3M
            84.0%
            CNN
            Best accuracy/efficiency ratio
          

            EfficientNet-B7
            66.3M
            86.3%
            CNN
            Maximum CNN accuracy
          

            ConvNeXt-T
            28.6M
            82.1%
            Modern CNN
            Best accuracy/speed tradeoff
          

            ConvNeXt-B
            88.6M
            85.8%
            Modern CNN
            High accuracy with CNN simplicity
          

            ViT-B/16
            86.6M
            86.0%
            Transformer
            Large datasets, global attention
          

            Swin-T
            28.3M
            81.3%
            Transformer
            Detection and segmentation
          

            CLIP ViT-B/32
            151M (visual)
            63.2% zero-shot
            Multimodal
            Zero-shot, visual search
          

            DINOv2 ViT-S/14
            22M
            81.1% linear probe
            Self-supervised
            Generic features, few labeled samples
          

            MobileNetV3-Large
            5.5M
            75.2%
            CNN
            Edge and mobile deployment
          

      
    

3.1 ResNet-50: The Workhorse

ResNet-50 remains the most popular model for Transfer Learning thanks to its simplicity, training stability, and broad ecosystem support. Skip connections (introduced in the previous article) allow training deep networks without vanishing gradient problems. The V2 weights (IMAGENET1K_V2), trained with modern techniques like Mixup, CutMix, and Random Erasing, achieve an impressive 80.9% top-1 accuracy.

3.2 EfficientNet: Compound Scaling

EfficientNet (Tan & Le, 2019) introduced compound scaling: rather than increasing only depth (more layers), width (more channels), or resolution alone, it scales all three dimensions simultaneously using a fixed coefficient. This produces a family of models (B0 through B7) that dominate the accuracy/efficiency Pareto frontier. EfficientNet-B4 is the sweet spot for most production use cases.

3.3 Vision Transformer (ViT) and Swin Transformer

Vision Transformers apply the Transformer architecture (originally created for NLP) to computer vision. The image is divided into patches (e.g., 16x16 pixels), each patch treated as a token and processed with self-attention. ViT excels when pre-trained on large datasets (ImageNet-21k, JFT-300M) but can underperform CNNs on small datasets. Swin Transformer introduces shifted window attention, making it more efficient and particularly suitable for dense prediction tasks like detection and segmentation.

3.4 ConvNeXt: Modernized CNN

ConvNeXt demonstrates that CNNs can compete with Transformers if modernized with the same training techniques (AdamW, Mixup, layer scale, Stochastic Depth). ConvNeXt-T achieves 82.1% with only 28.6M parameters, offering an excellent tradeoff between accuracy, speed, and deployment simplicity. It is increasingly the default choice when you want Transformer-level performance without Transformer deployment complexity.

3.5 DINOv2: Self-Supervised Learning

DINOv2 is trained with self-supervised learning (without labels) on an enormous curated dataset (LVD-142M images). The extracted features are extremely generic and transferable: a simple linear classifier added on top achieves competitive results with full supervised fine-tuning. It is particularly useful when you have very few labeled examples in the target domain, making it ideal for industrial inspection, medical imaging, and remote sensing applications.

4. When to Use Transfer Learning: The Decision Matrix

Strategy selection depends on two key factors: the size of the target dataset and the similarity between the source and target domains. This generates four decision quadrants.

Transfer Learning Decision Matrix

                       SIMILARITY TO SOURCE DOMAIN
                   High                        Low
             +-------------------------+-------------------------+
             |                         |                         |
   Large     |  QUADRANT 1             |  QUADRANT 2             |
   (10k+)    |  Full fine-tuning       |  Careful fine-tuning    |
             |  - Unfreeze all layers  |  - Unfreeze only final  |
 D           |  - Low learning rate    |    layers               |
 A           |  - High performance     |  - Very low LR for      |
 T           |                         |    backbone             |
 A           |  Example: Dog breeds    |  - Strong augmentation  |
 S           |  (ImageNet contains     |                         |
 E           |   many dog images)      |  Example: Medical       |
 T           |                         |  images (very different |
             +-------------------------+  from ImageNet)         |
 S           |                         +-------------------------+
 I           |  QUADRANT 3             |  QUADRANT 4             |
 Z           |  Feature Extraction     |  Limited options        |
 E           |  - Freeze backbone      |  - Try feature          |
             |  - Only train           |    extraction           |
   Small     |    classifier           |  - Very aggressive      |
   (100-1k)  |  - No overfitting       |    augmentation         |
             |  - Fast training        |  - Collect more data    |
             |                         |  - DINOv2 / CLIP        |
             |  Example: 200 flower    |    (self-supervised)    |
             |  photos (similar to     |                         |
             |  ImageNet)              |                         |
             +-------------------------+-------------------------+

Practical Rule

In 2025-2026, the answer to "Should I use Transfer Learning?" is almost always yes. Training a CNN from scratch is only justified in very specific cases: enormous datasets (millions of images), domains radically different from natural images (e.g., spectrograms, radar signals), or particular architectural constraints.

5. PyTorch Implementation

Let us move to practice. We will implement Transfer Learning step by step in PyTorch, starting from loading a pre-trained model through complete training with best practices.

5.1 Loading a Pre-trained Model

PyTorch offers two APIs for loading pre-trained models. The modern API (introduced in torchvision 0.13+) uses the Weights enum that provides detailed information about the weights, including the required preprocessing transformations. Always prefer this API.

Feature Extraction with PyTorch (ResNet-50)

import torch
import torch.nn as nn
import torchvision.models as models
from torchvision.models import ResNet50_Weights

def create_feature_extractor(num_classes: int, device: str = 'cuda') -> nn.Module:
    """
    Creates a frozen-backbone feature extractor based on ResNet-50.
    Only the final classifier is trained.

    Args:
        num_classes: Number of target classes
        device: Device to load the model onto
    Returns:
        Model with frozen backbone and custom classifier
    """
    # Load pre-trained weights (V2 = 80.9% top-1 on ImageNet)
    weights = ResNet50_Weights.IMAGENET1K_V2
    model = models.resnet50(weights=weights)

    # Freeze ALL backbone parameters
    for param in model.parameters():
        param.requires_grad = False

    # Replace final classifier - this is the ONLY part we train
    in_features = model.fc.in_features  # 2048 for ResNet-50
    model.fc = nn.Sequential(
        nn.Dropout(p=0.5),
        nn.Linear(in_features, 512),
        nn.ReLU(inplace=True),
        nn.Dropout(p=0.3),
        nn.Linear(512, num_classes)
    )

    # Report trainable vs frozen parameters
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"Trainable: {trainable:,} / {total:,} ({trainable/total*100:.1f}%)")
    # Output: Trainable: 1,051,138 / 25,049,090 (4.2%)

    return model.to(device)

# Usage
model = create_feature_extractor(num_classes=6)  # e.g. 6 defect types

5.2 Full Fine-Tuning with Discriminative Learning Rates

All weights are updated, but with different learning rates per layer group. Early layers that encode universal features should be updated very slowly (they are already well trained), while the classifier can use a much higher rate. This technique, called discriminative learning rates, is key for preventing catastrophic forgetting.

Discriminative Learning Rates in PyTorch

def create_discriminative_optimizer(model: nn.Module, base_lr: float = 1e-4) -> torch.optim.Optimizer:
    """
    Different learning rates per layer group:
    - Early layers (layer1, layer2): 10x lower - preserve universal features
    - Middle layers (layer3): 5x lower
    - Late layers (layer4): 2x lower
    - Classifier (fc): full base_lr - learn task-specific features
    """
    param_groups = [
        {
            'params': [p for n, p in model.named_parameters()
                       if 'layer1' in n or 'layer2' in n],
            'lr': base_lr / 10
        },
        {
            'params': [p for n, p in model.named_parameters() if 'layer3' in n],
            'lr': base_lr / 5
        },
        {
            'params': [p for n, p in model.named_parameters() if 'layer4' in n],
            'lr': base_lr / 2
        },
        {
            'params': model.fc.parameters(),
            'lr': base_lr
        }
    ]
    return torch.optim.AdamW(param_groups, weight_decay=1e-4)

def gradual_unfreeze(model: nn.Module, epoch: int) -> None:
    """
    Unfreeze layers gradually to prevent catastrophic forgetting.
    Epoch 0-4:   only classifier trainable
    Epoch 5-9:   + layer4
    Epoch 10-14: + layer3
    Epoch 15+:   all layers
    """
    layers_to_unfreeze = []
    if epoch >= 5:
        layers_to_unfreeze.append(model.layer4)
    if epoch >= 10:
        layers_to_unfreeze.append(model.layer3)
    if epoch >= 15:
        layers_to_unfreeze.extend([model.layer1, model.layer2])

    for layer in layers_to_unfreeze:
        for param in layer.parameters():
            param.requires_grad = True

6. Data Pipeline for Transfer Learning

The data pipeline for Transfer Learning has a critical requirement: since the backbone expects ImageNet-normalized inputs, you must apply the correct normalization (mean and std from ImageNet). Using wrong normalization is one of the most common mistakes and can silently degrade performance by several percentage points.

Data Loaders with Proper ImageNet Normalization

from torchvision import transforms
from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader
from pathlib import Path

# ImageNet normalization - MANDATORY for pre-trained models
IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD  = [0.229, 0.224, 0.225]

def get_train_transforms(img_size: int = 224) -> transforms.Compose:
    return transforms.Compose([
        transforms.RandomResizedCrop(img_size, scale=(0.7, 1.0)),
        transforms.RandomHorizontalFlip(p=0.5),
        transforms.RandomVerticalFlip(p=0.2),
        transforms.ColorJitter(brightness=0.3, contrast=0.3,
                               saturation=0.3, hue=0.1),
        transforms.RandomRotation(degrees=15),
        transforms.AutoAugment(policy=transforms.AutoAugmentPolicy.IMAGENET),
        transforms.ToTensor(),
        transforms.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
        transforms.RandomErasing(p=0.2, scale=(0.02, 0.15))
    ])

def get_val_transforms(img_size: int = 224) -> transforms.Compose:
    """Validation: no augmentation, only deterministic preprocessing."""
    return transforms.Compose([
        transforms.Resize(int(img_size * 1.14)),  # 256 for img_size=224
        transforms.CenterCrop(img_size),
        transforms.ToTensor(),
        transforms.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD)
    ])

def create_data_loaders(
    data_dir: str,
    batch_size: int = 32,
    img_size: int = 224,
    num_workers: int = 4
) -> tuple:
    """
    Creates train/val data loaders.
    Expects directory structure:
      data_dir/train/class_a/*.jpg
      data_dir/val/class_a/*.jpg
    """
    data_path = Path(data_dir)
    train_ds = ImageFolder(str(data_path / 'train'),
                           transform=get_train_transforms(img_size))
    val_ds   = ImageFolder(str(data_path / 'val'),
                           transform=get_val_transforms(img_size))

    train_loader = DataLoader(
        train_ds, batch_size=batch_size, shuffle=True,
        num_workers=num_workers, pin_memory=True, drop_last=True
    )
    val_loader = DataLoader(
        val_ds, batch_size=batch_size * 2, shuffle=False,
        num_workers=num_workers, pin_memory=True
    )

    print(f"Classes: {train_ds.classes}")
    print(f"Train: {len(train_ds)} | Val: {len(val_ds)}")
    return train_loader, val_loader, train_ds.classes

7. Learning Rate Warmup and Cosine Scheduling

Learning rate warmup gradually increases the LR during the first few epochs instead of starting at the target value directly. This prevents destabilizing the pre-trained weights early in training and is one of the most impactful tricks for fine-tuning. Combined with cosine annealing, it provides smooth LR decay through training.

Custom Warmup + Cosine Annealing Scheduler

import math
from torch.optim.lr_scheduler import _LRScheduler

class WarmupCosineScheduler(_LRScheduler):
    """
    Linear warmup followed by cosine annealing.
    Ideal for fine-tuning pre-trained models.
    """

    def __init__(self, optimizer, warmup_epochs: int, total_epochs: int,
                 min_lr: float = 1e-6, last_epoch: int = -1):
        self.warmup_epochs = warmup_epochs
        self.total_epochs = total_epochs
        self.min_lr = min_lr
        super().__init__(optimizer, last_epoch)

    def get_lr(self) -> list:
        if self.last_epoch < self.warmup_epochs:
            # Linear warmup: 0 -> base_lr over warmup_epochs
            factor = self.last_epoch / max(1, self.warmup_epochs)
            return [base_lr * factor for base_lr in self.base_lrs]
        else:
            # Cosine annealing: base_lr -> min_lr
            progress = (self.last_epoch - self.warmup_epochs) / max(
                1, self.total_epochs - self.warmup_epochs
            )
            cosine = 0.5 * (1 + math.cos(math.pi * progress))
            return [
                self.min_lr + (base_lr - self.min_lr) * cosine
                for base_lr in self.base_lrs
            ]

# Usage
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-4)
scheduler = WarmupCosineScheduler(
    optimizer, warmup_epochs=5, total_epochs=50, min_lr=1e-6
)

for epoch in range(50):
    train_one_epoch(model, train_loader, optimizer, criterion, device)
    val_acc = evaluate(model, val_loader, device)
    scheduler.step()
    lr = optimizer.param_groups[0]['lr']
    print(f"Epoch {epoch+1:2d} | LR: {lr:.2e} | Val Acc: {val_acc:.2f}%")

8. EfficientNet-B4: Two-Phase Training

EfficientNet-B4 provides the best accuracy/efficiency ratio for most production use cases. Two-phase training (classifier warmup followed by backbone fine-tuning) is the standard approach recommended by the literature.

EfficientNet-B4 with Two-Phase Training

import torch
import torch.nn as nn
import torchvision.models as models
from torchvision.models import EfficientNet_B4_Weights

class EfficientNetClassifier(nn.Module):
    """EfficientNet-B4 with custom classifier head for fine-tuning."""

    def __init__(self, num_classes: int, dropout_rate: float = 0.4):
        super().__init__()
        backbone = models.efficientnet_b4(weights=EfficientNet_B4_Weights.IMAGENET1K_V1)

        # Keep feature extraction backbone and pooling
        self.features = backbone.features
        self.avgpool = backbone.avgpool

        # Custom classifier head
        in_features = backbone.classifier[1].in_features  # 1792 for B4
        self.classifier = nn.Sequential(
            nn.Dropout(p=dropout_rate, inplace=True),
            nn.Linear(in_features, num_classes)
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        return self.classifier(x)

    def freeze_backbone(self) -> None:
        """Freeze all backbone parameters."""
        for param in self.features.parameters():
            param.requires_grad = False

    def unfreeze_last_n_blocks(self, n: int = 3) -> None:
        """Unfreeze the last n blocks of EfficientNet-B4 (9 blocks total)."""
        for i, block in enumerate(self.features):
            should_train = i >= (len(self.features) - n)
            for param in block.parameters():
                param.requires_grad = should_train


def two_phase_training(
    model: EfficientNetClassifier,
    train_loader,
    val_loader,
    device: torch.device
) -> dict:
    """
    Phase 1 (5 epochs): frozen backbone, warm up classifier
    Phase 2 (15 epochs): unfreeze last 3 blocks, cosine LR schedule
    """
    criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
    history = {'train_acc': [], 'val_acc': []}

    # ---- Phase 1: classifier warmup ----
    model.freeze_backbone()
    opt1 = torch.optim.AdamW(
        filter(lambda p: p.requires_grad, model.parameters()),
        lr=3e-3, weight_decay=1e-4
    )
    print("Phase 1: Classifier warmup (frozen backbone)")
    for epoch in range(5):
        t_acc = train_one_epoch(model, train_loader, opt1, criterion, device)
        v_acc = evaluate(model, val_loader, device)
        print(f"  Warmup {epoch+1}/5 | Train: {t_acc:.1f}% Val: {v_acc:.1f}%")

    # ---- Phase 2: fine-tuning ----
    print("\nPhase 2: Fine-tuning (last 3 blocks unfrozen)")
    model.unfreeze_last_n_blocks(n=3)
    opt2 = torch.optim.AdamW(
        filter(lambda p: p.requires_grad, model.parameters()),
        lr=1e-4, weight_decay=1e-4
    )
    sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt2, T_max=15, eta_min=1e-6)

    for epoch in range(15):
        t_acc = train_one_epoch(model, train_loader, opt2, criterion, device)
        v_acc = evaluate(model, val_loader, device)
        sched.step()
        history['train_acc'].append(t_acc)
        history['val_acc'].append(v_acc)
        print(f"  FT {epoch+1}/15 | Train: {t_acc:.1f}% Val: {v_acc:.1f}% "
              f"LR: {sched.get_last_lr()[0]:.2e}")

    return history

9. Case Study: Industrial Defect Classification

Let us apply everything learned to a real scenario: classifying surface defects on steel sheets using the NEU Steel Surface Defects Dataset (6 defect classes, ~300 images each at 200x200px). This is a representative industrial use case with a small dataset and a specific domain.

Complete Industrial Classifier with Early Stopping and ONNX Export

import torch
import torch.nn as nn
import torchvision.models as models
from torch.utils.data import DataLoader

class IndustrialDefectClassifier:
    """
    End-to-end system for surface defect classification using ResNet-50 fine-tuning.
    Includes two-phase training, early stopping, and ONNX export.
    """

    CLASSES = ['crazing', 'inclusion', 'patches',
               'pitted_surface', 'rolled_in_scale', 'scratches']

    def __init__(self, device: str = 'auto'):
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model = self._build_model(len(self.CLASSES)).to(self.device)
        print(f"Model on: {self.device}")

    def _build_model(self, n_cls: int) -> nn.Module:
        model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
        # Multi-layer classifier head with BatchNorm for better regularization
        model.fc = nn.Sequential(
            nn.Linear(2048, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.4),
            nn.Linear(512, 128),
            nn.BatchNorm1d(128),
            nn.ReLU(inplace=True),
            nn.Dropout(0.2),
            nn.Linear(128, n_cls)
        )
        return model

    def train(self, train_loader: DataLoader, val_loader: DataLoader,
              total_epochs: int = 30) -> dict:
        criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
        best_acc = 0.0
        patience = 8
        no_improve = 0
        history = {'val_acc': []}

        # Phase 1: classifier warmup (5 epochs, frozen backbone)
        for param in self.model.parameters():
            param.requires_grad = False
        for param in self.model.fc.parameters():
            param.requires_grad = True

        opt1 = torch.optim.AdamW(self.model.fc.parameters(), lr=3e-3)
        print("Phase 1: Classifier warmup")
        for ep in range(5):
            t_acc = self._run_epoch(train_loader, opt1, criterion, training=True)
            v_acc = self._run_epoch(val_loader, None, criterion, training=False)
            print(f"  Warmup {ep+1}/5 | T: {t_acc:.1f}% V: {v_acc:.1f}%")

        # Phase 2: full fine-tuning with OneCycleLR
        for param in self.model.parameters():
            param.requires_grad = True

        opt2 = torch.optim.AdamW(self.model.parameters(), lr=1e-4, weight_decay=1e-4)
        sched = torch.optim.lr_scheduler.OneCycleLR(
            opt2, max_lr=1e-4, epochs=total_epochs,
            steps_per_epoch=len(train_loader)
        )

        print("\nPhase 2: Full fine-tuning")
        for ep in range(total_epochs):
            t_acc = self._run_epoch(train_loader, opt2, criterion,
                                    training=True, sched=sched)
            v_acc = self._run_epoch(val_loader, None, criterion, training=False)
            history['val_acc'].append(v_acc)

            if v_acc > best_acc:
                best_acc = v_acc
                torch.save(self.model.state_dict(), 'best_defect_model.pth')
                no_improve = 0
            else:
                no_improve += 1

            print(f"Epoch {ep+1:2d}/{total_epochs} | "
                  f"T: {t_acc:.1f}% V: {v_acc:.1f}% Best: {best_acc:.1f}%")

            if no_improve >= patience:
                print(f"Early stopping at epoch {ep+1}")
                break

        return history

    def _run_epoch(self, loader, optimizer, criterion,
                   training: bool, sched=None) -> float:
        self.model.train(training)
        correct = total = 0

        with torch.set_grad_enabled(training):
            for images, labels in loader:
                images, labels = images.to(self.device), labels.to(self.device)
                if training:
                    optimizer.zero_grad(set_to_none=True)
                outputs = self.model(images)
                loss = criterion(outputs, labels)
                if training:
                    loss.backward()
                    torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
                    optimizer.step()
                    if sched:
                        sched.step()
                correct += outputs.argmax(1).eq(labels).sum().item()
                total += labels.size(0)

        return 100. * correct / total

    def export_onnx(self, path: str = 'defect_classifier.onnx') -> None:
        """Export to ONNX for dependency-free production inference."""
        self.model.eval()
        dummy = torch.randn(1, 3, 224, 224, device=self.device)
        torch.onnx.export(
            self.model, dummy, path,
            opset_version=17,
            input_names=['image'],
            output_names=['logits'],
            dynamic_axes={'image': {0: 'batch'}, 'logits': {0: 'batch'}}
        )
        print(f"Model exported to {path}")

10. Best Practices and Anti-patterns

Proven Recommendations

Always use the latest weights: Use weights=Model_Weights.DEFAULT for the best available pre-trained weights in torchvision. V2 weights trained with modern recipes significantly outperform V1.
Match input preprocessing exactly: Different models use different normalization. EfficientNet and ResNet use ImageNet statistics. Always use the weights.transforms() method to get the correct transforms automatically.
Two-phase training: Always warm up the classifier for a few epochs with a frozen backbone before full fine-tuning. This prevents the random classifier from destabilizing the pre-trained features.
Discriminative LRs: Early layers need a 10-100x lower LR than the classifier head. The backbone already knows universal features; you only want to gently adapt them.
Label smoothing: CrossEntropyLoss(label_smoothing=0.1) prevents overconfidence and improves generalization, especially with small datasets.
Early stopping with patience: Transfer Learning converges faster than from-scratch training. Monitor validation accuracy and stop when it plateaus (patience 5-15 epochs).
Test Time Augmentation (TTA): Average predictions over multiple augmented views (horizontal flip, scale) for +1-2% accuracy at inference time with no additional training.
BatchNorm in eval mode: When freezing backbone layers that include BatchNorm, explicitly set those layers to eval mode to use running statistics instead of batch statistics.

Common Mistakes to Avoid

Too high LR for backbone: Values above 1e-4 for the backbone will destroy pre-trained knowledge through what is called "catastrophic forgetting". Start at 1e-5 or 1e-6.
Forgetting ImageNet normalization: Without proper normalization the pre-trained model receives inputs from an entirely different distribution. This is one of the most silent and damaging mistakes.
Unfreezing everything at once: Unfreezing the entire backbone at once with too high a learning rate erases all pre-trained knowledge. Always use gradual unfreezing.
Tuning on the test set: Never make architectural or hyperparameter decisions based on test set performance. Always keep a clean hold-out set that you evaluate only once at the end.
Ignoring class imbalance: With small datasets, class imbalance is common. Use weighted losses (CrossEntropyLoss(weight=class_weights)) or oversampling strategies like WeightedRandomSampler.
Wrong model for the domain: ViT models require more data to fine-tune effectively than CNNs. For datasets with fewer than 5k images, prefer ResNet or EfficientNet over ViT.

10.1 Performance Benchmark

      Results on NEU Steel Surface Defects (500 images/class, 6 classes)
      
        
            Approach
            Val Accuracy
            Training Time
            Epochs to Convergence
          

        
            Small CNN from scratch
            67.3%
            ~45 min
            100+
          

            ResNet-50 Feature Extraction
            88.7%
            ~8 min
            20-30
          

            ResNet-50 Fine-Tune (last 2 blocks)
            93.4%
            ~15 min
            30-40
          

            ResNet-50 Full Fine-Tuning
            95.1%
            ~25 min
            40-50
          

            EfficientNet-B4 Fine-Tuning
            96.8%
            ~20 min
            35-45
          

            ConvNeXt-T Fine-Tuning
            96.2%
            ~22 min
            35-45
          

      
    

11. Transfer Learning for Object Detection

Transfer Learning is not limited to classification. In object detection, the backbone (the feature extractor) is almost always initialized with ImageNet pre-trained weights, while the detection head is trained from scratch on the target dataset. YOLO, Faster R-CNN, and RetinaNet all use this approach.

Faster R-CNN with Custom Backbone (PyTorch)

import torch
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn_v2
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

def create_detection_model(num_classes: int) -> torch.nn.Module:
    """
    Faster R-CNN with ResNet-50-FPN backbone pre-trained on COCO.
    Replace the box predictor head for custom class detection.

    Args:
        num_classes: number of target classes + 1 (background)
    """
    # Load model pre-trained on COCO (91 classes)
    model = fasterrcnn_resnet50_fpn_v2(
        weights=torchvision.models.detection.FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT
    )

    # Get input size of the existing classifier
    in_features = model.roi_heads.box_predictor.cls_score.in_features

    # Replace with new predictor for our classes
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    return model

# Detection training loop
def train_detection_epoch(model, data_loader, optimizer, device):
    model.train()
    total_loss = 0.0

    for images, targets in data_loader:
        images = [img.to(device) for img in images]
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

        # Faster R-CNN returns a dict of losses during training
        loss_dict = model(images, targets)
        losses = sum(loss for loss in loss_dict.values())

        optimizer.zero_grad(set_to_none=True)
        losses.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        total_loss += losses.item()

    return total_loss / len(data_loader)

# Transfer Learning stages for detection:
# Stage 1: freeze backbone, train head only (3-5 epochs)
# Stage 2: unfreeze FPN layers, lower LR (5-10 epochs)
# Stage 3: full fine-tuning with very low LR (5-10 epochs)
model = create_detection_model(num_classes=7)  # 6 defect classes + background
print(f"Detection model: {sum(p.numel() for p in model.parameters()):,} parameters")

12. Parameter-Efficient Fine-Tuning: LoRA for Vision Models

Full fine-tuning of large models (ViT-L/16: 307M params, ConvNeXt-XL: 350M params) is expensive in GPU memory and compute. LoRA (Low-Rank Adaptation) and related PEFT techniques achieve comparable accuracy by training only a tiny fraction of parameters - typically 0.1-1% of the original model. Originally developed for LLMs (Hu et al., 2021), LoRA transfers directly to Vision Transformers.

      Parameter-Efficient Methods for Vision Models
      
        
            Method
            Trainable Params
            Memory vs Full FT
            Accuracy (ImageNet)
            Best For
          

        
            Full Fine-Tuning
            100%
            1x (baseline)
            Highest
            Large dataset, ample GPU
          

            Linear Probe
            ~0.1%
            0.15x
            -3 to -5%
            Similar domain, few data
          

            LoRA (r=16)
            ~0.5%
            0.4x
            -0.5 to -1%
            Large ViT, limited GPU
          

            Adapter Tuning
            ~2-5%
            0.5x
            -0.5 to -1.5%
            Multi-task fine-tuning
          

            Prompt Tuning
            <0.1%
            0.15x
            -2 to -4%
            ViT, many tasks simultaneously
          

      
    

LoRA for Vision Transformer: Efficient Fine-Tuning with PEFT Library

import torch
import torch.nn as nn
from timm import create_model

# pip install peft  (PEFT library from Hugging Face)
from peft import get_peft_model, LoraConfig, TaskType

# LoRA implementation from scratch for understanding
class LoRALayer(nn.Module):
    """
    LoRA: Low-Rank Adaptation of a Linear layer.
    Original weight W is frozen; only A and B are trained.

    For an original Linear(in, out):
    W_fine-tuned = W_original + B @ A  where B in R^(out x r), A in R^(r x in)

    Key insight: the update delta_W = B @ A has rank r << min(in, out).
    We initialize A ~ N(0, sigma), B = 0 (so delta_W starts at 0 = no change).
    """

    def __init__(self, in_features: int, out_features: int,
                 rank: int = 16, alpha: float = 32.0):
        super().__init__()

        # Trainable low-rank matrices
        self.lora_A = nn.Parameter(
            torch.randn(rank, in_features) * (0.01)  # small init
        )
        self.lora_B = nn.Parameter(
            torch.zeros(out_features, rank)  # B=0 means delta_W starts at 0
        )

        # Scaling: alpha/rank controls effective learning rate of the update
        # Higher alpha/rank = larger effective LR for LoRA params
        self.scale = alpha / rank
        self.rank = rank

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """LoRA delta: adds B @ A @ x to the frozen base layer output."""
        # delta_W @ x = (B @ A) @ x = B @ (A @ x) - sequential for efficiency
        lora_out = (x @ self.lora_A.T) @ self.lora_B.T
        return lora_out * self.scale


class ViTWithLoRA(nn.Module):
    """
    Vision Transformer with LoRA applied to attention Q, K, V projections.
    Uses timm for the base ViT model.

    LoRA is inserted into every transformer block's attention layers.
    Only LoRA parameters and the classification head are trained (~0.5% of total).
    """

    def __init__(self, model_name: str = 'vit_base_patch16_224',
                 num_classes: int = 10,
                 lora_rank: int = 16,
                 lora_alpha: float = 32.0,
                 pretrained: bool = True):
        super().__init__()

        # Load pre-trained ViT (all parameters frozen initially)
        self.vit = create_model(model_name, pretrained=pretrained,
                                 num_classes=0)  # Remove head

        # Freeze all parameters
        for param in self.vit.parameters():
            param.requires_grad = False

        # Apply LoRA to Q and V projections in every attention block
        self.lora_layers = nn.ModuleList()
        embed_dim = self.vit.embed_dim

        for block in self.vit.blocks:
            attn = block.attn

            # Q projection LoRA
            lora_q = LoRALayer(embed_dim, embed_dim, lora_rank, lora_alpha)
            # V projection LoRA
            lora_v = LoRALayer(embed_dim, embed_dim, lora_rank, lora_alpha)

            self.lora_layers.extend([lora_q, lora_v])

            # Patch the attention forward to include LoRA
            # Store references for forward hook injection
            block.attn._lora_q = lora_q
            block.attn._lora_v = lora_v

        # Trainable classification head
        self.classifier = nn.Sequential(
            nn.LayerNorm(embed_dim),
            nn.Linear(embed_dim, num_classes)
        )

    def count_trainable_params(self) -> dict:
        """Count trainable vs frozen parameters."""
        trainable = sum(p.numel() for p in self.parameters() if p.requires_grad)
        total = sum(p.numel() for p in self.parameters())
        return {
            'trainable': trainable,
            'total': total,
            'trainable_pct': 100 * trainable / total
        }

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        features = self.vit(x)  # [B, embed_dim]
        return self.classifier(features)


# Using PEFT library (simpler approach for production)
def create_lora_model_with_peft(model_name: str = 'vit_base_patch16_224',
                                  num_classes: int = 10,
                                  lora_r: int = 16,
                                  lora_alpha: int = 32,
                                  lora_dropout: float = 0.1) -> nn.Module:
    """
    Create LoRA model using HuggingFace PEFT library.
    More production-ready than manual LoRA injection.

    Requires: pip install peft timm transformers
    """
    from transformers import ViTForImageClassification
    from peft import LoraConfig, get_peft_model

    # Load ViT from transformers
    base_model = ViTForImageClassification.from_pretrained(
        f'google/vit-base-patch16-224-in21k',
        num_labels=num_classes,
        ignore_mismatched_sizes=True
    )

    # Configure LoRA
    config = LoraConfig(
        r=lora_r,                      # Rank of the low-rank matrices
        lora_alpha=lora_alpha,         # Scaling factor
        target_modules=['query', 'value'],  # Apply LoRA to Q and V projections
        lora_dropout=lora_dropout,
        bias='none',                   # Don't train bias terms
        task_type=TaskType.IMAGE_CLASSIFICATION
    )

    peft_model = get_peft_model(base_model, config)
    peft_model.print_trainable_parameters()
    # Output example: trainable params: 598,530 || all params: 86,568,194 || 0.69%

    return peft_model


# Training LoRA model - identical to standard training
def train_lora_epoch(model, loader, optimizer, device):
    """LoRA training loop - only LoRA params and head are updated."""
    model.train()
    total_loss = 0.0
    criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

    for images, labels in loader:
        images, labels = images.to(device), labels.to(device)

        # Only LoRA params have requires_grad=True, so only they get updated
        logits = model(images)
        if hasattr(logits, 'logits'):  # HuggingFace model
            logits = logits.logits
        loss = criterion(logits, labels)

        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

        total_loss += loss.item()

    return total_loss / len(loader)


# Memory and speed comparison
def compare_fine_tuning_strategies():
    """
    Practical comparison of fine-tuning strategies
    on ViT-B/16 with 10-class classification.
    """
    model_full = create_model('vit_base_patch16_224', pretrained=True, num_classes=10)
    params_full = sum(p.numel() for p in model_full.parameters() if p.requires_grad)

    # Frozen backbone (linear probe)
    model_linear = create_model('vit_base_patch16_224', pretrained=True, num_classes=10)
    for name, param in model_linear.named_parameters():
        if 'head' not in name:
            param.requires_grad = False
    params_linear = sum(p.numel() for p in model_linear.parameters() if p.requires_grad)

    print("Fine-tuning Strategy Comparison (ViT-B/16, 10 classes):")
    print(f"  Full fine-tuning:   {params_full:,} trainable params (100%)")
    print(f"  Linear probe:       {params_linear:,} trainable params "
          f"({100*params_linear/params_full:.1f}%)")
    print(f"  LoRA (r=16):        ~598K trainable params (~0.7%)")
    print()
    print("Memory per batch (batch=32, A100 GPU):")
    print("  Full fine-tuning:   ~22 GB VRAM")
    print("  Linear probe:       ~4 GB VRAM (no backbone gradients)")
    print("  LoRA:               ~8 GB VRAM (tiny gradient matrices)")

13. Continual Learning: Avoiding Catastrophic Forgetting

When you fine-tune a model on a new task, it can "forget" previous tasks - this is catastrophic forgetting. For production systems that need to handle multiple domains or adapt to new product lines without losing performance on existing ones, continual learning strategies are essential.

Elastic Weight Consolidation (EWC) for Continual Fine-Tuning

import torch
import torch.nn as nn
import torch.nn.functional as F
from copy import deepcopy

class EWC:
    """
    Elastic Weight Consolidation (Kirkpatrick et al., 2017).
    Prevents catastrophic forgetting by adding a regularization term
    that penalizes changes to parameters important for previous tasks.

    Importance is measured by the Fisher Information Matrix diagonal:
    F_i = E[ (d log p(y|x; theta) / d theta_i)^2 ]

    L_total = L_new_task + lambda * sum_i F_i * (theta_i - theta_i_old)^2
    """

    def __init__(self, model: nn.Module, dataloader,
                 device: torch.device, n_samples: int = 200):
        """
        Compute Fisher Information for important parameters on the old task.

        model: model already fine-tuned on old task (task A)
        dataloader: dataloader for old task validation data (small subset ok)
        n_samples: number of samples for Fisher estimation
        """
        self.model = model
        self.device = device

        # Save old task optimal parameters
        self.params_old = {
            n: p.detach().clone()
            for n, p in model.named_parameters()
            if p.requires_grad
        }

        # Compute Fisher Information (diagonal approximation)
        self.fisher = self._compute_fisher(dataloader, n_samples)

    def _compute_fisher(self, dataloader,
                         n_samples: int) -> dict[str, torch.Tensor]:
        """Compute diagonal Fisher Information Matrix."""
        fisher = {
            n: torch.zeros_like(p)
            for n, p in self.model.named_parameters()
            if p.requires_grad
        }

        self.model.eval()
        processed = 0

        for images, labels in dataloader:
            if processed >= n_samples:
                break

            images = images.to(self.device)
            labels = labels.to(self.device)

            self.model.zero_grad()
            output = self.model(images)  # [B, num_classes]

            # Use log-likelihood (log softmax) as loss for Fisher computation
            log_probs = F.log_softmax(output, dim=1)

            # Sample from the model's own predictions (online variant)
            # More stable than using true labels for Fisher estimation
            sampled_labels = torch.distributions.Categorical(
                logits=output.detach()
            ).sample()

            loss = F.nll_loss(log_probs, sampled_labels)
            loss.backward()

            # Accumulate squared gradients (Fisher diagonal approximation)
            batch_size = images.size(0)
            for n, p in self.model.named_parameters():
                if p.requires_grad and p.grad is not None:
                    fisher[n] += p.grad.detach() ** 2 * batch_size

            processed += images.size(0)

        # Normalize by number of samples
        for n in fisher:
            fisher[n] /= processed

        return fisher

    def penalty(self, model: nn.Module) -> torch.Tensor:
        """
        EWC regularization penalty.
        Add this to your new task loss: L_total = L_new + lambda * ewc.penalty(model)
        """
        penalty = torch.tensor(0.0, device=self.device)

        for n, p in model.named_parameters():
            if n in self.fisher and p.requires_grad:
                # Penalize deviation from old task optimum, weighted by Fisher
                penalty += (self.fisher[n] * (p - self.params_old[n]) ** 2).sum()

        return penalty


# Complete continual learning training loop
def train_with_ewc(model, new_task_loader, old_task_loader,
                    epochs: int = 30, ewc_lambda: float = 5000.0,
                    device: str = 'cuda') -> None:
    """
    Train model on new task while preserving old task performance.

    ewc_lambda: weight of the EWC regularization term.
    Typical values: 1000 - 50000 (depends on task similarity).
    Higher lambda = stronger protection of old task, less adaptation to new.
    """
    device = torch.device(device)

    # Compute Fisher on old task BEFORE modifying the model
    print("Computing Fisher Information for EWC...")
    ewc = EWC(model, old_task_loader, device, n_samples=200)

    # Setup optimizer - only new task head and last few layers
    optimizer = torch.optim.AdamW([
        {'params': model.backbone.layer4.parameters(), 'lr': 1e-5},
        {'params': model.classifier.parameters(), 'lr': 1e-4},
    }, weight_decay=0.01)

    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer, T_max=epochs, eta_min=1e-6
    )

    criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

    for epoch in range(epochs):
        model.train()
        total_loss = 0.0
        total_task_loss = 0.0
        total_ewc_loss = 0.0

        for images, labels in new_task_loader:
            images, labels = images.to(device), labels.to(device)

            logits = model(images)
            task_loss = criterion(logits, labels)

            # EWC regularization - penalizes forgetting
            ewc_loss = ewc.penalty(model)

            loss = task_loss + ewc_lambda * ewc_loss

            optimizer.zero_grad(set_to_none=True)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

            total_loss += loss.item()
            total_task_loss += task_loss.item()
            total_ewc_loss += ewc_loss.item()

        scheduler.step()

        if (epoch + 1) % 5 == 0:
            print(f"Epoch {epoch+1}/{epochs}: "
                  f"total={total_loss/len(new_task_loader):.4f}  "
                  f"task={total_task_loss/len(new_task_loader):.4f}  "
                  f"ewc={total_ewc_loss/len(new_task_loader):.6f}")

      Continual Learning Strategies Comparison
      
        
            Strategy
            Approach
            Old Task Performance
            New Task Performance
            Memory Cost
          

        
            Fine-tune (naive)
            Train on new task only
            Catastrophic forgetting (may drop to 0%)
            Best
            Low
          

            Replay buffer
            Mix old and new task data
            Good (depends on buffer size)
            Slightly worse
            Medium (stores old images)
          

            EWC
            Fisher-weighted regularization
            Good (+/-2% vs baseline)
            Slight reduction (<1%)
            Low (only Fisher diag)
          

            LoRA + Adapter
            Separate adapters per task
            Perfect (task A adapters frozen)
            Near-optimal
            Low per adapter
          

      
    

Conclusions

Transfer Learning is one of the most powerful tools in a CV engineer's toolkit. In this article we covered:

The theoretical foundation: feature hierarchies in CNNs and why knowledge transfers across domains
The main strategies: feature extraction, partial fine-tuning, and full fine-tuning
Domain adaptation and zero-shot transfer with CLIP and DINOv2
The decision matrix: how to choose the right strategy based on dataset size and domain similarity
A comprehensive overview of pre-trained models (ResNet, EfficientNet, ViT, ConvNeXt, DINOv2)
Complete PyTorch implementation with discriminative learning rates and gradual unfreezing
Warmup + cosine annealing scheduling for stable fine-tuning
Industrial defect classification achieving 96.8% accuracy with EfficientNet-B4
ONNX export for dependency-free production inference
Transfer Learning for object detection with Faster R-CNN
Parameter-Efficient Fine-Tuning: LoRA for training only 0.5-1% of parameters with comparable accuracy to full fine-tuning, saving 50-60% GPU memory
Continual Learning with Elastic Weight Consolidation (EWC): add new tasks without catastrophic forgetting of old ones

Transfer Learning Quick Reference

Scenario	Dataset Size	Domain Similarity	Strategy	Expected Accuracy Gain vs Scratch
Medical imaging (CT/MRI)	<1K	Low	Feature extraction + domain adaption	+20-30%
Product classification (e-commerce)	5K-50K	Medium	Partial fine-tune (last 2 blocks)	+15-25%
Industrial defect detection	500-5K	Low-Medium	EfficientNet full fine-tune + Discriminative LR	+25-35%
Satellite imagery	10K-100K	Low	Full fine-tune or LoRA + domain adaptation	+10-20%
Natural images (similar to ImageNet)	Any size	High	Feature extraction only	+5-15% + 10x faster convergence
Large ViT, limited GPU	1K-50K	Any	LoRA (r=16) + classifier head	Similar to full fine-tune, 50% less VRAM

Series Navigation

Cross-Series Resources

MLOps: Model Serving in Production - deploy your fine-tuned model at scale with FastAPI and Docker
Deep Learning Advanced: EfficientNet and Compound Scaling - deep dive into the architecture
Data Augmentation for Computer Vision - maximize your dataset with advanced techniques

Layers	Features Learned	Specificity	Transferability
Layer 1-2	Edges, corners, color gradients	Generic (task-agnostic)	Very High
Layer 3-4	Textures, repeated patterns, geometric motifs	Semi-generic	High
Layer 5-6	Object parts (eyes, wheels, windows)	Semi-specific	Medium
Layer 7+	Complete objects, scenes, compositions	Task-specific	Low

Strategy	Data Required	Training Time	Performance	Overfitting Risk
Feature Extraction	100-1,000	Minutes	Good	Very Low
Partial Fine-Tuning	1,000-10,000	Hours	Very Good	Low
Full Fine-Tuning	10,000+	Hours-Days	Excellent	Medium
Domain Adaptation	Variable	Hours	Good-Excellent	Medium
Zero-Shot (CLIP)	0	None	Variable	None

Model	Parameters	Top-1 Acc	Type	Ideal Use
ResNet-50	25.6M	80.9% (V2)	CNN	Solid baseline, easy deployment
EfficientNet-B0	5.3M	77.1%	CNN	Mobile, edge, limited resources
EfficientNet-B4	19.3M	84.0%	CNN	Best accuracy/efficiency ratio
EfficientNet-B7	66.3M	86.3%	CNN	Maximum CNN accuracy
ConvNeXt-T	28.6M	82.1%	Modern CNN	Best accuracy/speed tradeoff
ConvNeXt-B	88.6M	85.8%	Modern CNN	High accuracy with CNN simplicity
ViT-B/16	86.6M	86.0%	Transformer	Large datasets, global attention
Swin-T	28.3M	81.3%	Transformer	Detection and segmentation
CLIP ViT-B/32	151M (visual)	63.2% zero-shot	Multimodal	Zero-shot, visual search
DINOv2 ViT-S/14	22M	81.1% linear probe	Self-supervised	Generic features, few labeled samples
MobileNetV3-Large	5.5M	75.2%	CNN	Edge and mobile deployment

Approach	Val Accuracy	Training Time	Epochs to Convergence
Small CNN from scratch	67.3%	~45 min	100+
ResNet-50 Feature Extraction	88.7%	~8 min	20-30
ResNet-50 Fine-Tune (last 2 blocks)	93.4%	~15 min	30-40
ResNet-50 Full Fine-Tuning	95.1%	~25 min	40-50
EfficientNet-B4 Fine-Tuning	96.8%	~20 min	35-45
ConvNeXt-T Fine-Tuning	96.2%	~22 min	35-45

Method	Trainable Params	Memory vs Full FT	Accuracy (ImageNet)	Best For
Full Fine-Tuning	100%	1x (baseline)	Highest	Large dataset, ample GPU
Linear Probe	~0.1%	0.15x	-3 to -5%	Similar domain, few data
LoRA (r=16)	~0.5%	0.4x	-0.5 to -1%	Large ViT, limited GPU
Adapter Tuning	~2-5%	0.5x	-0.5 to -1.5%	Multi-task fine-tuning
Prompt Tuning	<0.1%	0.15x	-2 to -4%	ViT, many tasks simultaneously

Strategy	Approach	Old Task Performance	New Task Performance	Memory Cost
Fine-tune (naive)	Train on new task only	Catastrophic forgetting (may drop to 0%)	Best	Low
Replay buffer	Mix old and new task data	Good (depends on buffer size)	Slightly worse	Medium (stores old images)
EWC	Fisher-weighted regularization	Good (+/-2% vs baseline)	Slight reduction (<1%)	Low (only Fisher diag)
LoRA + Adapter	Separate adapters per task	Perfect (task A adapters frozen)	Near-optimal	Low per adapter