Transfer Learning: Reusing Pre-trained Models for Computer Vision
Imagine teaching a child to recognize dog breeds. If that child already knows how to identify shapes, colors, textures, and general anatomical structures, the task becomes enormously simpler. They do not need to start from scratch: they can transfer their existing knowledge to the new task. This is exactly what Transfer Learning does in deep learning.
In this second article of the Computer Vision with Deep Learning series, we will explore Transfer Learning in depth: why it works, which strategies exist, how to choose the right pre-trained model, and how to implement complete pipelines in PyTorch. We will walk through a real industrial case study and the advanced techniques that professionals use every day.
Series Overview
| # | Article | Focus |
|---|---|---|
| 1 | CNN Fundamentals | Architecture, training, deployment |
| 2 | You are here - Transfer Learning and Fine-Tuning | Pre-trained models, domain adaptation |
| 3 | Object Detection with YOLO | Real-time detection |
| 4 | Semantic Segmentation | Pixel-level classification |
| 5 | Image Generation with GAN and Diffusion | Synthetic image generation |
| 6 | Edge Deployment and Optimization | Models on embedded devices |
What You Will Learn
- What Transfer Learning is and why it works (feature hierarchies in CNNs)
- Main strategies: feature extraction, fine-tuning, domain adaptation
- How to choose the right pre-trained model (ResNet, EfficientNet, ViT, ConvNeXt)
- Complete PyTorch implementation: from data preparation to deployment
- Advanced techniques: discriminative learning rates, gradual unfreezing, LR warmup
- Data augmentation optimized for Transfer Learning
- Practical case study: industrial defect classification with ResNet-50
- Transfer Learning applied to object detection (Faster R-CNN, YOLO)
- Common mistakes and how to avoid them
1. What is Transfer Learning
Transfer Learning is a machine learning technique where a model trained on one task (the source task) is reused as the starting point for a different task (the target task). Instead of training a neural network from scratch on millions of images, we take a model already trained (typically on ImageNet: 1.2 million images across 1000 classes) and adapt it to our specific problem.
1.1 The Human Analogy
Our brain operates in Transfer Learning mode constantly. A surgeon learning a new procedure does not need to relearn anatomy, physiology, or basic motor skills. A classical musician switching to jazz transfers their instrument technique, score reading, and harmonic theory. A Python developer learning Rust transfers programming concepts, debugging mentality, and algorithmic thinking. In every case, prior knowledge dramatically accelerates learning in a new domain.
1.2 Why It Works: The Feature Hierarchy
The fundamental reason Transfer Learning works in CNNs lies in the feature hierarchy they learn. Research has demonstrated that CNNs trained on ImageNet organize features in increasing levels of abstraction:
Feature Hierarchy in CNNs
| Layers | Features Learned | Specificity | Transferability |
|---|---|---|---|
| Layer 1-2 | Edges, corners, color gradients | Generic (task-agnostic) | Very High |
| Layer 3-4 | Textures, repeated patterns, geometric motifs | Semi-generic | High |
| Layer 5-6 | Object parts (eyes, wheels, windows) | Semi-specific | Medium |
| Layer 7+ | Complete objects, scenes, compositions | Task-specific | Low |
Early layers learn universal features: edges, textures, and gradients useful for any visual task. Middle layers capture more complex but still reasonably generic patterns. Only the final layers are highly specific to the original task. This means we can reuse most of the network as a powerful feature extractor and only adapt the final parts to our task.
ImageNet Pre-trained CNN (e.g. ResNet-50):
Input Image
|
v
[Layers 1-2] ---> Horizontal, vertical, diagonal edges
| Color gradients, blobs
| UNIVERSAL: useful for ANY image domain
v
[Layers 3-4] ---> Textures (fur, metal, wood, fabric)
| Geometric patterns, grids
| SEMI-GENERIC: transferable to many domains
v
[Layers 5-6] ---> Object parts (eyes, wheels, wings)
| Local compositions
| SEMI-SPECIFIC: domain-dependent
v
[Layers 7+] ---> Full classes (cat, car, bird)
| Highly ImageNet-specific features
| TASK-SPECIFIC: replace or fine-tune
v
[Classifier] ---> 1000 ImageNet classes
ALWAYS replace for your task
Formal Definition
Given a source domain D_s with task T_s and a target domain D_t with task T_t, Transfer Learning aims to improve the learning function f_t in the target domain using knowledge extracted from D_s and T_s, where D_s != D_t or T_s != T_t. In practice, the weights theta learned from the source task are used as initialization (theta_0) for training on the target task, instead of random initialization. This warm start dramatically accelerates convergence.
2. Transfer Learning Strategies
There is no single way to apply Transfer Learning. The optimal strategy depends on the size of the target dataset, its similarity to the source dataset, and available compute. Let us examine the four main strategies.
2.1 Feature Extraction (Frozen Backbone)
The simplest strategy: the entire pre-trained backbone is frozen and used as a fixed feature extractor. Only a new classifier head is trained on top. Backbone weights never change.
When to use: Small dataset (hundreds to a few thousand images) and domain similar to the source (e.g., classifying dog breeds with a model pre-trained on ImageNet, which contains many dog images).
Pre-trained ResNet-50 (ImageNet):
+---------------------------------------------------+
| [Conv layers] --> [Res blocks] --> [Global AvgPool] | FROZEN
| 25.5M parameters - NOT updated | requires_grad = False
+---------------------------------------------------+
|
v
Feature vector (2048-dim)
|
v
+-------------------+
| [Linear 2048->N] | TRAINABLE
| N = your classes| requires_grad = True
+-------------------+
|
v
Output: N classes
Advantages:
+ Very fast training (few parameters to optimize)
+ No powerful GPU required
+ Minimal overfitting risk
+ Works with small datasets
Disadvantages:
- Less flexible (features are fixed)
- Limited performance if domain is very different
2.2 Fine-Tuning (Unfreeze Some or All Layers)
In fine-tuning, after initializing the network with pre-trained weights, we unfreeze some or all layers and retrain the network (or part of it) with a very low learning rate. Pre-trained layers are slightly updated to adapt to the new domain, preserving previously acquired knowledge.
When to use: Medium to large target dataset (thousands to tens of thousands of images) and/or moderately different domain from the source.
Progressive Fine-Tuning Strategy:
Phase 1 - Feature Extraction (5-10 epochs):
[Backbone FROZEN] --> [New Classifier] TRAINED (lr=1e-3)
Phase 2 - Partial Fine-Tuning (10-20 epochs):
[Layers 1-3 FROZEN] --> [Layer 4 UNFROZEN lr=1e-5] --> [Classifier lr=1e-4]
Phase 3 - Full Fine-Tuning (optional, 5-10 epochs):
[ALL layers UNFROZEN lr=1e-6] --> [Classifier lr=1e-5]
Progressive learning rates:
Initial layers: lr = 1e-6 (generic features, change very little)
Middle layers: lr = 1e-5 (adapt gradually)
Final layers: lr = 1e-4 (adapt to new domain)
Classifier: lr = 1e-3 (learn from scratch)
2.3 Domain Adaptation
Domain Adaptation is a specialized form of Transfer Learning used when the source domain and target domain share the same classes but have different data distributions. For example, a model trained on professional product photos that must work on factory images with variable lighting. Techniques like DANN (Domain-Adversarial Neural Network) add a domain discriminator that forces the network to learn domain-invariant features.
2.4 Zero-Shot and Few-Shot Transfer
With models like CLIP (Contrastive Language-Image Pre-training), it is possible to classify images into categories never seen during training (zero-shot) or with very few examples (few-shot). CLIP learns a joint text-image representation: given a textual prompt like "a photo of a welding defect", the model can classify images without any specific training. DINOv2, trained with self-supervised learning on 142M images, provides extremely transferable generic features.
Transfer Learning Strategy Comparison
| Strategy | Data Required | Training Time | Performance | Overfitting Risk |
|---|---|---|---|---|
| Feature Extraction | 100-1,000 | Minutes | Good | Very Low |
| Partial Fine-Tuning | 1,000-10,000 | Hours | Very Good | Low |
| Full Fine-Tuning | 10,000+ | Hours-Days | Excellent | Medium |
| Domain Adaptation | Variable | Hours | Good-Excellent | Medium |
| Zero-Shot (CLIP) | 0 | None | Variable | None |
3. Pre-trained Models for Computer Vision
Choosing the right pre-trained model is a critical decision. Each architecture has different tradeoffs between accuracy, inference speed, model size, and memory requirements. Here is an overview of the most widely used models in 2025-2026.
Pre-trained Model Comparison (ImageNet Top-1 Accuracy)
| Model | Parameters | Top-1 Acc | Type | Ideal Use |
|---|---|---|---|---|
| ResNet-50 | 25.6M | 80.9% (V2) | CNN | Solid baseline, easy deployment |
| EfficientNet-B0 | 5.3M | 77.1% | CNN | Mobile, edge, limited resources |
| EfficientNet-B4 | 19.3M | 84.0% | CNN | Best accuracy/efficiency ratio |
| EfficientNet-B7 | 66.3M | 86.3% | CNN | Maximum CNN accuracy |
| ConvNeXt-T | 28.6M | 82.1% | Modern CNN | Best accuracy/speed tradeoff |
| ConvNeXt-B | 88.6M | 85.8% | Modern CNN | High accuracy with CNN simplicity |
| ViT-B/16 | 86.6M | 86.0% | Transformer | Large datasets, global attention |
| Swin-T | 28.3M | 81.3% | Transformer | Detection and segmentation |
| CLIP ViT-B/32 | 151M (visual) | 63.2% zero-shot | Multimodal | Zero-shot, visual search |
| DINOv2 ViT-S/14 | 22M | 81.1% linear probe | Self-supervised | Generic features, few labeled samples |
| MobileNetV3-Large | 5.5M | 75.2% | CNN | Edge and mobile deployment |
3.1 ResNet-50: The Workhorse
ResNet-50 remains the most popular model for Transfer Learning thanks to its simplicity, training stability, and broad ecosystem support. Skip connections (introduced in the previous article) allow training deep networks without vanishing gradient problems. The V2 weights (IMAGENET1K_V2), trained with modern techniques like Mixup, CutMix, and Random Erasing, achieve an impressive 80.9% top-1 accuracy.
3.2 EfficientNet: Compound Scaling
EfficientNet (Tan & Le, 2019) introduced compound scaling: rather than increasing only depth (more layers), width (more channels), or resolution alone, it scales all three dimensions simultaneously using a fixed coefficient. This produces a family of models (B0 through B7) that dominate the accuracy/efficiency Pareto frontier. EfficientNet-B4 is the sweet spot for most production use cases.
3.3 Vision Transformer (ViT) and Swin Transformer
Vision Transformers apply the Transformer architecture (originally created for NLP) to computer vision. The image is divided into patches (e.g., 16x16 pixels), each patch treated as a token and processed with self-attention. ViT excels when pre-trained on large datasets (ImageNet-21k, JFT-300M) but can underperform CNNs on small datasets. Swin Transformer introduces shifted window attention, making it more efficient and particularly suitable for dense prediction tasks like detection and segmentation.
3.4 ConvNeXt: Modernized CNN
ConvNeXt demonstrates that CNNs can compete with Transformers if modernized with the same training techniques (AdamW, Mixup, layer scale, Stochastic Depth). ConvNeXt-T achieves 82.1% with only 28.6M parameters, offering an excellent tradeoff between accuracy, speed, and deployment simplicity. It is increasingly the default choice when you want Transformer-level performance without Transformer deployment complexity.
3.5 DINOv2: Self-Supervised Learning
DINOv2 is trained with self-supervised learning (without labels) on an enormous curated dataset (LVD-142M images). The extracted features are extremely generic and transferable: a simple linear classifier added on top achieves competitive results with full supervised fine-tuning. It is particularly useful when you have very few labeled examples in the target domain, making it ideal for industrial inspection, medical imaging, and remote sensing applications.
4. When to Use Transfer Learning: The Decision Matrix
Strategy selection depends on two key factors: the size of the target dataset and the similarity between the source and target domains. This generates four decision quadrants.
SIMILARITY TO SOURCE DOMAIN
High Low
+-------------------------+-------------------------+
| | |
Large | QUADRANT 1 | QUADRANT 2 |
(10k+) | Full fine-tuning | Careful fine-tuning |
| - Unfreeze all layers | - Unfreeze only final |
D | - Low learning rate | layers |
A | - High performance | - Very low LR for |
T | | backbone |
A | Example: Dog breeds | - Strong augmentation |
S | (ImageNet contains | |
E | many dog images) | Example: Medical |
T | | images (very different |
+-------------------------+ from ImageNet) |
S | +-------------------------+
I | QUADRANT 3 | QUADRANT 4 |
Z | Feature Extraction | Limited options |
E | - Freeze backbone | - Try feature |
| - Only train | extraction |
Small | classifier | - Very aggressive |
(100-1k) | - No overfitting | augmentation |
| - Fast training | - Collect more data |
| | - DINOv2 / CLIP |
| Example: 200 flower | (self-supervised) |
| photos (similar to | |
| ImageNet) | |
+-------------------------+-------------------------+
Practical Rule
In 2025-2026, the answer to "Should I use Transfer Learning?" is almost always yes. Training a CNN from scratch is only justified in very specific cases: enormous datasets (millions of images), domains radically different from natural images (e.g., spectrograms, radar signals), or particular architectural constraints.
5. PyTorch Implementation
Let us move to practice. We will implement Transfer Learning step by step in PyTorch, starting from loading a pre-trained model through complete training with best practices.
5.1 Loading a Pre-trained Model
PyTorch offers two APIs for loading pre-trained models. The modern API (introduced in torchvision 0.13+)
uses the Weights enum that provides detailed information about the weights, including
the required preprocessing transformations. Always prefer this API.
import torch
import torch.nn as nn
import torchvision.models as models
from torchvision.models import ResNet50_Weights
def create_feature_extractor(num_classes: int, device: str = 'cuda') -> nn.Module:
"""
Creates a frozen-backbone feature extractor based on ResNet-50.
Only the final classifier is trained.
Args:
num_classes: Number of target classes
device: Device to load the model onto
Returns:
Model with frozen backbone and custom classifier
"""
# Load pre-trained weights (V2 = 80.9% top-1 on ImageNet)
weights = ResNet50_Weights.IMAGENET1K_V2
model = models.resnet50(weights=weights)
# Freeze ALL backbone parameters
for param in model.parameters():
param.requires_grad = False
# Replace final classifier - this is the ONLY part we train
in_features = model.fc.in_features # 2048 for ResNet-50
model.fc = nn.Sequential(
nn.Dropout(p=0.5),
nn.Linear(in_features, 512),
nn.ReLU(inplace=True),
nn.Dropout(p=0.3),
nn.Linear(512, num_classes)
)
# Report trainable vs frozen parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({trainable/total*100:.1f}%)")
# Output: Trainable: 1,051,138 / 25,049,090 (4.2%)
return model.to(device)
# Usage
model = create_feature_extractor(num_classes=6) # e.g. 6 defect types
5.2 Full Fine-Tuning with Discriminative Learning Rates
All weights are updated, but with different learning rates per layer group. Early layers that encode universal features should be updated very slowly (they are already well trained), while the classifier can use a much higher rate. This technique, called discriminative learning rates, is key for preventing catastrophic forgetting.
def create_discriminative_optimizer(model: nn.Module, base_lr: float = 1e-4) -> torch.optim.Optimizer:
"""
Different learning rates per layer group:
- Early layers (layer1, layer2): 10x lower - preserve universal features
- Middle layers (layer3): 5x lower
- Late layers (layer4): 2x lower
- Classifier (fc): full base_lr - learn task-specific features
"""
param_groups = [
{
'params': [p for n, p in model.named_parameters()
if 'layer1' in n or 'layer2' in n],
'lr': base_lr / 10
},
{
'params': [p for n, p in model.named_parameters() if 'layer3' in n],
'lr': base_lr / 5
},
{
'params': [p for n, p in model.named_parameters() if 'layer4' in n],
'lr': base_lr / 2
},
{
'params': model.fc.parameters(),
'lr': base_lr
}
]
return torch.optim.AdamW(param_groups, weight_decay=1e-4)
def gradual_unfreeze(model: nn.Module, epoch: int) -> None:
"""
Unfreeze layers gradually to prevent catastrophic forgetting.
Epoch 0-4: only classifier trainable
Epoch 5-9: + layer4
Epoch 10-14: + layer3
Epoch 15+: all layers
"""
layers_to_unfreeze = []
if epoch >= 5:
layers_to_unfreeze.append(model.layer4)
if epoch >= 10:
layers_to_unfreeze.append(model.layer3)
if epoch >= 15:
layers_to_unfreeze.extend([model.layer1, model.layer2])
for layer in layers_to_unfreeze:
for param in layer.parameters():
param.requires_grad = True
6. Data Pipeline for Transfer Learning
The data pipeline for Transfer Learning has a critical requirement: since the backbone expects ImageNet-normalized inputs, you must apply the correct normalization (mean and std from ImageNet). Using wrong normalization is one of the most common mistakes and can silently degrade performance by several percentage points.
from torchvision import transforms
from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader
from pathlib import Path
# ImageNet normalization - MANDATORY for pre-trained models
IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD = [0.229, 0.224, 0.225]
def get_train_transforms(img_size: int = 224) -> transforms.Compose:
return transforms.Compose([
transforms.RandomResizedCrop(img_size, scale=(0.7, 1.0)),
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomVerticalFlip(p=0.2),
transforms.ColorJitter(brightness=0.3, contrast=0.3,
saturation=0.3, hue=0.1),
transforms.RandomRotation(degrees=15),
transforms.AutoAugment(policy=transforms.AutoAugmentPolicy.IMAGENET),
transforms.ToTensor(),
transforms.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
transforms.RandomErasing(p=0.2, scale=(0.02, 0.15))
])
def get_val_transforms(img_size: int = 224) -> transforms.Compose:
"""Validation: no augmentation, only deterministic preprocessing."""
return transforms.Compose([
transforms.Resize(int(img_size * 1.14)), # 256 for img_size=224
transforms.CenterCrop(img_size),
transforms.ToTensor(),
transforms.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD)
])
def create_data_loaders(
data_dir: str,
batch_size: int = 32,
img_size: int = 224,
num_workers: int = 4
) -> tuple:
"""
Creates train/val data loaders.
Expects directory structure:
data_dir/train/class_a/*.jpg
data_dir/val/class_a/*.jpg
"""
data_path = Path(data_dir)
train_ds = ImageFolder(str(data_path / 'train'),
transform=get_train_transforms(img_size))
val_ds = ImageFolder(str(data_path / 'val'),
transform=get_val_transforms(img_size))
train_loader = DataLoader(
train_ds, batch_size=batch_size, shuffle=True,
num_workers=num_workers, pin_memory=True, drop_last=True
)
val_loader = DataLoader(
val_ds, batch_size=batch_size * 2, shuffle=False,
num_workers=num_workers, pin_memory=True
)
print(f"Classes: {train_ds.classes}")
print(f"Train: {len(train_ds)} | Val: {len(val_ds)}")
return train_loader, val_loader, train_ds.classes
7. Learning Rate Warmup and Cosine Scheduling
Learning rate warmup gradually increases the LR during the first few epochs instead of starting at the target value directly. This prevents destabilizing the pre-trained weights early in training and is one of the most impactful tricks for fine-tuning. Combined with cosine annealing, it provides smooth LR decay through training.
import math
from torch.optim.lr_scheduler import _LRScheduler
class WarmupCosineScheduler(_LRScheduler):
"""
Linear warmup followed by cosine annealing.
Ideal for fine-tuning pre-trained models.
"""
def __init__(self, optimizer, warmup_epochs: int, total_epochs: int,
min_lr: float = 1e-6, last_epoch: int = -1):
self.warmup_epochs = warmup_epochs
self.total_epochs = total_epochs
self.min_lr = min_lr
super().__init__(optimizer, last_epoch)
def get_lr(self) -> list:
if self.last_epoch < self.warmup_epochs:
# Linear warmup: 0 -> base_lr over warmup_epochs
factor = self.last_epoch / max(1, self.warmup_epochs)
return [base_lr * factor for base_lr in self.base_lrs]
else:
# Cosine annealing: base_lr -> min_lr
progress = (self.last_epoch - self.warmup_epochs) / max(
1, self.total_epochs - self.warmup_epochs
)
cosine = 0.5 * (1 + math.cos(math.pi * progress))
return [
self.min_lr + (base_lr - self.min_lr) * cosine
for base_lr in self.base_lrs
]
# Usage
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-4)
scheduler = WarmupCosineScheduler(
optimizer, warmup_epochs=5, total_epochs=50, min_lr=1e-6
)
for epoch in range(50):
train_one_epoch(model, train_loader, optimizer, criterion, device)
val_acc = evaluate(model, val_loader, device)
scheduler.step()
lr = optimizer.param_groups[0]['lr']
print(f"Epoch {epoch+1:2d} | LR: {lr:.2e} | Val Acc: {val_acc:.2f}%")
8. EfficientNet-B4: Two-Phase Training
EfficientNet-B4 provides the best accuracy/efficiency ratio for most production use cases. Two-phase training (classifier warmup followed by backbone fine-tuning) is the standard approach recommended by the literature.
import torch
import torch.nn as nn
import torchvision.models as models
from torchvision.models import EfficientNet_B4_Weights
class EfficientNetClassifier(nn.Module):
"""EfficientNet-B4 with custom classifier head for fine-tuning."""
def __init__(self, num_classes: int, dropout_rate: float = 0.4):
super().__init__()
backbone = models.efficientnet_b4(weights=EfficientNet_B4_Weights.IMAGENET1K_V1)
# Keep feature extraction backbone and pooling
self.features = backbone.features
self.avgpool = backbone.avgpool
# Custom classifier head
in_features = backbone.classifier[1].in_features # 1792 for B4
self.classifier = nn.Sequential(
nn.Dropout(p=dropout_rate, inplace=True),
nn.Linear(in_features, num_classes)
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.features(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
return self.classifier(x)
def freeze_backbone(self) -> None:
"""Freeze all backbone parameters."""
for param in self.features.parameters():
param.requires_grad = False
def unfreeze_last_n_blocks(self, n: int = 3) -> None:
"""Unfreeze the last n blocks of EfficientNet-B4 (9 blocks total)."""
for i, block in enumerate(self.features):
should_train = i >= (len(self.features) - n)
for param in block.parameters():
param.requires_grad = should_train
def two_phase_training(
model: EfficientNetClassifier,
train_loader,
val_loader,
device: torch.device
) -> dict:
"""
Phase 1 (5 epochs): frozen backbone, warm up classifier
Phase 2 (15 epochs): unfreeze last 3 blocks, cosine LR schedule
"""
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
history = {'train_acc': [], 'val_acc': []}
# ---- Phase 1: classifier warmup ----
model.freeze_backbone()
opt1 = torch.optim.AdamW(
filter(lambda p: p.requires_grad, model.parameters()),
lr=3e-3, weight_decay=1e-4
)
print("Phase 1: Classifier warmup (frozen backbone)")
for epoch in range(5):
t_acc = train_one_epoch(model, train_loader, opt1, criterion, device)
v_acc = evaluate(model, val_loader, device)
print(f" Warmup {epoch+1}/5 | Train: {t_acc:.1f}% Val: {v_acc:.1f}%")
# ---- Phase 2: fine-tuning ----
print("\nPhase 2: Fine-tuning (last 3 blocks unfrozen)")
model.unfreeze_last_n_blocks(n=3)
opt2 = torch.optim.AdamW(
filter(lambda p: p.requires_grad, model.parameters()),
lr=1e-4, weight_decay=1e-4
)
sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt2, T_max=15, eta_min=1e-6)
for epoch in range(15):
t_acc = train_one_epoch(model, train_loader, opt2, criterion, device)
v_acc = evaluate(model, val_loader, device)
sched.step()
history['train_acc'].append(t_acc)
history['val_acc'].append(v_acc)
print(f" FT {epoch+1}/15 | Train: {t_acc:.1f}% Val: {v_acc:.1f}% "
f"LR: {sched.get_last_lr()[0]:.2e}")
return history
9. Case Study: Industrial Defect Classification
Let us apply everything learned to a real scenario: classifying surface defects on steel sheets using the NEU Steel Surface Defects Dataset (6 defect classes, ~300 images each at 200x200px). This is a representative industrial use case with a small dataset and a specific domain.
import torch
import torch.nn as nn
import torchvision.models as models
from torch.utils.data import DataLoader
class IndustrialDefectClassifier:
"""
End-to-end system for surface defect classification using ResNet-50 fine-tuning.
Includes two-phase training, early stopping, and ONNX export.
"""
CLASSES = ['crazing', 'inclusion', 'patches',
'pitted_surface', 'rolled_in_scale', 'scratches']
def __init__(self, device: str = 'auto'):
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model = self._build_model(len(self.CLASSES)).to(self.device)
print(f"Model on: {self.device}")
def _build_model(self, n_cls: int) -> nn.Module:
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
# Multi-layer classifier head with BatchNorm for better regularization
model.fc = nn.Sequential(
nn.Linear(2048, 512),
nn.BatchNorm1d(512),
nn.ReLU(inplace=True),
nn.Dropout(0.4),
nn.Linear(512, 128),
nn.BatchNorm1d(128),
nn.ReLU(inplace=True),
nn.Dropout(0.2),
nn.Linear(128, n_cls)
)
return model
def train(self, train_loader: DataLoader, val_loader: DataLoader,
total_epochs: int = 30) -> dict:
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
best_acc = 0.0
patience = 8
no_improve = 0
history = {'val_acc': []}
# Phase 1: classifier warmup (5 epochs, frozen backbone)
for param in self.model.parameters():
param.requires_grad = False
for param in self.model.fc.parameters():
param.requires_grad = True
opt1 = torch.optim.AdamW(self.model.fc.parameters(), lr=3e-3)
print("Phase 1: Classifier warmup")
for ep in range(5):
t_acc = self._run_epoch(train_loader, opt1, criterion, training=True)
v_acc = self._run_epoch(val_loader, None, criterion, training=False)
print(f" Warmup {ep+1}/5 | T: {t_acc:.1f}% V: {v_acc:.1f}%")
# Phase 2: full fine-tuning with OneCycleLR
for param in self.model.parameters():
param.requires_grad = True
opt2 = torch.optim.AdamW(self.model.parameters(), lr=1e-4, weight_decay=1e-4)
sched = torch.optim.lr_scheduler.OneCycleLR(
opt2, max_lr=1e-4, epochs=total_epochs,
steps_per_epoch=len(train_loader)
)
print("\nPhase 2: Full fine-tuning")
for ep in range(total_epochs):
t_acc = self._run_epoch(train_loader, opt2, criterion,
training=True, sched=sched)
v_acc = self._run_epoch(val_loader, None, criterion, training=False)
history['val_acc'].append(v_acc)
if v_acc > best_acc:
best_acc = v_acc
torch.save(self.model.state_dict(), 'best_defect_model.pth')
no_improve = 0
else:
no_improve += 1
print(f"Epoch {ep+1:2d}/{total_epochs} | "
f"T: {t_acc:.1f}% V: {v_acc:.1f}% Best: {best_acc:.1f}%")
if no_improve >= patience:
print(f"Early stopping at epoch {ep+1}")
break
return history
def _run_epoch(self, loader, optimizer, criterion,
training: bool, sched=None) -> float:
self.model.train(training)
correct = total = 0
with torch.set_grad_enabled(training):
for images, labels in loader:
images, labels = images.to(self.device), labels.to(self.device)
if training:
optimizer.zero_grad(set_to_none=True)
outputs = self.model(images)
loss = criterion(outputs, labels)
if training:
loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
optimizer.step()
if sched:
sched.step()
correct += outputs.argmax(1).eq(labels).sum().item()
total += labels.size(0)
return 100. * correct / total
def export_onnx(self, path: str = 'defect_classifier.onnx') -> None:
"""Export to ONNX for dependency-free production inference."""
self.model.eval()
dummy = torch.randn(1, 3, 224, 224, device=self.device)
torch.onnx.export(
self.model, dummy, path,
opset_version=17,
input_names=['image'],
output_names=['logits'],
dynamic_axes={'image': {0: 'batch'}, 'logits': {0: 'batch'}}
)
print(f"Model exported to {path}")
10. Best Practices and Anti-patterns
Proven Recommendations
- Always use the latest weights: Use
weights=Model_Weights.DEFAULTfor the best available pre-trained weights in torchvision. V2 weights trained with modern recipes significantly outperform V1. - Match input preprocessing exactly: Different models use different normalization. EfficientNet and ResNet use ImageNet statistics. Always use the
weights.transforms()method to get the correct transforms automatically. - Two-phase training: Always warm up the classifier for a few epochs with a frozen backbone before full fine-tuning. This prevents the random classifier from destabilizing the pre-trained features.
- Discriminative LRs: Early layers need a 10-100x lower LR than the classifier head. The backbone already knows universal features; you only want to gently adapt them.
- Label smoothing:
CrossEntropyLoss(label_smoothing=0.1)prevents overconfidence and improves generalization, especially with small datasets. - Early stopping with patience: Transfer Learning converges faster than from-scratch training. Monitor validation accuracy and stop when it plateaus (patience 5-15 epochs).
- Test Time Augmentation (TTA): Average predictions over multiple augmented views (horizontal flip, scale) for +1-2% accuracy at inference time with no additional training.
- BatchNorm in eval mode: When freezing backbone layers that include BatchNorm, explicitly set those layers to eval mode to use running statistics instead of batch statistics.
Common Mistakes to Avoid
- Too high LR for backbone: Values above 1e-4 for the backbone will destroy pre-trained knowledge through what is called "catastrophic forgetting". Start at 1e-5 or 1e-6.
- Forgetting ImageNet normalization: Without proper normalization the pre-trained model receives inputs from an entirely different distribution. This is one of the most silent and damaging mistakes.
- Unfreezing everything at once: Unfreezing the entire backbone at once with too high a learning rate erases all pre-trained knowledge. Always use gradual unfreezing.
- Tuning on the test set: Never make architectural or hyperparameter decisions based on test set performance. Always keep a clean hold-out set that you evaluate only once at the end.
- Ignoring class imbalance: With small datasets, class imbalance is common. Use weighted losses (
CrossEntropyLoss(weight=class_weights)) or oversampling strategies like WeightedRandomSampler. - Wrong model for the domain: ViT models require more data to fine-tune effectively than CNNs. For datasets with fewer than 5k images, prefer ResNet or EfficientNet over ViT.
10.1 Performance Benchmark
Results on NEU Steel Surface Defects (500 images/class, 6 classes)
| Approach | Val Accuracy | Training Time | Epochs to Convergence |
|---|---|---|---|
| Small CNN from scratch | 67.3% | ~45 min | 100+ |
| ResNet-50 Feature Extraction | 88.7% | ~8 min | 20-30 |
| ResNet-50 Fine-Tune (last 2 blocks) | 93.4% | ~15 min | 30-40 |
| ResNet-50 Full Fine-Tuning | 95.1% | ~25 min | 40-50 |
| EfficientNet-B4 Fine-Tuning | 96.8% | ~20 min | 35-45 |
| ConvNeXt-T Fine-Tuning | 96.2% | ~22 min | 35-45 |
11. Transfer Learning for Object Detection
Transfer Learning is not limited to classification. In object detection, the backbone (the feature extractor) is almost always initialized with ImageNet pre-trained weights, while the detection head is trained from scratch on the target dataset. YOLO, Faster R-CNN, and RetinaNet all use this approach.
import torch
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn_v2
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
def create_detection_model(num_classes: int) -> torch.nn.Module:
"""
Faster R-CNN with ResNet-50-FPN backbone pre-trained on COCO.
Replace the box predictor head for custom class detection.
Args:
num_classes: number of target classes + 1 (background)
"""
# Load model pre-trained on COCO (91 classes)
model = fasterrcnn_resnet50_fpn_v2(
weights=torchvision.models.detection.FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT
)
# Get input size of the existing classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
# Replace with new predictor for our classes
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
return model
# Detection training loop
def train_detection_epoch(model, data_loader, optimizer, device):
model.train()
total_loss = 0.0
for images, targets in data_loader:
images = [img.to(device) for img in images]
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
# Faster R-CNN returns a dict of losses during training
loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())
optimizer.zero_grad(set_to_none=True)
losses.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
total_loss += losses.item()
return total_loss / len(data_loader)
# Transfer Learning stages for detection:
# Stage 1: freeze backbone, train head only (3-5 epochs)
# Stage 2: unfreeze FPN layers, lower LR (5-10 epochs)
# Stage 3: full fine-tuning with very low LR (5-10 epochs)
model = create_detection_model(num_classes=7) # 6 defect classes + background
print(f"Detection model: {sum(p.numel() for p in model.parameters()):,} parameters")
12. Parameter-Efficient Fine-Tuning: LoRA for Vision Models
Full fine-tuning of large models (ViT-L/16: 307M params, ConvNeXt-XL: 350M params) is expensive in GPU memory and compute. LoRA (Low-Rank Adaptation) and related PEFT techniques achieve comparable accuracy by training only a tiny fraction of parameters - typically 0.1-1% of the original model. Originally developed for LLMs (Hu et al., 2021), LoRA transfers directly to Vision Transformers.
Parameter-Efficient Methods for Vision Models
| Method | Trainable Params | Memory vs Full FT | Accuracy (ImageNet) | Best For |
|---|---|---|---|---|
| Full Fine-Tuning | 100% | 1x (baseline) | Highest | Large dataset, ample GPU |
| Linear Probe | ~0.1% | 0.15x | -3 to -5% | Similar domain, few data |
| LoRA (r=16) | ~0.5% | 0.4x | -0.5 to -1% | Large ViT, limited GPU |
| Adapter Tuning | ~2-5% | 0.5x | -0.5 to -1.5% | Multi-task fine-tuning |
| Prompt Tuning | <0.1% | 0.15x | -2 to -4% | ViT, many tasks simultaneously |
import torch
import torch.nn as nn
from timm import create_model
# pip install peft (PEFT library from Hugging Face)
from peft import get_peft_model, LoraConfig, TaskType
# LoRA implementation from scratch for understanding
class LoRALayer(nn.Module):
"""
LoRA: Low-Rank Adaptation of a Linear layer.
Original weight W is frozen; only A and B are trained.
For an original Linear(in, out):
W_fine-tuned = W_original + B @ A where B in R^(out x r), A in R^(r x in)
Key insight: the update delta_W = B @ A has rank r << min(in, out).
We initialize A ~ N(0, sigma), B = 0 (so delta_W starts at 0 = no change).
"""
def __init__(self, in_features: int, out_features: int,
rank: int = 16, alpha: float = 32.0):
super().__init__()
# Trainable low-rank matrices
self.lora_A = nn.Parameter(
torch.randn(rank, in_features) * (0.01) # small init
)
self.lora_B = nn.Parameter(
torch.zeros(out_features, rank) # B=0 means delta_W starts at 0
)
# Scaling: alpha/rank controls effective learning rate of the update
# Higher alpha/rank = larger effective LR for LoRA params
self.scale = alpha / rank
self.rank = rank
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""LoRA delta: adds B @ A @ x to the frozen base layer output."""
# delta_W @ x = (B @ A) @ x = B @ (A @ x) - sequential for efficiency
lora_out = (x @ self.lora_A.T) @ self.lora_B.T
return lora_out * self.scale
class ViTWithLoRA(nn.Module):
"""
Vision Transformer with LoRA applied to attention Q, K, V projections.
Uses timm for the base ViT model.
LoRA is inserted into every transformer block's attention layers.
Only LoRA parameters and the classification head are trained (~0.5% of total).
"""
def __init__(self, model_name: str = 'vit_base_patch16_224',
num_classes: int = 10,
lora_rank: int = 16,
lora_alpha: float = 32.0,
pretrained: bool = True):
super().__init__()
# Load pre-trained ViT (all parameters frozen initially)
self.vit = create_model(model_name, pretrained=pretrained,
num_classes=0) # Remove head
# Freeze all parameters
for param in self.vit.parameters():
param.requires_grad = False
# Apply LoRA to Q and V projections in every attention block
self.lora_layers = nn.ModuleList()
embed_dim = self.vit.embed_dim
for block in self.vit.blocks:
attn = block.attn
# Q projection LoRA
lora_q = LoRALayer(embed_dim, embed_dim, lora_rank, lora_alpha)
# V projection LoRA
lora_v = LoRALayer(embed_dim, embed_dim, lora_rank, lora_alpha)
self.lora_layers.extend([lora_q, lora_v])
# Patch the attention forward to include LoRA
# Store references for forward hook injection
block.attn._lora_q = lora_q
block.attn._lora_v = lora_v
# Trainable classification head
self.classifier = nn.Sequential(
nn.LayerNorm(embed_dim),
nn.Linear(embed_dim, num_classes)
)
def count_trainable_params(self) -> dict:
"""Count trainable vs frozen parameters."""
trainable = sum(p.numel() for p in self.parameters() if p.requires_grad)
total = sum(p.numel() for p in self.parameters())
return {
'trainable': trainable,
'total': total,
'trainable_pct': 100 * trainable / total
}
def forward(self, x: torch.Tensor) -> torch.Tensor:
features = self.vit(x) # [B, embed_dim]
return self.classifier(features)
# Using PEFT library (simpler approach for production)
def create_lora_model_with_peft(model_name: str = 'vit_base_patch16_224',
num_classes: int = 10,
lora_r: int = 16,
lora_alpha: int = 32,
lora_dropout: float = 0.1) -> nn.Module:
"""
Create LoRA model using HuggingFace PEFT library.
More production-ready than manual LoRA injection.
Requires: pip install peft timm transformers
"""
from transformers import ViTForImageClassification
from peft import LoraConfig, get_peft_model
# Load ViT from transformers
base_model = ViTForImageClassification.from_pretrained(
f'google/vit-base-patch16-224-in21k',
num_labels=num_classes,
ignore_mismatched_sizes=True
)
# Configure LoRA
config = LoraConfig(
r=lora_r, # Rank of the low-rank matrices
lora_alpha=lora_alpha, # Scaling factor
target_modules=['query', 'value'], # Apply LoRA to Q and V projections
lora_dropout=lora_dropout,
bias='none', # Don't train bias terms
task_type=TaskType.IMAGE_CLASSIFICATION
)
peft_model = get_peft_model(base_model, config)
peft_model.print_trainable_parameters()
# Output example: trainable params: 598,530 || all params: 86,568,194 || 0.69%
return peft_model
# Training LoRA model - identical to standard training
def train_lora_epoch(model, loader, optimizer, device):
"""LoRA training loop - only LoRA params and head are updated."""
model.train()
total_loss = 0.0
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
for images, labels in loader:
images, labels = images.to(device), labels.to(device)
# Only LoRA params have requires_grad=True, so only they get updated
logits = model(images)
if hasattr(logits, 'logits'): # HuggingFace model
logits = logits.logits
loss = criterion(logits, labels)
optimizer.zero_grad(set_to_none=True)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
total_loss += loss.item()
return total_loss / len(loader)
# Memory and speed comparison
def compare_fine_tuning_strategies():
"""
Practical comparison of fine-tuning strategies
on ViT-B/16 with 10-class classification.
"""
model_full = create_model('vit_base_patch16_224', pretrained=True, num_classes=10)
params_full = sum(p.numel() for p in model_full.parameters() if p.requires_grad)
# Frozen backbone (linear probe)
model_linear = create_model('vit_base_patch16_224', pretrained=True, num_classes=10)
for name, param in model_linear.named_parameters():
if 'head' not in name:
param.requires_grad = False
params_linear = sum(p.numel() for p in model_linear.parameters() if p.requires_grad)
print("Fine-tuning Strategy Comparison (ViT-B/16, 10 classes):")
print(f" Full fine-tuning: {params_full:,} trainable params (100%)")
print(f" Linear probe: {params_linear:,} trainable params "
f"({100*params_linear/params_full:.1f}%)")
print(f" LoRA (r=16): ~598K trainable params (~0.7%)")
print()
print("Memory per batch (batch=32, A100 GPU):")
print(" Full fine-tuning: ~22 GB VRAM")
print(" Linear probe: ~4 GB VRAM (no backbone gradients)")
print(" LoRA: ~8 GB VRAM (tiny gradient matrices)")
13. Continual Learning: Avoiding Catastrophic Forgetting
When you fine-tune a model on a new task, it can "forget" previous tasks - this is catastrophic forgetting. For production systems that need to handle multiple domains or adapt to new product lines without losing performance on existing ones, continual learning strategies are essential.
import torch
import torch.nn as nn
import torch.nn.functional as F
from copy import deepcopy
class EWC:
"""
Elastic Weight Consolidation (Kirkpatrick et al., 2017).
Prevents catastrophic forgetting by adding a regularization term
that penalizes changes to parameters important for previous tasks.
Importance is measured by the Fisher Information Matrix diagonal:
F_i = E[ (d log p(y|x; theta) / d theta_i)^2 ]
L_total = L_new_task + lambda * sum_i F_i * (theta_i - theta_i_old)^2
"""
def __init__(self, model: nn.Module, dataloader,
device: torch.device, n_samples: int = 200):
"""
Compute Fisher Information for important parameters on the old task.
model: model already fine-tuned on old task (task A)
dataloader: dataloader for old task validation data (small subset ok)
n_samples: number of samples for Fisher estimation
"""
self.model = model
self.device = device
# Save old task optimal parameters
self.params_old = {
n: p.detach().clone()
for n, p in model.named_parameters()
if p.requires_grad
}
# Compute Fisher Information (diagonal approximation)
self.fisher = self._compute_fisher(dataloader, n_samples)
def _compute_fisher(self, dataloader,
n_samples: int) -> dict[str, torch.Tensor]:
"""Compute diagonal Fisher Information Matrix."""
fisher = {
n: torch.zeros_like(p)
for n, p in self.model.named_parameters()
if p.requires_grad
}
self.model.eval()
processed = 0
for images, labels in dataloader:
if processed >= n_samples:
break
images = images.to(self.device)
labels = labels.to(self.device)
self.model.zero_grad()
output = self.model(images) # [B, num_classes]
# Use log-likelihood (log softmax) as loss for Fisher computation
log_probs = F.log_softmax(output, dim=1)
# Sample from the model's own predictions (online variant)
# More stable than using true labels for Fisher estimation
sampled_labels = torch.distributions.Categorical(
logits=output.detach()
).sample()
loss = F.nll_loss(log_probs, sampled_labels)
loss.backward()
# Accumulate squared gradients (Fisher diagonal approximation)
batch_size = images.size(0)
for n, p in self.model.named_parameters():
if p.requires_grad and p.grad is not None:
fisher[n] += p.grad.detach() ** 2 * batch_size
processed += images.size(0)
# Normalize by number of samples
for n in fisher:
fisher[n] /= processed
return fisher
def penalty(self, model: nn.Module) -> torch.Tensor:
"""
EWC regularization penalty.
Add this to your new task loss: L_total = L_new + lambda * ewc.penalty(model)
"""
penalty = torch.tensor(0.0, device=self.device)
for n, p in model.named_parameters():
if n in self.fisher and p.requires_grad:
# Penalize deviation from old task optimum, weighted by Fisher
penalty += (self.fisher[n] * (p - self.params_old[n]) ** 2).sum()
return penalty
# Complete continual learning training loop
def train_with_ewc(model, new_task_loader, old_task_loader,
epochs: int = 30, ewc_lambda: float = 5000.0,
device: str = 'cuda') -> None:
"""
Train model on new task while preserving old task performance.
ewc_lambda: weight of the EWC regularization term.
Typical values: 1000 - 50000 (depends on task similarity).
Higher lambda = stronger protection of old task, less adaptation to new.
"""
device = torch.device(device)
# Compute Fisher on old task BEFORE modifying the model
print("Computing Fisher Information for EWC...")
ewc = EWC(model, old_task_loader, device, n_samples=200)
# Setup optimizer - only new task head and last few layers
optimizer = torch.optim.AdamW([
{'params': model.backbone.layer4.parameters(), 'lr': 1e-5},
{'params': model.classifier.parameters(), 'lr': 1e-4},
}, weight_decay=0.01)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=epochs, eta_min=1e-6
)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
for epoch in range(epochs):
model.train()
total_loss = 0.0
total_task_loss = 0.0
total_ewc_loss = 0.0
for images, labels in new_task_loader:
images, labels = images.to(device), labels.to(device)
logits = model(images)
task_loss = criterion(logits, labels)
# EWC regularization - penalizes forgetting
ewc_loss = ewc.penalty(model)
loss = task_loss + ewc_lambda * ewc_loss
optimizer.zero_grad(set_to_none=True)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
total_loss += loss.item()
total_task_loss += task_loss.item()
total_ewc_loss += ewc_loss.item()
scheduler.step()
if (epoch + 1) % 5 == 0:
print(f"Epoch {epoch+1}/{epochs}: "
f"total={total_loss/len(new_task_loader):.4f} "
f"task={total_task_loss/len(new_task_loader):.4f} "
f"ewc={total_ewc_loss/len(new_task_loader):.6f}")
Continual Learning Strategies Comparison
| Strategy | Approach | Old Task Performance | New Task Performance | Memory Cost |
|---|---|---|---|---|
| Fine-tune (naive) | Train on new task only | Catastrophic forgetting (may drop to 0%) | Best | Low |
| Replay buffer | Mix old and new task data | Good (depends on buffer size) | Slightly worse | Medium (stores old images) |
| EWC | Fisher-weighted regularization | Good (+/-2% vs baseline) | Slight reduction (<1%) | Low (only Fisher diag) |
| LoRA + Adapter | Separate adapters per task | Perfect (task A adapters frozen) | Near-optimal | Low per adapter |
Conclusions
Transfer Learning is one of the most powerful tools in a CV engineer's toolkit. In this article we covered:
- The theoretical foundation: feature hierarchies in CNNs and why knowledge transfers across domains
- The main strategies: feature extraction, partial fine-tuning, and full fine-tuning
- Domain adaptation and zero-shot transfer with CLIP and DINOv2
- The decision matrix: how to choose the right strategy based on dataset size and domain similarity
- A comprehensive overview of pre-trained models (ResNet, EfficientNet, ViT, ConvNeXt, DINOv2)
- Complete PyTorch implementation with discriminative learning rates and gradual unfreezing
- Warmup + cosine annealing scheduling for stable fine-tuning
- Industrial defect classification achieving 96.8% accuracy with EfficientNet-B4
- ONNX export for dependency-free production inference
- Transfer Learning for object detection with Faster R-CNN
- Parameter-Efficient Fine-Tuning: LoRA for training only 0.5-1% of parameters with comparable accuracy to full fine-tuning, saving 50-60% GPU memory
- Continual Learning with Elastic Weight Consolidation (EWC): add new tasks without catastrophic forgetting of old ones
Transfer Learning Quick Reference
| Scenario | Dataset Size | Domain Similarity | Strategy | Expected Accuracy Gain vs Scratch |
|---|---|---|---|---|
| Medical imaging (CT/MRI) | <1K | Low | Feature extraction + domain adaption | +20-30% |
| Product classification (e-commerce) | 5K-50K | Medium | Partial fine-tune (last 2 blocks) | +15-25% |
| Industrial defect detection | 500-5K | Low-Medium | EfficientNet full fine-tune + Discriminative LR | +25-35% |
| Satellite imagery | 10K-100K | Low | Full fine-tune or LoRA + domain adaptation | +10-20% |
| Natural images (similar to ImageNet) | Any size | High | Feature extraction only | +5-15% + 10x faster convergence |
| Large ViT, limited GPU | 1K-50K | Any | LoRA (r=16) + classifier head | Similar to full fine-tune, 50% less VRAM |
Series Navigation
Cross-Series Resources
- MLOps: Model Serving in Production - deploy your fine-tuned model at scale with FastAPI and Docker
- Deep Learning Advanced: EfficientNet and Compound Scaling - deep dive into the architecture
- Data Augmentation for Computer Vision - maximize your dataset with advanced techniques







