CNN Fundamentals: Architecture, Training, Deployment
How does a computer "see" a cat in a photograph? How does it tell a traffic sign from a human face? The answer lies in Convolutional Neural Networks (CNNs), the deep learning architecture that revolutionized computer vision. From self-driving cars to medical imaging diagnostics, CNNs are the invisible engine powering millions of applications we use every day.
In this first article of the Computer Vision with Deep Learning series, we will build your understanding of CNNs from scratch: what they are, how convolutional filters work, which architectures made history, and how to implement and train a complete CNN in PyTorch. By the end you will have the skills to classify real images and deploy models to production.
Series Overview
| # | Article | Focus |
|---|---|---|
| 1 | You are here - CNN Fundamentals | Architecture, training, deployment |
| 2 | Transfer Learning and Fine-Tuning | Pre-trained models, domain adaptation |
| 3 | Object Detection with YOLO | Real-time object detection |
| 4 | Semantic Segmentation | Pixel-level classification |
| 5 | Image Generation with GANs and Diffusion | Synthetic image generation |
| 6 | Edge Deployment and Optimization | Models on embedded devices |
What You Will Learn
- How a computer represents images (pixels, RGB channels, tensors)
- The convolution operation: kernels, feature maps, sliding window
- Core CNN building blocks: convolutions, pooling, activations
- The evolution of architectures: from LeNet-5 to ConvNeXt
- ResNet and skip connections: solving the vanishing gradient problem
- Full PyTorch implementation: CNN for CIFAR-10 classification
- Transfer learning: leveraging models pre-trained on ImageNet
- Evaluation metrics: accuracy, precision, recall, confusion matrix
- Deployment: from trained model to production inference (ONNX, TorchScript)
1. From Pixels to Features: How a Computer Sees Images
For humans, looking at a photo is effortless. Our brain instantly recognizes shapes, colors, edges, and objects. But for a computer, an image is nothing more than a grid of numbers. Each pixel is represented by numerical values indicating light intensity.
1.1 An Image as a Matrix of Numbers
A grayscale image is a 2D matrix where each cell holds a value between 0 (black) and
255 (white). A color image consists of three channels (Red, Green, Blue),
each a separate matrix. A 224x224 color image is therefore a tensor of shape
3 x 224 x 224, totaling 150,528 numerical values.
4x4 Image (grayscale, values 0-255):
+-----+-----+-----+-----+
| 10 | 20 | 30 | 40 |
+-----+-----+-----+-----+
| 50 | 120 | 180 | 60 |
+-----+-----+-----+-----+
| 70 | 200 | 220 | 80 |
+-----+-----+-----+-----+
| 90 | 100 | 110 | 130 |
+-----+-----+-----+-----+
Color image (3 RGB channels):
R channel: [[10, 20, ...], ...] --> red intensities
G channel: [[30, 40, ...], ...] --> green intensities
B channel: [[50, 60, ...], ...] --> blue intensities
Final tensor: shape = (3, H, W) --> (channels, height, width)
1.2 Why Fully Connected Networks Fall Short
The naive approach would be to flatten the image into a 1D vector and feed it to a fully connected (dense) neural network. For a 224x224x3 image, that means an input layer with 150,528 neurons. With a 1,000-neuron hidden layer, the first layer alone would have 150 million parameters.
Problems with Dense Networks for Images
- Parameter explosion: Millions of weights in the first layer alone, computationally prohibitive
- No spatial invariance: If a cat shifts 10 pixels to the right, the network fails to recognize it
- Loss of 2D structure: Flattening destroys spatial relationships between neighboring pixels
- Overfitting: Too many parameters with limited data leads to memorization, not generalization
CNNs solve all of these problems by exploiting three key insights: locality (visual patterns are local), weight sharing (the same filter works everywhere in the image), and translation invariance (an edge is an edge regardless of where it appears).
2. The Convolution Operation
Convolution is the mathematical operation at the heart of every CNN. A small filter (called a kernel) slides across the input image, computing at each position a weighted sum of the covered pixels. The result is a new matrix called a feature map, which highlights a specific pattern (vertical edge, horizontal edge, corner, texture).
2.1 Kernels and Sliding Window
A kernel is a small weight matrix (typically 3x3 or 5x5) that slides across the entire input image. At each position, kernel values are multiplied element-wise with the underlying pixels and summed to produce a single value in the output feature map.
Input (5x5): Kernel (3x3):
+---+---+---+---+---+ +----+----+----+
| 1 | 2 | 3 | 0 | 1 | | -1 | 0 | 1 |
+---+---+---+---+---+ +----+----+----+
| 0 | 1 | 2 | 3 | 1 | | -1 | 0 | 1 |
+---+---+---+---+---+ +----+----+----+
| 1 | 0 | 1 | 2 | 0 | | -1 | 0 | 1 |
+---+---+---+---+---+ +----+----+----+
| 2 | 1 | 0 | 1 | 3 | (Vertical edge detector)
+---+---+---+---+---+
| 0 | 1 | 2 | 1 | 0 |
+---+---+---+---+---+
Position (0,0): apply kernel to highlighted pixels [*]
[*1][*2][*3] 0 1
[*0][*1][*2] 3 1 Calculation:
[*1][*0][*1] 2 0 (-1x1)+(0x2)+(1x3)+(-1x0)+(0x1)+(1x2)+(-1x1)+(0x0)+(1x1)
2 1 0 1 3 = -1 + 0 + 3 + 0 + 0 + 2 - 1 + 0 + 1 = 4
0 1 2 1 0
Output feature map (3x3):
+---+---+---+
| 4 | . | . | <-- The 4 we just computed
+---+---+---+
| . | . | . | The kernel slides and computes each cell
+---+---+---+
| . | . | . |
+---+---+---+
Classic Kernel Types
| Kernel | Purpose | Values (3x3) |
|---|---|---|
| Vertical edge | Detects vertical transitions | [-1, 0, 1] repeated |
| Horizontal edge | Detects horizontal transitions | [-1,-1,-1], [0,0,0], [1,1,1] |
| Sharpening | Increases sharpness | [0,-1,0], [-1,5,-1], [0,-1,0] |
| Gaussian blur | Gaussian smoothing | [1,2,1], [2,4,2], [1,2,1] / 16 |
The fundamental difference between CNNs and traditional image processing is that in CNNs the kernel values are not hand-crafted. The network automatically learns the optimal filters during training through backpropagation. Early layers learn to detect edges and simple textures, while deeper layers combine these features into increasingly complex patterns (eyes, wheels, faces).
3. CNN Building Blocks
A CNN is composed of several types of layers, each with a specific role. Understanding what each component does is essential for designing effective architectures.
3.1 Convolutional Layer
The convolutional layer applies multiple filters to the input, producing one feature map per filter. If you apply 32 filters of size 3x3 to an RGB image, you get 32 feature maps, each highlighting a different pattern. The key parameters are:
Convolutional Layer Parameters
| Parameter | Description | Typical Values |
|---|---|---|
| Kernel size | Filter dimensions (width x height) | 3x3, 5x5, 7x7 |
| Stride | Step size as the filter slides | 1, 2 |
| Padding | Pixels added to input borders | 0 (valid), 1 (same for 3x3) |
| Number of filters | How many output feature maps | 32, 64, 128, 256, 512 |
Output dimension formula:
output_size = (input_size - kernel_size + 2 * padding) / stride + 1
Example: input 32x32, kernel 3x3
Stride=1, Padding=0: (32 - 3 + 0) / 1 + 1 = 30x30 (shrinks)
Stride=1, Padding=1: (32 - 3 + 2) / 1 + 1 = 32x32 (same padding)
Stride=2, Padding=1: (32 - 3 + 2) / 2 + 1 = 16x16 (halves)
Stride=1, Padding=1 ("same"):
Input: [A B C D E] Output: [A B C D E] --> same dimensions
(with zero-padding at borders)
Stride=2 (halves resolution):
Input: [A B C D E F] Output: [A C E] --> half the size
(skips one pixel per step)
3.2 Activation Function (ReLU)
After each convolution, a non-linear activation function is applied. The most common is
ReLU (Rectified Linear Unit): f(x) = max(0, x).
ReLU zeroes out all negative values and leaves positive values unchanged. Without non-linearity,
a stack of convolutions would be equivalent to a single linear transformation, making
the network unable to learn complex patterns.
ReLU: f(x) = max(0, x) Simple, fast, standard
LeakyReLU: f(x) = x if x>0, 0.01x otherwise Avoids "dying ReLU"
GELU: f(x) = x * Phi(x) Used in Transformers, smooth
Swish: f(x) = x * sigmoid(x) Used in EfficientNet
Input feature map: After ReLU:
+----+-----+----+ +----+----+----+
| -3 | 5 | -1 | | 0 | 5 | 0 |
+----+-----+----+ +----+----+----+
| 2 | -7 | 4 | | 2 | 0 | 4 |
+----+-----+----+ +----+----+----+
| -2 | 1 | -5 | | 0 | 1 | 0 |
+----+-----+----+ +----+----+----+
3.3 Pooling Layer
Pooling reduces the spatial dimensions of feature maps, decreasing the number of parameters and making the network more robust to small variations in feature positions. The two main types are Max Pooling (takes the maximum value in each window) and Average Pooling (computes the average).
Input (4x4): Output (2x2):
+----+----+----+----+ +----+----+
| 12 | 20| 30| 0 | | 20 | 30 | max(12,20,0,8)=20
+----+----+----+----+ --> +----+----+ max(30,0,2,14)=30
| 0 | 8 | 2 | 14 | | 15 | 16 |
+----+----+----+----+ +----+----+
| 15 | 2 | 3 | 16 |
+----+----+----+----+ Reduces 4x4 --> 2x2
| 1 | 6 | 7 | 8 | (halves height and width)
+----+----+----+----+
Max Pooling: takes maximum --> preserves strongest features
Avg Pooling: computes average --> smoothing effect
Global Average Pooling: average over entire feature map --> single value per channel
3.4 Batch Normalization
Batch Normalization (BatchNorm) normalizes the output of each layer to have zero mean and unit variance. This stabilizes training, allows higher learning rates, and acts as a mild regularizer. In practice, a BatchNorm layer is placed after each convolution, before the activation function.
4. Typical CNN Architecture
A standard CNN follows a recurring pattern: feature extraction blocks (convolution + activation + pooling) followed by fully connected layers for final classification. As depth increases, feature maps shrink in spatial dimensions but grow in channel count, capturing increasingly abstract patterns.
Input Image (3 x 32 x 32)
|
v
[Conv2d 3->32, 3x3, pad=1] --> [BatchNorm] --> [ReLU] --> [MaxPool 2x2]
| Feature maps: 32 x 16 x 16
v
[Conv2d 32->64, 3x3, pad=1] --> [BatchNorm] --> [ReLU] --> [MaxPool 2x2]
| Feature maps: 64 x 8 x 8
v
[Conv2d 64->128, 3x3, pad=1] --> [BatchNorm] --> [ReLU] --> [MaxPool 2x2]
| Feature maps: 128 x 4 x 4
v
[Flatten] --> Vector of 128 * 4 * 4 = 2048 values
|
v
[Linear 2048 -> 256] --> [ReLU] --> [Dropout 0.5]
|
v
[Linear 256 -> 10] --> Output: 10 classes (e.g., CIFAR-10)
|
v
[Softmax] --> Probabilities per class: [0.02, 0.01, 0.85, ...]
Dimension flow:
(3, 32, 32) -> (32, 16, 16) -> (64, 8, 8) -> (128, 4, 4) -> (2048) -> (256) -> (10)
[image] [edges,texture] [parts] [objects] [decision] [class]
Hierarchy of Learned Features
- Layers 1-2 (low level): Edges, color gradients, simple textures
- Layers 3-5 (mid level): Corners, contours, object parts (eyes, wheels)
- Layers 6+ (high level): Complete objects, scenes, abstract concepts
This hierarchy emerges automatically during training. There is no need to tell the network what to look for: the filters adapt to the data.
5. Evolution of CNN Architectures
The history of CNNs is marked by groundbreaking architectures, each introducing ideas that reshaped the field. Understanding this evolution is essential for making informed architectural decisions today.
CNN Architecture Timeline
| Year | Architecture | Key Innovation | Top-1 ImageNet |
|---|---|---|---|
| 1998 | LeNet-5 | First practical CNN (digit recognition) | N/A |
| 2012 | AlexNet | GPU training, ReLU, Dropout | 63.3% |
| 2014 | VGGNet | Deep networks with uniform 3x3 filters | 74.5% |
| 2014 | GoogLeNet/Inception | Inception modules, parallel multi-scale | 74.8% |
| 2015 | ResNet | Skip connections, 152+ layer networks | 78.6% |
| 2019 | EfficientNet | Compound scaling (depth+width+resolution) | 84.4% |
| 2022 | ConvNeXt | Modernized CNN inspired by Vision Transformers | 87.8% |
5.1 LeNet-5 (1998) - The Pioneer
Designed by Yann LeCun for handwritten digit recognition (MNIST), LeNet-5 was the first CNN to achieve practical success. With just 5 layers and 60,000 parameters, it proved that convolutions could learn discriminative features from images. It was used to automatically read bank checks.
5.2 AlexNet (2012) - The Revolution
AlexNet won the ImageNet 2012 competition by a massive margin, reducing the error rate from 26% to 16%. The key innovations: GPU training (two NVIDIA GTX 580s), ReLU activation instead of sigmoid, Dropout for regularization, and data augmentation. This result convinced both academia and industry that deep learning worked.
5.3 VGGNet (2014) - Depth Matters
VGG demonstrated that deeper networks yield better results. Its key insight is radical in its simplicity: use only stacked 3x3 filters. Two consecutive 3x3 layers have the same receptive field as a single 5x5 layer, but with fewer parameters and more non-linearity. VGG-16 has 16 layers and 138 million parameters.
5.4 EfficientNet (2019) - Smart Scaling
EfficientNet introduced compound scaling: instead of increasing only depth (like VGG) or width, it uniformly scales all three dimensions (depth, width, input resolution) with balanced coefficients. EfficientNet-B0 achieves 77.1% accuracy on ImageNet with just 5.3 million parameters, an unprecedented accuracy-to-parameter ratio.
5.5 ConvNeXt (2022) - The Modernized CNN
ConvNeXt demonstrates that CNNs, modernized with techniques inspired by Vision Transformers, can compete with (and outperform) transformer architectures. Key innovations include 7x7 depthwise separable kernels, LayerNorm replacing BatchNorm, GELU activation, and an isomorphic design with stages of increasing dimensions. ConvNeXt V2, in its E-ConvNeXt-Tiny variant, achieves 80.6% Top-1 accuracy at just 2.0 GFLOPs, making it excellent for efficient deployment scenarios.
6. ResNet and Skip Connections
ResNet (Residual Networks), proposed by He et al. in 2015, solved one of the fundamental problems in deep learning: the degradation problem. Before ResNet, adding layers to a deep network worsened results rather than improving them, even on the training set. The solution is as elegant as it is simple.
6.1 The Vanishing Gradient Problem
During backpropagation, gradients are multiplied repeatedly through the network layers. If these multiplications yield values less than 1, the gradients "vanish" exponentially as they propagate toward the initial layers. With 50 or more layers, gradients become so small that early layers stop learning altogether. Recent studies confirm that without skip connections, gradient L2 norms drop sharply in early layers, while with skip connections they remain uniform throughout the network.
Deep network WITHOUT skip connections (50 layers):
Layer 50 Layer 49 Layer 48 ... Layer 2 Layer 1
grad=1.0 * 0.8 * 0.8 ... * 0.8 * 0.8
Gradient at layer 1: 0.8^49 = 0.00001 --> nearly zero!
Early layers DO NOT learn.
Deep network WITH skip connections (ResNet):
Gradients have a "direct path" through skip connections.
They are not repeatedly multiplied by small values.
Gradient at layer 1: ~0.5 --> layers learn normally!
6.2 The Solution: Residual Learning
The genius of ResNet is simple: instead of having a block learn the complete transformation
H(x), have it learn only the difference (residual) from the
input: F(x) = H(x) - x. The block output becomes y = F(x) + x,
where x "skips" the block via a skip connection
(or shortcut connection).
Standard Block: Residual Block:
x x ------+
| | |
v v | (skip connection)
[Conv 3x3] [Conv 3x3] |
| | |
[BatchNorm] [BatchNorm] |
| | |
[ReLU] [ReLU] |
| | |
[Conv 3x3] [Conv 3x3] |
| | |
[BatchNorm] [BatchNorm] |
| | |
v v |
H(x) = output F(x) + x <--+
|
[ReLU]
|
v
output
If F(x) = 0, the output is simply x (identity).
The network can "skip" a block if it is not needed.
This makes training of deep networks stable.
Why It Works
Learning the residual F(x) = 0 (i.e., "do nothing") is much easier
than learning a complete identity transformation. If a layer is not useful, the network
simply learns F(x) = 0 and passes the input unchanged. This allows
building networks with hundreds of layers without performance degradation.
7. Training a CNN
Training a CNN means finding the optimal values for all kernels (filters) and fully connected layer weights. This happens through an iterative process of forward pass, loss computation, and gradient backpropagation.
7.1 Loss Function
For image classification, the standard loss function is Cross-Entropy Loss. It measures how much the predicted probabilities deviate from the true labels. A perfect prediction yields loss = 0; a completely wrong prediction yields loss approaching infinity.
True label (one-hot): [0, 0, 1, 0, 0, 0, 0, 0, 0, 0] (class "cat" = index 2)
Good prediction: [0.02, 0.03, 0.85, 0.02, 0.01, 0.02, 0.01, 0.02, 0.01, 0.01]
Loss = -log(0.85) = 0.16 --> Low loss, correct prediction
Bad prediction: [0.30, 0.25, 0.05, 0.10, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05]
Loss = -log(0.05) = 3.00 --> High loss, wrong prediction
7.2 Optimizers
The optimizer updates network weights in the direction that reduces the loss. The most commonly used optimizers are:
Optimizers Compared
| Optimizer | Characteristics | When to Use |
|---|---|---|
| SGD + Momentum | Simple, robust, stable convergence | Long training, maximum final accuracy |
| Adam | Adaptive learning rate, fast convergence | Prototyping, small/medium networks |
| AdamW | Adam with corrected weight decay | Modern standard, recommended for CNNs |
7.3 Data Augmentation
Data augmentation is a fundamental technique for preventing overfitting and improving network generalization. It consists of applying random transformations to training images (rotations, flips, crops, brightness changes) to create synthetic variations without collecting new data.
Original Image: Transforms:
+-------+ [Horizontal flip] --> Mirrored image
| Cat | [Rotation +-15 deg] --> Slight rotation
| --o-- | [Random Crop] --> Random crop
| /|\ | [Color Jitter] --> Color variation
+-------+ [Gaussian Noise] --> Added noise
[Cutout/Erasing] --> Random rectangular mask
[MixUp] --> Weighted average of 2 images
[CutMix] --> Patch from one image overlaid
Effect: from 50,000 training images, each epoch sees different
variations, as if you had millions of unique images.
8. Complete PyTorch Implementation
Let us move from theory to practice. We will implement a complete CNN to classify images from the CIFAR-10 dataset (60,000 images at 32x32 in 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck).
8.1 Setup and Dataset
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
# Training transforms with data augmentation
train_transform = transforms.Compose([
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomCrop(32, padding=4),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.4914, 0.4822, 0.4465], # CIFAR-10 mean
std=[0.2470, 0.2435, 0.2616] # CIFAR-10 std
),
])
# Test transforms (no augmentation)
test_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(
mean=[0.4914, 0.4822, 0.4465],
std=[0.2470, 0.2435, 0.2616]
),
])
# Load datasets
train_dataset = torchvision.datasets.CIFAR10(
root='./data', train=True, download=True, transform=train_transform
)
test_dataset = torchvision.datasets.CIFAR10(
root='./data', train=False, download=True, transform=test_transform
)
# DataLoader with batching and shuffling
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=4)
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False, num_workers=4)
CLASSES = ['airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
8.2 Model Definition
class ResidualBlock(nn.Module):
"""Residual block with skip connection."""
def __init__(self, in_channels: int, out_channels: int, stride: int = 1):
super().__init__()
self.conv1 = nn.Conv2d(
in_channels, out_channels, kernel_size=3,
stride=stride, padding=1, bias=False
)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(
out_channels, out_channels, kernel_size=3,
stride=1, padding=1, bias=False
)
self.bn2 = nn.BatchNorm2d(out_channels)
self.relu = nn.ReLU(inplace=True)
# Skip connection: if dimensions change, use 1x1 conv
self.shortcut = nn.Sequential()
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=1,
stride=stride, bias=False),
nn.BatchNorm2d(out_channels)
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
identity = self.shortcut(x)
out = self.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out = out + identity # Skip connection
out = self.relu(out)
return out
class CIFAR10CNN(nn.Module):
"""CNN with residual blocks for CIFAR-10 classification."""
def __init__(self, num_classes: int = 10):
super().__init__()
# First convolutional layer
self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(32)
self.relu = nn.ReLU(inplace=True)
# Residual blocks with increasing depth
self.layer1 = self._make_layer(32, 64, num_blocks=2, stride=1)
self.layer2 = self._make_layer(64, 128, num_blocks=2, stride=2)
self.layer3 = self._make_layer(128, 256, num_blocks=2, stride=2)
# Final classifier
self.global_avg_pool = nn.AdaptiveAvgPool2d((1, 1))
self.dropout = nn.Dropout(0.3)
self.fc = nn.Linear(256, num_classes)
def _make_layer(
self, in_ch: int, out_ch: int, num_blocks: int, stride: int
) -> nn.Sequential:
layers = [ResidualBlock(in_ch, out_ch, stride)]
for _ in range(1, num_blocks):
layers.append(ResidualBlock(out_ch, out_ch, stride=1))
return nn.Sequential(*layers)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.relu(self.bn1(self.conv1(x))) # (B, 32, 32, 32)
x = self.layer1(x) # (B, 64, 32, 32)
x = self.layer2(x) # (B, 128, 16, 16)
x = self.layer3(x) # (B, 256, 8, 8)
x = self.global_avg_pool(x) # (B, 256, 1, 1)
x = x.view(x.size(0), -1) # (B, 256)
x = self.dropout(x)
x = self.fc(x) # (B, 10)
return x
8.3 Training Loop
def train_model(
model: nn.Module,
train_loader: DataLoader,
test_loader: DataLoader,
epochs: int = 50,
lr: float = 0.01
) -> dict:
"""Train the model and return training history."""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(
model.parameters(), lr=lr, momentum=0.9, weight_decay=1e-4
)
scheduler = optim.lr_scheduler.OneCycleLR(
optimizer, max_lr=lr, epochs=epochs,
steps_per_epoch=len(train_loader)
)
history = {'train_loss': [], 'test_loss': [], 'test_acc': []}
for epoch in range(epochs):
# --- Training ---
model.train()
running_loss = 0.0
for images, labels in train_loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
scheduler.step()
running_loss += loss.item()
avg_train_loss = running_loss / len(train_loader)
# --- Evaluation ---
model.eval()
test_loss = 0.0
correct = 0
total = 0
with torch.no_grad():
for images, labels in test_loader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
loss = criterion(outputs, labels)
test_loss += loss.item()
_, predicted = outputs.max(1)
total += labels.size(0)
correct += predicted.eq(labels).sum().item()
avg_test_loss = test_loss / len(test_loader)
accuracy = 100.0 * correct / total
history['train_loss'].append(avg_train_loss)
history['test_loss'].append(avg_test_loss)
history['test_acc'].append(accuracy)
print(
f"Epoch [{epoch+1}/{epochs}] "
f"Train Loss: {avg_train_loss:.4f} | "
f"Test Loss: {avg_test_loss:.4f} | "
f"Accuracy: {accuracy:.2f}%"
)
return history
# Run training
model = CIFAR10CNN(num_classes=10)
history = train_model(model, train_loader, test_loader, epochs=50, lr=0.1)
# Expected output after 50 epochs with OneCycleLR:
# Test Accuracy: ~92-93%
Recommended Training Parameters for CIFAR-10
| Parameter | Value | Rationale |
|---|---|---|
| Batch size | 128 | Balance between speed and gradient stability |
| Learning rate | 0.1 (with OneCycleLR) | SGD with scheduling achieves higher accuracy |
| Weight decay | 1e-4 | L2 regularization to prevent overfitting |
| Epochs | 50 | With OneCycleLR, 50 epochs suffice for convergence |
| Dropout | 0.3 | Additional regularization before the final layer |
9. Transfer Learning
In practice, you rarely train a CNN from scratch. Transfer learning allows you to reuse models pre-trained on large datasets (such as ImageNet with 1.2 million images and 1,000 classes) and adapt them to your specific problem. This drastically reduces training time, data requirements, and improves performance.
9.1 Feature Extraction vs Fine-Tuning
Two Transfer Learning Strategies
| Strategy | How It Works | When to Use |
|---|---|---|
| Feature Extraction | Freeze all pre-trained layers, train only the final classifier | Limited data (<1,000 images), domain similar to ImageNet |
| Fine-Tuning | Unfreeze the last N layers and retrain with a low learning rate | More data available, domain different from ImageNet |
import torchvision.models as models
def create_transfer_model(
num_classes: int,
freeze_backbone: bool = True
) -> nn.Module:
"""Create a model with transfer learning from ResNet-18."""
# Load ResNet-18 pre-trained on ImageNet
model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
# Strategy 1: Feature Extraction (freeze backbone)
if freeze_backbone:
for param in model.parameters():
param.requires_grad = False
# Replace the final classifier
num_features = model.fc.in_features # 512 for ResNet-18
model.fc = nn.Sequential(
nn.Dropout(0.3),
nn.Linear(num_features, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, num_classes)
)
return model
# Feature extraction: only the classifier is trained
feature_model = create_transfer_model(num_classes=10, freeze_backbone=True)
# Fine-tuning: the entire model is retrained
finetune_model = create_transfer_model(num_classes=10, freeze_backbone=False)
# For fine-tuning, use a lower learning rate
optimizer = optim.AdamW([
{'params': finetune_model.layer4.parameters(), 'lr': 1e-4},
{'params': finetune_model.fc.parameters(), 'lr': 1e-3},
], weight_decay=1e-2)
Transfer Learning Guide:
Data similar to ImageNet Data different from ImageNet
(objects, animals, scenes) (medical, satellite, micro)
+----------------------------+----------------------------+
Few data | Feature Extraction | Feature Extraction |
(<1,000 img) | Freeze all, train FC | + Increase data augment. |
+----------------------------+----------------------------+
Lots of data | Fine-tune last layers | Full fine-tuning |
(>5,000 img) | Low LR for backbone | or train from scratch |
+----------------------------+----------------------------+
10. Evaluation Metrics
Accuracy alone is not enough to determine whether a CNN performs well. With imbalanced datasets (e.g., 95% class A, 5% class B), a model that always predicts "class A" achieves 95% accuracy but is completely useless. More granular metrics are needed.
Image Classification Metrics
| Metric | What It Measures | Formula |
|---|---|---|
| Accuracy | Percentage of correct predictions overall | TP + TN / Total |
| Precision | Among positive predictions, how many are correct? | TP / (TP + FP) |
| Recall | Among actual positives, how many were found? | TP / (TP + FN) |
| F1 Score | Harmonic mean of precision and recall | 2 * (P * R) / (P + R) |
Confusion Matrix (predictions vs reality):
Predicted: plane car bird cat deer dog frog horse ship truck
Actual:
airplane [92] 1 2 0 0 0 1 0 3 1
automobile 0 [95] 0 0 0 0 0 0 1 4
bird 3 0 [85] 3 2 2 3 1 1 0
cat 0 1 2 [78] 1 12 3 2 0 1
deer 1 0 3 2 [88] 1 2 3 0 0
Interpretation:
- Diagonal = correct predictions (higher = better)
- Off-diagonal = errors (confusions between classes)
- cat vs dog: 12 cats classified as dogs --> "confused" classes
from sklearn.metrics import (
classification_report,
confusion_matrix
)
import numpy as np
def evaluate_model(
model: nn.Module,
test_loader: DataLoader,
device: torch.device,
class_names: list[str]
) -> dict:
"""Evaluate the model and return detailed metrics."""
model.eval()
all_preds = []
all_labels = []
with torch.no_grad():
for images, labels in test_loader:
images = images.to(device)
outputs = model(images)
_, predicted = outputs.max(1)
all_preds.extend(predicted.cpu().numpy())
all_labels.extend(labels.numpy())
preds_array = np.array(all_preds)
labels_array = np.array(all_labels)
# Per-class detailed report
report = classification_report(
labels_array, preds_array,
target_names=class_names, output_dict=True
)
# Confusion matrix
cm = confusion_matrix(labels_array, preds_array)
# Overall accuracy
accuracy = np.mean(preds_array == labels_array) * 100
print(f"Overall Accuracy: {accuracy:.2f}%")
print(classification_report(
labels_array, preds_array, target_names=class_names
))
return {'accuracy': accuracy, 'report': report, 'confusion_matrix': cm}
11. Deployment: From Trained Model to Production
Training a model is only half the work. To bring it to production you need to export it in an optimized format, create an inference API, and containerize everything for scalable deployment.
11.1 Model Export
Export Formats
| Format | Use Case | Advantages |
|---|---|---|
| TorchScript | PyTorch inference without Python | No Python dependency, complete serialization |
| ONNX | Universal format, multi-framework | Compatible with TensorRT, OpenVINO, CoreML |
| TensorRT | Optimized inference for NVIDIA GPUs | Up to 5x faster than native PyTorch |
import torch.onnx
def export_model(model: nn.Module, export_path: str) -> None:
"""Export the model to TorchScript and ONNX."""
model.eval()
dummy_input = torch.randn(1, 3, 32, 32)
# --- TorchScript ---
scripted_model = torch.jit.trace(model, dummy_input)
scripted_model.save(f"{export_path}/model_scripted.pt")
print("TorchScript saved.")
# --- ONNX ---
torch.onnx.export(
model,
dummy_input,
f"{export_path}/model.onnx",
input_names=['image'],
output_names=['prediction'],
dynamic_axes={
'image': {0: 'batch_size'},
'prediction': {0: 'batch_size'}
},
opset_version=17
)
print("ONNX saved.")
export_model(model, './exports')
11.2 Inference API with FastAPI
# inference_api.py
from fastapi import FastAPI, UploadFile
from PIL import Image
import torch
import torchvision.transforms as transforms
import io
app = FastAPI(title="CNN Image Classifier")
# Load TorchScript model
model = torch.jit.load("./exports/model_scripted.pt")
model.eval()
CLASSES = ['airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
preprocess = transforms.Compose([
transforms.Resize((32, 32)),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.4914, 0.4822, 0.4465],
std=[0.2470, 0.2435, 0.2616]
),
])
@app.post("/predict")
async def predict(file: UploadFile):
"""Classify an uploaded image."""
image_bytes = await file.read()
image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
tensor = preprocess(image).unsqueeze(0)
with torch.no_grad():
outputs = model(tensor)
probabilities = torch.softmax(outputs, dim=1)
confidence, predicted = probabilities.max(1)
return {
"class": CLASSES[predicted.item()],
"confidence": round(confidence.item() * 100, 2),
"all_probabilities": {
name: round(prob.item() * 100, 2)
for name, prob in zip(CLASSES, probabilities[0])
}
}
# Run: uvicorn inference_api:app --host 0.0.0.0 --port 8000
11.3 Containerization with Docker
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY exports/ ./exports/
COPY inference_api.py .
EXPOSE 8000
CMD ["uvicorn", "inference_api:app", "--host", "0.0.0.0", "--port", "8000"]
# Build: docker build -t cnn-classifier .
# Run: docker run -p 8000:8000 cnn-classifier
# Test: curl -X POST -F "file=@cat.jpg" http://localhost:8000/predict
Pipeline: Training --> Export --> Container --> Deploy
[Training] [Export] [Container] [Deploy]
PyTorch + GPU --> TorchScript/ONNX --> Docker --> Kubernetes
50 epochs ~10 MB file FastAPI Auto-scaling
~92% accuracy Optimized Health checks Load balancer
No Python dependency GPU optional Monitoring
Serverless alternative:
Export ONNX --> AWS Lambda + ONNX Runtime --> API Gateway
Pro: Pay-per-use, zero infrastructure to manage
Con: Cold start (~2s), 250MB package limit
Conclusions and Next Steps
In this article we built a comprehensive understanding of Convolutional Neural Networks, starting from fundamentals (how a computer sees images, the convolution operation) through practical implementation (CNN with residual blocks in PyTorch) to production deployment (TorchScript, ONNX, FastAPI, Docker).
We traced how the evolution of architectures, from LeNet-5 in 1998 to ConvNeXt in 2022, progressively improved performance through ideas like skip connections (ResNet), compound scaling (EfficientNet), and transformer-inspired design (ConvNeXt).
Key Takeaways
- CNNs exploit locality, weight sharing, and spatial invariance to process images efficiently
- The standard architecture follows the pattern: Conv + BatchNorm + ReLU + Pooling, repeated with increasing depth
- Skip connections (ResNet) are essential for training deep networks without vanishing gradients
- Transfer learning is almost always preferable to training from scratch, especially with limited data
- Data augmentation is critical for generalization and costs nothing in terms of data collection
- For production, export to ONNX or TorchScript and containerize with Docker
In the next article we will dive deep into Transfer Learning and Fine-Tuning: how to choose the right pre-trained model, progressive fine-tuning strategies, domain adaptation, and advanced techniques like knowledge distillation. In the third article we will tackle Object Detection with YOLO, the most widely used real-time object detection system in industry.
Additional Resources
- Original ResNet paper: "Deep Residual Learning for Image Recognition" (He et al., 2015)
- ConvNeXt paper: "A ConvNet for the 2020s" (Liu et al., 2022)
- EfficientNet paper: "EfficientNet: Rethinking Model Scaling for CNNs" (Tan & Le, 2019)
- PyTorch Documentation: Tutorials on CNNs and torchvision
- CS231n Stanford: Convolutional Neural Networks for Visual Recognition (online course)
- ONNX Runtime: Documentation for optimized multi-platform inference







