Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

CNN Fundamentals: Architecture, Training, Deployment

How does a computer "see" a cat in a photograph? How does it tell a traffic sign from a human face? The answer lies in Convolutional Neural Networks (CNNs), the deep learning architecture that revolutionized computer vision. From self-driving cars to medical imaging diagnostics, CNNs are the invisible engine powering millions of applications we use every day.

In this first article of the Computer Vision with Deep Learning series, we will build your understanding of CNNs from scratch: what they are, how convolutional filters work, which architectures made history, and how to implement and train a complete CNN in PyTorch. By the end you will have the skills to classify real images and deploy models to production.

Series Overview

#	Article	Focus
1	You are here - CNN Fundamentals	Architecture, training, deployment
2	Transfer Learning and Fine-Tuning	Pre-trained models, domain adaptation
3	Object Detection with YOLO	Real-time object detection
4	Semantic Segmentation	Pixel-level classification
5	Image Generation with GANs and Diffusion	Synthetic image generation
6	Edge Deployment and Optimization	Models on embedded devices

What You Will Learn

How a computer represents images (pixels, RGB channels, tensors)
The convolution operation: kernels, feature maps, sliding window
Core CNN building blocks: convolutions, pooling, activations
The evolution of architectures: from LeNet-5 to ConvNeXt
ResNet and skip connections: solving the vanishing gradient problem
Full PyTorch implementation: CNN for CIFAR-10 classification
Transfer learning: leveraging models pre-trained on ImageNet
Evaluation metrics: accuracy, precision, recall, confusion matrix
Deployment: from trained model to production inference (ONNX, TorchScript)

1. From Pixels to Features: How a Computer Sees Images

For humans, looking at a photo is effortless. Our brain instantly recognizes shapes, colors, edges, and objects. But for a computer, an image is nothing more than a grid of numbers. Each pixel is represented by numerical values indicating light intensity.

1.1 An Image as a Matrix of Numbers

A grayscale image is a 2D matrix where each cell holds a value between 0 (black) and 255 (white). A color image consists of three channels (Red, Green, Blue), each a separate matrix. A 224x224 color image is therefore a tensor of shape 3 x 224 x 224, totaling 150,528 numerical values.

Representing a 4x4 Grayscale Image


4x4 Image (grayscale, values 0-255):

+-----+-----+-----+-----+
|  10 |  20 |  30 |  40 |
+-----+-----+-----+-----+
|  50 | 120 | 180 |  60 |
+-----+-----+-----+-----+
|  70 | 200 | 220 |  80 |
+-----+-----+-----+-----+
|  90 | 100 | 110 | 130 |
+-----+-----+-----+-----+

Color image (3 RGB channels):
R channel: [[10, 20, ...], ...]  --> red intensities
G channel: [[30, 40, ...], ...]  --> green intensities
B channel: [[50, 60, ...], ...]  --> blue intensities

Final tensor: shape = (3, H, W)  --> (channels, height, width)

1.2 Why Fully Connected Networks Fall Short

The naive approach would be to flatten the image into a 1D vector and feed it to a fully connected (dense) neural network. For a 224x224x3 image, that means an input layer with 150,528 neurons. With a 1,000-neuron hidden layer, the first layer alone would have 150 million parameters.

Problems with Dense Networks for Images

Parameter explosion: Millions of weights in the first layer alone, computationally prohibitive
No spatial invariance: If a cat shifts 10 pixels to the right, the network fails to recognize it
Loss of 2D structure: Flattening destroys spatial relationships between neighboring pixels
Overfitting: Too many parameters with limited data leads to memorization, not generalization

CNNs solve all of these problems by exploiting three key insights: locality (visual patterns are local), weight sharing (the same filter works everywhere in the image), and translation invariance (an edge is an edge regardless of where it appears).

2. The Convolution Operation

Convolution is the mathematical operation at the heart of every CNN. A small filter (called a kernel) slides across the input image, computing at each position a weighted sum of the covered pixels. The result is a new matrix called a feature map, which highlights a specific pattern (vertical edge, horizontal edge, corner, texture).

2.1 Kernels and Sliding Window

A kernel is a small weight matrix (typically 3x3 or 5x5) that slides across the entire input image. At each position, kernel values are multiplied element-wise with the underlying pixels and summed to produce a single value in the output feature map.

3x3 Convolution - Step-by-Step Visualization


Input (5x5):                  Kernel (3x3):
+---+---+---+---+---+        +----+----+----+
| 1 | 2 | 3 | 0 | 1 |       | -1 |  0 |  1 |
+---+---+---+---+---+        +----+----+----+
| 0 | 1 | 2 | 3 | 1 |       | -1 |  0 |  1 |
+---+---+---+---+---+        +----+----+----+
| 1 | 0 | 1 | 2 | 0 |       | -1 |  0 |  1 |
+---+---+---+---+---+        +----+----+----+
| 2 | 1 | 0 | 1 | 3 |       (Vertical edge detector)
+---+---+---+---+---+
| 0 | 1 | 2 | 1 | 0 |
+---+---+---+---+---+

Position (0,0): apply kernel to highlighted pixels [*]
[*1][*2][*3]  0   1
[*0][*1][*2]  3   1         Calculation:
[*1][*0][*1]  2   0         (-1x1)+(0x2)+(1x3)+(-1x0)+(0x1)+(1x2)+(-1x1)+(0x0)+(1x1)
 2   1   0   1   3         = -1 + 0 + 3 + 0 + 0 + 2 - 1 + 0 + 1 = 4
 0   1   2   1   0

Output feature map (3x3):
+---+---+---+
| 4 | . | . |    <-- The 4 we just computed
+---+---+---+
| . | . | . |    The kernel slides and computes each cell
+---+---+---+
| . | . | . |
+---+---+---+

      Classic Kernel Types
      
            Kernel
            Purpose
            Values (3x3)
          
            Vertical edge
            Detects vertical transitions
            [-1, 0, 1] repeated
          
            Horizontal edge
            Detects horizontal transitions
            [-1,-1,-1], [0,0,0], [1,1,1]
          
            Sharpening
            Increases sharpness
            [0,-1,0], [-1,5,-1], [0,-1,0]
          
            Gaussian blur
            Gaussian smoothing
            [1,2,1], [2,4,2], [1,2,1] / 16

The fundamental difference between CNNs and traditional image processing is that in CNNs the kernel values are not hand-crafted. The network automatically learns the optimal filters during training through backpropagation. Early layers learn to detect edges and simple textures, while deeper layers combine these features into increasingly complex patterns (eyes, wheels, faces).

3. CNN Building Blocks

A CNN is composed of several types of layers, each with a specific role. Understanding what each component does is essential for designing effective architectures.

3.1 Convolutional Layer

The convolutional layer applies multiple filters to the input, producing one feature map per filter. If you apply 32 filters of size 3x3 to an RGB image, you get 32 feature maps, each highlighting a different pattern. The key parameters are:

      Convolutional Layer Parameters
      
            Parameter
            Description
            Typical Values
          
            Kernel size
            Filter dimensions (width x height)
            3x3, 5x5, 7x7
          
            Stride
            Step size as the filter slides
            1, 2
          
            Padding
            Pixels added to input borders
            0 (valid), 1 (same for 3x3)
          
            Number of filters
            How many output feature maps
            32, 64, 128, 256, 512

Effect of Stride and Padding on Output Dimensions


Output dimension formula:
  output_size = (input_size - kernel_size + 2 * padding) / stride + 1

Example: input 32x32, kernel 3x3
  Stride=1, Padding=0: (32 - 3 + 0) / 1 + 1 = 30x30  (shrinks)
  Stride=1, Padding=1: (32 - 3 + 2) / 1 + 1 = 32x32  (same padding)
  Stride=2, Padding=1: (32 - 3 + 2) / 2 + 1 = 16x16  (halves)

Stride=1, Padding=1 ("same"):
  Input:  [A B C D E]     Output: [A B C D E]  --> same dimensions
          (with zero-padding at borders)

Stride=2 (halves resolution):
  Input:  [A B C D E F]   Output: [A C E]  --> half the size
          (skips one pixel per step)

3.2 Activation Function (ReLU)

After each convolution, a non-linear activation function is applied. The most common is ReLU (Rectified Linear Unit): f(x) = max(0, x). ReLU zeroes out all negative values and leaves positive values unchanged. Without non-linearity, a stack of convolutions would be equivalent to a single linear transformation, making the network unable to learn complex patterns.

Activation Functions Compared


ReLU:      f(x) = max(0, x)        Simple, fast, standard
LeakyReLU: f(x) = x if x>0, 0.01x otherwise   Avoids "dying ReLU"
GELU:      f(x) = x * Phi(x)      Used in Transformers, smooth
Swish:     f(x) = x * sigmoid(x)  Used in EfficientNet

Input feature map:          After ReLU:
+----+-----+----+           +----+----+----+
| -3 |  5  | -1 |          |  0 |  5 |  0 |
+----+-----+----+           +----+----+----+
|  2 | -7  |  4 |          |  2 |  0 |  4 |
+----+-----+----+           +----+----+----+
| -2 |  1  | -5 |          |  0 |  1 |  0 |
+----+-----+----+           +----+----+----+

3.3 Pooling Layer

Pooling reduces the spatial dimensions of feature maps, decreasing the number of parameters and making the network more robust to small variations in feature positions. The two main types are Max Pooling (takes the maximum value in each window) and Average Pooling (computes the average).

Max Pooling 2x2 with Stride 2


Input (4x4):                    Output (2x2):
+----+----+----+----+           +----+----+
| 12 |  20|  30|  0 |          | 20 | 30 |  max(12,20,0,8)=20
+----+----+----+----+    -->   +----+----+    max(30,0,2,14)=30
|  0 |  8 |  2 | 14 |          | 15 | 16 |
+----+----+----+----+           +----+----+
| 15 |  2 |  3 | 16 |
+----+----+----+----+          Reduces 4x4 --> 2x2
|  1 |  6 |  7 |  8 |         (halves height and width)
+----+----+----+----+

Max Pooling: takes maximum    --> preserves strongest features
Avg Pooling: computes average --> smoothing effect
Global Average Pooling: average over entire feature map --> single value per channel

3.4 Batch Normalization

Batch Normalization (BatchNorm) normalizes the output of each layer to have zero mean and unit variance. This stabilizes training, allows higher learning rates, and acts as a mild regularizer. In practice, a BatchNorm layer is placed after each convolution, before the activation function.

4. Typical CNN Architecture

A standard CNN follows a recurring pattern: feature extraction blocks (convolution + activation + pooling) followed by fully connected layers for final classification. As depth increases, feature maps shrink in spatial dimensions but grow in channel count, capturing increasingly abstract patterns.

Standard CNN Architecture for Classification


Input Image (3 x 32 x 32)
    |
    v
[Conv2d 3->32, 3x3, pad=1] --> [BatchNorm] --> [ReLU] --> [MaxPool 2x2]
    |  Feature maps: 32 x 16 x 16
    v
[Conv2d 32->64, 3x3, pad=1] --> [BatchNorm] --> [ReLU] --> [MaxPool 2x2]
    |  Feature maps: 64 x 8 x 8
    v
[Conv2d 64->128, 3x3, pad=1] --> [BatchNorm] --> [ReLU] --> [MaxPool 2x2]
    |  Feature maps: 128 x 4 x 4
    v
[Flatten]  -->  Vector of 128 * 4 * 4 = 2048 values
    |
    v
[Linear 2048 -> 256] --> [ReLU] --> [Dropout 0.5]
    |
    v
[Linear 256 -> 10]  --> Output: 10 classes (e.g., CIFAR-10)
    |
    v
[Softmax]  --> Probabilities per class: [0.02, 0.01, 0.85, ...]

Dimension flow:
  (3, 32, 32) -> (32, 16, 16) -> (64, 8, 8) -> (128, 4, 4) -> (2048) -> (256) -> (10)
  [image]       [edges,texture]  [parts]       [objects]       [decision]   [class]

Hierarchy of Learned Features

Layers 1-2 (low level): Edges, color gradients, simple textures
Layers 3-5 (mid level): Corners, contours, object parts (eyes, wheels)
Layers 6+ (high level): Complete objects, scenes, abstract concepts

This hierarchy emerges automatically during training. There is no need to tell the network what to look for: the filters adapt to the data.

5. Evolution of CNN Architectures

The history of CNNs is marked by groundbreaking architectures, each introducing ideas that reshaped the field. Understanding this evolution is essential for making informed architectural decisions today.

      CNN Architecture Timeline
      
        
            Year
            Architecture
            Key Innovation
            Top-1 ImageNet
          

        
            1998
            LeNet-5
            First practical CNN (digit recognition)
            N/A
          

            2012
            AlexNet
            GPU training, ReLU, Dropout
            63.3%
          

            2014
            VGGNet
            Deep networks with uniform 3x3 filters
            74.5%
          

            2014
            GoogLeNet/Inception
            Inception modules, parallel multi-scale
            74.8%
          

            2015
            ResNet
            Skip connections, 152+ layer networks
            78.6%
          

            2019
            EfficientNet
            Compound scaling (depth+width+resolution)
            84.4%
          

            2022
            ConvNeXt
            Modernized CNN inspired by Vision Transformers
            87.8%
          

      
    

5.1 LeNet-5 (1998) - The Pioneer

Designed by Yann LeCun for handwritten digit recognition (MNIST), LeNet-5 was the first CNN to achieve practical success. With just 5 layers and 60,000 parameters, it proved that convolutions could learn discriminative features from images. It was used to automatically read bank checks.

5.2 AlexNet (2012) - The Revolution

AlexNet won the ImageNet 2012 competition by a massive margin, reducing the error rate from 26% to 16%. The key innovations: GPU training (two NVIDIA GTX 580s), ReLU activation instead of sigmoid, Dropout for regularization, and data augmentation. This result convinced both academia and industry that deep learning worked.

5.3 VGGNet (2014) - Depth Matters

VGG demonstrated that deeper networks yield better results. Its key insight is radical in its simplicity: use only stacked 3x3 filters. Two consecutive 3x3 layers have the same receptive field as a single 5x5 layer, but with fewer parameters and more non-linearity. VGG-16 has 16 layers and 138 million parameters.

5.4 EfficientNet (2019) - Smart Scaling

EfficientNet introduced compound scaling: instead of increasing only depth (like VGG) or width, it uniformly scales all three dimensions (depth, width, input resolution) with balanced coefficients. EfficientNet-B0 achieves 77.1% accuracy on ImageNet with just 5.3 million parameters, an unprecedented accuracy-to-parameter ratio.

5.5 ConvNeXt (2022) - The Modernized CNN

ConvNeXt demonstrates that CNNs, modernized with techniques inspired by Vision Transformers, can compete with (and outperform) transformer architectures. Key innovations include 7x7 depthwise separable kernels, LayerNorm replacing BatchNorm, GELU activation, and an isomorphic design with stages of increasing dimensions. ConvNeXt V2, in its E-ConvNeXt-Tiny variant, achieves 80.6% Top-1 accuracy at just 2.0 GFLOPs, making it excellent for efficient deployment scenarios.

6. ResNet and Skip Connections

ResNet (Residual Networks), proposed by He et al. in 2015, solved one of the fundamental problems in deep learning: the degradation problem. Before ResNet, adding layers to a deep network worsened results rather than improving them, even on the training set. The solution is as elegant as it is simple.

6.1 The Vanishing Gradient Problem

During backpropagation, gradients are multiplied repeatedly through the network layers. If these multiplications yield values less than 1, the gradients "vanish" exponentially as they propagate toward the initial layers. With 50 or more layers, gradients become so small that early layers stop learning altogether. Recent studies confirm that without skip connections, gradient L2 norms drop sharply in early layers, while with skip connections they remain uniform throughout the network.

Vanishing Gradient - The Problem


Deep network WITHOUT skip connections (50 layers):

  Layer 50   Layer 49   Layer 48  ...  Layer 2    Layer 1
  grad=1.0 * 0.8      * 0.8     ...  * 0.8      * 0.8

  Gradient at layer 1: 0.8^49 = 0.00001  --> nearly zero!
  Early layers DO NOT learn.

Deep network WITH skip connections (ResNet):

  Gradients have a "direct path" through skip connections.
  They are not repeatedly multiplied by small values.
  Gradient at layer 1: ~0.5  --> layers learn normally!

6.2 The Solution: Residual Learning

The genius of ResNet is simple: instead of having a block learn the complete transformation H(x), have it learn only the difference (residual) from the input: F(x) = H(x) - x. The block output becomes y = F(x) + x, where x "skips" the block via a skip connection (or shortcut connection).

Residual Block - Structure


Standard Block:                     Residual Block:

    x                                   x ------+
    |                                   |        |
    v                                   v        | (skip connection)
  [Conv 3x3]                         [Conv 3x3] |
    |                                   |        |
  [BatchNorm]                        [BatchNorm] |
    |                                   |        |
  [ReLU]                             [ReLU]      |
    |                                   |        |
  [Conv 3x3]                         [Conv 3x3] |
    |                                   |        |
  [BatchNorm]                        [BatchNorm] |
    |                                   |        |
    v                                   v        |
  H(x) = output                     F(x) + x <--+
                                        |
                                      [ReLU]
                                        |
                                        v
                                      output

If F(x) = 0, the output is simply x (identity).
The network can "skip" a block if it is not needed.
This makes training of deep networks stable.

Why It Works

Learning the residual F(x) = 0 (i.e., "do nothing") is much easier than learning a complete identity transformation. If a layer is not useful, the network simply learns F(x) = 0 and passes the input unchanged. This allows building networks with hundreds of layers without performance degradation.

7. Training a CNN

Training a CNN means finding the optimal values for all kernels (filters) and fully connected layer weights. This happens through an iterative process of forward pass, loss computation, and gradient backpropagation.

7.1 Loss Function

For image classification, the standard loss function is Cross-Entropy Loss. It measures how much the predicted probabilities deviate from the true labels. A perfect prediction yields loss = 0; a completely wrong prediction yields loss approaching infinity.

Cross-Entropy Loss - Example


True label (one-hot): [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]  (class "cat" = index 2)

Good prediction:      [0.02, 0.03, 0.85, 0.02, 0.01, 0.02, 0.01, 0.02, 0.01, 0.01]
  Loss = -log(0.85) = 0.16    --> Low loss, correct prediction

Bad prediction:       [0.30, 0.25, 0.05, 0.10, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05]
  Loss = -log(0.05) = 3.00    --> High loss, wrong prediction

7.2 Optimizers

The optimizer updates network weights in the direction that reduces the loss. The most commonly used optimizers are:

      Optimizers Compared
      
            Optimizer
            Characteristics
            When to Use
          
            SGD + Momentum
            Simple, robust, stable convergence
            Long training, maximum final accuracy
          
            Adam
            Adaptive learning rate, fast convergence
            Prototyping, small/medium networks
          
            AdamW
            Adam with corrected weight decay
            Modern standard, recommended for CNNs

7.3 Data Augmentation

Data augmentation is a fundamental technique for preventing overfitting and improving network generalization. It consists of applying random transformations to training images (rotations, flips, crops, brightness changes) to create synthetic variations without collecting new data.

Common Data Augmentation Transforms


Original Image:         Transforms:

  +-------+             [Horizontal flip]     --> Mirrored image
  |  Cat  |             [Rotation +-15 deg]   --> Slight rotation
  | --o-- |             [Random Crop]         --> Random crop
  |  /|\  |             [Color Jitter]        --> Color variation
  +-------+             [Gaussian Noise]      --> Added noise
                        [Cutout/Erasing]      --> Random rectangular mask
                        [MixUp]               --> Weighted average of 2 images
                        [CutMix]              --> Patch from one image overlaid

Effect: from 50,000 training images, each epoch sees different
variations, as if you had millions of unique images.

8. Complete PyTorch Implementation

Let us move from theory to practice. We will implement a complete CNN to classify images from the CIFAR-10 dataset (60,000 images at 32x32 in 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck).

8.1 Setup and Dataset

Dataset Loading and Data Augmentation


import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader

# Training transforms with data augmentation
train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomCrop(32, padding=4),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],   # CIFAR-10 mean
        std=[0.2470, 0.2435, 0.2616]     # CIFAR-10 std
    ),
])

# Test transforms (no augmentation)
test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2470, 0.2435, 0.2616]
    ),
])

# Load datasets
train_dataset = torchvision.datasets.CIFAR10(
    root='./data', train=True, download=True, transform=train_transform
)
test_dataset = torchvision.datasets.CIFAR10(
    root='./data', train=False, download=True, transform=test_transform
)

# DataLoader with batching and shuffling
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=4)
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False, num_workers=4)

CLASSES = ['airplane', 'automobile', 'bird', 'cat', 'deer',
           'dog', 'frog', 'horse', 'ship', 'truck']

8.2 Model Definition

CNN with Residual Blocks for CIFAR-10


class ResidualBlock(nn.Module):
    """Residual block with skip connection."""

    def __init__(self, in_channels: int, out_channels: int, stride: int = 1):
        super().__init__()
        self.conv1 = nn.Conv2d(
            in_channels, out_channels, kernel_size=3,
            stride=stride, padding=1, bias=False
        )
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(
            out_channels, out_channels, kernel_size=3,
            stride=1, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        # Skip connection: if dimensions change, use 1x1 conv
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1,
                          stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        identity = self.shortcut(x)

        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out = out + identity   # Skip connection
        out = self.relu(out)

        return out


class CIFAR10CNN(nn.Module):
    """CNN with residual blocks for CIFAR-10 classification."""

    def __init__(self, num_classes: int = 10):
        super().__init__()
        # First convolutional layer
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(32)
        self.relu = nn.ReLU(inplace=True)

        # Residual blocks with increasing depth
        self.layer1 = self._make_layer(32, 64, num_blocks=2, stride=1)
        self.layer2 = self._make_layer(64, 128, num_blocks=2, stride=2)
        self.layer3 = self._make_layer(128, 256, num_blocks=2, stride=2)

        # Final classifier
        self.global_avg_pool = nn.AdaptiveAvgPool2d((1, 1))
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(256, num_classes)

    def _make_layer(
        self, in_ch: int, out_ch: int, num_blocks: int, stride: int
    ) -> nn.Sequential:
        layers = [ResidualBlock(in_ch, out_ch, stride)]
        for _ in range(1, num_blocks):
            layers.append(ResidualBlock(out_ch, out_ch, stride=1))
        return nn.Sequential(*layers)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.relu(self.bn1(self.conv1(x)))   # (B, 32, 32, 32)
        x = self.layer1(x)                        # (B, 64, 32, 32)
        x = self.layer2(x)                        # (B, 128, 16, 16)
        x = self.layer3(x)                        # (B, 256, 8, 8)
        x = self.global_avg_pool(x)               # (B, 256, 1, 1)
        x = x.view(x.size(0), -1)                 # (B, 256)
        x = self.dropout(x)
        x = self.fc(x)                            # (B, 10)
        return x

8.3 Training Loop

Training Loop with Validation and Learning Rate Scheduling


def train_model(
    model: nn.Module,
    train_loader: DataLoader,
    test_loader: DataLoader,
    epochs: int = 50,
    lr: float = 0.01
) -> dict:
    """Train the model and return training history."""

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(
        model.parameters(), lr=lr, momentum=0.9, weight_decay=1e-4
    )
    scheduler = optim.lr_scheduler.OneCycleLR(
        optimizer, max_lr=lr, epochs=epochs,
        steps_per_epoch=len(train_loader)
    )

    history = {'train_loss': [], 'test_loss': [], 'test_acc': []}

    for epoch in range(epochs):
        # --- Training ---
        model.train()
        running_loss = 0.0

        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            scheduler.step()

            running_loss += loss.item()

        avg_train_loss = running_loss / len(train_loader)

        # --- Evaluation ---
        model.eval()
        test_loss = 0.0
        correct = 0
        total = 0

        with torch.no_grad():
            for images, labels in test_loader:
                images, labels = images.to(device), labels.to(device)
                outputs = model(images)
                loss = criterion(outputs, labels)
                test_loss += loss.item()
                _, predicted = outputs.max(1)
                total += labels.size(0)
                correct += predicted.eq(labels).sum().item()

        avg_test_loss = test_loss / len(test_loader)
        accuracy = 100.0 * correct / total

        history['train_loss'].append(avg_train_loss)
        history['test_loss'].append(avg_test_loss)
        history['test_acc'].append(accuracy)

        print(
            f"Epoch [{epoch+1}/{epochs}] "
            f"Train Loss: {avg_train_loss:.4f} | "
            f"Test Loss: {avg_test_loss:.4f} | "
            f"Accuracy: {accuracy:.2f}%"
        )

    return history


# Run training
model = CIFAR10CNN(num_classes=10)
history = train_model(model, train_loader, test_loader, epochs=50, lr=0.1)

# Expected output after 50 epochs with OneCycleLR:
# Test Accuracy: ~92-93%

Recommended Training Parameters for CIFAR-10

Parameter	Value	Rationale
Batch size	128	Balance between speed and gradient stability
Learning rate	0.1 (with OneCycleLR)	SGD with scheduling achieves higher accuracy
Weight decay	1e-4	L2 regularization to prevent overfitting
Epochs	50	With OneCycleLR, 50 epochs suffice for convergence
Dropout	0.3	Additional regularization before the final layer

9. Transfer Learning

In practice, you rarely train a CNN from scratch. Transfer learning allows you to reuse models pre-trained on large datasets (such as ImageNet with 1.2 million images and 1,000 classes) and adapt them to your specific problem. This drastically reduces training time, data requirements, and improves performance.

9.1 Feature Extraction vs Fine-Tuning

      Two Transfer Learning Strategies
      
            Strategy
            How It Works
            When to Use
          
            Feature Extraction
            Freeze all pre-trained layers, train only the final classifier
            Limited data (<1,000 images), domain similar to ImageNet
          
            Fine-Tuning
            Unfreeze the last N layers and retrain with a low learning rate
            More data available, domain different from ImageNet

Transfer Learning with Pre-Trained ResNet-18


import torchvision.models as models

def create_transfer_model(
    num_classes: int,
    freeze_backbone: bool = True
) -> nn.Module:
    """Create a model with transfer learning from ResNet-18."""

    # Load ResNet-18 pre-trained on ImageNet
    model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)

    # Strategy 1: Feature Extraction (freeze backbone)
    if freeze_backbone:
        for param in model.parameters():
            param.requires_grad = False

    # Replace the final classifier
    num_features = model.fc.in_features  # 512 for ResNet-18
    model.fc = nn.Sequential(
        nn.Dropout(0.3),
        nn.Linear(num_features, 256),
        nn.ReLU(),
        nn.Dropout(0.2),
        nn.Linear(256, num_classes)
    )

    return model


# Feature extraction: only the classifier is trained
feature_model = create_transfer_model(num_classes=10, freeze_backbone=True)

# Fine-tuning: the entire model is retrained
finetune_model = create_transfer_model(num_classes=10, freeze_backbone=False)

# For fine-tuning, use a lower learning rate
optimizer = optim.AdamW([
    {'params': finetune_model.layer4.parameters(), 'lr': 1e-4},
    {'params': finetune_model.fc.parameters(), 'lr': 1e-3},
], weight_decay=1e-2)

When to Use Which Strategy


Transfer Learning Guide:

                     Data similar to ImageNet    Data different from ImageNet
                     (objects, animals, scenes)   (medical, satellite, micro)
                  +----------------------------+----------------------------+
  Few data        | Feature Extraction          | Feature Extraction         |
  (<1,000 img)   | Freeze all, train FC        | + Increase data augment.   |
                  +----------------------------+----------------------------+
  Lots of data    | Fine-tune last layers       | Full fine-tuning           |
  (>5,000 img)    | Low LR for backbone         | or train from scratch      |
                  +----------------------------+----------------------------+

10. Evaluation Metrics

Accuracy alone is not enough to determine whether a CNN performs well. With imbalanced datasets (e.g., 95% class A, 5% class B), a model that always predicts "class A" achieves 95% accuracy but is completely useless. More granular metrics are needed.

      Image Classification Metrics
      
            Metric
            What It Measures
            Formula
          
            Accuracy
            Percentage of correct predictions overall
            TP + TN / Total
          
            Precision
            Among positive predictions, how many are correct?
            TP / (TP + FP)
          
            Recall
            Among actual positives, how many were found?
            TP / (TP + FN)
          
            F1 Score
            Harmonic mean of precision and recall
            2 * (P * R) / (P + R)

Confusion Matrix - CIFAR-10 Example


Confusion Matrix (predictions vs reality):

              Predicted: plane  car   bird  cat   deer  dog   frog  horse ship  truck
  Actual:
  airplane              [92]    1     2     0     0     0     1     0     3     1
  automobile             0    [95]    0     0     0     0     0     0     1     4
  bird                   3      0   [85]    3     2     2     3     1     1     0
  cat                    0      1     2   [78]    1     12    3     2     0     1
  deer                   1      0     3     2   [88]    1     2     3     0     0

Interpretation:
- Diagonal = correct predictions (higher = better)
- Off-diagonal = errors (confusions between classes)
- cat vs dog: 12 cats classified as dogs --> "confused" classes

Computing Metrics in PyTorch


from sklearn.metrics import (
    classification_report,
    confusion_matrix
)
import numpy as np


def evaluate_model(
    model: nn.Module,
    test_loader: DataLoader,
    device: torch.device,
    class_names: list[str]
) -> dict:
    """Evaluate the model and return detailed metrics."""

    model.eval()
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for images, labels in test_loader:
            images = images.to(device)
            outputs = model(images)
            _, predicted = outputs.max(1)
            all_preds.extend(predicted.cpu().numpy())
            all_labels.extend(labels.numpy())

    preds_array = np.array(all_preds)
    labels_array = np.array(all_labels)

    # Per-class detailed report
    report = classification_report(
        labels_array, preds_array,
        target_names=class_names, output_dict=True
    )

    # Confusion matrix
    cm = confusion_matrix(labels_array, preds_array)

    # Overall accuracy
    accuracy = np.mean(preds_array == labels_array) * 100

    print(f"Overall Accuracy: {accuracy:.2f}%")
    print(classification_report(
        labels_array, preds_array, target_names=class_names
    ))

    return {'accuracy': accuracy, 'report': report, 'confusion_matrix': cm}

11. Deployment: From Trained Model to Production

Training a model is only half the work. To bring it to production you need to export it in an optimized format, create an inference API, and containerize everything for scalable deployment.

11.1 Model Export

      Export Formats
      
            Format
            Use Case
            Advantages
          
            TorchScript
            PyTorch inference without Python
            No Python dependency, complete serialization
          
            ONNX
            Universal format, multi-framework
            Compatible with TensorRT, OpenVINO, CoreML
          
            TensorRT
            Optimized inference for NVIDIA GPUs
            Up to 5x faster than native PyTorch

Exporting to TorchScript and ONNX


import torch.onnx

def export_model(model: nn.Module, export_path: str) -> None:
    """Export the model to TorchScript and ONNX."""

    model.eval()
    dummy_input = torch.randn(1, 3, 32, 32)

    # --- TorchScript ---
    scripted_model = torch.jit.trace(model, dummy_input)
    scripted_model.save(f"{export_path}/model_scripted.pt")
    print("TorchScript saved.")

    # --- ONNX ---
    torch.onnx.export(
        model,
        dummy_input,
        f"{export_path}/model.onnx",
        input_names=['image'],
        output_names=['prediction'],
        dynamic_axes={
            'image': {0: 'batch_size'},
            'prediction': {0: 'batch_size'}
        },
        opset_version=17
    )
    print("ONNX saved.")


export_model(model, './exports')

11.2 Inference API with FastAPI

Inference Service with FastAPI


# inference_api.py
from fastapi import FastAPI, UploadFile
from PIL import Image
import torch
import torchvision.transforms as transforms
import io

app = FastAPI(title="CNN Image Classifier")

# Load TorchScript model
model = torch.jit.load("./exports/model_scripted.pt")
model.eval()

CLASSES = ['airplane', 'automobile', 'bird', 'cat', 'deer',
           'dog', 'frog', 'horse', 'ship', 'truck']

preprocess = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2470, 0.2435, 0.2616]
    ),
])


@app.post("/predict")
async def predict(file: UploadFile):
    """Classify an uploaded image."""
    image_bytes = await file.read()
    image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
    tensor = preprocess(image).unsqueeze(0)

    with torch.no_grad():
        outputs = model(tensor)
        probabilities = torch.softmax(outputs, dim=1)
        confidence, predicted = probabilities.max(1)

    return {
        "class": CLASSES[predicted.item()],
        "confidence": round(confidence.item() * 100, 2),
        "all_probabilities": {
            name: round(prob.item() * 100, 2)
            for name, prob in zip(CLASSES, probabilities[0])
        }
    }

# Run: uvicorn inference_api:app --host 0.0.0.0 --port 8000

11.3 Containerization with Docker

Dockerfile for the Inference Service


FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY exports/ ./exports/
COPY inference_api.py .

EXPOSE 8000

CMD ["uvicorn", "inference_api:app", "--host", "0.0.0.0", "--port", "8000"]

# Build: docker build -t cnn-classifier .
# Run:   docker run -p 8000:8000 cnn-classifier
# Test:  curl -X POST -F "file=@cat.jpg" http://localhost:8000/predict

Complete Deployment Pipeline


Pipeline: Training --> Export --> Container --> Deploy

  [Training]          [Export]           [Container]        [Deploy]
  PyTorch + GPU  -->  TorchScript/ONNX  -->  Docker     -->  Kubernetes
  50 epochs          ~10 MB file            FastAPI          Auto-scaling
  ~92% accuracy      Optimized              Health checks    Load balancer
                     No Python dependency   GPU optional     Monitoring

Serverless alternative:
  Export ONNX --> AWS Lambda + ONNX Runtime --> API Gateway
  Pro: Pay-per-use, zero infrastructure to manage
  Con: Cold start (~2s), 250MB package limit

Conclusions and Next Steps

In this article we built a comprehensive understanding of Convolutional Neural Networks, starting from fundamentals (how a computer sees images, the convolution operation) through practical implementation (CNN with residual blocks in PyTorch) to production deployment (TorchScript, ONNX, FastAPI, Docker).

We traced how the evolution of architectures, from LeNet-5 in 1998 to ConvNeXt in 2022, progressively improved performance through ideas like skip connections (ResNet), compound scaling (EfficientNet), and transformer-inspired design (ConvNeXt).

      Key Takeaways
      CNNs exploit locality, weight sharing, and spatial invariance to process images efficiently
The standard architecture follows the pattern: Conv + BatchNorm + ReLU + Pooling, repeated with increasing depth
Skip connections (ResNet) are essential for training deep networks without vanishing gradients
Transfer learning is almost always preferable to training from scratch, especially with limited data
Data augmentation is critical for generalization and costs nothing in terms of data collection
For production, export to ONNX or TorchScript and containerize with Docker

    

In the next article we will dive deep into Transfer Learning and Fine-Tuning: how to choose the right pre-trained model, progressive fine-tuning strategies, domain adaptation, and advanced techniques like knowledge distillation. In the third article we will tackle Object Detection with YOLO, the most widely used real-time object detection system in industry.

Additional Resources

Original ResNet paper: "Deep Residual Learning for Image Recognition" (He et al., 2015)
ConvNeXt paper: "A ConvNet for the 2020s" (Liu et al., 2022)
EfficientNet paper: "EfficientNet: Rethinking Model Scaling for CNNs" (Tan & Le, 2019)
PyTorch Documentation: Tutorials on CNNs and torchvision
CS231n Stanford: Convolutional Neural Networks for Visual Recognition (online course)
ONNX Runtime: Documentation for optimized multi-platform inference

Kernel	Purpose	Values (3x3)
Vertical edge	Detects vertical transitions	[-1, 0, 1] repeated
Horizontal edge	Detects horizontal transitions	[-1,-1,-1], [0,0,0], [1,1,1]
Sharpening	Increases sharpness	[0,-1,0], [-1,5,-1], [0,-1,0]
Gaussian blur	Gaussian smoothing	[1,2,1], [2,4,2], [1,2,1] / 16

Parameter	Description	Typical Values
Kernel size	Filter dimensions (width x height)	3x3, 5x5, 7x7
Stride	Step size as the filter slides	1, 2
Padding	Pixels added to input borders	0 (valid), 1 (same for 3x3)
Number of filters	How many output feature maps	32, 64, 128, 256, 512

Year	Architecture	Key Innovation	Top-1 ImageNet
1998	LeNet-5	First practical CNN (digit recognition)	N/A
2012	AlexNet	GPU training, ReLU, Dropout	63.3%
2014	VGGNet	Deep networks with uniform 3x3 filters	74.5%
2014	GoogLeNet/Inception	Inception modules, parallel multi-scale	74.8%
2015	ResNet	Skip connections, 152+ layer networks	78.6%
2019	EfficientNet	Compound scaling (depth+width+resolution)	84.4%
2022	ConvNeXt	Modernized CNN inspired by Vision Transformers	87.8%

Optimizer	Characteristics	When to Use
SGD + Momentum	Simple, robust, stable convergence	Long training, maximum final accuracy
Adam	Adaptive learning rate, fast convergence	Prototyping, small/medium networks
AdamW	Adam with corrected weight decay	Modern standard, recommended for CNNs

Strategy	How It Works	When to Use
Feature Extraction	Freeze all pre-trained layers, train only the final classifier	Limited data (<1,000 images), domain similar to ImageNet
Fine-Tuning	Unfreeze the last N layers and retrain with a low learning rate	More data available, domain different from ImageNet

Metric	What It Measures	Formula
Accuracy	Percentage of correct predictions overall	TP + TN / Total
Precision	Among positive predictions, how many are correct?	TP / (TP + FP)
Recall	Among actual positives, how many were found?	TP / (TP + FN)
F1 Score	Harmonic mean of precision and recall	2 * (P * R) / (P + R)

Format	Use Case	Advantages
TorchScript	PyTorch inference without Python	No Python dependency, complete serialization
ONNX	Universal format, multi-framework	Compatible with TensorRT, OpenVINO, CoreML
TensorRT	Optimized inference for NVIDIA GPUs	Up to 5x faster than native PyTorch