Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Object Detection vs Segmentation: Comparison and Use Cases

When tackling a computer vision problem, choosing the right task and architecture is fundamental. Object Detection, Semantic Segmentation, Instance Segmentation, and Panoptic Segmentation are not interchangeable alternatives: each answers different questions, has different computational requirements, and suits specific use cases. Picking the wrong approach means wasting resources or, worse, failing to solve the actual problem.

In this article we will rigorously compare the main computer vision tasks, with practical PyTorch implementations and concrete guidelines for selecting the right approach in your project.

What You Will Learn

Fundamental differences between detection, semantic, instance, and panoptic segmentation
When to use which approach: practical decision tree
Main architectures for each task and their tradeoffs
Complete multi-task pipeline implementation in PyTorch
Evaluation metrics for each task (mAP, mIoU, PQ)
Speed and accuracy benchmarks on real hardware
Case studies: autonomous vehicles, medical imaging, retail analytics

1. The Main Computer Vision Tasks

Before comparing approaches, let us precisely define each task with visual examples:

Computer Vision Task Hierarchy

Input image: a street with 3 people and 2 cars

┌─────────────────────────────────────────────────────────────────┐
│  IMAGE CLASSIFICATION: "street with vehicles and people"        │
│  Output: 1 label for the entire image                           │
│  Does NOT say WHERE or HOW MANY objects there are               │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  OBJECT DETECTION: 5 bounding boxes                             │
│  [person(0.95) x1,y1,x2,y2]                                    │
│  [person(0.88) x1,y1,x2,y2]                                    │
│  [person(0.91) x1,y1,x2,y2]                                    │
│  [car(0.97)    x1,y1,x2,y2]                                    │
│  [car(0.94)    x1,y1,x2,y2]                                    │
│  Knows WHERE and HOW MANY, but not the precise shape            │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  SEMANTIC SEGMENTATION: every pixel has a class label           │
│  pixel(100,200)="person", pixel(300,400)="car"                  │
│  Knows the EXACT SHAPE but does NOT separate instances          │
│  All "people" = same category, not separate identities          │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  INSTANCE SEGMENTATION: mask per object instance                │
│  person_1 = {pixels: (100,200),(101,200),...}                  │
│  person_2 = {pixels: (250,180),(251,180),...}                  │
│  Knows SHAPE and distinguishes SEPARATE INSTANCES               │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  PANOPTIC SEGMENTATION: semantic + instance combined            │
│  "things" (countable): instance per person and car              │
│  "stuff" (uncountable): semantic for road, sky, buildings       │
│  Knows EVERYTHING: shape, class, identity, background           │
└─────────────────────────────────────────────────────────────────┘

1.1 Detailed Technical Comparison

      Computer Vision Task Comparison
      
        
            Task
            Output
            Complexity
            Speed
            GPU Memory
            Metric
          

        
            Classification
            Label + prob
            Low
            Very high
            Low
            Top-1/5 Acc
          

            Object Detection
            BBox + label
            Medium
            High
            Medium
            mAP@0.5
          

            Semantic Seg.
            Pixel-label map
            Medium-High
            Medium
            High
            mIoU
          

            Instance Seg.
            BBox + mask
            High
            Low-Medium
            High
            mAP@mask
          

            Panoptic Seg.
            Everything
            Very high
            Low
            Very high
            PQ
          

      
    

2. Object Detection: Architectures and Implementation

2.1 Single-Stage vs Two-Stage Detectors

      Single-Stage vs Two-Stage Comparison
      
            Feature
            Single-Stage (YOLO, SSD, RetinaNet)
            Two-Stage (Faster R-CNN, Mask R-CNN)
          
            Pipeline
            Single network, direct prediction
            RPN proposes regions, then classifies
          
            Speed
            High (30-150+ FPS)
            Low (5-15 FPS)
          
            Accuracy
            Slightly lower on small objects
            Better, especially for small objects
          
            Typical use
            Real-time, edge, video
            Offline analysis, maximum precision
          
            Modern examples
            YOLO26, RT-DETR, DINO-DETR
            Faster R-CNN, Cascade R-CNN, DETR

Faster R-CNN for High Accuracy Detection

import torch
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection import FasterRCNN_ResNet50_FPN_Weights
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

def create_faster_rcnn(num_classes: int) -> torch.nn.Module:
    """
    Faster R-CNN with pre-trained ResNet-50 + FPN backbone.
    Two-stage: RPN (Region Proposal Network) + classifier.
    """
    model = fasterrcnn_resnet50_fpn(
        weights=FasterRCNN_ResNet50_FPN_Weights.DEFAULT
    )

    # Replace box predictor for custom number of classes
    # +1 because class 0 is reserved for "background"
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes + 1)

    return model

def train_faster_rcnn(model, data_loader, num_epochs: int = 10, lr: float = 0.005):
    """Faster R-CNN training loop with built-in loss computation."""
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    model.train()

    optimizer = torch.optim.SGD(model.parameters(), lr=lr,
                                 momentum=0.9, weight_decay=0.0005)
    scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)

    for epoch in range(num_epochs):
        total_loss = 0.0
        for images, targets in data_loader:
            images = [img.to(device) for img in images]
            targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

            # Faster R-CNN returns a loss dict in training mode
            loss_dict = model(images, targets)
            losses = sum(loss for loss in loss_dict.values())

            optimizer.zero_grad()
            losses.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            total_loss += losses.item()

        scheduler.step()
        print(f"Epoch {epoch+1}/{num_epochs} | Avg Loss: {total_loss/len(data_loader):.4f}")

3. Semantic Segmentation with DeepLabv3

Semantic segmentation assigns a class label to every single pixel in the image. It does not distinguish instances: all "people" belong to the same class. Ideal for full scene analysis (autonomous driving, medical analysis, remote sensing).

DeepLabv3 (Chen et al., 2017) uses atrous convolutions (dilated convolutions): convolutions with "holes" that increase the receptive field without increasing parameters, essential for capturing multi-scale context without reducing resolution.

Semantic Segmentation with DeepLabv3

import torch
import torch.nn as nn
import torchvision.models.segmentation as seg_models
from torchvision.models.segmentation import DeepLabV3_ResNet50_Weights

def create_deeplabv3(num_classes: int) -> nn.Module:
    """
    DeepLabv3 with pre-trained ResNet-50 backbone.
    Uses ASPP (Atrous Spatial Pyramid Pooling) for multi-scale context.
    """
    model = seg_models.deeplabv3_resnet50(
        weights=DeepLabV3_ResNet50_Weights.DEFAULT
    )

    # Replace final classifier for custom num_classes
    model.classifier[-1] = nn.Conv2d(256, num_classes, kernel_size=1)
    model.aux_classifier[-1] = nn.Conv2d(256, num_classes, kernel_size=1)

    return model

def train_semantic_segmentation(model, data_loader, num_epochs: int = 20):
    """Training loop for semantic segmentation (mIoU optimization)."""
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)

    criterion = nn.CrossEntropyLoss(ignore_index=255)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-4)
    scheduler = torch.optim.lr_scheduler.PolynomialLR(optimizer, total_iters=num_epochs)

    for epoch in range(num_epochs):
        model.train()
        total_loss = 0.0

        for images, masks in data_loader:
            images = images.to(device)
            masks = masks.long().to(device)  # [B, H, W], values 0..num_classes-1

            outputs = model(images)
            main_loss = criterion(outputs['out'], masks)
            aux_loss  = criterion(outputs['aux'], masks) * 0.4
            loss = main_loss + aux_loss

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        scheduler.step()
        miou = compute_miou(model, data_loader, device)
        print(f"Epoch {epoch+1}/{num_epochs} | Loss: {total_loss/len(data_loader):.4f} | mIoU: {miou:.3f}")

def compute_miou(model, data_loader, device, num_classes: int = 21) -> float:
    """Computes Mean IoU - the standard metric for semantic segmentation."""
    model.eval()
    intersection = torch.zeros(num_classes, device=device)
    union = torch.zeros(num_classes, device=device)

    with torch.no_grad():
        for images, masks in data_loader:
            images, masks = images.to(device), masks.long().to(device)
            preds = model(images)['out'].argmax(dim=1)

            for cls in range(num_classes):
                pred_cls = preds == cls
                true_cls = masks == cls
                intersection[cls] += (pred_cls & true_cls).sum()
                union[cls] += (pred_cls | true_cls).sum()

    iou = intersection / (union + 1e-10)
    return float(iou[union > 0].mean())

4. Instance Segmentation with Mask R-CNN

Instance segmentation combines object detection (bounding box + class) with pixel-level segmentation for each individual instance. Each object has its own independent binary mask. Mask R-CNN (He et al., 2017) extends Faster R-CNN by adding a third parallel "head" for mask prediction.

Mask R-CNN for Instance Segmentation

import torch
import torchvision
from torchvision.models.detection import maskrcnn_resnet50_fpn
from torchvision.models.detection import MaskRCNN_ResNet50_FPN_Weights
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

def create_mask_rcnn(num_classes: int) -> torch.nn.Module:
    """
    Mask R-CNN: Faster R-CNN + Mask Head.
    Output per instance: bbox + class + 28x28 binary mask.
    """
    model = maskrcnn_resnet50_fpn(
        weights=MaskRCNN_ResNet50_FPN_Weights.DEFAULT
    )

    # Replace box predictor
    in_features_box = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features_box, num_classes + 1)

    # Replace mask predictor
    in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
    model.roi_heads.mask_predictor = MaskRCNNPredictor(
        in_features_mask, 256, num_classes + 1
    )

    return model

def prepare_instance_target(boxes: list, labels: list, masks: list) -> dict:
    """
    Prepares the target dict required by Mask R-CNN.
    masks: list of boolean arrays [H, W] per instance.
    """
    return {
        'boxes':  torch.tensor(boxes,  dtype=torch.float32),
        'labels': torch.tensor(labels, dtype=torch.int64),
        'masks':  torch.tensor(masks,  dtype=torch.uint8)  # [N, H, W]
    }

5. Decision Tree: Which Task to Choose?

Task Selection Decision Tree

Question: "What do I need to know about the image?"
    |
    ├─ Just "what objects are present"?
    │   └── IMAGE CLASSIFICATION
    │       Architectures: ResNet, EfficientNet, ViT
    │       Examples: industrial quality gate, content moderation
    │
    ├─ "Where are objects + how many"?
    │   └── OBJECT DETECTION
    │       │
    │       ├─ Need real-time speed (>30 FPS)?
    │       │   └── Single-Stage: YOLO26, RT-DETR
    │       │
    │       └─ Need maximum accuracy (small objects)?
    │           └── Two-Stage: Faster R-CNN, DETR
    │
    ├─ "Class of every pixel" (no instance separation)?
    │   └── SEMANTIC SEGMENTATION
    │       Architectures: DeepLabv3, FCN, SegFormer
    │       Examples: road analysis, medical imaging, remote sensing
    │
    ├─ "Separate each object + exact shape"?
    │   └── INSTANCE SEGMENTATION
    │       Architectures: Mask R-CNN, SOLOv2, YOLACT
    │       Examples: object counting, robotics, biology
    │
    └─ "Everything: separated objects + classified background"?
        └── PANOPTIC SEGMENTATION
            Architectures: Panoptic FPN, Mask2Former
            Examples: full autonomous driving, scene understanding

      Use Cases by Industry
      
        
            Industry
            Detection
            Semantic Seg.
            Instance Seg.
            Panoptic
          

        
            Automotive
            Pedestrian/vehicle detection
            Road/lane segmentation
            Separate each pedestrian
            Full autonomous scene
          

            Medical
            Locate lesions in CT
            Organ segmentation
            Separate each tumor
            Full anatomical analysis
          

            Retail
            Count shelf products
            Map planogram zones
            Identify each product
            Complete shelf analysis
          

            Industrial
            Detect defects (bounding box)
            Classify defective zone
            Segment each defect
            Full part inspection
          

            Agriculture
            Count fruits on tree
            Segment vegetation
            Separate each fruit
            Complete field map
          

      
    

6. Evaluation Metrics for Each Task

Each computer vision task uses specific metrics. Using the wrong metric leads to misleading conclusions: a model with high mAP for detection may have terrible mIoU for segmentation. Understanding these metrics is essential for comparing models and monitoring production systems.

Computing mAP, mIoU, and Panoptic Quality from Scratch

import torch
import numpy as np
from collections import defaultdict

# ---- mAP for Object Detection ----
def compute_ap(recall: np.ndarray, precision: np.ndarray) -> float:
    """
    Computes Average Precision (AP) using the 11-point interpolation.
    mAP@0.5 = mean of AP for each class at IoU threshold 0.5.
    mAP@[0.5:0.95] = mean over thresholds [0.5, 0.55, ..., 0.95]
    """
    mrec = np.concatenate(([0.0], recall, [1.0]))
    mpre = np.concatenate(([0.0], precision, [0.0]))

    # Make precision monotonically decreasing
    for i in range(mpre.size - 1, 0, -1):
        mpre[i - 1] = np.maximum(mpre[i - 1], mpre[i])

    # Find recall change points
    i = np.where(mrec[1:] != mrec[:-1])[0]

    # Area under P-R curve
    ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1])
    return float(ap)

def bbox_iou(box1: np.ndarray, box2: np.ndarray) -> float:
    """
    IoU between two boxes in [x1, y1, x2, y2] format.
    Used to match predictions to ground truth boxes.
    """
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])

    intersection = max(0, x2 - x1) * max(0, y2 - y1)
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = area1 + area2 - intersection

    return intersection / (union + 1e-10)

def compute_map(predictions: list, ground_truths: list,
                iou_threshold: float = 0.5,
                num_classes: int = 80) -> float:
    """
    Compute mAP@IoU_threshold over all classes.

    predictions: list of [image_id, class_id, confidence, x1, y1, x2, y2]
    ground_truths: list of [image_id, class_id, x1, y1, x2, y2]
    """
    # Group by class
    class_preds = defaultdict(list)
    class_gts   = defaultdict(list)

    for pred in predictions:
        img_id, cls, conf, *box = pred
        class_preds[cls].append((img_id, conf, box))

    for gt in ground_truths:
        img_id, cls, *box = gt
        class_gts[cls].append((img_id, box))

    aps = []
    for cls in range(num_classes):
        if cls not in class_gts:
            continue

        # Sort predictions by confidence (descending)
        preds = sorted(class_preds[cls], key=lambda x: -x[1])
        gt_by_image = defaultdict(list)
        for img_id, box in class_gts[cls]:
            gt_by_image[img_id].append({'box': box, 'matched': False})

        tp = np.zeros(len(preds))
        fp = np.zeros(len(preds))

        for i, (img_id, conf, pred_box) in enumerate(preds):
            best_iou = 0.0
            best_idx = -1

            for j, gt in enumerate(gt_by_image[img_id]):
                iou = bbox_iou(np.array(pred_box), np.array(gt['box']))
                if iou > best_iou:
                    best_iou = iou
                    best_idx = j

            if best_iou >= iou_threshold and not gt_by_image[img_id][best_idx]['matched']:
                tp[i] = 1
                gt_by_image[img_id][best_idx]['matched'] = True
            else:
                fp[i] = 1

        cumtp = np.cumsum(tp)
        cumfp = np.cumsum(fp)
        n_gt  = len(class_gts[cls])

        recall    = cumtp / (n_gt + 1e-10)
        precision = cumtp / (cumtp + cumfp + 1e-10)
        aps.append(compute_ap(recall, precision))

    return float(np.mean(aps)) if aps else 0.0

# ---- mIoU for Semantic Segmentation ----
def compute_miou_fast(preds: torch.Tensor, targets: torch.Tensor,
                       num_classes: int = 21,
                       ignore_index: int = 255) -> float:
    """
    Fast mIoU computation using confusion matrix.
    preds:   [B, H, W] - predicted class indices
    targets: [B, H, W] - ground truth class indices
    """
    mask = targets != ignore_index
    preds_flat   = preds[mask].long()
    targets_flat = targets[mask].long()

    # Confusion matrix via bincount
    conf_matrix = torch.bincount(
        num_classes * targets_flat + preds_flat,
        minlength=num_classes ** 2
    ).reshape(num_classes, num_classes).float()

    intersection = conf_matrix.diag()
    union = conf_matrix.sum(1) + conf_matrix.sum(0) - intersection
    iou = intersection / (union + 1e-10)

    # Only average over classes that appear in ground truth
    valid = conf_matrix.sum(1) > 0
    return float(iou[valid].mean())

7. Multi-Task Pipeline: Detection + Segmentation

In many real applications it makes sense to combine multiple tasks in a single architecture for computational efficiency. A practical example: in retail analytics we want to both localize products (detection) and segment the occupied shelf zone (semantic segmentation). A shared backbone reduces total compute by 40-60% compared to two separate models.

Shared Backbone Multi-Task Architecture with FPN

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models

class MultiTaskModel(nn.Module):
    """
    Shared ResNet-50 + FPN backbone with two heads:
    - Detection head (anchor-free box prediction)
    - Semantic segmentation head (pixel-wise classification)

    FPN (Feature Pyramid Network) enables multi-scale feature extraction:
    P5 (1/32 resolution) -> large objects
    P4 (1/16 resolution) -> medium objects
    P3 (1/8 resolution)  -> small objects
    P2 (1/4 resolution)  -> segmentation (highest resolution)
    """

    def __init__(self, num_det_classes: int, num_seg_classes: int):
        super().__init__()

        backbone = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

        self.layer1 = nn.Sequential(backbone.conv1, backbone.bn1,
                                    backbone.relu, backbone.maxpool,
                                    backbone.layer1)   # 1/4 resolution
        self.layer2 = backbone.layer2                   # 1/8
        self.layer3 = backbone.layer3                   # 1/16
        self.layer4 = backbone.layer4                   # 1/32

        # FPN lateral connections (1x1 convolutions to normalize channels)
        self.fpn = nn.ModuleDict({
            'p5': nn.Conv2d(2048, 256, 1),
            'p4': nn.Conv2d(1024, 256, 1),
            'p3': nn.Conv2d(512,  256, 1),
            'p2': nn.Conv2d(256,  256, 1),
        })

        # Detection head on P3 (best for medium/small objects)
        self.det_head = nn.Sequential(
            nn.Conv2d(256, 256, 3, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(256, num_det_classes * 5, 1)  # 4 bbox coords + 1 objectness
        )

        # Segmentation head on P2 (highest resolution) with decoder
        self.seg_head = nn.Sequential(
            nn.Conv2d(256, 128, 3, padding=1), nn.ReLU(inplace=True),
            nn.ConvTranspose2d(128, 64, 4, stride=2, padding=1),
            nn.ReLU(inplace=True),
            nn.ConvTranspose2d(64, 32, 4, stride=2, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(32, num_seg_classes, 1)
        )

    def forward(self, x: torch.Tensor) -> dict:
        # Bottom-up pathway (backbone)
        c2 = self.layer1(x)    # 1/4
        c3 = self.layer2(c2)   # 1/8
        c4 = self.layer3(c3)   # 1/16
        c5 = self.layer4(c4)   # 1/32

        # Top-down pathway with lateral connections (FPN)
        p5 = self.fpn['p5'](c5)
        p4 = self.fpn['p4'](c4) + F.interpolate(p5, scale_factor=2)
        p3 = self.fpn['p3'](c3) + F.interpolate(p4, scale_factor=2)
        p2 = self.fpn['p2'](c2) + F.interpolate(p3, scale_factor=2)

        # Task-specific heads
        det_out = self.det_head(p3)
        seg_out = self.seg_head(p2)
        seg_out = F.interpolate(seg_out, size=x.shape[-2:],
                                mode='bilinear', align_corners=False)

        return {'detection': det_out, 'segmentation': seg_out}

def compute_multitask_loss(outputs: dict, det_targets, seg_targets,
                            det_weight: float = 1.0,
                            seg_weight: float = 0.5) -> tuple:
    """
    Balanced multi-task loss.

    det_weight and seg_weight control the relative importance of each task.
    These values must be tuned: if the detection loss is 100x larger than
    the segmentation loss, increase seg_weight proportionally.
    """
    det_loss = nn.BCEWithLogitsLoss()(outputs['detection'], det_targets)
    seg_loss = nn.CrossEntropyLoss(ignore_index=255)(
        outputs['segmentation'], seg_targets
    )
    total = det_weight * det_loss + seg_weight * seg_loss
    return total, {'det': float(det_loss), 'seg': float(seg_loss)}

7.1 Advanced Multi-Task Loss Balancing

A common challenge in multi-task learning is loss scale imbalance: the detection loss might be 100x larger than the segmentation loss, causing the optimizer to focus entirely on detection and ignore segmentation. Uncertainty weighting (Kendall et al., 2018) automatically balances losses by learning task-specific uncertainty weights.

Uncertainty-Weighted Multi-Task Loss (Kendall et al., 2018)

import torch
import torch.nn as nn

class UncertaintyWeightedLoss(nn.Module):
    """
    Automatic loss balancing using learned uncertainty weights.
    From: "Multi-Task Learning Using Uncertainty to Weigh Losses"
    (Kendall et al., CVPR 2018)

    L_total = sum_i [ 1/(2*sigma_i^2) * L_i + log(sigma_i) ]

    sigma_i (log_sigma_i) is a learnable parameter per task.
    It automatically balances the losses without manual tuning.
    """

    def __init__(self, n_tasks: int):
        super().__init__()
        # Initialize log_sigma to 0 (sigma = 1, no initial weighting)
        self.log_sigmas = nn.Parameter(torch.zeros(n_tasks))

    def forward(self, losses: list) -> tuple:
        assert len(losses) == len(self.log_sigmas)

        weighted_losses = []
        for loss, log_sigma in zip(losses, self.log_sigmas):
            # L_weighted = loss / (2 * sigma^2) + log(sigma)
            # Numerically stable: use log_sigma instead of sigma
            precision = torch.exp(-2 * log_sigma)  # 1 / sigma^2
            weighted_losses.append(0.5 * precision * loss + log_sigma)

        total = sum(weighted_losses)
        weights = [float(torch.exp(-2 * ls)) for ls in self.log_sigmas]

        return total, weights

# Usage in training:
# uncertainty_loss = UncertaintyWeightedLoss(n_tasks=2)
# optimizer = torch.optim.AdamW(
#     list(model.parameters()) + list(uncertainty_loss.parameters()),
#     lr=1e-4
# )
# ...
# total, weights = uncertainty_loss([det_loss, seg_loss])
# print(f"Det weight: {weights[0]:.3f}, Seg weight: {weights[1]:.3f}")

8. Best Practices and Performance Benchmarks

      COCO Benchmark (2025)
      
        
            Model
            Task
            mAP/mIoU
            FPS (V100)
            Params
          

        
            YOLO26m
            Detection
            57.2 mAP
            100+
            25M
          

            Faster R-CNN R50
            Detection
            40.2 mAP
            18
            41M
          

            DeepLabv3 R50
            Semantic Seg.
            74.3 mIoU
            45
            39M
          

            SegFormer-B5
            Semantic Seg.
            83.1 mIoU
            15
            85M
          

            Mask R-CNN R50
            Instance Seg.
            36.1 mAP
            14
            44M
          

            Mask2Former R50
            Panoptic
            51.9 PQ
            8
            44M
          

      
    

Common Design Mistakes

Using segmentation when detection suffices: If you just need to count or localize objects, use detection. Segmentation is far more expensive to annotate and train.
Ignoring real-time requirements: Mask R-CNN at 14 FPS is unacceptable for a live surveillance system. Always choose architecture based on latency requirements.
Unbalanced datasets for segmentation: If one class covers 95% of pixels (e.g., background), the model will trivially predict it. Use weighted loss or class-balanced sampling.
Confusing mIoU and mAP: These are different metrics. mIoU measures pixel-level precision (segmentation); mAP measures bounding box quality (detection).
Multi-task without loss balancing: In multi-task architectures, loss values from different tasks can have very different scales. Use gradient normalization or uncertainty weighting.

Conclusions

We explored the full spectrum of computer vision tasks, from their fundamental differences to practical implementations, evaluation metrics, and advanced multi-task architectures:

Classification, Detection, Semantic/Instance/Panoptic Segmentation each have distinct outputs, costs, and use cases - always choose based on the actual question being asked
YOLO26 excels at real-time detection; Faster R-CNN at offline high accuracy where every false negative is costly
DeepLabv3 and SegFormer for semantic segmentation; Mask R-CNN adds instance separation at the cost of speed
mAP measures bounding box quality; mIoU measures pixel-level accuracy; PQ combines both for panoptic tasks
Multi-task architectures with shared FPN backbone save 40-60% compute vs separate models
Uncertainty-weighted loss (Kendall et al.) automatically balances multi-task training without manual tuning
The decision tree guides selection of the right approach for each business problem and latency requirement

Series Navigation

Previous: YOLO and Object Detection: From Theory to Practice
Next: Segmentation: U-Net, Mask R-CNN and SAM

Task	Output	Complexity	Speed	GPU Memory	Metric
Classification	Label + prob	Low	Very high	Low	Top-1/5 Acc
Object Detection	BBox + label	Medium	High	Medium	mAP@0.5
Semantic Seg.	Pixel-label map	Medium-High	Medium	High	mIoU
Instance Seg.	BBox + mask	High	Low-Medium	High	mAP@mask
Panoptic Seg.	Everything	Very high	Low	Very high	PQ

Feature	Single-Stage (YOLO, SSD, RetinaNet)	Two-Stage (Faster R-CNN, Mask R-CNN)
Pipeline	Single network, direct prediction	RPN proposes regions, then classifies
Speed	High (30-150+ FPS)	Low (5-15 FPS)
Accuracy	Slightly lower on small objects	Better, especially for small objects
Typical use	Real-time, edge, video	Offline analysis, maximum precision
Modern examples	YOLO26, RT-DETR, DINO-DETR	Faster R-CNN, Cascade R-CNN, DETR

Industry	Detection	Semantic Seg.	Instance Seg.	Panoptic
Automotive	Pedestrian/vehicle detection	Road/lane segmentation	Separate each pedestrian	Full autonomous scene
Medical	Locate lesions in CT	Organ segmentation	Separate each tumor	Full anatomical analysis
Retail	Count shelf products	Map planogram zones	Identify each product	Complete shelf analysis
Industrial	Detect defects (bounding box)	Classify defective zone	Segment each defect	Full part inspection
Agriculture	Count fruits on tree	Segment vegetation	Separate each fruit	Complete field map

Model	Task	mAP/mIoU	FPS (V100)	Params
YOLO26m	Detection	57.2 mAP	100+	25M
Faster R-CNN R50	Detection	40.2 mAP	18	41M
DeepLabv3 R50	Semantic Seg.	74.3 mIoU	45	39M
SegFormer-B5	Semantic Seg.	83.1 mIoU	15	85M
Mask R-CNN R50	Instance Seg.	36.1 mAP	14	44M
Mask2Former R50	Panoptic	51.9 PQ	8	44M