안녕하세요!

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

연락하기

소개

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

역량

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

프로세스 자동화

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

맞춤 시스템

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

미션

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

기술의 민주화

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

IT와 비즈니스 통합

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

맞춤 솔루션

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

기술로 비즈니스를 혁신하세요

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

연락하기

프로젝트가 있으신가요? 아래 양식을 작성해 주시면 빠르게 답변드리겠습니다.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

엣지의 컴퓨터 비전: 모바일 및 임베디드 장치에 대한 최적화

Raspberry Pi, NVIDIA Jetson, 스마트폰 등 엣지 장치에 컴퓨터 비전 모델을 배포합니다. ARM 마이크로컨트롤러 - 클라우드 배포나 엔지니어링과 완전히 다른 엔지니어링 과제 GPU 서버. 리소스는 제한되어 있습니다. 몇 와트의 소비, 수십 기가바이트가 아닌 기가바이트의 RAM, 전용 GPU나 보급형 GPU가 없습니다. 그러나 수백만 개의 애플리케이션에는 추론이 필요합니다. 지역: 오프라인 감시, 로봇 공학, 휴대용 의료 기기, 산업 자동화 연결되지 않은 환경에서.

이 기사에서는 엣지 배포를 위한 최적화 기술인 양자화, 가지치기, 지식 증류, 최적화된 형식(ONNX, TFLite, NCNN) 및 실제 벤치마크 Raspberry Pi 5 및 NVIDIA Jetson Orin에서.

무엇을 배울 것인가

엣지 하드웨어 개요: Raspberry Pi, Jetson Nano/Orin, Coral TPU, Hailo
양자화: INT8, FP16 - 이론 및 실제 구현
매개변수를 줄이기 위한 구조적 및 비구조적 가지치기
지식 증류: 대형 모델에서 소형 모델 학습
TFLite 및 NCNN: ARM 장치에 배포
TensorRT: NVIDIA GPU(Jetson)의 최대 속도
CPU 및 NPU 최적화를 갖춘 ONNX 런타임
Raspberry Pi 5의 YOLO26: 벤치마크 및 전체 구성
Jetson Orin Nano의 실시간 비디오 파이프라인

1. 컴퓨터 비전을 위한 엣지 하드웨어

Edge 하드웨어 비교 2026


장치
CPU
GPU/NPU
숫양
TDP
YOLOv8n FPS


라즈베리 파이 5
ARM Cortex-A76 4코어
비디오코어 VII
8GB
15W
~5FPS

젯슨 나노(2GB)
ARM A57 4코어
128개의 CUDA 코어
2GB
10W
~20FPS

젯슨 오린 나노
ARM Cortex-A78AE 6코어
1024 쿠다 + DLA
8GB
25W
~80FPS

젯슨 AGX 오린
ARM Cortex-A78AE 12코어
2048 쿠다 + DLA
64GB
60W
~200FPS

구글 코랄 TPU
ARM Cortex-A53 4코어
4 TOPS 엣지 TPU
1GB
4W
~30FPS(TFLite)

하일로-8
- (PCIe 가속기)
26 TOPS 신경 엔진
-
5W
~120FPS

2. 양자화: FP32~INT8

La 양자화 가중치와 활성화의 수치적 정밀도가 감소합니다. 템플릿: float32(32비트)에서 float16(16비트) 또는 int8(8비트)까지. 실용적인 효과: INT8을 갖춘 4배 더 작은 모델, 2~4배 더 빠른 추론, 더 낮은 전력 소비. 현대 기술의 정확도 손실은 일반적으로 1% 미만입니다.

2.1 훈련 후 양자화(PTQ)

PyTorch를 사용한 INT8 양자화

import torch
import torch.quantization as quant
from torch.ao.quantization import get_default_qconfig, prepare, convert
from torchvision import models
import copy

def quantize_model_ptq(
    model: torch.nn.Module,
    calibration_loader,
    backend: str = 'x86'  # 'x86' per CPU Intel, 'qnnpack' per ARM
) -> torch.nn.Module:
    """
    Post-Training Quantization (PTQ): quantizza il modello senza retraining.
    Richiede solo un piccolo calibration dataset (~100-1000 immagini).

    Flusso:
    1. Fuse operazioni (Conv+BN+ReLU -> singola op)
    2. Insert observer per calibrazione
    3. Esegui calibrazione (forward pass sul dataset di calibrazione)
    4. Converti in modello quantizzato
    """
    torch.backends.quantized.engine = backend

    model_to_quantize = copy.deepcopy(model)
    model_to_quantize.eval()

    # Step 1: Fuse layer comuni per efficienza
    # Esempio per ResNet: (Conv, BN, ReLU) -> singola operazione fused
    model_to_quantize = torch.quantization.fuse_modules(
        model_to_quantize,
        [['conv1', 'bn1', 'relu']],  # adatta ai nomi del tuo modello
        inplace=True
    )

    # Step 2: Set qconfig e prepara per calibrazione
    qconfig = get_default_qconfig(backend)
    model_to_quantize.qconfig = qconfig
    prepared_model = prepare(model_to_quantize, inplace=False)

    # Step 3: Calibrazione con dati reali
    print("Calibrazione quantizzazione...")
    prepared_model.eval()
    with torch.no_grad():
        for i, (images, _) in enumerate(calibration_loader):
            prepared_model(images)
            if i >= 99:  # 100 batch di calibrazione sufficienti
                break
            if i % 10 == 0:
                print(f"  Batch {i+1}/100")

    # Step 4: Conversione al modello quantizzato
    quantized_model = convert(prepared_model, inplace=False)

    # Verifica dimensioni
    def model_size_mb(m: torch.nn.Module) -> float:
        param_size = sum(p.nelement() * p.element_size() for p in m.parameters())
        buffer_size = sum(b.nelement() * b.element_size() for b in m.buffers())
        return (param_size + buffer_size) / (1024 ** 2)

    original_size = model_size_mb(model)
    quantized_size = model_size_mb(quantized_model)
    print(f"Dimensione originale: {original_size:.1f} MB")
    print(f"Dimensione quantizzata: {quantized_size:.1f} MB")
    print(f"Riduzione: {original_size / quantized_size:.1f}x")

    return quantized_model

def compare_inference_speed(original_model, quantized_model,
                             input_tensor: torch.Tensor, n_runs: int = 100) -> dict:
    """Confronta velocità tra modello originale e quantizzato."""
    import time

    results = {}

    for name, model in [('FP32', original_model), ('INT8', quantized_model)]:
        model.eval()
        # Warmup
        with torch.no_grad():
            for _ in range(10):
                model(input_tensor)

        # Benchmark
        start = time.perf_counter()
        with torch.no_grad():
            for _ in range(n_runs):
                model(input_tensor)
        elapsed = time.perf_counter() - start

        avg_ms = (elapsed / n_runs) * 1000
        results[name] = avg_ms
        print(f"{name}: {avg_ms:.2f}ms / inference")

    speedup = results['FP32'] / results['INT8']
    print(f"Speedup INT8: {speedup:.2f}x")
    return results

2.2 YOLO(Ultralytics)를 사용한 정량

YOLO26: Edge용 양자화 내보내기

from ultralytics import YOLO

model = YOLO('yolo26n.pt')  # nano per edge

# ---- TFLite INT8 per Raspberry Pi / Coral TPU ----
model.export(
    format='tflite',
    imgsz=320,        # risoluzione ridotta per edge
    int8=True,        # quantizzazione INT8
    data='coco.yaml'  # dataset per calibrazione PTQ
)
# Output: yolo26n_int8.tflite

# ---- NCNN per CPU ARM (Raspberry Pi, Android) ----
model.export(
    format='ncnn',
    imgsz=320,
    half=False  # NCNN usa FP32 o INT8 nativo
)
# Output: yolo26n_ncnn_model/

# ---- TensorRT FP16 per Jetson ----
model.export(
    format='engine',
    imgsz=640,
    half=True,       # FP16
    workspace=2,     # GB workspace (ridotto per Jetson Nano)
    device=0
)
# Output: yolo26n.engine

# ---- ONNX + ONNX Runtime per CPU/NPU ----
model.export(
    format='onnx',
    imgsz=320,
    opset=17,
    simplify=True,
    dynamic=False    # batch size fisso per deployment edge
)

print("Export completati per tutti i target edge")

3. 라즈베리 파이 5의 YOLO

Il 라즈베리 파이 5 8GB RAM과 ARM Cortex-A76 프로세서가 탑재되어 있습니다. 엣지 AI에 가장 접근하기 쉬운 진입점입니다. 올바른 최적화(NCNN, 해상도 감소, 추론 빈도를 줄이기 위한 추적) 탐지 시스템을 달성할 수 있습니다. 실시간으로 기능합니다.

Raspberry Pi 5 설정 및 최적화

# ============================================
# SETUP RASPBERRY PI 5 per Computer Vision
# ============================================

# 1. Installazione dipendenze base
# sudo apt update && sudo apt install -y python3-pip libopencv-dev
# pip install ultralytics ncnn onnxruntime

# 2. Ottimizzazioni sistema per AI
# In /boot/firmware/config.txt:
# gpu_mem=256           # Aumenta memoria GPU (VideoCore VII)
# over_voltage=6        # Overclock lieve
# arm_freq=2800         # Frequenza CPU max (stock 2.4GHz)

# ============================================
# INFERENCE con NCNN su Raspberry Pi
# ============================================

import ncnn
import cv2
import numpy as np
import time

class YOLOncnn:
    """
    YOLO inference con NCNN - ottimizzato per CPU ARM.
    NCNN e sviluppato da Tencent ed e il runtime più veloce per ARM CPU.
    """

    def __init__(self, param_path: str, bin_path: str,
                 num_threads: int = 4, input_size: int = 320):
        self.net = ncnn.Net()
        self.net.opt.num_threads = num_threads  # usa tutti i core
        self.net.opt.use_vulkan_compute = False  # no GPU su RPi
        self.net.load_param(param_path)
        self.net.load_model(bin_path)
        self.input_size = input_size

    def predict(self, img_bgr: np.ndarray, conf_thresh: float = 0.4) -> list[dict]:
        """Inference NCNN su CPU ARM."""
        h, w = img_bgr.shape[:2]

        # Resize + normalizzazione per NCNN
        img_resized = cv2.resize(img_bgr, (self.input_size, self.input_size))
        img_rgb = cv2.cvtColor(img_resized, cv2.COLOR_BGR2RGB)

        mat_in = ncnn.Mat.from_pixels(
            img_rgb, ncnn.Mat.PixelType.PIXEL_RGB, self.input_size, self.input_size
        )
        mean_vals = [0.485 * 255, 0.456 * 255, 0.406 * 255]
        norm_vals = [1/0.229/255, 1/0.224/255, 1/0.225/255]
        mat_in.substract_mean_normalize(mean_vals, norm_vals)

        ex = self.net.create_extractor()
        ex.input("images", mat_in)
        _, mat_out = ex.extract("output0")

        return self._parse_output(mat_out, conf_thresh, w, h)

    def _parse_output(self, mat_out, conf_thresh, orig_w, orig_h) -> list[dict]:
        """Parsing dell'output NCNN in formato detection."""
        detections = []
        for i in range(mat_out.h):
            row = np.array(mat_out.row(i))
            confidence = row[4]
            if confidence < conf_thresh:
                continue

            class_scores = row[5:]
            class_id = int(np.argmax(class_scores))
            class_conf = confidence * class_scores[class_id]

            if class_conf >= conf_thresh:
                # Coordinate normalizzate -> pixel
                cx, cy, bw, bh = row[:4]
                x1 = int((cx - bw/2) * orig_w / self.input_size)
                y1 = int((cy - bh/2) * orig_h / self.input_size)
                x2 = int((cx + bw/2) * orig_w / self.input_size)
                y2 = int((cy + bh/2) * orig_h / self.input_size)

                detections.append({
                    'class_id': class_id,
                    'confidence': float(class_conf),
                    'bbox': (x1, y1, x2, y2)
                })

        return detections

def run_rpi_detection_loop(model_param: str, model_bin: str,
                            camera_id: int = 0) -> None:
    """Loop di detection real-time ottimizzato per Raspberry Pi."""
    detector = YOLOncnn(model_param, model_bin, num_threads=4, input_size=320)
    cap = cv2.VideoCapture(camera_id)

    # Ottimizza acquisizione per RPi
    cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
    cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
    cap.set(cv2.CAP_PROP_FPS, 30)
    cap.set(cv2.CAP_PROP_BUFFERSIZE, 1)

    frame_skip = 2  # Processa 1 frame su 3 per risparmiare CPU
    frame_count = 0
    cached_dets = []
    fps_history = []

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        t0 = time.perf_counter()

        if frame_count % frame_skip == 0:
            cached_dets = detector.predict(frame, conf_thresh=0.4)

        elapsed = time.perf_counter() - t0
        fps = 1.0 / elapsed if elapsed > 0 else 0
        fps_history.append(fps)

        # Visualizzazione
        for det in cached_dets:
            x1, y1, x2, y2 = det['bbox']
            cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
            cv2.putText(frame, f"{det['confidence']:.2f}",
                       (x1, y1-5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0,255,0), 2)

        avg_fps = sum(fps_history[-30:]) / min(len(fps_history), 30)
        cv2.putText(frame, f"FPS: {avg_fps:.1f}", (10, 30),
                   cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)

        cv2.imshow('RPi Detection', frame)
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

        frame_count += 1

    cap.release()
    cv2.destroyAllWindows()
    print(f"FPS medio: {sum(fps_history)/len(fps_history):.1f}")

4. NVIDIA Jetson Orin: TensorRT 및 DLA

Il 젯슨 오린 나노 (25W)는 1024개의 CUDA 코어와 DLA(Deep Learning)를 제공합니다. 액셀러레이터) 전용입니다. TensorRT FP16 및 YOLO26n 모델을 사용하면 이러한 문제를 쉽게 극복할 수 있습니다. 640x640 비디오에서 100FPS.

Jetson의 TensorRT: 설정 및 추론

from ultralytics import YOLO
import cv2
import time

def setup_jetson_pipeline(model_path: str = 'yolo26n.pt') -> YOLO:
    """
    Setup ottimale per Jetson Orin:
    1. Esporta in TensorRT FP16
    2. Configura jetson_clocks per prestazioni massime
    3. Imposta modalità performance per la GPU
    """
    import subprocess

    # Massimizza performance Jetson (esegui una sola volta)
    # subprocess.run(['sudo', 'jetson_clocks'], check=True)
    # subprocess.run(['sudo', 'nvpmodel', '-m', '0'], check=True)  # MAXN mode

    model = YOLO(model_path)

    print("Esportazione TensorRT FP16...")
    model.export(
        format='engine',
        imgsz=640,
        half=True,       # FP16 - quasi la stessa accuratezza di FP32 ma 2x più veloce
        workspace=2,     # GB workspace GPU (Jetson Orin Nano ha 8GB shared)
        device=0,
        batch=1,
        simplify=True
    )

    # Carica il modello TensorRT
    trt_model = YOLO('yolo26n.engine')
    print("Modello TensorRT pronto")
    return trt_model

def run_jetson_pipeline(model: YOLO, source=0) -> None:
    """Pipeline real-time ottimizzata per Jetson con statistiche."""
    cap = cv2.VideoCapture(source)
    cap.set(cv2.CAP_PROP_BUFFERSIZE, 1)

    fps_list = []
    frame_count = 0

    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break

            t0 = time.perf_counter()
            results = model.predict(
                frame, conf=0.35, iou=0.45,
                verbose=False, half=True  # FP16 inference
            )
            elapsed = time.perf_counter() - t0
            fps = 1.0 / elapsed
            fps_list.append(fps)

            # Annotazione con informazioni performance
            annotated = results[0].plot()
            avg_fps = sum(fps_list[-30:]) / min(len(fps_list), 30)

            info_text = [
                f"FPS: {fps:.0f} (avg: {avg_fps:.0f})",
                f"Detections: {len(results[0].boxes)}",
                f"Inference: {elapsed*1000:.1f}ms"
            ]
            for i, text in enumerate(info_text):
                cv2.putText(annotated, text, (10, 30 + i * 30),
                           cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 0), 2)

            cv2.imshow('Jetson Pipeline', annotated)
            if cv2.waitKey(1) & 0xFF == ord('q'):
                break

            frame_count += 1

    finally:
        cap.release()
        cv2.destroyAllWindows()
        if fps_list:
            print(f"\n=== Stats Jetson ===")
            print(f"Frame: {frame_count}")
            print(f"FPS medio: {sum(fps_list)/len(fps_list):.1f}")
            print(f"FPS massimo: {max(fps_list):.1f}")
            print(f"Latenza minima: {1000/max(fps_list):.1f}ms")

5. 가지치기와 지식 증류

5.1 구조화된 가지치기

PyTorch를 사용한 구조화된 가지치기

import torch
import torch.nn as nn
import torch.nn.utils.prune as prune

def apply_structured_pruning(model: nn.Module,
                               amount: float = 0.3,
                               n: int = 2) -> nn.Module:
    """
    Structured L2-norm pruning: rimuove interi filtri/neuroni.
    Produce modelli più veloci in inferenza (a differenza del pruning non strutturato
    che produce solo modelli più piccoli ma non necessariamente più veloci).

    amount: percentuale di filtri da rimuovere (0.3 = 30%)
    n: norma L_n usata per il ranking dei filtri
    """
    for name, module in model.named_modules():
        if isinstance(module, nn.Conv2d):
            # Prune i filtri convoluzionali meno importanti
            prune.ln_structured(
                module,
                name='weight',
                amount=amount,
                n=n,
                dim=0  # dim=0 = prune filtri in output
            )
        elif isinstance(module, nn.Linear):
            prune.ln_structured(
                module,
                name='weight',
                amount=amount,
                n=n,
                dim=0
            )

    return model

def remove_pruning_masks(model: nn.Module) -> nn.Module:
    """
    Rende permanente il pruning: rimuove le maschere e i parametri "orig",
    lasciando solo i pesi pruned. Necessario prima dell'export.
    """
    for name, module in model.named_modules():
        if isinstance(module, (nn.Conv2d, nn.Linear)):
            try:
                prune.remove(module, 'weight')
            except ValueError:
                pass
    return model

def prune_and_finetune(model: nn.Module, train_loader, val_loader,
                        prune_amount: float = 0.2, finetune_epochs: int = 5) -> nn.Module:
    """
    Pipeline completa:
    1. Prune il modello (rimuove il prune_amount% dei filtri)
    2. Fine-tunes per recuperare l'accuratezza persa
    3. Rimuove le maschere e finalizza
    """
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)

    print(f"Applying {prune_amount*100:.0f}% structured pruning...")
    model = apply_structured_pruning(model, amount=prune_amount)

    # Fine-tuning rapido per recupero accuratezza
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

    for epoch in range(finetune_epochs):
        model.train()
        total_loss = 0.0
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            loss = criterion(model(images), labels)
            optimizer.zero_grad(set_to_none=True)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        model.eval()
        correct = total = 0
        with torch.no_grad():
            for images, labels in val_loader:
                images, labels = images.to(device), labels.to(device)
                preds = model(images).argmax(1)
                correct += preds.eq(labels).sum().item()
                total += labels.size(0)

        print(f"  FT Epoch {epoch+1}/{finetune_epochs} | "
              f"Loss: {total_loss/len(train_loader):.4f} | "
              f"Acc: {100.*correct/total:.2f}%")

    # Finalizza pruning
    model = remove_pruning_masks(model)
    print("Pruning completato e finalizzato")
    return model

6. 엣지 모델을 위한 지식 증류

Il 지식 증류 (KD, Hinton et al., 2015) "지식"을 전달합니다. 큰 모델(교사)을 작은 모델(학생)로. 학생은 단지 배우는 것이 아니다. 데이터 세트의 하드 라벨이 있지만 소프트 예측 교사의: 분포 데이터 공간의 구조에 대한 정보를 포함하는 확률(예: "고양이"는 "자동차"라기보다는 "호랑이"에 더 가깝습니다.)

지식 증류: 교사-학생 훈련

import torch
import torch.nn as nn
import torch.nn.functional as F

class DistillationLoss(nn.Module):
    """
    Loss combinata per Knowledge Distillation.

    L_total = alpha * L_hard + (1 - alpha) * L_soft
    L_hard = CrossEntropyLoss(student_logits, true_labels)
    L_soft = KLDivLoss(softmax(student/T), softmax(teacher/T)) * T^2

    T (temperature): valori alti -> distribuzioni più soft -> più informazione strutturale
    alpha: peso relativo tra label reali e distillazione dal teacher
    """

    def __init__(self, temperature: float = 4.0, alpha: float = 0.7):
        super().__init__()
        self.T = temperature
        self.alpha = alpha
        self.hard_loss = nn.CrossEntropyLoss()
        self.soft_loss = nn.KLDivLoss(reduction='batchmean')

    def forward(self,
                student_logits: torch.Tensor,
                teacher_logits: torch.Tensor,
                labels: torch.Tensor) -> torch.Tensor:
        # Loss su label reali (hard labels)
        hard = self.hard_loss(student_logits, labels)

        # Loss su soft predictions del teacher (KL divergence)
        student_soft = F.log_softmax(student_logits / self.T, dim=1)
        teacher_soft = F.softmax(teacher_logits / self.T, dim=1)
        soft = self.soft_loss(student_soft, teacher_soft) * (self.T ** 2)

        return self.alpha * hard + (1 - self.alpha) * soft

def train_with_distillation(
    teacher: nn.Module,     # modello grande, già addestrato
    student: nn.Module,     # modello piccolo da addestrare
    train_loader,
    val_loader,
    n_epochs: int = 50,
    temperature: float = 4.0,
    alpha: float = 0.7,
    lr: float = 1e-3
) -> nn.Module:
    """
    Training del modello student con KD.
    Il teacher rimane frozen durante tutto il training.

    Tipico risultato:
    - MobileNetV3 senza KD su ImageNet: ~67% Top-1
    - MobileNetV3 con KD da ResNet-50:  ~72% Top-1
    - ResNet-50 (teacher):              ~76% Top-1
    - Delta: +5% con 5x meno parametri!
    """
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    teacher.eval()   # Teacher sempre in eval mode
    student.to(device)
    teacher.to(device)

    criterion = DistillationLoss(temperature=temperature, alpha=alpha)
    optimizer = torch.optim.AdamW(student.parameters(), lr=lr, weight_decay=0.01)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=n_epochs)

    best_val_acc = 0.0
    best_state = None

    for epoch in range(n_epochs):
        student.train()
        total_loss = 0.0

        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)

            # Forward pass
            student_logits = student(images)
            with torch.no_grad():  # Teacher: nessun gradiente
                teacher_logits = teacher(images)

            # Loss combinata
            loss = criterion(student_logits, teacher_logits, labels)

            optimizer.zero_grad(set_to_none=True)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(student.parameters(), 1.0)
            optimizer.step()
            total_loss += loss.item()

        scheduler.step()

        # Validation
        student.eval()
        correct = total = 0
        with torch.no_grad():
            for images, labels in val_loader:
                images, labels = images.to(device), labels.to(device)
                preds = student(images).argmax(1)
                correct += preds.eq(labels).sum().item()
                total += labels.size(0)

        val_acc = 100.0 * correct / total
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            best_state = {k: v.cpu().clone() for k, v in student.state_dict().items()}

        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}/{n_epochs} | "
                  f"Loss: {total_loss/len(train_loader):.4f} | "
                  f"Val Acc: {val_acc:.2f}% | "
                  f"Best: {best_val_acc:.2f}%")

    student.load_state_dict(best_state)
    print(f"\nBest validation accuracy: {best_val_acc:.2f}%")
    return student

Edge의 압축 전략 비교


기술
매개변수 감소
속도 향상
Acc. 손실
재교육 필요


양자화 INT8
4x
2-4배
<1%
아니요(PTQ) / 예(QAT)

구조화된 가지치기 30%
1.4배
1.3-1.6x
1-3%
예(미세 조정)

지식 증류
5-10x(모델 교환)
5-10배
3~8%
예(전체 교육)

FP16(텐서RT)
2x
1.5-2x
<0.5%
No

Q + 가지치기 + KD
10-20x
8-15x
2-5%
Si

7. ONNX 런타임: 하드웨어 간 이식성

ONNX(개방형 신경망 교환) 그리고 휴대성을 위한 표준 사이즈 딥러닝 모델의 ONNX로 내보낸 후에는 동일한 모델을 CPU, NVIDIA GPU, ARM NPU, Intel OpenVINO, Apple Neural Engine에서 ONNX 런타임으로 실행 추론 코드를 변경하지 않고.

ONNX 런타임을 사용한 ONNX 내보내기 및 추론

import torch
import onnx
import onnxruntime as ort
import numpy as np
import time

def export_to_onnx(model: torch.nn.Module,
                   input_shape: tuple = (1, 3, 640, 640),
                   output_path: str = 'model.onnx',
                   opset: int = 17) -> str:
    """
    Esporta modello PyTorch in formato ONNX ottimizzato.

    opset=17: versione del opset ONNX (più alta = più operatori supportati)
    dynamic_axes: permette batch size variabile (utile per server, non per edge)
    """
    model.eval()
    dummy_input = torch.zeros(input_shape)

    # Export con ottimizzazioni
    torch.onnx.export(
        model,
        dummy_input,
        output_path,
        opset_version=opset,
        input_names=['images'],
        output_names=['output'],
        dynamic_axes={
            'images': {0: 'batch'},
            'output': {0: 'batch'}
        },
        do_constant_folding=True,  # ottimizza operazioni costanti
        verbose=False
    )

    # Verifica il modello esportato
    onnx_model = onnx.load(output_path)
    onnx.checker.check_model(onnx_model)
    print(f"Modello ONNX valido: {output_path}")
    return output_path

class ONNXRuntimeInference:
    """
    Inference ottimizzata con ONNX Runtime.
    Supporta CPU, GPU CUDA, ARM (QNN), Intel OpenVINO come backend.
    """

    def __init__(self, model_path: str, device: str = 'cpu'):
        providers = self._get_providers(device)

        sess_options = ort.SessionOptions()
        sess_options.graph_optimization_level = (
            ort.GraphOptimizationLevel.ORT_ENABLE_ALL
        )
        # Numero di thread per CPU inference
        sess_options.intra_op_num_threads = 4
        sess_options.inter_op_num_threads = 2

        self.session = ort.InferenceSession(
            model_path, sess_options, providers=providers
        )

        # Cache nomi input/output
        self.input_name  = self.session.get_inputs()[0].name
        self.output_name = self.session.get_outputs()[0].name

        print(f"ONNX Runtime caricato su: {providers[0]}")

    def _get_providers(self, device: str) -> list:
        if device == 'cuda':
            return ['CUDAExecutionProvider', 'CPUExecutionProvider']
        elif device == 'openvino':
            return ['OpenVINOExecutionProvider', 'CPUExecutionProvider']
        else:
            return ['CPUExecutionProvider']

    def predict(self, image: np.ndarray) -> np.ndarray:
        """Inference su immagine numpy preprocessata."""
        # Assicura formato float32 [B, C, H, W]
        if image.ndim == 3:
            image = image[np.newaxis, ...]
        image = image.astype(np.float32)

        return self.session.run(
            [self.output_name], {self.input_name: image}
        )[0]

    def benchmark(self, input_shape: tuple = (1, 3, 640, 640),
                  n_runs: int = 100) -> dict:
        """Misura latenza e throughput."""
        dummy = np.random.rand(*input_shape).astype(np.float32)

        # Warmup
        for _ in range(10):
            self.predict(dummy)

        # Benchmark
        start = time.perf_counter()
        for _ in range(n_runs):
            self.predict(dummy)
        elapsed = time.perf_counter() - start

        avg_ms = (elapsed / n_runs) * 1000
        fps = 1000.0 / avg_ms
        print(f"ONNX Runtime: {avg_ms:.2f}ms ({fps:.1f} FPS)")
        return {'avg_ms': avg_ms, 'fps': fps}

8. 엣지 배포 모범 사례

생산 준비가 완료된 Edge 배포를 위한 체크리스트

요구 사항을 충족하는 가장 작은 모델을 선택하십시오. RPi의 경우 YOLOv8n 또는 YOLO26n, Jetson Orin의 경우 YOLOv8m. 가장자리에 Large 또는 XLarge 모델을 사용하지 마십시오. 항상 대상 하드웨어를 측정하십시오.
입력 해상도 줄이기: 640x640 대신 320x320은 적당한 정확도 손실로 추론 시간을 75% 줄입니다. 큰 항목의 경우 320이면 충분합니다.
지능형 프레임 건너뛰기: 개체가 느리게 움직이는 경우 3~5프레임 중 1프레임을 처리합니다. 건너뛴 프레임의 위치를 보간하려면 추적기(CSRT, ByteTrack)를 사용하세요.
획득 파이프라인 최적화: 대기 시간을 최소화하려면 CAP_PROP_BUFFERSIZE=1을 설정하세요. OpenCV보다 오버헤드가 적기 때문에 Linux에서 직접 V4L2를 사용하세요.
Jetson의 TensorRT: 언제나. PyTorch와 TensorRT FP16의 차이는 5~8배입니다. Jetson의 추론 생성에 PyTorch를 사용할 이유가 없습니다.
열 조절: RPi 및 Jetson에서는 과열로 인해 조절이 발생합니다. 방열판을 추가하고 온도를 조절하세요. vcgencmd measure_temp (RPi) 또는 tegrastats (젯슨).
속도뿐만 아니라 에너지도 측정하세요. FPS/와트는 배터리 장치에 중요한 측정항목입니다. 2배 더 느리지만 4배 더 에너지 효율적이고 종종 선호되는 모델입니다.
워치독 및 점진적 재시작: 프로덕션 에지 장치에서는 충돌이나 정지가 발생할 경우 추론 프로세스를 다시 시작하는 감시 장치를 항상 구현하십시오.
엣지 친화적인 로깅: RPi에서는 원격 데이터베이스 대신 SQLite를 사용하여 이벤트를 로컬에 저장합니다. 연결이 가능할 때 일괄적으로 클라우드에 동기화합니다.

Raspberry Pi의 열 모니터링 및 감시 장치

import subprocess
import threading
import time
import logging

class ThermalMonitor:
    """
    Monitor termico per Raspberry Pi/Jetson.
    Riduce automaticamente il carico di lavoro se la temperatura e troppo alta.
    """

    TEMP_WARNING = 75.0   # Celsius: riduce frame rate
    TEMP_CRITICAL = 85.0  # Celsius: ferma il processing

    def __init__(self, platform: str = 'rpi',
                 check_interval: float = 5.0):
        self.platform = platform
        self.check_interval = check_interval
        self.current_temp = 0.0
        self.throttle_factor = 1.0  # 1.0 = nessun throttling
        self._stop = threading.Event()

    def get_temperature(self) -> float:
        """Legge la temperatura del SoC."""
        try:
            if self.platform == 'rpi':
                result = subprocess.run(
                    ['vcgencmd', 'measure_temp'],
                    capture_output=True, text=True
                )
                # Output: "temp=62.1'C"
                temp_str = result.stdout.strip()
                return float(temp_str.split('=')[1].replace("'C", ''))
            elif self.platform == 'jetson':
                # Legge da sysfs
                with open('/sys/class/thermal/thermal_zone0/temp') as f:
                    return float(f.read().strip()) / 1000.0
        except Exception as e:
            logging.warning(f"Impossibile leggere temperatura: {e}")
            return 0.0

    def get_throttle_factor(self) -> float:
        """Restituisce il fattore di throttling (0.0-1.0)."""
        temp = self.current_temp
        if temp < self.TEMP_WARNING:
            return 1.0
        elif temp < self.TEMP_CRITICAL:
            # Throttling lineare tra 75 e 85 gradi
            factor = 1.0 - (temp - self.TEMP_WARNING) / (
                self.TEMP_CRITICAL - self.TEMP_WARNING
            )
            return max(0.2, factor)  # mai sotto il 20%
        else:
            return 0.0  # ferma il processing

    def monitor_loop(self) -> None:
        """Thread di monitoraggio termico."""
        while not self._stop.is_set():
            self.current_temp = self.get_temperature()
            self.throttle_factor = self.get_throttle_factor()

            if self.current_temp >= self.TEMP_CRITICAL:
                logging.critical(f"TEMP CRITICA: {self.current_temp:.1f}C - "
                                 f"Processing fermato!")
            elif self.current_temp >= self.TEMP_WARNING:
                logging.warning(f"TEMP ALTA: {self.current_temp:.1f}C - "
                                f"Throttle: {self.throttle_factor:.2f}")

            time.sleep(self.check_interval)

    def start(self) -> None:
        t = threading.Thread(target=self.monitor_loop, daemon=True)
        t.start()

    def stop(self) -> None:
        self._stop.set()

결론

에지 장치에 컴퓨터 비전 모델을 배포하려면 전체적인 접근 방식이 필요합니다. 하드웨어 선택, 모델 최적화 및 파이프라인 엔지니어링을 결합합니다. 존재하지 않습니다 고유한 솔루션: 최적의 조합은 지배적인 제약 조건(대기 시간, 에너지, 정확성, 비용). 이 기사에서 우리는 완전한 툴킷을 구축했습니다:

엣지 하드웨어: 예산 시나리오를 위한 Raspberry Pi 5, 실시간 성능을 위한 Jetson Orin, 초저전력을 위한 Coral TPU 및 Hailo-8
INT8 양자화: 4배 크기 감소, 2-4배 속도 향상, PTQ를 통한 <1% 정확도 손실
ARM CPU용 NCNN, NVIDIA GPU용 TensorRT, 초저전력용 TFLite + Coral TPU
구조화된 가지치기 + 미세 조정: 정확도 손실을 최소화하면서 필터의 20~30%를 제거합니다.
지식 증류: 대형 모델의 지식을 내장형 모델로 이전
ONNX 런타임: 다양한 하드웨어 플랫폼 간의 모델 이식성
열 모니터링 및 감시: 연중무휴 24시간 엣지 생산을 위한 견고한 시스템
프레임 건너뛰기 + 추적: 움직임이 거의 없는 장면에서 컴퓨팅 성능을 70-80% 줄입니다.

시리즈 탐색

이전의: OpenCV 및 PyTorch: 완전한 CV 파이프라인
다음: 얼굴 감지 및 인식: 최신 기술

시리즈 간 리소스

MLOps: 프로덕션에서 모델 제공 - Kubernetes 및 Triton을 사용한 클라우드 배포
고급 딥러닝: 양자화 및 압축

장치	CPU	GPU/NPU	숫양	TDP	YOLOv8n FPS
라즈베리 파이 5	ARM Cortex-A76 4코어	비디오코어 VII	8GB	15W	~5FPS
젯슨 나노(2GB)	ARM A57 4코어	128개의 CUDA 코어	2GB	10W	~20FPS
젯슨 오린 나노	ARM Cortex-A78AE 6코어	1024 쿠다 + DLA	8GB	25W	~80FPS
젯슨 AGX 오린	ARM Cortex-A78AE 12코어	2048 쿠다 + DLA	64GB	60W	~200FPS
구글 코랄 TPU	ARM Cortex-A53 4코어	4 TOPS 엣지 TPU	1GB	4W	~30FPS(TFLite)
하일로-8	- (PCIe 가속기)	26 TOPS 신경 엔진	-	5W	~120FPS

기술	매개변수 감소	속도 향상	Acc. 손실	재교육 필요
양자화 INT8	4x	2-4배	<1%	아니요(PTQ) / 예(QAT)
구조화된 가지치기 30%	1.4배	1.3-1.6x	1-3%	예(미세 조정)
지식 증류	5-10x(모델 교환)	5-10배	3~8%	예(전체 교육)
FP16(텐서RT)	2x	1.5-2x	<0.5%	No
Q + 가지치기 + KD	10-20x	8-15x	2-5%	Si