안녕하세요!

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

연락하기

소개

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

역량

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

프로세스 자동화

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

맞춤 시스템

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

미션

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

기술의 민주화

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

IT와 비즈니스 통합

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

맞춤 솔루션

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

기술로 비즈니스를 혁신하세요

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

연락하기

프로젝트가 있으신가요? 아래 양식을 작성해 주시면 빠르게 답변드리겠습니다.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

벤치마크 및 최적화: 48GB GPU에서 8GB RTX까지

당신은 모델이 있습니다. 80GB A100에서 실행됩니다. 하지만 24GB RTX 3090에 배포해야 하는데, 또는 RTX 4060 8GB 노트북, 심지어 Raspberry Pi에서도 가능합니다. 얼마나 많은지 어떻게 알 수 있나요? FP32에서 INT4로 갈수록 정확도가 떨어지나요? Flash Attention을 사용하면 속도가 얼마나 향상되나요? 정량화할 가치가 있나요, 아니면 증류하는 것이 더 낫나요? 그래디언트 체크포인트는 얼마나 많은 메모리를 절약합니까?

체계적인 벤치마킹이 없으면 이러한 질문은 답을 얻지 못한 채 남아 있습니다. 직관이나 공개된 구성 벤치마크를 기반으로 차선책 선택 당신 자신과 다릅니다. 시리즈의 마지막 기사에서는 프레임워크를 구축합니다. 측정을 위한 포괄적인 벤치마킹 모든 사이즈 성능: 메모리, 대기 시간, 처리량, 정확성 및 전력 소비.

그런 다음 시리즈에서 볼 수 있는 모든 기술인 양자화, 가지치기, 증류, Flash Attention, 그래디언트 체크포인트, 혼합 정밀도 — 48GB가 필요한 모델에서 8GB로 실행되는 모델로 전환하는 방법을 보여줍니다. 품질 측면에서 귀하가 지불한 금액을 정확하게 보여주는 지표를 사용합니다.

무엇을 배울 것인가

DL 모델을 위한 체계적인 벤치마킹 프레임워크
VRAM, 대기 시간, 처리량 및 FLOP를 정확하게 측정
혼합 정밀 훈련: FP16 vs BF16 vs FP32
플래시 어텐션 2/3: 얼마나 절약하고 언제 사용하는가
그라데이션 체크포인트: 메모리와 컴퓨팅의 절충
Gradient Accumulation: 사실상 큰 배치 크기
Torch.compile 및 런타임 최적화
KV 캐시: LLM 자동회귀 추론을 위한 최적화
체계적인 비교: 모든 기술 비교
결정 지침: 어떤 시나리오에 대한 최적화

체계적인 벤치마킹 프레임워크

최적화하기 전에 정확한 측정이 필요합니다. 벤치마킹 프레임워크 전문적인 측정: 최대 VRAM 사용량, 평균 대기 시간 및 P95, 처리량(토큰/초 또는 img/초), 특정 작업에 대한 FLOP, 에너지 소비 및 정확성. 핵심은 재현성: 실행마다 10%씩 달라지는 벤치마크는 쓸모가 없습니다.

import torch
import torch.nn as nn
import time
import numpy as np
from dataclasses import dataclass, asdict
from typing import Optional, Callable
import gc

# ============================================================
# DATACLASS PER RISULTATI BENCHMARK
# ============================================================
@dataclass
class BenchmarkResult:
    """Risultati completi di un benchmark."""
    name: str
    # Memoria
    vram_allocated_mb: float
    vram_reserved_mb: float
    vram_peak_mb: float
    # Velocita
    latency_ms_mean: float
    latency_ms_p50: float
    latency_ms_p95: float
    latency_ms_p99: float
    throughput_per_sec: float
    # Modello
    params_total: int
    params_trainable: int
    model_size_mb: float
    # Opzionali
    accuracy: Optional[float] = None
    flops_total: Optional[float] = None
    power_watts: Optional[float] = None

    def print_summary(self):
        print(f"\n=== {self.name} ===")
        print(f"  VRAM: {self.vram_peak_mb:.0f} MB peak, {self.vram_allocated_mb:.0f} MB alloc")
        print(f"  Latenza: {self.latency_ms_mean:.1f}ms mean, "
              f"{self.latency_ms_p95:.1f}ms p95, {self.latency_ms_p99:.1f}ms p99")
        print(f"  Throughput: {self.throughput_per_sec:.1f}/s")
        print(f"  Parametri: {self.params_total:,} ({self.model_size_mb:.1f} MB)")
        if self.accuracy:
            print(f"  Accuratezza: {self.accuracy:.4f}")


# ============================================================
# CLASSE PRINCIPALE DI BENCHMARKING
# ============================================================
class DeepLearningBenchmark:
    def __init__(self, device: str = "cuda"):
        self.device = device
        self.results = []

    def _count_params(self, model: nn.Module) -> tuple:
        total = sum(p.numel() for p in model.parameters())
        trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
        return total, trainable

    def _model_size_mb(self, model: nn.Module) -> float:
        total_bytes = sum(p.numel() * p.element_size() for p in model.parameters())
        return total_bytes / (1024 ** 2)

    def _reset_memory(self):
        """Reset GPU memory per benchmark pulito."""
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            torch.cuda.reset_peak_memory_stats()

    def benchmark_inference(
        self,
        name: str,
        model: nn.Module,
        input_fn: Callable[[], tuple],
        n_warmup: int = 10,
        n_runs: int = 100,
        batch_size: int = 1
    ) -> BenchmarkResult:
        """
        Benchmark completo di inferenza.
        input_fn: funzione che restituisce input per il modello
        """
        model = model.to(self.device).eval()
        self._reset_memory()

        # Warmup
        with torch.no_grad():
            for _ in range(n_warmup):
                inputs = input_fn()
                if isinstance(inputs, dict):
                    model(**{k: v.to(self.device) for k, v in inputs.items()})
                else:
                    model(inputs.to(self.device))

        # Misura memoria post-warmup
        if torch.cuda.is_available():
            mem_alloc = torch.cuda.memory_allocated() / (1024**2)
            mem_reserved = torch.cuda.memory_reserved() / (1024**2)

        # Benchmark vero
        torch.cuda.synchronize() if torch.cuda.is_available() else None
        latencies = []
        for _ in range(n_runs):
            inputs = input_fn()
            t0 = time.perf_counter()
            with torch.no_grad():
                if isinstance(inputs, dict):
                    _ = model(**{k: v.to(self.device) for k, v in inputs.items()})
                else:
                    _ = model(inputs.to(self.device))
            torch.cuda.synchronize() if torch.cuda.is_available() else None
            latencies.append((time.perf_counter() - t0) * 1000)

        if torch.cuda.is_available():
            mem_peak = torch.cuda.max_memory_allocated() / (1024**2)
        else:
            mem_alloc = mem_reserved = mem_peak = 0.0

        latencies = np.array(latencies)
        total_params, trainable_params = self._count_params(model)

        result = BenchmarkResult(
            name=name,
            vram_allocated_mb=mem_alloc,
            vram_reserved_mb=mem_reserved,
            vram_peak_mb=mem_peak,
            latency_ms_mean=float(np.mean(latencies)),
            latency_ms_p50=float(np.percentile(latencies, 50)),
            latency_ms_p95=float(np.percentile(latencies, 95)),
            latency_ms_p99=float(np.percentile(latencies, 99)),
            throughput_per_sec=1000 / np.mean(latencies) * batch_size,
            params_total=total_params,
            params_trainable=trainable_params,
            model_size_mb=self._model_size_mb(model)
        )
        result.print_summary()
        self.results.append(result)
        return result

    def benchmark_training_step(
        self,
        name: str,
        model: nn.Module,
        optimizer: torch.optim.Optimizer,
        loss_fn: Callable,
        input_fn: Callable,
        n_steps: int = 50
    ) -> dict:
        """Benchmark di un singolo step di training."""
        model = model.to(self.device).train()
        self._reset_memory()

        latencies = []
        for step in range(n_steps):
            inputs, labels = input_fn()
            inputs = inputs.to(self.device)
            labels = labels.to(self.device)

            t0 = time.perf_counter()
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()
            torch.cuda.synchronize() if torch.cuda.is_available() else None
            latencies.append((time.perf_counter() - t0) * 1000)

        return {
            "name": name,
            "vram_peak_mb": torch.cuda.max_memory_allocated() / (1024**2) if torch.cuda.is_available() else 0,
            "step_ms_mean": float(np.mean(latencies[5:])),  # Skip warmup
            "step_ms_p95": float(np.percentile(latencies[5:], 95))
        }

    def compare_results(self) -> None:
        """Stampa tabella comparativa di tutti i risultati."""
        if not self.results:
            print("Nessun risultato disponibile.")
            return

        baseline = self.results[0]
        print(f"\n{'Config':<30} {'VRAM (MB)':>12} {'Latency (ms)':>14} {'Throughput':>12} {'Speedup':>10}")
        print("-" * 82)
        for r in self.results:
            speedup = baseline.latency_ms_mean / r.latency_ms_mean
            print(f"{r.name:<30} {r.vram_peak_mb:>12.0f} {r.latency_ms_mean:>14.2f} "
                  f"{r.throughput_per_sec:>12.1f} {speedup:>10.2f}x")

# Uso:
bench = DeepLearningBenchmark(device="cuda" if torch.cuda.is_available() else "cpu")
print("Framework di benchmarking inizializzato")

혼합 정밀도: FP32 vs FP16 vs BF16

Il 혼합 정밀 훈련 활성화할 첫 번째 최적화: 거의 구성에 필요한 오버헤드 없음, 메모리 절약 2배, 하드웨어 속도 2~3배 향상 암페어+. torch.autocast 수행할 작업을 자동으로 관리합니다. 정밀도가 떨어집니다.

FP16과 BF16 및 이진 형식의 주요 차이점: FP16에는 지수가 5비트 있습니다. 가수는 10(6e-5 ~ 6.5e4 범위)인 반면, BF16은 지수에 8비트, 지수에 7비트를 갖습니다. 가수(FP32와 동일한 범위, 1.2e-38에서 3.4e38까지). BF16 및 훨씬 더 안정적인 큰 기울기로 오버플로/언더플로가 발생하지 않기 때문에 훈련됩니다.

import torch
import torch.nn as nn
from torch.cuda.amp import GradScaler

# ============================================================
# CONFRONTO FP32 vs FP16 vs BF16
# ============================================================
def train_step_fp32(model, optimizer, imgs, labels, criterion):
    """Training step standard FP32."""
    optimizer.zero_grad()
    output = model(imgs)
    loss = criterion(output, labels)
    loss.backward()
    optimizer.step()
    return loss.item()


def train_step_fp16(model, optimizer, imgs, labels, criterion, scaler: GradScaler):
    """
    Training step con AMP FP16.
    GradScaler necessario: FP16 ha range limitato, loss scaling evita underflow.
    """
    optimizer.zero_grad()
    with torch.autocast(device_type="cuda", dtype=torch.float16):
        output = model(imgs)
        loss = criterion(output, labels)

    # Scala la loss per evitare underflow in FP16
    scaler.scale(loss).backward()
    # Decomprime gradienti prima di clip
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    # Aggiorna pesi (salta se ci sono NaN/Inf nei gradienti)
    scaler.step(optimizer)
    scaler.update()
    return loss.item()


def train_step_bf16(model, optimizer, imgs, labels, criterion):
    """
    Training step con BF16.
    BF16 NON richiede GradScaler: ha range dinamico uguale a FP32.
    Disponibile su: A100, RTX 3000+, Apple M-series.
    """
    optimizer.zero_grad()
    with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
        output = model(imgs)
        loss = criterion(output, labels)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    return loss.item()


# Benchmark comparativo
from torchvision import models
import time, gc

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def compare_precisions(model_fn=models.resnet50, n_steps=100,
                        batch_size=32, img_size=224):
    """Confronta FP32, FP16, BF16 per training e inferenza."""
    criterion = nn.CrossEntropyLoss()

    configs = [
        ("FP32",  torch.float32, False),
        ("FP16",  torch.float16, True),   # Richiede GradScaler
        ("BF16",  torch.bfloat16, False)  # No GradScaler
    ]

    results = {}
    for name, dtype, use_scaler in configs:
        model = model_fn(pretrained=False).to(device)
        optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
        scaler = GradScaler() if use_scaler else None

        # Reset memory stats
        torch.cuda.reset_peak_memory_stats() if torch.cuda.is_available() else None
        gc.collect()
        torch.cuda.empty_cache() if torch.cuda.is_available() else None

        timings = []
        for step in range(n_steps):
            imgs = torch.randn(batch_size, 3, img_size, img_size, device=device)
            labels = torch.randint(0, 1000, (batch_size,), device=device)

            t0 = time.perf_counter()
            with torch.autocast(device_type="cuda", dtype=dtype, enabled=(dtype != torch.float32)):
                out = model(imgs)
                loss = criterion(out, labels)

            if scaler:
                scaler.scale(loss).backward()
                scaler.step(optimizer)
                scaler.update()
            else:
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            torch.cuda.synchronize() if torch.cuda.is_available() else None
            timings.append((time.perf_counter() - t0) * 1000)

        vram_peak = torch.cuda.max_memory_allocated() / (1024**2) if torch.cuda.is_available() else 0

        results[name] = {
            "vram_mb": round(vram_peak, 1),
            "step_ms": round(np.mean(timings[10:]), 2),
            "throughput_imgs_s": round(batch_size * 1000 / np.mean(timings[10:]), 1)
        }
        print(f"{name}: VRAM={vram_peak:.0f}MB, {np.mean(timings[10:]):.1f}ms/step, "
              f"{batch_size*1000/np.mean(timings[10:]):.0f} img/s")

    return results

# Risultati tipici ResNet-50 BS=32 su RTX 4090:
# FP32: VRAM=6200MB, 95ms/step, 336 img/s
# FP16: VRAM=3100MB, 41ms/step, 780 img/s  (2x velocità, 50% VRAM)
# BF16: VRAM=3100MB, 38ms/step, 842 img/s  (2.2x velocità, 50% VRAM)

Flash Attention: 규칙을 바꾸는 최적화

플래시 주의 (Dao et al., 2022) 아마도 가장 영향력 있는 최적화일 것입니다. 최근 몇 년간의 트랜스포머의 경우. 관심 계산을 다음과 같이 재구성합니다. IO 바인딩 인식: HBM에서 완전한 어텐션 매트릭스를 구현하는 대신 (메모리 복잡도가 O(n^2)임) SRAM에 남아 있는 동안 블록 주의를 계산합니다. 결과: O(n^2) 대신 O(n) 메모리 복잡도, 긴 시퀀스의 속도가 2-4배 향상됩니다.

Flash Attention 2(2023)는 GPU의 병렬성을 더욱 향상시켜 다음을 달성합니다. FP16 FLOPS의 이론적 사용률은 72%입니다. Flash Attention 3(2024)에 지원 추가 FP8 및 호퍼별 최적화를 위해 FA2에 비해 속도가 최대 2배 향상되었습니다.

import torch
import torch.nn as nn
import torch.nn.functional as F
import time, math

# ============================================================
# FLASH ATTENTION vs STANDARD ATTENTION: CONFRONTO
# ============================================================

def standard_attention(q, k, v, scale=None):
    """
    Attention standard: materializza la matrice NxN completa in GPU memory.
    Complessità memoria: O(N^2 * d_head)
    """
    if scale is None:
        scale = q.size(-1) ** -0.5
    # [B, heads, N, N] - questa matrice può essere ENORME per seq lunghe!
    attn = torch.softmax((q @ k.transpose(-2, -1)) * scale, dim=-1)
    return attn @ v


def flash_attention_native(q, k, v):
    """
    Flash Attention tramite PyTorch 2.0+ scaled_dot_product_attention.
    Sceglie automaticamente l'implementazione ottimale:
    - FlashAttention-2 se disponibile (CUDA Ampere+)
    - Memory-efficient attention (xFormers) come fallback
    - Standard attention come ultimo fallback
    """
    # Automaticamente ottimizzato da PyTorch
    return F.scaled_dot_product_attention(q, k, v, is_causal=False)


def benchmark_attention_implementations(
    batch_size=4, n_heads=12, seq_lengths=[512, 1024, 2048, 4096, 8192],
    d_head=64, device="cuda"
):
    """
    Confronta Standard vs Flash Attention su diverse lunghezze di sequenza.
    """
    print(f"{'Seq Len':>10} | {'Standard (ms)':>15} | {'Flash (ms)':>12} | "
          f"{'Speedup':>10} | {'VRAM Std (MB)':>15} | {'VRAM Flash (MB)':>15}")
    print("-" * 90)

    for seq_len in seq_lengths:
        q = torch.randn(batch_size, n_heads, seq_len, d_head, device=device, dtype=torch.float16)
        k = torch.randn_like(q)
        v = torch.randn_like(q)

        # Warmup
        for _ in range(5):
            standard_attention(q, k, v)
            flash_attention_native(q, k, v)

        # Benchmark Standard
        torch.cuda.reset_peak_memory_stats()
        torch.cuda.synchronize()
        t0 = time.perf_counter()
        for _ in range(20):
            out_std = standard_attention(q, k, v)
        torch.cuda.synchronize()
        std_ms = (time.perf_counter() - t0) / 20 * 1000
        vram_std = torch.cuda.max_memory_allocated() / (1024**2)

        # Benchmark Flash Attention
        torch.cuda.reset_peak_memory_stats()
        torch.cuda.synchronize()
        t0 = time.perf_counter()
        for _ in range(20):
            out_flash = flash_attention_native(q, k, v)
        torch.cuda.synchronize()
        flash_ms = (time.perf_counter() - t0) / 20 * 1000
        vram_flash = torch.cuda.max_memory_allocated() / (1024**2)

        speedup = std_ms / flash_ms
        print(f"{seq_len:>10} | {std_ms:>15.2f} | {flash_ms:>12.2f} | "
              f"{speedup:>10.2f}x | {vram_std:>15.0f} | {vram_flash:>15.0f}")

# Risultati tipici su RTX 4090 (FP16, B=4, heads=12, d_head=64):
# Seq Len  | Standard (ms) | Flash (ms) | Speedup  | VRAM Std (MB) | VRAM Flash (MB)
# -----------------------------------------------------------------------------------
#      512 |          0.82 |       0.31 |    2.65x |           48  |           12
#     1024 |          2.45 |       0.58 |    4.22x |          192  |           24
#     2048 |          9.12 |       1.12 |    8.14x |          768  |           48
#     4096 |         35.80 |       2.21 |   16.20x |         3072  |           96
#     8192 |        144.20 |       4.38 |   32.92x |        12288  |          192
# Flash Attention scala LINEARMENTE: a seq=8192 usa 64x meno VRAM!

그라디언트 체크포인트 및 그라디언트 누적

훈련 중 VRAM이 병목 현상을 일으키는 경우 두 가지 보완 기술 하드웨어를 업그레이드하지 않고도 더 큰 배치를 훈련할 수 있습니다.

import torch
import torch.nn as nn
from torch.utils.checkpoint import checkpoint_sequential
import gc

# ============================================================
# GRADIENT CHECKPOINTING
# ============================================================
# Idea: invece di salvare tutte le attivazioni intermedie per il backward pass,
# le ricalcola al momento (tradeoff: +33% compute, -50-70% memoria)

class CheckpointedTransformerBlock(nn.Module):
    """Transformer block con gradient checkpointing."""
    def __init__(self, d_model=768, n_heads=12):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.norm1 = nn.LayerNorm(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_model * 4), nn.GELU(),
            nn.Linear(d_model * 4, d_model)
        )
        self.norm2 = nn.LayerNorm(d_model)

    def _attn_block(self, x):
        attn_out, _ = self.attn(x, x, x)
        return self.norm1(x + attn_out)

    def _ff_block(self, x):
        return self.norm2(x + self.ff(x))

    def forward(self, x):
        # Gradient checkpointing: ogni sotto-modulo viene ricalcolato
        # durante il backward invece di essere salvato
        x = torch.utils.checkpoint.checkpoint(self._attn_block, x, use_reentrant=False)
        x = torch.utils.checkpoint.checkpoint(self._ff_block, x, use_reentrant=False)
        return x


def enable_gradient_checkpointing_hf(model):
    """Abilita gradient checkpointing su modelli HuggingFace."""
    model.gradient_checkpointing_enable()
    print(f"Gradient checkpointing abilitato su {type(model).__name__}")


# Benchmark Gradient Checkpointing
def compare_checkpointing(seq_len=2048, batch_size=8, d_model=768,
                            n_layers=12, n_heads=12, device="cuda"):
    """Confronta training con e senza gradient checkpointing."""

    class SimpleTransformer(nn.Module):
        def __init__(self, use_checkpoint=False):
            super().__init__()
            self.use_checkpoint = use_checkpoint
            self.blocks = nn.ModuleList([
                CheckpointedTransformerBlock(d_model, n_heads) if use_checkpoint
                else CheckpointedTransformerBlock(d_model, n_heads)
                for _ in range(n_layers)
            ])
            self.head = nn.Linear(d_model, 1000)

        def forward(self, x):
            for block in self.blocks:
                if self.use_checkpoint:
                    x = torch.utils.checkpoint.checkpoint(block, x, use_reentrant=False)
                else:
                    x = block(x)
            return self.head(x[:, 0])

    results = {}
    for use_ckpt in [False, True]:
        name = "con checkpointing" if use_ckpt else "senza checkpointing"
        gc.collect()
        torch.cuda.empty_cache() if torch.cuda.is_available() else None
        torch.cuda.reset_peak_memory_stats() if torch.cuda.is_available() else None

        model = SimpleTransformer(use_checkpoint=use_ckpt).to(device)
        optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
        x = torch.randn(batch_size, seq_len, d_model, device=device)
        labels = torch.randint(0, 1000, (batch_size,), device=device)

        # Forward + backward
        torch.cuda.synchronize() if torch.cuda.is_available() else None
        t0 = time.perf_counter()
        for _ in range(10):
            optimizer.zero_grad()
            out = model(x)
            loss = nn.CrossEntropyLoss()(out, labels)
            loss.backward()
            optimizer.step()
        torch.cuda.synchronize() if torch.cuda.is_available() else None
        elapsed = (time.perf_counter() - t0) / 10 * 1000

        vram = torch.cuda.max_memory_allocated() / (1024**2) if torch.cuda.is_available() else 0

        results[name] = {"vram_mb": round(vram, 1), "step_ms": round(elapsed, 1)}
        print(f"{name}: VRAM={vram:.0f}MB, Step={elapsed:.1f}ms")

    return results

# Risultati tipici (Transformer 12 layer, seq=2048, BS=8, RTX 3090):
# Senza checkpointing: VRAM=18.4GB, Step=285ms
# Con checkpointing:   VRAM= 7.8GB, Step=378ms  (-58% VRAM, +33% compute)


# ============================================================
# GRADIENT ACCUMULATION
# ============================================================
def train_with_gradient_accumulation(
    model, optimizer, train_loader, criterion,
    accumulation_steps: int = 4,
    device: str = "cuda"
):
    """
    Gradient accumulation: simula batch_size * accumulation_steps
    con la memoria di batch_size.
    Utile quando il batch_size reale e troppo piccolo per convergenza ottimale.
    """
    model = model.to(device).train()
    optimizer.zero_grad()

    for step, (imgs, labels) in enumerate(train_loader):
        imgs, labels = imgs.to(device), labels.to(device)

        # Forward pass
        with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
            output = model(imgs)
            # Dividi loss per accumulation steps (mantiene la scala corretta)
            loss = criterion(output, labels) / accumulation_steps

        loss.backward()

        # Aggiorna i pesi ogni N step
        if (step + 1) % accumulation_steps == 0:
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            optimizer.zero_grad()

            effective_batch = imgs.size(0) * accumulation_steps
            print(f"Step {(step+1)//accumulation_steps} | "
                  f"Effective batch: {effective_batch} | Loss: {loss.item()*accumulation_steps:.4f}")

torch.compile: 그래프 최적화

토치.컴파일 (PyTorch 2.0+) 모델을 최적화된 커널로 컴파일합니다. Triton 또는 기타 백엔드를 통해. 적용할 수 있는 가장 간단한 최적화: 단 하나 한 줄의 코드로 인해 추론 속도가 1.5~2.5배 향상될 수 있습니다.

import torch
from torchvision import models
import time, numpy as np

def benchmark_torch_compile():
    device = "cuda" if torch.cuda.is_available() else "cpu"

    # ============================================================
    # MODALITA DI COMPILAZIONE
    # ============================================================
    # "default":    Bilanciamento compile time / speedup
    # "reduce-overhead": Minimizza overhead, ottimale per piccoli batch
    # "max-autotune": Massima velocità (compile time molto più lungo, ~5-10 min)
    # "inductor":   Backend default (usa Triton su CUDA, C++ su CPU)

    model_fp32 = models.resnet50(pretrained=False).to(device).eval()

    # Compilazione eager (default)
    model_compiled_default = torch.compile(
        models.resnet50(pretrained=False).to(device).eval(),
        mode="default"
    )

    # Compilazione per massima velocità
    model_compiled_max = torch.compile(
        models.resnet50(pretrained=False).to(device).eval(),
        mode="max-autotune",
        fullgraph=True  # Evita graph breaks per massimo speedup
    )

    x = torch.randn(32, 3, 224, 224, device=device)

    def time_model(model, x, n=100):
        """Benchmark con warmup."""
        # Warmup (specialmente importante per torch.compile)
        with torch.no_grad():
            for _ in range(20):
                model(x)
        torch.cuda.synchronize() if torch.cuda.is_available() else None

        latencies = []
        with torch.no_grad():
            for _ in range(n):
                t0 = time.perf_counter()
                model(x)
                torch.cuda.synchronize() if torch.cuda.is_available() else None
                latencies.append((time.perf_counter() - t0) * 1000)
        return np.mean(latencies)

    ms_eager = time_model(model_fp32, x)
    ms_default = time_model(model_compiled_default, x)
    # ms_max = time_model(model_compiled_max, x)  # Richiede molto tempo di compile

    print(f"Eager (FP32):    {ms_eager:.2f} ms")
    print(f"Compiled default: {ms_default:.2f} ms ({ms_eager/ms_default:.2f}x speedup)")

    # Con BF16 + compile: effetto moltiplicativo
    model_bf16_compiled = torch.compile(
        models.resnet50(pretrained=False).to(device).eval(),
        mode="default"
    )
    x_bf16 = x.to(torch.bfloat16)
    model_bf16_compiled = model_bf16_compiled.to(torch.bfloat16)
    ms_bf16_compiled = time_model(model_bf16_compiled, x_bf16)
    print(f"BF16 + Compiled: {ms_bf16_compiled:.2f} ms ({ms_eager/ms_bf16_compiled:.2f}x speedup)")

    # Risultati tipici RTX 4090:
    # Eager FP32:      12.4 ms/step (BS=32)
    # Compiled default: 7.8 ms/step (1.59x)
    # BF16 + Compiled:  5.1 ms/step (2.43x)

benchmark_torch_compile()

KV 캐시: LLM 자동회귀 추론을 위한 최적화

자동회귀 모델에서 생성된 각 토큰은 모든 토큰에 대한 주의를 기다려야 합니다. 이전 토큰. 최적화하지 않으면 키(K)와 값(V)이 다시 계산됩니다. 각 단계에서 — n개의 토큰 시퀀스에 대해 복잡도는 O(n^2)입니다. 그만큼 KV 캐시 각 단계 후에 각 레이어의 K와 V를 절약하여 비용을 절감합니다. O(n^2)에서 O(n)으로의 생성.

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional, Tuple

# ============================================================
# TRANSFORMER CON KV CACHE
# ============================================================
class CachedMultiHeadAttention(nn.Module):
    """
    Multi-head attention con KV cache per generazione autogressiva.
    Il cache evita di ricalcolare K, V per token passati.
    """
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        self.n_heads = n_heads
        self.d_head = d_model // n_heads
        self.scale = self.d_head ** -0.5

        self.q_proj = nn.Linear(d_model, d_model, bias=False)
        self.k_proj = nn.Linear(d_model, d_model, bias=False)
        self.v_proj = nn.Linear(d_model, d_model, bias=False)
        self.out_proj = nn.Linear(d_model, d_model, bias=False)

    def forward(
        self,
        x: torch.Tensor,               # [B, seq_len, d_model]
        kv_cache: Optional[Tuple] = None  # (K_cache, V_cache) o None
    ) -> Tuple[torch.Tensor, Tuple]:
        B, T, D = x.shape

        # Proietta Q, K, V
        q = self.q_proj(x).view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        k = self.k_proj(x).view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        v = self.v_proj(x).view(B, T, self.n_heads, self.d_head).transpose(1, 2)

        # Concatena con cache esistente
        if kv_cache is not None:
            k_cache, v_cache = kv_cache
            k = torch.cat([k_cache, k], dim=2)  # [B, heads, T_total, d_head]
            v = torch.cat([v_cache, v], dim=2)

        # Attention (Flash Attention automatica con PyTorch 2.0+)
        out = F.scaled_dot_product_attention(q, k, v, is_causal=(kv_cache is None))
        out = out.transpose(1, 2).contiguous().view(B, T, D)

        return self.out_proj(out), (k, v)  # Ritorna output + nuovo cache


class CachedTransformerDecoder(nn.Module):
    """Decoder Transformer con KV cache per generazione efficiente."""
    def __init__(self, vocab_size: int, d_model: int = 512,
                 n_heads: int = 8, n_layers: int = 6):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(2048, d_model)
        self.layers = nn.ModuleList([
            CachedMultiHeadAttention(d_model, n_heads)
            for _ in range(n_layers)
        ])
        self.norms = nn.ModuleList([nn.LayerNorm(d_model) for _ in range(n_layers)])
        self.head = nn.Linear(d_model, vocab_size)
        self.n_layers = n_layers

    @torch.no_grad()
    def generate(
        self,
        input_ids: torch.Tensor,  # [B, seq_len]
        max_new_tokens: int = 100,
        temperature: float = 1.0
    ) -> torch.Tensor:
        """
        Generazione autogressiva con KV cache.
        Ogni step utilizza il cache dei token precedenti.
        """
        B, T = input_ids.shape
        device = input_ids.device

        # Processa il prompt (prefill)
        x = self.embed(input_ids)
        positions = torch.arange(T, device=device).unsqueeze(0)
        x = x + self.pos_embed(positions)

        # Inizializza cache per ogni layer
        kv_caches = [None] * self.n_layers

        for i, (layer, norm) in enumerate(zip(self.layers, self.norms)):
            x_norm = norm(x)
            attn_out, kv_caches[i] = layer(x_norm, kv_caches[i])
            x = x + attn_out

        # Generazione token per token (usando il cache)
        generated = []
        for step in range(max_new_tokens):
            # Solo l'ultimo token come query
            last_token = input_ids[:, -1:] if step == 0 else new_token
            x_new = self.embed(last_token)
            pos = torch.tensor([[T + step]], device=device)
            x_new = x_new + self.pos_embed(pos)

            for i, (layer, norm) in enumerate(zip(self.layers, self.norms)):
                x_norm = norm(x_new)
                attn_out, kv_caches[i] = layer(x_norm, kv_caches[i])
                x_new = x_new + attn_out

            # Campiona prossimo token
            logits = self.head(x_new[:, -1, :]) / temperature
            new_token = torch.multinomial(torch.softmax(logits, -1), 1)
            generated.append(new_token)

        return torch.cat(generated, dim=1)


# Benchmark KV cache vs no cache
def benchmark_generation(model, vocab_size=32000, seq_len=128,
                           max_new=50, device="cuda"):
    model = model.to(device).eval()
    input_ids = torch.randint(0, vocab_size, (1, seq_len), device=device)

    # Con KV cache (normale)
    t0 = time.perf_counter()
    with torch.no_grad():
        output = model.generate(input_ids, max_new_tokens=max_new)
    t_cached = (time.perf_counter() - t0) * 1000
    tokens_per_sec = max_new / (t_cached / 1000)

    print(f"Con KV Cache: {t_cached:.1f}ms totale, {tokens_per_sec:.1f} token/s")

체계적인 비교: 48GB에서 8GB RTX까지

시리즈에서 볼 수 있는 모든 최적화를 점진적으로 적용하여 요약합니다. 정확도/메모리/속도 트레이드오프를 보여주는 기본 모델입니다.

전체 비교: RTX 3090(24GB)의 Llama-3.1-8B

구성	VRAM	처리량	HellaSwag	당황	메모
BF16 기준선	16.0GB	38t/초	82.1%	6.14	참조 벤치마크
+ 플래시주의 2	14.2GB	52t/초	82.1%	6.14	-11% VRAM, +37% 속도
+ 토치.컴파일	14.2GB	68t/초	82.1%	6.14	플래시 어텐션 +31%
INT8(비트와바이트)	8.5GB	35t/초	81.8%	6.21	-47% VRAM, -0.3% acc
INT4 NF4 (비앤비)	4.9GB	42t/초	81.2%	6.47	-69% VRAM, -0.9% acc
GPTQ INT4	4.8GB	55t/초	81.5%	6.39	-70% VRAM, -0.6% acc
AWQ INT4	4.7GB	52t/초	81.6%	6.35	-71% VRAM, -0.5% acc
GGUF Q4_K_M(CPU)	0VRAM(5GB RAM)	18t/초	81.3%	6.42	GPU가 필요하지 않습니다

RTX 3090(24GB VRAM)의 대략적인 값입니다. 배치=1, seq=512로 측정된 처리량입니다.

결정 가이드: 어떤 시나리오에 대한 어떤 최적화

# ALBERO DECISIONALE PER OTTIMIZZAZIONE DL

def recommend_optimization(
    vram_available_gb: float,
    task: str,  # "training" | "inference" | "edge"
    accuracy_critical: bool,
    hardware: str  # "server_gpu" | "consumer_gpu" | "cpu" | "edge"
) -> dict:
    """
    Raccomanda le ottimizzazioni più appropriate per il proprio scenario.
    """
    recommendations = []
    priority = []

    # === SEMPRE DA FARE (zero o quasi zero costo) ===
    priority.append("1. Mixed Precision (BF16/FP16): abilita SEMPRE su GPU Ampere+")
    priority.append("2. Flash Attention: abilita se seq_len > 512")
    priority.append("3. torch.compile: abilita se PyTorch 2.0+, +30-50% speedup inference")
    priority.append("4. KV Cache: abilita SEMPRE per LLM autoregressive generation")

    if task == "training":
        if vram_available_gb < 24:
            priority.append("5. Gradient Checkpointing: -50% VRAM, +33% compute")
            priority.append("6. Gradient Accumulation: simula batch più grandi")
        if hardware in ["consumer_gpu", "edge"]:
            priority.append("7. QLoRA: fine-tuning con INT4 + LoRA su GPU consumer")

    if task in ["inference", "edge"]:
        if not accuracy_critical:
            if hardware == "server_gpu":
                priority.append("5. GPTQ INT4: massimo throughput su GPU NVIDIA")
            elif hardware in ["consumer_gpu", "cpu"]:
                priority.append("5. AWQ INT4 o GGUF Q4_K_M: per hardware eterogeneo")
            elif hardware == "edge":
                priority.append("5. GGUF Q3_K_M o Q4_K_M: per Raspberry Pi / embedded")
        else:
            priority.append("5. INT8 (bitsandbytes): minima perdita di accuratezza")

        if vram_available_gb < 16:
            priority.append("6. ONNX Export: riduzione overhead runtime +20-40%")
            priority.append("7. Considera distillazione verso modello più piccolo")

    print("=== RACCOMANDAZIONI OTTIMIZZAZIONE ===")
    for p in priority:
        print(f"  {p}")
    return {"priorities": priority}

# Esempi:
print("--- Scenario 1: Fine-tuning su RTX 4080 (16GB) ---")
recommend_optimization(16, "training", True, "consumer_gpu")

print("\n--- Scenario 2: Inferenza su Raspberry Pi ---")
recommend_optimization(0, "inference", False, "edge")

print("\n--- Scenario 3: Produzione su A100 (80GB) ---")
recommend_optimization(80, "inference", True, "server_gpu")

최적화 요약: 예상되는 영향

기술	VRAM 절약	속도 향상	Acc 손실	복잡성
혼합 정밀 BF16	-50%	2-3배	0%	낮음(1줄)
플래시 어텐션 2	-50-90%	2-8배	0%	낮음(1줄)
토치.컴파일	0%	1.5-2.5x	0%	낮음(1줄)
KV 캐시	+VRAM	10-50x 세대	0%	낮은
그라데이션 체크포인트	-50-70%	-0.7x	0%	낮은
INT8 양자화	-50%	0.9-1.1x	0-0.5%	낮은
INT4 GPTQ/AWQ	-75%	1.3-1.8x	0.5-1.5%	평균
증류	-70-90%	5-20x	5-15%	높은
구조화된 가지치기	-30-70%	2-5배	2-10%	높은

시리즈의 결론

우리는 시리즈 전체를 살펴보았습니다. 고급 딥 러닝 및 엣지 배포: Transformers의 주의 메커니즘부터 LoRA를 통한 미세 조정, GPTQ 양자화까지 구조화된 가지치기, 증류에서 Vision Transformers, NAS에서 엣지 배포까지 Raspberry Pi 및 Jetson을 사용하여 Ollama부터 최종 벤치마크까지.

핵심적이고 명확한 메시지는 하나의 "최고" 기술은 없다는 것입니다. 최적의 선택 항상 상황(사용 가능한 하드웨어, 정확도 요구 사항, 대상 대기 시간, 운영 비용. 하지만 이 기사에 제시된 체계적인 벤치마킹 프레임워크를 사용하면 당신은 할 수 측정하다 대신에 추측하다, 정보를 바탕으로 결정을 내립니다.

2026년의 추세는 분명합니다. 모델이 가장자리를 향해 움직이고 있다는 것입니다. Gartner 2027에서는 다음과 같이 예측합니다. SLM은 사용 중인 클라우드 LLM보다 성능이 3배 더 뛰어납니다. 이 시리즈의 기술 — 양자화, 증류, 엣지 배포, Ollama — 학문적 틈새 시장이 아니라 기술입니다. 향후 몇 년 동안 AI를 사용하려는 모든 사람에게 기본입니다.

관련 시리즈: MLOps | 컴퓨터 비전 | AI 엔지니어링