Hi! I'm

Federico Calò

Software Developer | Technical Writer

I create modern web applications and custom digital tools to help businesses grow through technological innovation. My passion is combining computer science and economics to generate real value.

Contact Me

About Me

My passion for computer science was born at the Technical Commercial Institute of Maglie, where I discovered the power of programming and the fascination of creating digital solutions. From the start, I understood that computer science was not just code, but an extraordinary tool for turning ideas into reality.

During my studies in Business Information Systems, I began to interweave computer science and economics, understanding how technology can be the engine of growth for any business. This vision accompanied me to the University of Bari, where I obtained my degree in Computer Science, deepening my technical skills and passion for software development.

Today I put this experience at the service of businesses, professionals and startups, creating tailor-made digital solutions that automate processes, optimize resources and open new business opportunities. Because true innovation begins when technology meets the real needs of people.

My Skills

Data Analysis & Predictive Models

I transform data into strategic insights with in-depth analysis and predictive models for informed decisions

Process Automation

I create custom tools that automate repetitive operations and free up time for value-added activities

Custom Systems

I develop tailor-made software systems, from platform integrations to customized dashboards

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

December 2024

View

Master SQL

RoadMap.sh

Novembre 2024

View

Oracle Certified Foundations Associate

Oracle

October 2024

View

People Leadership Credential

Connect

Settembre 2024

💻 Languages & Technologies

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

AI and GPU Workloads on Kubernetes: Device Plugin and Training Jobs

In 2026, 66% of AI inference clusters run on Kubernetes (CNCF Survey 2026). The reason It's simple: Kubernetes solves AI workloads' toughest operational problems — scheduling intelligent GPU scaling, elastic scaling of training jobs, integration with distributed storage for datasets, automatic retry in case of node failure. But setting up Kubernetes for i GPU workload requires specific skills that go well beyond the deployment of a normal one web application.

In this article we will see how to configure the NVIDIA Device Plugins for expose GPUs to the cluster, how to schedule Distributed training jobs with PyTorch and TensorFlow, how to use Karpenter to do spot scaling of GPU (reducing costs by 40-70%), and patterns to optimize GPU usage in workloads of inference into production.

What You Will Learn

Installation and configuration of the NVIDIA Device Plugin for Kubernetes
Scheduling Pods with GPU demand (nvidia.com/gpu resource)
Distributed training with PyTorchJob and TFJob (Kubeflow Training Operator)
Karpenter NodePool for automatic provisioning of spot GPU nodes
GPU time-slicing to share GPUs between multiple Pods
MIG (Multi-Instance GPU) for A100/H100 GPU partition
GPU monitoring with DCGM Exporter and Grafana
Pattern for high throughput inference with TorchServe on K8s

GPU architecture on Kubernetes

Kubernetes is not natively aware of GPUs. GPUs are exposed to the cluster via the Device Plugin Framework: a DaemonSet that runs on every node with GPU, registers with the kubelet, and manages the allocation of GPUs to containers. The NVIDIA Device Plugin is the most popular implementation of this framework.

Installing NVIDIA Device Plugin

# Pre-requisiti: NVIDIA GPU drivers installati sui nodi
# Verifica driver sui nodi
kubectl get nodes -l accelerator=nvidia
kubectl describe node gpu-node-1 | grep -i nvidia

# Installa NVIDIA Device Plugin con Helm
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update

helm install nvdp nvdp/nvidia-device-plugin \
  --namespace kube-system \
  --version 0.16.0 \
  --set failOnInitError=false

# Oppure con manifest diretto
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.0/deployments/static/nvidia-device-plugin.yml

# Verifica che le GPU siano visibili nel cluster
kubectl get nodes -o json | jq '.items[].status.allocatable | select(."nvidia.com/gpu" != null)'
# Output: { "nvidia.com/gpu": "8" }  per un nodo con 8 GPU A100

Installing NVIDIA GPU Operator (Recommended Approach)

For clusters in production, the GPU Operator of NVIDIA manages automatically all the necessary components: drivers, device plugins, container runtime, DCGM Exporter for monitoring:

# Installa GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --version v24.9.0 \
  --set driver.enabled=true \
  --set mig.strategy=single \
  --set dcgmExporter.enabled=true \
  --set dcgmExporter.serviceMonitor.enabled=true

# Verifica installazione
kubectl get pods -n gpu-operator
# Attendi che tutti i pod siano Running
kubectl wait --for=condition=ready pod -l app=nvidia-device-plugin-daemonset -n gpu-operator --timeout=300s

# Verifica GPU allocabili
kubectl describe node gpu-node-1 | grep -A 5 "Allocatable:"
# nvidia.com/gpu: 8

Scheduling Pods with GPU

Once the Device Plugin is active, you can request GPUs in manifests like any other Kubernetes resource. The difference: you don't use requests, only limits for GPUs (Kubernetes always guarantees exactly the number of GPUs required).

# pod-gpu-basic.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: OnFailure
  containers:
    - name: inference
      image: nvcr.io/nvidia/pytorch:24.01-py3
      command: ["python3", "-c"]
      args:
        - |
          import torch
          print(f"CUDA available: {torch.cuda.is_available()}")
          print(f"GPU count: {torch.cuda.device_count()}")
          print(f"GPU name: {torch.cuda.get_device_name(0)}")
          x = torch.rand(1000, 1000).cuda()
          print(f"Tensor on GPU: {x.device}")
      resources:
        limits:
          nvidia.com/gpu: "1"    # richiedi 1 GPU
          memory: "16Gi"
          cpu: "4"
        requests:
          memory: "16Gi"
          cpu: "4"
      volumeMounts:
        - name: model-storage
          mountPath: /models
  volumes:
    - name: model-storage
      persistentVolumeClaim:
        claimName: model-pvc
  nodeSelector:
    accelerator: "nvidia-a100"  # schedule solo su nodi A100
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"

Training Distributed with Kubeflow Training Operator

Training large models often requires multiple GPUs on multiple nodes. The Kubeflow Training Operator manages distributed training jobs with PyTorchJob, TFJob, MXJob and MPIJob. Install the operator first:

# Installa Training Operator
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.8.0"

# Verifica
kubectl get pods -n kubeflow
kubectl get crd | grep kubeflow

PyTorchJob for Multi-GPU Multi-Node Training

# pytorch-distributed-training.yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: llm-finetuning-job
  namespace: ml-training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: company.registry.io/training:llm-v2.1
              command:
                - python3
                - -m
                - torch.distributed.run
                - --nproc_per_node=8
                - --nnodes=4
                - --node_rank=$(RANK)
                - --master_addr=$(MASTER_ADDR)
                - --master_port=23456
                - train_llm.py
                - --model=llama-7b
                - --dataset=/data/training_set
                - --batch-size=32
                - --epochs=3
                - --output=/models/finetuned
              env:
                - name: NCCL_DEBUG
                  value: "INFO"
                - name: NCCL_SOCKET_IFNAME
                  value: "eth0"
              resources:
                limits:
                  nvidia.com/gpu: "8"
                  memory: "120Gi"
                  cpu: "32"
                requests:
                  memory: "120Gi"
                  cpu: "32"
              volumeMounts:
                - name: training-data
                  mountPath: /data
                - name: model-output
                  mountPath: /models
                - name: shm
                  mountPath: /dev/shm
          volumes:
            - name: training-data
              persistentVolumeClaim:
                claimName: training-dataset-pvc
            - name: model-output
              persistentVolumeClaim:
                claimName: model-output-pvc
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: "64Gi"  # shared memory per NCCL
          nodeSelector:
            accelerator: "nvidia-a100-80gb"
          tolerations:
            - key: "nvidia.com/gpu"
              operator: "Exists"
              effect: "NoSchedule"
    Worker:
      replicas: 3  # 3 worker + 1 master = 4 nodi, 32 GPU totali
      restartPolicy: OnFailure
      template:
        spec: # stesso spec del Master...
          containers:
            - name: pytorch
              image: company.registry.io/training:llm-v2.1
              resources:
                limits:
                  nvidia.com/gpu: "8"
                  memory: "120Gi"
                  cpu: "32"

Karpenter for Spot GPU Node Provisioning

GPUs are the most expensive resource in the cloud. Spot GPU instances cost 60-70% less compared to on-demand. Karpenter manages automatic provisioning of spot GPU nodes, with fallback to on-demand in case of outage:

# karpenter-gpu-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-spot
spec:
  template:
    metadata:
      labels:
        role: gpu-worker
        accelerator: nvidia
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1
        kind: EC2NodeClass
        name: gpu-nodeclass
      requirements:
        # Tipologie di istanze GPU AWS
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - p4d.24xlarge     # 8x A100 80GB
            - p3.8xlarge       # 4x V100
            - g5.12xlarge      # 4x A10G
            - g4dn.12xlarge    # 4x T4
        # Preferisci spot
        - key: karpenter.sh/capacity-type
          operator: In
          values:
            - spot
            - on-demand  # fallback
        - key: kubernetes.io/os
          operator: In
          values:
            - linux
      taints:
        - key: nvidia.com/gpu
          value: "true"
          effect: NoSchedule
  limits:
    nvidia.com/gpu: 256   # max 256 GPU totali nel cluster
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30m  # rimuovi nodi GPU spot quando il job finisce
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-nodeclass
spec:
  amiFamily: AL2
  role: KarpenterNodeRole
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "my-cluster"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "my-cluster"
  instanceStorePolicy: RAID0  # usa i dischi NVMe locali per storage temporaneo
  userData: |
    #!/bin/bash
    # Installa NVIDIA drivers al primo boot
    /etc/eks/bootstrap.sh my-cluster
    nvidia-smi  # verifica GPU disponibili

GPU Time-Slicing: Share a GPU between multiple Pods

For light inference or development workloads, an entire GPU is often wasted. The GPU time-slicing allows you to share a physical GPU between multiple Pods, each of which sees a "virtual GPU" with a slice of the computation time:

# gpu-time-slicing-config.yaml
# Configura il Device Plugin per time-slicing
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4   # ogni GPU fisica diventa 4 "GPU" logiche
---
# Applica il config all'operator
kubectl patch clusterpolicy gpu-cluster-policy \
  -n gpu-operator \
  --type merge \
  -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config"}}}}'

# Verifica: ogni nodo con 1 GPU A100 ora mostra 4 GPU allocabili
kubectl describe node gpu-node-1 | grep nvidia.com/gpu
# Allocatable:
#   nvidia.com/gpu:  4

# Pod che usa 1/4 di GPU
apiVersion: v1
kind: Pod
metadata:
  name: inference-small
spec:
  containers:
    - name: model-server
      image: company.registry.io/inference:v1
      resources:
        limits:
          nvidia.com/gpu: "1"  # ottiene 1/4 della GPU fisica

MIG: Multi-Instance GPU for A100 and H100

NVIDIA A100 and H100 GPUs support MIG (Multi-Instance GPU), that partition the GPU into isolated hardware instances (not just time-sharing). Each MIG instance it has guaranteed memory and computation and does not interfere with the others:

# Configura MIG sul nodo (eseguito sul nodo GPU, non da kubectl)
# Richiede: driver NVIDIA >= 525, GPU A100 o H100

# Abilita MIG mode sulla GPU
sudo nvidia-smi -mig 1

# Crea 7 istanze MIG da 1/7 di A100 (1g.10gb)
sudo nvidia-smi mig -cgip -p 0,9  # Profile 9 = MIG 1g.10gb

# Verifica istanze create
sudo nvidia-smi mig -lgi
# +-------------------------------------------------------+
# | GPU instances:                                         |
# | GPU   Name             Profile  Instance   Placement  |
# |                        ID       ID         Start:Size |
# |=======================================================|
# |   0  MIG 1g.10gb       9        1          0:1        |
# |   0  MIG 1g.10gb       9        2          1:1        |
# |   0  MIG 1g.10gb       9        3          2:1        |
# ... (7 istanze totali)

# Nel cluster Kubernetes, configurare MIG Strategy nel GPU Operator
kubectl patch clusterpolicy gpu-cluster-policy \
  -n gpu-operator \
  --type json \
  -p '[{"op":"replace","path":"/spec/mig/strategy","value":"mixed"}]'

# Pod che richiede specifica istanza MIG
apiVersion: v1
kind: Pod
metadata:
  name: inference-mig
spec:
  containers:
    - name: model
      image: nvcr.io/nvidia/pytorch:24.01-py3
      resources:
        limits:
          nvidia.com/mig-1g.10gb: "1"  # richiedi 1 istanza MIG 1g.10gb

GPU Monitoring with DCGM Exporter

GPU monitoring is essential to understand if training jobs are efficient and for FinOps. DCGM Exporter exposes GPU metrics to Prometheus:

# DCGM Exporter viene installato automaticamente con GPU Operator
# Verifica che le metriche siano disponibili
kubectl port-forward svc/gpu-operator-dcgm-exporter 9400:9400 -n gpu-operator &
curl -s localhost:9400/metrics | grep DCGM_FI

# Metriche chiave da monitorare:
# DCGM_FI_DEV_GPU_UTIL        - utilizzo GPU (0-100%)
# DCGM_FI_DEV_MEM_COPY_UTIL   - utilizzo memoria GPU
# DCGM_FI_DEV_FB_USED         - memoria GPU usata (MB)
# DCGM_FI_DEV_POWER_USAGE     - consumo energetico (W)
# DCGM_FI_DEV_SM_CLOCK        - clock streaming multiprocessor
# DCGM_FI_DEV_GPU_TEMP        - temperatura GPU

# Alert: GPU sottoutilizzata (< 50% per 30 minuti = spreco)
- alert: GPUUnderutilized
  expr: DCGM_FI_DEV_GPU_UTIL < 50
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "GPU {{ $labels.gpu }} sul nodo {{ $labels.Hostname }} utilization < 50%"
    description: "Valuta se il job puo essere terminato o ottimizzato"

# Dashboard Grafana: importa ID 12239 (NVIDIA DCGM Exporter Dashboard)

Deployment of Models for Inference with TorchServe

For production inference, you need a model server that handles load balancing including multiple model replication, request batching, and versioning. TorchServe and the official PyTorch solution:

# torchserve-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-inference
  namespace: ml-inference
spec:
  replicas: 3   # 3 replica per alta disponibilita
  selector:
    matchLabels:
      app: model-inference
  template:
    metadata:
      labels:
        app: model-inference
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8082"   # TorchServe metrics port
    spec:
      containers:
        - name: torchserve
          image: pytorch/torchserve:0.11.0-gpu
          args:
            - torchserve
            - --start
            - --model-store=/models
            - --models=text-classifier=bert-classifier.mar
            - --ts-config=/config/config.properties
          ports:
            - containerPort: 8080  # inference API
            - containerPort: 8081  # management API
            - containerPort: 8082  # metrics
          resources:
            limits:
              nvidia.com/gpu: "1"
              memory: "16Gi"
              cpu: "4"
            requests:
              memory: "8Gi"
              cpu: "2"
          readinessProbe:
            httpGet:
              path: /ping
              port: 8080
            initialDelaySeconds: 60
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /ping
              port: 8080
            initialDelaySeconds: 120
            periodSeconds: 30
          volumeMounts:
            - name: model-store
              mountPath: /models
            - name: ts-config
              mountPath: /config
      volumes:
        - name: model-store
          persistentVolumeClaim:
            claimName: model-store-pvc
        - name: ts-config
          configMap:
            name: torchserve-config
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: torchserve-config
  namespace: ml-inference
data:
  config.properties: |
    inference_address=http://0.0.0.0:8080
    management_address=http://0.0.0.0:8081
    metrics_address=http://0.0.0.0:8082
    number_of_gpu=1
    batch_size=32
    max_batch_delay=100   # ms: attendi fino a 100ms per fare batching
    max_response_size=6553500
    install_py_dep_per_model=true
---
# HPA basato su latenza con KEDA (event-driven autoscaling)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: inference-scaler
  namespace: ml-inference
spec:
  scaleTargetRef:
    name: model-inference
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc:9090
        metricName: torchserve_queue_latency_microseconds
        threshold: "100000"  # 100ms di coda = scala up
        query: avg(torchserve_queue_latency_microseconds{model_name="bert-classifier"})

Best Practices for AI Workloads on Kubernetes

GPU Cost Optimization

Spot for training, on-demand for inference: Training can handle interruptions with checkpoints; inference must always be available
Frequent checkpoints: Save checkpoints every 30 minutes to resume training after a spot interruption
Time-slicing for development: Use time-sliced GPUs for developers, MIG or full GPUs for production
Scale-to-zero: Karpenter removes spot GPU nodes when training finishes — you don't pay for idle GPUs
Batching in inference: TorchServe with batch_size=32 increases throughput 10-20x compared to single requests
Profile before deploying: Use NVIDIA Nsight to profile your training job and identify inefficiencies

Common GPU Errors on Kubernetes

Containers without taint tolerance: GPU nodes have taint nvidia.com/gpu=true:NoSchedule; without toleration the Pod is not scheduled on the GPU node
Failure to isolate memory: The GPU does not isolate memory between containers like the CPU does. If you allocate 1 GPU but the model uses more memory than available, the job crashes with CUDA OOM
NCCL without shared memory: PyTorch distributed training uses NCCL which requires large /dev/shm (typically 10-60GB); always configure an emptyDir with medium: Memory
Do not monitor GPU utilization: A GPU at 20% utilization is a huge waste. The DCGM dashboard should be the first place you look after every deployment

Conclusions and Next Steps

Kubernetes has become the standard platform for AI/ML workloads not by chance: the its resource abstraction, advanced scheduling system and operator ecosystem (Kubeflow, Training Operator) make it the ideal context to orchestrate both training that inference to scale. With Karpenter automatically managing node provisioning Spot GPU, the cost of a training job can be reduced by 40-70% compared to usage of on-demand instances.

The next step is to integrate these workloads with a complete MLOps pipeline: logging of models with MLflow, dataset management with DVC, CI/CD for automatic retraining. The FinOps for Kubernetes article (Article 9 in this series) delves into how to measure and optimize the total cost of GPU workloads in the cluster.

Upcoming Articles in the Kubernetes at Scale Series

Related Series

MLOps and Machine Learning in Production — CI/CD pipeline for ML, model registry
Deep Learning and Neural Networks — theoretical foundations of models training on GPUs
Autoscaling in Kubernetes — KEDA for scaling based on custom metrics