안녕하세요!

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

연락하기

소개

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

역량

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

프로세스 자동화

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

맞춤 시스템

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

미션

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

기술의 민주화

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

IT와 비즈니스 통합

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

맞춤 솔루션

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

기술로 비즈니스를 혁신하세요

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

연락하기

프로젝트가 있으신가요? 아래 양식을 작성해 주시면 빠르게 답변드리겠습니다.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

로컬 Ollama 및 LLM: 자체 하드웨어에서 모델 실행

2023년에는 대규모 언어 모델을 로컬에서 실행하는 것이 유일한 일이었습니다. 그는 llama.cpp 컴파일, 가중치 변환, 구성 등 깊은 기술 전문 지식을 갖고 있었습니다. GGML 매개변수는 복잡한 종속성을 관리합니다. 그러다 도착했다 올라마, 그리고 모든 것이 바뀌었습니다. 단일 명령으로 — ollama run llama3 — 누구라도 몇 분 만에 노트북에서 경쟁력 있는 LLM을 실행할 수 있습니다.

추세는 폭발적이다. 올라마는 2024년 월간 다운로드 100만 건을 돌파했고, 전년 대비 300% 성장을 이뤘습니다. 시장은 분명히 다음을 선택하고 있다. 은둔 (데이터가 장치를 떠나지 않음) 비용 제로 API의 맞춤화 (사용자 정의 모델, 고정 시스템 프롬프트) e 가용성 오프라인. 이러한 이점이 마이그레이션을 주도하고 있습니다. 클라우드 API부터 로컬 배포까지 다양한 비즈니스 워크플로를 지원합니다.

이 가이드에서는 설치부터 생산까지: Ollama 구성 방법을 선택하고 올바른 모델, 맞춤형 모델 파일 생성, REST API 노출, LangChain 통합 오프라인 RAG 파이프라인 및 특정 사용 사례에 대한 GGUF 모델 미세 조정 랩톱, 서버 및 Raspberry Pi에서.

무엇을 배울 것인가

Windows, macOS 및 Linux에 Ollama 설치
모델 선택 가이드: Llama, Qwen, Phi, Gemma, Mistral, DeepSeek
Modelfile: 사용자 정의 매개변수를 사용하여 사용자 정의 어시스턴트 생성
Ollama REST API: Python, JavaScript 및 cURL과 통합
공식 라이브러리 및 OpenAI 호환성을 통해 Python을 사용하는 Ollama
LangChain 및 FAISS를 사용한 오프라인 RAG 파이프라인
systemd를 사용하여 Raspberry Pi 및 헤드리스 서버에 배포
OpenWebUI: 완전 오프라인 ChatGPT와 유사한 인터페이스
자세한 벤치마크 및 양자화 수준 선택
생산을 위한 다중 모델 관리 및 최적화

Ollama가 내부적으로 작동하는 방식

Ollama를 사용하기 전에 Ollama의 내부 기능을 이해하는 것이 도움이 됩니다. 올라마와 포장지 주위에 라마.cpp, 이를 가능하게 한 C++ 추론 엔진 상용 하드웨어에서 양자화된 모델을 실행합니다. 올라마는 이렇게 덧붙입니다.

모델 등록: GGUF 모델을 위한 Docker Hub와 같은 풀/푸시 시스템
REST API 서버: 포트 11434에 로컬 HTTP 서버를 노출합니다.
모델 캐싱: 요청 사이에 모델을 RAM에 로드된 상태로 유지합니다.
GPU 감지: NVIDIA CUDA, AMD ROCm 및 Apple Metal을 자동으로 감지합니다.
컨텍스트 관리: 컨텍스트 창과 KV 캐시를 관리합니다.

# Architettura Ollama - diagramma semplificato
#
#  Client (Python/cURL/Browser)
#         |
#         v
#  [Ollama REST API - port 11434]
#         |
#         v
#  [Model Manager]  ---  ~/.ollama/models/  (storage GGUF)
#         |
#         v
#  [llama.cpp backend]
#         |
#    _____|______
#   |           |
#  [CPU]     [GPU/Metal]
#  ARM/x86   CUDA/ROCm/Metal
#
# Formato modelli: GGUF (GPT-Generated Unified Format)
# Quantization levels: Q4_K_M, Q5_K_M, Q6_K, Q8_0, F16
#
# Dove sono i modelli sul disco:
# macOS/Linux: ~/.ollama/models/
# Windows:     C:\Users\USERNAME\.ollama\models\
#
# Struttura directory:
# ~/.ollama/models/
# ├── blobs/       (file GGUF binari, identificati da SHA256)
# └── manifests/   (metadata: quale blob = quale modello:tag)

import subprocess, json

def ollama_status():
    """Controlla status Ollama e modelli caricati."""
    result = subprocess.run(
        ["ollama", "list"], capture_output=True, text=True
    )
    print("Modelli installati:")
    print(result.stdout)

    # Controlla processo
    ps = subprocess.run(
        ["pgrep", "-x", "ollama"], capture_output=True, text=True
    )
    running = ps.returncode == 0
    print(f"Ollama in esecuzione: {running}")

ollama_status()

설치 및 첫 번째 단계

Ollama는 단일 명령으로 설치되며 구성이 필요하지 않습니다. 지원 macOS(Apple Silicon 및 Intel), Windows(NVIDIA 또는 AMD GPU 포함) 및 Linux(deb/rpm/generic).

# ================================================================
# INSTALLAZIONE OLLAMA
# ================================================================

# macOS / Linux (un comando):
# curl -fsSL https://ollama.com/install.sh | sh

# Windows:
# Download installer da https://ollama.com/download
# (include supporto CUDA automatico se GPU NVIDIA presente)

# Verifica installazione:
# ollama --version
# ollama serve  (avvia il server manualmente se non e attivo)

# ================================================================
# COMANDI BASE
# ================================================================

# Esegui un modello (download automatico se non presente)
# ollama run llama3.2

# Lista modelli disponibili localmente
# ollama list

# Pull senza eseguire (per pre-scaricare)
# ollama pull llama3.2:3b

# Informazioni dettagliate su un modello
# ollama show llama3.2

# Rimuovi un modello (libera spazio disco)
# ollama rm llama3.2:old-version

# Copia un modello con nome diverso
# ollama cp llama3.2 my-custom-model

# ================================================================
# VARIABILI D'AMBIENTE UTILI
# ================================================================

# Ascolta su tutte le interfacce (per accesso dalla rete)
# export OLLAMA_HOST=0.0.0.0:11434

# Directory custom per i modelli
# export OLLAMA_MODELS=/mnt/ssd/ollama-models

# Numero massimo richieste parallele (default: 1)
# export OLLAMA_NUM_PARALLEL=4

# Massimo modelli in memoria (default: 1)
# export OLLAMA_MAX_LOADED_MODELS=2

# Tempo prima di scaricare un modello dalla RAM (default: 5m)
# export OLLAMA_KEEP_ALIVE=30m

# ================================================================
# MODELLI POPOLARI e REQUISITI HARDWARE (2025)
# ================================================================

MODELS_GUIDE = {
    # Modelli PICCOLI (per Raspberry Pi / laptop 8 GB)
    "qwen2.5:1.5b":      {"size": "0.9 GB", "ram": "2 GB",  "quality": 7, "rpi5_tps": 4.5},
    "llama3.2:1b":       {"size": "1.3 GB", "ram": "2 GB",  "quality": 7, "rpi5_tps": 5.1},
    "phi3.5:mini":       {"size": "2.2 GB", "ram": "4 GB",  "quality": 8, "rpi5_tps": 2.8},
    "qwen2.5:3b":        {"size": "1.9 GB", "ram": "4 GB",  "quality": 8, "rpi5_tps": 2.1},
    "gemma2:2b":         {"size": "1.6 GB", "ram": "3 GB",  "quality": 8, "rpi5_tps": 3.2},

    # Modelli MEDI (laptop 16+ GB / desktop)
    "llama3.2:3b":       {"size": "2.0 GB", "ram": "4 GB",  "quality": 8, "rpi5_tps": 1.8},
    "mistral:7b":        {"size": "4.1 GB", "ram": "8 GB",  "quality": 9, "rpi5_tps": 0.8},
    "llama3.1:8b":       {"size": "4.7 GB", "ram": "8 GB",  "quality": 9, "rpi5_tps": 0.6},
    "qwen2.5:7b":        {"size": "4.4 GB", "ram": "8 GB",  "quality": 9, "rpi5_tps": 0.7},
    "deepseek-r1:8b":    {"size": "4.9 GB", "ram": "8 GB",  "quality": 9, "rpi5_tps": 0.5},

    # Modelli GRANDI (workstation 24+ GB / server)
    "llama3.1:70b":      {"size": "40 GB", "ram": "64 GB", "quality": 10, "rpi5_tps": None},
    "qwen2.5:72b":       {"size": "41 GB", "ram": "64 GB", "quality": 10, "rpi5_tps": None},
    "deepseek-r1:32b":   {"size": "19 GB", "ram": "32 GB", "quality": 10, "rpi5_tps": None},
}

print("Modelli consigliati per hardware:")
print("  Raspberry Pi 5 (8GB): qwen2.5:1.5b, llama3.2:1b, gemma2:2b")
print("  Laptop 16GB:          llama3.1:8b, qwen2.5:7b, mistral:7b")
print("  Mac M2/M3 (24GB):     llama3.1:8b, gemma2:9b, qwen2.5:14b")
print("  Workstation 48GB+:    llama3.1:70b, deepseek-r1:32b")

양자화 수준: 어떤 GGUF를 선택해야 합니까?

완료되면 ollama pull llama3.1:8b, Ollama는 자동으로 귀하의 하드웨어에 대한 최적의 양자화. 하지만 명시적으로 선택할 수는 있습니다. 중요한 품질/크기/속도 균형을 갖춘 양자화 수준.

GGUF 양자화 레벨 가이드

태그 / 형식	비트/무게	사이즈(7B)	당혹감 손실	권장 대상
Q2_K	2.63비트	2.7GB	+15-20%	RAM이 절대적인 제약인 경우에만
Q4_K_S	4.37비트	4.5GB	+2-3%	좋은 속도/품질 균형
Q4_K_M	4.58비트	4.8GB	+1-2%	권장 기본값(최적의 지점)
Q5_K_M	5.68비트	5.7GB	+0.5-1%	<6GB RAM으로 최고 품질
Q6_K	6.57비트	6.6GB	+0.1-0.3%	F16과 거의 동일하며 더 많은 RAM이 필요합니다.
Q8_0	8.5비트	8.5GB	~0%	최고 품질, 9GB 이상의 RAM 필요
F16	16비트	14GB	0%(기준선)	추론을 위한 것이 아닌 훈련/미세 조정

# Scegliere esplicitamente la quantizzazione su Ollama
# I tag dipendono dal modello - usa 'ollama show' per vedere le opzioni

# Default (Ollama sceglie automaticamente, solitamente Q4_K_M):
# ollama pull llama3.1:8b

# Specifica quantizzazione manualmente (sintassi dipende dal modello):
# ollama pull llama3.1:8b-instruct-q4_K_M
# ollama pull llama3.1:8b-instruct-q5_K_M
# ollama pull llama3.1:8b-instruct-q8_0

# Per modelli HuggingFace non su Ollama registry:
# Scarica GGUF manualmente e importa con Modelfile:

IMPORT_GGUF_MODELFILE = """
FROM ./path/to/model-q4_k_m.gguf

PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM "Sei un assistente utile."
"""

# echo IMPORT_GGUF_MODELFILE > Modelfile
# ollama create mio-modello -f Modelfile
# ollama run mio-modello

# Confronto performance Q4 vs Q5 vs Q8 (Llama 3.1 8B, MacBook M3 Pro):
QUANT_BENCHMARK = {
    "Q4_K_M": {"size_gb": 4.8, "tps": 38.2, "quality_vs_f16": "98.5%"},
    "Q5_K_M": {"size_gb": 5.7, "tps": 33.1, "quality_vs_f16": "99.2%"},
    "Q6_K":   {"size_gb": 6.6, "tps": 29.4, "quality_vs_f16": "99.7%"},
    "Q8_0":   {"size_gb": 8.5, "tps": 24.8, "quality_vs_f16": "99.9%"},
}

for quant, data in QUANT_BENCHMARK.items():
    print(f"{quant}: {data['size_gb']}GB, {data['tps']}t/s, qualità={data['quality_vs_f16']}")

모델 파일: 맞춤형 도우미 생성

Un 모델파일 사용자 정의 템플릿을 생성하는 Ollama의 메커니즘. 기본 모델, 시스템 프롬프트, 생성 매개변수(온도, top_p, 컨텍스트 창), 추가 파일을 사용하여 모델을 확장할 수도 있습니다. 동등하다 Dockerfile에 있지만 언어 모델용입니다.

# ================================================================
# ESEMPI PRATICI DI MODELFILE
# ================================================================

# --- Modelfile 1: Assistente tecnico italiano ---
MODEL_FILE_TECH = """
FROM qwen2.5:7b

# Parametri di generazione
PARAMETER temperature 0.3         # Bassa = risposte più deterministiche
PARAMETER top_p 0.9               # Nucleus sampling
PARAMETER top_k 40                # Top-k sampling
PARAMETER num_ctx 8192            # Context window (4096-32768)
PARAMETER repeat_penalty 1.1      # Evita ripetizioni

# System prompt (definisce il comportamento del modello)
SYSTEM \"\"\"
Sei un assistente tecnico esperto in Python, deep learning e machine learning.
Rispondi SEMPRE in italiano, in modo conciso e tecnico.
Quando mostri codice, usa sempre blocchi markdown con il linguaggio specificato.
Se non sei sicuro di qualcosa, dillo esplicitamente.
Non inventare informazioni o API che non esistono.
\"\"\"

# Messaggio di benvenuto
MESSAGE user "Ciao!"
MESSAGE assistant "Ciao! Sono il tuo assistente tecnico. Come posso aiutarti oggi con Python, deep learning o machine learning?"
"""

# Crea il modello:
# echo MODEL_FILE_TECH > Modelfile-tech-it
# ollama create assistente-tech-it -f Modelfile-tech-it
# ollama run assistente-tech-it

# --- Modelfile 2: Code review assistant ---
MODEL_FILE_CODE = """
FROM llama3.1:8b

PARAMETER temperature 0.1         # Molto deterministico per codice
PARAMETER num_ctx 16384           # Context grande per file lunghi
PARAMETER repeat_penalty 1.05

SYSTEM \"\"\"
You are an expert code reviewer. When reviewing code:
1. Identify bugs, security issues, and performance problems
2. Suggest specific improvements with code examples
3. Follow PEP8/language standards
4. Be concise: list issues with severity (CRITICAL/HIGH/MEDIUM/LOW)
Be direct and actionable. Never hallucinate API methods.
\"\"\"
"""

# --- Modelfile 3: RAG con documenti aziendali ---
MODEL_FILE_RAG = """
FROM qwen2.5:7b

PARAMETER temperature 0.1
PARAMETER num_ctx 32768         # Contesto lungo per documenti
PARAMETER repeat_penalty 1.0

SYSTEM \"\"\"
Sei un assistente che risponde SOLO basandosi sui documenti forniti nel contesto.
Se non trovi la risposta nel contesto, dì esattamente: "Non ho informazioni su questo nei documenti forniti."
Non aggiungere mai informazioni esterne. Cita sempre il documento sorgente nella risposta.
Rispondi in italiano.
\"\"\"
"""

print("Modelfile pronti. Per creare:")
print("  ollama create assistente-tech-it -f Modelfile-tech-it")
print("  ollama create code-reviewer -f Modelfile-code")
print("  ollama create rag-assistant -f Modelfile-rag")

Ollama REST API: Python과 통합

Ollama는 자체 네이티브 API와 OpenAI 호환 API라는 두 가지 API를 공개합니다. OpenAI 호환성을 통해 OpenAI API를 Ollama로 간단하게 대체할 수 있습니다. 애플리케이션 코드를 변경하지 않고 기본 URL을 변경합니다.

# pip install ollama openai requests

import ollama
import json, time
from typing import Iterator

# ================================================================
# 1. LIBRERIA OLLAMA UFFICIALE (Python)
# ================================================================

# Chat semplice (non-streaming)
def chat_simple(model: str, message: str) -> str:
    response = ollama.chat(
        model=model,
        messages=[{"role": "user", "content": message}]
    )
    return response['message']['content']

# Chat con streaming (token per token)
def chat_streaming(model: str, messages: list) -> Iterator[str]:
    stream = ollama.chat(
        model=model,
        messages=messages,
        stream=True
    )
    for chunk in stream:
        if chunk['message']['content']:
            yield chunk['message']['content']

# Embeddings per RAG (usa nomic-embed-text o mxbai-embed-large)
def get_embedding(model: str, text: str) -> list:
    response = ollama.embeddings(model=model, prompt=text)
    return response['embedding']

# Chat con immagini (modelli multimodali: llava, bakllava, moondream)
def chat_with_image(model: str, prompt: str, image_path: str) -> str:
    import base64
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()

    response = ollama.chat(
        model="llava:7b",  # oppure moondream
        messages=[{
            "role": "user",
            "content": prompt,
            "images": [image_data]
        }]
    )
    return response['message']['content']

# Chatbot con storia della conversazione
def interactive_chat(model: str = "llama3.2:3b"):
    history = []
    print(f"Chat con {model} (digita 'exit' per uscire)")

    while True:
        user_input = input("Tu: ").strip()
        if user_input.lower() == "exit":
            break

        history.append({"role": "user", "content": user_input})

        print("Assistant: ", end="", flush=True)
        full_response = ""
        for chunk in chat_streaming(model, history):
            print(chunk, end="", flush=True)
            full_response += chunk
        print()

        history.append({"role": "assistant", "content": full_response})


# ================================================================
# 2. API COMPATIBILE OPENAI (drop-in replacement)
# ================================================================
from openai import OpenAI

# Cambia solo la base_url: zero modifiche al codice!
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Qualsiasi stringa
)

def chat_openai_compatible(model: str, prompt: str) -> str:
    """Identico all'API OpenAI, ma usa Ollama in locale."""
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=500
    )
    return response.choices[0].message.content

# ================================================================
# 3. RAW REST API (senza librerie Python)
# ================================================================
import requests

def ollama_raw_api(model: str, prompt: str, stream: bool = False) -> str:
    """Chiama l'API Ollama direttamente con requests."""
    resp = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": stream,
            "options": {
                "temperature": 0.7,
                "num_predict": 200,
                "num_ctx": 4096
            }
        },
        timeout=120
    )
    if not stream:
        return resp.json()["response"]
    else:
        # Streaming: ogni riga e un JSON
        result = ""
        for line in resp.iter_lines():
            if line:
                data = json.loads(line)
                result += data.get("response", "")
                if data.get("done"):
                    break
        return result

# ================================================================
# 4. BENCHMARK VELOCITA MODELLI
# ================================================================
def benchmark_model(model: str, n_runs: int = 3):
    """Misura velocità di generazione in token/s."""
    prompt = "Explain quantum computing in one paragraph."
    results = []
    for _ in range(n_runs):
        t0 = time.time()
        response = ollama.generate(
            model=model,
            prompt=prompt,
            options={"num_predict": 100}
        )
        elapsed = time.time() - t0
        eval_count = response.get('eval_count', 100)
        tps = eval_count / elapsed
        results.append(tps)

    avg_tps = sum(results) / len(results)
    print(f"{model}: {avg_tps:.1f} token/s (media {n_runs} run)")
    return avg_tps

# Risultati tipici su MacBook M3 Pro 18GB:
# qwen2.5:1.5b  ~85 t/s
# llama3.2:3b   ~62 t/s
# qwen2.5:7b    ~42 t/s
# llama3.1:8b   ~38 t/s
# qwen2.5:14b   ~22 t/s
# llama3.1:70b  ~8  t/s

LangChain이 있는 Ollama: RAG 파이프라인 오프라인

Ollama는 기본적으로 LangChain과 통합되어 RAG 파이프라인을 구축할 수 있습니다. (검색-증강 생성)은 완전히 오프라인입니다. 이것과 특히 민감한 데이터를 클라우드로 보낼 수 없는 엔터프라이즈 애플리케이션과 관련이 있습니다. 모델 nomic-embed-text 로컬 임베딩에 최적입니다.

# pip install langchain langchain-ollama langchain-community
# pip install faiss-cpu chromadb pypdf

from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import (
    DirectoryLoader, TextLoader, PyPDFLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA, ConversationalRetrievalChain
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory
import os

# ================================================================
# PIPELINE RAG OFFLINE CON OLLAMA - VERSIONE PRODUZIONE
# ================================================================

class OllamaRAGSystem:
    """
    Sistema RAG completo e offline con Ollama.
    Supporta PDF, TXT e directory intere.
    Usa FAISS per vector store locale.
    """

    def __init__(
        self,
        llm_model: str = "llama3.1:8b",
        embed_model: str = "nomic-embed-text",  # ollama pull nomic-embed-text
        kb_dir: str = "./knowledge_base"
    ):
        self.llm_model = llm_model
        self.embed_model = embed_model
        self.kb_dir = kb_dir

        self.embeddings = OllamaEmbeddings(model=embed_model)
        self.llm = OllamaLLM(
            model=llm_model,
            temperature=0.1,
            num_ctx=8192,
            num_predict=512
        )
        self.vectorstore = None

    def load_documents(self, docs_dir: str) -> list:
        """Carica documenti da directory (PDF, TXT, MD)."""
        docs = []

        # Carica TXT e MD
        txt_loader = DirectoryLoader(
            docs_dir, glob="**/*.txt", loader_cls=TextLoader
        )
        docs.extend(txt_loader.load())

        # Carica PDF
        for pdf_file in os.listdir(docs_dir):
            if pdf_file.endswith(".pdf"):
                loader = PyPDFLoader(os.path.join(docs_dir, pdf_file))
                docs.extend(loader.load())

        print(f"Caricati {len(docs)} documenti da {docs_dir}")
        return docs

    def build_knowledge_base(self, docs_dir: str) -> None:
        """Crea e salva la knowledge base da una directory."""
        documents = self.load_documents(docs_dir)

        splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            separators=["\n\n", "\n", " ", ""]
        )
        texts = splitter.split_documents(documents)
        print(f"Creati {len(texts)} chunk")

        self.vectorstore = FAISS.from_documents(texts, self.embeddings)
        self.vectorstore.save_local(self.kb_dir)
        print(f"Knowledge base salvata in {self.kb_dir}")

    def load_knowledge_base(self) -> None:
        """Carica knowledge base esistente da disco."""
        self.vectorstore = FAISS.load_local(
            self.kb_dir, self.embeddings,
            allow_dangerous_deserialization=True
        )
        print(f"Knowledge base caricata: {self.vectorstore.index.ntotal} vettori")

    def create_qa_chain(self) -> RetrievalQA:
        """Crea chain per Q&A su documenti."""
        prompt_template = """Usa il seguente contesto per rispondere alla domanda.
Se non trovi la risposta nel contesto, dì esplicitamente che non lo sai.
Non inventare informazioni non presenti nel contesto.

Contesto:
{context}

Domanda: {question}

Risposta in italiano:"""

        PROMPT = PromptTemplate(
            template=prompt_template,
            input_variables=["context", "question"]
        )

        retriever = self.vectorstore.as_retriever(
            search_type="mmr",  # Maximum Marginal Relevance (più diversificato)
            search_kwargs={"k": 5, "fetch_k": 20}
        )

        return RetrievalQA.from_chain_type(
            llm=self.llm,
            chain_type="stuff",
            retriever=retriever,
            chain_type_kwargs={"prompt": PROMPT},
            return_source_documents=True
        )

    def ask(self, question: str, qa_chain: RetrievalQA) -> dict:
        """Poni una domanda al sistema RAG."""
        result = qa_chain.invoke({"query": question})
        sources = list(set([
            doc.metadata.get("source", "Unknown")
            for doc in result["source_documents"]
        ]))
        return {
            "answer": result["result"],
            "sources": sources,
            "n_docs": len(result["source_documents"])
        }


# Utilizzo:
# rag = OllamaRAGSystem(llm_model="llama3.1:8b")
# rag.build_knowledge_base("./documenti_aziendali")
# chain = rag.create_qa_chain()
# result = rag.ask("Qual e la policy ferie aziendale?", chain)
# print(result["answer"])
# print("Fonti:", result["sources"])

print("Sistema RAG pronto!")

OpenWebUI: Ollama용 ChatGPT 인터페이스

오픈웹UI (이전 Ollama WebUI) 및 가장 많이 사용되는 인터페이스 Ollama는 ChatGPT와 동일하지만 완전히 오프라인인 사용자 경험을 제공합니다. 지원 채팅, 문서 업로드, 대화 관리, 프롬프트 공유, 통합 RAG 그리고 이미지를 위한 멀티 모드.

# ================================================================
# SETUP OPENWEBUI CON DOCKER
# ================================================================

# Caso 1: Ollama sullo stesso host
# docker run -d -p 3000:8080 \
#     -v open-webui:/app/backend/data \
#     -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
#     --name open-webui \
#     ghcr.io/open-webui/open-webui:main

# Caso 2: OpenWebUI con Ollama integrato (tutto in uno)
# docker run -d -p 3000:8080 \
#     -v ollama:/root/.ollama \
#     -v open-webui:/app/backend/data \
#     --gpus all \
#     --name open-webui \
#     ghcr.io/open-webui/open-webui:ollama

# Accesso: http://localhost:3000

# ================================================================
# DOCKER COMPOSE (raccomandato per produzione)
# ================================================================
DOCKER_COMPOSE = """
version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_MAX_LOADED_MODELS=2
    # Per GPU NVIDIA:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: all
    #           capabilities: [gpu]
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_AUTH=True
      - WEBUI_SECRET_KEY=cambia-questa-chiave-segreta
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:
  webui_data:
"""

# ================================================================
# OLLAMA COME SERVIZIO SYSTEMD (Linux production)
# ================================================================
SYSTEMD_SERVICE = """
# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama LLM Service
After=network-online.target

[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_MODELS=/opt/ollama/models"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_KEEP_ALIVE=10m"

[Install]
WantedBy=default.target
"""

# sudo systemctl enable ollama
# sudo systemctl start ollama
# sudo journalctl -u ollama -f  # Log in real-time

print("Setup Ollama come servizio completato!")

Raspberry Pi에 배포: 최적화된 설정

8GB RAM을 갖춘 Raspberry Pi 5는 로컬 LLM을 위한 가장 접근하기 쉬운 에지 장치입니다. 올바른 구성을 사용하면 1.5B 매개변수가 있는 모델은 초당 4~5개의 토큰에 도달합니다. — 충분합니다. 실시간이 아닌 다양한 사용 사례: 소량 챗봇, 일괄 텍스트 분석, 이벤트 트리거를 사용한 자동화.

# ================================================================
# OLLAMA SU RASPBERRY PI 5 (setup ottimizzato)
# ================================================================

# Installazione (identica a Linux x86):
# curl -fsSL https://ollama.com/install.sh | sh

# Configurazione ottimale per RPi5 in /etc/environment:
# OLLAMA_NUM_PARALLEL=1       # Un request alla volta (RAM limitata)
# OLLAMA_MAX_LOADED_MODELS=1  # Un modello in memoria
# OLLAMA_KEEP_ALIVE=5m        # Scarica modello dopo 5 min inattivita
# OLLAMA_NUM_THREAD=4         # Tutti i core Cortex-A76

# Modelli raccomandati per RPi5 (8GB):
# ollama pull qwen2.5:1.5b    (veloce: ~4.5 t/s, 1.8 GB RAM)
# ollama pull llama3.2:1b     (bilanciato: ~5.1 t/s, 1.4 GB RAM)
# ollama pull gemma2:2b       (qualità: ~3.2 t/s, 2.5 GB RAM)

import ollama
import time, statistics, psutil

def benchmark_ollama_rpi(model: str = "qwen2.5:1.5b",
                          n_tests: int = 5):
    """Test velocità e consistenza su RPi."""
    prompt = "Spiega in 3 frasi cos'è il machine learning."
    results = []
    latencies_to_first = []

    print(f"Benchmark {model} su {n_tests} test...")
    for i in range(n_tests):
        t0 = time.time()
        first_token = None
        full_response = ""

        for chunk in ollama.chat(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            stream=True,
            options={"temperature": 0, "top_k": 1, "num_predict": 50}
        ):
            content = chunk['message']['content']
            if content and first_token is None:
                first_token = time.time() - t0
                latencies_to_first.append(first_token * 1000)
            full_response += content

        elapsed = time.time() - t0
        n_tokens = len(full_response.split())  # Approssimazione
        tps = n_tokens / elapsed
        results.append(tps)
        print(f"  Test {i+1}: {tps:.1f} t/s, TTFT: {first_token*1000:.0f}ms")

    mean_tps = statistics.mean(results)
    mean_ttft = statistics.mean(latencies_to_first)
    mem = psutil.virtual_memory()

    print(f"\nRisultati {model} su RPi5:")
    print(f"  Velocita media: {mean_tps:.1f} t/s")
    print(f"  TTFT medio:     {mean_ttft:.0f} ms")
    print(f"  RAM usata:      {mem.used/(1024**3):.1f} GB / {mem.total/(1024**3):.1f} GB")

    return mean_tps

# ================================================================
# AUTOMAZIONE: Aggiornamento modelli e monitoring
# ================================================================
import subprocess, datetime

def update_ollama_models(models: list = ["qwen2.5:1.5b", "nomic-embed-text"]):
    """Aggiorna i modelli Ollama (da eseguire con cron)."""
    log = []
    for model in models:
        print(f"Aggiornamento {model}...")
        result = subprocess.run(
            ["ollama", "pull", model],
            capture_output=True, text=True, timeout=600
        )
        status = "OK" if result.returncode == 0 else "FAIL"
        log.append({
            "model": model,
            "status": status,
            "time": datetime.datetime.now().isoformat()
        })
        print(f"  {model}: {status}")

    return log

# Cron job consigliato (ogni domenica alle 3:00):
# 0 3 * * 0 /usr/bin/python3 /home/pi/update_models.py >> /var/log/ollama-update.log 2>&1

실제 사례 연구: 오프라인 비즈니스 챗봇

실제 사용 사례: 기밀 문서(계약서, 인사 정책, 기술 매뉴얼)은 데이터를 클라우드에 노출하지 않고 내부 챗봇을 원합니다. 올라마와 함께 + RAG는 하루도 채 안 되어 완전히 에어갭 시스템을 구축합니다.

# ================================================================
# CHATBOT AZIENDALE OFFLINE - Stack completo
# ================================================================

# Stack:
# - Ollama con llama3.1:8b (o qwen2.5:7b per italiano migliore)
# - nomic-embed-text per embeddings
# - FAISS per vector store
# - FastAPI per REST API
# - OpenWebUI per interfaccia utente

# fastapi_chatbot.py
from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
from typing import Optional
import ollama, json
from pathlib import Path

app = FastAPI(title="Corporate AI Assistant", version="2.0")

# Stato globale (in produzione usa Redis)
conversation_store = {}

class ChatRequest(BaseModel):
    session_id: str
    message: str
    model: str = "qwen2.5:7b"
    use_rag: bool = True

class ChatResponse(BaseModel):
    session_id: str
    response: str
    sources: list = []
    model: str
    tokens_per_sec: float

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    # Recupera storia conversazione
    if request.session_id not in conversation_store:
        conversation_store[request.session_id] = []

    history = conversation_store[request.session_id]

    # Aggiungi contesto RAG se richiesto
    context = ""
    sources = []
    if request.use_rag and rag_system and rag_system.vectorstore:
        docs = rag_system.vectorstore.similarity_search(
            request.message, k=3
        )
        context = "\n\n".join([d.page_content for d in docs])
        sources = list(set([d.metadata.get("source", "") for d in docs]))

        # Inietta contesto nel messaggio
        augmented_message = f"""Contesto dai documenti aziendali:
{context}

Domanda: {request.message}"""
    else:
        augmented_message = request.message

    history.append({"role": "user", "content": augmented_message})

    # Genera risposta
    t0 = time.time()
    response = ollama.chat(
        model=request.model,
        messages=history,
        options={"num_ctx": 8192, "temperature": 0.3}
    )
    elapsed = time.time() - t0

    assistant_msg = response['message']['content']
    history.append({"role": "assistant", "content": assistant_msg})

    # Tronca la storia se troppo lunga (sliding window)
    if len(history) > 20:
        history = history[-20:]
    conversation_store[request.session_id] = history

    eval_count = response.get('eval_count', 50)
    tps = eval_count / elapsed if elapsed > 0 else 0

    return ChatResponse(
        session_id=request.session_id,
        response=assistant_msg,
        sources=sources,
        model=request.model,
        tokens_per_sec=round(tps, 1)
    )

@app.delete("/chat/{session_id}")
async def clear_session(session_id: str):
    """Azzera la storia di una sessione."""
    if session_id in conversation_store:
        del conversation_store[session_id]
    return {"status": "cleared"}

@app.get("/models")
async def list_models():
    """Lista modelli disponibili su questo server Ollama."""
    models = ollama.list()
    return {
        "models": [
            {"name": m['name'], "size_gb": m['size'] / 1e9}
            for m in models['models']
        ]
    }

# Avvio: uvicorn fastapi_chatbot:app --host 0.0.0.0 --port 8080

일반적인 사용 사례에 대한 모델 비교

사용 사례	추천 모델	최소 RAM	perchè
이탈리아어 챗봇	qwen2.5:7b	8GB	뛰어난 다국어, 긴 맥락
코드 생성	qwen2.5-코더:7b	8GB	미세 조정된 코드, 90개 이상의 언어
RAG / Q&A 문서	라마3.1:8b	8GB	탁월한 지시 따르기, 128K 컨텍스트
고급 추론	deepseek-r1:8b	8GB	사고의 사슬, 수학, 논리
라즈베리 파이(빠름)	라마3.2:1b	2GB	5t/s 이상, 간단한 작업
라즈베리 파이(품질)	qwen2.5:3b	4GB	최적의 품질/속도 균형
Mac M 시리즈(빠름)	qwen2.5:14b	16GB	M2/M3에서 22+ t/s, GPT-4에 가까운 품질
이미지 분석	llava:7b 또는 문드림	8GB	비전에 최적화된 다중 모드 모델

생산 모범 사례

프로덕션에서 Ollama를 사용하려면 몇 가지 용도별 고려 사항이 필요합니다. 개인. 가장 중요한 패턴은 다음과 같습니다.

# ================================================================
# PATTERN PRODUZIONE: Load Balancing con più istanze Ollama
# ================================================================

# Se si ha più di un server con Ollama, si può fare load balancing.
# nginx.conf (upstream round-robin):
NGINX_CONFIG = """
upstream ollama_cluster {
    least_conn;  # Instrada alla connessione con meno richieste
    server server1:11434;
    server server2:11434;
    server server3:11434;
}

server {
    listen 80;

    location /api/ {
        proxy_pass http://ollama_cluster;
        proxy_read_timeout 300s;   # Timeout elevato per generazione lunga
        proxy_connect_timeout 10s;
        proxy_set_header Host $host;
    }
}
"""

# ================================================================
# HEALTH CHECK E MONITORING
# ================================================================
import requests, time

def monitor_ollama(host: str = "localhost", port: int = 11434):
    """Controlla disponibilità e carico Ollama."""
    try:
        # API health endpoint
        resp = requests.get(f"http://{host}:{port}/api/tags", timeout=5)
        if resp.status_code == 200:
            models = resp.json().get("models", [])
            print(f"Ollama OK: {len(models)} modelli disponibili")
            return True
    except requests.exceptions.RequestException as e:
        print(f"Ollama NON RAGGIUNGIBILE: {e}")
        return False

# ================================================================
# GESTIONE ERRORI E RETRY
# ================================================================
import functools, random

def with_ollama_retry(max_attempts: int = 3, backoff: float = 1.0):
    """Decorator per retry automatico su errori Ollama."""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_attempts):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_attempts - 1:
                        raise
                    wait = backoff * (2 ** attempt) + random.uniform(0, 0.5)
                    print(f"Tentativo {attempt+1} fallito: {e}. Retry in {wait:.1f}s")
                    time.sleep(wait)
        return wrapper
    return decorator

@with_ollama_retry(max_attempts=3, backoff=1.0)
def robust_chat(model: str, message: str) -> str:
    """Chat con retry automatico su errori di rete/timeout."""
    response = ollama.chat(
        model=model,
        messages=[{"role": "user", "content": message}],
        options={"num_predict": 500}
    )
    return response['message']['content']

생산에 대한 제한 사항 및 고려 사항

Ollama는 기본적으로 다중 테넌트가 아닙니다. 공유 서버에서는 요청은 직렬화됩니다. 세 OLLAMA_NUM_PARALLEL=4 에 대한 동시 요청을 처리합니다(더 많은 RAM 필요: 7B 모델의 경우 요청당 ~8GB).
대형 모델의 RPi 시간 초과: llama3.1:8b는 10~15초 소요 RPi에서 첫 번째 응답을 생성합니다. 미국 num_ctx=512 줄이기 위해 시간에 민감한 경우에는 시간을 미리 채웁니다. TTFT <2의 경우 1-3B 모델을 사용합니다.
자동 자동 확장 없음: 클라우드 API와 달리 Ollama 확장되지 않습니다. 트래픽이 많은 경우 서버의 여러 Ollama 인스턴스로 로드 밸런싱을 사용하세요. 다르거나 GPU 배포를 위해 vLLM을 고려하세요.
연속 전력 소비: Ollama를 모델로 활동적으로 유지하기 로드된 전력은 RPi5에서 ~15W, Jetson Orin NX에서 ~45W를 소비합니다. 미국 OLLAMA_KEEP_ALIVE=0 각 요청 후 즉시 모델을 다운로드합니다.
안전: Ollama는 기본적으로 요청을 인증하지 않습니다. 생산에서는 인증 및 속도 제한과 함께 항상 역방향 프록시(nginx)를 앞에 두십시오. 포트 11434를 인터넷에 직접 노출하지 마십시오.

결론

올라마는 로컬 AI의 진입 장벽을 0으로 낮췄습니다. 단일 명령으로 개인 정보 보호가 전혀 없이 노트북에서 경쟁력 있는 LLM을 실행할 수 있습니다. API 비용. 나에 대한 추세 지역 LLM 멈출 수 없다: Gartner는 예측한다 2027년까지 SLM(Small Language Models)이 클라우드 LLM을 빈도에서 3배 초과할 것입니다. 사용 비용이 70% 절감됩니다.

생산의 경우 Ollama는 훌륭한 출발점이지만 몇 가지 고려 사항이 필요합니다. 동시성 관리, 모니터링, 모델 업데이트, 보안 및 통합 기존 시스템과 함께. 가장 강력한 패턴은 Ollama와 RAG 파이프라인을 결합하는 것입니다. 데이터를 클라우드로 보내지 않고도 모델에 비공개 지식 기반에 대한 액세스 권한을 부여합니다.

시리즈의 다음 기사는 i로 원을 마감합니다. 벤치마크 최적화: 모든 성과를 체계적으로 측정하는 방법 시리즈에서 볼 수 있는 도구 — 양자화, 증류, 가지치기, 에지 배포 — 사용 사례에 가장 적합한 조합을 선택하세요.

다음 단계

다음 기사: 벤치마크 및 최적화: 48GB에서 8GB RTX까지
관련된: 엣지 디바이스의 딥 러닝
관련된: 양자화: GPTQ, AWQ, GGUF
AI 엔지니어링 시리즈: 로컬 LLM을 갖춘 RAG 파이프라인
MLOps 시리즈: 프로덕션에서 모델 제공