Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Ollama and Local LLMs: Running Models on Your Own Hardware

In 2023, running a Large Language Model locally was reserved for those with deep technical expertise: compiling llama.cpp, converting weights, configuring GGML parameters, managing complex dependencies. Then Ollama arrived and everything changed. With a single command — ollama run llama3 — anyone can have a competitive LLM running on their laptop in a few minutes.

The trend is explosive. Ollama reached over 1 million monthly downloads in 2024 with 300% year-over-year growth. The market is clearly choosing privacy (data never leaves the device), zero API cost, customization (custom models, fixed system prompts), and offline availability. These advantages are driving the migration of many enterprise workflows from cloud APIs to local deployment.

In this guide, from installation to production: configuring Ollama, choosing the right model, creating custom Modelfiles, exposing REST APIs, building offline RAG pipelines with LangChain, and optimizing for Raspberry Pi and server deployment.

What You'll Learn

Installing Ollama on Windows, macOS, and Linux
Model selection guide: Llama, Qwen, Phi, Gemma, Mistral, DeepSeek
Modelfile: creating custom assistants with tailored parameters
Ollama REST API: integration with Python, JavaScript, and cURL
Python integration via official library and OpenAI-compatible API
Offline RAG pipelines with LangChain and FAISS
Raspberry Pi and headless server deployment with systemd
OpenWebUI: fully offline ChatGPT-like interface
Detailed benchmarks and quantization level selection
Multi-model management and production optimization

How Ollama Works Internally

Before using Ollama, it helps to understand what it does under the hood. Ollama is a wrapper around llama.cpp, the C++ inference engine that made running quantized models on commodity hardware possible. Ollama adds:

Model registry: Docker Hub-like pull/push system for GGUF models
REST API server: exposes a local HTTP server on port 11434
Model caching: keeps models loaded in RAM between requests
GPU detection: automatically detects NVIDIA CUDA, AMD ROCm, and Apple Metal
Context management: handles the context window and KV cache

# Ollama Architecture - simplified diagram
#
#  Client (Python/cURL/Browser)
#         |
#         v
#  [Ollama REST API - port 11434]
#         |
#         v
#  [Model Manager]  ---  ~/.ollama/models/  (GGUF storage)
#         |
#         v
#  [llama.cpp backend]
#         |
#    _____|______
#   |           |
#  [CPU]     [GPU/Metal]
#  ARM/x86   CUDA/ROCm/Metal
#
# Model format: GGUF (GPT-Generated Unified Format)
# Quantization levels: Q4_K_M, Q5_K_M, Q6_K, Q8_0, F16
#
# Model storage locations:
# macOS/Linux: ~/.ollama/models/
# Windows:     C:\Users\USERNAME\.ollama\models\
#
# Directory structure:
# ~/.ollama/models/
# ├── blobs/       (binary GGUF files, identified by SHA256)
# └── manifests/   (metadata: which blob = which model:tag)

import subprocess, json

def ollama_status():
    """Check Ollama status and loaded models."""
    result = subprocess.run(
        ["ollama", "list"], capture_output=True, text=True
    )
    print("Installed models:")
    print(result.stdout)

    # Check process
    ps = subprocess.run(
        ["pgrep", "-x", "ollama"], capture_output=True, text=True
    )
    running = ps.returncode == 0
    print(f"Ollama running: {running}")

ollama_status()

Installation and First Steps

Ollama installs with a single command and requires no configuration. It supports macOS (Apple Silicon and Intel), Windows (with NVIDIA or AMD GPU), and Linux (deb/rpm/generic).

# ================================================================
# OLLAMA INSTALLATION
# ================================================================

# macOS / Linux (one command):
# curl -fsSL https://ollama.com/install.sh | sh

# Windows:
# Download installer from https://ollama.com/download
# (includes automatic CUDA support if NVIDIA GPU present)

# Verify installation:
# ollama --version
# ollama serve  (start server manually if not running)

# ================================================================
# BASIC COMMANDS
# ================================================================

# Run a model (auto-download if not present)
# ollama run llama3.2

# List locally available models
# ollama list

# Pull without running (for pre-downloading)
# ollama pull llama3.2:3b

# Detailed model information
# ollama show llama3.2

# Remove a model (frees disk space)
# ollama rm llama3.2:old-version

# Copy a model with a different name
# ollama cp llama3.2 my-custom-model

# ================================================================
# USEFUL ENVIRONMENT VARIABLES
# ================================================================

# Listen on all interfaces (for network access)
# export OLLAMA_HOST=0.0.0.0:11434

# Custom model directory
# export OLLAMA_MODELS=/mnt/ssd/ollama-models

# Maximum parallel requests (default: 1)
# export OLLAMA_NUM_PARALLEL=4

# Maximum models in memory (default: 1)
# export OLLAMA_MAX_LOADED_MODELS=2

# Time before unloading a model from RAM (default: 5m)
# export OLLAMA_KEEP_ALIVE=30m

# ================================================================
# POPULAR MODELS AND HARDWARE REQUIREMENTS (2025)
# ================================================================

MODELS_GUIDE = {
    # SMALL models (for Raspberry Pi / 8 GB laptop)
    "qwen2.5:1.5b":      {"size": "0.9 GB", "ram": "2 GB",  "quality": 7, "rpi5_tps": 4.5},
    "llama3.2:1b":       {"size": "1.3 GB", "ram": "2 GB",  "quality": 7, "rpi5_tps": 5.1},
    "phi3.5:mini":       {"size": "2.2 GB", "ram": "4 GB",  "quality": 8, "rpi5_tps": 2.8},
    "qwen2.5:3b":        {"size": "1.9 GB", "ram": "4 GB",  "quality": 8, "rpi5_tps": 2.1},
    "gemma2:2b":         {"size": "1.6 GB", "ram": "3 GB",  "quality": 8, "rpi5_tps": 3.2},

    # MEDIUM models (16+ GB laptop / desktop)
    "llama3.2:3b":       {"size": "2.0 GB", "ram": "4 GB",  "quality": 8, "rpi5_tps": 1.8},
    "mistral:7b":        {"size": "4.1 GB", "ram": "8 GB",  "quality": 9, "rpi5_tps": 0.8},
    "llama3.1:8b":       {"size": "4.7 GB", "ram": "8 GB",  "quality": 9, "rpi5_tps": 0.6},
    "qwen2.5:7b":        {"size": "4.4 GB", "ram": "8 GB",  "quality": 9, "rpi5_tps": 0.7},
    "deepseek-r1:8b":    {"size": "4.9 GB", "ram": "8 GB",  "quality": 9, "rpi5_tps": 0.5},

    # LARGE models (24+ GB workstation / server)
    "llama3.1:70b":      {"size": "40 GB", "ram": "64 GB", "quality": 10, "rpi5_tps": None},
    "qwen2.5:72b":       {"size": "41 GB", "ram": "64 GB", "quality": 10, "rpi5_tps": None},
    "deepseek-r1:32b":   {"size": "19 GB", "ram": "32 GB", "quality": 10, "rpi5_tps": None},
}

print("Recommended models by hardware:")
print("  Raspberry Pi 5 (8GB): qwen2.5:1.5b, llama3.2:1b, gemma2:2b")
print("  16GB laptop:          llama3.1:8b, qwen2.5:7b, mistral:7b")
print("  Mac M2/M3 (24GB):     llama3.1:8b, gemma2:9b, qwen2.5:14b")
print("  Workstation 48GB+:    llama3.1:70b, deepseek-r1:32b")

Quantization Levels: Which GGUF to Choose?

When you run ollama pull llama3.1:8b, Ollama automatically downloads the optimal quantization for your hardware. But you can explicitly choose the quantization level, with an important quality/size/speed trade-off.

GGUF Quantization Level Guide

Tag / Format	Bits/weight	Size (7B)	Perplexity loss	Recommended for
Q2_K	2.63 bit	2.7 GB	+15-20%	Only when RAM is the absolute constraint
Q4_K_S	4.37 bit	4.5 GB	+2-3%	Good speed/quality balance
Q4_K_M	4.58 bit	4.8 GB	+1-2%	Recommended default (sweet spot)
Q5_K_M	5.68 bit	5.7 GB	+0.5-1%	Maximum quality with <6 GB RAM
Q6_K	6.57 bit	6.6 GB	+0.1-0.3%	Nearly identical to F16, needs more RAM
Q8_0	8.5 bit	8.5 GB	~0%	Maximum quality, requires 9+ GB RAM
F16	16 bit	14 GB	0% (baseline)	Training/fine-tuning, not for inference

# Explicitly choosing quantization in Ollama
# Tags depend on the model - use 'ollama show' to see options

# Default (Ollama chooses automatically, usually Q4_K_M):
# ollama pull llama3.1:8b

# Specify quantization manually:
# ollama pull llama3.1:8b-instruct-q4_K_M
# ollama pull llama3.1:8b-instruct-q5_K_M
# ollama pull llama3.1:8b-instruct-q8_0

# For HuggingFace models not in Ollama registry:
# Download GGUF manually and import with Modelfile:

IMPORT_GGUF_MODELFILE = """
FROM ./path/to/model-q4_k_m.gguf

PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM "You are a helpful assistant."
"""

# echo IMPORT_GGUF_MODELFILE > Modelfile
# ollama create my-model -f Modelfile
# ollama run my-model

# Performance comparison Q4 vs Q5 vs Q8 (Llama 3.1 8B, MacBook M3 Pro):
QUANT_BENCHMARK = {
    "Q4_K_M": {"size_gb": 4.8, "tps": 38.2, "quality_vs_f16": "98.5%"},
    "Q5_K_M": {"size_gb": 5.7, "tps": 33.1, "quality_vs_f16": "99.2%"},
    "Q6_K":   {"size_gb": 6.6, "tps": 29.4, "quality_vs_f16": "99.7%"},
    "Q8_0":   {"size_gb": 8.5, "tps": 24.8, "quality_vs_f16": "99.9%"},
}

for quant, data in QUANT_BENCHMARK.items():
    print(f"{quant}: {data['size_gb']}GB, {data['tps']}t/s, quality={data['quality_vs_f16']}")

Modelfile: Creating Custom Assistants

A Modelfile is Ollama's mechanism for creating custom models. It allows you to define the base model, system prompt, generation parameters (temperature, top_p, context window), and even extend a model with additional files. It is equivalent to a Dockerfile, but for language models.

# ================================================================
# PRACTICAL MODELFILE EXAMPLES
# ================================================================

# --- Modelfile 1: Technical English assistant ---
MODEL_FILE_TECH = """
FROM qwen2.5:7b

# Generation parameters
PARAMETER temperature 0.3         # Low = more deterministic responses
PARAMETER top_p 0.9               # Nucleus sampling
PARAMETER top_k 40                # Top-k sampling
PARAMETER num_ctx 8192            # Context window (4096-32768)
PARAMETER repeat_penalty 1.1      # Avoid repetitions

# System prompt (defines model behavior)
SYSTEM \"\"\"
You are a technical assistant expert in Python, deep learning and machine learning.
Always respond in English, concisely and technically.
When showing code, always use markdown blocks with the language specified.
If you are not sure about something, say so explicitly.
Do not invent information or APIs that don't exist.
\"\"\"

# Welcome message
MESSAGE user "Hello!"
MESSAGE assistant "Hi! I'm your technical assistant. How can I help you with Python, deep learning, or machine learning today?"
"""

# Create the model:
# echo MODEL_FILE_TECH > Modelfile-tech-en
# ollama create tech-assistant-en -f Modelfile-tech-en
# ollama run tech-assistant-en

# --- Modelfile 2: Code review assistant ---
MODEL_FILE_CODE = """
FROM llama3.1:8b

PARAMETER temperature 0.1         # Very deterministic for code
PARAMETER num_ctx 16384           # Large context for long files
PARAMETER repeat_penalty 1.05

SYSTEM \"\"\"
You are an expert code reviewer. When reviewing code:
1. Identify bugs, security issues, and performance problems
2. Suggest specific improvements with code examples
3. Follow PEP8/language standards
4. Be concise: list issues with severity (CRITICAL/HIGH/MEDIUM/LOW)
Be direct and actionable. Never hallucinate API methods.
\"\"\"
"""

# --- Modelfile 3: Document RAG assistant ---
MODEL_FILE_RAG = """
FROM qwen2.5:7b

PARAMETER temperature 0.1
PARAMETER num_ctx 32768         # Long context for documents
PARAMETER repeat_penalty 1.0

SYSTEM \"\"\"
You are an assistant that answers ONLY based on the documents provided in the context.
If you cannot find the answer in the context, say exactly: "I don't have information on this in the provided documents."
Never add external information. Always cite the source document in your answer.
\"\"\"
"""

print("Modelfiles ready. To create:")
print("  ollama create tech-assistant-en -f Modelfile-tech-en")
print("  ollama create code-reviewer -f Modelfile-code")
print("  ollama create rag-assistant -f Modelfile-rag")

Ollama REST API: Integration with Python

Ollama exposes two APIs: its own native API and an OpenAI-compatible API. The OpenAI compatibility allows replacing OpenAI APIs with Ollama simply by changing the base URL — without modifying application code.

# pip install ollama openai requests

import ollama
import json, time
from typing import Iterator

# ================================================================
# 1. OFFICIAL OLLAMA LIBRARY (Python)
# ================================================================

# Simple chat (non-streaming)
def chat_simple(model: str, message: str) -> str:
    response = ollama.chat(
        model=model,
        messages=[{"role": "user", "content": message}]
    )
    return response['message']['content']

# Chat with streaming (token by token)
def chat_streaming(model: str, messages: list) -> Iterator[str]:
    stream = ollama.chat(
        model=model,
        messages=messages,
        stream=True
    )
    for chunk in stream:
        if chunk['message']['content']:
            yield chunk['message']['content']

# Embeddings for RAG (use nomic-embed-text or mxbai-embed-large)
def get_embedding(model: str, text: str) -> list:
    response = ollama.embeddings(model=model, prompt=text)
    return response['embedding']

# Chat with images (multimodal models: llava, bakllava, moondream)
def chat_with_image(model: str, prompt: str, image_path: str) -> str:
    import base64
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()

    response = ollama.chat(
        model="llava:7b",
        messages=[{
            "role": "user",
            "content": prompt,
            "images": [image_data]
        }]
    )
    return response['message']['content']

# Chatbot with conversation history
def interactive_chat(model: str = "llama3.2:3b"):
    history = []
    print(f"Chat with {model} (type 'exit' to quit)")

    while True:
        user_input = input("You: ").strip()
        if user_input.lower() == "exit":
            break

        history.append({"role": "user", "content": user_input})

        print("Assistant: ", end="", flush=True)
        full_response = ""
        for chunk in chat_streaming(model, history):
            print(chunk, end="", flush=True)
            full_response += chunk
        print()

        history.append({"role": "assistant", "content": full_response})


# ================================================================
# 2. OPENAI-COMPATIBLE API (drop-in replacement)
# ================================================================
from openai import OpenAI

# Only change base_url: zero code changes!
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Any string
)

def chat_openai_compatible(model: str, prompt: str) -> str:
    """Identical to OpenAI API, but uses Ollama locally."""
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=500
    )
    return response.choices[0].message.content

# ================================================================
# 3. RAW REST API (without Python libraries)
# ================================================================
import requests

def ollama_raw_api(model: str, prompt: str, stream: bool = False) -> str:
    """Call Ollama API directly with requests."""
    resp = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": stream,
            "options": {
                "temperature": 0.7,
                "num_predict": 200,
                "num_ctx": 4096
            }
        },
        timeout=120
    )
    if not stream:
        return resp.json()["response"]
    else:
        result = ""
        for line in resp.iter_lines():
            if line:
                data = json.loads(line)
                result += data.get("response", "")
                if data.get("done"):
                    break
        return result

# ================================================================
# 4. MODEL SPEED BENCHMARK
# ================================================================
def benchmark_model(model: str, n_runs: int = 3):
    """Measures generation speed in token/s."""
    prompt = "Explain quantum computing in one paragraph."
    results = []
    for _ in range(n_runs):
        t0 = time.time()
        response = ollama.generate(
            model=model,
            prompt=prompt,
            options={"num_predict": 100}
        )
        elapsed = time.time() - t0
        eval_count = response.get('eval_count', 100)
        tps = eval_count / elapsed
        results.append(tps)

    avg_tps = sum(results) / len(results)
    print(f"{model}: {avg_tps:.1f} token/s (average {n_runs} runs)")
    return avg_tps

# Typical comparison results on MacBook M3 Pro 18GB:
# qwen2.5:1.5b  ~85 t/s
# llama3.2:3b   ~62 t/s
# qwen2.5:7b    ~42 t/s
# llama3.1:8b   ~38 t/s
# qwen2.5:14b   ~22 t/s
# llama3.1:70b  ~8  t/s

Ollama with LangChain: Offline RAG Pipeline

Ollama integrates natively with LangChain, enabling completely offline RAG (Retrieval-Augmented Generation) pipelines. This is particularly relevant for enterprise applications that cannot send sensitive data to the cloud. The nomic-embed-text model is optimal for local embeddings.

# pip install langchain langchain-ollama langchain-community
# pip install faiss-cpu chromadb pypdf

from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import (
    DirectoryLoader, TextLoader, PyPDFLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import os

# ================================================================
# OFFLINE RAG SYSTEM WITH OLLAMA - PRODUCTION VERSION
# ================================================================

class OllamaRAGSystem:
    """
    Complete offline RAG system with Ollama.
    Supports PDF, TXT, and entire directories.
    Uses FAISS for local vector storage.
    """

    def __init__(
        self,
        llm_model: str = "llama3.1:8b",
        embed_model: str = "nomic-embed-text",  # ollama pull nomic-embed-text
        kb_dir: str = "./knowledge_base"
    ):
        self.llm_model = llm_model
        self.embed_model = embed_model
        self.kb_dir = kb_dir

        self.embeddings = OllamaEmbeddings(model=embed_model)
        self.llm = OllamaLLM(
            model=llm_model,
            temperature=0.1,
            num_ctx=8192,
            num_predict=512
        )
        self.vectorstore = None

    def load_documents(self, docs_dir: str) -> list:
        """Load documents from directory (PDF, TXT, MD)."""
        docs = []

        # Load TXT and MD
        txt_loader = DirectoryLoader(
            docs_dir, glob="**/*.txt", loader_cls=TextLoader
        )
        docs.extend(txt_loader.load())

        # Load PDFs
        for pdf_file in os.listdir(docs_dir):
            if pdf_file.endswith(".pdf"):
                loader = PyPDFLoader(os.path.join(docs_dir, pdf_file))
                docs.extend(loader.load())

        print(f"Loaded {len(docs)} documents from {docs_dir}")
        return docs

    def build_knowledge_base(self, docs_dir: str) -> None:
        """Create and save knowledge base from a directory."""
        documents = self.load_documents(docs_dir)

        splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            separators=["\n\n", "\n", " ", ""]
        )
        texts = splitter.split_documents(documents)
        print(f"Created {len(texts)} chunks")

        self.vectorstore = FAISS.from_documents(texts, self.embeddings)
        self.vectorstore.save_local(self.kb_dir)
        print(f"Knowledge base saved to {self.kb_dir}")

    def load_knowledge_base(self) -> None:
        """Load existing knowledge base from disk."""
        self.vectorstore = FAISS.load_local(
            self.kb_dir, self.embeddings,
            allow_dangerous_deserialization=True
        )
        print(f"Knowledge base loaded: {self.vectorstore.index.ntotal} vectors")

    def create_qa_chain(self) -> RetrievalQA:
        """Create Q&A chain over documents."""
        prompt_template = """Use the following context to answer the question.
If you cannot find the answer in the context, explicitly say you don't know.
Do not make up information not present in the context.

Context:
{context}

Question: {question}

Answer:"""

        PROMPT = PromptTemplate(
            template=prompt_template,
            input_variables=["context", "question"]
        )

        retriever = self.vectorstore.as_retriever(
            search_type="mmr",  # Maximum Marginal Relevance (more diverse)
            search_kwargs={"k": 5, "fetch_k": 20}
        )

        return RetrievalQA.from_chain_type(
            llm=self.llm,
            chain_type="stuff",
            retriever=retriever,
            chain_type_kwargs={"prompt": PROMPT},
            return_source_documents=True
        )

    def ask(self, question: str, qa_chain: RetrievalQA) -> dict:
        """Ask a question to the RAG system."""
        result = qa_chain.invoke({"query": question})
        sources = list(set([
            doc.metadata.get("source", "Unknown")
            for doc in result["source_documents"]
        ]))
        return {
            "answer": result["result"],
            "sources": sources,
            "n_docs": len(result["source_documents"])
        }


# Usage:
# rag = OllamaRAGSystem(llm_model="llama3.1:8b")
# rag.build_knowledge_base("./company_documents")
# chain = rag.create_qa_chain()
# result = rag.ask("What is the company vacation policy?", chain)
# print(result["answer"])
# print("Sources:", result["sources"])

print("RAG system ready!")

OpenWebUI: ChatGPT Interface for Ollama

OpenWebUI (formerly Ollama WebUI) is the most popular interface for Ollama, with a user experience identical to ChatGPT but completely offline. It supports chat, document upload, conversation management, prompt sharing, integrated RAG, and multimodal mode for images.

# ================================================================
# OPENWEBUI SETUP WITH DOCKER
# ================================================================

# Case 1: Ollama on the same host
# docker run -d -p 3000:8080 \
#     -v open-webui:/app/backend/data \
#     -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
#     --name open-webui \
#     ghcr.io/open-webui/open-webui:main

# Case 2: OpenWebUI with integrated Ollama (all-in-one)
# docker run -d -p 3000:8080 \
#     -v ollama:/root/.ollama \
#     -v open-webui:/app/backend/data \
#     --gpus all \
#     --name open-webui \
#     ghcr.io/open-webui/open-webui:ollama

# Access: http://localhost:3000

# ================================================================
# DOCKER COMPOSE (recommended for production)
# ================================================================
DOCKER_COMPOSE = """
version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_MAX_LOADED_MODELS=2
    # For NVIDIA GPU:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: all
    #           capabilities: [gpu]
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_AUTH=True
      - WEBUI_SECRET_KEY=change-this-secret-key
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:
  webui_data:
"""

# ================================================================
# OLLAMA AS SYSTEMD SERVICE (Linux production)
# ================================================================
SYSTEMD_SERVICE = """
# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama LLM Service
After=network-online.target

[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_MODELS=/opt/ollama/models"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_KEEP_ALIVE=10m"

[Install]
WantedBy=default.target
"""

# sudo systemctl enable ollama
# sudo systemctl start ollama
# sudo journalctl -u ollama -f  # Real-time logs

print("Ollama service setup complete!")

Deployment on Raspberry Pi: Optimized Setup

The Raspberry Pi 5 with 8 GB of RAM is the most accessible edge device for local LLMs. With the right configuration, 1.5B parameter models reach 4-5 token/s — sufficient for many non-real-time use cases: low-volume chatbots, batch text analysis, event-triggered automation.

# ================================================================
# OLLAMA ON RASPBERRY PI 5 (optimized setup)
# ================================================================

# Installation (identical to x86 Linux):
# curl -fsSL https://ollama.com/install.sh | sh

# Optimal configuration for RPi5 in /etc/environment:
# OLLAMA_NUM_PARALLEL=1       # One request at a time (limited RAM)
# OLLAMA_MAX_LOADED_MODELS=1  # One model in memory
# OLLAMA_KEEP_ALIVE=5m        # Unload model after 5 min inactivity
# OLLAMA_NUM_THREAD=4         # All Cortex-A76 cores

# Recommended models for RPi5 (8GB):
# ollama pull qwen2.5:1.5b    (fast: ~4.5 t/s, 1.8 GB RAM)
# ollama pull llama3.2:1b     (balanced: ~5.1 t/s, 1.4 GB RAM)
# ollama pull gemma2:2b       (quality: ~3.2 t/s, 2.5 GB RAM)

import ollama
import time, statistics, psutil

def benchmark_ollama_rpi(model: str = "qwen2.5:1.5b",
                          n_tests: int = 5):
    """Test speed and consistency on RPi."""
    prompt = "Explain in 3 sentences what machine learning is."
    results = []
    latencies_to_first = []

    print(f"Benchmarking {model} over {n_tests} tests...")
    for i in range(n_tests):
        t0 = time.time()
        first_token = None
        full_response = ""

        for chunk in ollama.chat(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            stream=True,
            options={"temperature": 0, "top_k": 1, "num_predict": 50}
        ):
            content = chunk['message']['content']
            if content and first_token is None:
                first_token = time.time() - t0
                latencies_to_first.append(first_token * 1000)
            full_response += content

        elapsed = time.time() - t0
        n_tokens = len(full_response.split())  # Approximation
        tps = n_tokens / elapsed
        results.append(tps)
        print(f"  Test {i+1}: {tps:.1f} t/s, TTFT: {first_token*1000:.0f}ms")

    mean_tps = statistics.mean(results)
    mean_ttft = statistics.mean(latencies_to_first)
    mem = psutil.virtual_memory()

    print(f"\nResults {model} on RPi5:")
    print(f"  Mean speed: {mean_tps:.1f} t/s")
    print(f"  Mean TTFT:  {mean_ttft:.0f} ms")
    print(f"  RAM used:   {mem.used/(1024**3):.1f} GB / {mem.total/(1024**3):.1f} GB")

    return mean_tps

# ================================================================
# AUTOMATION: Model updates and monitoring
# ================================================================
import subprocess, datetime

def update_ollama_models(models: list = ["qwen2.5:1.5b", "nomic-embed-text"]):
    """Update Ollama models (run with cron)."""
    log = []
    for model in models:
        print(f"Updating {model}...")
        result = subprocess.run(
            ["ollama", "pull", model],
            capture_output=True, text=True, timeout=600
        )
        status = "OK" if result.returncode == 0 else "FAIL"
        log.append({
            "model": model,
            "status": status,
            "time": datetime.datetime.now().isoformat()
        })
        print(f"  {model}: {status}")

    return log

# Recommended cron job (every Sunday at 3:00 AM):
# 0 3 * * 0 /usr/bin/python3 /home/pi/update_models.py >> /var/log/ollama-update.log 2>&1

Real-World Case Study: Offline Enterprise Chatbot

A real use case: a company handling sensitive documents (contracts, HR policies, technical manuals) wants an internal chatbot without exposing data to the cloud. With Ollama and RAG, a completely air-gapped system can be built in less than a day.

# ================================================================
# OFFLINE ENTERPRISE CHATBOT - Full Stack
# ================================================================

# Stack:
# - Ollama with llama3.1:8b (or qwen2.5:7b for better multilingual)
# - nomic-embed-text for embeddings
# - FAISS for vector store
# - FastAPI for REST API
# - OpenWebUI for user interface

# fastapi_chatbot.py
from fastapi import FastAPI
from pydantic import BaseModel
import ollama, time

app = FastAPI(title="Corporate AI Assistant", version="2.0")

# Global state (use Redis in production)
conversation_store = {}

class ChatRequest(BaseModel):
    session_id: str
    message: str
    model: str = "qwen2.5:7b"
    use_rag: bool = True

class ChatResponse(BaseModel):
    session_id: str
    response: str
    sources: list = []
    model: str
    tokens_per_sec: float

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    # Retrieve conversation history
    if request.session_id not in conversation_store:
        conversation_store[request.session_id] = []

    history = conversation_store[request.session_id]

    # Add RAG context if requested
    context = ""
    sources = []
    if request.use_rag and rag_system and rag_system.vectorstore:
        docs = rag_system.vectorstore.similarity_search(
            request.message, k=3
        )
        context = "\n\n".join([d.page_content for d in docs])
        sources = list(set([d.metadata.get("source", "") for d in docs]))

        augmented_message = f"""Context from company documents:
{context}

Question: {request.message}"""
    else:
        augmented_message = request.message

    history.append({"role": "user", "content": augmented_message})

    # Generate response
    t0 = time.time()
    response = ollama.chat(
        model=request.model,
        messages=history,
        options={"num_ctx": 8192, "temperature": 0.3}
    )
    elapsed = time.time() - t0

    assistant_msg = response['message']['content']
    history.append({"role": "assistant", "content": assistant_msg})

    # Truncate history if too long (sliding window)
    if len(history) > 20:
        history = history[-20:]
    conversation_store[request.session_id] = history

    eval_count = response.get('eval_count', 50)
    tps = eval_count / elapsed if elapsed > 0 else 0

    return ChatResponse(
        session_id=request.session_id,
        response=assistant_msg,
        sources=sources,
        model=request.model,
        tokens_per_sec=round(tps, 1)
    )

@app.get("/models")
async def list_models():
    """List available models on this Ollama server."""
    models = ollama.list()
    return {
        "models": [
            {"name": m['name'], "size_gb": m['size'] / 1e9}
            for m in models['models']
        ]
    }

# Start: uvicorn fastapi_chatbot:app --host 0.0.0.0 --port 8080

Model Comparison for Common Use Cases

Use Case	Recommended Model	Min RAM	Why
English chatbot	qwen2.5:7b	8 GB	Excellent multilingual, long context
Code generation	qwen2.5-coder:7b	8 GB	Fine-tuned on code, 90+ languages
RAG / Q&A documents	llama3.1:8b	8 GB	Excellent instruction following, 128K context
Advanced reasoning	deepseek-r1:8b	8 GB	Chain-of-thought, math, logic
Raspberry Pi (fast)	llama3.2:1b	2 GB	5+ t/s, useful for simple tasks
Raspberry Pi (quality)	qwen2.5:3b	4 GB	Optimal quality/speed balance
Mac M-series (fast)	qwen2.5:14b	16 GB	22+ t/s on M2/M3, near GPT-4 quality
Image analysis	llava:7b or moondream	8 GB	Multimodal models optimized for vision

Production Best Practices

Using Ollama in production requires specific considerations compared to personal use. Here are the most important patterns.

# ================================================================
# PRODUCTION PATTERN: Health Check and Monitoring
# ================================================================
import requests, time, functools, random

def monitor_ollama(host: str = "localhost", port: int = 11434):
    """Check Ollama availability and load."""
    try:
        resp = requests.get(f"http://{host}:{port}/api/tags", timeout=5)
        if resp.status_code == 200:
            models = resp.json().get("models", [])
            print(f"Ollama OK: {len(models)} models available")
            return True
    except requests.exceptions.RequestException as e:
        print(f"Ollama UNREACHABLE: {e}")
        return False

# ================================================================
# ERROR HANDLING AND RETRY
# ================================================================
def with_ollama_retry(max_attempts: int = 3, backoff: float = 1.0):
    """Decorator for automatic retry on Ollama errors."""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_attempts):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_attempts - 1:
                        raise
                    wait = backoff * (2 ** attempt) + random.uniform(0, 0.5)
                    print(f"Attempt {attempt+1} failed: {e}. Retry in {wait:.1f}s")
                    time.sleep(wait)
        return wrapper
    return decorator

@with_ollama_retry(max_attempts=3, backoff=1.0)
def robust_chat(model: str, message: str) -> str:
    """Chat with automatic retry on network/timeout errors."""
    response = ollama.chat(
        model=model,
        messages=[{"role": "user", "content": message}],
        options={"num_predict": 500}
    )
    return response['message']['content']

# ================================================================
# NGINX LOAD BALANCING (multiple Ollama instances)
# ================================================================
NGINX_CONFIG = """
upstream ollama_cluster {
    least_conn;  # Route to connection with fewest requests
    server server1:11434;
    server server2:11434;
    server server3:11434;
}

server {
    listen 80;

    location /api/ {
        proxy_pass http://ollama_cluster;
        proxy_read_timeout 300s;   # High timeout for long generation
        proxy_connect_timeout 10s;
        proxy_set_header Host $host;
    }
}
"""

Limitations and Production Considerations

Ollama is not multi-tenant by default: on a shared server, requests are serialized. Set OLLAMA_NUM_PARALLEL=4 to handle concurrent requests (requires more RAM: ~8GB per request with 7B model).
Timeout on RPi with large models: llama3.1:8b takes 10-15 seconds to generate the first response on RPi. Use num_ctx=512 to reduce prefill time in time-sensitive cases. For TTFT <2s use 1-3B models.
No automatic autoscaling: unlike cloud APIs, Ollama does not scale. For high traffic, use load balancing with multiple Ollama instances on different servers, or evaluate vLLM for GPU deployment.
Continuous power consumption: keeping Ollama active with a loaded model consumes ~15W on RPi5, ~45W on Jetson Orin NX. Use OLLAMA_KEEP_ALIVE=0 to unload the model immediately after each request.
Security: Ollama does not authenticate requests by default. In production, always put a reverse proxy (nginx) in front with authentication and rate limiting. Never expose port 11434 directly to the internet.

Conclusions

Ollama has reduced the barrier to local AI to zero. With a single command you can have a competitive LLM running on your laptop, with complete privacy and zero API costs. The trend toward Local LLMs is unstoppable: Gartner predicts that by 2027, SLMs (Small Language Models) will surpass cloud LLMs 3x in usage frequency, with a 70% reduction in operational costs.

For production, Ollama is an excellent starting point but requires some considerations: concurrency management, monitoring, model updates, security, and integration with existing systems. The most powerful pattern is combining Ollama with a RAG pipeline to give the model access to private knowledge bases without sending data to the cloud.

The next article in the series closes the loop with Benchmarks and Optimization: how to systematically measure the performance of all the tools seen in the series — quantization, distillation, pruning, edge deployment — and choose the optimal combination for your use case.

Next Steps

Next article: Benchmarks and Optimization: from 48GB to 8GB RTX
Related: Deep Learning on Edge Devices
Related: Quantization: GPTQ, AWQ, GGUF
AI Engineering series: RAG Pipeline with Local LLMs
MLOps series: Model Serving in Production