Ollama and Local LLMs: Running Models on Your Own Hardware
In 2023, running a Large Language Model locally was reserved for those with deep technical
expertise: compiling llama.cpp, converting weights, configuring GGML parameters, managing
complex dependencies. Then Ollama arrived and everything changed. With a
single command — ollama run llama3 — anyone can have a competitive LLM running
on their laptop in a few minutes.
The trend is explosive. Ollama reached over 1 million monthly downloads in 2024 with 300% year-over-year growth. The market is clearly choosing privacy (data never leaves the device), zero API cost, customization (custom models, fixed system prompts), and offline availability. These advantages are driving the migration of many enterprise workflows from cloud APIs to local deployment.
In this guide, from installation to production: configuring Ollama, choosing the right model, creating custom Modelfiles, exposing REST APIs, building offline RAG pipelines with LangChain, and optimizing for Raspberry Pi and server deployment.
What You'll Learn
- Installing Ollama on Windows, macOS, and Linux
- Model selection guide: Llama, Qwen, Phi, Gemma, Mistral, DeepSeek
- Modelfile: creating custom assistants with tailored parameters
- Ollama REST API: integration with Python, JavaScript, and cURL
- Python integration via official library and OpenAI-compatible API
- Offline RAG pipelines with LangChain and FAISS
- Raspberry Pi and headless server deployment with systemd
- OpenWebUI: fully offline ChatGPT-like interface
- Detailed benchmarks and quantization level selection
- Multi-model management and production optimization
How Ollama Works Internally
Before using Ollama, it helps to understand what it does under the hood. Ollama is a wrapper around llama.cpp, the C++ inference engine that made running quantized models on commodity hardware possible. Ollama adds:
- Model registry: Docker Hub-like pull/push system for GGUF models
- REST API server: exposes a local HTTP server on port 11434
- Model caching: keeps models loaded in RAM between requests
- GPU detection: automatically detects NVIDIA CUDA, AMD ROCm, and Apple Metal
- Context management: handles the context window and KV cache
# Ollama Architecture - simplified diagram
#
# Client (Python/cURL/Browser)
# |
# v
# [Ollama REST API - port 11434]
# |
# v
# [Model Manager] --- ~/.ollama/models/ (GGUF storage)
# |
# v
# [llama.cpp backend]
# |
# _____|______
# | |
# [CPU] [GPU/Metal]
# ARM/x86 CUDA/ROCm/Metal
#
# Model format: GGUF (GPT-Generated Unified Format)
# Quantization levels: Q4_K_M, Q5_K_M, Q6_K, Q8_0, F16
#
# Model storage locations:
# macOS/Linux: ~/.ollama/models/
# Windows: C:\Users\USERNAME\.ollama\models\
#
# Directory structure:
# ~/.ollama/models/
# ├── blobs/ (binary GGUF files, identified by SHA256)
# └── manifests/ (metadata: which blob = which model:tag)
import subprocess, json
def ollama_status():
"""Check Ollama status and loaded models."""
result = subprocess.run(
["ollama", "list"], capture_output=True, text=True
)
print("Installed models:")
print(result.stdout)
# Check process
ps = subprocess.run(
["pgrep", "-x", "ollama"], capture_output=True, text=True
)
running = ps.returncode == 0
print(f"Ollama running: {running}")
ollama_status()
Installation and First Steps
Ollama installs with a single command and requires no configuration. It supports macOS (Apple Silicon and Intel), Windows (with NVIDIA or AMD GPU), and Linux (deb/rpm/generic).
# ================================================================
# OLLAMA INSTALLATION
# ================================================================
# macOS / Linux (one command):
# curl -fsSL https://ollama.com/install.sh | sh
# Windows:
# Download installer from https://ollama.com/download
# (includes automatic CUDA support if NVIDIA GPU present)
# Verify installation:
# ollama --version
# ollama serve (start server manually if not running)
# ================================================================
# BASIC COMMANDS
# ================================================================
# Run a model (auto-download if not present)
# ollama run llama3.2
# List locally available models
# ollama list
# Pull without running (for pre-downloading)
# ollama pull llama3.2:3b
# Detailed model information
# ollama show llama3.2
# Remove a model (frees disk space)
# ollama rm llama3.2:old-version
# Copy a model with a different name
# ollama cp llama3.2 my-custom-model
# ================================================================
# USEFUL ENVIRONMENT VARIABLES
# ================================================================
# Listen on all interfaces (for network access)
# export OLLAMA_HOST=0.0.0.0:11434
# Custom model directory
# export OLLAMA_MODELS=/mnt/ssd/ollama-models
# Maximum parallel requests (default: 1)
# export OLLAMA_NUM_PARALLEL=4
# Maximum models in memory (default: 1)
# export OLLAMA_MAX_LOADED_MODELS=2
# Time before unloading a model from RAM (default: 5m)
# export OLLAMA_KEEP_ALIVE=30m
# ================================================================
# POPULAR MODELS AND HARDWARE REQUIREMENTS (2025)
# ================================================================
MODELS_GUIDE = {
# SMALL models (for Raspberry Pi / 8 GB laptop)
"qwen2.5:1.5b": {"size": "0.9 GB", "ram": "2 GB", "quality": 7, "rpi5_tps": 4.5},
"llama3.2:1b": {"size": "1.3 GB", "ram": "2 GB", "quality": 7, "rpi5_tps": 5.1},
"phi3.5:mini": {"size": "2.2 GB", "ram": "4 GB", "quality": 8, "rpi5_tps": 2.8},
"qwen2.5:3b": {"size": "1.9 GB", "ram": "4 GB", "quality": 8, "rpi5_tps": 2.1},
"gemma2:2b": {"size": "1.6 GB", "ram": "3 GB", "quality": 8, "rpi5_tps": 3.2},
# MEDIUM models (16+ GB laptop / desktop)
"llama3.2:3b": {"size": "2.0 GB", "ram": "4 GB", "quality": 8, "rpi5_tps": 1.8},
"mistral:7b": {"size": "4.1 GB", "ram": "8 GB", "quality": 9, "rpi5_tps": 0.8},
"llama3.1:8b": {"size": "4.7 GB", "ram": "8 GB", "quality": 9, "rpi5_tps": 0.6},
"qwen2.5:7b": {"size": "4.4 GB", "ram": "8 GB", "quality": 9, "rpi5_tps": 0.7},
"deepseek-r1:8b": {"size": "4.9 GB", "ram": "8 GB", "quality": 9, "rpi5_tps": 0.5},
# LARGE models (24+ GB workstation / server)
"llama3.1:70b": {"size": "40 GB", "ram": "64 GB", "quality": 10, "rpi5_tps": None},
"qwen2.5:72b": {"size": "41 GB", "ram": "64 GB", "quality": 10, "rpi5_tps": None},
"deepseek-r1:32b": {"size": "19 GB", "ram": "32 GB", "quality": 10, "rpi5_tps": None},
}
print("Recommended models by hardware:")
print(" Raspberry Pi 5 (8GB): qwen2.5:1.5b, llama3.2:1b, gemma2:2b")
print(" 16GB laptop: llama3.1:8b, qwen2.5:7b, mistral:7b")
print(" Mac M2/M3 (24GB): llama3.1:8b, gemma2:9b, qwen2.5:14b")
print(" Workstation 48GB+: llama3.1:70b, deepseek-r1:32b")
Quantization Levels: Which GGUF to Choose?
When you run ollama pull llama3.1:8b, Ollama automatically downloads the
optimal quantization for your hardware. But you can explicitly choose the quantization
level, with an important quality/size/speed trade-off.
GGUF Quantization Level Guide
| Tag / Format | Bits/weight | Size (7B) | Perplexity loss | Recommended for |
|---|---|---|---|---|
| Q2_K | 2.63 bit | 2.7 GB | +15-20% | Only when RAM is the absolute constraint |
| Q4_K_S | 4.37 bit | 4.5 GB | +2-3% | Good speed/quality balance |
| Q4_K_M | 4.58 bit | 4.8 GB | +1-2% | Recommended default (sweet spot) |
| Q5_K_M | 5.68 bit | 5.7 GB | +0.5-1% | Maximum quality with <6 GB RAM |
| Q6_K | 6.57 bit | 6.6 GB | +0.1-0.3% | Nearly identical to F16, needs more RAM |
| Q8_0 | 8.5 bit | 8.5 GB | ~0% | Maximum quality, requires 9+ GB RAM |
| F16 | 16 bit | 14 GB | 0% (baseline) | Training/fine-tuning, not for inference |
# Explicitly choosing quantization in Ollama
# Tags depend on the model - use 'ollama show' to see options
# Default (Ollama chooses automatically, usually Q4_K_M):
# ollama pull llama3.1:8b
# Specify quantization manually:
# ollama pull llama3.1:8b-instruct-q4_K_M
# ollama pull llama3.1:8b-instruct-q5_K_M
# ollama pull llama3.1:8b-instruct-q8_0
# For HuggingFace models not in Ollama registry:
# Download GGUF manually and import with Modelfile:
IMPORT_GGUF_MODELFILE = """
FROM ./path/to/model-q4_k_m.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM "You are a helpful assistant."
"""
# echo IMPORT_GGUF_MODELFILE > Modelfile
# ollama create my-model -f Modelfile
# ollama run my-model
# Performance comparison Q4 vs Q5 vs Q8 (Llama 3.1 8B, MacBook M3 Pro):
QUANT_BENCHMARK = {
"Q4_K_M": {"size_gb": 4.8, "tps": 38.2, "quality_vs_f16": "98.5%"},
"Q5_K_M": {"size_gb": 5.7, "tps": 33.1, "quality_vs_f16": "99.2%"},
"Q6_K": {"size_gb": 6.6, "tps": 29.4, "quality_vs_f16": "99.7%"},
"Q8_0": {"size_gb": 8.5, "tps": 24.8, "quality_vs_f16": "99.9%"},
}
for quant, data in QUANT_BENCHMARK.items():
print(f"{quant}: {data['size_gb']}GB, {data['tps']}t/s, quality={data['quality_vs_f16']}")
Modelfile: Creating Custom Assistants
A Modelfile is Ollama's mechanism for creating custom models. It allows you to define the base model, system prompt, generation parameters (temperature, top_p, context window), and even extend a model with additional files. It is equivalent to a Dockerfile, but for language models.
# ================================================================
# PRACTICAL MODELFILE EXAMPLES
# ================================================================
# --- Modelfile 1: Technical English assistant ---
MODEL_FILE_TECH = """
FROM qwen2.5:7b
# Generation parameters
PARAMETER temperature 0.3 # Low = more deterministic responses
PARAMETER top_p 0.9 # Nucleus sampling
PARAMETER top_k 40 # Top-k sampling
PARAMETER num_ctx 8192 # Context window (4096-32768)
PARAMETER repeat_penalty 1.1 # Avoid repetitions
# System prompt (defines model behavior)
SYSTEM \"\"\"
You are a technical assistant expert in Python, deep learning and machine learning.
Always respond in English, concisely and technically.
When showing code, always use markdown blocks with the language specified.
If you are not sure about something, say so explicitly.
Do not invent information or APIs that don't exist.
\"\"\"
# Welcome message
MESSAGE user "Hello!"
MESSAGE assistant "Hi! I'm your technical assistant. How can I help you with Python, deep learning, or machine learning today?"
"""
# Create the model:
# echo MODEL_FILE_TECH > Modelfile-tech-en
# ollama create tech-assistant-en -f Modelfile-tech-en
# ollama run tech-assistant-en
# --- Modelfile 2: Code review assistant ---
MODEL_FILE_CODE = """
FROM llama3.1:8b
PARAMETER temperature 0.1 # Very deterministic for code
PARAMETER num_ctx 16384 # Large context for long files
PARAMETER repeat_penalty 1.05
SYSTEM \"\"\"
You are an expert code reviewer. When reviewing code:
1. Identify bugs, security issues, and performance problems
2. Suggest specific improvements with code examples
3. Follow PEP8/language standards
4. Be concise: list issues with severity (CRITICAL/HIGH/MEDIUM/LOW)
Be direct and actionable. Never hallucinate API methods.
\"\"\"
"""
# --- Modelfile 3: Document RAG assistant ---
MODEL_FILE_RAG = """
FROM qwen2.5:7b
PARAMETER temperature 0.1
PARAMETER num_ctx 32768 # Long context for documents
PARAMETER repeat_penalty 1.0
SYSTEM \"\"\"
You are an assistant that answers ONLY based on the documents provided in the context.
If you cannot find the answer in the context, say exactly: "I don't have information on this in the provided documents."
Never add external information. Always cite the source document in your answer.
\"\"\"
"""
print("Modelfiles ready. To create:")
print(" ollama create tech-assistant-en -f Modelfile-tech-en")
print(" ollama create code-reviewer -f Modelfile-code")
print(" ollama create rag-assistant -f Modelfile-rag")
Ollama REST API: Integration with Python
Ollama exposes two APIs: its own native API and an OpenAI-compatible API. The OpenAI compatibility allows replacing OpenAI APIs with Ollama simply by changing the base URL — without modifying application code.
# pip install ollama openai requests
import ollama
import json, time
from typing import Iterator
# ================================================================
# 1. OFFICIAL OLLAMA LIBRARY (Python)
# ================================================================
# Simple chat (non-streaming)
def chat_simple(model: str, message: str) -> str:
response = ollama.chat(
model=model,
messages=[{"role": "user", "content": message}]
)
return response['message']['content']
# Chat with streaming (token by token)
def chat_streaming(model: str, messages: list) -> Iterator[str]:
stream = ollama.chat(
model=model,
messages=messages,
stream=True
)
for chunk in stream:
if chunk['message']['content']:
yield chunk['message']['content']
# Embeddings for RAG (use nomic-embed-text or mxbai-embed-large)
def get_embedding(model: str, text: str) -> list:
response = ollama.embeddings(model=model, prompt=text)
return response['embedding']
# Chat with images (multimodal models: llava, bakllava, moondream)
def chat_with_image(model: str, prompt: str, image_path: str) -> str:
import base64
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode()
response = ollama.chat(
model="llava:7b",
messages=[{
"role": "user",
"content": prompt,
"images": [image_data]
}]
)
return response['message']['content']
# Chatbot with conversation history
def interactive_chat(model: str = "llama3.2:3b"):
history = []
print(f"Chat with {model} (type 'exit' to quit)")
while True:
user_input = input("You: ").strip()
if user_input.lower() == "exit":
break
history.append({"role": "user", "content": user_input})
print("Assistant: ", end="", flush=True)
full_response = ""
for chunk in chat_streaming(model, history):
print(chunk, end="", flush=True)
full_response += chunk
print()
history.append({"role": "assistant", "content": full_response})
# ================================================================
# 2. OPENAI-COMPATIBLE API (drop-in replacement)
# ================================================================
from openai import OpenAI
# Only change base_url: zero code changes!
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Any string
)
def chat_openai_compatible(model: str, prompt: str) -> str:
"""Identical to OpenAI API, but uses Ollama locally."""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.7,
max_tokens=500
)
return response.choices[0].message.content
# ================================================================
# 3. RAW REST API (without Python libraries)
# ================================================================
import requests
def ollama_raw_api(model: str, prompt: str, stream: bool = False) -> str:
"""Call Ollama API directly with requests."""
resp = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": stream,
"options": {
"temperature": 0.7,
"num_predict": 200,
"num_ctx": 4096
}
},
timeout=120
)
if not stream:
return resp.json()["response"]
else:
result = ""
for line in resp.iter_lines():
if line:
data = json.loads(line)
result += data.get("response", "")
if data.get("done"):
break
return result
# ================================================================
# 4. MODEL SPEED BENCHMARK
# ================================================================
def benchmark_model(model: str, n_runs: int = 3):
"""Measures generation speed in token/s."""
prompt = "Explain quantum computing in one paragraph."
results = []
for _ in range(n_runs):
t0 = time.time()
response = ollama.generate(
model=model,
prompt=prompt,
options={"num_predict": 100}
)
elapsed = time.time() - t0
eval_count = response.get('eval_count', 100)
tps = eval_count / elapsed
results.append(tps)
avg_tps = sum(results) / len(results)
print(f"{model}: {avg_tps:.1f} token/s (average {n_runs} runs)")
return avg_tps
# Typical comparison results on MacBook M3 Pro 18GB:
# qwen2.5:1.5b ~85 t/s
# llama3.2:3b ~62 t/s
# qwen2.5:7b ~42 t/s
# llama3.1:8b ~38 t/s
# qwen2.5:14b ~22 t/s
# llama3.1:70b ~8 t/s
Ollama with LangChain: Offline RAG Pipeline
Ollama integrates natively with LangChain, enabling completely offline RAG
(Retrieval-Augmented Generation) pipelines. This is particularly relevant for enterprise
applications that cannot send sensitive data to the cloud. The nomic-embed-text
model is optimal for local embeddings.
# pip install langchain langchain-ollama langchain-community
# pip install faiss-cpu chromadb pypdf
from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import (
DirectoryLoader, TextLoader, PyPDFLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import os
# ================================================================
# OFFLINE RAG SYSTEM WITH OLLAMA - PRODUCTION VERSION
# ================================================================
class OllamaRAGSystem:
"""
Complete offline RAG system with Ollama.
Supports PDF, TXT, and entire directories.
Uses FAISS for local vector storage.
"""
def __init__(
self,
llm_model: str = "llama3.1:8b",
embed_model: str = "nomic-embed-text", # ollama pull nomic-embed-text
kb_dir: str = "./knowledge_base"
):
self.llm_model = llm_model
self.embed_model = embed_model
self.kb_dir = kb_dir
self.embeddings = OllamaEmbeddings(model=embed_model)
self.llm = OllamaLLM(
model=llm_model,
temperature=0.1,
num_ctx=8192,
num_predict=512
)
self.vectorstore = None
def load_documents(self, docs_dir: str) -> list:
"""Load documents from directory (PDF, TXT, MD)."""
docs = []
# Load TXT and MD
txt_loader = DirectoryLoader(
docs_dir, glob="**/*.txt", loader_cls=TextLoader
)
docs.extend(txt_loader.load())
# Load PDFs
for pdf_file in os.listdir(docs_dir):
if pdf_file.endswith(".pdf"):
loader = PyPDFLoader(os.path.join(docs_dir, pdf_file))
docs.extend(loader.load())
print(f"Loaded {len(docs)} documents from {docs_dir}")
return docs
def build_knowledge_base(self, docs_dir: str) -> None:
"""Create and save knowledge base from a directory."""
documents = self.load_documents(docs_dir)
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", " ", ""]
)
texts = splitter.split_documents(documents)
print(f"Created {len(texts)} chunks")
self.vectorstore = FAISS.from_documents(texts, self.embeddings)
self.vectorstore.save_local(self.kb_dir)
print(f"Knowledge base saved to {self.kb_dir}")
def load_knowledge_base(self) -> None:
"""Load existing knowledge base from disk."""
self.vectorstore = FAISS.load_local(
self.kb_dir, self.embeddings,
allow_dangerous_deserialization=True
)
print(f"Knowledge base loaded: {self.vectorstore.index.ntotal} vectors")
def create_qa_chain(self) -> RetrievalQA:
"""Create Q&A chain over documents."""
prompt_template = """Use the following context to answer the question.
If you cannot find the answer in the context, explicitly say you don't know.
Do not make up information not present in the context.
Context:
{context}
Question: {question}
Answer:"""
PROMPT = PromptTemplate(
template=prompt_template,
input_variables=["context", "question"]
)
retriever = self.vectorstore.as_retriever(
search_type="mmr", # Maximum Marginal Relevance (more diverse)
search_kwargs={"k": 5, "fetch_k": 20}
)
return RetrievalQA.from_chain_type(
llm=self.llm,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": PROMPT},
return_source_documents=True
)
def ask(self, question: str, qa_chain: RetrievalQA) -> dict:
"""Ask a question to the RAG system."""
result = qa_chain.invoke({"query": question})
sources = list(set([
doc.metadata.get("source", "Unknown")
for doc in result["source_documents"]
]))
return {
"answer": result["result"],
"sources": sources,
"n_docs": len(result["source_documents"])
}
# Usage:
# rag = OllamaRAGSystem(llm_model="llama3.1:8b")
# rag.build_knowledge_base("./company_documents")
# chain = rag.create_qa_chain()
# result = rag.ask("What is the company vacation policy?", chain)
# print(result["answer"])
# print("Sources:", result["sources"])
print("RAG system ready!")
OpenWebUI: ChatGPT Interface for Ollama
OpenWebUI (formerly Ollama WebUI) is the most popular interface for Ollama, with a user experience identical to ChatGPT but completely offline. It supports chat, document upload, conversation management, prompt sharing, integrated RAG, and multimodal mode for images.
# ================================================================
# OPENWEBUI SETUP WITH DOCKER
# ================================================================
# Case 1: Ollama on the same host
# docker run -d -p 3000:8080 \
# -v open-webui:/app/backend/data \
# -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
# --name open-webui \
# ghcr.io/open-webui/open-webui:main
# Case 2: OpenWebUI with integrated Ollama (all-in-one)
# docker run -d -p 3000:8080 \
# -v ollama:/root/.ollama \
# -v open-webui:/app/backend/data \
# --gpus all \
# --name open-webui \
# ghcr.io/open-webui/open-webui:ollama
# Access: http://localhost:3000
# ================================================================
# DOCKER COMPOSE (recommended for production)
# ================================================================
DOCKER_COMPOSE = """
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_NUM_PARALLEL=4
- OLLAMA_MAX_LOADED_MODELS=2
# For NVIDIA GPU:
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: [gpu]
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
ports:
- "3000:8080"
volumes:
- webui_data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_AUTH=True
- WEBUI_SECRET_KEY=change-this-secret-key
depends_on:
- ollama
restart: unless-stopped
volumes:
ollama_data:
webui_data:
"""
# ================================================================
# OLLAMA AS SYSTEMD SERVICE (Linux production)
# ================================================================
SYSTEMD_SERVICE = """
# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama LLM Service
After=network-online.target
[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_MODELS=/opt/ollama/models"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_KEEP_ALIVE=10m"
[Install]
WantedBy=default.target
"""
# sudo systemctl enable ollama
# sudo systemctl start ollama
# sudo journalctl -u ollama -f # Real-time logs
print("Ollama service setup complete!")
Deployment on Raspberry Pi: Optimized Setup
The Raspberry Pi 5 with 8 GB of RAM is the most accessible edge device for local LLMs. With the right configuration, 1.5B parameter models reach 4-5 token/s — sufficient for many non-real-time use cases: low-volume chatbots, batch text analysis, event-triggered automation.
# ================================================================
# OLLAMA ON RASPBERRY PI 5 (optimized setup)
# ================================================================
# Installation (identical to x86 Linux):
# curl -fsSL https://ollama.com/install.sh | sh
# Optimal configuration for RPi5 in /etc/environment:
# OLLAMA_NUM_PARALLEL=1 # One request at a time (limited RAM)
# OLLAMA_MAX_LOADED_MODELS=1 # One model in memory
# OLLAMA_KEEP_ALIVE=5m # Unload model after 5 min inactivity
# OLLAMA_NUM_THREAD=4 # All Cortex-A76 cores
# Recommended models for RPi5 (8GB):
# ollama pull qwen2.5:1.5b (fast: ~4.5 t/s, 1.8 GB RAM)
# ollama pull llama3.2:1b (balanced: ~5.1 t/s, 1.4 GB RAM)
# ollama pull gemma2:2b (quality: ~3.2 t/s, 2.5 GB RAM)
import ollama
import time, statistics, psutil
def benchmark_ollama_rpi(model: str = "qwen2.5:1.5b",
n_tests: int = 5):
"""Test speed and consistency on RPi."""
prompt = "Explain in 3 sentences what machine learning is."
results = []
latencies_to_first = []
print(f"Benchmarking {model} over {n_tests} tests...")
for i in range(n_tests):
t0 = time.time()
first_token = None
full_response = ""
for chunk in ollama.chat(
model=model,
messages=[{"role": "user", "content": prompt}],
stream=True,
options={"temperature": 0, "top_k": 1, "num_predict": 50}
):
content = chunk['message']['content']
if content and first_token is None:
first_token = time.time() - t0
latencies_to_first.append(first_token * 1000)
full_response += content
elapsed = time.time() - t0
n_tokens = len(full_response.split()) # Approximation
tps = n_tokens / elapsed
results.append(tps)
print(f" Test {i+1}: {tps:.1f} t/s, TTFT: {first_token*1000:.0f}ms")
mean_tps = statistics.mean(results)
mean_ttft = statistics.mean(latencies_to_first)
mem = psutil.virtual_memory()
print(f"\nResults {model} on RPi5:")
print(f" Mean speed: {mean_tps:.1f} t/s")
print(f" Mean TTFT: {mean_ttft:.0f} ms")
print(f" RAM used: {mem.used/(1024**3):.1f} GB / {mem.total/(1024**3):.1f} GB")
return mean_tps
# ================================================================
# AUTOMATION: Model updates and monitoring
# ================================================================
import subprocess, datetime
def update_ollama_models(models: list = ["qwen2.5:1.5b", "nomic-embed-text"]):
"""Update Ollama models (run with cron)."""
log = []
for model in models:
print(f"Updating {model}...")
result = subprocess.run(
["ollama", "pull", model],
capture_output=True, text=True, timeout=600
)
status = "OK" if result.returncode == 0 else "FAIL"
log.append({
"model": model,
"status": status,
"time": datetime.datetime.now().isoformat()
})
print(f" {model}: {status}")
return log
# Recommended cron job (every Sunday at 3:00 AM):
# 0 3 * * 0 /usr/bin/python3 /home/pi/update_models.py >> /var/log/ollama-update.log 2>&1
Real-World Case Study: Offline Enterprise Chatbot
A real use case: a company handling sensitive documents (contracts, HR policies, technical manuals) wants an internal chatbot without exposing data to the cloud. With Ollama and RAG, a completely air-gapped system can be built in less than a day.
# ================================================================
# OFFLINE ENTERPRISE CHATBOT - Full Stack
# ================================================================
# Stack:
# - Ollama with llama3.1:8b (or qwen2.5:7b for better multilingual)
# - nomic-embed-text for embeddings
# - FAISS for vector store
# - FastAPI for REST API
# - OpenWebUI for user interface
# fastapi_chatbot.py
from fastapi import FastAPI
from pydantic import BaseModel
import ollama, time
app = FastAPI(title="Corporate AI Assistant", version="2.0")
# Global state (use Redis in production)
conversation_store = {}
class ChatRequest(BaseModel):
session_id: str
message: str
model: str = "qwen2.5:7b"
use_rag: bool = True
class ChatResponse(BaseModel):
session_id: str
response: str
sources: list = []
model: str
tokens_per_sec: float
@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
# Retrieve conversation history
if request.session_id not in conversation_store:
conversation_store[request.session_id] = []
history = conversation_store[request.session_id]
# Add RAG context if requested
context = ""
sources = []
if request.use_rag and rag_system and rag_system.vectorstore:
docs = rag_system.vectorstore.similarity_search(
request.message, k=3
)
context = "\n\n".join([d.page_content for d in docs])
sources = list(set([d.metadata.get("source", "") for d in docs]))
augmented_message = f"""Context from company documents:
{context}
Question: {request.message}"""
else:
augmented_message = request.message
history.append({"role": "user", "content": augmented_message})
# Generate response
t0 = time.time()
response = ollama.chat(
model=request.model,
messages=history,
options={"num_ctx": 8192, "temperature": 0.3}
)
elapsed = time.time() - t0
assistant_msg = response['message']['content']
history.append({"role": "assistant", "content": assistant_msg})
# Truncate history if too long (sliding window)
if len(history) > 20:
history = history[-20:]
conversation_store[request.session_id] = history
eval_count = response.get('eval_count', 50)
tps = eval_count / elapsed if elapsed > 0 else 0
return ChatResponse(
session_id=request.session_id,
response=assistant_msg,
sources=sources,
model=request.model,
tokens_per_sec=round(tps, 1)
)
@app.get("/models")
async def list_models():
"""List available models on this Ollama server."""
models = ollama.list()
return {
"models": [
{"name": m['name'], "size_gb": m['size'] / 1e9}
for m in models['models']
]
}
# Start: uvicorn fastapi_chatbot:app --host 0.0.0.0 --port 8080
Model Comparison for Common Use Cases
| Use Case | Recommended Model | Min RAM | Why |
|---|---|---|---|
| English chatbot | qwen2.5:7b | 8 GB | Excellent multilingual, long context |
| Code generation | qwen2.5-coder:7b | 8 GB | Fine-tuned on code, 90+ languages |
| RAG / Q&A documents | llama3.1:8b | 8 GB | Excellent instruction following, 128K context |
| Advanced reasoning | deepseek-r1:8b | 8 GB | Chain-of-thought, math, logic |
| Raspberry Pi (fast) | llama3.2:1b | 2 GB | 5+ t/s, useful for simple tasks |
| Raspberry Pi (quality) | qwen2.5:3b | 4 GB | Optimal quality/speed balance |
| Mac M-series (fast) | qwen2.5:14b | 16 GB | 22+ t/s on M2/M3, near GPT-4 quality |
| Image analysis | llava:7b or moondream | 8 GB | Multimodal models optimized for vision |
Production Best Practices
Using Ollama in production requires specific considerations compared to personal use. Here are the most important patterns.
# ================================================================
# PRODUCTION PATTERN: Health Check and Monitoring
# ================================================================
import requests, time, functools, random
def monitor_ollama(host: str = "localhost", port: int = 11434):
"""Check Ollama availability and load."""
try:
resp = requests.get(f"http://{host}:{port}/api/tags", timeout=5)
if resp.status_code == 200:
models = resp.json().get("models", [])
print(f"Ollama OK: {len(models)} models available")
return True
except requests.exceptions.RequestException as e:
print(f"Ollama UNREACHABLE: {e}")
return False
# ================================================================
# ERROR HANDLING AND RETRY
# ================================================================
def with_ollama_retry(max_attempts: int = 3, backoff: float = 1.0):
"""Decorator for automatic retry on Ollama errors."""
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_attempts):
try:
return func(*args, **kwargs)
except Exception as e:
if attempt == max_attempts - 1:
raise
wait = backoff * (2 ** attempt) + random.uniform(0, 0.5)
print(f"Attempt {attempt+1} failed: {e}. Retry in {wait:.1f}s")
time.sleep(wait)
return wrapper
return decorator
@with_ollama_retry(max_attempts=3, backoff=1.0)
def robust_chat(model: str, message: str) -> str:
"""Chat with automatic retry on network/timeout errors."""
response = ollama.chat(
model=model,
messages=[{"role": "user", "content": message}],
options={"num_predict": 500}
)
return response['message']['content']
# ================================================================
# NGINX LOAD BALANCING (multiple Ollama instances)
# ================================================================
NGINX_CONFIG = """
upstream ollama_cluster {
least_conn; # Route to connection with fewest requests
server server1:11434;
server server2:11434;
server server3:11434;
}
server {
listen 80;
location /api/ {
proxy_pass http://ollama_cluster;
proxy_read_timeout 300s; # High timeout for long generation
proxy_connect_timeout 10s;
proxy_set_header Host $host;
}
}
"""
Limitations and Production Considerations
-
Ollama is not multi-tenant by default: on a shared server, requests
are serialized. Set
OLLAMA_NUM_PARALLEL=4to handle concurrent requests (requires more RAM: ~8GB per request with 7B model). -
Timeout on RPi with large models: llama3.1:8b takes 10-15 seconds
to generate the first response on RPi. Use
num_ctx=512to reduce prefill time in time-sensitive cases. For TTFT <2s use 1-3B models. - No automatic autoscaling: unlike cloud APIs, Ollama does not scale. For high traffic, use load balancing with multiple Ollama instances on different servers, or evaluate vLLM for GPU deployment.
-
Continuous power consumption: keeping Ollama active with a loaded
model consumes ~15W on RPi5, ~45W on Jetson Orin NX. Use
OLLAMA_KEEP_ALIVE=0to unload the model immediately after each request. - Security: Ollama does not authenticate requests by default. In production, always put a reverse proxy (nginx) in front with authentication and rate limiting. Never expose port 11434 directly to the internet.
Conclusions
Ollama has reduced the barrier to local AI to zero. With a single command you can have a competitive LLM running on your laptop, with complete privacy and zero API costs. The trend toward Local LLMs is unstoppable: Gartner predicts that by 2027, SLMs (Small Language Models) will surpass cloud LLMs 3x in usage frequency, with a 70% reduction in operational costs.
For production, Ollama is an excellent starting point but requires some considerations: concurrency management, monitoring, model updates, security, and integration with existing systems. The most powerful pattern is combining Ollama with a RAG pipeline to give the model access to private knowledge bases without sending data to the cloud.
The next article in the series closes the loop with Benchmarks and Optimization: how to systematically measure the performance of all the tools seen in the series — quantization, distillation, pruning, edge deployment — and choose the optimal combination for your use case.
Next Steps
- Next article: Benchmarks and Optimization: from 48GB to 8GB RTX
- Related: Deep Learning on Edge Devices
- Related: Quantization: GPTQ, AWQ, GGUF
- AI Engineering series: RAG Pipeline with Local LLMs
- MLOps series: Model Serving in Production







