Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Case Law Search Engine with Vector Embeddings

A lawyer researching "product liability for defective goods" could easily miss a landmark ruling framed as "manufacturer responsibility for product defects under strict liability." Traditional full-text search — built on exact keyword matching — fails systematically in a domain where the same legal concept may appear in dozens of formulations across different eras, jurisdictions, and courts.

Vector Embeddings and Semantic Search solve this at the foundation: instead of comparing words, they compare meaning. A query about "contract voidance for lack of mutual assent" automatically surfaces cases on "annulment for mistake in consent" because both concepts reside in adjacent regions of the embedding space. In this article we build a production-ready case law search engine from scratch using Python, domain-specific legal embedding models, and a vector database.

What You Will Learn

Architecture of a semantic search engine for case law
Legal-domain embedding models (legal-BERT, ModernBERT, Voyage-law)
Efficient indexing with FAISS and Pinecone
Hybrid search: BM25 + vector similarity for maximum precision
Cross-encoder re-ranking for final result refinement
FastAPI REST service for LegalTech application integration

System Architecture

A modern case law search engine consists of three main layers:

Ingestion Pipeline: downloads, normalizes, and processes rulings from official sources (EUR-Lex, ECLI API, CourtListener). Produces chunked documents ready for embedding.
Indexing Engine: generates vector embeddings for each ruling chunk and indexes them in a vector store (FAISS for self-hosted, Pinecone for managed).
Query Engine: processes user queries, transforms them into embeddings, executes vector search, applies hybrid re-ranking, and returns results with verifiable citations.

from dataclasses import dataclass, field
from typing import List, Optional
from datetime import date
from enum import Enum

class JurisdictionType(Enum):
    SUPREME_COURT = "supreme_court"
    COURT_OF_APPEALS = "court_of_appeals"
    DISTRICT_COURT = "district_court"
    CONSTITUTIONAL_COURT = "constitutional_court"
    EU_COURT_OF_JUSTICE = "ecj"
    ECHR = "echr"

@dataclass
class CourtDecision:
    """Represents a court ruling indexed in the system."""
    ecli: str               # e.g., ECLI:EU:C:2024:123
    court: JurisdictionType
    date: date
    number: str
    subject_matter: str     # civil, criminal, administrative...
    keywords: List[str]
    headnotes: str          # ratio decidendi / holding
    full_text: str
    citations: List[str]
    cited_by: List[str] = field(default_factory=list)

@dataclass
class ChunkedDecision:
    """Semantically chunked ruling ready for embedding."""
    chunk_id: str
    ecli: str
    chunk_type: str   # "headnote", "facts", "reasoning", "holding"
    content: str
    embedding: Optional[List[float]] = None
    metadata: dict = field(default_factory=dict)

Choosing the Embedding Model

The embedding model choice is critical for search quality. General-purpose models like OpenAI's text-embedding-3-large perform well, but models pre-trained on legal corpora significantly outperform general models on specialized legal retrieval tasks.

Model	Dimensions	Specialization	NDCG@10 (legal)	Deployment
text-embedding-3-large	3072	General	0.71	API (OpenAI)
nlpaueb/legal-bert-base	768	Legal (EN)	0.79	HuggingFace
Free-Law-Project/modernbert	768	Case Law (EN)	0.83	HuggingFace
Voyage-law-2	1024	Legal (multilingual)	0.86	API (Voyage AI)

from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List

class LegalEmbeddingService:
    """
    Embedding service specialized for legal texts.
    Supports local HuggingFace models and remote APIs.
    """

    def __init__(self, model_name: str = "nlpaueb/legal-bert-base-uncased"):
        self.model_name = model_name
        self.model = SentenceTransformer(model_name)
        self.embedding_dim = self.model.get_sentence_embedding_dimension()

    def encode_texts(
        self,
        texts: List[str],
        batch_size: int = 32,
        normalize: bool = True
    ) -> np.ndarray:
        """Generate normalized embeddings for cosine similarity via dot product."""
        return self.model.encode(
            texts,
            batch_size=batch_size,
            normalize_embeddings=normalize,
            show_progress_bar=len(texts) > 100,
            convert_to_numpy=True
        )

    def encode_query(self, query: str) -> np.ndarray:
        """Encode user query with model-specific prefixes where required."""
        if "e5" in self.model_name.lower():
            query = f"query: {query}"
        elif "instructor" in self.model_name.lower():
            query = f"Represent the legal question for retrieval: {query}"
        return self.model.encode([query], normalize_embeddings=True, convert_to_numpy=True)[0]

Indexing with FAISS

FAISS (Facebook AI Similarity Search) is the reference library for high-performance vector search on large datasets. For a collection of 10 million rulings, an IVF (Inverted File Index) with Product Quantization (PQ) keeps response times below 100ms on commodity CPU hardware.

import faiss
import numpy as np
import pickle
from typing import List

class FAISSCaseLawIndex:
    """
    FAISS index optimized for case law retrieval.
    Supports flat (small datasets) and IVF+PQ (millions of rulings).
    """

    def __init__(self, embedding_dim: int, index_type: str = "ivf"):
        self.embedding_dim = embedding_dim
        self.index_type = index_type
        self.index = None
        self.id_to_metadata = {}
        self.next_id = 0

    def build_index(self, embeddings: np.ndarray, num_clusters: int = 1024):
        if self.index_type == "flat":
            self.index = faiss.IndexFlatIP(self.embedding_dim)
        elif self.index_type == "ivf":
            quantizer = faiss.IndexFlatIP(self.embedding_dim)
            pq_segments = min(self.embedding_dim, 8)
            self.index = faiss.IndexIVFPQ(
                quantizer, self.embedding_dim,
                num_clusters, pq_segments, 8
            )
            self.index.train(embeddings)
            self.index.nprobe = 64

        self.index.add(embeddings)
        print(f"Index built: {self.index.ntotal} vectors")

    def search(
        self,
        query_embedding: np.ndarray,
        k: int = 20,
        score_threshold: float = 0.6
    ):
        query = query_embedding.reshape(1, -1).astype(np.float32)
        scores, indices = self.index.search(query, k)

        results = []
        for score, idx in zip(scores[0], indices[0]):
            if idx != -1 and score >= score_threshold:
                result = {**self.id_to_metadata.get(idx, {}), 'score': float(score)}
                results.append(result)

        return results

Hybrid Search: BM25 + Vector Similarity

Pure semantic search excels at finding related concepts but can miss exact matches for precise statutory references (e.g., "42 U.S.C. § 1983", "GDPR Article 17"). BM25 (keyword-based) is excellent for exact matches but blind to semantics. The hybrid approach combines both using Reciprocal Rank Fusion (RRF).

from rank_bm25 import BM25Okapi
import re
import numpy as np
from typing import List, Tuple

class HybridCaseLawSearch:
    """Hybrid BM25 + Vector Similarity with Reciprocal Rank Fusion."""

    def __init__(self, embedding_service, faiss_index, corpus: List[dict]):
        self.embedding_service = embedding_service
        self.faiss_index = faiss_index
        self.corpus = corpus
        tokenized_corpus = [self._tokenize_legal(doc['content']) for doc in corpus]
        self.bm25 = BM25Okapi(tokenized_corpus)

    def _tokenize_legal(self, text: str) -> List[str]:
        text = re.sub(r'§\s*(\d+)', r'section_\1', text)
        text = re.sub(r'art\.\s*(\d+)', r'art_\1', text, flags=re.IGNORECASE)
        tokens = re.findall(r'\b[a-zA-Z_][a-zA-Z0-9_]*\b', text.lower())
        stopwords = {'the', 'a', 'an', 'in', 'of', 'to', 'and', 'or', 'for', 'is', 'are'}
        return [t for t in tokens if t not in stopwords and len(t) > 2]

    def _reciprocal_rank_fusion(
        self,
        vector_results: List[dict],
        bm25_results: List[Tuple[int, float]],
        k: int = 60,
        vector_weight: float = 0.6,
        bm25_weight: float = 0.4
    ) -> List[dict]:
        rrf_scores = {}

        for rank, result in enumerate(vector_results):
            doc_id = result['chunk_id']
            if doc_id not in rrf_scores:
                rrf_scores[doc_id] = {'score': 0, 'data': result}
            rrf_scores[doc_id]['score'] += vector_weight / (k + rank + 1)

        for rank, (doc_idx, _) in enumerate(bm25_results):
            doc_id = self.corpus[doc_idx]['chunk_id']
            if doc_id not in rrf_scores:
                rrf_scores[doc_id] = {'score': 0, 'data': self.corpus[doc_idx]}
            rrf_scores[doc_id]['score'] += bm25_weight / (k + rank + 1)

        sorted_results = sorted(rrf_scores.values(), key=lambda x: x['score'], reverse=True)
        return [{**r['data'], 'rrf_score': r['score']} for r in sorted_results]

    def search(self, query: str, top_k: int = 10) -> List[dict]:
        query_embedding = self.embedding_service.encode_query(query)
        vector_results = self.faiss_index.search(query_embedding, k=50)

        tokenized_query = self._tokenize_legal(query)
        bm25_scores = self.bm25.get_scores(tokenized_query)
        top_bm25_indices = np.argsort(bm25_scores)[::-1][:50]
        bm25_results = [(idx, bm25_scores[idx]) for idx in top_bm25_indices]

        fused = self._reciprocal_rank_fusion(vector_results, bm25_results)
        return fused[:top_k]

Cross-Encoder Re-Ranking

After hybrid search, a cross-encoder re-ranking step further improves precision. Cross-encoders process the (query, document) pair jointly, producing a significantly more accurate relevance score than bi-encoders — but at higher computational cost, which is why they are only applied to the top-K candidates from the initial search.

from sentence_transformers import CrossEncoder
from typing import List

class LegalReranker:
    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-12-v2"):
        self.model = CrossEncoder(model_name, max_length=512)

    def rerank(self, query: str, candidates: List[dict], top_k: int = 5) -> List[dict]:
        if not candidates:
            return []

        pairs = [(query, c['content'][:400]) for c in candidates]
        scores = self.model.predict(pairs)

        for candidate, score in zip(candidates, scores):
            candidate['rerank_score'] = float(score)

        reranked = sorted(candidates, key=lambda x: x['rerank_score'], reverse=True)
        return reranked[:top_k]


class CaseLawSearchEngine:
    def __init__(self, hybrid_searcher, reranker):
        self.hybrid_searcher = hybrid_searcher
        self.reranker = reranker

    def search(self, query: str, top_k: int = 5) -> List[dict]:
        candidates = self.hybrid_searcher.search(query, top_k=20)
        results = self.reranker.rerank(query, candidates, top_k=top_k)
        return [{
            'ecli': r.get('ecli', 'N/A'),
            'court': r.get('court', 'N/A'),
            'date': r.get('date', 'N/A'),
            'headnote': r.get('headnote', ''),
            'excerpt': r['content'][:300] + "...",
            'relevance_score': r['rerank_score']
        } for r in results]

ECLI and European Standards

The European Case Law Identifier (ECLI) is the EU standard for uniquely identifying court decisions across member states. An ECLI takes the form: ECLI:{country}:{court}:{year}:{number}. Always include ECLI in your index metadata to ensure citations are verifiable and machine-readable.

Official Indexing Sources

EUR-Lex: EU rulings with SPARQL API and bulk download
CourtListener (Free Law Project): US case law, open source
CURIA (ECJ): curia.europa.eu API with XML/JSON output
ECHR: hudoc.echr.coe.int with open data exports
Justia: US federal and state case law with free API

Anti-Patterns to Avoid

Embedding full ruling text without chunking: 10,000-word rulings produce "diluted" embeddings. Always chunk by section (facts, reasoning, holding).
Score threshold too low: returning all results above 0.3 creates too much noise. Start with 0.65 and calibrate with user feedback.
Ignoring decision date: a ruling on repealed legislation is irrelevant for current practice. Always apply temporal filters.
Missing E5 model prefixes: E5 models require different prefixes for "query" vs "passage". Ignoring this degrades performance by ~15%.

Conclusions

A case law search engine built on vector embeddings outperforms traditional full-text search on every metric that matters for legal professionals: recall of relevant precedents, robustness to terminological variation, and ability to find conceptual analogies across different factual patterns.

The complete pipeline — specialized embeddings + FAISS + BM25 hybrid + cross-encoder — achieves production performance on datasets of millions of rulings with latencies under 200ms. The code in this article is the ideal starting point for the retrieval core of any modern LegalTech platform.

LegalTech & AI Series

NLP for Contract Analysis: From OCR to Understanding
e-Discovery Platform Architecture
Compliance Automation with Dynamic Rules Engines
Smart Contracts for Legal Agreements: Solidity and Vyper
Legal Document Summarization with Generative AI
Case Law Search Engine: Vector Embeddings (this article)
Digital Signature and Document Authentication at Scale
Data Privacy and GDPR Compliance Systems
Building a Legal AI Assistant (Legal Copilot)
LegalTech Data Integration Patterns