Case Law Search Engine with Vector Embeddings
A lawyer researching "product liability for defective goods" could easily miss a landmark ruling framed as "manufacturer responsibility for product defects under strict liability." Traditional full-text search — built on exact keyword matching — fails systematically in a domain where the same legal concept may appear in dozens of formulations across different eras, jurisdictions, and courts.
Vector Embeddings and Semantic Search solve this at the foundation: instead of comparing words, they compare meaning. A query about "contract voidance for lack of mutual assent" automatically surfaces cases on "annulment for mistake in consent" because both concepts reside in adjacent regions of the embedding space. In this article we build a production-ready case law search engine from scratch using Python, domain-specific legal embedding models, and a vector database.
What You Will Learn
- Architecture of a semantic search engine for case law
- Legal-domain embedding models (legal-BERT, ModernBERT, Voyage-law)
- Efficient indexing with FAISS and Pinecone
- Hybrid search: BM25 + vector similarity for maximum precision
- Cross-encoder re-ranking for final result refinement
- FastAPI REST service for LegalTech application integration
System Architecture
A modern case law search engine consists of three main layers:
- Ingestion Pipeline: downloads, normalizes, and processes rulings from official sources (EUR-Lex, ECLI API, CourtListener). Produces chunked documents ready for embedding.
- Indexing Engine: generates vector embeddings for each ruling chunk and indexes them in a vector store (FAISS for self-hosted, Pinecone for managed).
- Query Engine: processes user queries, transforms them into embeddings, executes vector search, applies hybrid re-ranking, and returns results with verifiable citations.
from dataclasses import dataclass, field
from typing import List, Optional
from datetime import date
from enum import Enum
class JurisdictionType(Enum):
SUPREME_COURT = "supreme_court"
COURT_OF_APPEALS = "court_of_appeals"
DISTRICT_COURT = "district_court"
CONSTITUTIONAL_COURT = "constitutional_court"
EU_COURT_OF_JUSTICE = "ecj"
ECHR = "echr"
@dataclass
class CourtDecision:
"""Represents a court ruling indexed in the system."""
ecli: str # e.g., ECLI:EU:C:2024:123
court: JurisdictionType
date: date
number: str
subject_matter: str # civil, criminal, administrative...
keywords: List[str]
headnotes: str # ratio decidendi / holding
full_text: str
citations: List[str]
cited_by: List[str] = field(default_factory=list)
@dataclass
class ChunkedDecision:
"""Semantically chunked ruling ready for embedding."""
chunk_id: str
ecli: str
chunk_type: str # "headnote", "facts", "reasoning", "holding"
content: str
embedding: Optional[List[float]] = None
metadata: dict = field(default_factory=dict)
Choosing the Embedding Model
The embedding model choice is critical for search quality. General-purpose models like
OpenAI's text-embedding-3-large perform well, but models
pre-trained on legal corpora significantly outperform general models on specialized
legal retrieval tasks.
| Model | Dimensions | Specialization | NDCG@10 (legal) | Deployment |
|---|---|---|---|---|
| text-embedding-3-large | 3072 | General | 0.71 | API (OpenAI) |
| nlpaueb/legal-bert-base | 768 | Legal (EN) | 0.79 | HuggingFace |
| Free-Law-Project/modernbert | 768 | Case Law (EN) | 0.83 | HuggingFace |
| Voyage-law-2 | 1024 | Legal (multilingual) | 0.86 | API (Voyage AI) |
from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List
class LegalEmbeddingService:
"""
Embedding service specialized for legal texts.
Supports local HuggingFace models and remote APIs.
"""
def __init__(self, model_name: str = "nlpaueb/legal-bert-base-uncased"):
self.model_name = model_name
self.model = SentenceTransformer(model_name)
self.embedding_dim = self.model.get_sentence_embedding_dimension()
def encode_texts(
self,
texts: List[str],
batch_size: int = 32,
normalize: bool = True
) -> np.ndarray:
"""Generate normalized embeddings for cosine similarity via dot product."""
return self.model.encode(
texts,
batch_size=batch_size,
normalize_embeddings=normalize,
show_progress_bar=len(texts) > 100,
convert_to_numpy=True
)
def encode_query(self, query: str) -> np.ndarray:
"""Encode user query with model-specific prefixes where required."""
if "e5" in self.model_name.lower():
query = f"query: {query}"
elif "instructor" in self.model_name.lower():
query = f"Represent the legal question for retrieval: {query}"
return self.model.encode([query], normalize_embeddings=True, convert_to_numpy=True)[0]
Indexing with FAISS
FAISS (Facebook AI Similarity Search) is the reference library for high-performance vector search on large datasets. For a collection of 10 million rulings, an IVF (Inverted File Index) with Product Quantization (PQ) keeps response times below 100ms on commodity CPU hardware.
import faiss
import numpy as np
import pickle
from typing import List
class FAISSCaseLawIndex:
"""
FAISS index optimized for case law retrieval.
Supports flat (small datasets) and IVF+PQ (millions of rulings).
"""
def __init__(self, embedding_dim: int, index_type: str = "ivf"):
self.embedding_dim = embedding_dim
self.index_type = index_type
self.index = None
self.id_to_metadata = {}
self.next_id = 0
def build_index(self, embeddings: np.ndarray, num_clusters: int = 1024):
if self.index_type == "flat":
self.index = faiss.IndexFlatIP(self.embedding_dim)
elif self.index_type == "ivf":
quantizer = faiss.IndexFlatIP(self.embedding_dim)
pq_segments = min(self.embedding_dim, 8)
self.index = faiss.IndexIVFPQ(
quantizer, self.embedding_dim,
num_clusters, pq_segments, 8
)
self.index.train(embeddings)
self.index.nprobe = 64
self.index.add(embeddings)
print(f"Index built: {self.index.ntotal} vectors")
def search(
self,
query_embedding: np.ndarray,
k: int = 20,
score_threshold: float = 0.6
):
query = query_embedding.reshape(1, -1).astype(np.float32)
scores, indices = self.index.search(query, k)
results = []
for score, idx in zip(scores[0], indices[0]):
if idx != -1 and score >= score_threshold:
result = {**self.id_to_metadata.get(idx, {}), 'score': float(score)}
results.append(result)
return results
Hybrid Search: BM25 + Vector Similarity
Pure semantic search excels at finding related concepts but can miss exact matches for precise statutory references (e.g., "42 U.S.C. § 1983", "GDPR Article 17"). BM25 (keyword-based) is excellent for exact matches but blind to semantics. The hybrid approach combines both using Reciprocal Rank Fusion (RRF).
from rank_bm25 import BM25Okapi
import re
import numpy as np
from typing import List, Tuple
class HybridCaseLawSearch:
"""Hybrid BM25 + Vector Similarity with Reciprocal Rank Fusion."""
def __init__(self, embedding_service, faiss_index, corpus: List[dict]):
self.embedding_service = embedding_service
self.faiss_index = faiss_index
self.corpus = corpus
tokenized_corpus = [self._tokenize_legal(doc['content']) for doc in corpus]
self.bm25 = BM25Okapi(tokenized_corpus)
def _tokenize_legal(self, text: str) -> List[str]:
text = re.sub(r'§\s*(\d+)', r'section_\1', text)
text = re.sub(r'art\.\s*(\d+)', r'art_\1', text, flags=re.IGNORECASE)
tokens = re.findall(r'\b[a-zA-Z_][a-zA-Z0-9_]*\b', text.lower())
stopwords = {'the', 'a', 'an', 'in', 'of', 'to', 'and', 'or', 'for', 'is', 'are'}
return [t for t in tokens if t not in stopwords and len(t) > 2]
def _reciprocal_rank_fusion(
self,
vector_results: List[dict],
bm25_results: List[Tuple[int, float]],
k: int = 60,
vector_weight: float = 0.6,
bm25_weight: float = 0.4
) -> List[dict]:
rrf_scores = {}
for rank, result in enumerate(vector_results):
doc_id = result['chunk_id']
if doc_id not in rrf_scores:
rrf_scores[doc_id] = {'score': 0, 'data': result}
rrf_scores[doc_id]['score'] += vector_weight / (k + rank + 1)
for rank, (doc_idx, _) in enumerate(bm25_results):
doc_id = self.corpus[doc_idx]['chunk_id']
if doc_id not in rrf_scores:
rrf_scores[doc_id] = {'score': 0, 'data': self.corpus[doc_idx]}
rrf_scores[doc_id]['score'] += bm25_weight / (k + rank + 1)
sorted_results = sorted(rrf_scores.values(), key=lambda x: x['score'], reverse=True)
return [{**r['data'], 'rrf_score': r['score']} for r in sorted_results]
def search(self, query: str, top_k: int = 10) -> List[dict]:
query_embedding = self.embedding_service.encode_query(query)
vector_results = self.faiss_index.search(query_embedding, k=50)
tokenized_query = self._tokenize_legal(query)
bm25_scores = self.bm25.get_scores(tokenized_query)
top_bm25_indices = np.argsort(bm25_scores)[::-1][:50]
bm25_results = [(idx, bm25_scores[idx]) for idx in top_bm25_indices]
fused = self._reciprocal_rank_fusion(vector_results, bm25_results)
return fused[:top_k]
Cross-Encoder Re-Ranking
After hybrid search, a cross-encoder re-ranking step further improves precision. Cross-encoders process the (query, document) pair jointly, producing a significantly more accurate relevance score than bi-encoders — but at higher computational cost, which is why they are only applied to the top-K candidates from the initial search.
from sentence_transformers import CrossEncoder
from typing import List
class LegalReranker:
def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-12-v2"):
self.model = CrossEncoder(model_name, max_length=512)
def rerank(self, query: str, candidates: List[dict], top_k: int = 5) -> List[dict]:
if not candidates:
return []
pairs = [(query, c['content'][:400]) for c in candidates]
scores = self.model.predict(pairs)
for candidate, score in zip(candidates, scores):
candidate['rerank_score'] = float(score)
reranked = sorted(candidates, key=lambda x: x['rerank_score'], reverse=True)
return reranked[:top_k]
class CaseLawSearchEngine:
def __init__(self, hybrid_searcher, reranker):
self.hybrid_searcher = hybrid_searcher
self.reranker = reranker
def search(self, query: str, top_k: int = 5) -> List[dict]:
candidates = self.hybrid_searcher.search(query, top_k=20)
results = self.reranker.rerank(query, candidates, top_k=top_k)
return [{
'ecli': r.get('ecli', 'N/A'),
'court': r.get('court', 'N/A'),
'date': r.get('date', 'N/A'),
'headnote': r.get('headnote', ''),
'excerpt': r['content'][:300] + "...",
'relevance_score': r['rerank_score']
} for r in results]
ECLI and European Standards
The European Case Law Identifier (ECLI) is the EU standard for uniquely
identifying court decisions across member states. An ECLI takes the form:
ECLI:{country}:{court}:{year}:{number}.
Always include ECLI in your index metadata to ensure citations are verifiable and machine-readable.
Official Indexing Sources
- EUR-Lex: EU rulings with SPARQL API and bulk download
- CourtListener (Free Law Project): US case law, open source
- CURIA (ECJ): curia.europa.eu API with XML/JSON output
- ECHR: hudoc.echr.coe.int with open data exports
- Justia: US federal and state case law with free API
Anti-Patterns to Avoid
- Embedding full ruling text without chunking: 10,000-word rulings produce "diluted" embeddings. Always chunk by section (facts, reasoning, holding).
- Score threshold too low: returning all results above 0.3 creates too much noise. Start with 0.65 and calibrate with user feedback.
- Ignoring decision date: a ruling on repealed legislation is irrelevant for current practice. Always apply temporal filters.
- Missing E5 model prefixes: E5 models require different prefixes for "query" vs "passage". Ignoring this degrades performance by ~15%.
Conclusions
A case law search engine built on vector embeddings outperforms traditional full-text search on every metric that matters for legal professionals: recall of relevant precedents, robustness to terminological variation, and ability to find conceptual analogies across different factual patterns.
The complete pipeline — specialized embeddings + FAISS + BM25 hybrid + cross-encoder — achieves production performance on datasets of millions of rulings with latencies under 200ms. The code in this article is the ideal starting point for the retrieval core of any modern LegalTech platform.
LegalTech & AI Series
- NLP for Contract Analysis: From OCR to Understanding
- e-Discovery Platform Architecture
- Compliance Automation with Dynamic Rules Engines
- Smart Contracts for Legal Agreements: Solidity and Vyper
- Legal Document Summarization with Generative AI
- Case Law Search Engine: Vector Embeddings (this article)
- Digital Signature and Document Authentication at Scale
- Data Privacy and GDPR Compliance Systems
- Building a Legal AI Assistant (Legal Copilot)
- LegalTech Data Integration Patterns







