Open Data API Design: Publishing and Consuming Public Data
How to design and implement APIs for public administration open data: DCAT-AP standards, CKAN, REST APIs with pagination and caching, open data portals, and best practices for dataset quality.
The Context: Open Data as Public Infrastructure
Open data in Italian public administration is not just a transparency principle: it is a legal obligation (Legislative Decree 36/2006, amended by D.Lgs. 200/2021 transposing EU Open Data Directive 2019/1024), an economic innovation driver, and a key component of the Piattaforma Digitale Nazionale Dati (PDND) ecosystem.
The national portal dati.gov.it, managed by AgID, collects and indexes datasets from over 5,000 Italian public administrations. Every published dataset — from public transport schedules to election results, from air quality readings to municipal council resolutions — must comply with precise metadata, quality, and accessibility standards.
For a developer, the open government data challenge is twofold: publishing data in a way that makes it truly reusable (a raw CSV file on an institutional website is not enough), and consuming heterogeneous data from multiple sources with varying standards and quality. This article addresses both dimensions.
What You Will Learn
- DCAT-AP_IT standard: catalog structure, dataset, distribution and mandatory metadata
- CKAN: configuration, Italian extensions, and REST API for dataset management
- REST API design for open data: pagination, filtering, versioning, and caching
- Distribution formats: CSV, JSON-LD, RDF, GeoJSON, Parquet
- Data quality: validation, profiling, and DCAT quality metrics
- Consuming open data: robust clients, error handling, and normalization
- PDND and interoperability: publishing and consuming PA APIs through the national platform
DCAT-AP_IT Standard: Metadata for Public Data
The Italian profile of DCAT-AP (Data Catalog Vocabulary Application Profile), known as DCAT-AP_IT, is the reference standard for publishing dataset metadata in Italian PAs. Defined by AgID, it builds on W3C DCAT specifications with Italian-specific extensions (geographies, licenses, EUROVOC themes, etc.).
The core DCAT-AP_IT structure has three main entities: Catalog (the PA's data catalog), Dataset (the informational resource with title, description, themes, update frequency, license), and Distribution (the specific format: CSV, JSON, RDF, shapefile).
# DCAT-AP_IT Metadata Generator in Python
# Produces RDF/Turtle compatible with dati.gov.it
from rdflib import Graph, Literal, URIRef, Namespace
from rdflib.namespace import DCAT, DCT, FOAF, RDF, XSD
from datetime import datetime
DCATAPIT = Namespace("http://dati.gov.it/onto/dcatapit#")
def create_dcat_ap_it_metadata(
catalog_uri: str,
dataset_id: str,
title_en: str,
description_en: str,
publisher_name: str,
publisher_uri: str,
themes: list,
license_uri: str,
distributions: list
) -> str:
g = Graph()
g.bind("dcat", DCAT)
g.bind("dct", DCT)
g.bind("foaf", FOAF)
dataset_uri = URIRef(f"{catalog_uri}/dataset/{dataset_id}")
g.add((dataset_uri, RDF.type, DCAT.Dataset))
g.add((dataset_uri, RDF.type, DCATAPIT.Dataset))
# Mandatory metadata
g.add((dataset_uri, DCT.title, Literal(title_en, lang="en")))
g.add((dataset_uri, DCT.description, Literal(description_en, lang="en")))
g.add((dataset_uri, DCT.modified, Literal(datetime.utcnow().date().isoformat(), datatype=XSD.date)))
g.add((dataset_uri, DCT.license, URIRef(license_uri)))
# Publisher (mandatory in DCAT-AP_IT)
publisher = URIRef(publisher_uri)
g.add((publisher, RDF.type, DCATAPIT.Agent))
g.add((publisher, FOAF.name, Literal(publisher_name, lang="en")))
g.add((dataset_uri, DCT.publisher, publisher))
# EU EUROVOC themes
for theme_uri in themes:
g.add((dataset_uri, DCAT.theme, URIRef(theme_uri)))
# Distribution for each format
for i, dist in enumerate(distributions):
dist_uri = URIRef(f"{dataset_uri}/distribution/{i}")
g.add((dist_uri, RDF.type, DCAT.Distribution))
g.add((dist_uri, DCAT.accessURL, URIRef(dist["url"])))
g.add((dist_uri, DCT.format, URIRef(
f"http://publications.europa.eu/resource/authority/file-type/{dist['format']}"
)))
g.add((dataset_uri, DCAT.distribution, dist_uri))
return g.serialize(format="turtle")
# Example usage
metadata_turtle = create_dcat_ap_it_metadata(
catalog_uri="https://opendata.example-pa.gov.it",
dataset_id="air-quality-2024",
title_en="Air Quality Measurements 2024",
description_en="Hourly air quality sensor readings for the city area",
publisher_name="Example Municipality",
publisher_uri="http://spcdata.digitpa.gov.it/browse/page/Amministrazione/example",
themes=["http://publications.europa.eu/resource/authority/data-theme/ENVI"],
license_uri="https://creativecommons.org/licenses/by/4.0/",
distributions=[
{"url": "https://opendata.example.gov.it/air-2024.csv", "format": "CSV"},
{"url": "https://opendata.example.gov.it/air-2024.json", "format": "JSON"},
]
)
CKAN: the Italian PA Open Data Portal
CKAN (Comprehensive Knowledge Archive Network) is the most widely used open-source platform for government open data portals. dati.gov.it itself runs on CKAN, and the same architecture is used by dozens of Italian municipalities, regions, and ministries.
The ckanext-dcatapit extension, developed by GeoSolutions and the Autonomous Province of Trento, adds full support for the DCAT-AP_IT profile, allowing CKAN to expose and consume metadata conforming to Italian and European standards.
# Using dati.gov.it CKAN APIs
# CKAN APIs are REST with standardized JSON responses
import httpx
import asyncio
from typing import Optional
class CKANClient:
def __init__(self, base_url: str, api_key: Optional[str] = None):
self.base_url = base_url.rstrip("/")
self.headers = {"Content-Type": "application/json"}
if api_key:
self.headers["Authorization"] = api_key
async def search_datasets(
self,
query: str,
filters: Optional[dict] = None,
rows: int = 20,
start: int = 0
) -> dict:
"""Search datasets using Solr full-text search with optional faceted filtering."""
params = {"q": query, "rows": rows, "start": start}
if filters:
fq_parts = [f"{k}:{v}" for k, v in filters.items()]
params["fq"] = " AND ".join(fq_parts)
async with httpx.AsyncClient() as client:
response = await client.get(
f"{self.base_url}/api/3/action/package_search",
params=params,
headers=self.headers,
timeout=30.0
)
response.raise_for_status()
result = response.json()
if not result.get("success"):
raise ValueError(f"CKAN API error: {result.get('error')}")
return {
"total": result["result"]["count"],
"datasets": result["result"]["results"],
"page": start // rows + 1,
}
async def get_dataset(self, dataset_id: str) -> dict:
"""Retrieve a specific dataset with all its distributions."""
async with httpx.AsyncClient() as client:
response = await client.get(
f"{self.base_url}/api/3/action/package_show",
params={"id": dataset_id},
headers=self.headers,
timeout=30.0
)
response.raise_for_status()
result = response.json()
if not result.get("success"):
raise ValueError(f"Dataset not found: {dataset_id}")
return result["result"]
# Practical usage
async def main():
client = CKANClient("https://www.dati.gov.it")
results = await client.search_datasets(
query="air quality",
filters={"res_format": "CSV", "groups": "ambiente"},
rows=10
)
print(f"Found {results['total']} datasets")
for ds in results["datasets"]:
print(f"- {ds['title']} ({ds['num_resources']} resources)")
asyncio.run(main())
REST API Design for Open Data
When a PA publishes its data via REST API (beyond CKAN), it must follow design principles ensuring usability, stability, and scalability. AgID's Technical Interoperability Guidelines for PA define the REST patterns to follow for public services.
# FastAPI: REST API for open data with AgID-compliant pagination
from fastapi import FastAPI, Query
from fastapi.responses import JSONResponse
from typing import Optional
from datetime import date
import math
app = FastAPI(
title="Open Data API",
description="REST API for open data publication - AgID Interoperability Guidelines compliant",
version="1.0.0"
)
@app.get("/api/v1/datasets/air-quality")
async def get_air_quality(
# AgID standard pagination: page + page_size
page: int = Query(default=1, ge=1),
page_size: int = Query(default=100, ge=1, le=1000),
# Filtering
station_id: Optional[str] = None,
pollutant: Optional[str] = None,
date_from: Optional[date] = None,
date_to: Optional[date] = None,
# Sorting
sort_by: str = Query(default="timestamp"),
sort_order: str = Query(default="desc", regex="^(asc|desc)$"),
# Output format
format: str = Query(default="json", regex="^(json|csv|geojson)$")
):
offset = (page - 1) * page_size
records, total_count = await air_quality_service.get_records(
station_id=station_id, pollutant=pollutant,
date_from=date_from, date_to=date_to,
sort_by=sort_by, sort_order=sort_order,
limit=page_size, offset=offset
)
total_pages = math.ceil(total_count / page_size)
return JSONResponse(content={
"data": records,
"meta": {
"total_count": total_count,
"page": page,
"page_size": page_size,
"total_pages": total_pages,
"has_next": page < total_pages,
"has_prev": page > 1,
},
"links": {
"self": f"/api/v1/datasets/air-quality?page={page}",
"first": "/api/v1/datasets/air-quality?page=1",
"next": f"/api/v1/datasets/air-quality?page={page+1}" if page < total_pages else None,
"prev": f"/api/v1/datasets/air-quality?page={page-1}" if page > 1 else None,
},
"dataset": {
"id": "air-quality-2024",
"title": "Air Quality Measurements",
"license": "CC BY 4.0",
"publisher": "Example Municipality",
}
})
Data Quality: Validation and Profiling
A technically accessible dataset with poor data quality is not truly "open" in any useful sense. AgID's 2024-2026 Three-Year ICT Plan defines specific quality indicators for PA datasets, drawing on ISO/IEC 25012 quality dimensions:
| Quality Dimension | Definition | Practical Metric | AgID Minimum Threshold |
|---|---|---|---|
| Completeness | Absence of missing values | % NULL fields over total | < 5% NULL in mandatory fields |
| Accuracy | Correspondence with reality | Validation against authoritative sources | Domain-dependent |
| Consistency | Internal dataset coherence | Referential constraints, range checks | 0% constraint violations |
| Timeliness | Dataset updated per declared frequency | Days since last update vs declared frequency | Updated within 2x declared period |
| Conformity | Compliance with standards (DCAT-AP_IT) | SHACL metadata validation | 100% mandatory metadata present |
# Data Quality Profiler for PA datasets
import pandas as pd
import numpy as np
from dataclasses import dataclass, field
from typing import List, Dict, Any
@dataclass
class QualityReport:
dataset_id: str
total_rows: int
total_columns: int
quality_score: float # 0-100
issues: List[dict] = field(default_factory=list)
column_stats: Dict[str, Any] = field(default_factory=dict)
class DataQualityProfiler:
"""Profiler for open data quality measuring ISO/IEC 25012 dimensions."""
REQUIRED_FIELDS = ["id", "timestamp", "value", "station_code"]
def profile(self, df: pd.DataFrame, dataset_id: str) -> QualityReport:
report = QualityReport(
dataset_id=dataset_id,
total_rows=len(df),
total_columns=len(df.columns),
quality_score=100.0
)
# 1. Completeness: mandatory fields
for field_name in self.REQUIRED_FIELDS:
if field_name not in df.columns:
report.issues.append({
"severity": "critical",
"dimension": "completeness",
"field": field_name,
"message": f"Mandatory field '{field_name}' missing"
})
report.quality_score -= 20
else:
null_pct = df[field_name].isna().sum() / len(df) * 100
if null_pct > 5:
report.issues.append({
"severity": "warning",
"dimension": "completeness",
"field": field_name,
"message": f"{null_pct:.1f}% NULL values in '{field_name}'",
})
report.quality_score -= min(10, null_pct)
# 2. Consistency: duplicates
duplicate_count = df.duplicated().sum()
if duplicate_count > 0:
dup_pct = duplicate_count / len(df) * 100
report.issues.append({
"severity": "warning",
"dimension": "consistency",
"message": f"{duplicate_count} duplicate rows ({dup_pct:.1f}%)",
})
report.quality_score -= min(15, dup_pct * 2)
report.quality_score = max(0.0, report.quality_score)
return report
PDND: Italy's National Data Interoperability Platform
The Piattaforma Digitale Nazionale Dati (PDND), managed by PagoPA S.p.A., is the national interoperability infrastructure that allows PAs to share data securely, under controlled access, and with full audit trails. Unlike pure open data (publicly accessible to everyone), PDND also manages sensitive data whose sharing is authorized by inter-institutional agreements.
For developers, PDND integration means:
- Joining PDND as a data consumer or provider via the interop.pagopa.it portal
- Publishing APIs following AgID interoperability guidelines (OpenAPI 3.1 mandatory, PDND e-service descriptor)
- Authenticating with JWT tokens signed with X.509 certificates for every data request
- Respecting consumption vouchers: digital agreements between entities authorizing access to specific APIs
Conclusions and Next Steps
Quality government open data requires much more than publishing a CSV file on an institutional website: it requires standard metadata (DCAT-AP_IT), well-designed REST APIs with pagination and caching, continuous data quality validation, and integration with national infrastructures like CKAN and PDND.
The next article in this series covers accessible user interfaces for PA following WCAG 2.1 AA: an equally critical regulatory requirement that, together with open data, makes public digital services truly inclusive.
Useful Resources
- dati.gov.it - Italian national open data portal
- ckanext-dcatapit - CKAN extension for DCAT-AP_IT
- PDND Documentation - Interoperability platform
- DCAT-US - US federal open data standard (for comparison)
Related Articles in This Series
- GovTech #00: Digital Public Infrastructure - building blocks and architecture
- GovTech #04: GDPR-by-Design - architectural patterns for public services
- GovTech #06: Government API Integration - SPID, CIE and pagoPA
- GovTech #07: GovStack Building Blocks - modules for digital government







