Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Data Governance and Data Quality: Foundations for Trustworthy AI

72% of enterprise AI projects fail before reaching production. Not because of sophisticated algorithms lacking, not because of inadequate architectures, not because of missing talent. They fail because of insufficient data quality, absent governance frameworks, and pipelines that silently produce incorrect results without anyone noticing.

In a context where the EU AI Act mandates legally binding requirements on training data quality for high-risk AI systems, and where organizations are increasingly investing in data-driven initiatives, building a solid data governance and data quality foundation is no longer optional: it is a prerequisite for competitive advantage and regulatory compliance.

This article guides you through operational frameworks, open-source tools, and practical implementations that enable trustworthy AI systems: from defining data quality dimensions, to implementing automated tests with Great Expectations and dbt, to managing data lineage with OpenMetadata, and ensuring EU AI Act compliance.

What You'll Learn in This Article

The 6 DAMA data quality dimensions and how to measure them automatically
Data quality checks implementation with Great Expectations and dbt-expectations
Data catalog and data lineage with OpenMetadata and Apache Atlas
Data governance framework: roles, processes, and organizational structure
EU AI Act Article 10 requirements for training data: what to do before August 2026
Data observability with Soda Core and Monte Carlo for production pipelines
Bias detection in ML datasets and mitigation strategies
Case study: data governance framework for an Italian manufacturing SMB

The Data Warehouse, AI and Digital Transformation Series

#	Article	Focus
1	Data Warehouse Evolution	From SQL Server to Data Lakehouse
2	Data Mesh Architecture	Decentralized Data Ownership
3	Modern ETL vs ELT	dbt, Airbyte and Fivetran
4	Pipeline Orchestration	Airflow, Dagster and Prefect
5	AI in Manufacturing	Predictive Maintenance and Digital Twins
6	AI in Finance	Fraud Detection and Credit Scoring
7	AI in Retail	Demand Forecasting and Recommendations
8	AI in Healthcare	Diagnostics and Drug Discovery
9	AI in Logistics	Route Optimization and Warehouse Automation
10	LLMs in Enterprise	RAG and AI Guardrails
11	Enterprise Vector Databases	pgvector, Pinecone and Weaviate
12	MLOps for Business	AI Models in Production with MLflow
13	You are here - Data Governance	Data Quality for Trustworthy AI
14	Data-Driven Roadmap for SMBs	AI and DWH Adoption

The Data Quality Problem in the AI Era

"Garbage in, garbage out" is a principle data engineers have known for decades, but with generative AI and machine learning models in production, the consequences of poor data quality have become exponentially more severe. A fraud detection model trained on imbalanced data generates false positives that block legitimate transactions. A recommendation system trained on biased data amplifies discrimination. A demand forecasting model trained on inconsistent data generates incorrect orders with direct impact on cash flow.

      The Cost of Poor Data Quality - 2025 Data
      
            Indicator
            Value
            Source
          
            Average cost per company (poor data quality)
            $12.9M/year
            Gartner 2024
          
            AI projects failing due to data quality
            72%
            McKinsey 2025
          
            Data scientist time on data cleaning
            45-60%
            Multiple surveys
          
            Companies with formal data quality program
            ~20%
            Industry reports 2025
          
            ML error reduction with data governance
            up to 35%
            IBM Institute 2025

A fundamental distinction that many organizations miss is the difference between data quality for analytics and data quality for AI. In traditional reporting, an anomalous data point produces a wrong number on a dashboard - someone notices it, corrects it, and moves on. In machine learning, an anomalous data point in the training set silently contaminates the model, which then behaves incorrectly in production for months before the problem is identified. The latency between the problem and its manifestation is infinitely higher.

The 6 DAMA Data Quality Dimensions

The DAMA-DMBOK (Data Management Body of Knowledge) framework defines 6 fundamental data quality dimensions, which in 2025 remain the standard reference for any enterprise data quality program:

      DAMA Dimensions for AI - Extended Framework
      
        
            Dimension
            Definition
            AI-Specific Metric
            Critical Threshold
          

        
            Accuracy
            Data correctly represents real-world entities
            % correct labels in training set
            >99% for critical classification
          

            Completeness
            All necessary data is present
            % non-null values for critical features
            >95% for input features
          

            Consistency
            Data is uniform across different systems
            % concordant records between sources
            >98% for shared features
          

            Timeliness
            Data is up-to-date and accessible when needed
            Average lag: production data vs. training
            <24h for real-time models
          

            Validity
            Data conforms to defined formats and constraints
            % schema and range violations
            <0.1% violations
          

            Uniqueness
            No unintentional duplicates
            % duplicate records in training set
            <0.5% duplicates
          

      
    

Practical Implementation: Data Quality with Great Expectations

Great Expectations is the most widely adopted open-source Python framework for data quality testing. The approach is similar to unit tests for code: define "expectations" about data, run them automatically in the pipeline, and auto-generate documentation. Native integration with Airflow, Prefect, and dbt makes it a natural component of any modern data stack.

# data_quality_pipeline.py
# Complete data quality framework with Great Expectations

import great_expectations as gx
from great_expectations.core.batch import RuntimeBatchRequest
import pandas as pd
import numpy as np
from datetime import datetime
import logging

logger = logging.getLogger(__name__)

class DataQualityFramework:
    """
    Data quality framework for enterprise ML pipelines.
    Implements the 6 DAMA dimensions with Great Expectations.
    """

    def __init__(self, datasource_name: str = "ml_training_data"):
        self.context = gx.get_context()
        self.datasource_name = datasource_name
        self.validation_results = {}

    def build_ml_expectation_suite(
        self,
        suite_name: str,
        target_column: str,
        feature_columns: list[str]
    ):
        """
        Creates an expectation suite for ML datasets.
        Covers all 6 DAMA quality dimensions.
        """
        suite = self.context.add_expectation_suite(
            expectation_suite_name=suite_name
        )

        # === COMPLETENESS ===
        for col in feature_columns:
            suite.add_expectation(
                gx.core.ExpectationConfiguration(
                    expectation_type="expect_column_values_to_not_be_null",
                    kwargs={
                        "column": col,
                        "mostly": 0.95  # 95% non-null
                    }
                )
            )

        # Target column must be 100% non-null
        suite.add_expectation(
            gx.core.ExpectationConfiguration(
                expectation_type="expect_column_values_to_not_be_null",
                kwargs={"column": target_column}
            )
        )

        # === VALIDITY ===
        suite.add_expectation(
            gx.core.ExpectationConfiguration(
                expectation_type="expect_column_values_to_be_between",
                kwargs={
                    "column": "customer_age",
                    "min_value": 18,
                    "max_value": 120,
                    "mostly": 0.999
                }
            )
        )

        # === UNIQUENESS ===
        suite.add_expectation(
            gx.core.ExpectationConfiguration(
                expectation_type="expect_column_values_to_be_unique",
                kwargs={"column": "transaction_id"}
            )
        )

        # === CONSISTENCY: target must have expected values only ===
        suite.add_expectation(
            gx.core.ExpectationConfiguration(
                expectation_type="expect_column_values_to_be_in_set",
                kwargs={
                    "column": target_column,
                    "value_set": [0, 1],
                }
            )
        )

        # === TIMELINESS: no data older than 90 days ===
        suite.add_expectation(
            gx.core.ExpectationConfiguration(
                expectation_type="expect_column_values_to_be_between",
                kwargs={
                    "column": "event_timestamp",
                    "min_value": "2024-11-01",
                    "max_value": datetime.now().strftime("%Y-%m-%d"),
                    "parse_strings_as_datetimes": True,
                    "mostly": 0.99
                }
            )
        )

        self.context.save_expectation_suite(suite)
        logger.info(f"Suite {suite_name} created with {len(suite.expectations)} expectations")
        return suite

    def validate_dataset(
        self,
        df: pd.DataFrame,
        suite_name: str,
        run_name: str = None
    ) -> dict:
        """
        Validates a DataFrame against the defined suite.
        Returns structured results for monitoring.
        """
        run_name = run_name or f"run_{datetime.now().isoformat()}"

        batch_request = RuntimeBatchRequest(
            datasource_name=self.datasource_name,
            data_connector_name="runtime_connector",
            data_asset_name="training_data",
            runtime_parameters={"batch_data": df},
            batch_identifiers={"run_id": run_name}
        )

        checkpoint = self.context.add_or_update_checkpoint(
            name="ml_data_checkpoint",
            validations=[{
                "batch_request": batch_request,
                "expectation_suite_name": suite_name
            }]
        )

        results = checkpoint.run(run_name=run_name)
        validation_result = results.list_validation_results()[0]
        stats = validation_result.statistics

        quality_report = {
            "run_name": run_name,
            "timestamp": datetime.now().isoformat(),
            "success": results.success,
            "success_rate": stats["success_percent"] / 100,
            "evaluated_expectations": stats["evaluated_expectations"],
            "failed_checks": [
                {
                    "expectation": r.expectation_config.expectation_type,
                    "column": r.expectation_config.kwargs.get("column"),
                }
                for r in validation_result.results if not r.success
            ]
        }

        if not quality_report["success"]:
            raise ValueError(
                f"Data quality FAILED: {len(quality_report['failed_checks'])} checks failed. "
                f"Pipeline halted to prevent model contamination."
            )

        return quality_report

Data Quality in dbt: Declarative Tests in the Transformation Layer

For teams using dbt as their transformation layer, the dbt-expectations package brings the same capabilities as Great Expectations directly into dbt models, defining tests in YAML close to the SQL code. This "quality as code" approach ensures every transformation is automatically validated.

# models/schema.yml
# Data quality tests with dbt-expectations

version: 2

models:
  - name: ml_features_customer
    description: "Feature store for churn prediction model"
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null
          - dbt_expectations.expect_column_values_to_be_of_type:
              column_type: VARCHAR

      - name: customer_age
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 18
              max_value: 120
              mostly: 0.999

      - name: monthly_charges
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0
              max_value: 10000
          - dbt_expectations.expect_column_mean_to_be_between:
              min_value: 50
              max_value: 200

      - name: churn_label
        tests:
          - not_null
          - accepted_values:
              values: [0, 1]
          # Verify dataset is not too imbalanced
          - dbt_expectations.expect_column_proportion_of_unique_values_to_be_between:
              min_value: 0.01  # At least 1% churn
              max_value: 0.5   # Max 50% churn

    # Table-level tests
    tests:
      - dbt_expectations.expect_table_row_count_to_be_between:
          min_value: 10000
          max_value: 10000000
      - dbt_expectations.expect_compound_columns_to_be_unique:
          column_list: ["customer_id", "snapshot_date"]

Data Catalog and Data Lineage with OpenMetadata

Data lineage - the ability to track the journey of data from its source to the final AI model - has become an indispensable requirement for two converging reasons: compliance with the EU AI Act (which requires documentation of training data provenance) and the practical need for debugging when a model produces unexpected results.

OpenMetadata is the most advanced open-source platform for data catalog and lineage in 2025, built by ex-Uber engineers and Apache Hadoop founders. It supports column-level lineage, native integration with dbt, Airflow, Spark, and major data warehouses.

# openmetadata_lineage.py
# Automated data lineage registration for ML pipelines

from metadata.ingestion.ometa.ometa_api import OpenMetadata
from metadata.generated.schema.entity.data.table import Table
from metadata.generated.schema.type.entityLineage import (
    ColumnLineage, EntitiesEdge, LineageDetails
)
from metadata.generated.schema.api.lineage.addLineage import AddLineageRequest
import json
from datetime import datetime


class MLPipelineLineageTracker:
    """
    Automated data lineage tracker for ML pipelines.
    Registers every transformation step in OpenMetadata.
    Required for EU AI Act Article 10 compliance documentation.
    """

    def __init__(self, server_url: str, jwt_token: str):
        from metadata.generated.schema.entity.services.connections.metadata.openMetadataConnection import (
            OpenMetadataConnection
        )
        from metadata.generated.schema.security.client.openMetadataJWTClientConfig import (
            OpenMetadataJWTClientConfig
        )
        server_config = OpenMetadataConnection(
            hostPort=server_url,
            authProvider="openmetadata",
            securityConfig=OpenMetadataJWTClientConfig(jwtToken=jwt_token)
        )
        self.metadata = OpenMetadata(server_config)

    def register_training_data_lineage(
        self,
        source_tables: list[str],
        feature_store_table: str,
        ml_model_name: str,
        transformation_description: str,
        data_steward: str
    ):
        """
        Registers: raw_data -> feature_store -> ml_model flow.
        Critical for AI Act Art. 10 - Data and Data Governance.
        """
        source_entities = []
        for table_fqn in source_tables:
            table = self.metadata.get_by_name(entity=Table, fqn=table_fqn)
            if table:
                source_entities.append(table)

        feature_table = self.metadata.get_by_name(
            entity=Table, fqn=feature_store_table
        )

        for source in source_entities:
            lineage_request = AddLineageRequest(
                edge=EntitiesEdge(
                    fromEntity={
                        "id": str(source.id.__root__),
                        "type": "table"
                    },
                    toEntity={
                        "id": str(feature_table.id.__root__),
                        "type": "table"
                    },
                    lineageDetails=LineageDetails(
                        description=(
                            f"{transformation_description} | "
                            f"Data Steward: {data_steward} | "
                            f"Reviewed: {datetime.now().strftime('%Y-%m-%d')}"
                        )
                    )
                )
            )
            self.metadata.add_lineage(lineage_request)

        # Add AI Act compliance tags
        self._tag_for_ai_act_compliance(
            feature_store_table, ml_model_name, data_steward
        )

    def _tag_for_ai_act_compliance(
        self, table_fqn: str, model_name: str, steward: str
    ):
        """
        Adds structured AI Act Article 10 compliance tags.
        Required documentation for high-risk AI systems.
        """
        compliance_metadata = {
            "ai_act_article": "10",
            "dataset_purpose": "training_and_validation",
            "ai_system": model_name,
            "data_steward": steward,
            "governance_reviewed": "true",
            "bias_assessment_completed": "true",
            "next_review": "2025-07-01"
        }
        print(f"AI Act compliance tags registered: {json.dumps(compliance_metadata, indent=2)}")


# Usage example
def setup_churn_model_lineage():
    tracker = MLPipelineLineageTracker(
        server_url="http://openmetadata.internal:8585/api",
        jwt_token="your-jwt-token"
    )

    tracker.register_training_data_lineage(
        source_tables=[
            "default.raw_crm.customers",
            "default.raw_billing.transactions"
        ],
        feature_store_table="default.feature_store.ml_churn_features_v3",
        ml_model_name="churn_prediction_xgboost_v2",
        transformation_description="Monthly aggregations, tenure calculation, "
                                   "categorical encoding",
        data_steward="Jane Smith, Data Governance Lead"
    )

Data Governance Framework: Organizational Structure and Processes

Data governance is not a software tool: it is a system of people, processes, and technology that ensures data is managed as a strategic asset. For organizations approaching AI initiatives, building this structure pragmatically is essential.

      Governance Structure for Mid-Market (50-500 employees)
      
        
            Role
            Responsibilities
            FTE Estimate
            Profile
          

        
            Chief Data Officer (CDO)
            Data strategy, executive sponsor, AI Act compliance
            0.25 FTE (partial)
            C-level or senior manager
          

            Data Steward
            Domain ownership, standard definition, change approval
            1 per data domain
            Business + technical hybrid
          

            Data Engineer
            Pipelines, technical quality, monitoring tools
            1-2 FTE
            Technical profile
          

            Data Quality Analyst
            KPI definition, audits, quality reporting
            0.5 FTE
            Analytical/hybrid
          

            DPO (Data Protection Officer)
            GDPR, AI Act, data security, privacy by design
            0.25-0.5 FTE
            Legal/technical
          

      
    

EU AI Act and Training Data Requirements

The EU AI Act introduces for the first time in history legally binding requirements on training data quality for high-risk AI systems. Article 10 of the regulation is specifically dedicated to "Data and Data Governance" and became operational for GPAI models on 2 August 2025. For high-risk systems, full compliance is required by 2 August 2026.

EU AI Act Timeline - Actions Required Before August 2026

Date	Milestone	Required Action
Feb 2025	Prohibited AI practices operative	Audit AI systems for prohibited practices
Aug 2025	GPAI models and governance operative	Governance for LLMs and foundation models
Aug 2026	High-risk AI systems full compliance	Art. 10 data governance fully implemented
Aug 2027	Legacy systems compliance	Existing AI systems brought into compliance

# ai_act_compliance_checker.py
# EU AI Act Article 10 compliance verification for training datasets

from dataclasses import dataclass, field
from typing import Optional
import pandas as pd
import numpy as np
from scipy import stats
import json
from datetime import datetime


@dataclass
class DatasetComplianceReport:
    """EU AI Act Article 10 compliance report."""
    dataset_name: str
    assessment_date: str
    assessor: str
    compliant: bool = False
    checks: dict = field(default_factory=dict)
    recommendations: list = field(default_factory=list)


class AIActArticle10Checker:
    """
    Verifies EU AI Act Art. 10 compliance for high-risk training datasets.

    Art. 10 requires datasets to be:
    1. Relevant and sufficiently representative
    2. Free from errors as much as possible
    3. Complete with respect to intended purpose
    4. Having appropriate statistical properties
    5. Free from biases that could discriminate against protected groups
    """

    def __init__(self, dataset_name: str, assessor: str):
        self.report = DatasetComplianceReport(
            dataset_name=dataset_name,
            assessment_date=datetime.now().isoformat(),
            assessor=assessor
        )

    def check_representativeness(
        self,
        df: pd.DataFrame,
        demographic_columns: list[str],
        reference_distributions: dict
    ) -> bool:
        """
        Art. 10(3): Verify demographic representativeness.
        Chi-square test against reference population distribution.
        """
        is_representative = True
        self.report.checks["representativeness"] = {}

        for col in demographic_columns:
            if col not in df.columns or col not in reference_distributions:
                continue

            expected = reference_distributions[col]
            categories = list(expected.keys())
            obs_counts = [df[col].value_counts().get(c, 0) for c in categories]
            exp_props = [expected.get(c, 0.001) for c in categories]
            total_exp = sum(exp_props)
            exp_normalized = [p / total_exp * len(df) for p in exp_props]

            chi2, p_value = stats.chisquare(obs_counts, exp_normalized)
            is_rep = p_value > 0.05

            self.report.checks["representativeness"][col] = {
                "chi2_statistic": chi2,
                "p_value": p_value,
                "is_representative": is_rep
            }

            if not is_rep:
                is_representative = False
                self.report.recommendations.append(
                    f"CRITICAL: Column '{col}' is not representative of the target population "
                    f"(chi2={chi2:.2f}, p={p_value:.4f}). "
                    f"Apply oversampling or collect additional data."
                )

        return is_representative

    def check_bias_protected_attributes(
        self,
        df: pd.DataFrame,
        target_column: str,
        protected_attributes: list[str]
    ) -> bool:
        """
        Art. 10(5): Verify absence of bias on protected attributes.
        Uses disparate impact ratio (threshold: 0.8 = 80% rule).
        """
        is_unbiased = True
        self.report.checks["bias_assessment"] = {}

        for attr in protected_attributes:
            if attr not in df.columns:
                continue

            groups = df[attr].unique()
            positive_rates = {
                str(g): (df[df[attr] == g][target_column] == 1).mean()
                for g in groups if len(df[df[attr] == g]) > 0
            }

            if len(positive_rates) < 2:
                continue

            max_rate = max(positive_rates.values())
            min_rate = min(positive_rates.values())
            disparate_impact = min_rate / max_rate if max_rate > 0 else 1.0

            self.report.checks["bias_assessment"][attr] = {
                "disparate_impact": disparate_impact,
                "compliant": disparate_impact >= 0.8,
                "group_rates": positive_rates
            }

            if disparate_impact < 0.8:
                is_unbiased = False
                self.report.recommendations.append(
                    f"CRITICAL: Bias detected on protected attribute '{attr}'. "
                    f"Disparate impact = {disparate_impact:.3f} (AI Act threshold: 0.80). "
                    f"Apply re-weighting, resampling, or fairness constraints."
                )

        return is_unbiased

    def generate_compliance_report(self) -> str:
        """Generates JSON report for EU AI Act audit."""
        all_checks = (
            list(self.report.checks.get("representativeness", {}).values()) +
            list(self.report.checks.get("bias_assessment", {}).values()) +
            list(self.report.checks.get("completeness", {}).values())
        )

        self.report.compliant = all(
            check.get("is_representative", True) or check.get("compliant", True)
            for check in all_checks
        )

        return json.dumps({
            "dataset": self.report.dataset_name,
            "assessment_date": self.report.assessment_date,
            "assessor": self.report.assessor,
            "eu_ai_act_article_10_compliant": self.report.compliant,
            "checks": self.report.checks,
            "recommendations": self.report.recommendations
        }, indent=2, default=str)

Data Observability: Continuous Monitoring of Production Pipelines

Data quality is not only guaranteed at ingestion time: data degrades over time. The phenomenon of data drift - where the distribution of production data progressively diverges from the training set - is one of the main causes of silent AI model degradation. Data observability addresses this with continuous monitoring and proactive alerting.

      Data Observability Tools Comparison 2025
      
            Tool
            Type
            Strengths
            Ideal Use Case
          
            Soda Core
            Open source
            YAML-based, CLI, CI/CD integration
            SMBs with limited budget, dbt pipelines
          
            Monte Carlo
            SaaS (enterprise)
            ML-powered, zero-config anomaly detection
            Enterprise, high volume, small team
          
            Great Expectations
            Open source
            Python-native, flexible, auto-documentation
            Data engineering teams, Python pipelines
          
            dbt tests
            Open source
            Quality as code, integrated in dbt workflow
            Teams using dbt as primary layer

# soda_checks_ml_pipeline.yml
# Soda Core checks for ML pipeline observability

checks for ml_features_customer:

  # === FRESHNESS: data timeliness ===
  - freshness(event_date) < 24h:
      name: "Data not older than 24 hours"

  # === VOLUME: volumetric anomalies ===
  - row_count between 50000 and 5000000:
      name: "Volume within expected range"

  # === COMPLETENESS ===
  - missing_count(customer_age) = 0:
      name: "Customer age: zero nulls tolerated"

  - missing_percent(monthly_charges) < 2%:
      name: "Monthly charges: max 2% null"

  # === VALIDITY ===
  - invalid_percent(customer_age) < 0.1%:
      name: "Customer age within 18-120 range"
      valid min: 18
      valid max: 120

  - duplicate_count(customer_id) = 0:
      name: "No duplicate customer IDs"

  # === DISTRIBUTION DRIFT ===
  # Compare against historical baseline (previous week)
  - distribution_difference_index(monthly_charges) < 0.1:
      name: "Monthly charges: drift < 10% vs baseline"
      method: psi  # Population Stability Index

  - distribution_difference_index(customer_age) < 0.1:
      name: "Customer age: drift < 10% vs baseline"
      method: ks   # Kolmogorov-Smirnov test

  # === BUSINESS RULES ===
  - failed_rows(no_negative_charges):
      name: "No negative charges"
      fail condition: monthly_charges < 0

# Alerting configuration
alert config:
  slack:
    webhook: "https://hooks.slack.com/services/..."
    channel: "#data-quality-alerts"
  email:
    to:
      - "data-team@company.com"
      - "ml-team@company.com"

Best Practices and Anti-Patterns

Core Best Practices

Quality as Code: Define quality requirements in versioned code (YAML, Python) in Git, not Word documents. This makes checks automatic, reproducible, and part of CI/CD.
Fail Fast, Fail Loud: Data quality checks should block the pipeline (not just generate a warning) when data is out of threshold. A model trained on poor data causes more damage than a stopped pipeline.
Separate Validation from Transformation: Validate data before transforming it (schema/range validation at ingestion), during transformation (dbt tests), and before ML use (Great Expectations on feature store).
Monitor Drift, Not Just Static Quality: Data changes over time. PSI (Population Stability Index) and KS (Kolmogorov-Smirnov) tests are standard tools for detecting distributional shifts that silently degrade AI models.
Document Governance Decisions: Every choice (quality threshold, imputation strategy, dataset exclusion) must be documented with date, author, and rationale. This documentation is required by the AI Act for high-risk systems.

Critical Anti-Patterns in Data Governance for AI

"Best effort" data quality: Defining vague SLAs instead of precise, measurable numeric thresholds. Without metrics, there is no real governance.
Silencing alerts: Configuring quality checks and then silencing alerts because they "disturb." Every ignored alert is a future production model incident.
Governance only for new projects: Legacy datasets used for model retraining have the same quality requirements. They are often the most problematic.
One-shot bias checks: Verifying bias only before initial training. Bias can emerge with new production data over time (concept drift).
Confusing analytics and ML quality standards: Acceptable thresholds for a dashboard (5% null in a column) can be catastrophic for an ML input feature. The two contexts require different standards.

Conclusions and Next Steps

Data governance and data quality for AI are not bureaucracy: they are the invisible infrastructure that determines whether your models work in production or fail silently. With the EU AI Act bringing legally binding requirements on training data, investing in governance is not just good practice - it is a prerequisite for operating in the European market with high-risk AI systems.

The practical starting point for any organization is the same: identify the 3-5 most critical datasets for your AI initiatives, assign a data steward to each, implement automated checks with the open-source tools described in this article (Great Expectations, dbt-expectations, Soda Core), and gradually build the governance structure around these datasets.

Perfection is not the initial requirement: the journey matters as much as the destination, and every percentage point improvement in data quality directly translates into more reliable AI models, fewer production incidents, and more solid business decisions.

Data Governance Launch Checklist for AI

AI-critical dataset inventory completed
Data stewards assigned for each key data domain
Quality SLAs defined and approved (completeness, freshness, validity)
Automated checks implemented in pipelines (GE, dbt-expectations or Soda)
Data catalog with lineage configured (OpenMetadata or Apache Atlas)
Bias assessment performed for all datasets used in high-risk systems
EU AI Act Art. 10 documentation started for high-risk AI systems
Alerting configured with escalation channels defined
First monthly Data Council meeting scheduled
Team training completed

Related Resources

MLOps for Business: How to monitor model drift in production with MLflow - Article #12 of this series
LLMs in Enterprise: Data governance for RAG and data security in Large Language Models - Article #10 of this series
AI Engineering: Feature store and embedding governance for enterprise RAG systems - AI Engineering Series
PostgreSQL AI: pgvector and data quality for vector databases - PostgreSQL AI Series

Indicator	Value	Source
Average cost per company (poor data quality)	$12.9M/year	Gartner 2024
AI projects failing due to data quality	72%	McKinsey 2025
Data scientist time on data cleaning	45-60%	Multiple surveys
Companies with formal data quality program	~20%	Industry reports 2025
ML error reduction with data governance	up to 35%	IBM Institute 2025

Dimension	Definition	AI-Specific Metric	Critical Threshold
Accuracy	Data correctly represents real-world entities	% correct labels in training set	>99% for critical classification
Completeness	All necessary data is present	% non-null values for critical features	>95% for input features
Consistency	Data is uniform across different systems	% concordant records between sources	>98% for shared features
Timeliness	Data is up-to-date and accessible when needed	Average lag: production data vs. training	<24h for real-time models
Validity	Data conforms to defined formats and constraints	% schema and range violations	<0.1% violations
Uniqueness	No unintentional duplicates	% duplicate records in training set	<0.5% duplicates

Role	Responsibilities	FTE Estimate	Profile
Chief Data Officer (CDO)	Data strategy, executive sponsor, AI Act compliance	0.25 FTE (partial)	C-level or senior manager
Data Steward	Domain ownership, standard definition, change approval	1 per data domain	Business + technical hybrid
Data Engineer	Pipelines, technical quality, monitoring tools	1-2 FTE	Technical profile
Data Quality Analyst	KPI definition, audits, quality reporting	0.5 FTE	Analytical/hybrid
DPO (Data Protection Officer)	GDPR, AI Act, data security, privacy by design	0.25-0.5 FTE	Legal/technical

Tool	Type	Strengths	Ideal Use Case
Soda Core	Open source	YAML-based, CLI, CI/CD integration	SMBs with limited budget, dbt pipelines
Monte Carlo	SaaS (enterprise)	ML-powered, zero-config anomaly detection	Enterprise, high volume, small team
Great Expectations	Open source	Python-native, flexible, auto-documentation	Data engineering teams, Python pipelines
dbt tests	Open source	Quality as code, integrated in dbt workflow	Teams using dbt as primary layer