Data Governance and Data Quality: Foundations for Trustworthy AI
72% of enterprise AI projects fail before reaching production. Not because of sophisticated algorithms lacking, not because of inadequate architectures, not because of missing talent. They fail because of insufficient data quality, absent governance frameworks, and pipelines that silently produce incorrect results without anyone noticing.
In a context where the EU AI Act mandates legally binding requirements on training data quality for high-risk AI systems, and where organizations are increasingly investing in data-driven initiatives, building a solid data governance and data quality foundation is no longer optional: it is a prerequisite for competitive advantage and regulatory compliance.
This article guides you through operational frameworks, open-source tools, and practical implementations that enable trustworthy AI systems: from defining data quality dimensions, to implementing automated tests with Great Expectations and dbt, to managing data lineage with OpenMetadata, and ensuring EU AI Act compliance.
What You'll Learn in This Article
- The 6 DAMA data quality dimensions and how to measure them automatically
- Data quality checks implementation with Great Expectations and dbt-expectations
- Data catalog and data lineage with OpenMetadata and Apache Atlas
- Data governance framework: roles, processes, and organizational structure
- EU AI Act Article 10 requirements for training data: what to do before August 2026
- Data observability with Soda Core and Monte Carlo for production pipelines
- Bias detection in ML datasets and mitigation strategies
- Case study: data governance framework for an Italian manufacturing SMB
The Data Warehouse, AI and Digital Transformation Series
| # | Article | Focus |
|---|---|---|
| 1 | Data Warehouse Evolution | From SQL Server to Data Lakehouse |
| 2 | Data Mesh Architecture | Decentralized Data Ownership |
| 3 | Modern ETL vs ELT | dbt, Airbyte and Fivetran |
| 4 | Pipeline Orchestration | Airflow, Dagster and Prefect |
| 5 | AI in Manufacturing | Predictive Maintenance and Digital Twins |
| 6 | AI in Finance | Fraud Detection and Credit Scoring |
| 7 | AI in Retail | Demand Forecasting and Recommendations |
| 8 | AI in Healthcare | Diagnostics and Drug Discovery |
| 9 | AI in Logistics | Route Optimization and Warehouse Automation |
| 10 | LLMs in Enterprise | RAG and AI Guardrails |
| 11 | Enterprise Vector Databases | pgvector, Pinecone and Weaviate |
| 12 | MLOps for Business | AI Models in Production with MLflow |
| 13 | You are here - Data Governance | Data Quality for Trustworthy AI |
| 14 | Data-Driven Roadmap for SMBs | AI and DWH Adoption |
The Data Quality Problem in the AI Era
"Garbage in, garbage out" is a principle data engineers have known for decades, but with generative AI and machine learning models in production, the consequences of poor data quality have become exponentially more severe. A fraud detection model trained on imbalanced data generates false positives that block legitimate transactions. A recommendation system trained on biased data amplifies discrimination. A demand forecasting model trained on inconsistent data generates incorrect orders with direct impact on cash flow.
The Cost of Poor Data Quality - 2025 Data
| Indicator | Value | Source |
|---|---|---|
| Average cost per company (poor data quality) | $12.9M/year | Gartner 2024 |
| AI projects failing due to data quality | 72% | McKinsey 2025 |
| Data scientist time on data cleaning | 45-60% | Multiple surveys |
| Companies with formal data quality program | ~20% | Industry reports 2025 |
| ML error reduction with data governance | up to 35% | IBM Institute 2025 |
A fundamental distinction that many organizations miss is the difference between data quality for analytics and data quality for AI. In traditional reporting, an anomalous data point produces a wrong number on a dashboard - someone notices it, corrects it, and moves on. In machine learning, an anomalous data point in the training set silently contaminates the model, which then behaves incorrectly in production for months before the problem is identified. The latency between the problem and its manifestation is infinitely higher.
The 6 DAMA Data Quality Dimensions
The DAMA-DMBOK (Data Management Body of Knowledge) framework defines 6 fundamental data quality dimensions, which in 2025 remain the standard reference for any enterprise data quality program:
DAMA Dimensions for AI - Extended Framework
| Dimension | Definition | AI-Specific Metric | Critical Threshold |
|---|---|---|---|
| Accuracy | Data correctly represents real-world entities | % correct labels in training set | >99% for critical classification |
| Completeness | All necessary data is present | % non-null values for critical features | >95% for input features |
| Consistency | Data is uniform across different systems | % concordant records between sources | >98% for shared features |
| Timeliness | Data is up-to-date and accessible when needed | Average lag: production data vs. training | <24h for real-time models |
| Validity | Data conforms to defined formats and constraints | % schema and range violations | <0.1% violations |
| Uniqueness | No unintentional duplicates | % duplicate records in training set | <0.5% duplicates |
Practical Implementation: Data Quality with Great Expectations
Great Expectations is the most widely adopted open-source Python framework for data quality testing. The approach is similar to unit tests for code: define "expectations" about data, run them automatically in the pipeline, and auto-generate documentation. Native integration with Airflow, Prefect, and dbt makes it a natural component of any modern data stack.
# data_quality_pipeline.py
# Complete data quality framework with Great Expectations
import great_expectations as gx
from great_expectations.core.batch import RuntimeBatchRequest
import pandas as pd
import numpy as np
from datetime import datetime
import logging
logger = logging.getLogger(__name__)
class DataQualityFramework:
"""
Data quality framework for enterprise ML pipelines.
Implements the 6 DAMA dimensions with Great Expectations.
"""
def __init__(self, datasource_name: str = "ml_training_data"):
self.context = gx.get_context()
self.datasource_name = datasource_name
self.validation_results = {}
def build_ml_expectation_suite(
self,
suite_name: str,
target_column: str,
feature_columns: list[str]
):
"""
Creates an expectation suite for ML datasets.
Covers all 6 DAMA quality dimensions.
"""
suite = self.context.add_expectation_suite(
expectation_suite_name=suite_name
)
# === COMPLETENESS ===
for col in feature_columns:
suite.add_expectation(
gx.core.ExpectationConfiguration(
expectation_type="expect_column_values_to_not_be_null",
kwargs={
"column": col,
"mostly": 0.95 # 95% non-null
}
)
)
# Target column must be 100% non-null
suite.add_expectation(
gx.core.ExpectationConfiguration(
expectation_type="expect_column_values_to_not_be_null",
kwargs={"column": target_column}
)
)
# === VALIDITY ===
suite.add_expectation(
gx.core.ExpectationConfiguration(
expectation_type="expect_column_values_to_be_between",
kwargs={
"column": "customer_age",
"min_value": 18,
"max_value": 120,
"mostly": 0.999
}
)
)
# === UNIQUENESS ===
suite.add_expectation(
gx.core.ExpectationConfiguration(
expectation_type="expect_column_values_to_be_unique",
kwargs={"column": "transaction_id"}
)
)
# === CONSISTENCY: target must have expected values only ===
suite.add_expectation(
gx.core.ExpectationConfiguration(
expectation_type="expect_column_values_to_be_in_set",
kwargs={
"column": target_column,
"value_set": [0, 1],
}
)
)
# === TIMELINESS: no data older than 90 days ===
suite.add_expectation(
gx.core.ExpectationConfiguration(
expectation_type="expect_column_values_to_be_between",
kwargs={
"column": "event_timestamp",
"min_value": "2024-11-01",
"max_value": datetime.now().strftime("%Y-%m-%d"),
"parse_strings_as_datetimes": True,
"mostly": 0.99
}
)
)
self.context.save_expectation_suite(suite)
logger.info(f"Suite {suite_name} created with {len(suite.expectations)} expectations")
return suite
def validate_dataset(
self,
df: pd.DataFrame,
suite_name: str,
run_name: str = None
) -> dict:
"""
Validates a DataFrame against the defined suite.
Returns structured results for monitoring.
"""
run_name = run_name or f"run_{datetime.now().isoformat()}"
batch_request = RuntimeBatchRequest(
datasource_name=self.datasource_name,
data_connector_name="runtime_connector",
data_asset_name="training_data",
runtime_parameters={"batch_data": df},
batch_identifiers={"run_id": run_name}
)
checkpoint = self.context.add_or_update_checkpoint(
name="ml_data_checkpoint",
validations=[{
"batch_request": batch_request,
"expectation_suite_name": suite_name
}]
)
results = checkpoint.run(run_name=run_name)
validation_result = results.list_validation_results()[0]
stats = validation_result.statistics
quality_report = {
"run_name": run_name,
"timestamp": datetime.now().isoformat(),
"success": results.success,
"success_rate": stats["success_percent"] / 100,
"evaluated_expectations": stats["evaluated_expectations"],
"failed_checks": [
{
"expectation": r.expectation_config.expectation_type,
"column": r.expectation_config.kwargs.get("column"),
}
for r in validation_result.results if not r.success
]
}
if not quality_report["success"]:
raise ValueError(
f"Data quality FAILED: {len(quality_report['failed_checks'])} checks failed. "
f"Pipeline halted to prevent model contamination."
)
return quality_report
Data Quality in dbt: Declarative Tests in the Transformation Layer
For teams using dbt as their transformation layer, the dbt-expectations package brings the same capabilities as Great Expectations directly into dbt models, defining tests in YAML close to the SQL code. This "quality as code" approach ensures every transformation is automatically validated.
# models/schema.yml
# Data quality tests with dbt-expectations
version: 2
models:
- name: ml_features_customer
description: "Feature store for churn prediction model"
columns:
- name: customer_id
tests:
- unique
- not_null
- dbt_expectations.expect_column_values_to_be_of_type:
column_type: VARCHAR
- name: customer_age
tests:
- not_null
- dbt_expectations.expect_column_values_to_be_between:
min_value: 18
max_value: 120
mostly: 0.999
- name: monthly_charges
tests:
- not_null
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
max_value: 10000
- dbt_expectations.expect_column_mean_to_be_between:
min_value: 50
max_value: 200
- name: churn_label
tests:
- not_null
- accepted_values:
values: [0, 1]
# Verify dataset is not too imbalanced
- dbt_expectations.expect_column_proportion_of_unique_values_to_be_between:
min_value: 0.01 # At least 1% churn
max_value: 0.5 # Max 50% churn
# Table-level tests
tests:
- dbt_expectations.expect_table_row_count_to_be_between:
min_value: 10000
max_value: 10000000
- dbt_expectations.expect_compound_columns_to_be_unique:
column_list: ["customer_id", "snapshot_date"]
Data Catalog and Data Lineage with OpenMetadata
Data lineage - the ability to track the journey of data from its source to the final AI model - has become an indispensable requirement for two converging reasons: compliance with the EU AI Act (which requires documentation of training data provenance) and the practical need for debugging when a model produces unexpected results.
OpenMetadata is the most advanced open-source platform for data catalog and lineage in 2025, built by ex-Uber engineers and Apache Hadoop founders. It supports column-level lineage, native integration with dbt, Airflow, Spark, and major data warehouses.
# openmetadata_lineage.py
# Automated data lineage registration for ML pipelines
from metadata.ingestion.ometa.ometa_api import OpenMetadata
from metadata.generated.schema.entity.data.table import Table
from metadata.generated.schema.type.entityLineage import (
ColumnLineage, EntitiesEdge, LineageDetails
)
from metadata.generated.schema.api.lineage.addLineage import AddLineageRequest
import json
from datetime import datetime
class MLPipelineLineageTracker:
"""
Automated data lineage tracker for ML pipelines.
Registers every transformation step in OpenMetadata.
Required for EU AI Act Article 10 compliance documentation.
"""
def __init__(self, server_url: str, jwt_token: str):
from metadata.generated.schema.entity.services.connections.metadata.openMetadataConnection import (
OpenMetadataConnection
)
from metadata.generated.schema.security.client.openMetadataJWTClientConfig import (
OpenMetadataJWTClientConfig
)
server_config = OpenMetadataConnection(
hostPort=server_url,
authProvider="openmetadata",
securityConfig=OpenMetadataJWTClientConfig(jwtToken=jwt_token)
)
self.metadata = OpenMetadata(server_config)
def register_training_data_lineage(
self,
source_tables: list[str],
feature_store_table: str,
ml_model_name: str,
transformation_description: str,
data_steward: str
):
"""
Registers: raw_data -> feature_store -> ml_model flow.
Critical for AI Act Art. 10 - Data and Data Governance.
"""
source_entities = []
for table_fqn in source_tables:
table = self.metadata.get_by_name(entity=Table, fqn=table_fqn)
if table:
source_entities.append(table)
feature_table = self.metadata.get_by_name(
entity=Table, fqn=feature_store_table
)
for source in source_entities:
lineage_request = AddLineageRequest(
edge=EntitiesEdge(
fromEntity={
"id": str(source.id.__root__),
"type": "table"
},
toEntity={
"id": str(feature_table.id.__root__),
"type": "table"
},
lineageDetails=LineageDetails(
description=(
f"{transformation_description} | "
f"Data Steward: {data_steward} | "
f"Reviewed: {datetime.now().strftime('%Y-%m-%d')}"
)
)
)
)
self.metadata.add_lineage(lineage_request)
# Add AI Act compliance tags
self._tag_for_ai_act_compliance(
feature_store_table, ml_model_name, data_steward
)
def _tag_for_ai_act_compliance(
self, table_fqn: str, model_name: str, steward: str
):
"""
Adds structured AI Act Article 10 compliance tags.
Required documentation for high-risk AI systems.
"""
compliance_metadata = {
"ai_act_article": "10",
"dataset_purpose": "training_and_validation",
"ai_system": model_name,
"data_steward": steward,
"governance_reviewed": "true",
"bias_assessment_completed": "true",
"next_review": "2025-07-01"
}
print(f"AI Act compliance tags registered: {json.dumps(compliance_metadata, indent=2)}")
# Usage example
def setup_churn_model_lineage():
tracker = MLPipelineLineageTracker(
server_url="http://openmetadata.internal:8585/api",
jwt_token="your-jwt-token"
)
tracker.register_training_data_lineage(
source_tables=[
"default.raw_crm.customers",
"default.raw_billing.transactions"
],
feature_store_table="default.feature_store.ml_churn_features_v3",
ml_model_name="churn_prediction_xgboost_v2",
transformation_description="Monthly aggregations, tenure calculation, "
"categorical encoding",
data_steward="Jane Smith, Data Governance Lead"
)
Data Governance Framework: Organizational Structure and Processes
Data governance is not a software tool: it is a system of people, processes, and technology that ensures data is managed as a strategic asset. For organizations approaching AI initiatives, building this structure pragmatically is essential.
Governance Structure for Mid-Market (50-500 employees)
| Role | Responsibilities | FTE Estimate | Profile |
|---|---|---|---|
| Chief Data Officer (CDO) | Data strategy, executive sponsor, AI Act compliance | 0.25 FTE (partial) | C-level or senior manager |
| Data Steward | Domain ownership, standard definition, change approval | 1 per data domain | Business + technical hybrid |
| Data Engineer | Pipelines, technical quality, monitoring tools | 1-2 FTE | Technical profile |
| Data Quality Analyst | KPI definition, audits, quality reporting | 0.5 FTE | Analytical/hybrid |
| DPO (Data Protection Officer) | GDPR, AI Act, data security, privacy by design | 0.25-0.5 FTE | Legal/technical |
EU AI Act and Training Data Requirements
The EU AI Act introduces for the first time in history legally binding requirements on training data quality for high-risk AI systems. Article 10 of the regulation is specifically dedicated to "Data and Data Governance" and became operational for GPAI models on 2 August 2025. For high-risk systems, full compliance is required by 2 August 2026.
EU AI Act Timeline - Actions Required Before August 2026
| Date | Milestone | Required Action |
|---|---|---|
| Feb 2025 | Prohibited AI practices operative | Audit AI systems for prohibited practices |
| Aug 2025 | GPAI models and governance operative | Governance for LLMs and foundation models |
| Aug 2026 | High-risk AI systems full compliance | Art. 10 data governance fully implemented |
| Aug 2027 | Legacy systems compliance | Existing AI systems brought into compliance |
# ai_act_compliance_checker.py
# EU AI Act Article 10 compliance verification for training datasets
from dataclasses import dataclass, field
from typing import Optional
import pandas as pd
import numpy as np
from scipy import stats
import json
from datetime import datetime
@dataclass
class DatasetComplianceReport:
"""EU AI Act Article 10 compliance report."""
dataset_name: str
assessment_date: str
assessor: str
compliant: bool = False
checks: dict = field(default_factory=dict)
recommendations: list = field(default_factory=list)
class AIActArticle10Checker:
"""
Verifies EU AI Act Art. 10 compliance for high-risk training datasets.
Art. 10 requires datasets to be:
1. Relevant and sufficiently representative
2. Free from errors as much as possible
3. Complete with respect to intended purpose
4. Having appropriate statistical properties
5. Free from biases that could discriminate against protected groups
"""
def __init__(self, dataset_name: str, assessor: str):
self.report = DatasetComplianceReport(
dataset_name=dataset_name,
assessment_date=datetime.now().isoformat(),
assessor=assessor
)
def check_representativeness(
self,
df: pd.DataFrame,
demographic_columns: list[str],
reference_distributions: dict
) -> bool:
"""
Art. 10(3): Verify demographic representativeness.
Chi-square test against reference population distribution.
"""
is_representative = True
self.report.checks["representativeness"] = {}
for col in demographic_columns:
if col not in df.columns or col not in reference_distributions:
continue
expected = reference_distributions[col]
categories = list(expected.keys())
obs_counts = [df[col].value_counts().get(c, 0) for c in categories]
exp_props = [expected.get(c, 0.001) for c in categories]
total_exp = sum(exp_props)
exp_normalized = [p / total_exp * len(df) for p in exp_props]
chi2, p_value = stats.chisquare(obs_counts, exp_normalized)
is_rep = p_value > 0.05
self.report.checks["representativeness"][col] = {
"chi2_statistic": chi2,
"p_value": p_value,
"is_representative": is_rep
}
if not is_rep:
is_representative = False
self.report.recommendations.append(
f"CRITICAL: Column '{col}' is not representative of the target population "
f"(chi2={chi2:.2f}, p={p_value:.4f}). "
f"Apply oversampling or collect additional data."
)
return is_representative
def check_bias_protected_attributes(
self,
df: pd.DataFrame,
target_column: str,
protected_attributes: list[str]
) -> bool:
"""
Art. 10(5): Verify absence of bias on protected attributes.
Uses disparate impact ratio (threshold: 0.8 = 80% rule).
"""
is_unbiased = True
self.report.checks["bias_assessment"] = {}
for attr in protected_attributes:
if attr not in df.columns:
continue
groups = df[attr].unique()
positive_rates = {
str(g): (df[df[attr] == g][target_column] == 1).mean()
for g in groups if len(df[df[attr] == g]) > 0
}
if len(positive_rates) < 2:
continue
max_rate = max(positive_rates.values())
min_rate = min(positive_rates.values())
disparate_impact = min_rate / max_rate if max_rate > 0 else 1.0
self.report.checks["bias_assessment"][attr] = {
"disparate_impact": disparate_impact,
"compliant": disparate_impact >= 0.8,
"group_rates": positive_rates
}
if disparate_impact < 0.8:
is_unbiased = False
self.report.recommendations.append(
f"CRITICAL: Bias detected on protected attribute '{attr}'. "
f"Disparate impact = {disparate_impact:.3f} (AI Act threshold: 0.80). "
f"Apply re-weighting, resampling, or fairness constraints."
)
return is_unbiased
def generate_compliance_report(self) -> str:
"""Generates JSON report for EU AI Act audit."""
all_checks = (
list(self.report.checks.get("representativeness", {}).values()) +
list(self.report.checks.get("bias_assessment", {}).values()) +
list(self.report.checks.get("completeness", {}).values())
)
self.report.compliant = all(
check.get("is_representative", True) or check.get("compliant", True)
for check in all_checks
)
return json.dumps({
"dataset": self.report.dataset_name,
"assessment_date": self.report.assessment_date,
"assessor": self.report.assessor,
"eu_ai_act_article_10_compliant": self.report.compliant,
"checks": self.report.checks,
"recommendations": self.report.recommendations
}, indent=2, default=str)
Data Observability: Continuous Monitoring of Production Pipelines
Data quality is not only guaranteed at ingestion time: data degrades over time. The phenomenon of data drift - where the distribution of production data progressively diverges from the training set - is one of the main causes of silent AI model degradation. Data observability addresses this with continuous monitoring and proactive alerting.
Data Observability Tools Comparison 2025
| Tool | Type | Strengths | Ideal Use Case |
|---|---|---|---|
| Soda Core | Open source | YAML-based, CLI, CI/CD integration | SMBs with limited budget, dbt pipelines |
| Monte Carlo | SaaS (enterprise) | ML-powered, zero-config anomaly detection | Enterprise, high volume, small team |
| Great Expectations | Open source | Python-native, flexible, auto-documentation | Data engineering teams, Python pipelines |
| dbt tests | Open source | Quality as code, integrated in dbt workflow | Teams using dbt as primary layer |
# soda_checks_ml_pipeline.yml
# Soda Core checks for ML pipeline observability
checks for ml_features_customer:
# === FRESHNESS: data timeliness ===
- freshness(event_date) < 24h:
name: "Data not older than 24 hours"
# === VOLUME: volumetric anomalies ===
- row_count between 50000 and 5000000:
name: "Volume within expected range"
# === COMPLETENESS ===
- missing_count(customer_age) = 0:
name: "Customer age: zero nulls tolerated"
- missing_percent(monthly_charges) < 2%:
name: "Monthly charges: max 2% null"
# === VALIDITY ===
- invalid_percent(customer_age) < 0.1%:
name: "Customer age within 18-120 range"
valid min: 18
valid max: 120
- duplicate_count(customer_id) = 0:
name: "No duplicate customer IDs"
# === DISTRIBUTION DRIFT ===
# Compare against historical baseline (previous week)
- distribution_difference_index(monthly_charges) < 0.1:
name: "Monthly charges: drift < 10% vs baseline"
method: psi # Population Stability Index
- distribution_difference_index(customer_age) < 0.1:
name: "Customer age: drift < 10% vs baseline"
method: ks # Kolmogorov-Smirnov test
# === BUSINESS RULES ===
- failed_rows(no_negative_charges):
name: "No negative charges"
fail condition: monthly_charges < 0
# Alerting configuration
alert config:
slack:
webhook: "https://hooks.slack.com/services/..."
channel: "#data-quality-alerts"
email:
to:
- "data-team@company.com"
- "ml-team@company.com"
Best Practices and Anti-Patterns
Core Best Practices
- Quality as Code: Define quality requirements in versioned code (YAML, Python) in Git, not Word documents. This makes checks automatic, reproducible, and part of CI/CD.
- Fail Fast, Fail Loud: Data quality checks should block the pipeline (not just generate a warning) when data is out of threshold. A model trained on poor data causes more damage than a stopped pipeline.
- Separate Validation from Transformation: Validate data before transforming it (schema/range validation at ingestion), during transformation (dbt tests), and before ML use (Great Expectations on feature store).
- Monitor Drift, Not Just Static Quality: Data changes over time. PSI (Population Stability Index) and KS (Kolmogorov-Smirnov) tests are standard tools for detecting distributional shifts that silently degrade AI models.
- Document Governance Decisions: Every choice (quality threshold, imputation strategy, dataset exclusion) must be documented with date, author, and rationale. This documentation is required by the AI Act for high-risk systems.
Critical Anti-Patterns in Data Governance for AI
- "Best effort" data quality: Defining vague SLAs instead of precise, measurable numeric thresholds. Without metrics, there is no real governance.
- Silencing alerts: Configuring quality checks and then silencing alerts because they "disturb." Every ignored alert is a future production model incident.
- Governance only for new projects: Legacy datasets used for model retraining have the same quality requirements. They are often the most problematic.
- One-shot bias checks: Verifying bias only before initial training. Bias can emerge with new production data over time (concept drift).
- Confusing analytics and ML quality standards: Acceptable thresholds for a dashboard (5% null in a column) can be catastrophic for an ML input feature. The two contexts require different standards.
Conclusions and Next Steps
Data governance and data quality for AI are not bureaucracy: they are the invisible infrastructure that determines whether your models work in production or fail silently. With the EU AI Act bringing legally binding requirements on training data, investing in governance is not just good practice - it is a prerequisite for operating in the European market with high-risk AI systems.
The practical starting point for any organization is the same: identify the 3-5 most critical datasets for your AI initiatives, assign a data steward to each, implement automated checks with the open-source tools described in this article (Great Expectations, dbt-expectations, Soda Core), and gradually build the governance structure around these datasets.
Perfection is not the initial requirement: the journey matters as much as the destination, and every percentage point improvement in data quality directly translates into more reliable AI models, fewer production incidents, and more solid business decisions.
Data Governance Launch Checklist for AI
- AI-critical dataset inventory completed
- Data stewards assigned for each key data domain
- Quality SLAs defined and approved (completeness, freshness, validity)
- Automated checks implemented in pipelines (GE, dbt-expectations or Soda)
- Data catalog with lineage configured (OpenMetadata or Apache Atlas)
- Bias assessment performed for all datasets used in high-risk systems
- EU AI Act Art. 10 documentation started for high-risk AI systems
- Alerting configured with escalation channels defined
- First monthly Data Council meeting scheduled
- Team training completed
Related Resources
- MLOps for Business: How to monitor model drift in production with MLflow - Article #12 of this series
- LLMs in Enterprise: Data governance for RAG and data security in Large Language Models - Article #10 of this series
- AI Engineering: Feature store and embedding governance for enterprise RAG systems - AI Engineering Series
- PostgreSQL AI: pgvector and data quality for vector databases - PostgreSQL AI Series







