Detecting Model Drift: Monitoring and Automated Retraining in Production
Your model is finally live. Metrics look great, the team is happy, and stakeholders are satisfied. Then, weeks later, someone notices the predictions feel off. A month later, the model is clearly degraded. Welcome to one of the most insidious problems in production machine learning: model drift.
Research from Gartner indicates that over 65% of ML models in production degrade significantly within 12 months of deployment, often without teams noticing in time. The problem is especially acute in retail and finance, where data distributions shift rapidly in response to market trends, seasonality, and changing user behavior.
In this guide we will build a complete drift detection and automated retraining system: understanding the different types of drift, implementing detectors with Evidently AI, NannyML and Alibi Detect, configuring statistical tests (KS, PSI, Chi-Square), integrating Prometheus and Grafana for continuous monitoring, and creating automated retraining pipelines triggered by alerts.
What You Will Learn
- Differences between data drift, concept drift, feature drift and label drift
- Statistical tests for drift detection: KS test, PSI, Chi-Square, MMD
- Practical implementation with Evidently AI, NannyML and Alibi Detect
- Monitoring dashboards with Prometheus and Grafana
- Alerting pipelines and automated retraining with MLflow
- Production-grade MLOps best practices on a limited budget
Why Drift is a Critical Problem
The real world is not static. The data your model was trained on reflected a specific statistical distribution, a snapshot of the world at that point in time. But the world keeps changing: user habits evolve, markets fluctuate, upstream systems change their data formats, and unexpected events like economic crises or pandemics reshape behavioral patterns.
The fundamental challenge with drift is silent degradation: the model becomes less accurate but keeps producing predictions without any technical errors. The service returns HTTP 200, logs show no exceptions, but the decisions based on those predictions are gradually becoming wrong. Without an active monitoring system, this degradation can go unnoticed for months.
The Economic Impact of Undetected Drift
A degraded fraud detection model lets fraudulent transactions pass through undetected. A drifting pricing model can cost millions in uncompetitive pricing. A degraded churn prediction model leads to wasted retention campaigns targeting the wrong customers. The cost of monitoring is always lower than the cost of undetected drift.
Drift Taxonomy: Four Fundamental Types
Before implementing solutions, it is essential to understand what is drifting. There are four main categories of drift, each with different root causes and detection strategies.
1. Data Drift (Covariate Shift)
Data drift, also known as covariate shift, occurs when the distribution of input features P(X) changes from training, but the relationship between features and labels P(Y|X) remains stable. A classic example: the model was trained on users of a specific age range, but the product is adopted by a new demographic segment.
Data drift is the most common type and the easiest to detect because it only requires monitoring the distributions of input features, with no need for labels. It can be detected in real-time, before results impact predictions.
2. Concept Drift
Concept drift is more insidious: the relationship P(Y|X) between features and labels changes, even if the feature distribution X remains stable. Example: a sentiment analysis model trained on 2022 tweets fails to understand the slang of 2025. The semantics of the words (X) have changed, so the mapping X to Y is now different.
Concept drift requires ground truth to be detected directly: you need to compare predictions against real labels. When labels arrive slowly (like in churn prediction with 90-day observation windows), proxy metrics such as prediction drift or probability score distributions are used instead.
3. Feature Drift
Feature drift is a subset of data drift that focuses on specific critical features. Not all features have equal impact: a high-importance feature that drifts is far more critical than a low-relevance feature. Feature importance tools (SHAP, permutation importance) help prioritize monitoring efforts.
4. Label Drift (Prior Probability Shift)
Label drift occurs when the distribution of target labels P(Y) changes. In a binary classifier (spam/not-spam), if suddenly 90% of messages are spam instead of the usual 10%, the model is calibrated for a different distribution and predictions will be biased. This type of drift is common in scenarios with variable class imbalance over time.
Drift Types Summary
- Data Drift: P(X) changes, P(Y|X) stable. Detectable without labels.
- Concept Drift: P(Y|X) changes. Requires labels or proxy metrics.
- Feature Drift: Specific features change. Priority based on importance.
- Label Drift: P(Y) changes. Monitor prediction distribution.
Statistical Tests for Drift Detection
Statistical drift detection relies on comparing two distributions: the reference distribution (training or a stable production period) and the current distribution (the monitoring window). Different statistical tests have different characteristics in terms of sensitivity, interpretability, and computational cost.
Kolmogorov-Smirnov Test (KS)
The KS test is the most widely used for continuous features. It measures the maximum distance between the cumulative distribution functions (CDF) of the two distributions. A low p-value (typically < 0.05) signals statistically significant drift.
Advantages: distribution-free (non-parametric), robust, easy to visualize. Limitations: sensitive to tails of distributions, less powerful with small samples, can produce false positives with very large datasets.
Population Stability Index (PSI)
The PSI originated in banking to monitor the stability of risk score distributions. It bins both distributions and calculates the weighted sum of differences between proportions. Standard interpretation:
- PSI < 0.1: no significant change
- PSI 0.1 - 0.2: slight change, worth monitoring
- PSI > 0.2: significant change, action required
PSI is very intuitive for business stakeholders and applies to both continuous (with decile discretization) and categorical features. It is particularly popular in credit scoring and fraud detection models.
Chi-Square Test
The Chi-Square test is the reference test for categorical features. It compares observed frequencies with expected frequencies and produces a p-value. It is appropriate when features have a limited number of categories and samples are large enough (expected frequency > 5 for each category). For high-cardinality features, rare categories should be grouped together.
Maximum Mean Discrepancy (MMD)
MMD is a kernel-based test that measures the distance between two distributions in a Hilbert space. It is particularly powerful for detecting differences in multivariate structure and is used by Alibi Detect for drift in tabular data, images and text. The advantage is that it does not require choosing bins or discretization parameters.
Implementation with Evidently AI
Evidently AI has become the standard open-source library for ML model monitoring in Python, with over 20 million downloads. It provides predefined presets for the most common use cases and integrates with any workflow orchestrator.
# Installation
pip install evidently
import pandas as pd
import numpy as np
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset, ClassificationPreset
from evidently.metrics import (
DatasetDriftMetric,
DataDriftTable,
ColumnDriftMetric,
ColumnSummaryMetric
)
# --- Load reference (training) and production data ---
reference_data = pd.read_parquet("data/training_features.parquet")
current_data = pd.read_parquet("data/production_batch_2025_02.parquet")
feature_columns = [
"age", "tenure_months", "monthly_charges",
"total_charges", "num_support_tickets",
"contract_type", "payment_method"
]
# --- Data Drift Report ---
drift_report = Report(metrics=[
DatasetDriftMetric(),
DataDriftTable(),
ColumnDriftMetric(column_name="monthly_charges"),
ColumnDriftMetric(column_name="contract_type"),
ColumnSummaryMetric(column_name="monthly_charges"),
])
drift_report.run(
reference_data=reference_data[feature_columns],
current_data=current_data[feature_columns]
)
# Save interactive HTML report
drift_report.save_html("reports/drift_report_2025_02.html")
# Extract metrics programmatically for pipeline integration
report_dict = drift_report.as_dict()
dataset_drift = report_dict["metrics"][0]["result"]
print(f"Dataset drift detected: {dataset_drift['dataset_drift']}")
print(f"Features drifted: {dataset_drift['number_of_drifted_columns']}/{dataset_drift['number_of_columns']}")
print(f"Share of drifted features: {dataset_drift['share_of_drifted_columns']:.1%}")
Evidently generates interactive HTML reports with distribution visualizations, overlapping histograms and summary tables. For each feature, it reports the statistical test used (chosen automatically based on data type), the p-value or test statistic, and a drift/no-drift flag.
Test Suites with Custom Thresholds
To integrate Evidently into a CI/CD pipeline or an Airflow/Prefect workflow, the Test Suite is the right tool: it lets you define precise thresholds and returns pass/fail results programmatically.
from evidently.test_suite import TestSuite
from evidently.tests import (
TestNumberOfDriftedColumns,
TestShareOfDriftedColumns,
TestColumnDrift,
TestDatasetDrift
)
# --- Test Suite with custom thresholds ---
drift_test_suite = TestSuite(tests=[
# No more than 20% of features should drift
TestShareOfDriftedColumns(lt=0.2),
# Critical features: individual tests with aggressive thresholds
TestColumnDrift(
column_name="monthly_charges",
stattest="ks",
stattest_threshold=0.05
),
TestColumnDrift(
column_name="contract_type",
stattest="chi2",
stattest_threshold=0.05
),
TestColumnDrift(
column_name="num_support_tickets",
stattest="psi",
stattest_threshold=0.1 # PSI < 0.1 = no drift
),
# Dataset-level drift test
TestDatasetDrift(stattest_threshold=0.05),
])
drift_test_suite.run(
reference_data=reference_data[feature_columns],
current_data=current_data[feature_columns]
)
# Pass/fail result for the pipeline
test_result = drift_test_suite.as_dict()
all_passed = all(
test["status"] == "SUCCESS"
for test in test_result["tests"]
)
if not all_passed:
print("DRIFT DETECTED - Pipeline triggering retraining...")
for test in test_result["tests"]:
if test["status"] != "SUCCESS":
print(f" FAILED: {test['name']} - {test['description']}")
trigger_retraining_pipeline()
else:
print("All drift tests passed - Model healthy")
Monitoring with NannyML: Performance Without Labels
NannyML solves one of the hardest problems in model monitoring: estimating model performance when real labels are not yet available. In a churn prediction model, labels (whether the customer actually churned) might only arrive 90 days after the prediction. NannyML uses Confidence-Based Performance Estimation (CBPE) to estimate accuracy, F1 and AUC in real-time using only the score distributions.
pip install nannyml
import nannyml as nml
import pandas as pd
reference_df = pd.read_parquet("data/reference_with_targets.parquet")
analysis_df = pd.read_parquet("data/production_last_30_days.parquet")
# --- CBPE: Estimate performance without labels ---
estimator = nml.CBPE(
y_pred_proba="churn_probability",
y_pred="churn_predicted",
y_true="churned", # only present in reference data
timestamp_column_name="prediction_date",
problem_type="binary_classification",
metrics=["roc_auc", "f1", "precision", "recall"],
chunk_size=500
)
estimator.fit(reference_df)
results = estimator.estimate(analysis_df)
# Visualize results with automatic alerts
figure = results.plot()
figure.show()
# Extract metrics for alerting
estimated_metrics = results.to_df()
latest_chunk = estimated_metrics.tail(1)
auc_lower = latest_chunk["estimated_roc_auc_lower_confidence_boundary"].values[0]
if auc_lower < 0.70:
print(f"ALERT: Estimated AUC < 0.70 (lower bound: {auc_lower:.3f})")
trigger_retraining_pipeline()
# --- Univariate Drift Detection ---
univariate_calc = nml.UnivariateDriftCalculator(
column_names=["monthly_charges", "tenure_months", "num_tickets"],
timestamp_column_name="prediction_date",
continuous_methods=["kolmogorov_smirnov", "jensen_shannon"],
categorical_methods=["chi2", "jensen_shannon"],
chunk_size=500
)
univariate_calc.fit(reference_df)
drift_results = univariate_calc.calculate(analysis_df)
# Plot drift over time for each feature
drift_figure = drift_results.filter(period="analysis").plot()
drift_figure.show()
Alibi Detect: Advanced Drift Detection with MMD
Alibi Detect (by Seldon) is the reference library for advanced detection that goes beyond univariate statistics. It supports MMD (Maximum Mean Discrepancy) for tabular data and images, LSDD (Least-Squares Density Difference) and outlier detection. It is ideal when you need to detect complex multivariate drift.
pip install alibi-detect
import numpy as np
from alibi_detect.cd import MMDDrift, KSDrift, TabularDrift
from alibi_detect.saving import save_detector, load_detector
X_ref = reference_data[feature_columns].values.astype(np.float32)
X_current = current_data[feature_columns].values.astype(np.float32)
# --- KS Drift for continuous features ---
ks_detector = KSDrift(
x_ref=X_ref,
p_val=0.05,
alternative="two-sided"
)
ks_preds = ks_detector.predict(
X_current,
drift_type="batch",
return_p_val=True,
return_distance=True
)
print("KS Drift Results:")
print(f" Drift detected: {ks_preds['data']['is_drift']}")
print(f" p-values per feature: {ks_preds['data']['p_val']}")
print(f" Features drifted: {ks_preds['data']['is_drift'].sum()}")
# --- MMD Drift for multivariate detection ---
mmd_detector = MMDDrift(
x_ref=X_ref,
backend="pytorch",
p_val=0.05,
n_permutations=200
)
mmd_preds = mmd_detector.predict(
X_current,
return_p_val=True,
return_distance=True
)
print(f"\nMMD Drift (multivariate):")
print(f" Drift detected: {mmd_preds['data']['is_drift']}")
print(f" p-value: {mmd_preds['data']['p_val']:.4f}")
print(f" MMD^2 statistic: {mmd_preds['data']['distance']:.6f}")
# --- TabularDrift: optimized for mixed tabular data ---
tabular_detector = TabularDrift(
x_ref=X_ref,
p_val=0.05,
categories_per_feature={
4: None, # feature index 4 = contract_type (categorical)
6: None # feature index 6 = payment_method (categorical)
},
)
# Save detector for reuse
save_detector(tabular_detector, "models/drift_detector/")
Monitoring System Architecture
A production-grade monitoring system requires multiple integrated components: a metrics collection layer, time-series storage, a visualization system and an alerting engine. The Prometheus + Grafana combination is the open-source standard for this use case, with extensive integration in the Kubernetes ecosystem.
# monitoring_service.py
from fastapi import FastAPI, BackgroundTasks
from prometheus_client import Counter, Gauge, Histogram, generate_latest, CONTENT_TYPE_LATEST
from starlette.responses import Response
import pandas as pd
from datetime import datetime
import logging
logger = logging.getLogger(__name__)
app = FastAPI(title="ML Monitoring Service")
# --- Prometheus Metrics ---
DRIFT_GAUGE = Gauge(
"ml_feature_drift_psi",
"Population Stability Index per feature",
labelnames=["feature_name", "model_name", "model_version"]
)
DATASET_DRIFT_GAUGE = Gauge(
"ml_dataset_drift_detected",
"1 if drift detected at dataset level, 0 otherwise",
labelnames=["model_name", "model_version"]
)
ESTIMATED_AUC = Gauge(
"ml_estimated_auc",
"Estimated AUC via CBPE (NannyML)",
labelnames=["model_name", "model_version"]
)
INFERENCE_LATENCY = Histogram(
"ml_inference_duration_seconds",
"Inference latency in seconds",
labelnames=["model_name"],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5]
)
@app.get("/metrics")
async def metrics():
"""Prometheus metrics endpoint."""
return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)
@app.post("/drift/check")
async def trigger_drift_check(background_tasks: BackgroundTasks):
"""Manually trigger a drift check."""
background_tasks.add_task(run_drift_check_job)
return {"status": "drift check started"}
@app.get("/health")
async def health():
return {"status": "healthy", "timestamp": datetime.utcnow().isoformat()}
Prometheus Alerting Rules
# ml_drift_alerts.yml
groups:
- name: ml_drift_alerts
rules:
- alert: HighFeatureDrift
expr: ml_feature_drift_psi{} > 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "High drift detected on feature {{ $labels.feature_name }}"
description: "PSI = {{ $value | humanize }} for model {{ $labels.model_name }}"
- alert: DatasetDriftDetected
expr: ml_dataset_drift_detected == 1
for: 10m
labels:
severity: critical
annotations:
summary: "Dataset-level drift detected for {{ $labels.model_name }}"
description: "Model performance may be degraded. Consider retraining."
- alert: LowEstimatedAUC
expr: ml_estimated_auc < 0.70
for: 15m
labels:
severity: critical
annotations:
summary: "Estimated AUC dropped below threshold"
description: "AUC = {{ $value | humanize }} for {{ $labels.model_name }}"
Grafana Dashboard: Key Metrics to Monitor
- PSI per feature: heatmap with 0.1/0.2 color-coded thresholds (green/yellow/red)
- Drift score over time: line chart for critical features
- Estimated AUC (CBPE): time series with confidence bands
- Number of drifted features: gauge with alert threshold
- Prediction distribution: probability score histogram
- Latency and throughput: standard SLA monitoring panels
Automated Retraining Pipeline
Detecting drift is necessary but not sufficient: you also need to react automatically. An automated retraining pipeline must be triggered by drift alerts, validate the new model before replacing the production one, and guarantee rollback in case of performance regression.
# retraining_pipeline.py
import mlflow
import mlflow.sklearn
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, f1_score
from datetime import datetime
import logging
logger = logging.getLogger(__name__)
MLFLOW_TRACKING_URI = "http://mlflow-server:5000"
MODEL_NAME = "churn-prediction"
MIN_AUC_THRESHOLD = 0.72
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
def train_new_model(df: pd.DataFrame) -> tuple:
"""Train a new model on fresh data."""
feature_columns = [
"age", "tenure_months", "monthly_charges",
"total_charges", "num_support_tickets",
"contract_type_encoded", "payment_method_encoded"
]
X = df[feature_columns]
y = df["churned"]
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
model = GradientBoostingClassifier(
n_estimators=200, max_depth=4,
learning_rate=0.05, subsample=0.8, random_state=42
)
model.fit(X_train, y_train)
y_pred_proba = model.predict_proba(X_val)[:, 1]
metrics = {
"auc": roc_auc_score(y_val, y_pred_proba),
"f1": f1_score(y_val, model.predict(X_val))
}
return model, metrics, feature_columns
def register_and_promote_model(model, metrics, feature_columns, trigger_reason):
"""Register the model in MLflow and promote to production if above threshold."""
with mlflow.start_run(run_name=f"retrain_{datetime.utcnow().strftime('%Y%m%d_%H%M')}"):
mlflow.log_param("trigger_reason", trigger_reason)
mlflow.log_metrics(metrics)
mlflow.sklearn.log_model(
model,
artifact_path="model",
registered_model_name=MODEL_NAME
)
if metrics["auc"] >= MIN_AUC_THRESHOLD:
client = mlflow.tracking.MlflowClient()
latest = client.get_latest_versions(MODEL_NAME, stages=["None"])[0]
client.transition_model_version_stage(
name=MODEL_NAME,
version=latest.version,
stage="Production",
archive_existing_versions=True
)
logger.info(f"Model v{latest.version} promoted. AUC={metrics['auc']:.4f}")
return True
logger.warning(f"AUC {metrics['auc']:.4f} below threshold. Manual review needed.")
return False
def run_retraining_pipeline(trigger_reason: str = "drift_detected"):
df = pd.read_parquet("data/training_data_fresh.parquet")
model, metrics, feature_columns = train_new_model(df)
return register_and_promote_model(model, metrics, feature_columns, trigger_reason)
Retraining Trigger Strategies
Defining when to retrain is just as important as how to retrain. There are three main strategies, each with advantages and limitations:
Retraining Strategies Compared
- Schedule-based: Fixed periodic retraining (weekly, monthly). Simple to implement but inefficient: retrains even when not needed and may not retrain fast enough during rapid drift episodes.
- Performance-based: Retrain when performance metrics drop below a threshold. Requires labels to be available quickly. Ideal for models with fast feedback loops (click-through rate, conversion).
- Drift-based: Retrain when statistically significant drift is detected in features or predictions. Does not require labels. Proactive approach that prevents degradation before it impacts performance. Risk of false positives.
- Hybrid (recommended): Combine drift detection as the primary trigger with performance validation as a quality gate before production promotion. Add a periodic fallback retraining as well.
Budget Under 5K EUR/Year for Small Teams
A complete drift detection system does not require enterprise budget. With the open-source and cloud-native approach, you can maintain a robust system at minimal cost:
- Evidently AI + NannyML: Open-source, free
- MLflow (self-hosted): Open-source, only infrastructure costs
- Prometheus + Grafana: Open-source, free
- Compute (VPS/cloud): ~50-100 EUR/month for a medium VM (600-1200 EUR/year)
- S3-compatible storage: ~20 EUR/month for 500GB (240 EUR/year)
- Total estimate: ~1000-2000 EUR/year for a complete stack
Best Practices for Production Drift Detection
Production Checklist
- Define statistical baselines before deploying: Run drift detection against itself on the validation set to calibrate thresholds. A PSI > 0 on stationary data indicates overfitting to the threshold.
- Use appropriate time windows: Do not compare all historical traffic against today. Use sliding windows (7/14/30 days) to capture recent drift.
- Prioritize features by importance: Monitor high-SHAP-impact features more aggressively. Not all drift events are equally critical.
- Distinguish technical drift from semantic drift: A change in a field format (string to number) is an engineering bug, not ML drift. Add separate data quality checks.
- Avoid alert fatigue: Set conservative thresholds initially and refine over time. Too many alerts leads to ignoring them all.
- Log all retraining decisions: Every retraining must be tracked in MLflow, including trigger reason, pre/post metrics and the model version promoted.
- Test the detector itself: Periodically verify that the detection system works correctly with data injection testing (inject synthetic drift and verify it is detected).
Anti-Patterns to Avoid
- Automated retraining without quality gates: Never promote a freshly trained model to production without performance validation. Retraining on contaminated data can make the model worse.
- Monitoring only the output: Monitoring only predictions without input features makes it impossible to diagnose the root cause of drift.
- Fixed thresholds for all models: Each model has different sensitivity to drift. PSI > 0.2 can be catastrophic for a critical model and irrelevant for a low-priority one.
- Ignoring concept drift: If you are not collecting feedback labels from production predictions, it is impossible to detect concept drift directly. Invest in feedback loop infrastructure.
Conclusions and Next Steps
A drift detection and automated retraining system is at the heart of any mature MLOps setup. Without active monitoring, ML models in production degrade silently, generating wrong decisions that can cost far more than the monitoring system itself.
In this guide we have built a complete system: from the theoretical understanding of the four types of drift, to practical implementation with Evidently AI for interactive reports, NannyML for label-free performance estimation and Alibi Detect for advanced multivariate detection. We integrated everything with Prometheus, Grafana and an automated retraining pipeline with MLflow.
The next step is to integrate this system with the FastAPI serving layer from the previous article and the Kubernetes scaling approach in the next one. With these components in place, you will have a complete, production-grade and maintainable MLOps system.
Continue the MLOps Series
- Previous article: Experiment Tracking with MLflow: Complete Guide - logging experiments and comparing models
- Next article: Serving ML Models: FastAPI + Uvicorn in Production - building scalable inference APIs
- Deep dive: Scaling ML on Kubernetes - orchestrating deployment with KubeFlow and Seldon
- Related series: Advanced Deep Learning - monitoring for complex neural models







