AI in Finance: Fraud Detection, Credit Scoring and Risk Management
In 2024, global financial fraud losses exceeded $12.5 billion, a 25% year-over-year increase. In the same year, AI-powered fraud detection systems prevented an estimated $25.5 billion in losses globally. The math is clear: AI in financial services is no longer a competitive differentiator — it is an operational necessity.
Today, 99% of financial organizations use some form of machine learning or AI to combat fraud, and 93% believe AI will revolutionize fraud detection capabilities. But how do these systems actually work? How do you build a credit scoring model that satisfies European regulatory requirements? How do you implement an AML system capable of detecting money laundering patterns across networks of thousands of transactions?
This article answers these questions with real code, concrete architectures, and a case study inspired by the Italian banking context. We will cover real-time fraud detection with Kafka and Flink, interpretable credit scoring with XGBoost and SHAP, AML detection with Graph Neural Networks, and compliance with the EU AI Act and PSD3.
What You Will Learn
- How ML-powered fraud detection works: isolation forest, gradient boosting, neural networks
- Credit scoring AI with XGBoost and LightGBM: feature engineering and real-world performance (>95% accuracy)
- Explainable AI (SHAP and LIME) for regulatory-compliant financial decisions
- Real-time architecture for transaction monitoring with Kafka and Flink
- Anti-Money Laundering with Graph Neural Networks: detecting patterns that rule-based systems miss
- RegTech: EU AI Act, PSD3, DORA and their implications for European banks
- Case study: fraud detection implementation for an Italian retail bank
Data Warehouse, AI and Digital Transformation Series
| # | Article | Focus |
|---|---|---|
| 1 | Data Warehouse Evolution | From SQL Server to Data Lakehouse |
| 2 | Data Mesh Architecture | Decentralized data ownership |
| 3 | Modern ETL vs ELT | dbt, Airbyte and Fivetran |
| 4 | Pipeline Orchestration | Airflow, Dagster and Prefect |
| 5 | AI in Manufacturing | Predictive Maintenance and Digital Twins |
| 6 | You are here - AI in Finance | Fraud Detection and Credit Scoring |
| 7 | AI in Retail | Demand Forecasting and Recommendations |
| 8 | AI in Healthcare | Diagnostics and Drug Discovery |
| 9 | AI in Logistics | Route Optimization and Warehouse Automation |
| 10 | Enterprise LLMs | RAG and Guardrails |
The AI in Finance Landscape: 2025 Data and Trends
Financial services was one of the first industries to adopt machine learning at scale, and it remains among the most advanced. The reasons are structural: abundant historical data, enormous economic incentives (preventing a $10,000 fraud delivers immediate ROI), and a regulatory framework that, while complex, explicitly requires automated monitoring systems.
Key Market Data 2025
| Indicator | 2025 Value | Trend |
|---|---|---|
| Global financial fraud losses | $12.5B (2024) | +25% year-over-year |
| Fraud prevented by AI | $25.5B | 2025 estimate |
| Financial institutions using AI | 99% | For fraud detection |
| Frauds involving generative AI | >50% | Deepfakes, synthetic identities |
| LightGBM credit scoring accuracy | 98.13% | With SHAP explainability |
| RegTech AI market CAGR | $21.7B | Annual growth forecast |
A fundamental shift in 2024-2025 is the emergence of generative AI as an offensive weapon: more than 50% of today's frauds involve AI on the criminal side, creating hyper-realistic deepfake videos, synthetic identities, and personalized phishing campaigns at scale. This has accelerated adoption of more sophisticated defense systems, moving beyond simple anomaly detection to models capable of detecting complex behavioral patterns.
The Three Main Application Areas
AI in finance revolves around three distinct but interconnected domains, each with specific technical and regulatory requirements:
AI Finance Domains
| Domain | Key Techniques | Required Latency | Relevant Regulation |
|---|---|---|---|
| Fraud Detection | Isolation Forest, GBM, Deep Learning | <100ms (real-time) | PSD2/PSD3, AI Act |
| Credit Scoring | XGBoost, LightGBM, SHAP | Seconds to minutes | CRR/CRD IV, AI Act High-Risk |
| AML (Anti-Money Laundering) | Graph Neural Networks, NLP | Hours (batch) + real-time alerts | AMLD6, FATF guidelines |
Fraud Detection with Machine Learning
Fraud detection is the most mature ML use case in finance. The primary challenge is not technical but statistical: fraud datasets are heavily imbalanced. Out of 10,000 transactions, typically only 1-3 are fraudulent (positive class), the rest are legitimate (negative class). A model that predicts "everything is legitimate" would achieve 99.9% accuracy but be completely useless.
Handling Class Imbalance
Several techniques address class imbalance: SMOTE (Synthetic Minority Over-sampling Technique) for generating synthetic minority class samples, class weights to penalize errors on rare class predictions, and appropriate evaluation metrics like AUC-ROC, Precision-Recall AUC, and F1-score instead of raw accuracy.
# fraud_detection_pipeline.py
# Complete fraud detection pipeline with class imbalance handling
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
from sklearn.metrics import (
classification_report, roc_auc_score,
average_precision_score
)
import xgboost as xgb
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
def engineer_transaction_features(df: pd.DataFrame) -> pd.DataFrame:
"""
Build behavioral and contextual features from raw transactions.
"""
df = df.copy()
# Temporal features
df['hour_of_day'] = pd.to_datetime(df['timestamp']).dt.hour
df['day_of_week'] = pd.to_datetime(df['timestamp']).dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['is_night'] = df['hour_of_day'].between(0, 6).astype(int)
# Velocity features - rolling windows
df = df.sort_values(['account_id', 'timestamp'])
df['avg_amount_7d'] = (
df.groupby('account_id')['amount']
.transform(lambda x: x.rolling('7D', min_periods=1).mean())
)
df['txn_count_1h'] = (
df.groupby('account_id')['amount']
.transform(lambda x: x.rolling('1H', min_periods=1).count())
)
df['txn_count_24h'] = (
df.groupby('account_id')['amount']
.transform(lambda x: x.rolling('24H', min_periods=1).count())
)
# Deviation from customer's typical amount
df['amount_z_score'] = (
df.groupby('account_id')['amount']
.transform(lambda x: (x - x.mean()) / (x.std() + 1e-8))
)
# Geographic anomaly: country change from previous transaction
df['country_changed'] = (
df.groupby('account_id')['country']
.transform(lambda x: x != x.shift(1))
.astype(int)
)
return df
def train_fraud_detector(df: pd.DataFrame):
"""
Train a fraud detector with XGBoost + SMOTE.
Returns model, scaler, and evaluation results.
"""
feature_cols = [
'amount', 'hour_of_day', 'day_of_week', 'is_weekend', 'is_night',
'avg_amount_7d', 'txn_count_1h', 'txn_count_24h',
'amount_z_score', 'distance_km', 'country_changed',
'merchant_category', 'channel'
]
X = df[feature_cols].fillna(0)
y = df['is_fraud'].astype(int)
print(f"Class distribution: {y.value_counts().to_dict()}")
print(f"Fraud rate: {y.mean():.4%}")
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Pipeline: SMOTE -> StandardScaler -> XGBoost
pipeline = ImbPipeline([
('smote', SMOTE(sampling_strategy=0.1, random_state=42, k_neighbors=5)),
('scaler', StandardScaler()),
('classifier', xgb.XGBClassifier(
n_estimators=500,
max_depth=6,
learning_rate=0.05,
subsample=0.8,
colsample_bytree=0.8,
scale_pos_weight=len(y_train[y_train==0]) / len(y_train[y_train==1]),
eval_metric='aucpr',
tree_method='hist',
random_state=42
))
])
# Stratified cross-validation (5 folds)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = []
for fold, (train_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
X_cv_train, X_cv_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
y_cv_train, y_cv_val = y_train.iloc[train_idx], y_train.iloc[val_idx]
pipeline.fit(X_cv_train, y_cv_train)
score = roc_auc_score(y_cv_val, pipeline.predict_proba(X_cv_val)[:, 1])
cv_scores.append(score)
print(f"Fold {fold+1} AUC-ROC: {score:.4f}")
print(f"\nCV Mean AUC-ROC: {np.mean(cv_scores):.4f} +/- {np.std(cv_scores):.4f}")
# Final training on full train set
pipeline.fit(X_train, y_train)
y_pred_proba = pipeline.predict_proba(X_test)[:, 1]
print("\n--- Classification Report ---")
print(classification_report(y_test, pipeline.predict(X_test),
target_names=['Legitimate', 'Fraud']))
print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_proba):.4f}")
print(f"PR-AUC: {average_precision_score(y_test, y_pred_proba):.4f}")
return pipeline, X_test, y_test, y_pred_proba
Two-Stage Architecture: Isolation Forest + XGBoost
Production fraud detection systems in 2025 use a two-stage architecture that balances speed and accuracy. Isolation Forest, an unsupervised algorithm, acts as a fast pre-screening layer: it identifies clearly normal transactions in under 1ms, routing them to immediate approval. Only suspicious transactions proceed to the more expensive XGBoost scoring stage, reducing computational costs while maintaining detection quality.
# two_stage_fraud_pipeline.py
# Isolation Forest for fast pre-screening + XGBoost for detailed scoring
def dual_model_fraud_pipeline(transaction: dict, iso_model, xgb_pipeline) -> dict:
"""
Two-stage pipeline:
Stage 1: Isolation Forest - fast pre-screening (<1ms)
Stage 2: XGBoost - precise scoring only for suspicious cases (<10ms)
"""
features = extract_features(transaction)
# Stage 1: fast pre-screening
anomaly_score = iso_model.score_samples([features])[0]
# High score = likely normal, skip expensive stage 2
if anomaly_score > -0.1:
return {
'fraud_probability': 0.01,
'decision': 'APPROVE',
'stage': 'isolation_forest_fast_path',
'latency_ms': 0.8
}
# Stage 2: detailed scoring only for suspected anomalies
fraud_prob = xgb_pipeline.predict_proba([features])[0][1]
decision = 'BLOCK' if fraud_prob > 0.7 else 'REVIEW' if fraud_prob > 0.3 else 'APPROVE'
return {
'fraud_probability': float(fraud_prob),
'decision': decision,
'stage': 'xgboost_detailed',
'anomaly_score': float(anomaly_score)
}
Credit Scoring AI: From Logistic Regression to LightGBM
Traditional credit scoring relies on linear statistical models like logistic regression, historically preferred by banks for their interpretability. But modern gradient boosting machines (GBMs), particularly XGBoost and LightGBM, have demonstrated significantly superior performance, with accuracy up to 98.13% in recent benchmarks compared to 70-75% for classical logistic regression.
The barrier to GBM adoption was not technical but regulatory: how do you explain to a customer (and to the regulator) why a loan was rejected, if the model is a "black box"? The answer is Explainable AI (XAI) with SHAP values.
# credit_scoring_model.py
# LightGBM training for credit scoring with threshold optimization
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, precision_recall_curve
import numpy as np
def train_credit_scoring_model(df):
"""
Train LightGBM for credit scoring.
Target: 1 = default (won't repay), 0 = good payer
"""
feature_cols = [
'avg_balance_12m', 'min_balance_12m', 'balance_volatility',
'overdraft_days', 'avg_monthly_income', 'income_stability',
'income_sources', 'avg_monthly_expenses', 'savings_ratio',
'credit_utilization', 'active_loans', 'missed_payments',
'max_days_late', 'repayment_ratio', 'banking_seniority_years',
'age', 'employment_years', 'debt_income_ratio', 'employment_type_score'
]
X = df[feature_cols].fillna(df[feature_cols].median())
y = df['default_flag'].astype(int)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
dtrain = lgb.Dataset(X_train, label=y_train)
dval = lgb.Dataset(X_test, label=y_test, reference=dtrain)
params = {
'objective': 'binary',
'metric': ['binary_logloss', 'auc'],
'num_leaves': 63,
'max_depth': 8,
'learning_rate': 0.03,
'n_estimators': 1000,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'min_child_samples': 30,
'lambda_l1': 0.1,
'lambda_l2': 0.1,
'is_unbalance': True,
'verbose': -1,
'random_state': 42
}
callbacks = [
lgb.early_stopping(stopping_rounds=50, verbose=False),
lgb.log_evaluation(period=100)
]
model = lgb.train(params, dtrain, valid_sets=[dtrain, dval],
valid_names=['train', 'val'], callbacks=callbacks)
y_pred_proba = model.predict(X_test)
threshold = optimize_threshold(y_test, y_pred_proba)
auc = roc_auc_score(y_test, y_pred_proba)
print(f"AUC-ROC: {auc:.4f}")
print(f"Optimal threshold: {threshold:.3f}")
return model, threshold
def optimize_threshold(y_true, y_prob) -> float:
"""
Find the threshold maximizing F1-score.
Can be calibrated to balance cost of default (FN)
vs cost of incorrectly rejected good customer (FP).
"""
precisions, recalls, thresholds = precision_recall_curve(y_true, y_prob)
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)
optimal_idx = np.argmax(f1_scores)
return thresholds[optimal_idx] if optimal_idx < len(thresholds) else 0.5
Explainable AI: SHAP and LIME for Financial Decisions
The EU AI Act classifies credit scoring systems as High-Risk AI (Annex III, point 5). This implies stringent obligations: detailed technical documentation, conformity assessment, human oversight, and above all, transparency in automated decisions. The GDPR (Art. 22) and CRD IV banking regulation already require that automated credit decisions be explainable to customers.
SHAP (SHapley Additive exPlanations) is the standard tool for meeting these requirements. Based on cooperative game theory, it calculates the marginal contribution of each feature to the final prediction in a mathematically rigorous and consistent way.
# explainable_credit_scoring.py
# SHAP for explaining credit scoring decisions
import shap
import pandas as pd
import numpy as np
import lightgbm as lgb
def explain_credit_decision(
model: lgb.Booster,
application_features: pd.DataFrame,
feature_names: list,
threshold: float = 0.4
) -> dict:
"""
Generate SHAP explanation for a single credit decision.
Returns both technical explanation and human-readable text.
Required by GDPR Art. 22 and EU AI Act for high-risk AI systems.
"""
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(application_features)
if isinstance(shap_values, list):
shap_vals = shap_values[1] # positive class (default)
else:
shap_vals = shap_values
contributions = pd.DataFrame({
'feature': feature_names,
'value': application_features.values[0],
'shap_value': shap_vals[0],
'impact': ['INCREASES_RISK' if v > 0 else 'REDUCES_RISK' for v in shap_vals[0]]
}).sort_values('shap_value', key=abs, ascending=False)
prob_default = model.predict(application_features)[0]
decision = 'REJECTED' if prob_default >= threshold else 'APPROVED'
explanation = generate_human_readable_explanation(
contributions.head(5), decision, prob_default
)
return {
'decision': decision,
'default_probability': float(prob_default),
'credit_score': int((1 - prob_default) * 1000), # 0-1000 scale
'explanation_text': explanation,
'feature_contributions': contributions.to_dict('records'),
'audit_timestamp': pd.Timestamp.now().isoformat()
}
def generate_human_readable_explanation(
contributions: pd.DataFrame,
decision: str,
probability: float
) -> str:
"""
Generate natural language explanation for the customer.
Satisfies GDPR Art. 22 right-to-explanation and AI Act transparency obligations.
"""
lines = [
f"Decision: {decision} (credit score: {int((1-probability)*1000)}/1000)",
"\nMain factors influencing this decision:"
]
FEATURE_LABELS = {
'debt_income_ratio': 'Monthly installment-to-income ratio',
'overdraft_days': 'Days with negative balance (last 12 months)',
'max_days_late': 'Maximum payment delay (days)',
'savings_ratio': 'Savings capacity',
'banking_seniority_years': 'Banking relationship seniority',
'avg_monthly_income': 'Estimated monthly income',
'income_stability': 'Income stability',
'missed_payments': 'Historical missed payments',
'balance_volatility': 'Balance variability',
'employment_type_score': 'Employment type'
}
for _, row in contributions.iterrows():
label = FEATURE_LABELS.get(row['feature'], row['feature'])
if row['shap_value'] > 0:
lines.append(f" [-] {label} (increases risk)")
else:
lines.append(f" [+] {label} (reduces risk)")
if decision == 'REJECTED':
lines.append(
"\nTo request a review or additional information, "
"please contact your banking advisor."
)
return "\n".join(lines)
SHAP vs LIME: When to Use Which
| Aspect | SHAP | LIME |
|---|---|---|
| Theoretical foundation | Shapley values (game theory) | Local linear approximation |
| Consistency | High (mathematical guarantees) | Medium (stochastic) |
| Speed | Slower (TreeSHAP fast for GBMs) | Faster |
| Explanation type | Global + local | Local only |
| Use in banking production | De facto standard (audit-ready) | Complementary for debugging |
| AI Act compliance | Adequate (with documentation) | Insufficient alone |
Real-Time Transaction Monitoring with Kafka and Flink
Payment fraud detection cannot wait: a fraudulent credit card transaction must be blocked before completion, not after. This requires a streaming architecture capable of processing millions of events per second with sub-100 millisecond latency.
The 2025 standard architecture combines Apache Kafka as the event message broker, Apache Flink for stateful real-time processing with ML model inference, and a feature store to enrich events with pre-computed behavioral features.
Real-Time Fraud Detection Architecture
| Component | Technology | Role | Typical Latency |
|---|---|---|---|
| Event Ingestion | Apache Kafka | Buffer incoming transactions | <5ms |
| Feature Enrichment | Redis + Feature Store | Add behavioral features | <10ms |
| ML Inference | Apache Flink + ONNX | GBM model scoring | <20ms |
| Decision Engine | Rules + ML Score | APPROVE / REVIEW / BLOCK | <5ms |
| Alert & Case Mgmt | Elasticsearch + Kibana | Investigator dashboard and alerts | Near real-time |
| Feedback Loop | Kafka + MLflow | Incremental retraining | Daily/weekly |
# real_time_fraud_scoring.py
# Flink-based real-time fraud detection scoring logic
import redis
import joblib
import numpy as np
from datetime import datetime
MODEL = joblib.load('/models/fraud_detector_v2.pkl')
REDIS_CLIENT = redis.Redis(host='redis', port=6379, db=0)
def enrich_with_behavioral_features(transaction: dict) -> dict:
"""
Enrich transaction with pre-computed behavioral features from Redis feature store.
Features are updated every 5 minutes by a separate streaming job.
"""
account_id = transaction['account_id']
cached = REDIS_CLIENT.hgetall(f"features:{account_id}")
if cached:
transaction['avg_amount_7d'] = float(cached.get(b'avg_amount_7d', 0))
transaction['txn_count_24h'] = int(cached.get(b'txn_count_24h', 0))
transaction['max_amount_30d'] = float(cached.get(b'max_amount_30d', 0))
transaction['countries_visited_30d'] = int(cached.get(b'countries_visited_30d', 1))
else:
# Defaults for new accounts (no history)
transaction.update({
'avg_amount_7d': 0.0, 'txn_count_24h': 0,
'max_amount_30d': 0.0, 'countries_visited_30d': 1
})
return transaction
def score_transaction(transaction: dict) -> dict:
"""
Apply ML model to compute fraud score.
Runs inside Flink for each incoming event.
Total p99 latency target: <50ms end-to-end.
"""
transaction = enrich_with_behavioral_features(transaction)
features = [
transaction.get('amount', 0),
transaction.get('hour_of_day', 12),
transaction.get('is_weekend', 0),
transaction.get('is_night', 0),
transaction.get('avg_amount_7d', 0),
transaction.get('txn_count_24h', 0),
transaction.get('amount_z_score', 0),
transaction.get('distance_km', 0),
transaction.get('country_changed', 0),
transaction.get('channel_encoded', 0),
]
# Apply hard rules first (deterministic, high precision)
if apply_hard_rules(transaction):
return {
**transaction,
'fraud_score': 1.0,
'decision': 'BLOCK',
'rule_triggered': True,
'scored_at': datetime.utcnow().isoformat()
}
fraud_prob = MODEL.predict_proba([features])[0][1]
decision = 'BLOCK' if fraud_prob >= 0.70 else 'REVIEW' if fraud_prob >= 0.30 else 'APPROVE'
return {
**transaction,
'fraud_score': float(fraud_prob),
'decision': decision,
'rule_triggered': False,
'scored_at': datetime.utcnow().isoformat()
}
def apply_hard_rules(txn: dict) -> bool:
"""
Deterministic rules for immediate blocks.
High precision, run before the ML model.
"""
# Rule 1: Amount 10x above customer average
if txn.get('amount', 0) > txn.get('avg_amount_7d', 0) * 10:
return True
# Rule 2: Velocity check - more than 5 transactions in 1 hour
if txn.get('txn_count_1h', 0) > 5:
return True
# Rule 3: Impossible travel (2 countries in less than 2 hours)
if txn.get('distance_km', 0) > 1000 and txn.get('hours_since_last', 999) < 2:
return True
return False
Anti-Money Laundering with Graph Neural Networks
Money laundering is structurally different from card fraud: it involves networks of entities (people, companies, bank accounts) connected by complex transactions. Traditional rule-based systems detect known patterns (structuring, layering, integration) but miss emerging patterns and indirect connections.
Graph Neural Networks (GNNs) represent the natural evolution for AML: they model the financial system as a graph where nodes are entities and edges are transactions. GNNs learn to classify nodes as suspicious or normal by aggregating information from their neighborhood, capturing exactly the kind of patterns that money launderers hide in network complexity.
# aml_graph_neural_network.py
# GNN for AML detection using PyTorch Geometric
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GATConv, SAGEConv
from torch_geometric.data import Data
class AMLGraphAttentionNetwork(nn.Module):
"""
Graph Attention Network for AML detection.
Uses attention mechanism to weight most relevant neighbors.
Architecture: GAT Layer -> SAGEConv Layer -> Classifier
"""
def __init__(
self,
in_channels: int,
edge_channels: int,
hidden_channels: int = 128,
num_heads: int = 8,
dropout: float = 0.3
):
super().__init__()
# Layer 1: Graph Attention (aggregate neighbors with attention)
self.conv1 = GATConv(
in_channels,
hidden_channels // num_heads,
heads=num_heads,
dropout=dropout,
edge_dim=edge_channels
)
# Layer 2: GraphSAGE (robust aggregation for sparse graphs)
self.conv2 = SAGEConv(hidden_channels, hidden_channels)
# Final classifier
self.classifier = nn.Sequential(
nn.Linear(hidden_channels, 64),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(64, 2) # 0: legitimate, 1: AML
)
self.dropout = nn.Dropout(dropout)
self.bn1 = nn.BatchNorm1d(hidden_channels)
def forward(self, x, edge_index, edge_attr):
x = self.conv1(x, edge_index, edge_attr)
x = F.elu(self.bn1(x))
x = self.dropout(x)
x = self.conv2(x, edge_index)
x = F.relu(x)
x = self.dropout(x)
return self.classifier(x)
def build_transaction_graph(transactions) -> Data:
"""
Build financial graph from transaction dataframe.
Nodes: bank accounts / entities
Edges: transactions (features: amount, type, time)
"""
import pandas as pd
import numpy as np
entities = pd.concat([
transactions['sender_account'],
transactions['receiver_account']
]).unique()
node_map = {e: i for i, e in enumerate(entities)}
num_nodes = len(entities)
node_features = np.zeros((num_nodes, 8))
for _, txn in transactions.iterrows():
i = node_map[txn['sender_account']]
j = node_map[txn['receiver_account']]
node_features[i][0] += txn['amount'] # total outflow
node_features[i][1] += 1 # outgoing txn count
node_features[j][2] += txn['amount'] # total inflow
node_features[j][3] += 1 # incoming txn count
node_features = (node_features - node_features.mean(0)) / (node_features.std(0) + 1e-8)
edge_index = torch.tensor([
[node_map[r] for r in transactions['sender_account']],
[node_map[r] for r in transactions['receiver_account']]
], dtype=torch.long)
edge_attr = torch.tensor(transactions[[
'amount_normalized', 'transaction_type_encoded',
'hour_sin', 'hour_cos', 'is_round_amount'
]].values, dtype=torch.float)
y = torch.tensor(
[1 if e in KNOWN_AML_ENTITIES else 0 for e in entities],
dtype=torch.long
)
return Data(
x=torch.tensor(node_features, dtype=torch.float),
edge_index=edge_index,
edge_attr=edge_attr,
y=y
)
GNN vs Rule-Based AML: Comparison
| Metric | Rule-Based | GNN |
|---|---|---|
| False Positive Rate | High (40-60%) | Low (10-20%) |
| New patterns | Not detected (zero-day) | Detectable (generalization) |
| Indirect connections | Max 1-2 hops | Multi-hop (up to 5+ degrees) |
| Maintenance burden | High (manual rules) | Low (auto-updates with retraining) |
| Explainability | High (readable rules) | Medium (attention weights) |
| Implementation cost | Low | High (data, infrastructure) |
RegTech: EU AI Act, PSD3 and Automated Compliance
The European financial sector is navigating an unprecedented regulatory transformation. Three major frameworks overlap with direct implications for banking AI systems:
EU Regulatory Framework for AI in Finance 2025-2027
| Regulation | Effective Date | AI Finance Impact |
|---|---|---|
| EU AI Act | Feb 2025 (phased until Aug 2027) | Credit scoring = High-Risk AI. Obligations: registry, audit, human oversight, transparency |
| PSD3 | Adoption 2025-2026, transposition 2027 | Enhanced SCA, open banking APIs, fraud liability shift, expanded data sharing |
| DORA | Jan 2025 (in force) | Mandatory ICT resilience, 4h incident reporting, AI system penetration testing |
| AMLD6 | Progressive application 2025 | Digital KYC, beneficial ownership, mandatory transaction monitoring |
EU AI Act Compliance Checklist for Banks (High-Risk AI)
- AI Systems Registry: Document every AI system used for credit decisions in the EU high-risk AI registry (mandatory from August 2026)
- Conformity Assessment: Pre-deployment assessment with risk analysis, bias testing, performance across protected demographic groups
- Human Oversight: Mandatory human review process for high-impact decisions (loan rejections above defined thresholds)
- Transparency: Customer must be informed that the decision was made (or assisted) by an AI system
- Accuracy and Robustness: Continuous performance monitoring, adversarial testing, data drift detection
- Technical Documentation: Model architecture, training data, performance metrics, bias analysis (Art. 11)
- Logs and Audit Trail: Retention of all AI decision logs for at least 5 years (Art. 12)
- Right to Explanation: Customer entitled to a meaningful explanation of any AI decision (GDPR Art. 22 + AI Act)
Case Study: Fraud Detection for an Italian Retail Bank
Here we present a case study inspired by real implementations in the Italian banking context: a mid-size bank (approximately 500,000 retail customers) that migrated from a legacy rule-based system to a hybrid ML + rules approach for debit and credit card fraud detection.
Before and After Comparison
| Metric | Before (Rule-Based) | After (ML Hybrid) | Improvement |
|---|---|---|---|
| Fraud detection rate | 72% | 91% | +19 points |
| False positive rate | 5.2% | 0.8% | -84% |
| Average decision latency | 450ms | 85ms | -81% |
| Customer complaints (wrong blocks) | 1,200/month | 190/month | -84% |
| Net annual fraud cost | €4.2M | €1.1M | -74% |
| Fraud team headcount | 22 FTE | 14 FTE | -36% |
Implementation Roadmap (14 months)
- Phase 1 - Foundation (Months 1-4): Consolidation of 5 years of fraud history (120M transactions), feature engineering, training dataset labeling by investigators. Kafka infrastructure for transaction streaming. Redis feature store for real-time features. Baseline Logistic Regression model (AUC 0.78).
- Phase 2 - ML Core (Months 5-10): XGBoost + SMOTE training (AUC 0.94). Flink real-time scoring pipeline. Core banking integration. A/B test: 20% traffic on new system, 80% on legacy. Grafana monitoring dashboard. Threshold tuning for FP/FN balance.
- Phase 3 - Production & Compliance (Months 11-14): Full 100% traffic rollout. SHAP implementation for decision logging. Technical documentation for EU AI Act compliance (anticipating 2026 obligations). Monthly retraining pipeline with MLflow. Integration with fraud investigators' case management system.
Key Challenges and How They Were Solved
- Poor data quality in historical data: 5 years of data contained imprecise fraud labels. Solution: manual re-labeling campaign on stratified sample plus automatic cleaning rules for obvious cases (confirmed chargebacks = confirmed fraud).
- Legacy core banking latency: The Temenos T24 system was not designed for sub-100ms real-time integrations. Solution: asynchronous pre-authorization pattern with Kafka to decouple the scoring system from the final approval process.
- Demographic bias: The initial model showed higher false positive rates for transactions in certain regions with thinner transaction history. Solution: fairness constraints during training, differentiated thresholds for segments with limited history, monthly bias monitoring.
Best Practices and Anti-Patterns for AI in Finance
Established Best Practices
- Start with data quality: Before building any ML model, invest 2-3 months understanding and cleaning historical data. A model on dirty data performs worse than a rule-based system. The garbage-in-garbage-out rule applies doubly in finance, where incorrect decisions carry legal consequences.
- Use the right metrics for imbalanced classes: Never use accuracy as the primary metric for fraud detection. Use AUC-ROC, Precision-Recall AUC, F1-score. Explicitly define the relative cost of FP (customer wrongly blocked) and FN (undetected fraud) to calibrate the optimal decision threshold.
- Hybrid rules + ML model: Never completely eliminate deterministic rules. Rules capture known patterns with high precision and are auditable. ML excels at complex and novel patterns. The optimal architecture uses both in separate layers.
- Centralized feature store: Compute behavioral features (velocity, rolling averages, aggregations) once and store them in a feature store (Redis, Feast, Tecton). This ensures consistency between training and inference and reduces production latency.
- Champion-challenger testing: Never do big-bang deployments. Always use A/B testing with traffic splitting. The challenger model receives 10-20% of traffic and is promoted only if its metrics surpass the current champion.
- Complete decision logging: Store every prediction with its full feature vector, SHAP values, final decision, and verified outcome. This is essential for audits, debugging, and retraining.
Anti-Patterns to Avoid
- Data leakage: Do not include features that are only available at labeling time (e.g., "this account was blocked" as a training feature). The model learns to predict the label instead of the underlying phenomenon.
- Overfitting to known fraud patterns: A model trained only on historical fraud schemes fails on new patterns. Always include robust negative sampling and test on future time windows (forward validation, not random split).
- Ignoring the feedback loop: If the model blocks a transaction, it will never know whether it was actually fraudulent. This selection bias progressively distorts the training set. Implement a random sampling review of blocked transactions.
- Uncalibrated scores: XGBoost and LightGBM do not produce calibrated probabilities by default. A fraud_score of 0.8 does not mean "80% probability of fraud." Use Platt scaling or isotonic regression to calibrate probability outputs.
- Deploying without monitoring: Financial models degrade quickly. An excellent model in June can be mediocre in December due to seasonal changes and evolving fraud patterns. PSI and feature drift monitoring are non-negotiable.
Conclusions and Next Steps
AI in financial services has matured rapidly from simple anomaly detection to complex systems combining gradient boosting, graph neural networks, real-time streaming, and explainability to satisfy increasingly stringent regulatory requirements. 2025-2026 will be a critical period: the EU AI Act brings credit scoring systems under formal regulation, while PSD3 and DORA redefine security and resilience requirements.
For European banks and fintechs, the message is clear: those who have not yet modernized their fraud detection and credit scoring systems have a window of opportunity in 2025-2026 to do so ahead of regulatory obligations, turning compliance into a competitive advantage.
Key Takeaways
- 99% of financial institutions already use ML for fraud detection, but implementation quality varies enormously
- XGBoost and LightGBM with SMOTE dominate credit scoring: >95% AUC performance with SHAP explainability for compliance
- The Kafka + Flink + Redis real-time architecture is the standard for sub-100ms transaction monitoring
- Graph Neural Networks are transforming AML: detecting money laundering patterns in complex networks that rules miss entirely
- The EU AI Act classifies credit scoring as High-Risk AI: formal obligations from 2026, but preparing now is the smart move
- The hybrid rules + ML model is the best production choice: ML performance with the robustness and auditability of deterministic rules
The next article in the series explores AI in Retail, covering demand forecasting with time series, recommendation engines with collaborative filtering and matrix factorization, and dynamic pricing with reinforcement learning. Different techniques from finance, but many architectural patterns in common: feature stores, A/B testing, and MLOps for models in production.
Further Reading and Resources
- Cross-series link: MLOps series - "MLOps for Business: AI Models in Production with MLflow" for managing the lifecycle of financial models
- Cross-series link: AI Engineering series - "Enterprise LLMs: RAG and Guardrails" for banking customer care chatbots with compliance guardrails
- Official resources: EBA Guidelines on Internal Governance and the EU AI Act official text for regulatory context
- Public datasets to experiment with: IEEE-CIS Fraud Detection on Kaggle (590k transactions, 1% fraud rate) and Home Credit Default Risk for credit scoring







