Ciao! Sono

Federico Calò

Sviluppatore Software | Divulgatore Tecnico

Creo applicazioni web moderne e strumenti digitali personalizzati per aiutare le attività a crescere attraverso l'innovazione tecnologica. La mia passione è unire informatica ed economia per generare valore reale.

Contattami

Chi Sono

La mia passione per l'informatica è nata tra i banchi dell'Istituto Tecnico Commerciale di Maglie, dove ho scoperto il potere della programmazione e il fascino di creare soluzioni digitali. Fin da subito, ho capito che l'informatica non era solo codice, ma uno strumento straordinario per trasformare idee in realtà.

Durante gli studi superiori in Sistemi Informativi Aziendali, ho iniziato a intrecciare informatica ed economia, comprendendo come la tecnologia possa essere il motore della crescita per qualsiasi attività. Questa visione mi ha accompagnato all'Università degli Studi di Bari, dove ho conseguito la Laurea in Informatica, approfondendo le mie competenze tecniche e la mia passione per lo sviluppo software.

Oggi metto questa esperienza al servizio di imprese, professionisti e startup, creando soluzioni digitali su misura che automatizzano processi, ottimizzano risorse e aprono nuove opportunità di business. Perché la vera innovazione inizia quando la tecnologia incontra le esigenze reali delle persone.

Le Mie Competenze

Analisi Dati & Modelli Previsionali

Trasformo i dati in insights strategici con analisi approfondite e modelli predittivi per decisioni informate

Automazione Processi

Creo strumenti personalizzati che automatizzano operazioni ripetitive e liberano tempo per attività a valore aggiunto

Sistemi Custom

Sviluppo sistemi software su misura, dalle integrazioni tra piattaforme alle dashboard personalizzate

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

🚀

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

💡

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

🎯

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Dicembre 2024

Visualizza

Master SQL

RoadMap.sh

Novembre 2024

Visualizza

Oracle Certified Foundations Associate

Oracle

Ottobre 2024

Visualizza

People Leadership Credential

Connect

Settembre 2024

💻 Linguaggi & Tecnologie

☕Java

🐍Python

📜JavaScript

🅰️Angular

⚛️React

🔷TypeScript

🗄️SQL

🐘PHP

🎨CSS/SCSS

🔧Node.js

🐳Docker

🌿Git

💼

12/2024 - Presente

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italia · Ibrida Analisi e sviluppo di sistemi informatici attraverso l'utilizzo di Java e Quarkus in Health and Public Sector. Formazione continua su tecnologie moderne per la creazione di soluzioni software personalizzate ed efficienti e sugli agenti.

💼

06/2022 - 12/2024

Analista software e Back End Developer Associate Consultant

Links Management and Technology SpA

Esperienza nell'analisi di sistemi software as-is e flussi ETL utilizzando PowerCenter. Formazione completata su Spring Boot per lo sviluppo di applicazioni backend moderne e scalabili. Sviluppatore Backend specializzato in Spring Boot, con esperienza in progettazione di database, analisi, sviluppo e testing dei task assegnati.

💼

02/2021 - 10/2021

Programmatore software

Adesso.it (prima era WebScience srl)

Esperienza nell'analisi AS-IS e TO-BE, evoluzioni SEO ed evoluzioni website per migliorare le performance e l'engagement degli utenti.

🎓

2018 - 2025

Laurea in Informatica

Università degli Studi di Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Sistemi Informativi Aziendali

Istituto Tecnico Commerciale di Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

05 - Testing AI-Generated Code: Strategies, Frameworks, and Quality Gates

2025 confirmed an uncomfortable truth: AI-generated code is functional, fast to produce, and often elegant in structure. But it is not safe. According to the Veracode GenAI Code Security Report 2025, which analyzed over 100 LLMs across 80 coding tasks in Java, Python, C#, and JavaScript, 45% of AI-generated code fails security tests and introduces OWASP Top 10 vulnerabilities into codebases. 62% exhibit design flaws, not obvious bugs but deep architectural errors that surface only in production. AI-generated code introduces 2.74 times more vulnerabilities than human-written code.

These numbers are not an indictment of vibe coding. They are a reminder that AI changes who writes the code, not the need to test it. In fact, they amplify that need. When an AI agent generates 500 lines of code in 30 seconds, the productivity bottleneck shifts from writing to verification. Developers without a robust testing strategy for AI code end up with codebases whose quality is opaque, whose bugs are unpredictable, and whose security is a gamble.

This article builds that strategy. From understanding the unique failure patterns of AI code, through unit testing, property-based testing, static analysis, SAST, and mutation testing, to an operational checklist for deciding when to accept or reject AI-generated code. The target audience is developers already using AI tools in their workflow who want to build the confidence needed to do it professionally.

What You Will Learn

Why AI-generated code fails differently than human-written code
Specific bug types: hallucinated APIs, logic errors, security holes
Unit testing and property-based testing with pytest and Hypothesis
Static analysis with ESLint and SonarQube configured for AI code
Security testing with Semgrep: custom rules for AI patterns
TDD with AI assistant: Red-Green-Refactor becomes Red-AI-Green-Review
Automated code review through multi-agent pipelines
Quality metrics: coverage, mutation score, cyclomatic complexity
Operational checklist for accepting or rejecting AI code

Why AI-Generated Code Requires Specific Testing

Code written by a human developer carries a mental model of the system. The developer knows what lies above and below the module they are writing, has seen the code run in production, and understands the edge cases that burn. AI has no such context. It operates on statistical patterns extracted from billions of lines of code, and generates solutions that are plausible given the prompt, not necessarily correct given the system.

This difference produces a fundamentally different failure profile. Human code tends to fail predictably: a tired developer copies the wrong snippet, forgets a null-check, applies yesterday's business logic to today's model. Localized, traceable errors, often caught by existing tests. AI code fails structurally: it implements the wrong API with absolute consistency, introduces a vulnerability pattern across dozens of functions, builds an architecture that works perfectly for 95% of cases and collapses on the remaining 5% with behaviors impossible to anticipate without a purpose-built test suite.

The Problem of Synthetic Confidence

AI generates code with equal confidence regardless of correctness. A model that hallucinates a nonexistent library function does so with the same certainty as when implementing a correct algorithm. There is no signal of uncertainty in the generated text. This makes visual review unreliable: the code looks right because it is written well. Only execution reveals the truth.

Three specific characteristics of AI code require a different testing approach than traditional testing:

Contextual variance: The same prompt on different codebases produces code with very different quality profiles. Testing must be specific to the deployment context, not just the abstract functionality.
Internal consistency of errors: If AI introduces a wrong pattern (for example, unsafe handling of user input), it replicates it across all similar functions. Bugs are not isolated: they are systemic. This requires tests that search for patterns, not just individual cases.
Outdated knowledge: Models have knowledge cutoffs. They use deprecated APIs, obsolete security patterns, dependencies with known vulnerabilities. Testing must include a layer of dependency and security pattern verification for current standards.

Bug Types in AI-Generated Code

Before building an effective testing strategy, you need to understand what you are looking for. AI code produces distinct bug categories, each with its own characteristics requiring different detection approaches.

1. Hallucinated APIs

The most insidious bug is also the most documented. AI invents functions, methods, parameters, and modules that do not exist, but with names so plausible they pass visual review without difficulty. A classic example: pandas.DataFrame.filter_by_threshold() does not exist, but it sounds exactly like something that should. The code compiles (without strict type checking), passes linting, and fails at runtime with an AttributeError that surfaces only when the function is called with real data.

The primary defense against hallucinated APIs is execution, not review. Unit tests that actually call the code, even with minimal inputs, detect this error type immediately. Static type checking with mypy (Python) or TypeScript strict mode adds a layer of protection at compile time.

2. Logic Errors and Off-by-One

AI code is surprisingly good with simple logic and surprisingly fragile at boundaries. Array indices, inclusive vs exclusive ranges, loop termination conditions: these are where AI code introduces subtle errors that standard tests do not catch unless specifically designed to test boundary values.

3. Security Holes from Known Patterns

The Veracode 2025 Report documents that LLMs fail to defend against XSS in 86% of relevant cases and against log injection in 88% of cases. These are not random bugs: they are a consequence of training data containing enormous amounts of vulnerable code that the model learned to replicate. SQL injection, SSRF, path traversal, improper deserialization: systematic patterns requiring dedicated SAST, not just functional tests.

4. Architectural Design Flaws

The hardest to detect with automated tests. AI builds solutions that work locally but do not scale, that have the right interfaces but wrong responsibilities, that respect the required contract but violate fundamental architectural principles (single responsibility, separation of concerns, loose coupling). The 62% design flaw rate cited by Veracode falls into this category. Detection requires a mix of complexity metrics, dependency analysis, and structured code review.

Testing Frameworks for AI Code

Unit Test Strategy: Testing Boundaries

The unit testing strategy for AI code must be shifted toward boundary values and non-obvious cases. AI code handles the happy path well (it was trained on millions of happy path examples). It fails on boundaries, anomalous inputs, and error handling.

An effective unit test suite for AI code includes:

Tests with inputs at the edge of valid range (zero, negative, max-int values)
Tests with empty, null, None, empty string inputs
Tests with inputs containing special characters or unexpected encoding
Tests verifying behavior on error (exception handling)
Tests verifying side effects (changes to shared state, database, filesystem)

pytest - Unit tests with boundary testing for AI-generated function

# test_ai_generated.py
# Testing strategy for AI-generated code
import pytest
from hypothesis import given, strategies as st, settings
from hypothesis import HealthCheck

# AI-generated function under test (example)
# def process_user_input(data: dict) -> dict:
#     return {
#         "name": data["name"].strip().title(),
#         "age": int(data["age"]),
#         "email": data["email"].lower()
#     }
from my_module import process_user_input


class TestAIGeneratedBoundaries:
    """Boundary tests for AI-generated code."""

    def test_nominal_case(self):
        """Happy path - the case AI almost certainly handled correctly."""
        result = process_user_input({
            "name": "john smith",
            "age": "30",
            "email": "JOHN@EXAMPLE.COM"
        })
        assert result["name"] == "John Smith"
        assert result["age"] == 30
        assert result["email"] == "john@example.com"

    def test_empty_name(self):
        """Edge case: empty name - often unhandled by AI."""
        with pytest.raises((ValueError, KeyError)):
            process_user_input({"name": "", "age": "25", "email": "a@b.com"})

    def test_negative_age(self):
        """Edge case: negative age - AI often skips range validation."""
        with pytest.raises(ValueError, match="age must be positive"):
            process_user_input({"name": "Test", "age": "-5", "email": "a@b.com"})

    def test_missing_required_field(self):
        """Edge case: missing field - KeyError or default?"""
        with pytest.raises(KeyError):
            process_user_input({"name": "Test", "age": "25"})

    def test_sql_injection_in_name(self):
        """Security: injection input must not pass through unmodified."""
        malicious = "'; DROP TABLE users; --"
        result = process_user_input({
            "name": malicious,
            "age": "25",
            "email": "a@b.com"
        })
        # Name must be sanitized or rejected
        assert "DROP TABLE" not in result.get("name", "")

    def test_extremely_long_input(self):
        """Edge case: very long input - buffer overflow / ReDoS."""
        long_name = "a" * 10000
        # Must not hang indefinitely or crash
        with pytest.raises((ValueError, OverflowError)):
            process_user_input({"name": long_name, "age": "25", "email": "a@b.com"})

    def test_unicode_name(self):
        """Unicode compatibility: non-ASCII names."""
        result = process_user_input({
            "name": "sofia garcia",
            "age": "28",
            "email": "sofia@example.es"
        })
        assert result["name"] == "Sofia Garcia"


class TestAIGeneratedWithMocks:
    """Tests with mocks to isolate dependencies."""

    def test_database_not_called_on_validation_error(self, mocker):
        """Verifies DB is not called when validation fails."""
        mock_db = mocker.patch("my_module.database.save")
        with pytest.raises(ValueError):
            process_user_input({"name": "", "age": "25", "email": "a@b.com"})
        mock_db.assert_not_called()  # AI often calls DB before validating

Property-Based Testing with Hypothesis

Property-based testing is the most powerful tool available for testing AI code. Instead of defining individual test cases, you define invariant properties that must hold for any input within a specified domain, and Hypothesis automatically generates hundreds of inputs to find counterexamples. This technique is particularly effective against AI code because it systematically explores the input space, including edge cases we would not think to test manually.

A 2025 paper on arXiv (Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem) demonstrated that a Claude Code agent generating Hypothesis tests finds different and complementary bugs compared to manually written tests, with few false positives. The approach establishes a new paradigm for scalable software auditing, analyzing the target module, proposing grounded properties, and executing them iteratively with pytest.

Hypothesis - Property-based testing for AI-generated code

# test_properties.py
# Property-based testing with Hypothesis for AI-generated code
import pytest
from hypothesis import given, strategies as st, settings, assume
from hypothesis import HealthCheck
from decimal import Decimal

# AI-generated function: discount calculation
# def calculate_discount(price: float, discount_percent: float) -> float:
#     if discount_percent < 0 or discount_percent > 100:
#         raise ValueError("Discount must be between 0 and 100")
#     return price * (1 - discount_percent / 100)
from pricing import calculate_discount


# Property 1: identity - 0% discount does not change the price
@given(price=st.floats(min_value=0.01, max_value=1_000_000.0, allow_nan=False))
def test_zero_discount_preserves_price(price):
    """Zero discount must return the original price."""
    result = calculate_discount(price, 0.0)
    assert abs(result - price) < 0.01  # floating point tolerance


# Property 2: monotonicity - higher discount = lower price
@given(
    price=st.floats(min_value=1.0, max_value=10000.0, allow_nan=False),
    discount1=st.floats(min_value=0.0, max_value=50.0, allow_nan=False),
    discount2=st.floats(min_value=50.0, max_value=100.0, allow_nan=False),
)
def test_higher_discount_lower_price(price, discount1, discount2):
    """Higher discount must produce lower or equal price."""
    result1 = calculate_discount(price, discount1)
    result2 = calculate_discount(price, discount2)
    assert result1 >= result2


# Property 3: output range - final price always between 0 and original price
@given(
    price=st.floats(min_value=0.0, max_value=1_000_000.0, allow_nan=False),
    discount=st.floats(min_value=0.0, max_value=100.0, allow_nan=False),
)
def test_output_range(price, discount):
    """Discounted price must be in range [0, price]."""
    result = calculate_discount(price, discount)
    assert 0.0 <= result <= price + 0.01  # small FP tolerance


# Property 4: boundary - 100% discount must zero the price
@given(price=st.floats(min_value=0.01, max_value=1_000_000.0, allow_nan=False))
def test_full_discount_zero_price(price):
    """100% discount must result in zero price."""
    result = calculate_discount(price, 100.0)
    assert abs(result) < 0.01


# Property 5: input validation - out-of-range discount raises exception
@given(discount=st.one_of(
    st.floats(max_value=-0.001, allow_nan=False),
    st.floats(min_value=100.001, allow_nan=False, allow_infinity=False)
))
def test_invalid_discount_raises(discount):
    """Discount out of range [0, 100] must raise ValueError."""
    assume(not (discount != discount))  # exclude NaN
    with pytest.raises(ValueError):
        calculate_discount(100.0, discount)


# Test with st.composite for correlated data
@st.composite
def valid_order(draw):
    """Composite strategy for valid orders."""
    price = draw(st.floats(min_value=1.0, max_value=10000.0, allow_nan=False))
    discount = draw(st.floats(min_value=0.0, max_value=99.9, allow_nan=False))
    return {"price": price, "discount": discount}


@given(order=valid_order())
@settings(max_examples=500, suppress_health_check=[HealthCheck.too_slow])
def test_order_processing_invariants(order):
    """Invariants on random orders generated by Hypothesis."""
    result = calculate_discount(order["price"], order["discount"])
    # Result must be a finite number
    assert result == result  # excludes NaN
    assert result < float('inf')
    # Result must be non-negative
    assert result >= 0.0

Static Analysis: ESLint, SonarQube and Custom Rules

Static analysis is the first line of defense against low-quality AI code. It runs without executing the code, making it ideal for integration into the commit cycle or CI/CD pipeline as a mandatory gate before any deployment.

For AI code, standard static analysis tool configurations are insufficient. AI code tends to produce specific patterns that require dedicated rules: excessively long functions generated in a single block, circular dependencies, overuse of any in TypeScript, imports of nonexistent or deprecated modules.

.eslintrc.json - ESLint configuration optimized for AI-generated code

{
  "root": true,
  "parser": "@typescript-eslint/parser",
  "parserOptions": {
    "project": "./tsconfig.json",
    "ecmaVersion": 2024
  },
  "plugins": ["@typescript-eslint", "security", "sonarjs"],
  "extends": [
    "eslint:recommended",
    "plugin:@typescript-eslint/strict",
    "plugin:security/recommended",
    "plugin:sonarjs/recommended"
  ],
  "rules": {
    // Critical rules for AI-generated code

    // Forbids explicit 'any' - AI uses it frequently for "safety"
    "@typescript-eslint/no-explicit-any": "error",

    // Requires explicit return types - AI often omits them
    "@typescript-eslint/explicit-function-return-type": ["error", {
      "allowExpressions": false,
      "allowTypedFunctionExpressions": true
    }],

    // Cognitive complexity: AI generates long, complex functions
    "sonarjs/cognitive-complexity": ["error", 15],

    // Duplication: AI replicates identical patterns across functions
    "sonarjs/no-duplicate-string": ["warn", 3],
    "sonarjs/no-identical-functions": "error",

    // Security: common patterns in vulnerable AI code
    "security/detect-object-injection": "error",
    "security/detect-non-literal-regexp": "error",
    "security/detect-possible-timing-attacks": "error",
    "security/detect-unsafe-regex": "error",

    // Function complexity - AI generates monoliths
    "max-lines-per-function": ["warn", {
      "max": 50,
      "skipBlankLines": true,
      "skipComments": true
    }],

    // Deep nesting - frequent pattern in AI code
    "max-depth": ["warn", 4],

    // Promise handling - AI often forgets await
    "@typescript-eslint/no-floating-promises": "error",
    "@typescript-eslint/await-thenable": "error",

    // Null safety
    "@typescript-eslint/no-non-null-assertion": "error",

    // Unused variables (hallucinated imports)
    "@typescript-eslint/no-unused-vars": "error",
    "no-unused-vars": "off"  // replaced by typescript version
  },
  "overrides": [
    {
      // Stricter configuration for AI-generated code directories
      "files": ["**/generated/**/*.ts", "**/ai-output/**/*.ts"],
      "rules": {
        "sonarjs/cognitive-complexity": ["error", 10],
        "max-lines-per-function": ["error", {"max": 30}]
      }
    }
  ]
}

For SonarQube, beyond standard configuration, it is useful to set up a Quality Gate specific to AI code with stricter thresholds: no blocker issues, minimum 80% coverage, maximum 3% duplication, and an explicit limit on function cyclomatic complexity.

Security Testing: SAST with Semgrep for AI Code

SAST (Static Application Security Testing) specialized for AI code goes beyond style linting: it searches for active vulnerability patterns, often invisible to manual review because they are syntactically correct. Semgrep is the reference tool for this category in 2025, with 20,000+ predefined rules and the ability to write custom rules that precisely match patterns AI introduces.

In November 2025, Semgrep announced the private beta of its AI-powered detection system, which combines traditional static analysis with contextual reasoning to reduce false positives by 91% compared to standalone SAST tools. For AI code, this hybrid approach is particularly effective because it understands the context in which a pattern is used, not just its presence.

semgrep-ai-patterns.yaml - Semgrep rules for common AI-generated vulnerabilities

# semgrep-ai-patterns.yaml
# Custom Semgrep rules for vulnerability patterns typical of AI-generated code
rules:
  # Rule 1: SQL injection via f-string or concatenation
  - id: ai-sql-injection-fstring
    patterns:
      - pattern: |
          cursor.execute(f"...{$VAR}...")
      - pattern: |
          cursor.execute("..." + $VAR + "...")
    message: |
      Potential SQL injection via direct interpolation.
      AI frequently generates queries with f-strings instead of parameters.
      Use cursor.execute("SELECT ... WHERE id = %s", (var,))
    languages: [python]
    severity: ERROR
    metadata:
      category: security
      owasp: "A03:2021 - Injection"
      ai-pattern: true

  # Rule 2: Path traversal in file operations
  - id: ai-path-traversal
    patterns:
      - pattern: |
          open($BASE + $USER_INPUT, ...)
      - pattern: |
          open(os.path.join($BASE, $USER_INPUT), ...)
    message: |
      Potential path traversal. AI often concatenates user paths
      without sanitization. Use pathlib.Path().resolve() and verify
      the resulting path is within the allowed directory.
    languages: [python]
    severity: ERROR

  # Rule 3: Hardcoded credentials (very common AI pattern)
  - id: ai-hardcoded-credentials
    patterns:
      - pattern: |
          password = "$VALUE"
      - pattern: |
          api_key = "$VALUE"
      - pattern: |
          secret = "$VALUE"
      - pattern: |
          token = "$VALUE"
    message: |
      Hardcoded credentials detected. AI frequently inserts
      realistic placeholders that look like real credentials.
      Use environment variables or a secret manager.
    languages: [python, javascript, typescript]
    severity: ERROR

  # Rule 4: Eval on external input
  - id: ai-eval-injection
    patterns:
      - pattern: eval($USER_INPUT)
      - pattern: exec($USER_INPUT)
    message: |
      eval/exec on user-controllable input. Dangerous pattern
      that AI introduces when generating interpreters or dynamic systems.
    languages: [python, javascript]
    severity: ERROR

  # Rule 5: Unsafe deserialization (Python pickle)
  - id: ai-unsafe-deserialization
    patterns:
      - pattern: pickle.loads($DATA)
      - pattern: pickle.load($FILE)
    message: |
      pickle.loads/load on untrusted data. AI uses pickle for
      simplicity without considering security implications.
      Use json.loads() or secure libraries like marshmallow.
    languages: [python]
    severity: ERROR

  # Rule 6: JWT without signature verification
  - id: ai-jwt-no-verification
    patterns:
      - pattern: |
          jwt.decode($TOKEN, options={"verify_signature": False})
    message: |
      JWT decoded without signature verification. Common pattern in
      AI-generated code for development environments that ends up in production.
    languages: [python]
    severity: ERROR

  # Rule 7: Potentially catastrophic regex (ReDoS)
  - id: ai-catastrophic-regex
    pattern-regex: '\(\.\*\)+|\(\.\+\)+'
    message: |
      Regex pattern potentially vulnerable to ReDoS.
      AI often generates regex with exponential backtracking.
    languages: [python, javascript, typescript]
    severity: WARNING

Critical Stat: Veracode 2025

The Veracode 2025 report analyzed over 100 LLMs across 80 coding tasks. Java is the riskiest language with a 72% security failure rate. Python, C#, and JavaScript range from 38% to 45%. Most alarming: newer, more advanced models are no safer than older ones. Security in AI code does not improve with model size.

TDD with AI Assistant: Red-AI-Green-Review

Traditional Test-Driven Development follows the Red-Green-Refactor cycle: write the failing test, implement the minimum to make it pass, refactor. With an AI assistant, this cycle becomes Red-AI-Green-Review: write tests first (maintaining human control over specifications), delegate implementation to AI, verify that all tests pass, then conduct code review focused on security patterns and architecture.

This approach has a fundamental advantage: tests written by the developer represent the real specifications of the system, not AI's interpretation of the prompt. If AI generates code that fails the tests, the problem is explicit and measurable. If the code passes all tests but has security issues, Semgrep finds them. If it has design flaws, code review identifies them. The cycle is complete.

TDD workflow with AI: example using pytest and Claude Code

# Step 1: Developer writes tests FIRST
# test_payment_processor.py

import pytest
from decimal import Decimal


class TestPaymentProcessor:
    """
    PaymentProcessor specs written BEFORE asking AI to implement it.
    """

    def test_process_valid_payment(self, processor):
        """Valid payment must return a transaction_id."""
        result = processor.process(
            amount=Decimal("99.99"),
            currency="USD",
            card_token="tok_valid_test"
        )
        assert result.success is True
        assert result.transaction_id is not None
        assert len(result.transaction_id) == 36  # UUID format

    def test_negative_amount_rejected(self, processor):
        """Negative amount must raise ValueError."""
        with pytest.raises(ValueError, match="Amount must be positive"):
            processor.process(
                amount=Decimal("-10.00"),
                currency="USD",
                card_token="tok_valid_test"
            )

    def test_unsupported_currency_rejected(self, processor):
        """Unsupported currency must raise ValueError."""
        with pytest.raises(ValueError, match="Currency not supported"):
            processor.process(
                amount=Decimal("50.00"),
                currency="XYZ",
                card_token="tok_valid_test"
            )

    def test_invalid_token_raises_payment_error(self, processor):
        """Invalid token must raise PaymentError, not crash."""
        with pytest.raises(PaymentError, match="Invalid card token"):
            processor.process(
                amount=Decimal("50.00"),
                currency="USD",
                card_token=""
            )

    def test_duplicate_idempotency(self, processor):
        """Same idempotency_key must not create duplicates."""
        result1 = processor.process(
            amount=Decimal("25.00"),
            currency="USD",
            card_token="tok_valid_test",
            idempotency_key="key-123"
        )
        result2 = processor.process(
            amount=Decimal("25.00"),
            currency="USD",
            card_token="tok_valid_test",
            idempotency_key="key-123"
        )
        assert result1.transaction_id == result2.transaction_id

    def test_amount_precision_preserved(self, processor):
        """Decimal precision must be maintained."""
        result = processor.process(
            amount=Decimal("0.01"),
            currency="USD",
            card_token="tok_valid_test"
        )
        assert result.charged_amount == Decimal("0.01")


# Step 2: Ask AI to implement PaymentProcessor
# Prompt: "Implement PaymentProcessor that passes all these tests.
#          Use Decimal for amounts, validate inputs,
#          handle idempotency, do not use float."

# Step 3: Verify AI implementation passes all tests
# pytest test_payment_processor.py -v

# Step 4: Run Semgrep on the implementation
# semgrep --config semgrep-ai-patterns.yaml payment_processor.py

# Step 5: Code review focused on security and architecture

Automated Code Review: Multi-Agent Pipeline

A multi-agent code review pipeline for AI code combines different specialists operating in parallel on different quality aspects: a security review agent, a code style agent, a performance agent, and an architecture agent. The result is a structured report that the developer uses as a starting point for human review, not as a replacement for it.

This approach is not new (tools like CodeClimate, DeepSource, and Codacy have existed for years), but in 2025 the quality of models has made a fully agentic pipeline practical. Claude Code can be orchestrated into parallel subagents that analyze code from different angles and aggregate results into a single review report.

Python - Multi-agent code review pipeline for AI-generated code

# ai_code_review_pipeline.py
# Multi-agent pipeline for reviewing AI-generated code
import asyncio
import json
from dataclasses import dataclass
from typing import List


@dataclass
class ReviewResult:
    agent: str
    severity: str  # "critical", "high", "medium", "low", "info"
    category: str
    message: str
    file: str
    line: int | None = None


async def run_security_agent(file_path: str) -> List[ReviewResult]:
    """Security agent: runs Semgrep OWASP ruleset."""
    results = []

    proc = await asyncio.create_subprocess_exec(
        "semgrep", "--config", "p/owasp-top-ten",
        "--json", file_path,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE
    )
    stdout, _ = await proc.communicate()

    data = json.loads(stdout)
    for finding in data.get("results", []):
        results.append(ReviewResult(
            agent="security",
            severity="critical" if finding["extra"]["severity"] == "ERROR" else "high",
            category="security",
            message=finding["extra"]["message"],
            file=finding["path"],
            line=finding["start"]["line"]
        ))

    return results


async def run_complexity_agent(file_path: str) -> List[ReviewResult]:
    """Complexity agent: analyzes cyclomatic complexity with radon."""
    results = []

    proc = await asyncio.create_subprocess_exec(
        "radon", "cc", "-s", "-n", "B", file_path,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE
    )
    stdout, _ = await proc.communicate()

    for line in stdout.decode().splitlines():
        if line.strip():
            parts = line.strip().split()
            if len(parts) >= 5:
                results.append(ReviewResult(
                    agent="complexity",
                    severity="medium" if "B" in parts else "high",
                    category="maintainability",
                    message=f"High complexity function: {' '.join(parts)}",
                    file=file_path
                ))

    return results


async def run_coverage_agent(
    test_file: str,
    source_file: str
) -> List[ReviewResult]:
    """Coverage agent: verifies tests adequately cover AI code."""
    results = []

    proc = await asyncio.create_subprocess_exec(
        "pytest", test_file,
        f"--cov={source_file}",
        "--cov-report=json:coverage.json",
        "--quiet",
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE
    )
    await proc.communicate()

    try:
        with open("coverage.json") as f:
            cov_data = json.load(f)

        total = cov_data["totals"]["percent_covered"]
        if total < 80.0:
            results.append(ReviewResult(
                agent="coverage",
                severity="high",
                category="testing",
                message=f"Insufficient coverage: {total:.1f}% (minimum 80%)",
                file=source_file
            ))
    except FileNotFoundError:
        results.append(ReviewResult(
            agent="coverage",
            severity="critical",
            category="testing",
            message="Cannot calculate coverage. Missing tests?",
            file=source_file
        ))

    return results


async def review_ai_code(
    source_file: str,
    test_file: str | None = None
) -> dict:
    """Full review pipeline for AI-generated code."""

    tasks = [
        run_security_agent(source_file),
        run_complexity_agent(source_file),
    ]

    if test_file:
        tasks.append(run_coverage_agent(test_file, source_file))

    # Parallel execution of all agents
    all_results = await asyncio.gather(*tasks, return_exceptions=True)

    findings: List[ReviewResult] = []
    for result in all_results:
        if isinstance(result, Exception):
            print(f"Agent failed: {result}")
        else:
            findings.extend(result)

    critical = [f for f in findings if f.severity == "critical"]
    high = [f for f in findings if f.severity == "high"]

    return {
        "total_findings": len(findings),
        "critical": len(critical),
        "high": len(high),
        "approve": len(critical) == 0 and len(high) == 0,
        "findings": findings
    }


if __name__ == "__main__":
    result = asyncio.run(review_ai_code(
        source_file="payment_processor.py",
        test_file="test_payment_processor.py"
    ))

    print(f"Review complete: {result['total_findings']} findings")
    print(f"Approved: {result['approve']}")

    if not result["approve"]:
        print("\nCRITICAL ISSUES:")
        for f in result["findings"]:
            if f.severity in ["critical", "high"]:
                print(f"  [{f.severity.upper()}] {f.file}:{f.line} - {f.message}")

Quality Metrics for AI Code

Testing AI code requires a broader set of metrics than just code coverage. Coverage measures how many lines of code are executed by tests, but does not measure the quality of those tests. AI code can have 90% coverage with tests that verify nothing relevant. The supplementary metrics needed are:

Mutation Score

Mutation testing is the most reliable way to measure test quality, not just quantity. A mutation tester (such as mutmut for Python or Stryker for JavaScript/TypeScript) systematically introduces small changes to code (mutants: changes > to >=, inverts a boolean, removes a return) and verifies that existing tests detect these changes. If a mutant survives the tests, the tests are not granular enough.

For AI code, the minimum target mutation score is 70% (Gold tier). Scores below 50% indicate an inadequate test suite, regardless of coverage percentage.

Bash - Mutation testing with mutmut and CI/CD threshold enforcement

# setup.cfg - mutmut configuration
[mutmut]
paths_to_mutate=src/
backup=False
runner=python -m pytest tests/ -x -q
tests_dir=tests/

# Run mutation testing
# mutmut run

# View results
# mutmut results

# Example surviving mutant output:
# Mutant #1: SURVIVED
#   src/payment_processor.py:45
#   -    if amount <= 0:
#   +    if amount < 0:
#
# This means no test covers amount=0 exactly!
# Add: assert raises ValueError for amount=0

# CI/CD script with minimum threshold
#!/bin/bash
mutmut run --no-progress

KILLED=$(mutmut results | grep "Killed" | grep -o '[0-9]*' | head -1)
TOTAL=$(mutmut results | grep "Total" | grep -o '[0-9]*' | head -1)

if [ -z "$TOTAL" ] || [ "$TOTAL" -eq 0 ]; then
  echo "No mutants generated"
  exit 0
fi

SCORE=$(echo "scale=2; $KILLED * 100 / $TOTAL" | bc)
echo "Mutation score: $SCORE%"

# Minimum threshold 70% for AI-generated code
if (( $(echo "$SCORE < 70" | bc -l) )); then
  echo "FAIL: Insufficient mutation score for AI code ($SCORE% < 70%)"
  exit 1
fi

echo "OK: Acceptable mutation score ($SCORE%)"

Cyclomatic Complexity

Cyclomatic complexity measures the number of linearly independent paths through a function. A function with complexity 1 has a single possible path (no branches). A function with complexity 20 has 20 paths and requires at least 20 test cases for complete branch coverage. AI code tends to generate high-complexity functions because it tries to handle all cases in a single monolithic function.

Recommended targets for AI code in 2025: median value < 10 per function, automatic flag at > 15 for mandatory review, and automatic rejection at > 25 as a CI/CD gate. These values are stricter than for human code because high-complexity AI code is less predictable in error handling than equivalently complex developer-written code.

Best Practices: Checklist for Accepting AI Code

The following checklist represents the minimum quality gate before integrating AI-generated code into a production codebase. It is not exhaustive but covers the most frequent error categories documented in 2025.

Checklist: Quality Gate for AI-Generated Code

Layer 1 - Automated (CI/CD Gate, blocking)

All unit tests pass (0 failures)
Semgrep security scan: 0 ERROR severity issues
ESLint/TypeScript: 0 errors (warnings acceptable with justification)
Code coverage >= 80% on new lines
No dependencies with known CVE (npm audit / pip-audit)
No hardcoded credentials (git-secrets, detect-secrets)

Layer 2 - Quality Metrics (warnings if not met)

Mutation score >= 70% on critical modules
Cyclomatic complexity < 15 per function
Cognitive complexity < 20 per function
No function exceeding 50 lines
Code duplication < 3%

Layer 3 - Manual Review (mandatory for significant changes)

Business logic verified against original specifications
Error handling reviewed: no silent catch, no generic Exception
Edge cases documented in tests: empty input, null, extreme values
No external API calls without timeout and retry logic
State mutation isolated: no undocumented side effects
Property-based tests for functions with numeric or textual inputs

Layer 4 - Architectural Review (for new modules or significant refactoring)

Single Responsibility respected (one reason to change per class)
Dependencies toward abstractions, not concrete implementations
No circular dependencies between modules
Public interfaces stable and well documented

Anti-Patterns: What NOT to Do with AI Code

Never merge without running tests: even for "small fixes". AI code is deterministic but not transparent: a prompt change alters the entire implementation in non-obvious ways.
Never trust "it works on my machine": AI code is particularly sensitive to environment differences. CI/CD tests on a clean container are mandatory.
Never accept AI code on authentication modules without SAST: auth and authorization are statistically the most dangerous areas for AI code. The Veracode Report cites JWT without verification, broken session management, and privilege escalation among the top-5 AI vulnerabilities.
Do not use AI to generate tests for AI code: tests must be written by the developer. Asking AI to write tests for its own code produces tests that verify the AI implementation, not the system specifications.

Conclusions: Testing as the Enabler of Professional Vibe Coding

The goal of this article is not to discourage AI-assisted development. It is the opposite: to build the confidence needed to do it well. The 45% security failure rate of AI code does not mean that 45% of AI code reaching production is insecure. It means that without a structured verification process, that is the probability. With the right process, that percentage drops dramatically.

The combination of TDD (tests written before, not after), property-based testing with Hypothesis, SAST with Semgrep, static analysis with ESLint configured for AI patterns, and mutation testing to verify test quality, creates a safety system that makes vibe coding professional. It does not slow development down: it stabilizes it.

The next article in the series explores Prompt Engineering for IDEs and Code Generation: how to write prompts that produce better, more secure, more maintainable code, reducing the testing burden downstream.

Resources and Useful Links

Vibe Coding and Agentic Development Series

Full Series

Vibe Coding: The Paradigm That Changed 2025
Claude Code: Agentic Development from the Terminal
Agentic Workflows: Decomposing Problems for AI
Multi-Agent Coding: LangGraph, CrewAI and AutoGen
Testing AI-Generated Code (this article)
Prompt Engineering for IDEs and Code Generation
Security in Vibe Coding: Risks and Mitigations
The Future of Agentic Development in 2026

Related Deep Dives

Cursor IDE: The First AI-First IDE - How Cursor handles testing of AI-generated code
OWASP Top 10 2025: Current Vulnerabilities - The vulnerability context AI code introduces
Claude for Code Review - Using Claude as a review agent