05 - Testing AI-Generated Code: Strategies, Frameworks, and Quality Gates
2025 confirmed an uncomfortable truth: AI-generated code is functional, fast to produce, and often elegant in structure. But it is not safe. According to the Veracode GenAI Code Security Report 2025, which analyzed over 100 LLMs across 80 coding tasks in Java, Python, C#, and JavaScript, 45% of AI-generated code fails security tests and introduces OWASP Top 10 vulnerabilities into codebases. 62% exhibit design flaws, not obvious bugs but deep architectural errors that surface only in production. AI-generated code introduces 2.74 times more vulnerabilities than human-written code.
These numbers are not an indictment of vibe coding. They are a reminder that AI changes who writes the code, not the need to test it. In fact, they amplify that need. When an AI agent generates 500 lines of code in 30 seconds, the productivity bottleneck shifts from writing to verification. Developers without a robust testing strategy for AI code end up with codebases whose quality is opaque, whose bugs are unpredictable, and whose security is a gamble.
This article builds that strategy. From understanding the unique failure patterns of AI code, through unit testing, property-based testing, static analysis, SAST, and mutation testing, to an operational checklist for deciding when to accept or reject AI-generated code. The target audience is developers already using AI tools in their workflow who want to build the confidence needed to do it professionally.
What You Will Learn
- Why AI-generated code fails differently than human-written code
- Specific bug types: hallucinated APIs, logic errors, security holes
- Unit testing and property-based testing with pytest and Hypothesis
- Static analysis with ESLint and SonarQube configured for AI code
- Security testing with Semgrep: custom rules for AI patterns
- TDD with AI assistant: Red-Green-Refactor becomes Red-AI-Green-Review
- Automated code review through multi-agent pipelines
- Quality metrics: coverage, mutation score, cyclomatic complexity
- Operational checklist for accepting or rejecting AI code
Why AI-Generated Code Requires Specific Testing
Code written by a human developer carries a mental model of the system. The developer knows what lies above and below the module they are writing, has seen the code run in production, and understands the edge cases that burn. AI has no such context. It operates on statistical patterns extracted from billions of lines of code, and generates solutions that are plausible given the prompt, not necessarily correct given the system.
This difference produces a fundamentally different failure profile. Human code tends to fail predictably: a tired developer copies the wrong snippet, forgets a null-check, applies yesterday's business logic to today's model. Localized, traceable errors, often caught by existing tests. AI code fails structurally: it implements the wrong API with absolute consistency, introduces a vulnerability pattern across dozens of functions, builds an architecture that works perfectly for 95% of cases and collapses on the remaining 5% with behaviors impossible to anticipate without a purpose-built test suite.
The Problem of Synthetic Confidence
AI generates code with equal confidence regardless of correctness. A model that hallucinates a nonexistent library function does so with the same certainty as when implementing a correct algorithm. There is no signal of uncertainty in the generated text. This makes visual review unreliable: the code looks right because it is written well. Only execution reveals the truth.
Three specific characteristics of AI code require a different testing approach than traditional testing:
- Contextual variance: The same prompt on different codebases produces code with very different quality profiles. Testing must be specific to the deployment context, not just the abstract functionality.
- Internal consistency of errors: If AI introduces a wrong pattern (for example, unsafe handling of user input), it replicates it across all similar functions. Bugs are not isolated: they are systemic. This requires tests that search for patterns, not just individual cases.
- Outdated knowledge: Models have knowledge cutoffs. They use deprecated APIs, obsolete security patterns, dependencies with known vulnerabilities. Testing must include a layer of dependency and security pattern verification for current standards.
Bug Types in AI-Generated Code
Before building an effective testing strategy, you need to understand what you are looking for. AI code produces distinct bug categories, each with its own characteristics requiring different detection approaches.
1. Hallucinated APIs
The most insidious bug is also the most documented. AI invents functions, methods,
parameters, and modules that do not exist, but with names so plausible they pass
visual review without difficulty. A classic example: pandas.DataFrame.filter_by_threshold()
does not exist, but it sounds exactly like something that should. The code compiles
(without strict type checking), passes linting, and fails at runtime with an
AttributeError that surfaces only when the function is called with real data.
The primary defense against hallucinated APIs is execution, not review. Unit tests that actually call the code, even with minimal inputs, detect this error type immediately. Static type checking with mypy (Python) or TypeScript strict mode adds a layer of protection at compile time.
2. Logic Errors and Off-by-One
AI code is surprisingly good with simple logic and surprisingly fragile at boundaries. Array indices, inclusive vs exclusive ranges, loop termination conditions: these are where AI code introduces subtle errors that standard tests do not catch unless specifically designed to test boundary values.
3. Security Holes from Known Patterns
The Veracode 2025 Report documents that LLMs fail to defend against XSS in 86% of relevant cases and against log injection in 88% of cases. These are not random bugs: they are a consequence of training data containing enormous amounts of vulnerable code that the model learned to replicate. SQL injection, SSRF, path traversal, improper deserialization: systematic patterns requiring dedicated SAST, not just functional tests.
4. Architectural Design Flaws
The hardest to detect with automated tests. AI builds solutions that work locally but do not scale, that have the right interfaces but wrong responsibilities, that respect the required contract but violate fundamental architectural principles (single responsibility, separation of concerns, loose coupling). The 62% design flaw rate cited by Veracode falls into this category. Detection requires a mix of complexity metrics, dependency analysis, and structured code review.
Testing Frameworks for AI Code
Unit Test Strategy: Testing Boundaries
The unit testing strategy for AI code must be shifted toward boundary values and non-obvious cases. AI code handles the happy path well (it was trained on millions of happy path examples). It fails on boundaries, anomalous inputs, and error handling.
An effective unit test suite for AI code includes:
- Tests with inputs at the edge of valid range (zero, negative, max-int values)
- Tests with empty, null, None, empty string inputs
- Tests with inputs containing special characters or unexpected encoding
- Tests verifying behavior on error (exception handling)
- Tests verifying side effects (changes to shared state, database, filesystem)
# test_ai_generated.py
# Testing strategy for AI-generated code
import pytest
from hypothesis import given, strategies as st, settings
from hypothesis import HealthCheck
# AI-generated function under test (example)
# def process_user_input(data: dict) -> dict:
# return {
# "name": data["name"].strip().title(),
# "age": int(data["age"]),
# "email": data["email"].lower()
# }
from my_module import process_user_input
class TestAIGeneratedBoundaries:
"""Boundary tests for AI-generated code."""
def test_nominal_case(self):
"""Happy path - the case AI almost certainly handled correctly."""
result = process_user_input({
"name": "john smith",
"age": "30",
"email": "JOHN@EXAMPLE.COM"
})
assert result["name"] == "John Smith"
assert result["age"] == 30
assert result["email"] == "john@example.com"
def test_empty_name(self):
"""Edge case: empty name - often unhandled by AI."""
with pytest.raises((ValueError, KeyError)):
process_user_input({"name": "", "age": "25", "email": "a@b.com"})
def test_negative_age(self):
"""Edge case: negative age - AI often skips range validation."""
with pytest.raises(ValueError, match="age must be positive"):
process_user_input({"name": "Test", "age": "-5", "email": "a@b.com"})
def test_missing_required_field(self):
"""Edge case: missing field - KeyError or default?"""
with pytest.raises(KeyError):
process_user_input({"name": "Test", "age": "25"})
def test_sql_injection_in_name(self):
"""Security: injection input must not pass through unmodified."""
malicious = "'; DROP TABLE users; --"
result = process_user_input({
"name": malicious,
"age": "25",
"email": "a@b.com"
})
# Name must be sanitized or rejected
assert "DROP TABLE" not in result.get("name", "")
def test_extremely_long_input(self):
"""Edge case: very long input - buffer overflow / ReDoS."""
long_name = "a" * 10000
# Must not hang indefinitely or crash
with pytest.raises((ValueError, OverflowError)):
process_user_input({"name": long_name, "age": "25", "email": "a@b.com"})
def test_unicode_name(self):
"""Unicode compatibility: non-ASCII names."""
result = process_user_input({
"name": "sofia garcia",
"age": "28",
"email": "sofia@example.es"
})
assert result["name"] == "Sofia Garcia"
class TestAIGeneratedWithMocks:
"""Tests with mocks to isolate dependencies."""
def test_database_not_called_on_validation_error(self, mocker):
"""Verifies DB is not called when validation fails."""
mock_db = mocker.patch("my_module.database.save")
with pytest.raises(ValueError):
process_user_input({"name": "", "age": "25", "email": "a@b.com"})
mock_db.assert_not_called() # AI often calls DB before validating
Property-Based Testing with Hypothesis
Property-based testing is the most powerful tool available for testing AI code. Instead of defining individual test cases, you define invariant properties that must hold for any input within a specified domain, and Hypothesis automatically generates hundreds of inputs to find counterexamples. This technique is particularly effective against AI code because it systematically explores the input space, including edge cases we would not think to test manually.
A 2025 paper on arXiv (Agentic Property-Based Testing: Finding Bugs Across the Python Ecosystem) demonstrated that a Claude Code agent generating Hypothesis tests finds different and complementary bugs compared to manually written tests, with few false positives. The approach establishes a new paradigm for scalable software auditing, analyzing the target module, proposing grounded properties, and executing them iteratively with pytest.
# test_properties.py
# Property-based testing with Hypothesis for AI-generated code
import pytest
from hypothesis import given, strategies as st, settings, assume
from hypothesis import HealthCheck
from decimal import Decimal
# AI-generated function: discount calculation
# def calculate_discount(price: float, discount_percent: float) -> float:
# if discount_percent < 0 or discount_percent > 100:
# raise ValueError("Discount must be between 0 and 100")
# return price * (1 - discount_percent / 100)
from pricing import calculate_discount
# Property 1: identity - 0% discount does not change the price
@given(price=st.floats(min_value=0.01, max_value=1_000_000.0, allow_nan=False))
def test_zero_discount_preserves_price(price):
"""Zero discount must return the original price."""
result = calculate_discount(price, 0.0)
assert abs(result - price) < 0.01 # floating point tolerance
# Property 2: monotonicity - higher discount = lower price
@given(
price=st.floats(min_value=1.0, max_value=10000.0, allow_nan=False),
discount1=st.floats(min_value=0.0, max_value=50.0, allow_nan=False),
discount2=st.floats(min_value=50.0, max_value=100.0, allow_nan=False),
)
def test_higher_discount_lower_price(price, discount1, discount2):
"""Higher discount must produce lower or equal price."""
result1 = calculate_discount(price, discount1)
result2 = calculate_discount(price, discount2)
assert result1 >= result2
# Property 3: output range - final price always between 0 and original price
@given(
price=st.floats(min_value=0.0, max_value=1_000_000.0, allow_nan=False),
discount=st.floats(min_value=0.0, max_value=100.0, allow_nan=False),
)
def test_output_range(price, discount):
"""Discounted price must be in range [0, price]."""
result = calculate_discount(price, discount)
assert 0.0 <= result <= price + 0.01 # small FP tolerance
# Property 4: boundary - 100% discount must zero the price
@given(price=st.floats(min_value=0.01, max_value=1_000_000.0, allow_nan=False))
def test_full_discount_zero_price(price):
"""100% discount must result in zero price."""
result = calculate_discount(price, 100.0)
assert abs(result) < 0.01
# Property 5: input validation - out-of-range discount raises exception
@given(discount=st.one_of(
st.floats(max_value=-0.001, allow_nan=False),
st.floats(min_value=100.001, allow_nan=False, allow_infinity=False)
))
def test_invalid_discount_raises(discount):
"""Discount out of range [0, 100] must raise ValueError."""
assume(not (discount != discount)) # exclude NaN
with pytest.raises(ValueError):
calculate_discount(100.0, discount)
# Test with st.composite for correlated data
@st.composite
def valid_order(draw):
"""Composite strategy for valid orders."""
price = draw(st.floats(min_value=1.0, max_value=10000.0, allow_nan=False))
discount = draw(st.floats(min_value=0.0, max_value=99.9, allow_nan=False))
return {"price": price, "discount": discount}
@given(order=valid_order())
@settings(max_examples=500, suppress_health_check=[HealthCheck.too_slow])
def test_order_processing_invariants(order):
"""Invariants on random orders generated by Hypothesis."""
result = calculate_discount(order["price"], order["discount"])
# Result must be a finite number
assert result == result # excludes NaN
assert result < float('inf')
# Result must be non-negative
assert result >= 0.0
Static Analysis: ESLint, SonarQube and Custom Rules
Static analysis is the first line of defense against low-quality AI code. It runs without executing the code, making it ideal for integration into the commit cycle or CI/CD pipeline as a mandatory gate before any deployment.
For AI code, standard static analysis tool configurations are insufficient. AI code
tends to produce specific patterns that require dedicated rules: excessively long
functions generated in a single block, circular dependencies, overuse of any
in TypeScript, imports of nonexistent or deprecated modules.
{
"root": true,
"parser": "@typescript-eslint/parser",
"parserOptions": {
"project": "./tsconfig.json",
"ecmaVersion": 2024
},
"plugins": ["@typescript-eslint", "security", "sonarjs"],
"extends": [
"eslint:recommended",
"plugin:@typescript-eslint/strict",
"plugin:security/recommended",
"plugin:sonarjs/recommended"
],
"rules": {
// Critical rules for AI-generated code
// Forbids explicit 'any' - AI uses it frequently for "safety"
"@typescript-eslint/no-explicit-any": "error",
// Requires explicit return types - AI often omits them
"@typescript-eslint/explicit-function-return-type": ["error", {
"allowExpressions": false,
"allowTypedFunctionExpressions": true
}],
// Cognitive complexity: AI generates long, complex functions
"sonarjs/cognitive-complexity": ["error", 15],
// Duplication: AI replicates identical patterns across functions
"sonarjs/no-duplicate-string": ["warn", 3],
"sonarjs/no-identical-functions": "error",
// Security: common patterns in vulnerable AI code
"security/detect-object-injection": "error",
"security/detect-non-literal-regexp": "error",
"security/detect-possible-timing-attacks": "error",
"security/detect-unsafe-regex": "error",
// Function complexity - AI generates monoliths
"max-lines-per-function": ["warn", {
"max": 50,
"skipBlankLines": true,
"skipComments": true
}],
// Deep nesting - frequent pattern in AI code
"max-depth": ["warn", 4],
// Promise handling - AI often forgets await
"@typescript-eslint/no-floating-promises": "error",
"@typescript-eslint/await-thenable": "error",
// Null safety
"@typescript-eslint/no-non-null-assertion": "error",
// Unused variables (hallucinated imports)
"@typescript-eslint/no-unused-vars": "error",
"no-unused-vars": "off" // replaced by typescript version
},
"overrides": [
{
// Stricter configuration for AI-generated code directories
"files": ["**/generated/**/*.ts", "**/ai-output/**/*.ts"],
"rules": {
"sonarjs/cognitive-complexity": ["error", 10],
"max-lines-per-function": ["error", {"max": 30}]
}
}
]
}
For SonarQube, beyond standard configuration, it is useful to set up a Quality Gate specific to AI code with stricter thresholds: no blocker issues, minimum 80% coverage, maximum 3% duplication, and an explicit limit on function cyclomatic complexity.
Security Testing: SAST with Semgrep for AI Code
SAST (Static Application Security Testing) specialized for AI code goes beyond style linting: it searches for active vulnerability patterns, often invisible to manual review because they are syntactically correct. Semgrep is the reference tool for this category in 2025, with 20,000+ predefined rules and the ability to write custom rules that precisely match patterns AI introduces.
In November 2025, Semgrep announced the private beta of its AI-powered detection system, which combines traditional static analysis with contextual reasoning to reduce false positives by 91% compared to standalone SAST tools. For AI code, this hybrid approach is particularly effective because it understands the context in which a pattern is used, not just its presence.
# semgrep-ai-patterns.yaml
# Custom Semgrep rules for vulnerability patterns typical of AI-generated code
rules:
# Rule 1: SQL injection via f-string or concatenation
- id: ai-sql-injection-fstring
patterns:
- pattern: |
cursor.execute(f"...{$VAR}...")
- pattern: |
cursor.execute("..." + $VAR + "...")
message: |
Potential SQL injection via direct interpolation.
AI frequently generates queries with f-strings instead of parameters.
Use cursor.execute("SELECT ... WHERE id = %s", (var,))
languages: [python]
severity: ERROR
metadata:
category: security
owasp: "A03:2021 - Injection"
ai-pattern: true
# Rule 2: Path traversal in file operations
- id: ai-path-traversal
patterns:
- pattern: |
open($BASE + $USER_INPUT, ...)
- pattern: |
open(os.path.join($BASE, $USER_INPUT), ...)
message: |
Potential path traversal. AI often concatenates user paths
without sanitization. Use pathlib.Path().resolve() and verify
the resulting path is within the allowed directory.
languages: [python]
severity: ERROR
# Rule 3: Hardcoded credentials (very common AI pattern)
- id: ai-hardcoded-credentials
patterns:
- pattern: |
password = "$VALUE"
- pattern: |
api_key = "$VALUE"
- pattern: |
secret = "$VALUE"
- pattern: |
token = "$VALUE"
message: |
Hardcoded credentials detected. AI frequently inserts
realistic placeholders that look like real credentials.
Use environment variables or a secret manager.
languages: [python, javascript, typescript]
severity: ERROR
# Rule 4: Eval on external input
- id: ai-eval-injection
patterns:
- pattern: eval($USER_INPUT)
- pattern: exec($USER_INPUT)
message: |
eval/exec on user-controllable input. Dangerous pattern
that AI introduces when generating interpreters or dynamic systems.
languages: [python, javascript]
severity: ERROR
# Rule 5: Unsafe deserialization (Python pickle)
- id: ai-unsafe-deserialization
patterns:
- pattern: pickle.loads($DATA)
- pattern: pickle.load($FILE)
message: |
pickle.loads/load on untrusted data. AI uses pickle for
simplicity without considering security implications.
Use json.loads() or secure libraries like marshmallow.
languages: [python]
severity: ERROR
# Rule 6: JWT without signature verification
- id: ai-jwt-no-verification
patterns:
- pattern: |
jwt.decode($TOKEN, options={"verify_signature": False})
message: |
JWT decoded without signature verification. Common pattern in
AI-generated code for development environments that ends up in production.
languages: [python]
severity: ERROR
# Rule 7: Potentially catastrophic regex (ReDoS)
- id: ai-catastrophic-regex
pattern-regex: '\(\.\*\)+|\(\.\+\)+'
message: |
Regex pattern potentially vulnerable to ReDoS.
AI often generates regex with exponential backtracking.
languages: [python, javascript, typescript]
severity: WARNING
Critical Stat: Veracode 2025
The Veracode 2025 report analyzed over 100 LLMs across 80 coding tasks. Java is the riskiest language with a 72% security failure rate. Python, C#, and JavaScript range from 38% to 45%. Most alarming: newer, more advanced models are no safer than older ones. Security in AI code does not improve with model size.
TDD with AI Assistant: Red-AI-Green-Review
Traditional Test-Driven Development follows the Red-Green-Refactor cycle: write the failing test, implement the minimum to make it pass, refactor. With an AI assistant, this cycle becomes Red-AI-Green-Review: write tests first (maintaining human control over specifications), delegate implementation to AI, verify that all tests pass, then conduct code review focused on security patterns and architecture.
This approach has a fundamental advantage: tests written by the developer represent the real specifications of the system, not AI's interpretation of the prompt. If AI generates code that fails the tests, the problem is explicit and measurable. If the code passes all tests but has security issues, Semgrep finds them. If it has design flaws, code review identifies them. The cycle is complete.
# Step 1: Developer writes tests FIRST
# test_payment_processor.py
import pytest
from decimal import Decimal
class TestPaymentProcessor:
"""
PaymentProcessor specs written BEFORE asking AI to implement it.
"""
def test_process_valid_payment(self, processor):
"""Valid payment must return a transaction_id."""
result = processor.process(
amount=Decimal("99.99"),
currency="USD",
card_token="tok_valid_test"
)
assert result.success is True
assert result.transaction_id is not None
assert len(result.transaction_id) == 36 # UUID format
def test_negative_amount_rejected(self, processor):
"""Negative amount must raise ValueError."""
with pytest.raises(ValueError, match="Amount must be positive"):
processor.process(
amount=Decimal("-10.00"),
currency="USD",
card_token="tok_valid_test"
)
def test_unsupported_currency_rejected(self, processor):
"""Unsupported currency must raise ValueError."""
with pytest.raises(ValueError, match="Currency not supported"):
processor.process(
amount=Decimal("50.00"),
currency="XYZ",
card_token="tok_valid_test"
)
def test_invalid_token_raises_payment_error(self, processor):
"""Invalid token must raise PaymentError, not crash."""
with pytest.raises(PaymentError, match="Invalid card token"):
processor.process(
amount=Decimal("50.00"),
currency="USD",
card_token=""
)
def test_duplicate_idempotency(self, processor):
"""Same idempotency_key must not create duplicates."""
result1 = processor.process(
amount=Decimal("25.00"),
currency="USD",
card_token="tok_valid_test",
idempotency_key="key-123"
)
result2 = processor.process(
amount=Decimal("25.00"),
currency="USD",
card_token="tok_valid_test",
idempotency_key="key-123"
)
assert result1.transaction_id == result2.transaction_id
def test_amount_precision_preserved(self, processor):
"""Decimal precision must be maintained."""
result = processor.process(
amount=Decimal("0.01"),
currency="USD",
card_token="tok_valid_test"
)
assert result.charged_amount == Decimal("0.01")
# Step 2: Ask AI to implement PaymentProcessor
# Prompt: "Implement PaymentProcessor that passes all these tests.
# Use Decimal for amounts, validate inputs,
# handle idempotency, do not use float."
# Step 3: Verify AI implementation passes all tests
# pytest test_payment_processor.py -v
# Step 4: Run Semgrep on the implementation
# semgrep --config semgrep-ai-patterns.yaml payment_processor.py
# Step 5: Code review focused on security and architecture
Automated Code Review: Multi-Agent Pipeline
A multi-agent code review pipeline for AI code combines different specialists operating in parallel on different quality aspects: a security review agent, a code style agent, a performance agent, and an architecture agent. The result is a structured report that the developer uses as a starting point for human review, not as a replacement for it.
This approach is not new (tools like CodeClimate, DeepSource, and Codacy have existed for years), but in 2025 the quality of models has made a fully agentic pipeline practical. Claude Code can be orchestrated into parallel subagents that analyze code from different angles and aggregate results into a single review report.
# ai_code_review_pipeline.py
# Multi-agent pipeline for reviewing AI-generated code
import asyncio
import json
from dataclasses import dataclass
from typing import List
@dataclass
class ReviewResult:
agent: str
severity: str # "critical", "high", "medium", "low", "info"
category: str
message: str
file: str
line: int | None = None
async def run_security_agent(file_path: str) -> List[ReviewResult]:
"""Security agent: runs Semgrep OWASP ruleset."""
results = []
proc = await asyncio.create_subprocess_exec(
"semgrep", "--config", "p/owasp-top-ten",
"--json", file_path,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE
)
stdout, _ = await proc.communicate()
data = json.loads(stdout)
for finding in data.get("results", []):
results.append(ReviewResult(
agent="security",
severity="critical" if finding["extra"]["severity"] == "ERROR" else "high",
category="security",
message=finding["extra"]["message"],
file=finding["path"],
line=finding["start"]["line"]
))
return results
async def run_complexity_agent(file_path: str) -> List[ReviewResult]:
"""Complexity agent: analyzes cyclomatic complexity with radon."""
results = []
proc = await asyncio.create_subprocess_exec(
"radon", "cc", "-s", "-n", "B", file_path,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE
)
stdout, _ = await proc.communicate()
for line in stdout.decode().splitlines():
if line.strip():
parts = line.strip().split()
if len(parts) >= 5:
results.append(ReviewResult(
agent="complexity",
severity="medium" if "B" in parts else "high",
category="maintainability",
message=f"High complexity function: {' '.join(parts)}",
file=file_path
))
return results
async def run_coverage_agent(
test_file: str,
source_file: str
) -> List[ReviewResult]:
"""Coverage agent: verifies tests adequately cover AI code."""
results = []
proc = await asyncio.create_subprocess_exec(
"pytest", test_file,
f"--cov={source_file}",
"--cov-report=json:coverage.json",
"--quiet",
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE
)
await proc.communicate()
try:
with open("coverage.json") as f:
cov_data = json.load(f)
total = cov_data["totals"]["percent_covered"]
if total < 80.0:
results.append(ReviewResult(
agent="coverage",
severity="high",
category="testing",
message=f"Insufficient coverage: {total:.1f}% (minimum 80%)",
file=source_file
))
except FileNotFoundError:
results.append(ReviewResult(
agent="coverage",
severity="critical",
category="testing",
message="Cannot calculate coverage. Missing tests?",
file=source_file
))
return results
async def review_ai_code(
source_file: str,
test_file: str | None = None
) -> dict:
"""Full review pipeline for AI-generated code."""
tasks = [
run_security_agent(source_file),
run_complexity_agent(source_file),
]
if test_file:
tasks.append(run_coverage_agent(test_file, source_file))
# Parallel execution of all agents
all_results = await asyncio.gather(*tasks, return_exceptions=True)
findings: List[ReviewResult] = []
for result in all_results:
if isinstance(result, Exception):
print(f"Agent failed: {result}")
else:
findings.extend(result)
critical = [f for f in findings if f.severity == "critical"]
high = [f for f in findings if f.severity == "high"]
return {
"total_findings": len(findings),
"critical": len(critical),
"high": len(high),
"approve": len(critical) == 0 and len(high) == 0,
"findings": findings
}
if __name__ == "__main__":
result = asyncio.run(review_ai_code(
source_file="payment_processor.py",
test_file="test_payment_processor.py"
))
print(f"Review complete: {result['total_findings']} findings")
print(f"Approved: {result['approve']}")
if not result["approve"]:
print("\nCRITICAL ISSUES:")
for f in result["findings"]:
if f.severity in ["critical", "high"]:
print(f" [{f.severity.upper()}] {f.file}:{f.line} - {f.message}")
Quality Metrics for AI Code
Testing AI code requires a broader set of metrics than just code coverage. Coverage measures how many lines of code are executed by tests, but does not measure the quality of those tests. AI code can have 90% coverage with tests that verify nothing relevant. The supplementary metrics needed are:
Mutation Score
Mutation testing is the most reliable way to measure test quality, not just quantity.
A mutation tester (such as mutmut for Python or Stryker
for JavaScript/TypeScript) systematically introduces small changes to code (mutants:
changes > to >=, inverts a boolean, removes a return)
and verifies that existing tests detect these changes. If a mutant survives the tests,
the tests are not granular enough.
For AI code, the minimum target mutation score is 70% (Gold tier). Scores below 50% indicate an inadequate test suite, regardless of coverage percentage.
# setup.cfg - mutmut configuration
[mutmut]
paths_to_mutate=src/
backup=False
runner=python -m pytest tests/ -x -q
tests_dir=tests/
# Run mutation testing
# mutmut run
# View results
# mutmut results
# Example surviving mutant output:
# Mutant #1: SURVIVED
# src/payment_processor.py:45
# - if amount <= 0:
# + if amount < 0:
#
# This means no test covers amount=0 exactly!
# Add: assert raises ValueError for amount=0
# CI/CD script with minimum threshold
#!/bin/bash
mutmut run --no-progress
KILLED=$(mutmut results | grep "Killed" | grep -o '[0-9]*' | head -1)
TOTAL=$(mutmut results | grep "Total" | grep -o '[0-9]*' | head -1)
if [ -z "$TOTAL" ] || [ "$TOTAL" -eq 0 ]; then
echo "No mutants generated"
exit 0
fi
SCORE=$(echo "scale=2; $KILLED * 100 / $TOTAL" | bc)
echo "Mutation score: $SCORE%"
# Minimum threshold 70% for AI-generated code
if (( $(echo "$SCORE < 70" | bc -l) )); then
echo "FAIL: Insufficient mutation score for AI code ($SCORE% < 70%)"
exit 1
fi
echo "OK: Acceptable mutation score ($SCORE%)"
Cyclomatic Complexity
Cyclomatic complexity measures the number of linearly independent paths through a function. A function with complexity 1 has a single possible path (no branches). A function with complexity 20 has 20 paths and requires at least 20 test cases for complete branch coverage. AI code tends to generate high-complexity functions because it tries to handle all cases in a single monolithic function.
Recommended targets for AI code in 2025: median value < 10 per function, automatic flag at > 15 for mandatory review, and automatic rejection at > 25 as a CI/CD gate. These values are stricter than for human code because high-complexity AI code is less predictable in error handling than equivalently complex developer-written code.
Best Practices: Checklist for Accepting AI Code
The following checklist represents the minimum quality gate before integrating AI-generated code into a production codebase. It is not exhaustive but covers the most frequent error categories documented in 2025.
Checklist: Quality Gate for AI-Generated Code
Layer 1 - Automated (CI/CD Gate, blocking)
- All unit tests pass (0 failures)
- Semgrep security scan: 0 ERROR severity issues
- ESLint/TypeScript: 0 errors (warnings acceptable with justification)
- Code coverage >= 80% on new lines
- No dependencies with known CVE (npm audit / pip-audit)
- No hardcoded credentials (git-secrets, detect-secrets)
Layer 2 - Quality Metrics (warnings if not met)
- Mutation score >= 70% on critical modules
- Cyclomatic complexity < 15 per function
- Cognitive complexity < 20 per function
- No function exceeding 50 lines
- Code duplication < 3%
Layer 3 - Manual Review (mandatory for significant changes)
- Business logic verified against original specifications
- Error handling reviewed: no silent catch, no generic Exception
- Edge cases documented in tests: empty input, null, extreme values
- No external API calls without timeout and retry logic
- State mutation isolated: no undocumented side effects
- Property-based tests for functions with numeric or textual inputs
Layer 4 - Architectural Review (for new modules or significant refactoring)
- Single Responsibility respected (one reason to change per class)
- Dependencies toward abstractions, not concrete implementations
- No circular dependencies between modules
- Public interfaces stable and well documented
Anti-Patterns: What NOT to Do with AI Code
- Never merge without running tests: even for "small fixes". AI code is deterministic but not transparent: a prompt change alters the entire implementation in non-obvious ways.
- Never trust "it works on my machine": AI code is particularly sensitive to environment differences. CI/CD tests on a clean container are mandatory.
- Never accept AI code on authentication modules without SAST: auth and authorization are statistically the most dangerous areas for AI code. The Veracode Report cites JWT without verification, broken session management, and privilege escalation among the top-5 AI vulnerabilities.
- Do not use AI to generate tests for AI code: tests must be written by the developer. Asking AI to write tests for its own code produces tests that verify the AI implementation, not the system specifications.
Conclusions: Testing as the Enabler of Professional Vibe Coding
The goal of this article is not to discourage AI-assisted development. It is the opposite: to build the confidence needed to do it well. The 45% security failure rate of AI code does not mean that 45% of AI code reaching production is insecure. It means that without a structured verification process, that is the probability. With the right process, that percentage drops dramatically.
The combination of TDD (tests written before, not after), property-based testing with Hypothesis, SAST with Semgrep, static analysis with ESLint configured for AI patterns, and mutation testing to verify test quality, creates a safety system that makes vibe coding professional. It does not slow development down: it stabilizes it.
The next article in the series explores Prompt Engineering for IDEs and Code Generation: how to write prompts that produce better, more secure, more maintainable code, reducing the testing burden downstream.
Resources and Useful Links
Vibe Coding and Agentic Development Series
Full Series
- Vibe Coding: The Paradigm That Changed 2025
- Claude Code: Agentic Development from the Terminal
- Agentic Workflows: Decomposing Problems for AI
- Multi-Agent Coding: LangGraph, CrewAI and AutoGen
- Testing AI-Generated Code (this article)
- Prompt Engineering for IDEs and Code Generation
- Security in Vibe Coding: Risks and Mitigations
- The Future of Agentic Development in 2026
Related Deep Dives
- Cursor IDE: The First AI-First IDE - How Cursor handles testing of AI-generated code
- OWASP Top 10 2025: Current Vulnerabilities - The vulnerability context AI code introduces
- Claude for Code Review - Using Claude as a review agent







