Vývoj řízený důkazy

Chybějící disciplína v AI-asistovaném inženýrství

Daniel Leblond March 2026

Teams adopted AI coding tools, saw short-term velocity spikes, and then paid the verification tax later in debugging, regressions, and production incidents.

The gap is not generation. The gap is proof. Evidence-Driven Development turns that gap into a repeatable workflow with explicit gates.

Perceived Speed vs. Actual Speed (METR): +24% belief versus -19% measured reality.

The Loop

The model is simple: define intent, prove the gap, capture baseline, implement, prove pass, capture result, and verify quality dimensions before review.

The critical constraint is sequence. Steps before implementation create reliability; steps after implementation create trust.

| Phase | What happens | Why it matters | | --- | --- | --- | | Document | Write what done means before implementation. | Prevents drifting requirements and vague success criteria. | | Test: Fail | Define and run tests that prove the gap exists. | Confirms you are testing behavior, not assumptions. | | Capture: Before | Record baseline outputs before touching implementation. | Provides non-negotiable proof for reviewers and future audits. | | Implement | Apply the change with AI assistance under constraints. | Execution stays fast while the bar remains human-defined. | | Test: Pass | Run targeted tests and confirm behavior now passes. | Validates the change solves the exact acceptance criteria. | | Capture: After | Collect equivalent post-change artifacts. | Enables clear before/after comparison. | | Verify | Audit security, accessibility, performance, docs, and drift. | Catches failure modes tests alone miss. | | Review | Human reviewer accepts or rejects based on evidence. | Keeps accountability with engineers, not prompts. |

:::graphic name: ImplementationLoopDiagram caption: The implementation loop: human-defined constraints, AI-assisted execution. :::

Fáze	Co se stane	Proč na tom záleží
Document	Write what done means before implementation.	Prevents drifting requirements and vague success criteria.
Test: Fail	Define and run tests that prove the gap exists.	Confirms you are testing behavior, not assumptions.
Capture: Before	Record baseline outputs before touching implementation.	Provides non-negotiable proof for reviewers and future audits.
Implement	Apply the change with AI assistance under constraints.	Execution stays fast while the bar remains human-defined.
Test: Pass	Run targeted tests and confirm behavior now passes.	Validates the change solves the exact acceptance criteria.
Capture: After	Collect equivalent post-change artifacts.	Enables clear before/after comparison.
Verify	Audit security, accessibility, performance, docs, and drift.	Catches failure modes tests alone miss.
Review	Human reviewer accepts or rejects based on evidence.	Keeps accountability with engineers, not prompts.

The implementation loop: human-defined constraints, AI-assisted execution.

Before Evidence Is Irreversible in Practice

Teams can theoretically reconstruct a baseline after implementation starts, but almost nobody does. Momentum shifts to fixing forward.

That is why missing before-evidence is treated as a reset condition in disciplined loops.

:::graphic name: MaturityLadder caption: Maturity model: ad-hoc to audit-verified engineering. :::

Maturity model: ad-hoc to audit-verified engineering.

Audit: deset dimenzí

Dimenze	Co odhaluje
Build	Compilation, lint, and suite integrity
Telemetry	PII leaks and unsafe logging payloads
Accessibility	Landmarks, keyboard flow, heading hierarchy
Security	Secrets, injection risk, dependency flaws
Performance	N+1 paths, unbounded loops, memory leaks
Documentation	Spec and implementation drift
Test Coverage	Behavior changes without matching tests
TODO Debt	Skipped follow-ups and unresolved placeholders
Error Handling	Swallowed errors and leaked internals
AI Verbosity	Redundant comments and unnecessary abstractions

Audit posture before and after evidence-driven checks.

The Audit: Ten Dimensions

| Dimension | What it catches | | --- | --- | | Build | Compilation, lint, and suite integrity | | Telemetry | PII leaks and unsafe logging payloads | | Accessibility | Landmarks, keyboard flow, heading hierarchy | | Security | Secrets, injection risk, dependency flaws | | Performance | N+1 paths, unbounded loops, memory leaks | | Documentation | Spec and implementation drift | | Test Coverage | Behavior changes without matching tests | | TODO Debt | Skipped follow-ups and unresolved placeholders | | Error Handling | Swallowed errors and leaked internals | | AI Verbosity | Redundant comments and unnecessary abstractions |

:::graphic name: AuditRadarChart caption: Audit posture before and after evidence-driven checks. :::

PR template enforcing observable evidence, audit output, and explicit test plans.

Příklady důkazů podle domény

Doména	Důkaz před	Důkaz po
API endpoint	curl response with wrong status	curl response with expected status and schema
Database migration	Query before migration	Query showing new columns and populated values
Infrastructure	Current plan output	Desired plan and apply output
Performance	Benchmark baseline	Benchmark delta after optimization
Security patch	Scanner finding	Scanner clean report

Same loop, different artifacts, one quality standard.

The Burden of Proof in Pull Requests

Značky

Engineering Quality AI Testing

Zpět domů

Odkazy

Kent Beck (2025) Augmented Coding: Beyond the Vibes
ThoughtWorks (2025) AI-Aided Test-First Development
METR (2025) AI Tools Made Experienced Developers 19% Slower
Addy Osmani (2026) AI Writes Code Faster. Your Job Is Still to Prove It Works.
Microsoft .NET (2026) Ten Months with Copilot Coding Agent in dotnet/runtime

Core Loop

AI-first engineering at scale