Core Loop

AI-first engineering at scale

Theme

The EDD Constitution

A Living Contract That Only Ratchets Up

Daniel Leblond June 2026

In 2026, an enterprise team built an agent that read incident management tickets, generated post-mortems from them, and created repair items automatically. A second agent on the same team watched for new work items and opened draft PRs for each one to help developers get started. A developer reviewed one of those PRs, found it reasonable, approved it, and it merged to the main branch.

**The problem was that the change was entirely fabricated.**

The agent had invented a plausible-looking fix for a problem that did not exist in the way it described. The change regressed a real production scenario. A real customer noticed. There was an outage.

This is why the EDD Constitution exists.

The constitution is a single file that governs every non-trivial change to the codebase. It defines what a finished change looks like, what evidence it must carry, what a scoped audit must cover, and how the rules themselves can evolve. Every agent and every human who touches the repo loads it on every session. The bar only ratchets up.

This post walks through the constitution step by step: the incident that motivated it, the design question it answers, how it is structured, how it was written (and what we got wrong early on), how amendments work, and how the full kit unfurls into a new repo.

Step 1 - What it is

The incident is the clearest possible illustration of the failure mode EDD is designed to prevent. The agent did not hallucinate randomly. It produced a coherent, readable, professionally formatted PR. The developer who reviewed it did what developers do: they read for intent, found the intent plausible, and approved. The gap was not attentiveness. The gap was the absence of proof. Nothing in the review process required the agent to demonstrate that the problem it claimed to fix actually existed, that the fix addressed real behavior, or that the change did not regress anything.

EDD answers that gap with a single hard requirement: if the agent cannot prove it, the task is not done. Proof is not a summary. Proof is not a confidence assertion. Proof is an artifact that can be independently verified - a failing test, a scanner output, a before/after screenshot, a plan diff. The PR carries the artifact or the PR is incomplete.

The constitution lives at `docs/methodology/CONSTITUTION.md`. Every AI agent configured on the project loads it on every session through its own loader file. The body is agent-neutral: no tool names, no platform-specific syntax. It defines nine foundational principles, hard constraints with stable slug IDs, the 10-step implementation loop, audit dimensions, evidence acceptance criteria per change type, triviality carve-outs for low-risk work, and the amendment clause.

Each hard constraint has a slug like `[HC-EVIDENCE-BEFORE]` and declares three things: the bar (what must be true), the verification (how adherence is checked), and the pattern source (which instruction file elaborates the implementation guidance).

HC Bar Verified by
`[HC-EVIDENCE-BEFORE]` Before evidence mandatory prior to implementation edits for non-trivial work Loop step 4; human review
`[HC-SECURITY-LINT]` Security lint rules remain at error severity; do not disable `eslint-plugin-security` or `@microsoft/eslint-plugin-sdl` `pnpm lint`
`[HC-ASCII-PUNCT]` No em dash characters in blog articles; use ASCII punctuation alternatives `/audit Documentation`
`[HC-A11Y-GATE]` New interactive UI surfaces require e2e a11y coverage for all supported themes `e2e/a11y.spec.js`

The 10-step loop is the operational heart: define done, write a test that proves the gap, watch it fail, capture before-evidence, implement, watch the test pass, capture after-evidence, verify docs, audit across declared dimensions, then human review. Steps 1 through 4 happen before any implementation edit. The ordering is constitutional because this incident shows exactly what happens when it is not.

As an additional safeguard, EDD borrows a core principle from TDD: the scenario must be proven to fail before work begins. A proof that passes at the end of a change is not sufficient evidence on its own - if the test also passed before the change, the agent may have fixed something that was never broken. This is a specific failure mode in AI-assisted workflows: an agent generates a plausible-sounding fix, the verification happens to be satisfied, and the change ships as though it solved a real problem. Requiring a documented failing state before implementation starts closes that gap. The before-evidence step is not just a baseline for comparison - it is proof that the problem being solved was real.

Of all the hard constraints that have shipped through the constitution, two have caught more defects than all others combined, and they are arguably the single most important upgrade EDD makes to any AI-assisted workflow: `[HC-EVIDENCE]` and `[HC-EVIDENCE-INTEGRITY]`. Both are declared in `.github/copilot-instructions.md` - the eager-loaded file that is in the agent's context on every session.

`[HC-EVIDENCE]` requires that every PR carries actual before and after artifacts - not descriptions of what the artifacts would show, not summaries of what the agent believes happened, but the raw output. `[HC-EVIDENCE-INTEGRITY]` goes further: it requires that the evidence in the PR can be traced back to the work that was actually done. It validates that the PR contains what it says it contains.

Together, these two constraints have caught hundreds of AI-generated bugs before they reached review. The failure mode they guard against is not hallucination in the obvious sense. It is a subtler behavior: the agent marks a task complete, writes a confident PR description, and includes what reads as evidence - but the evidence was composed, not captured. The agent described what the output should look like rather than running the command and including the actual output.

`[HC-EVIDENCE-INTEGRITY]` is specifically effective at catching what might be called the "I couldn't do that" pattern. An agent facing a hard or unfamiliar task will sometimes claim the task is impossible or out of scope rather than attempting it. The claim is often framed as a limitation - a tool that does not exist, a constraint that prevents the approach, an environment that does not support the operation. `[HC-EVIDENCE-INTEGRITY]` surfaces this: if the agent is claiming a task could not be done, the PR must show evidence that the task was genuinely attempted and the obstacle is real. "I couldn't run the test suite" requires a terminal output showing the failure, not a statement that the failure occurred. Without that requirement, the agent's avoidance of difficult work is invisible at review time and the task ships incomplete as though it were done.

Step 2 - How we got here

The constitution is not an accumulation of lessons learned. It is the answer to one design question asked at the start: how do you force an AI agent to prove its work every single time, in a way that is not bypassable and does not depend on a human remembering to ask?

Answering that question forced a decomposition. Proof requires knowing what done looks like before implementation starts - that produced the documentation step. Proof requires a test that fails before the change and passes after - that produced the failing/passing test discipline. Proof requires a before-state that was captured before implementation touched anything - that produced the before-evidence requirement. Proof requires an independent audit that catches what per-change evidence cannot - that produced the audit dimensions. And proof has to survive the review process without being softened away - that produced the hard constraint that the human reviewer is the final gate and only gate that cannot be automated.

The result is the 10-step loop. Every step exists because removing it creates a hole through which unproven work can pass. The loop is not a checklist. It is a causal chain.

From there the question became: what categories of failure can an agent produce that the loop does not catch by default? That produced the hard constraints. Security lint being silent while rules were downgraded produced `[HC-SECURITY-LINT]`. A character encoding failure in a deployed artifact produced `[HC-ASCII-PUNCT]`. Each HC closes a specific failure class with a specific automated verification. Rules that could not name a verification were declined: if a check cannot be automated, the rule is aspirational, not constitutional.

The final design question was: who governs the constitution itself? The answer is the ratchet. The constitution can only get stricter. Amendments must carry proof that agent behavior improves or holds. The floor never moves down. This means the constitution can be trusted over time in a way that a living style guide cannot: every version is strictly stronger than the one before it.

Step 3 - How it is structured

The constitution follows a three-layer architecture.

**Layer 1: The canonical body.** One file, `docs/methodology/CONSTITUTION.md`. This is the source of truth. It is agent-neutral: it makes no assumptions about which AI tool is in use. It defines principles, hard constraints, the loop, audit dimensions, evidence acceptance criteria, triviality carve-outs, and the self-improvement clause. When the body exceeds roughly 250 lines, rules that apply to a specific path pattern are factored out into path-scoped files, with a one-line pointer left in the body.

**Layer 2: Agent loaders.** Every configured agent gets exactly one loader file in the location that agent eagerly loads. For GitHub Copilot that is `.github/copilot-instructions.md`. For Claude that is `CLAUDE.md`. The loader imports or inlines the constitution body depending on what the platform currently supports. The loader is mechanically rendered from the canonical body; it is not hand-edited. If a project uses three AI tools, there are three loader files, all rendering the same constitutional content.

**Layer 3: Path-scoped rules.** Some rules apply only to specific file types. Accessibility and localization rules apply to JSX and CSS files, not to infrastructure templates. Those rules live in instruction files with a frontmatter glob:

The agent loads these files only when it touches a matching path. This keeps the eager-load context budget tight (core rules always present, specialized rules loaded on demand) and prevents accessibility guidance from surfacing on a Bicep template edit.

Supporting files complete the structure: `/constitute` command bodies in `.github/prompts/`, per-feature folders at `docs/methodology/features/`, eval scenarios at `docs/methodology/eval/scenarios/`, and `verify-sequence.yaml` at the repo root that defines the CI gate ordering.

Step 4 - How we wrote it

**Context is everything. What is not in context is dead to the agent.**

This is the single most important insight about writing a constitution for AI-governed development. A rule that exists in a file the agent never loads does not exist. A hard constraint buried in a document that only loads when a specific file path is touched is not a hard constraint for any other path. The constitution must always be in context, and everything that always must apply must live there.

This drives three authoring decisions:

**Token optimization is non-negotiable.** The canonical body targets under 300 lines and 8k tokens. Every amendment must maintain or improve token efficiency - verbose rules that accomplish what terse ones can are rejected on those grounds alone. This is not a style preference. It is a load constraint. If the constitution exceeds the budget, agents in smaller context windows start truncating it, and truncated rules are no rules.

**Conditional loading through frontmatter.** Rules that apply only to specific file types are factored out of the canonical body and into path-scoped instruction files with a frontmatter glob declaration. Accessibility and localization guidance loads only when the agent touches JSX or CSS. Infrastructure guidance loads only when the agent touches Bicep. The canonical body keeps a one-line pointer. The agent loads the path-scoped file only when it is relevant. This is not just efficiency - it prevents accessibility guidance from surfacing as a distraction on an infrastructure edit, which trains the agent to ignore it.

This project does not use frontmatter-separated rule files because the constitution is small enough to load entirely in context - a single developer, a focused scope, a lean rule set. For larger teams the calculus changes. An enterprise-scale project currently running EDD has 75 hard constraints across security, compliance, accessibility, and platform-specific requirements. Inlining all 75 into a single eager-load file would push the constitution well past the context budget for most agents. Frontmatter splitting keeps the canonical body under 250 lines - a summary pointer per domain - and lazy-loads the full rule detail only when the agent touches matching paths. The constitution stays fast and lean. The rules stay complete. The token cost stays bounded.

**Amendments as units of change.** A constitutional amendment is not a word change in a markdown file. It is a three-artifact bundle: the exact rule delta, the verification mechanism that will catch future violations, and a behavioral eval scenario that proves agent behavior improves. All three ship together in the same PR. The amendment is atomic. Partial amendments that promise to add the verification later are rejected - later does not come, and an unverifiable rule is decoration.

**Write the evals and rubrics before you think you need them.** The eval is the ratchet. Without it, amendments are accepted on good faith and the constitution drifts. The rubric scores agent behavior on realistic scenarios. Every new rule produces at least one scenario. The rubric produces a numeric score. The amendment passes only if the working-tree score meets or exceeds the baseline.

**Document what you can and cannot cover. Do not lie to yourself about coverage.**

Early on, the accessibility hard constraint declared that new interactive UI surfaces required e2e a11y coverage using axe-core. This felt comprehensive. In practice it was naive. axe-core handles a meaningful subset of WCAG - it catches missing labels, landmark structure, focus order, and contrast in cases where the DOM is fully rendered and colors are resolved. It does not catch screen reader announcement logic, cognitive load patterns, complex widget keyboard contracts, or contrast issues involving gradients and SVG image nodes where the computed color cannot be resolved.

Having `[HC-A11Y-GATE]` with axe-core in the verification does not mean a11y bugs are zero. It means the specific axe-core ruleset runs against the rendered DOM. The difference matters enormously in PR coverage claims.

The fix was decomposition. Instead of "axe-core clean," the verification was rewritten to enumerate which WCAG success criteria the axe-core ruleset deterministically covers (1.1.1 for non-text content, 1.3.1 for info and relationships, 1.4.3 for contrast where resolvable, 4.1.2 for name/role/value) and which have zero automated coverage (1.3.3 sensory characteristics, 2.4.6 headings and labels semantics, all of 3.x Understandable criteria). The known-gaps section of the audit dimension now states explicitly: axe-core handles these criteria; manual scanning is required for those. PR reviewers see the actual coverage, not a false-confidence summary.

The broader principle: for every verification, document what it catches and what it does not. "Security lint passes" does not mean the codebase is secure. "axe-core clean" does not mean WCAG 2.2 AA conformant. Name the gap. Log it in the audit dimension. Require manual scanning for the gap surface. Do not let the automated check substitute for the human judgment it cannot replace.

Step 5 - How we write amendments

Amendments almost never start as amendment proposals. They start as bugs.

A bug surfaces. The fix is applied. Before shipping, `/reflect` asks one question: is this a one-off, or is something missing in the constitution that would have caught this class of failure? If the answer is one-off, the fix ships and that is the end. If the answer is that something is missing, that is when `/constitute` is invoked.

**The /reflect -> gap -> amendment path.** `/reflect` examines the fix and classifies it: constitution gap (no rule covered this class of failure) or verification gap (a rule existed but no automated check enforced it). Both routes lead to `/constitute`. A constitution gap produces a new HC. A verification gap produces a tighter verification on an existing HC - typically a new audit dimension subsection, a new scanner rule, or a new eval scenario.

**The three required artifacts.** `/constitute` refuses to proceed without all three in the same PR:

  • **Rule delta.** The exact text change, classified: new rule, amended wording, displacement (replace and remove the old), supersession (raise the bar with the old as floor), or relocation. Duplicates are rejected on sight.
  • **Verification mechanism.** The specific gate that will catch a future violation - a test name, lint rule ID, audit dimension subsection, scanner exit-code check, or behavioral eval scenario. It must exist at commit time. Rules without detectable violations are decorative.
  • **Eval scenario.** Stored at `docs/methodology/eval/scenarios/<id>.md`. Describes a realistic situation where the old rule produces wrong agent behavior and the new rule produces the scoreable correct answer.

**The ratchet.** After all three artifacts are approved, the amendment is applied to a branch. The eval runs against main-branch rules and working-tree rules. The rubric scores both. Pass requires working-tree >= main on every scenario. Regression blocks the amendment until the wording is fixed. The ratchet is not optional for obvious amendments: they can still produce subtle regressions, and the eval is the only mechanism that catches them before they land.

**The retroactive sweep.** An amendment fixes the rule for new code immediately: `/audit`'s diff scope means new work meets the new bar from the moment the amendment lands. Pre-existing code that violates the new rule is handled by a separate sweep PR queued through `/rollout`. The fix site does not need to fix every pre-existing instance inline. That would make amendments prohibitively expensive. Instead: new code meets the new bar right away, old code is on the rollout queue, and the sweep PR carries its own evidence that pre-existing instances are resolved.

The four valid triggers for `/constitute` are: a bug, an incident, a post-mortem, and a new contractual standard. Proposals shaped as "we should probably..." with none of the four triggers are declined and routed to `/reflect` instead.

Step 6 - How we unfurl it

The Portable Methodology Kit (`EDD - Portable Methodology Kit.md`) is a self-contained document you hand to any AI agent with the instruction to run `/begin`. The agent inspects the repo, runs a discovery pass, confirms detected values with you in a single table, asks only what discovery cannot answer, and emits the minimum viable scaffolding for your project. One session to stand up the full EDD infrastructure.

How you unfurl it depends entirely on whether you are starting fresh or bringing it into an existing codebase. The two paths are different enough to treat separately.

**Greenfield.** On a new project, put every rule you can think of in the constitution on day one. You have no pre-existing code to audit, no team practices to protect, no existing PRs to grandfather. The cost of strictness at day one is nearly zero. Add all the hard constraints, all the audit dimensions, all the path-scoped rules. Then run the loop. What you will discover quickly is where the constitution creates friction: build times that balloon because every change triggers a full audit, test suites that are slow because the coverage bar is set too high for the current complexity, token budgets that are tight because the canonical body is too verbose. Day-one strictness surfaces these problems in development, not in production. Then you optimize: tighten the build gates, adjust the coverage rules, trim the constitutional body to its minimum. You stabilize everything at once, with no customers affected, no team disrupted. The short-term cost is a slightly slower first sprint. The long-term gain is a constitution that has been stress-tested from the first commit.

**Brownfield.** An existing codebase has an existing team, existing practices, and existing PRs that did not follow EDD. The unfurl here is incremental and must be additive, not disruptive. The goal for the first month is not to retrofit every past decision - it is to start generating the collateral that makes EDD trustworthy: one audit dimension that catches something real, one hard constraint that automates a review check the team was already doing manually, one amendment cycle end to end. Use the team's existing quality signals as raw material. If the team already has a lint rule for security, add `[HC-SECURITY-LINT]` and point it at the existing gate - nothing changes for developers, but now the constitutional record reflects what the gate actually enforces.

The cardinal rule in brownfield is: win allies before winning arguments. Do not push a full constitution that touches every area of the codebase in the first week. Start with the dimension the team already cares most about - usually security or reliability. Show that the amendment process closes a real gap they have seen before. Let the ratchet compound from there. A team that has seen EDD catch one real bug that slipped through their existing process will make room for the next rule. A team that encounters EDD as a document that tells them they have been doing everything wrong will route around it.

**What it unlocks.** The reason to go through the unfurl, whether greenfield or brownfield, is not the constitution document. It is what the constitution enables once the verification machinery is running.

Production code quality and delivery velocity compound together in a way that is genuinely counterintuitive if you have not seen it. Engineers stop context-switching to debug regressions that the loop would have caught. Review cycles shorten because PRs carry evidence instead of explanations. The audit runs automatically and flags the issues that would have been caught by the most experienced reviewer - freeing that reviewer to focus on the architectural decisions that actually require their judgment.

The clearest evidence of this: a new engineer joining the team, with full access to the constitution, the feature spec, and the loop, can make a production-quality feature contribution checked into main within 48 hours of their first day. Not a toy change. Not a documentation update. A real feature, with evidence, passing the full audit. It is not a fluke and it is not a particularly unusual engineer. The guardrails make it possible for any skilled developer to operate at the team's quality standard from day one, because the quality standard is written down, verifiable, and enforced automatically rather than carried as institutional knowledge in the heads of whoever has been around the longest.

That is the shape it reaches: a team where the AI does the verification toil, the guardrails catch the failure classes that would otherwise slip through code review, and the engineers spend their thinking time on the problems that actually require engineering judgment.

This has been verified at different scales - solo projects first, then medium-size teams, then enterprise-scale organizations. The mechanics hold across all three. The unfurl cost is different (a solo developer can skip requirements registries and cross-vendor adversarial review; an enterprise team needs them). The amendment cadence is different (a solo project may go weeks between `/constitute` invocations; an enterprise team with multiple contributing agents runs them weekly). But the core loop, the ratchet, and the evidence requirement behave the same way regardless of team size. The quality floor only goes up, and the verification machinery keeps it there without depending on whoever happens to be the most experienced reviewer in the room that week.

AI Hooks

The constitution governs agent behavior through loaded context. Hooks enforce it at the moment of action, before the work is already done and a PR is already open. Without hooks, a violation is caught at review: the agent has already written the code, the PR exists, and fixing it means re-doing work. With hooks, the interception happens before any keystroke.

**Both Claude Code and GitHub Copilot run a hook on prompt submit.** When a new task arrives, the hook fires before the agent does anything. Its job: check whether the task is non-trivial, then rewire the task list into the `/wow` loop - the 10-step EDD sequence the agent must follow before shipping anything.

Step Description
1 Update docs for the task
2 Write or update tests (E2E or unit)
3 Run targeted tests - confirm FAIL
4 Capture BEFORE evidence
5 Implement the task
6 Run targeted tests - confirm PASS + full suite green
7 Capture AFTER evidence
8 Verify docs match implementation
9 Run `/audit` - fix Critical/High findings
10 Summarize and wait for human review

An agent that reaches step 5 without completing steps 1-4 has violated the loop. The hook establishes the sequence at session start - not after the fact.

**Claude Code** additionally fires a pre-tool-call hook before any file write, terminal command, or git operation. A commit cannot be attempted if the loop steps haven't been satisfied.

**GitHub Copilot** additionally fires a PR creation hook. Before the PR description is finalized, the hook runs `/audit` in self-review mode - catching dimension violations, missing evidence, and empty test plans before a human reviewer ever sees the draft. What reaches the reviewer is already pre-screened.

**Codex and other agents** have no native hook surface as of this writing. The fallback is a CI watcher bot that comments on PRs immediately after creation and flags violations. It's a backstop, not a first-surface gate - the work is already done by the time it fires, so it doesn't prevent the rework that hooks eliminate.

On a project with active hooks, violations are corrected inline during the session. The agent catches the gap, produces the evidence, and includes it from the start. Review time drops. Rework disappears. The constitution moves from a document the agent reads to a constraint the agent operates inside in real time.

Appendix - The Full Constitution

What follows is the actual `CONSTITUTION.md` from this project - a single developer, fully autonomous AI-assisted dev project. It governs every non-trivial change made to this codebase. This is not a template or an illustration. This is the live document loaded by every agent on every session.

Back to home

References