Files
agentic-dev-template/docs/work/prds/2026-05-13-coverage-architecture.prd.md
Danijel Martinek fc27eef6eb docs(coverage): sync docs to shipped state + wire sandcastle prompts
Closes the staleness gap after the 10-commit coverage epic shipped.

Doc sync (item 1 from the user's choice):
  - CLAUDE.md Quick Start: adds pnpm coverage:aggregate / coverage:diff
    / mutate to the command listing
  - CLAUDE.md: new "Sibling architecture: coverage (ADR-020)" section
    after the conformance gate table — captures the 4-layer table +
    points at docs/guides/coverage.md + ADR-020 + says agents must run
    coverage:diff before reporting complete
  - AGENTS.md preamble: now lists coverage as a parallel multi-latency
    quality system alongside conformance, with the same gate / latency
    framing
  - PRD frontmatter: status draft -> shipped + shipped date +
    shipping-commits list (all 10 SHAs anchoring the trace)
  - PRD findings table: each row gets a Resolution column citing the
    commit that closed it; conclusion text updated to past tense
  - ADR-020 implementation phasing: rewritten as a status table with
    each step linked to the commit that shipped it + Boot-time
    assertFeatureConformance explicitly marked Deferred with rationale
  - docs/guides/coverage.md: removed "Boot wiring lands in the next
    story" line; replaced with the deferral rationale + clarified
    that two readers (vitest, coverage:diff) consume the manifest

Sandcastle prompts (item 2 from the user's choice):
  - .sandcastle/implementer.prompt.md: new "Coverage gates" section
    after the conformance-gates list, requiring `pnpm test --coverage`,
    `pnpm coverage:aggregate`, and `pnpm coverage:diff` to all pass
    before reporting `complete`. Machine-readable JSON shape of
    coverage:diff documented (status / uncovered[] / kind enum), with
    explicit instructions on how to interpret each kind. Allowlist
    expansion requires justification + test.
  - .sandcastle/reviewer.prompt.md: AC coverage relabeled to "AC
    coverage (acceptance criteria, not test coverage)" to disambiguate;
    new check #7 "Coverage gates (ADR-020)" requiring CI's
    Coverage — diff (L1) step green + per-layer thresholds met +
    no silent allowlist expansion + manifest band drift detection.

Effect: future agent runs through sandcastle now treat coverage as a
first-class blocking gate, parallel to conformance. PRs no longer
discover coverage failures only via CI; the implementer is required
to check before reporting done, and the reviewer is required to
verify.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 16:47:16 +02:00

20 KiB
Raw Blame History

id, title, type, status, author, elicitation-session, created, shipped, shipping-commits
id title type status author elicitation-session created shipped shipping-commits
2026-05-13-coverage-architecture Agent-first coverage architecture (4 layers + manifest-driven thresholds) prd shipped danijel brainstorm-2026-05-13 2026-05-13 2026-05-13
7eb783a (PRD)
4dce1df (ADR-020 + glossary + hook)
f7baa8b (manifest schema + helper + auth)
412d994 (L1 coverage:diff)
bd5a077 (L2 coverage:aggregate)
39e33eb (CI integration)
15db9c4 (helper rollout blog + marketing-pages)
f4254aa (cookbook guide + generator)
6428f10 (L3 Stryker mutation)
bf0b049 (L0 unification — all 5 features green)

Problem

The template enforces "every use case has a test file" via the usecase-must-have-test-file ESLint rule (structural) and declares per-layer coverage thresholds in each feature's vitest.config.ts (100% on entities/use-cases/controllers; 80/75/80/80 baseline). But:

  • Agents can ship slices that don't actually exercise the new code. The ESLint rule only checks file presence; coverage % checks what executed. There's no gate that says "this PR's diff was tested."
  • The declared 100%-on-critical-layers thresholds may be aspirational, not enforced. Their actual green/red state is unverified. CI uploads **/coverage/lcov.info as an artifact but doesn't gate on it.
  • There's no aggregate visibility. No merged report, no trend, no "is the codebase covered well right now?" answer.
  • 100% coverage with weak assertions is invisible. Tests that import the SUT but barely assert anything pass coverage. The third dimension of test quality — "would my test catch a real regression?" — isn't measured.
  • Coverage expectations live in 5 separate vitest configs. Drift across features is easy and hard to spot.

For an agent-first template where most code is authored by AI agents in vertical slices, this is the single biggest gap between "feature shipped" and "feature shipped safely."

Goal

Establish a 4-layer coverage architecture that mirrors the existing 5-gate conformance philosophy (multi-latency, machine-readable, agent-first) and makes coverage a first-class conformance signal driven from each feature's feature.manifest.ts.

In scope

  • A coverage: section in feature.manifest.ts as the single source of truth for per-layer expectations
  • Vitest config auto-derives test-time thresholds from the manifest
  • pnpm coverage:diff script — cover-the-diff gate (changed lines must be exercised), machine-readable output for the dispatch loop
  • pnpm coverage:aggregate script — merges per-package lcov into a root coverage/lcov.info + grep-able coverage/summary.json
  • coverage/summary.json committed per merge; trend readable from git log
  • pnpm mutate — Stryker on entities/ + application/use-cases/ only, on-demand (not part of pnpm test)
  • assertFeatureConformance reads the manifest's coverage band and the package's lcov at boot; fails if drift
  • CI gate: pnpm test --coverage (existing) + pnpm coverage:diff (new) + pnpm coverage:aggregate (new)
  • ADR-020 capturing the architecture
  • docs/guides/coverage.md cookbook
  • Glossary entries for new vocabulary
  • Generator update — pnpm turbo gen feature scaffolds the coverage: manifest section
  • .claude/hooks/prompt-context.sh detects coverage-related prompts and injects pointers

Out of scope

  • Codecov / SaaS dashboards. Aggregate trend ships as committed coverage/summary.json only. SaaS can be a later ADR.
  • Coverage badges or PR comments. Optional follow-up if humans want them; agents don't.
  • Mutation testing on infrastructure / repositories / controllers — entities + use-cases only for v1. Wider scope is a future epic.
  • Branch coverage on __seeds__/ / __factories__/ / __contracts__/ — already excluded from coverage; stays out.
  • Coverage for apps/ — they have their own existing thresholds; not part of this initiative.
  • Test quality beyond coverage + mutation — property-based tests, fuzzing, etc. are separate concerns.

Constraints

  • ADR-014, ADR-017 (instrumentation): coverage instrumentation must not interfere with span/log collection.
  • ADR-011 (TDD foundation): the declared per-layer thresholds (entities/use-cases/controllers at 100%; baseline 80/75/80/80) are existing decisions; we honor them and centralize their declaration.
  • The conformance ESLint rules are AST-time + filesystem; coverage assertions are runtime data — they must remain in separate gates.
  • Generator-first is non-negotiable: any new file added to a feature must come from pnpm turbo gen feature or a sibling generator.
  • pnpm conformance and pnpm test already take ~120s and ~90s respectively; new gates must not add more than ~30s wall time to the default loop.
  • Agent dispatch (pnpm work dispatch) must be able to read coverage results as JSON without a network call.

Success criteria

  • pnpm test --coverage passes green across all five feature packages (verifies L0 baseline is real, not aspirational).
  • pnpm coverage:diff exits non-zero when a changed line is uncovered; outputs JSON to stdout listing each uncovered hunk with file + line range.
  • pnpm coverage:aggregate produces coverage/lcov.info + coverage/summary.json at the repo root; both readable in <1s.
  • coverage/summary.json is committed and git log -- coverage/summary.json shows trend over time.
  • pnpm mutate --filter <feature> runs Stryker on entities + use-cases; produces a per-feature mutation score.
  • feature.manifest.ts includes coverage: { ... }; removing or weakening a band fails pnpm dev boot via assertFeatureConformance.
  • CI fails when (a) a per-layer threshold breach happens, (b) any changed line is uncovered, (c) the aggregate report can't be produced.
  • ADR-020 + docs/guides/coverage.md + glossary entries land alongside implementation.
  • pnpm turbo gen feature emits a manifest with the coverage: section pre-populated to defaults.

User stories

  1. As an AI implementer agent, I want pnpm coverage:diff to exit with a precise JSON list of uncovered hunks, so that I can immediately add the missing test without searching.
  2. As an AI implementer agent, I want pnpm dev to refuse to boot if I've authored a manifest entry without backing tests, so that I cannot ship an untested slice.
  3. As an AI reviewer agent, I want the dispatch loop to surface coverage drift as part of its post-task verification, so that I can flag the slice for revision.
  4. As a human reviewer, I want coverage/summary.json committed on merge, so that I can see coverage trend via git log without leaving the repo.
  5. As a template maintainer, I want each feature's feature.manifest.ts to declare its coverage band, so that I have one place to read or change expectations.
  6. As a template maintainer, I want pnpm mutate to surface tests that don't actually assert, so that 100% line coverage can't paper over weak tests.
  7. As a template adopter, I want pnpm turbo gen feature to scaffold the coverage: manifest section with sensible defaults, so that I don't have to remember the shape.
  8. As a future agent in a session, I want the prompt-context hook to inject coverage pointers when I mention "coverage" or "uncovered" in a prompt, so that the relevant ADR + guide load automatically.
  9. As an on-call engineer, I want coverage/summary.json to include a timestamp + commit SHA, so that I can correlate coverage state with deploys.
  10. As an AI agent receiving a handoff, I want the coverage state of the in-flight slice to be readable from a known path, so that I can continue without re-running the suite.

Implementation decisions

Architecture: 4 layers, mirroring the 5-gate conformance philosophy

Layer What it catches Latency Runs in
L0 Per-layer vitest thresholds Drift below declared bands (e.g., entities < 100%) ~530s per package pnpm test --coverage (existing)
L1 Diff coverage "Changed line was not exercised" ~5s after L0 pnpm coverage:diff; CI gate; dispatch post-task
L2 Aggregate trend "Codebase coverage drifted over time" ~10s pnpm coverage:aggregate; committed coverage/summary.json
L3 Mutation testing "Test exists, executes the code, but asserts nothing" Minutes pnpm mutate; on-demand; nightly GH Action

Manifest-driven coverage band (the keystone)

A coverage: section in feature.manifest.ts is the single source of truth. Default scaffolded shape:

  • entities: 100/100/100/100
  • use-cases: 100/95/100/100
  • controllers: 100/95/100/100
  • baseline: 80/75/80/80
  • mutationTargets: ["entities", "use-cases"]

Three readers:

  1. Vitestvitest.config.ts imports the manifest and emits its coverage.thresholds. Eliminates the duplication today.
  2. assertFeatureConformance — reads coverage/lcov.info for the package at boot, asserts each band. Fails to boot on drift (matching the existing brand-assertion shape).
  3. pnpm coverage:diff — uses baseline as the default expectation for non-layer-tagged files; stricter bands override per-path.

Diff coverage algorithm

pnpm coverage:diff [<base-ref>] (default origin/main...HEAD):

  1. Run git diff --unified=0 --no-color <base>...HEAD → list of (file, hunk-start, hunk-end) for changed/added lines.
  2. Read merged coverage/lcov.info → executable-line + execution-count per file.
  3. For each changed line that is executable (skip comments, types, exports, declarations): assert execution count > 0.
  4. Output stdout JSON { status: "pass" | "fail", uncovered: [{ file, line, kind }] }.
  5. Output stderr human-readable summary.
  6. Exit 0 on pass, 1 on fail.

Filename allowlist (don't gate on diff coverage): *.test.ts, *.test.tsx, *.config.*, *.md, *.json, *.mjs, *.cjs. Same exclude list as L0's existing per-package coverage excludes (DI bootstrap, interfaces, CMS, UI, factories, contracts).

Aggregate report

pnpm coverage:aggregate:

  1. Collect every packages/**/coverage/lcov.info + apps/**/coverage/lcov.info.
  2. Merge into coverage/lcov.info via lcov-result-merger or monocart-coverage-reports.
  3. Emit coverage/summary.json:
    {
      "generatedAt": "2026-05-13T12:34:56Z",
      "commit": "abc1234",
      "repo": { "statements": 87.4, "branches": 81.2, "functions": 89.0, "lines": 87.4 },
      "byPackage": {
        "@repo/auth": { "statements": 96.1, ... },
        ...
      }
    }
    
  4. Re-render HTML at coverage/html/index.html for human drill-down (gitignored).

coverage/summary.json is committed; coverage/lcov.info + coverage/html/ are gitignored.

Mutation testing

pnpm mutate [--filter @repo/<feature>]:

  • Uses Stryker with @stryker-mutator/vitest-runner.
  • Per-feature stryker.config.json scaffolded by the generator.
  • Mutation scope honors the manifest's mutationTargets (entities + use-cases by default).
  • Output: reports/mutation/<feature>/ (HTML + JSON; gitignored).
  • Mutation score threshold: 80% per feature (tunable in manifest); not enforced in default pnpm test.
  • Optional nightly GH Action: mutation-nightly.yml — runs across all features, opens an issue on score drop > 5%.

Boot-time assertion

assertFeatureConformance (in core-shared/conformance/) gains a new check:

  1. Look for coverage/lcov.info at the feature package root.
  2. If absent: skip with a logger.debug(...) (dev mode is allowed to lack coverage data).
  3. If present: parse, compute per-layer % for entities/, application/use-cases/, interface-adapters/controllers/.
  4. Compare against the manifest's coverage: bands. Fail boot on breach with a CoverageDriftError.

Graceful degradation in dev (USE_DEV_SEED=true): assertion logs warning instead of throwing, so pnpm dev boots without a fresh coverage run.

CI integration

.github/workflows/ci.yml gains three steps after the existing test step:

  1. pnpm coverage:aggregate (always)
  2. pnpm coverage:diff (always; fails build on uncovered diff)
  3. Upload coverage/lcov.info + coverage/summary.json as artifact (existing flow)
  4. On merge to main only: commit coverage/summary.json back to the repo (separate workflow with permissions: contents: write)

Generator integration

pnpm turbo gen feature template emits:

  • Manifest with coverage: { ... defaults ... } section at a <gen:coverage> anchor
  • vitest.config.ts that imports the manifest and derives coverage.thresholds from it
  • stryker.config.json at the package root, scoped to entities/ + application/use-cases/

The CI guard at packages/core-eslint/anchors.test.js adds the new anchor.

Hook integration

.claude/hooks/prompt-context.sh gains a new keyword group:

if echo "$prompt" | grep -qE 'coverage|uncovered|mutation|stryker|lcov'; then
  inject+=('Coverage: ADR-020 + docs/guides/coverage.md. 4 layers: L0 vitest thresholds, L1 pnpm coverage:diff, L2 coverage/summary.json, L3 pnpm mutate. Manifest-driven via feature.manifest.ts coverage section.')
fi

Testing decisions

A good test for this initiative covers behavior through the public surface:

  • The diff-coverage script gets tested by feeding it a synthetic git diff + lcov.info and asserting JSON output shape and exit code. Fixtures, not e2e.
  • The aggregate script gets unit-tested by feeding it N synthetic per-package lcov files and asserting the merged output structure.
  • assertFeatureConformance gets a new test in core-shared/conformance/ that constructs a manifest + a coverage/lcov.info fixture and asserts pass/fail behavior. No real test run.
  • Stryker integration gets an integration-test style verification: run pnpm mutate --filter @repo/auth in CI nightly and assert mutation score > 80% on a known-good commit.
  • The manifest schema gets a Zod schema test: invalid coverage: shapes fail parse with a specific error message.

Modules to add test coverage to:

  • scripts/coverage/diff.mjs (the diff coverage runner)
  • scripts/coverage/aggregate.mjs (the merger)
  • packages/core-shared/src/conformance/assert-coverage.ts (the boot-time reader + comparator)
  • packages/core-shared/src/conformance/manifest-schema.ts (extended Zod schema)

Prior art to mirror:

  • scripts/work/state-builder.mjs — for the diff-coverage script's shape (Node ESM, no deps, fixture-based test)
  • packages/core-eslint/anchors.test.js — for the new anchor's CI guard pattern
  • Existing assertFeatureConformance brand-assertion code — for the boot-time check shape

Open questions

  • Q1: Mutation score threshold per feature — start at 80% across the board, or tune per-feature in the manifest? Recommended: start at 80% baseline, allow per-feature override via coverage.mutationThreshold.
  • Q2: When coverage/lcov.info is stale at boot — fail loudly or warn? Recommended: warn in dev, fail in NODE_ENV=production boot.
  • Q3: Should pnpm coverage:diff run automatically as a Git pre-push hook? Recommended: no — pre-commit already runs; pre-push adds latency without proportional value. Keep it CI-only + dispatch-loop.
  • Q4: How is coverage/summary.json committed safely on merge to main? Recommended: separate workflow with a github-actions[bot] identity, permissions: contents: write, skipped if no diff in summary.
  • Q5: monocart-coverage-reports vs. lcov-result-merger? Recommended: monocart — actively maintained, single-binary, handles V8 + Istanbul, no Python dep.

Out of scope (deferred to future PRDs)

  • Codecov / SaaS dashboards. Aggregate is committed; SaaS is gold-plating.
  • Mutation testing on infrastructure/, interface-adapters/, integrations/. Bigger surface, slower runs; revisit after L3 v1 is stable.
  • Coverage-aware Storybook screenshot diffing. Visual regression is its own concern.
  • Code review heuristics from coverage (e.g., "this PR touched a 100%-covered file and dropped to 95% — needs review tag"). Possible follow-up via fallow audit.

L0 verification findings (2026-05-13, during brainstorm)

Ran pnpm --filter @repo/<x> test -- --coverage --run per feature. State of the declared 100%-on-entities/use-cases/controllers:

Feature State Detail
@repo/auth green 21 tests, 93.7% overall, all per-layer bands hit 100%
@repo/blog green passed (no threshold errors surfaced)
@repo/marketing-pages green passed
@repo/navigation real gap entities/: 86.36% lines / 50% functions; controllers/: 86.66% lines / 80% branches
@repo/media config + real gap (1) Missing @vitest/coverage-v8 dev dep — --coverage crashed. (2) vitest.config.ts had NO coverage block at all (no per-layer thresholds, no excludes). When the standard block was applied, real gaps surfaced in controllers/ (one controller at 86.66% lines / 75% branches, lines 19-20 uncovered).

Conclusions reinforcing the PRD:

  • The L0 layer is real (auth/blog/marketing-pages prove the 100%/100%/95%/100% bar is achievable).
  • The duplication problem is real (5 features, 4 different vitest configs, one entirely absent the coverage block — exactly the drift the manifest-driven keystone eliminates).
  • Real test gaps exist in 2 of 5 features (navigation, media). Fixing them is part of the implementation epic, not a blocker for this design landing.

Patches landed alongside the PRD (non-blocking for CI):

  • packages/media/package.json — added @vitest/coverage-v8 dev dep so pnpm test -- --coverage no longer crashes.
  • packages/media/vitest.config.ts — left intentionally minimal (no coverage block); commented to point at the L0 unification story.

The full per-layer block for media + the missing tests in navigation + media land in the L0 unification story of the implementation epic, as the first work item after the manifest schema lands.

Further notes

  • Builds on: ADR-011 (TDD foundation), ADR-006 (vertical-feature-packages), the existing 5-gate conformance system.
  • ADR-020 is the durable record of the architecture decisions; this PRD is the implementation seed.
  • Each layer should land as a separate story (L0 unification, manifest schema + auto-derive, L1 diff, L2 aggregate, L3 mutation, ADR + docs). Estimated effort: one mid-sized epic, 68 stories.
  • This is the canonical example of "agent-first observability": every layer optimized for machine consumption first, human consumption second.