Files

Danijel Martinek 841655573b docs(adr): rename ADR-012 — drop Lazar; update title + content + cross-refs

- Rename docs/decisions/adr-012-lazar-conformance.md → adr-012-feature-conventions.md
- Strip "Lazar", "Plan 8/9/10/11", "refactor-logs" refs from all ADRs,
  architecture docs, HTML explainers, and feature/core AGENTS.md files
- Update all incoming links in docs/, packages/*/AGENTS.md, HTML explainers

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-13 10:07:37 +02:00

7.2 KiB

Raw Blame History

ADR-019 — Sandcastle for Agent Orchestration

Status: Accepted Date: 2026-05-13 Spec: docs/architecture/agent-first-workflow-and-conformance.md Companion guide: docs/guides/runbook.md ("Using Sandcastle for agent dispatch") Related: ADR-011 (TDD foundation), ADR-012 (feature conventions), ADR-015 (events and jobs)

Context

This template is designed for agent-driven feature development. The conformance system (ADR-012 + the post-ADR conformance-system-v1 epic) gives agents a tight, layered feedback loop — type errors in 0s, lint in <1s, boot assertion in ~3s, CI gates in ~120s. The remaining substrate question is: how does an agent actually get dispatched against a task?

Three pieces are needed:

A way to invoke an agent (Claude / Codex) with a task description, inside a sandbox so the agent can't break the host while iterating.
A way to capture the agent's commits so a reviewer agent can inspect the diff and approve or reject.
A way to compose the above into a per-task dispatch loop with retry semantics, branch management, and integration into the existing docs/work/ task system.

Without a substrate that handles all three, agentic development falls back to copy-paste-prompt-by-hand, which is slow and error-prone.

Decision

Adopt Sandcastle (@ai-hero/sandcastle) as the agent-orchestration substrate. pnpm work dispatch is the entry point.

Concretely:

@ai-hero/sandcastle is a workspace-root devDependency. Pinned at ^2.73.0 at adoption; pnpm resolves later patches automatically.
.sandcastle/ holds the canonical prompt templates. Five role-specific prompts: PRD eliciter, ADR eliciter, decomposer, implementer, reviewer. Each enforces the generator-first rule (prefer pnpm turbo gen <kind> over hand-rolling — see saved memory generator-first-for-agents).
.sandcastle/Dockerfile is the sandbox baseline (node:22-bookworm-slim
- pnpm via corepack). The agent runs pnpm install --frozen-lockfile as its first step per the implementer prompt.
scripts/work/dispatch.mjs is the orchestrator. It reads _state.json, finds the first ready story's first unchecked AC bullet, builds a task spec, and calls sandcastle.run({ promptFile, promptArgs: { TASK_FILE_CONTENT } }) for the implementer, then again for the reviewer with {{DIFF}}. The orchestrator does NOT mutate state in v1 — it prints suggested mutations for the human to apply.
Two modes: pnpm work dispatch (planning, no agent invoked) and pnpm work dispatch --execute (real sandcastle call, requires auth — see point 7).
Reviewer agent verifies generator-first. Hand-rolled output that should have been a pnpm turbo gen <kind> invocation is grounds for rejection.
Bring-your-own-auth. Two paths are supported, in priority order:
- Subscription (primary) — bind-mount the host's ~/.claude/ into the sandbox. Claude Code CLI inside the sandbox uses the host's logged-in subscription session. Zero per-task token spend for Pro/Max subscribers. Path overridable via SANDCASTLE_CLAUDE_CREDS_DIR env var.
- API key (fallback) — ANTHROPIC_API_KEY or OPENAI_API_KEY passed through to the sandbox env. Used when no host creds directory exists.
- The resolver (resolveClaudeAuth in scripts/work/dispatch.mjs) picks automatically with subscription always preferred. Sandcastle's own issue #191 documents that subscription support won't be added natively; this mount-based pattern is our workaround promoted to first-class.
Per-task max-attempts honoured (v2). Each task's frontmatter may carry max-attempts: N to bound the implementer↔reviewer retry loop. Default 3.

Alternatives considered

Bare Claude Code / Codex CLI invocation per task — rejected. No sandbox isolation; no consistent prompt template surface; no built-in branch management; no reviewer-loop primitive.
GitHub Copilot Workspace / native CI agent — rejected. Vendor lock-in; workflow lives outside the repo; no local equivalent for development time.
Custom orchestrator built from scratch on the Anthropic SDK — rejected. Sandcastle already solves sandbox + branch + structured-output extraction; rebuilding it is not the leverage point.
No orchestrator — humans dispatch each task manually via copy-paste — rejected as the steady-state mode, but supported as a fallback via planning mode (pnpm work dispatch without --execute).
A different sandbox provider (Vercel sandboxes, Daytona, native fly.io) — sandcastle is provider-agnostic; the choice of provider sits behind the SANDCASTLE_PROVIDER env var and can change without disrupting prompts or orchestrator code. Default is Docker.

Consequences

Positive

Per-task isolation. Each implementer dispatch runs in its own Docker sandbox + sandbox branch. Bad agent output stays in the branch; merge to main is gated by the reviewer agent + the full 5-gate stack.
Provider-agnostic. Switching from Claude to Codex (or to a future agent runtime) is a one-line change to the prompt's agent parameter.
Composable with existing workflow. pnpm work CLI already reads _state.json and the docs/work/ markdown; dispatch is one more subcommand layered on top.
Cost-aware default. Planning mode invokes no agent; only --execute spends tokens. Operators choose when to escalate from plan to execute.
Recoverable failure modes. If an implementer goes off-rails, its diff lives on a sandbox branch — review, reject, re-dispatch with notes.

Negative / accepted trade-offs

External dependency on sandcastle. If the project stalls, we either pin
- maintain a fork or migrate to another orchestrator. Sandcastle is small enough (~3KLOC) that a fork is manageable.
Token cost is real. A complex task can use 100K-200K tokens per implementer + reviewer round-trip. Operators budget per-dispatch; the planning mode + the optional max-attempts frontmatter cap exposure.
Docker dependency for the default sandbox. Without Docker (or a provider swap), --execute won't run. Documented in the runbook.
State mutation is manual in v1. The orchestrator prints suggested state mutations; a human ticks the AC bullet + commits. Auto-mutation is v2 work, gated on confidence that the reviewer's decision can be trusted without human inspection.

Follow-up work

Auto state mutation — when the reviewer agent's decision is approve, the orchestrator could automatically tick the AC bullet + commit. Currently manual; promote when reviewer confidence is established empirically.
Multi-task batch dispatch — pnpm work dispatch --all-ready would fan out across all ready stories. Requires DAG-aware concurrency (no two implementers touching the same files).
Sandcastle CI image alignment — the .sandcastle/Dockerfile is minimal; once we identify the CI base image, the sandbox should extend it to match the CI environment exactly.
Cost telemetry — sandcastle.run() returns iteration usage stats; the orchestrator could log these to _state.json per-task so operators see cumulative spend.

7.2 KiB Raw Blame History