- Rename docs/decisions/adr-012-lazar-conformance.md → adr-012-feature-conventions.md - Strip "Lazar", "Plan 8/9/10/11", "refactor-logs" refs from all ADRs, architecture docs, HTML explainers, and feature/core AGENTS.md files - Update all incoming links in docs/, packages/*/AGENTS.md, HTML explainers Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.2 KiB
ADR-019 — Sandcastle for Agent Orchestration
Status: Accepted Date: 2026-05-13 Spec: docs/architecture/agent-first-workflow-and-conformance.md Companion guide: docs/guides/runbook.md ("Using Sandcastle for agent dispatch") Related: ADR-011 (TDD foundation), ADR-012 (feature conventions), ADR-015 (events and jobs)
Context
This template is designed for agent-driven feature development. The conformance system (ADR-012 + the post-ADR conformance-system-v1 epic) gives agents a tight, layered feedback loop — type errors in 0s, lint in <1s, boot assertion in ~3s, CI gates in ~120s. The remaining substrate question is: how does an agent actually get dispatched against a task?
Three pieces are needed:
- A way to invoke an agent (Claude / Codex) with a task description, inside a sandbox so the agent can't break the host while iterating.
- A way to capture the agent's commits so a reviewer agent can inspect the diff and approve or reject.
- A way to compose the above into a per-task dispatch loop with retry semantics, branch management, and integration into the existing docs/work/ task system.
Without a substrate that handles all three, agentic development falls back to copy-paste-prompt-by-hand, which is slow and error-prone.
Decision
Adopt Sandcastle (@ai-hero/sandcastle)
as the agent-orchestration substrate. pnpm work dispatch is the entry point.
Concretely:
@ai-hero/sandcastleis a workspace-root devDependency. Pinned at^2.73.0at adoption; pnpm resolves later patches automatically..sandcastle/holds the canonical prompt templates. Five role-specific prompts: PRD eliciter, ADR eliciter, decomposer, implementer, reviewer. Each enforces the generator-first rule (preferpnpm turbo gen <kind>over hand-rolling — see saved memorygenerator-first-for-agents)..sandcastle/Dockerfileis the sandbox baseline (node:22-bookworm-slim- pnpm via corepack). The agent runs
pnpm install --frozen-lockfileas its first step per the implementer prompt.
- pnpm via corepack). The agent runs
scripts/work/dispatch.mjsis the orchestrator. It reads_state.json, finds the first ready story's first unchecked AC bullet, builds a task spec, and callssandcastle.run({ promptFile, promptArgs: { TASK_FILE_CONTENT } })for the implementer, then again for the reviewer with{{DIFF}}. The orchestrator does NOT mutate state in v1 — it prints suggested mutations for the human to apply.- Two modes:
pnpm work dispatch(planning, no agent invoked) andpnpm work dispatch --execute(real sandcastle call, requires auth — see point 7). - Reviewer agent verifies generator-first. Hand-rolled output that should
have been a
pnpm turbo gen <kind>invocation is grounds for rejection. - Bring-your-own-auth. Two paths are supported, in priority order:
- Subscription (primary) — bind-mount the host's
~/.claude/into the sandbox. Claude Code CLI inside the sandbox uses the host's logged-in subscription session. Zero per-task token spend for Pro/Max subscribers. Path overridable viaSANDCASTLE_CLAUDE_CREDS_DIRenv var. - API key (fallback) —
ANTHROPIC_API_KEYorOPENAI_API_KEYpassed through to the sandbox env. Used when no host creds directory exists. - The resolver (
resolveClaudeAuthinscripts/work/dispatch.mjs) picks automatically with subscription always preferred. Sandcastle's own issue #191 documents that subscription support won't be added natively; this mount-based pattern is our workaround promoted to first-class.
- Subscription (primary) — bind-mount the host's
- Per-task max-attempts honoured (v2). Each task's frontmatter may carry
max-attempts: Nto bound the implementer↔reviewer retry loop. Default 3.
Alternatives considered
- Bare Claude Code / Codex CLI invocation per task — rejected. No sandbox isolation; no consistent prompt template surface; no built-in branch management; no reviewer-loop primitive.
- GitHub Copilot Workspace / native CI agent — rejected. Vendor lock-in; workflow lives outside the repo; no local equivalent for development time.
- Custom orchestrator built from scratch on the Anthropic SDK — rejected. Sandcastle already solves sandbox + branch + structured-output extraction; rebuilding it is not the leverage point.
- No orchestrator — humans dispatch each task manually via copy-paste —
rejected as the steady-state mode, but supported as a fallback via planning
mode (
pnpm work dispatchwithout--execute). - A different sandbox provider (Vercel sandboxes, Daytona, native fly.io)
— sandcastle is provider-agnostic; the choice of provider sits behind the
SANDCASTLE_PROVIDERenv var and can change without disrupting prompts or orchestrator code. Default is Docker.
Consequences
Positive
- Per-task isolation. Each implementer dispatch runs in its own Docker
sandbox + sandbox branch. Bad agent output stays in the branch; merge to
mainis gated by the reviewer agent + the full 5-gate stack. - Provider-agnostic. Switching from Claude to Codex (or to a future
agent runtime) is a one-line change to the prompt's
agentparameter. - Composable with existing workflow.
pnpm workCLI already reads_state.jsonand the docs/work/ markdown; dispatch is one more subcommand layered on top. - Cost-aware default. Planning mode invokes no agent; only
--executespends tokens. Operators choose when to escalate from plan to execute. - Recoverable failure modes. If an implementer goes off-rails, its diff lives on a sandbox branch — review, reject, re-dispatch with notes.
Negative / accepted trade-offs
- External dependency on sandcastle. If the project stalls, we either pin
- maintain a fork or migrate to another orchestrator. Sandcastle is small enough (~3KLOC) that a fork is manageable.
- Token cost is real. A complex task can use 100K-200K tokens per
implementer + reviewer round-trip. Operators budget per-dispatch; the
planning mode + the optional
max-attemptsfrontmatter cap exposure. - Docker dependency for the default sandbox. Without Docker (or a
provider swap),
--executewon't run. Documented in the runbook. - State mutation is manual in v1. The orchestrator prints suggested state mutations; a human ticks the AC bullet + commits. Auto-mutation is v2 work, gated on confidence that the reviewer's decision can be trusted without human inspection.
Follow-up work
- Auto state mutation — when the reviewer agent's decision is approve, the orchestrator could automatically tick the AC bullet + commit. Currently manual; promote when reviewer confidence is established empirically.
- Multi-task batch dispatch —
pnpm work dispatch --all-readywould fan out across all ready stories. Requires DAG-aware concurrency (no two implementers touching the same files). - Sandcastle CI image alignment — the
.sandcastle/Dockerfileis minimal; once we identify the CI base image, the sandbox should extend it to match the CI environment exactly. - Cost telemetry —
sandcastle.run()returns iteration usage stats; the orchestrator could log these to_state.jsonper-task so operators see cumulative spend.