Files
agentic-dev-template/docs/decisions/adr-019-sandcastle-for-agent-orchestration.md
Danijel Martinek 841655573b docs(adr): rename ADR-012 — drop Lazar; update title + content + cross-refs
- Rename docs/decisions/adr-012-lazar-conformance.md → adr-012-feature-conventions.md
- Strip "Lazar", "Plan 8/9/10/11", "refactor-logs" refs from all ADRs,
  architecture docs, HTML explainers, and feature/core AGENTS.md files
- Update all incoming links in docs/, packages/*/AGENTS.md, HTML explainers

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 10:07:37 +02:00

135 lines
7.2 KiB
Markdown

# ADR-019 — Sandcastle for Agent Orchestration
**Status:** Accepted
**Date:** 2026-05-13
**Spec:** docs/architecture/agent-first-workflow-and-conformance.md
**Companion guide:** docs/guides/runbook.md ("Using Sandcastle for agent dispatch")
**Related:** ADR-011 (TDD foundation), ADR-012 (feature conventions), ADR-015 (events and jobs)
## Context
This template is designed for **agent-driven feature development**. The conformance
system (ADR-012 + the post-ADR conformance-system-v1 epic) gives agents a tight,
layered feedback loop — type errors in 0s, lint in <1s, boot assertion in ~3s, CI
gates in ~120s. The remaining substrate question is: how does an agent actually
get dispatched against a task?
Three pieces are needed:
1. **A way to invoke an agent** (Claude / Codex) with a task description,
inside a sandbox so the agent can't break the host while iterating.
2. **A way to capture the agent's commits** so a reviewer agent can inspect
the diff and approve or reject.
3. **A way to compose the above into a per-task dispatch loop** with retry
semantics, branch management, and integration into the existing
docs/work/ task system.
Without a substrate that handles all three, agentic development falls back to
copy-paste-prompt-by-hand, which is slow and error-prone.
## Decision
Adopt [Sandcastle](https://github.com/mattpocock/sandcastle) (`@ai-hero/sandcastle`)
as the agent-orchestration substrate. `pnpm work dispatch` is the entry point.
Concretely:
1. **`@ai-hero/sandcastle` is a workspace-root devDependency.** Pinned at
`^2.73.0` at adoption; pnpm resolves later patches automatically.
2. **`.sandcastle/` holds the canonical prompt templates.** Five role-specific
prompts: PRD eliciter, ADR eliciter, decomposer, implementer, reviewer.
Each enforces the **generator-first** rule (prefer `pnpm turbo gen <kind>`
over hand-rolling see saved memory `generator-first-for-agents`).
3. **`.sandcastle/Dockerfile`** is the sandbox baseline (node:22-bookworm-slim
- pnpm via corepack). The agent runs `pnpm install --frozen-lockfile` as
its first step per the implementer prompt.
4. **`scripts/work/dispatch.mjs` is the orchestrator.** It reads `_state.json`,
finds the first ready story's first unchecked AC bullet, builds a task spec,
and calls `sandcastle.run({ promptFile, promptArgs: { TASK_FILE_CONTENT } })`
for the implementer, then again for the reviewer with `{{DIFF}}`. The
orchestrator does NOT mutate state in v1 it prints suggested mutations
for the human to apply.
5. **Two modes:** `pnpm work dispatch` (planning, no agent invoked) and
`pnpm work dispatch --execute` (real sandcastle call, requires auth see
point 7).
6. **Reviewer agent verifies generator-first.** Hand-rolled output that should
have been a `pnpm turbo gen <kind>` invocation is grounds for rejection.
7. **Bring-your-own-auth.** Two paths are supported, in priority order:
- **Subscription (primary)** bind-mount the host's `~/.claude/` into the
sandbox. Claude Code CLI inside the sandbox uses the host's logged-in
subscription session. Zero per-task token spend for Pro/Max subscribers.
Path overridable via `SANDCASTLE_CLAUDE_CREDS_DIR` env var.
- **API key (fallback)** `ANTHROPIC_API_KEY` or `OPENAI_API_KEY` passed
through to the sandbox env. Used when no host creds directory exists.
- The resolver (`resolveClaudeAuth` in `scripts/work/dispatch.mjs`) picks
automatically with subscription always preferred. Sandcastle's own issue
#191 documents that subscription support won't be added natively;
this mount-based pattern is our workaround promoted to first-class.
8. **Per-task max-attempts honoured (v2).** Each task's frontmatter may carry
`max-attempts: N` to bound the implementerreviewer retry loop. Default 3.
## Alternatives considered
- **Bare Claude Code / Codex CLI invocation per task** rejected. No sandbox
isolation; no consistent prompt template surface; no built-in branch
management; no reviewer-loop primitive.
- **GitHub Copilot Workspace / native CI agent** rejected. Vendor lock-in;
workflow lives outside the repo; no local equivalent for development time.
- **Custom orchestrator built from scratch on the Anthropic SDK** rejected.
Sandcastle already solves sandbox + branch + structured-output extraction;
rebuilding it is not the leverage point.
- **No orchestrator humans dispatch each task manually via copy-paste**
rejected as the steady-state mode, but supported as a fallback via planning
mode (`pnpm work dispatch` without `--execute`).
- **A different sandbox provider (Vercel sandboxes, Daytona, native fly.io)**
sandcastle is provider-agnostic; the choice of provider sits behind the
`SANDCASTLE_PROVIDER` env var and can change without disrupting prompts or
orchestrator code. Default is Docker.
## Consequences
### Positive
- **Per-task isolation.** Each implementer dispatch runs in its own Docker
sandbox + sandbox branch. Bad agent output stays in the branch; merge to
`main` is gated by the reviewer agent + the full 5-gate stack.
- **Provider-agnostic.** Switching from Claude to Codex (or to a future
agent runtime) is a one-line change to the prompt's `agent` parameter.
- **Composable with existing workflow.** `pnpm work` CLI already reads
`_state.json` and the docs/work/ markdown; dispatch is one more subcommand
layered on top.
- **Cost-aware default.** Planning mode invokes no agent; only `--execute`
spends tokens. Operators choose when to escalate from plan to execute.
- **Recoverable failure modes.** If an implementer goes off-rails, its diff
lives on a sandbox branch review, reject, re-dispatch with notes.
### Negative / accepted trade-offs
- **External dependency on sandcastle.** If the project stalls, we either pin
- maintain a fork or migrate to another orchestrator. Sandcastle is small
enough (~3KLOC) that a fork is manageable.
- **Token cost is real.** A complex task can use 100K-200K tokens per
implementer + reviewer round-trip. Operators budget per-dispatch; the
planning mode + the optional `max-attempts` frontmatter cap exposure.
- **Docker dependency for the default sandbox.** Without Docker (or a
provider swap), `--execute` won't run. Documented in the runbook.
- **State mutation is manual in v1.** The orchestrator prints suggested
state mutations; a human ticks the AC bullet + commits. Auto-mutation is
v2 work, gated on confidence that the reviewer's decision can be trusted
without human inspection.
### Follow-up work
- **Auto state mutation** when the reviewer agent's decision is approve,
the orchestrator could automatically tick the AC bullet + commit. Currently
manual; promote when reviewer confidence is established empirically.
- **Multi-task batch dispatch** `pnpm work dispatch --all-ready` would
fan out across all ready stories. Requires DAG-aware concurrency
(no two implementers touching the same files).
- **Sandcastle CI image alignment** the `.sandcastle/Dockerfile` is
minimal; once we identify the CI base image, the sandbox should extend it
to match the CI environment exactly.
- **Cost telemetry** `sandcastle.run()` returns iteration usage stats; the
orchestrator could log these to `_state.json` per-task so operators see
cumulative spend.