Files

Danijel Martinek 3f0d60e082 docs(adr): ADR-019 — Sandcastle for agent orchestration

Captures the decision to adopt @ai-hero/sandcastle as the orchestration
substrate for agent-driven development in this template. Records the
8-point decision (workspace dep, .sandcastle/ prompts, Dockerfile,
dispatch.mjs orchestrator, planning vs execute modes, generator-first
reviewer check, bring-your-own-key, per-task max-attempts), the four
alternatives considered (bare CLI / Copilot Workspace / custom-from-
scratch / no orchestrator), and four trade-offs (external dep, token
cost, Docker dependency, manual state mutation in v1).

Surfaces the decision at the top of README.md and AGENTS.md so new
contributors see the agent-driven framing before they hit the package
map or daily commands.

2026-05-13 09:15:13 +02:00

6.6 KiB

Raw Blame History

ADR-019 — Sandcastle for Agent Orchestration

Status: Accepted Date: 2026-05-13 Spec: docs/architecture/agent-first-workflow-and-conformance.md Companion guide: docs/guides/runbook.md ("Using Sandcastle for agent dispatch") Related: ADR-011 (TDD foundation), ADR-012 (Lazar conformance), ADR-015 (events and jobs)

Context

This template is designed for agent-driven feature development. The conformance system (ADR-012 + the post-ADR conformance-system-v1 epic) gives agents a tight, layered feedback loop — type errors in 0s, lint in <1s, boot assertion in ~3s, CI gates in ~120s. The remaining substrate question is: how does an agent actually get dispatched against a task?

Three pieces are needed:

A way to invoke an agent (Claude / Codex) with a task description, inside a sandbox so the agent can't break the host while iterating.
A way to capture the agent's commits so a reviewer agent can inspect the diff and approve or reject.
A way to compose the above into a per-task dispatch loop with retry semantics, branch management, and integration into the existing docs/work/ task system.

Without a substrate that handles all three, agentic development falls back to copy-paste-prompt-by-hand, which is slow and error-prone.

Decision

Adopt Sandcastle (@ai-hero/sandcastle) as the agent-orchestration substrate. pnpm work dispatch is the entry point.

Concretely:

@ai-hero/sandcastle is a workspace-root devDependency. Pinned at ^2.73.0 at adoption; pnpm resolves later patches automatically.
.sandcastle/ holds the canonical prompt templates. Five role-specific prompts: PRD eliciter, ADR eliciter, decomposer, implementer, reviewer. Each enforces the generator-first rule (prefer pnpm turbo gen <kind> over hand-rolling — see saved memory generator-first-for-agents).
.sandcastle/Dockerfile is the sandbox baseline (node:22-bookworm-slim
- pnpm via corepack). The agent runs pnpm install --frozen-lockfile as its first step per the implementer prompt.
scripts/work/dispatch.mjs is the orchestrator. It reads _state.json, finds the first ready story's first unchecked AC bullet, builds a task spec, and calls sandcastle.run({ promptFile, promptArgs: { TASK_FILE_CONTENT } }) for the implementer, then again for the reviewer with {{DIFF}}. The orchestrator does NOT mutate state in v1 — it prints suggested mutations for the human to apply.
Two modes: pnpm work dispatch (planning, no agent invoked) and pnpm work dispatch --execute (real sandcastle call, requires ANTHROPIC_API_KEY or OPENAI_API_KEY).
Reviewer agent verifies generator-first. Hand-rolled output that should have been a pnpm turbo gen <kind> invocation is grounds for rejection.
Bring-your-own-key for cost control. No bundled API key. Agents only dispatch when the operator explicitly provides credentials.
Per-task max-attempts honoured (v2). Each task's frontmatter may carry max-attempts: N to bound the implementer↔reviewer retry loop. Default 3.

Alternatives considered

Bare Claude Code / Codex CLI invocation per task — rejected. No sandbox isolation; no consistent prompt template surface; no built-in branch management; no reviewer-loop primitive.
GitHub Copilot Workspace / native CI agent — rejected. Vendor lock-in; workflow lives outside the repo; no local equivalent for development time.
Custom orchestrator built from scratch on the Anthropic SDK — rejected. Sandcastle already solves sandbox + branch + structured-output extraction; rebuilding it is not the leverage point.
No orchestrator — humans dispatch each task manually via copy-paste — rejected as the steady-state mode, but supported as a fallback via planning mode (pnpm work dispatch without --execute).
A different sandbox provider (Vercel sandboxes, Daytona, native fly.io) — sandcastle is provider-agnostic; the choice of provider sits behind the SANDCASTLE_PROVIDER env var and can change without disrupting prompts or orchestrator code. Default is Docker.

Consequences

Positive

Per-task isolation. Each implementer dispatch runs in its own Docker sandbox + sandbox branch. Bad agent output stays in the branch; merge to main is gated by the reviewer agent + the full 5-gate stack.
Provider-agnostic. Switching from Claude to Codex (or to a future agent runtime) is a one-line change to the prompt's agent parameter.
Composable with existing workflow. pnpm work CLI already reads _state.json and the docs/work/ markdown; dispatch is one more subcommand layered on top.
Cost-aware default. Planning mode invokes no agent; only --execute spends tokens. Operators choose when to escalate from plan to execute.
Recoverable failure modes. If an implementer goes off-rails, its diff lives on a sandbox branch — review, reject, re-dispatch with notes.

Negative / accepted trade-offs

External dependency on sandcastle. If the project stalls, we either pin
- maintain a fork or migrate to another orchestrator. Sandcastle is small enough (~3KLOC) that a fork is manageable.
Token cost is real. A complex task can use 100K-200K tokens per implementer + reviewer round-trip. Operators budget per-dispatch; the planning mode + the optional max-attempts frontmatter cap exposure.
Docker dependency for the default sandbox. Without Docker (or a provider swap), --execute won't run. Documented in the runbook.
State mutation is manual in v1. The orchestrator prints suggested state mutations; a human ticks the AC bullet + commits. Auto-mutation is v2 work, gated on confidence that the reviewer's decision can be trusted without human inspection.

Follow-up work

Auto state mutation — when the reviewer agent's decision is approve, the orchestrator could automatically tick the AC bullet + commit. Currently manual; promote when reviewer confidence is established empirically.
Multi-task batch dispatch — pnpm work dispatch --all-ready would fan out across all ready stories. Requires DAG-aware concurrency (no two implementers touching the same files).
Sandcastle CI image alignment — the .sandcastle/Dockerfile is minimal; once we identify the CI base image, the sandbox should extend it to match the CI environment exactly.
Cost telemetry — sandcastle.run() returns iteration usage stats; the orchestrator could log these to _state.json per-task so operators see cumulative spend.

6.6 KiB Raw Blame History