//moonshift · blog · engineering

Designing an AI agent pipeline that ships every time

How Moonshift's 14-agent pipeline is structured: phases, parallelism, the contract validator, the fixer loop, the per-run cost cap, and why HITL gates are non-negotiable. A technical post for engineers building their own agent systems.

May 23, 2026·10 min read·engineering · agents · pipeline · architecture

Most agent systems fail the same way: an agent gets stuck in a loop, burns tokens, produces nothing usable, and the operator has no signal until the bill arrives. The two papers everyone reads (ReAct, Reflexion) describe the agent. The hard part of running agents in production is everything around the agent: the contract, the budget, the fixer, the gate.

This post documents how Moonshift’s pipeline is shaped after 18 months of running real agentic launches end to end. None of this is novel research. It is the operational scaffolding that turns a flaky LLM into a thing that ships a deployed app 95% of the time without supervision.

The agent is the model. The pipeline is everything that catches the model when it falls.

The shape: 14 agents, 10 phases, one DAG

Moonshift runs 14 specialised agents across a 10-phase directed acyclic graph (DAG). Phases sequence so that downstream agents can consume upstream artefacts. Specifically:

Phase 1: planner (Opus 4.7) emits a spec.md + contract.json describing routes, schema, components.
Phase 2: db (Sonnet 4.6) emits a Drizzle schema.ts + migration.sql from the contract.
Phase 3: backend (Sonnet 4.6) emits API routes that satisfy the contract.
Phase 4: frontend + tests run in parallel - frontend emits pages/components, tests emits Playwright specs.
Phase 5: contract validator (deterministic) reconciles backend output against the contract.json.
Phase 6: deployer pushes to GitHub, creates Vercel project, runs vercel deploy.
Phase 7: auditor + auditor-security run in parallel (40-point quality audit + auth flow review).
Phase 8: marketer drafts X thread + LinkedIn post.
Phase 9: image-gen renders three hero card sizes via gpt-image-2.
Phase 10: publisher stages everything in a PENDING.json behind a HITL approval gate.

Fan-out at phases 4 and 7 shaves roughly 90 seconds off the median run. The DAG is strict; an out-of-order write rejects at the tool boundary so the backend cannot accidentally rewrite the dashboard and the frontend cannot accidentally clobber an API route.

The contract: the only thing that keeps agents honest

Every phase between the planner and the deployer reads the same contract.json. The contract describes routes, request bodies, response shapes, and database tables. The contract validator runs after the backend and frontend emit code and asserts that:

Every route declared in the contract exists in the backend with the right method.
Every required request body field is consumed.
Every response shape matches a TypeScript type the frontend imports.
Every database table referenced by an API route is in the Drizzle schema.
Every Drizzle table has at least one read path or it is dead schema.

A violation does not crash the run. It feeds back into a fixer loop. The fixer is an LLM with a tight write budget and a strict instruction to repair only the specific violation. After 3 fixer retries on the same fingerprint we bail to a deterministic codemod for the common cases (missing default export, schemaless route, implicit-any cleanup) and fail fast if even that does not converge.

The budget: per-run moon ceiling between phases

Open-ended agent loops burn money. A widely-reported $6,000 overnight Claude Code bill happened because a scheduler had no spending ceiling. We made the ceiling a hard architectural primitive instead of a soft warning.

Every Moonshift run declares a per-run moon cap up front. Between every phase the orchestrator checks consumed vs. cap. If the next phase’s projected spend would push the run over, the orchestrator aborts. The user only pays for completed phases. The worst-case bad run loses 60% of its budget. There is no runaway case.

This is why Moonshift can afford to be aggressive about retries inside a phase budget. The fixer can iterate; the ceiling cannot be breached. The shape of the cost curve is a step function, not an open-ended exponential.

Soft warnings do not stop runaway agents. Hard ceilings do.

The gate: HITL is a feature, not a bug

The publisher agent is the only agent in the pipeline that is explicitly forbidden from posting. It writes the drafted X thread and LinkedIn post to a PENDING.json and exits. Nothing publishes until the operator approves it from the dashboard. We get asked weekly to make this configurable. We will not.

Two reasons. First, the cost of a bad post on your own account is permanent: followers churn, reputation lingers, a 95% acceptable autopost is 100% unacceptable when 1 of every 20 posts is embarrassing. Second, drafted-then-approved beats from-scratch by a factor of 5 on actual output; autoposted beats it by a factor of nothing because nobody trusts the post to their own account.

The audit pass: catch the silent failures

Two parallel auditors run at phase 7. Both are independent of the generation agents (different models, different prompts, no shared scratchpad). The general auditor runs a 40-point check on layout, accessibility, copy quality, link integrity, image alt text, dead buttons, contrast ratios. The security auditor runs a separate check on the auth flow, CORS posture, env var leaks, and route protection.

A failed audit does not kill the run. It triggers a focused fixer pass on the specific violations. We track auditor signal-to-noise carefully; over the last six months the auditor catches and fixes roughly 1 in 4 runs without operator involvement. The other 3 in 4 come back clean on the first audit pass.

What this gets you, operationally

The combination of contract validator + per-run cap + fixer loop + audit pass + HITL gate means a Moonshift run can be triggered and walked away from. Median run time is 6m 42s. 95% of runs complete without human intervention between the prompt and the publisher gate. The 5% that fail do so under a budget cap, with a clear diagnostic, and a resume button.

If you are building your own agent system, the bullet-point version is:

Write down the contract first. Every agent that reads the contract is honest. Every agent without a contract drifts.
Cap spend per run between phases, not per agent. Per-agent caps leak through cumulative overruns.
Run the fixer in a tight write budget. Three retries on the same fingerprint means bail to deterministic codemod or fail.
Run the auditor in a fresh context. Same model with shared scratchpad will agree with itself.
Make the HITL gate non-bypassable. Operator trust is the most expensive resource you can lose.

Want to see the architecture in action? Every Moonshift run produces a public trace at /showcase with the prompt, the phase timeline, the audit verdict, and the drafted launch kit. Click into any one to walk the trace phase by phase.

← all posts