//moonshift · blog · hitl

Auditable autonomy: why we gate the publisher behind a human

Autonomous agents publishing to your socials is a bad idea, full stop. Here's the architectural principle we built Moonshift around - why the publisher is gated, what the auditor actually audits, and what 'human in the loop' looks like when the loop is five minutes long.

·6 min read·hitl · safety · architecture

There's a question we get on roughly every demo: "wait, the agents deploy the app, push the repo, generate the hero image, and write the X post - but you won't let them hit Publish on the X post? Why?" The short answer is: autonomy should stop at identity. The long answer is what this post is about.

The two things autonomous agents can't unring

When you hand an AI process the keys to act on your behalf, the failures come in two flavors. Reversible failures: it deploys a bad build, pushes a commit with a typo, picks the wrong database schema. You fix those with git revert, a second deploy, or a migration. Annoying, not existential.

Irreversible failures: it says something in your voice, to your audience, on a platform that punishes deletion with an algorithmic shadow. You cannot unring that bell. A tweet lives in quote-retweet caches, in screenshots, in someone's angry DM archive. A LinkedIn post lives in the newsfeed of every recruiter you've ever added. Deleting the post doesn't undo the damage; it just adds "and also panicked" to the damage.

The internet forgives broken builds. It does not forgive you posting "thrilled to announce" about a product that doesn't work yet.

So the load-bearing principle in Moonshift is: autonomous agents can do anything that's reversible, and nothing that isn't. Everything on the reversible side of the line is fair game for the swarm. Everything on the irreversible side - specifically, anything that emits a public artifact carrying your name - gets a human gate.

What that looks like in the pipeline

The Moonshift DAG has ten phases and fourteen agents. Nine of those phases run unattended. The tenth - publisher - does not. The distinction is intentional and architectural:

  • Phases 1–9 (autonomous): planner, db, backend, frontend, tests, contract-validator, deployer, marketer, auditor, auditor-security, image-gen. These run end-to-end in about five minutes. They touch your GitHub (push), your Vercel (deploy), and your Turso (DB migrations). They don't touch any third-party social platform.
  • Phase 10 (gated): publisher. The publisher agent produces X and LinkedIn drafts, writes them to your dashboard, and stops. Nothing calls the X API. Nothing calls the LinkedIn API. The only code path that makes an outbound social request is triggered by a human click on your dashboard - one click per platform, with a preview rendered exactly as it will appear when posted.

That's not a UX convenience. It's enforced at the orchestrator layer. The publisher agent literally does not have credentials for the social APIs; the dashboard does. Even a compromised agent can't post without a compromised dashboard session, and a compromised dashboard session is a different class of problem that requires a human-initiated action to exploit.

What the auditor actually does (it's not rubber-stamping)

Before the publisher even gets to draft, phase 7 runs the auditor and auditor-security agents in parallel. These aren't decorative. They produce two JSON reports:auditor/REPORT.json and auditor-security/SECURITY_REPORT.json. Each carries a score, a set of enumerated issues with severities, and a pass/fail verdict. If either fails beyond a configured threshold, an audit-fixer agent runs a repair pass (up to one retry), after which the auditor reruns.

The auditor-security pass is the one that matters for this conversation. It flags:

  • Secrets leaked into committed files (API keys, tokens, service accounts).
  • Public endpoints missing auth gates that the planner said should exist.
  • SQL injection vectors introduced by the backend agent's query construction.
  • Dangerous package choices (unmaintained, known-CVE, or typosquat candidates).
  • Missing rate limits on endpoints the planner flagged as sensitive.

The auditor-security verdict is attached to the run. When you land on the publisher gate, the dashboard shows you the verdict before the draft copy. If the security score is below a threshold, the "Publish" button is disabled entirely. You can override, but the override is explicit, logged, and shown in the audit trail for that run.

HITL when the loop is five minutes long

Classic human-in-the-loop design assumes the human has time. The loop is slow, so review is deliberate. Moonshift inverts that assumption: the pipeline is fast, so the review has to be fast, too, without being shallow.

Our answer is "review the diff, not the code." What the human is asked to approve at the publisher gate is not "did the whole pipeline work?" - you can see that yourself from the deploy URL. It's specifically:

  1. Does this X draft say something you're willing to stand behind, exactly as written?
  2. Does this LinkedIn draft say something you're willing to stand behind, exactly as written?
  3. Does this hero image look like a thing you'd attach to your name?
  4. Did the auditor-security scan pass at a level you accept?

Four questions, each answerable in under thirty seconds. That's the loop. A human who just woke up can answer it with coffee in hand. A human who's in a meeting can answer it on their phone. The whole point is that the review stays bounded even as the production side keeps getting faster.

A note on edits, because nobody approves an AI draft verbatim

We know you're going to edit the draft. Everyone edits the draft. That's fine - it's actually the point. The drafts exist to solve the "blank page" problem, not to replace your voice. The publisher agent writes something specific enough that you can react to it instead of inventing it. React-to-text is ten times faster than generate-text, especially at 11pm.

Edits happen in the same dashboard surface that holds the gate. The published version - the one that actually hits X or LinkedIn - is whatever you pressed publish on, with whatever edits you made, logged verbatim in the run's audit trail. If you ever need to answer the question "did an AI post this under my name?" - the answer is "no, I did, and here's the timestamp."

The meta-principle

There's a temptation, in the current cycle, to demo "full autonomy" as a feature. An agent that does the whole thing, posts and all, is more viral than an agent that stops at a gate. We understand the attraction; we've watched the demos. We think they're a bad idea dressed up as a good one.

The gate is the feature. "Deployed and drafted while you slept, parked for your approval" is a more defensible product than "posted while you slept, hope you like it." The day we drop the gate, we've stopped making the thing builders actually want and started making a thing that generates an incident report on a Tuesday morning.

Auditable autonomy, bounded authority, and a human at the identity boundary. That's the deal. That's the whole deal.