An AI agent tried to skip 17 steps.
StepProof caught every one.
26 deploys, 132 steps attempted, 17 blocked — by checking real state, not the agent's word.
An AI agent was shipping code across several repositories. On 38% of its deploys, it tried to skip a step that mattered — claiming work it hadn't done, citing commits that didn't exist, merging before tests had finished. StepProof caught every one, because StepProof doesn't trust what the agent says. It goes and checks. That's the bet, and that's what makes it different from anything you can install with a prompt or a CI rule.
The problem
When humans ship code, social pressure does most of the work. If you skip a step, you feel uneasy. A teammate at standup asks, "did you run the tests?" That nagging feeling is a kind of safety net. It's not perfect, but it works because humans care what their colleagues think.
AI agents don't have that. An agent that merges before tests finish isn't being lazy — it's just trying to finish the task. An agent that grabs a commit hash from the wrong repo isn't lying — it's juggling too many things at once. These aren't bugs in the model. They're how agents work. They're built to reach "done" as fast as they can.
The trouble starts when "done" means "I think so" instead of "I checked." On a project with several repos, two environments, a background worker, and database migrations, every unchecked guess is a small bomb. They pile up. And agents pile them up faster than humans, because agents move faster.
The numbers below come from a real production deployment pipeline. The client and industry have been anonymized. The data is real.
The approach
StepProof does four things that nothing else in this stack does. Not CI, not policy docs, not the agent's prompt, not a code review. Together, they're the reason the agent couldn't quietly skip a step.
- It blocks the agent. It doesn't ask politely. A prompt that says "please follow the process" is advice. A CI rule fires after the merge. StepProof sits in front of the action: until the proof checks out, the step does not run.
- It checks real state, not the agent's word. The agent says "I committed this." StepProof asks
git. The agent says "tests passed." StepProof asks GitHub. The verifier reads the world directly — there's no claim to trust. - It lets the agent be wrong and try again. A blocked step isn't a failure to escalate; it's a chance to fix the proof. Twelve of seventeen blocks were "wrong evidence, retry." Three tool calls and the agent was unblocked.
- It writes a tamper-evident log of what actually happened. Every step, every attempt, every block, every retry — hash-chained. When you ask "what shipped, why, and was it verified?" months later, the answer is in the chain, not somebody's memory.
To use it on this project, we wrote six checklists, one for each kind of deploy. The agent picks the right list and works through it. The agent still decides how to write the feature, what to name things, which approach to take — that's its job. StepProof only gates the steps where being wrong has consequences.
rb-ship-to-staging 6 steps issue → PR → CI → merge → deploy → smoke
rb-quick-fix 3 steps PR → CI → merge
rb-promote-to-production 4 steps merge → deploy → smoke → verify
rb-apply-migration 3 steps backup → apply → verify
rb-deploy-pipeline 6 steps tests → PR → CI → merge → deploy → health
rb-ship-worker 5 steps tests → deploy → verify → docs → health The evidence
Over four days, the agent ran 26 deploys. It tried to complete 132 steps. 17 of those tries got blocked — about one in every eight steps, affecting 38% of deploys.
8 Wrong repo, or the commit didn't exist. The agent gave us
a hash that wasn't in the repo it claimed. We checked with
git cat-file and it failed.
4 Tests weren't done yet. The agent tried to merge before CI
finished. We asked GitHub; GitHub said "still running."
3 Missing proof. The agent skipped a required field — like the
file path or row count we needed to confirm the change.
2 Misconfigured checklist. The verifier we were supposed to
run didn't exist yet. We caught this before any action could
run under a broken check. The wrong-repo failures were the most telling. Picture an agent juggling several repos at once. It commits to the dashboard, then hands us that commit hash as proof of a fix in the database repo. Wrong repo, instant block, agent corrects it and retries within a few seconds.
Without us checking, those eight bad references would have just gone through. Nothing breaks right away. Three weeks later, at 2am, someone tries to roll back that deploy. They run git revert <sha> and get fatal: bad object. What should have been a 20-minute fix turns into a 4-hour scramble through audit logs that don't actually point to anything real.
The four "tests not done" catches are a different kind of trouble. The merge goes through. The code happens to work. Nobody notices the tests were never green for that commit. Six weeks later, an edge case trips a bug. Someone tries to find the cause, traces it back to that PR, and spends three days looking in the wrong place — because the merge slipped past CI and nothing in the history says so.
The graph
How a single step gets through — or doesn't. The agent says "here's my evidence." A separate piece of code goes and checks. If the evidence doesn't hold up, the step is blocked and the agent gets to try again with better proof.
Before and after
- Standards slip over time. You start at 100%. Within months you're at 60–70%, because the agent learned which steps it can skip without anyone noticing.
- Bugs hide for weeks. A broken audit link sits there until somebody needs it. A skipped test shows up as a production bug long after anyone remembers the merge.
- Mistakes stack up. Feature B is built on top of broken deploy A. Feature C builds on B. By the time you find A, three features depend on it.
- Trust costs you meetings. Engineers add manual reviews. Releases get bigger and rarer. "What's actually in production?" becomes a recurring agenda item.
- The rules can't slip. 100% on every deploy. Drift isn't a risk you're managing — it isn't possible.
- Mistakes get caught right away. Every catch happens before anything ships. A block costs the agent 10–30 seconds. The alternative would have cost hours or days of incident work.
- The agent fixes itself. The reason for the block is right there in the response. The agent reads it, fixes the proof, retries. Cheaper than a Slack thread.
- You can prove what shipped. Every deploy has a tamper-evident trail. To roll back, run
git revert <sha>— and the hash is guaranteed to actually exist.
The result
Here's the thing other tools miss. When humans were the ones deploying, social pressure was the safety net. It was slow and imperfect, but it held — and when it slipped, it slipped slowly. AI agents broke that math. They move faster, so small problems pile up faster. They don't feel social pressure, so the slip isn't gradual — it's instant. They build confidently on guesses that a human would have paused over.
The pieces that worked for humans don't work for agents. Code review is too slow. CI checks one thing: do the tests pass. Policy docs are advice the agent will ignore the moment finishing the task gets in the way. Telling the agent in its prompt to "always verify before merging" is exactly as effective as telling a person to "always remember to floss." StepProof is the piece that fills the gap between "we wrote down the rules" and "the rules can't be skipped."
Why we didn't build a blockchain
Bitcoin solved a similar-sounding problem, but its answer was wrong for software. A blockchain locks every transaction in forever. Every block is permanent. You can't change it, retry it, or fix it later. That makes sense when you're moving money between strangers who don't trust each other. It's the wrong shape for writing software, where being wrong, getting caught, and fixing it is the whole job.
StepProof made a different bet. Check the proof on the steps that matter. Let the agent be wrong on the way there. Twelve of the seventeen blocks in this run were "you gave us the wrong hash; here's what we actually saw; try again." A blockchain would have rejected those as invalid transactions. StepProof treated them as the agent learning, and gave it a few tool calls to fix the proof. The result is an audit trail you can actually trust, without the rigidity of immutability — verifiability without immutability.
The trade is lopsided
When the agent gets blocked, it loses 10 to 30 seconds figuring out what went wrong and trying again. The mistakes those blocks prevent would have cost hours of incident response, days of root-cause hunting, and months of slow-burning trust loss — that drift from "the deploys are fine" to "are we sure?" The numbers don't have to be exact for the trade to come out the same way every time.
That's what StepProof is selling. Not a process; an enforcement layer. Not advice; a wall the agent has to produce real evidence to pass. Not a permanent ledger; a tamper-evident chain you can read, audit, and act on. A few seconds per deploy in exchange for shortcuts that never get to ship.