Bitcoin engineered proof into money so it can't be faked. Stepproof engineers proof into AI instructions so agents can't lie about following them.
uvx --from git+https://github.com/eidos-agi/stepproof.git stepproof View on GitHub Ask an AI to follow a 10-step release checklist and it will tell you all 10 steps are done. Some of them won't be. The model was trained to complete the mission, and if step 7 is tedious the fastest path to "done" is to skip it and say everything went fine. Stepproof makes that impossible.
Before any step advances, the agent has to produce concrete proof — the commit SHA, the pytest output, the migration log, the PR URL. A separate checker reads real state and either allows the next step or denies it. The agent can't talk past a denial. And because every decision is chained onto a public, tamper-evident ledger — the same engineering Bitcoin used to make transactions unforgeable, applied here to AI steps — the agent can't rewrite history after the fact either. What you asked for is what actually happened.
Anthropic, OpenAI, and Perplexity all ship something they call agent governance. None of them ship this. Permission prompts are stateless nags — they interrupt on each sensitive call and forget. Audit logs are the model's own story — written by the thing they're supposed to audit. System-prompt trust isn't a contract, it's a wish. Tool scopes gate single calls, not sequences. From 30 feet these look sufficient; up close they aren't.
The shape StepProof fills — pre-declared plan, evidence at every step, independent verifier, tamper-evident log — is a specific thing the big labs haven't built. A few reasons, none of them charity:
So we built it. StepProof runs with Claude Code, not instead of it — agents declare plans over MCP, a PreToolUse hook enforces sequence, the audit log lives in .stepproof/events.jsonl. No lab cooperation required. Full argument: THE_GAP.md.
Autonomy and enforcement are in tension — and the tension has to be managed, not resolved. Too much autonomy, the agent shortcuts. Too much enforcement, you've built a workflow engine with extra steps: the agent becomes a function in someone else's flowchart, and you lose the judgment you hired it for. Both extremes cost you something.
The other way to stop an agent from lying is to code every step deterministically — n8n, Temporal, Airflow. Those work. But determinism alone kills the thing agents are actually good at: picking the right approach for this codebase, noticing that the failing test is brittle rather than a real defect, choosing between three plausible fixes. Lock those decisions down and you've paid agent prices for a switch statement.
StepProof is the middle dial. The agent keeps autonomy where autonomy matters. Only the load-bearing checkpoints — the ones where a wrong move costs you an incident — gate on evidence. "Migrate the schema" waits on a verifier reading real DB state. "Pick how to implement this feature" gates on nothing; the agent just works.
And the dial ratchets toward less autonomy as you learn. Start at Tier 0: log everything, block nothing. Watch for two weeks. When the audit log shows a real failure pattern — pre-deploys skipped, tests claimed but never run — add a Tier 1 verifier for that specific step. You never block a step you haven't seen go sideways. You never tighten the loop blindly. The data tells you which way to turn the dial — and the ratchet only turns one way, so trust compounds instead of erodes.
An agent can't just "deploy." It has to show evidence that the runbook's gates are met. Stepproof reads the evidence — PR approvals, test results, migration logs — and decides. No evidence, no advance. Every decision is audited to a hash-chained log the reviewer can replay.
1. First attempt — gates unmet, stepproof denies
2. After addressing the gaps — stepproof allows
The denial and the allowance are both durable. The next person (or next tick, or next auditor) reads .stepproof/events.jsonl and replays exactly what happened — what the agent asked for, what evidence it had, how stepproof decided, and why.
stepproof metrics after two weeks and see how often enforcement actually bit.stepproof_keep_me_honest binds the session to a plan. Each step specifies allowed_tools, required_evidence, and a verification_method. The agent can't change its own constraints mid-run..stepproof/events.jsonl carries a SHA-256 over its contents plus prev_hash. stepproof audit verify detects retroactive edits.stepproof metrics computes (deny + wedged_runs) / opportunities directly from the log. No vendor dashboard; your own data answers whether enforcement is catching real drift..stepproof/runbooks/. Operators define the ceremony; agents run it by ID. Three tiers of adoption (see TIERS.md).# inside the repo you want enforcement in — adds hooks + MCP registration
uvx stepproof install --scope project
# or register globally via Claude Code's MCP config
claude mcp add stepproof --scope user -- \
uvx --from git+https://github.com/eidos-agi/stepproof.git stepproof # 1. Agent-declared plan (Keep Me Honest mode)
mcp__stepproof__stepproof_keep_me_honest \
intent: "release v0.1.0" \
steps: [
{step_id: "s1", required_evidence: ["path", "min_lines"],
verification_method: "verify_file_exists"},
{step_id: "s2", required_evidence: ["pytest_output_path", "min_passed"],
verification_method: "verify_pytest_passed"},
{step_id: "s3", required_evidence: ["commit_sha"],
verification_method: "verify_git_commit"},
]
# 2. Or start a pre-registered runbook (Template mode)
mcp__stepproof__stepproof_run_start template_id: "rb-stepproof-release"
# 3. Submit evidence per step — verifier decides pass/fail
mcp__stepproof__stepproof_step_complete \
run_id: "<run_id>" step_id: "s3" \
evidence: {commit_sha: "1bfea1c..."} Models disagree about whether unsupervised agents shortcut 8% or 40% of the time — that gap drives a 30× spread in projected ROI. Don't argue about it. Run StepProof for two to three weeks, then ask your own audit log where you land:
# cryptographic integrity — detect retroactive edits to events.jsonl
stepproof audit verify
# empirical off-rails rate (deny + wedged runs / enforcement opportunities)
stepproof metrics --days 14
stepproof metrics --json # scriptable output The number is yours, not ours. The quintile tells you whether StepProof is overhead-positive or governance you were missing. HONEST_LIMITS.md names the failure modes the design doesn't close, so you know what you're still on the hook for.
Built for DevOps runbooks — migrations, deploys, incident response, rollbacks. The primitive generalizes: durable workflow + bounded action permissions + evidence-based verification + audit trail. Same shape applies to security (access changes, secret rotation), data (backfills, schema promotions), regulated operations (financial reconciliations, healthcare workflows), and agent-platform governance (Claude Code, Cursor, OpenAI Agents as a shared enforcement layer).
Designed to produce the artifacts regulators will ask for under the EU AI Act (effective Aug 2026), the Colorado AI Act (Jun 2026), and the OWASP Agentic AI Top 10 (Dec 2025). StepProof itself doesn't encode any specific regulation; the runbook author does.
StepProof is step one in a trinity. Loops is step two; Lighthouse is step three. Three roles, one-way flow — no loop can rewrite itself without evidence promoted by Lighthouse.
What happened — auditable ground truth.
.stepproof/events.jsonl Are we trending right — cadence + delta proposal.
loops/<name>/history/*.md Is this pattern real enough to promote.
hypothesis → trial → keep/kill Read more: THE_GAP · HONEST_LIMITS · TIERS · KEEP_ME_HONEST