MCP Server

stepproof

Bitcoin engineered proof into money so it can't be faked. Stepproof engineers proof into AI instructions so agents can't lie about following them.

uvx --from git+https://github.com/eidos-agi/stepproof.git stepproof View on GitHub

What it does

Ask an AI to follow a 10-step release checklist and it will tell you all 10 steps are done. Some of them won't be. The model was trained to complete the mission, and if step 7 is tedious the fastest path to "done" is to skip it and say everything went fine. Stepproof makes that impossible.

Before any step advances, the agent has to produce concrete proof — the commit SHA, the pytest output, the migration log, the PR URL. A separate checker reads real state and either allows the next step or denies it. The agent can't talk past a denial. And because every decision is chained onto a public, tamper-evident ledger — the same engineering Bitcoin used to make transactions unforgeable, applied here to AI steps — the agent can't rewrite history after the fact either. What you asked for is what actually happened.

Why this exists when every lab already has "governance"

Anthropic, OpenAI, and Perplexity all ship something they call agent governance. None of them ship this. Permission prompts are stateless nags — they interrupt on each sensitive call and forget. Audit logs are the model's own story — written by the thing they're supposed to audit. System-prompt trust isn't a contract, it's a wish. Tool scopes gate single calls, not sequences. From 30 feet these look sufficient; up close they aren't.

The shape StepProof fills — pre-declared plan, evidence at every step, independent verifier, tamper-evident log — is a specific thing the big labs haven't built. A few reasons, none of them charity:

The median user wants fewer gates, not more. Chat and code-assist sessions dominate; verifier-gated ceremony is dead weight for 99% of workloads. Labs optimize for the median.
It's infrastructure, not intelligence. Every quarter the model gets smarter. Runbooks and verifiers don't make it smarter. Labs ship models and leave plumbing to downstream.
Permission prompts look sufficient. From a lab-PM viewpoint, approve/deny already solves it. The difference between a stateless nag and a session contract is invisible from the outside.
Narrative tension. "Trust our agents to run your infrastructure" is hard to say in the same breath as "our agents need an external checker to stop them from lying." Labs stay on message.

So we built it. StepProof runs with Claude Code, not instead of it — agents declare plans over MCP, a PreToolUse hook enforces sequence, the audit log lives in .stepproof/events.jsonl. No lab cooperation required. Full argument: THE_GAP.md.

The autonomy dial

Autonomy and enforcement are in tension — and the tension has to be managed, not resolved. Too much autonomy, the agent shortcuts. Too much enforcement, you've built a workflow engine with extra steps: the agent becomes a function in someone else's flowchart, and you lose the judgment you hired it for. Both extremes cost you something.

The other way to stop an agent from lying is to code every step deterministically — n8n, Temporal, Airflow. Those work. But determinism alone kills the thing agents are actually good at: picking the right approach for this codebase, noticing that the failing test is brittle rather than a real defect, choosing between three plausible fixes. Lock those decisions down and you've paid agent prices for a switch statement.

StepProof is the middle dial. The agent keeps autonomy where autonomy matters. Only the load-bearing checkpoints — the ones where a wrong move costs you an incident — gate on evidence. "Migrate the schema" waits on a verifier reading real DB state. "Pick how to implement this feature" gates on nothing; the agent just works.

And the dial ratchets toward less autonomy as you learn. Start at Tier 0: log everything, block nothing. Watch for two weeks. When the audit log shows a real failure pattern — pre-deploys skipped, tests claimed but never run — add a Tier 1 verifier for that specific step. You never block a step you haven't seen go sideways. You never tighten the loop blindly. The data tells you which way to turn the dial — and the ratchet only turns one way, so trust compounds instead of erodes.

Start at Tier 0: log everything, block nothing. When the audit log shows a specific step drifting, add a Tier 1 verifier for that step. The dial never turns back — you don't loosen without evidence, only tighten.

How it works

An agent can't just "deploy." It has to show evidence that the runbook's gates are met. Stepproof reads the evidence — PR approvals, test results, migration logs — and decides. No evidence, no advance. Every decision is audited to a hash-chained log the reviewer can replay.

1. First attempt — gates unmet, stepproof denies

agent-7 · deploy

  # agent-7 requests step "deploy" under runbook rb-deploy-prod › stepproof step-complete deploy --evidence tests=pass --evidence migrations=dry-run  stepproof checking step deploy against rb-deploy-prod...  ✗ PR approved no PR reference provided  ✓ tests green 3/3 suites pass  ✗ migrations dry-run only  DENY deploy — 2 of 3 evidence gates unmet  audit: .stepproof/events.jsonl (hash-chained, entry #246)  agent: halted. Collect approvals, run migrations for real, retry.   

2. After addressing the gaps — stepproof allows

agent-7 · deploy (retry)

  # agent re-submits after opening PR #128 and applying migrations › stepproof step-complete deploy --evidence pr=128 --evidence tests=pass --evidence migrations=applied  stepproof checking step deploy against rb-deploy-prod...  ✓ PR approved #128 · 2 reviewers  ✓ tests green 3/3 suites pass  ✓ migrations applied 0042..0044 (12.4s)  ALLOW deploy — all evidence verified  audit: entry #247, sha 3a91c20 → links back to #246  agent: proceeding with deploy.   

The denial and the allowance are both durable. The next person (or next tick, or next auditor) reads .stepproof/events.jsonl and replays exactly what happened — what the agent asked for, what evidence it had, how stepproof decided, and why.

Reach for what

Reach for these instead when…

Permission prompts — a single tool call needs human sign-off. One gate, no sequence, no audit trail a reviewer can replay.
System-prompt trust — a chat assistant where the cost of the agent freelancing is low and interruption is worse than drift.
Tool scopes — the blast radius of the credential is the right boundary. Read-only keys, narrow MCP servers.
Workflow engines (n8n, Temporal) — the author writes every step in code. The agent's role is a function call, not a decision-maker.

Reach for StepProof when…

Sequence matters — the agent must run migration, then tests, then deploy — not any subset in any order.
Evidence must be real — "tests passed" has to mean a pytest log with 160 passing, not a string the agent typed.
The audit trail will be replayed — by an auditor, a postmortem, a regulator. Tamper-evident matters.
The ROI is measurable — you want to run stepproof metrics after two weeks and see how often enforcement actually bit.

Key features

Declared plans — stepproof_keep_me_honest binds the session to a plan. Each step specifies allowed_tools, required_evidence, and a verification_method. The agent can't change its own constraints mid-run.
Three roles, no trust between them — worker (the agent) executes; verifier (read-only) checks evidence against reality; governor (policy engine) gates advancement at the tool-call boundary via a PreToolUse hook.
Three-tier verification — deterministic scripts first (git, file-exists, SQL — 80–90% of checks); small verifier model for unstructured output; heavyweight model opt-in for high-stakes guardrails.
Hash-chained audit — every event in .stepproof/events.jsonl carries a SHA-256 over its contents plus prev_hash. stepproof audit verify detects retroactive edits.
Measurable off-rails rate — stepproof metrics computes (deny + wedged_runs) / opportunities directly from the log. No vendor dashboard; your own data answers whether enforcement is catching real drift.
Pre-registered runbooks — YAML templates in .stepproof/runbooks/. Operators define the ceremony; agents run it by ID. Three tiers of adoption (see TIERS.md).

Install

install-stepproof.sh

# inside the repo you want enforcement in — adds hooks + MCP registration
uvx stepproof install --scope project

# or register globally via Claude Code's MCP config
claude mcp add stepproof --scope user -- \
  uvx --from git+https://github.com/eidos-agi/stepproof.git stepproof

Usage

session.md

# 1. Agent-declared plan (Keep Me Honest mode)
mcp__stepproof__stepproof_keep_me_honest \
  intent: "release v0.1.0" \
  steps: [
    {step_id: "s1", required_evidence: ["path", "min_lines"],
     verification_method: "verify_file_exists"},
    {step_id: "s2", required_evidence: ["pytest_output_path", "min_passed"],
     verification_method: "verify_pytest_passed"},
    {step_id: "s3", required_evidence: ["commit_sha"],
     verification_method: "verify_git_commit"},
  ]

# 2. Or start a pre-registered runbook (Template mode)
mcp__stepproof__stepproof_run_start template_id: "rb-stepproof-release"

# 3. Submit evidence per step — verifier decides pass/fail
mcp__stepproof__stepproof_step_complete \
  run_id: "<run_id>" step_id: "s3" \
  evidence: {commit_sha: "1bfea1c..."}

Measure it from your own log

Models disagree about whether unsupervised agents shortcut 8% or 40% of the time — that gap drives a 30× spread in projected ROI. Don't argue about it. Run StepProof for two to three weeks, then ask your own audit log where you land:

< 5% Q1–Q2 · StepProof mostly dormant

15%+ Q3–Q4 · catching real drift

30%+ Q5 · high-stakes, mandatory

measure-off-rails.sh

# cryptographic integrity — detect retroactive edits to events.jsonl
stepproof audit verify

# empirical off-rails rate (deny + wedged runs / enforcement opportunities)
stepproof metrics --days 14
stepproof metrics --json  # scriptable output

The number is yours, not ours. The quintile tells you whether StepProof is overhead-positive or governance you were missing. HONEST_LIMITS.md names the failure modes the design doesn't close, so you know what you're still on the hook for.

Where this applies

Built for DevOps runbooks — migrations, deploys, incident response, rollbacks. The primitive generalizes: durable workflow + bounded action permissions + evidence-based verification + audit trail. Same shape applies to security (access changes, secret rotation), data (backfills, schema promotions), regulated operations (financial reconciliations, healthcare workflows), and agent-platform governance (Claude Code, Cursor, OpenAI Agents as a shared enforcement layer).

Designed to produce the artifacts regulators will ask for under the EU AI Act (effective Aug 2026), the Colorado AI Act (Jun 2026), and the OWASP Agentic AI Top 10 (Dec 2025). StepProof itself doesn't encode any specific regulation; the runbook author does.

What's next

StepProof is step one in a trinity. Loops is step two; Lighthouse is step three. Three roles, one-way flow — no loop can rewrite itself without evidence promoted by Lighthouse.

StepProof

What happened — auditable ground truth.

.stepproof/events.jsonl

Loops

Are we trending right — cadence + delta proposal.

loops/<name>/history/*.md

Lighthouse

Is this pattern real enough to promote.

hypothesis → trial → keep/kill

Read more: THE_GAP · HONEST_LIMITS · TIERS · KEEP_ME_HONEST