eidos agi
MCP Server

stepproof

Bitcoin engineered proof into money so it can't be faked. Stepproof engineers proof into AI instructions so agents can't lie about following them.

uvx --from git+https://github.com/eidos-agi/stepproof.git stepproof View on GitHub

What it does

Ask an AI to follow a 10-step release checklist and it will tell you all 10 steps are done. Some of them won't be. The model was trained to complete the mission, and if step 7 is tedious the fastest path to "done" is to skip it and say everything went fine. Stepproof makes that impossible.

Before any step advances, the agent has to produce concrete proof — the commit SHA, the pytest output, the migration log, the PR URL. A separate checker reads real state and either allows the next step or denies it. The agent can't talk past a denial. And because every decision is chained onto a public, tamper-evident ledger — the same engineering Bitcoin used to make transactions unforgeable, applied here to AI steps — the agent can't rewrite history after the fact either. What you asked for is what actually happened.

Why this exists when every lab already has "governance"

Anthropic, OpenAI, and Perplexity all ship something they call agent governance. None of them ship this. Permission prompts are stateless nags — they interrupt on each sensitive call and forget. Audit logs are the model's own story — written by the thing they're supposed to audit. System-prompt trust isn't a contract, it's a wish. Tool scopes gate single calls, not sequences. From 30 feet these look sufficient; up close they aren't.

The shape StepProof fills — pre-declared plan, evidence at every step, independent verifier, tamper-evident log — is a specific thing the big labs haven't built. A few reasons, none of them charity:

So we built it. StepProof runs with Claude Code, not instead of it — agents declare plans over MCP, a PreToolUse hook enforces sequence, the audit log lives in .stepproof/events.jsonl. No lab cooperation required. Full argument: THE_GAP.md.

The autonomy dial

Autonomy and enforcement are in tension — and the tension has to be managed, not resolved. Too much autonomy, the agent shortcuts. Too much enforcement, you've built a workflow engine with extra steps: the agent becomes a function in someone else's flowchart, and you lose the judgment you hired it for. Both extremes cost you something.

The other way to stop an agent from lying is to code every step deterministically — n8n, Temporal, Airflow. Those work. But determinism alone kills the thing agents are actually good at: picking the right approach for this codebase, noticing that the failing test is brittle rather than a real defect, choosing between three plausible fixes. Lock those decisions down and you've paid agent prices for a switch statement.

StepProof is the middle dial. The agent keeps autonomy where autonomy matters. Only the load-bearing checkpoints — the ones where a wrong move costs you an incident — gate on evidence. "Migrate the schema" waits on a verifier reading real DB state. "Pick how to implement this feature" gates on nothing; the agent just works.

And the dial ratchets toward less autonomy as you learn. Start at Tier 0: log everything, block nothing. Watch for two weeks. When the audit log shows a real failure pattern — pre-deploys skipped, tests claimed but never run — add a Tier 1 verifier for that specific step. You never block a step you haven't seen go sideways. You never tighten the loop blindly. The data tells you which way to turn the dial — and the ratchet only turns one way, so trust compounds instead of erodes.

TIER 0 TIER 1 TIER 2 log, don't block verifier-gated step + human signoff one-way · trust compounds
Start at Tier 0: log everything, block nothing. When the audit log shows a specific step drifting, add a Tier 1 verifier for that step. The dial never turns back — you don't loosen without evidence, only tighten.

How it works

An agent can't just "deploy." It has to show evidence that the runbook's gates are met. Stepproof reads the evidence — PR approvals, test results, migration logs — and decides. No evidence, no advance. Every decision is audited to a hash-chained log the reviewer can replay.

1. First attempt — gates unmet, stepproof denies

agent-7 · deploy
# agent-7 requests step "deploy" under runbook rb-deploy-prod stepproof step-complete deploy --evidence tests=pass --evidence migrations=dry-run stepproof checking step deploy against rb-deploy-prod... PR approved no PR reference provided tests green 3/3 suites pass migrations dry-run only DENY deploy — 2 of 3 evidence gates unmet audit: .stepproof/events.jsonl (hash-chained, entry #246) agent: halted. Collect approvals, run migrations for real, retry.

2. After addressing the gaps — stepproof allows

agent-7 · deploy (retry)
# agent re-submits after opening PR #128 and applying migrations stepproof step-complete deploy --evidence pr=128 --evidence tests=pass --evidence migrations=applied stepproof checking step deploy against rb-deploy-prod... PR approved #128 · 2 reviewers tests green 3/3 suites pass migrations applied 0042..0044 (12.4s) ALLOW deploy — all evidence verified audit: entry #247, sha 3a91c20 → links back to #246 agent: proceeding with deploy.

The denial and the allowance are both durable. The next person (or next tick, or next auditor) reads .stepproof/events.jsonl and replays exactly what happened — what the agent asked for, what evidence it had, how stepproof decided, and why.

Reach for what

Reach for these instead when…
  • Permission prompts — a single tool call needs human sign-off. One gate, no sequence, no audit trail a reviewer can replay.
  • System-prompt trust — a chat assistant where the cost of the agent freelancing is low and interruption is worse than drift.
  • Tool scopes — the blast radius of the credential is the right boundary. Read-only keys, narrow MCP servers.
  • Workflow engines (n8n, Temporal) — the author writes every step in code. The agent's role is a function call, not a decision-maker.
Reach for StepProof when…
  • Sequence matters — the agent must run migration, then tests, then deploy — not any subset in any order.
  • Evidence must be real — "tests passed" has to mean a pytest log with 160 passing, not a string the agent typed.
  • The audit trail will be replayed — by an auditor, a postmortem, a regulator. Tamper-evident matters.
  • The ROI is measurable — you want to run stepproof metrics after two weeks and see how often enforcement actually bit.

Key features

Install

install-stepproof.sh
# inside the repo you want enforcement in — adds hooks + MCP registration
uvx stepproof install --scope project

# or register globally via Claude Code's MCP config
claude mcp add stepproof --scope user -- \
  uvx --from git+https://github.com/eidos-agi/stepproof.git stepproof

Usage

session.md
# 1. Agent-declared plan (Keep Me Honest mode)
mcp__stepproof__stepproof_keep_me_honest \
  intent: "release v0.1.0" \
  steps: [
    {step_id: "s1", required_evidence: ["path", "min_lines"],
     verification_method: "verify_file_exists"},
    {step_id: "s2", required_evidence: ["pytest_output_path", "min_passed"],
     verification_method: "verify_pytest_passed"},
    {step_id: "s3", required_evidence: ["commit_sha"],
     verification_method: "verify_git_commit"},
  ]

# 2. Or start a pre-registered runbook (Template mode)
mcp__stepproof__stepproof_run_start template_id: "rb-stepproof-release"

# 3. Submit evidence per step — verifier decides pass/fail
mcp__stepproof__stepproof_step_complete \
  run_id: "<run_id>" step_id: "s3" \
  evidence: {commit_sha: "1bfea1c..."}

Measure it from your own log

Models disagree about whether unsupervised agents shortcut 8% or 40% of the time — that gap drives a 30× spread in projected ROI. Don't argue about it. Run StepProof for two to three weeks, then ask your own audit log where you land:

< 5% Q1–Q2 · StepProof mostly dormant
15%+ Q3–Q4 · catching real drift
30%+ Q5 · high-stakes, mandatory
measure-off-rails.sh
# cryptographic integrity — detect retroactive edits to events.jsonl
stepproof audit verify

# empirical off-rails rate (deny + wedged runs / enforcement opportunities)
stepproof metrics --days 14
stepproof metrics --json  # scriptable output

The number is yours, not ours. The quintile tells you whether StepProof is overhead-positive or governance you were missing. HONEST_LIMITS.md names the failure modes the design doesn't close, so you know what you're still on the hook for.

Where this applies

Built for DevOps runbooks — migrations, deploys, incident response, rollbacks. The primitive generalizes: durable workflow + bounded action permissions + evidence-based verification + audit trail. Same shape applies to security (access changes, secret rotation), data (backfills, schema promotions), regulated operations (financial reconciliations, healthcare workflows), and agent-platform governance (Claude Code, Cursor, OpenAI Agents as a shared enforcement layer).

Designed to produce the artifacts regulators will ask for under the EU AI Act (effective Aug 2026), the Colorado AI Act (Jun 2026), and the OWASP Agentic AI Top 10 (Dec 2025). StepProof itself doesn't encode any specific regulation; the runbook author does.

What's next

StepProof is step one in a trinity. Loops is step two; Lighthouse is step three. Three roles, one-way flow — no loop can rewrite itself without evidence promoted by Lighthouse.

StepProof

What happened — auditable ground truth.

.stepproof/events.jsonl

Loops

Are we trending right — cadence + delta proposal.

loops/<name>/history/*.md

Lighthouse

Is this pattern real enough to promote.

hypothesis → trial → keep/kill

Read more: THE_GAP · HONEST_LIMITS · TIERS · KEEP_ME_HONEST