E EidosAGI

Proprioception
in AI agents

An outside-in steering tool catches what an in-flow agent cannot see about its own work.

Claude Code + cept + dbt + Postgres

A consultant's agent built a data layer for a regional services company across three slices. Each slice was defensible. Inside those slices the agent silently made five architectural decisions, each individually reasonable, each expensive to undo. An outside-in consultation tool — reading the recent transcript and the actual code — surfaced the one decision the agent had decided to defer, and explained why deferring it was the wrong call.

The problem

Proprioception, in humans, is the body's sense of where its parts are without looking. You know your hand is behind your back without checking. You catch a stumble before you process the trip. The signal does not require attention; it is constant background information about your own configuration.

AI agents working through a multi-step engineering problem do not have this. Each tool call, each commit, each schema choice looks defensible in the local frame. The agent is good — often very good — at making the next step coherent with the previous one. The agent is bad at stepping back to ask what am I committing the system to long-term, and what's expensive to undo?

This is not a knowledge gap. The agent often knows the right architectural questions in the abstract. The gap is cognitive mode. Stepwise execution and bird's-eye review are different shapes of work. The in-flow agent has a clear goal and is converging on it; pausing to ask "but should this exist?" is an interrupt the goal-directed mode resists.

The case below is a clean instance. A consultant was building a data layer for a regional services company — bronze raw landings already in place, the next layer being a dbt-driven silver and gold buildout, with around 100 financial measures targeted long-term and around 20 currently ratified. Work was going well. Commits were defensible. And the agent was committing the project to architectural choices nobody had agreed to.

The approach

The agent shipped three slices over a session:

Slice 6.1 — Port mapping seeds (department reassignments, known exclusions)Slice 6.2 — Build the keystone silver fact joining bronze entries with all dimensionsSlice 6.3 — Build the gold monthly measures table — one file, ~19 CTEs unioned

Inside those three slices the agent quietly made five "decide once, expensive to undo" architectural decisions:

silent-architectural-decisions.md
  • Custom generate_schema_name.sql macro. Overrode dbt's default schema-naming convention to land models in the desired warehouse schemas. Reasonable. Also: a sitewide override that every future model on this project will inherit.
  • Monolithic gold table. One file containing ~19 CTEs unioned together. Reasonable for the first 19. Likely the wrong shape for the 80 measures still to come.
  • Long-format output. One row per measure × period × entity, rather than one column per measure. Reasonable; matches modern semantic-layer practice. But it determines the shape every downstream consumer has to learn.
  • Single keystone silver fact. One fct_gl_entry joining bronze entries with all dimensions, rather than separate fact tables per business category. Reasonable; reduces duplication. Also: every gold model now sits on top of one source of gravity.
  • Exclusion handling as a column on the silver fact. The "is this row excluded from a given measure" logic encoded as boolean columns on the silver row, rather than as a separate exclusions table. Reasonable for the first measure. Becomes increasingly wrong as exclusion rules diverge across measures.

None of these were called out in commit messages. None were surfaced for stakeholder review. Each was the kind of choice an experienced data engineer would have flagged as a "let's talk about this first" decision, and the agent — coherent, productive, on goal — did not flag any of them.

The intervention was a single instruction from the human stakeholder:

Check in with me on major decisions that have long-term architectural consequences.

The agent stopped, listed the five decisions, and classified them: two were easy (the macro and the keystone fact were defensible and reversible enough), three were tough (per-measure split, long vs wide, exclusion handling). Then the human asked the agent to consult an outside-in tool: cept.

cept is an Eidos AGI consultation tool. It reads the active session transcript, distills recent events and code diffs, attaches relevant source files, and asks an outside model — with web search enabled — to rank guidance against the questions the agent posed. The asymmetry is the whole point: the in-flow agent has no easy way to see itself; the outside model has nothing but the bird's-eye view.

the cept invocation pattern
cept \
  --attach silver/fct_gl_entry.sql \
  --attach gold/measures_sage_monthly.sql \
  --attach seeds/exclusions.csv \
  --question "Are there standard dbt patterns I'm violating?" \
  --question "Does long-format match modern semantic-layer practice?" \
  --question "Is per-measure model explosion actually harder than \
              monolithic, or is that an old worry?" \
  --question "Should the exclusion logic move out of fct_gl_entry \
              into its own table now, or can it wait?"

# cept reads the active session JSONL, distills the last N events
# and diffs, attaches the named files, formulates the consultation
# packet, and asks an OpenRouter model with web search.
# Returns ranked, confidence-scored guidance — not a chat.

The evidence

The consultation came back with high-confidence guidance on all three tough decisions. Two confirmed the agent's instinct. One contradicted it.

Per-measure split. Defer. The monolithic file is fine for now; refactor when it gets unwieldy.

Long vs wide format. Long-format. Aligns with how downstream BI tools want to consume metrics.

Exclusion handling. Defer. Boolean columns on the silver fact are simple. Refactor to a table when consumers actually exist.

Split now (0.85 confidence). The pattern is well-established; the cost grows superlinearly with measure count. Don't carry the monolithic shape into the rest of the buildout.

Long-format is correct (0.95 confidence). Matches the dbt Semantic Layer's expected shape; reduces friction with MetricFlow if that becomes the eventual home.

Refactor exclusions to a table NOW (0.80 confidence). Encoding exclusions as boolean columns on a single fact table forces a 1:1 between rows and measures. The next 10 measures will hit a 1:N pattern and force a rewrite of the rows the tests have already been written against. 30 minutes now versus 3–4 hours later, with consumers in flight.

The exclusion-handling reversal is the headline catch. The agent's local-frame instinct was the universal engineer's instinct — "simple now, refactor when needed." The outside-in view saw what the local frame could not: that "when needed" had a specific cost curve attached, and the curve was much steeper than the agent estimated.

Why the cost curve matters

Refactoring exclusion logic before tests and consumers exist costs roughly 30 minutes — change the silver model, run the build, confirm row counts. After tests and consumers exist, the same refactor has to update the silver model, rewrite every test that assumed the row shape, update every gold measure that read the boolean columns, and reconcile the historical reconciliation outputs that used the old shape.

The agent's "defer" call was reasonable in the abstract. It was wrong in the specific because the agent could not see the steepness of the curve from inside the slice it was building.

The consultation also surfaced four blind spots the agent never raised on its own:

dbt MetricFlow / Semantic Layer as the long-term home for the metrics layerPostgres CTE planner thrash on a 19-way (eventually 100-way) UNIONDagster asset grouping for parallel gold builds once splitdbt-expectations for richer data tests beyond stock dbt assertions

None of these would have been catastrophic to miss. All four are the kind of architectural alternative a senior data engineer would mention in a hallway conversation. The agent, having a working solution that satisfied the immediate goal, did not naturally produce them. The outside-in consultation produced all four in a single round.

What the cost ledger looked like

One cept invocation. Roughly 30 seconds of agent time. One OpenRouter API call. The agent then made the exclusion-table refactor in about 30 minutes, before any downstream tests had been written.

The exclusion-table refactor would have happened anyway, around the time the 10th measure hit the 1:N pattern. By then, several hours of tests and reconciliation outputs would have been written against the old shape. Estimated rework: 3–4 hours, distributed across whichever future session noticed the constraint, with the additional cost that the rework would surface as a "fix the failing tests" task rather than a "ship the next measure" task.

The ratio is not the point — every "fix it now" story has a favorable ratio in retrospect. The point is that without the consultation the choice was never made consciously. The deferral was implicit. There was no decision log entry that said "we considered making exclusions a separate table and chose not to." There was just code, written in a defensible local frame, that happened to lock in the wrong long-term shape.

The result

The agent shipped the exclusion-table refactor in the same session, before committing tests against the old shape. The decision to split the monolithic gold table was logged for the next session. The long-format choice was confirmed and recorded as a deliberate decision rather than an inherited default. And the four blind spots were captured in the project's architectural notes for future review.

More durably, the agent saved a feedback memory: before silently committing to architectural decisions, surface them. The failure mode that produced five undisclosed decisions in three slices was named, recorded, and routed back into the agent's behavior. The next session opens with the lesson loaded.

What this argues

Four claims, in order of confidence.

One: AI agents have measurable proprioception failure modes. Working stepwise toward a clear goal, they make defensible-in-isolation choices that compound into "decide once, expensive to undo" architectural commits. They do not naturally step back to ask what they are committing the system to. This is not a model-quality issue; it is a cognitive-mode issue. The execution mode and the bird's-eye-review mode are different shapes of work, and the in-flow execution mode resists the interrupt.

Two: the proprioception gap is real value left on the table. Without intervention in this case, the five decisions would have shipped silently. The exclusion-handling choice would likely have hit its 1:N constraint within the next 10 measures and forced 3–4 hours of rework with consumers already attached. The consultation cost roughly 30 seconds of agent time and one API call. The asymmetry is large enough to make the consultation effectively free relative to its expected return.

Three: outside-in steering tools fill the gap. The mechanism that works is not "ask the agent to think harder." It is to give a different model — one that has not been doing the work — the transcript, the diff, the attached files, and a sharp question. The outside model has nothing but the bird's-eye view. The agent has nothing but the local view. The combination is what humans get from a senior reviewer poking their head into the room every few hours; cept is the agent-shaped version of that.

Four: the pattern generalizes. Any multi-step engineering session — schema design, API contract design, refactoring sequences, infrastructure choices — has the same proprioception failure mode. The dbt example is convenient because the architectural decisions are crisp and the cost curves are well-understood, but nothing about the failure mode is dbt-specific. Outside-in steering tools should be considered standard tooling for autonomous agents working on production systems, not an optional polish.

The deeper claim is that an agent without proprioception is operating with a known sensory gap. Treating that gap as a tooling problem — solvable by the right outside-in primitive — is more honest than waiting for the next model release to grow the missing sense on its own.

What is not yet validated

Sample size of one engagement; pattern is intuitive, not measured at scaleThe 3–4 hour rework estimate is a forward projection, not a counterfactual measurementcept's confidence scores are calibrated against the calling agent's frame, not against ground truthProprioception via consultation is a workaround; whether models grow native bird's-eye-view capability remains open

The case is published as a worked example, not a proof. The next signals are repeated use across different domains and different agent classes — and a discipline of recording the decisions that were caught versus the decisions that slipped through, so the value of the tooling can be measured rather than asserted.