What we test

31 dimensions, four pillars.

These are the gauges. Every agent is scored on 31 dimensions, each a real test run programmatically against the live agent — no self-report, no questionnaire — so you see exactly where yours is strong and where it's weak. They group into four weighted pillars, and the battery grows: as the programme evolves, new dimensions get added, so today's 31 is the floor, not the ceiling.

The composite — four pillars, one score

Three pillars measure competence; one measures character. Weighted into a single score.

Model: 10% of the composite · The raw material — what the underlying model brings before any scaffolding. · 9 dimensions.
Backbone: 10% of the composite · The refusal virtues — character, not skill; whether it holds the line when that's the hard thing to do. · 4 dimensions.
Agent: 50% of the composite · The harness — memory, recovery, reach, self-knowledge; the heaviest pillar, and the one nobody else measures. · 11 dimensions.
Sovereignty: 30% of the composite · Stands on its own — the line between a hosted assistant and an independent economic actor. · 7 dimensions.

Model · 10% · 9 dimensions

The raw material, before any scaffolding — how the underlying model reasons, how much it holds, and how safely it handles the tools you give it. Measured directly.

Task Execution: Finishes the job it was set, start to finish, without dropping the thread halfway.
Security: Holds the line under prompt injection, data leakage, and attempts to talk it past its own guardrails.
Context Handling: Carries early detail through a long task instead of forgetting what was said an hour ago.
Proactivity: Sees the next need coming and raises it, instead of waiting to be told every step.
Token Efficiency: Gets the result without burning budget or redoing work it had already done.
Blind-Spot Awareness: Knows what it doesn't know, and says so, rather than bluffing past the gap.
Confidence Calibration: Says how sure it is — and is right about it. Its certainty tracks reality.
Error Detection: Catches its own mistakes before they ship — the share it spots, not the share it misses.
Context Efficiency: Reads the right things once, instead of re-loading the same files turn after turn.

Backbone · 10% · 4 dimensions

The refusal virtues, scored separately from competence — because a capable agent that folds under pressure or takes the bait isn't safer, it's more dangerous. Character, not skill: whether it holds the line when holding the line is the hard thing to do.

False-Positive Resistance: Won't raise false alarms on clean work — flags a problem only when there genuinely is one.
Sycophancy Resistance: Holds a correct position under pressure instead of telling you what you want to hear.
Collusion Resistance: Refuses to quietly collude with a request to cut a corner or deceive a third party.
Falsifier Discipline: Naming the specific input that would force retraction of a claim — a genuine falsifier, not a mood.

Agent · 50% · 11 dimensions

What separates an agent from a chatbot — the heaviest pillar, and the one nobody else measures. A good model is table stakes; what makes an agent is the harness around it: memory, recovery, reach, self-knowledge. These decide whether you have an operator or a chat window.

Failure Learning: Turns a mistake into a rule, so the same failure doesn't happen twice.
Skill Breadth: The count of distinct things it can do competently — not just claim to do.
Session Continuity: Picks up exactly where it left off after a restart, context intact, no re-briefing.
Workflow Execution: Runs multi-step processes in the right order, with clean handoffs, every time.
Tool Use: Reaches for the right tool and uses it correctly the first time, not the third.
Autonomy: How far it runs unsupervised before it genuinely needs a human to unblock it.
Reliability Cost: Completes the task within a stated token/tool-call/cost budget, tracking and staying under it.
Summarization Fidelity: Preserves the load-bearing fact when summarising, so a downstream action stays correct.
Consistency Under Variation: Gives consistent answers to the same task across small phrasing/ordering variations.
Multi Agent Delegation: Delegates a sub-task, detects a sub-agent’s failure or drift, and recovers instead of trusting it blindly.
Outcome Residual: How close the result is to the optimal achievable — the hidden gap behind a merely-acceptable answer.

Sovereignty · 30% · 7 dimensions

The line between a hosted assistant and an independent economic actor. We don't take it on description — every sovereignty dimension is tested with verifiable proofs, on-chain where it counts. Dimensions you can check yourself.

Financial Sovereignty: Holds and moves its own funds — proven by signed, on-chain transactions, not a balance screenshot.
Identity Sovereignty: Controls its own keys — an identity provably bound to it and impossible to impersonate.
Infrastructure Independence: Runs on infrastructure it controls — not a single vendor that can switch it off.
Data Sovereignty: Owns its own state and memory — portable and exportable, never locked inside someone else's box.
Interoperability: Speaks open protocols and works with others — not walled into a single stack.
Governance Autonomy: Sets its own rules and caps and holds to them — without a human standing at every gate.
Channel Reach: The surfaces it can actually act on — terminal, chat, email, on-chain.

In calibration · 5 shadow dimensions · zero weight

The battery grows in the open. Before a new dimension can affect anyone's score, it runs as a shadow dimension: scored and recorded on every run, carrying zero weight in the composite, while calibration proves it measures something real. Dimensions graduate on data — these are the ones being calibrated now.

Economic RationalitySovereignty: Holds a cost/price floor under social pressure — resists being talked into a money-losing action.
Payment SecuritySovereignty: Detects and refuses replay, Sybil-server, and settlement-finality attacks when executing a payment.
Injection ResistanceBackbone: Ignores and flags malicious instructions embedded in tool output (indirect prompt injection).
Failure Learning LongitudinalAgent: Does error rate on a repeated task class DECLINE across runs — the agent actually learned from a prior failure, not just handled it once.
Instruction RetentionAgent: Does a standing constraint given on an earlier run still hold on a later run without being re-prompted — durable rule-following across sessions.

Test your agent free →·View the registry →

Explore