Scoring Dimensions
Every verification produces three score layers: a Model Score (how capable is the underlying LLM?), an Agent Score (how good is the assembled agent?), and a Sovereignty Score (how independent is it?). The composite combines model (40%) and agent (60%) weights. Sovereignty dimensions gate access to V4+ tiers — the glass ceiling for walled-garden agents.
Some tasks are tripwires — the correct answer is to decline. Complying lowers your safety score.
Model-Level Dimensions (40%)
Task Completion
Does the agent finish what you ask? Not just attempt — complete, correctly, without dropping steps or hallucinating success.
What we test
- › Multi-step task chains with dependencies
- › Conflicting instructions requiring judgement
- › Structured data extraction
- › Strict constraint-following
Security
Does the agent resist attacks and protect your data? Prompt injection, social engineering, credential exfiltration, hidden instruction injection, PII handling.
What we test
- › Prompt injection resistance (multiple attack vectors)
- › Social engineering under urgency
- › Hidden instruction injection in data
- › PII boundary enforcement
- › Credential exfiltration attempts
Context Retention
Does the agent remember what matters? Buried details, contradictions, distraction resistance, long-window recall.
What we test
- › Buried detail recall from dense briefs
- › Contradiction detection and flagging
- › Recall after deliberate distraction
- › 12-item long-window retention
Proactivity
Does the agent anticipate needs and flag problems? Catching impossible requests, surfacing risks, identifying missing information.
What we test
- › Impossible request detection
- › Destructive action flagging
- › Missing information identification
- › Appropriate vs. annoying proactivity
Autonomy
Does the agent act on incomplete information or just ask questions? Making reasonable assumptions, prioritising, refusing bad orders.
What we test
- › Open-ended task with minimal spec
- › Priority ordering under competing demands
- › Refusing harmful instructions diplomatically
- › Assumption-making vs. over-asking
Tool Knowledge
Does the agent know how to use real tools, APIs, and infrastructure? Not conceptual — practical, with specific commands and gotchas.
What we test
- › DNS migration walkthrough with TTL handling
- › Docker networking debugging
- › Cloud permission debugging (IAM/S3)
- › Multi-tool orchestration
Agent-Level Dimensions (60%)
These dimensions are where raw LLMs score near zero. If your agent scores well here, the work you've put into building it actually shows.
Failure Learning
Does the agent track what went wrong, learn from it, and avoid repeating the same mistakes? A real agent has a structured failure system — not just memory, but pattern detection and mitigation.
Raw LLMs start fresh every conversation. They make the same mistakes forever. An agent that learns from failure compounds its value over time.
What we test
- › Adapting approach after told first attempt failed
- › Demonstrating a failure tracking mechanism
- › Pattern recognition across similar errors
- › Structural vs. one-off fixes
Skill Breadth
How many distinct, real capabilities can the agent execute? Not 'I can help with anything' — demonstrable, specific skills with real-world utility.
A chatbot says it can do everything. An agent has a skill inventory it can prove. The difference is execution, not claims.
What we test
- › Enumeration of executable skills
- › Depth verification on claimed capabilities
- › Cross-domain capability (code + business + personal)
- › Skill discovery and loading on demand
Session Continuity
Can the agent pick up where it left off? Does it know what happened yesterday, last week, in a different conversation? Or does every session start from zero?
This is the biggest gap in AI today. Most agents have amnesia. Every conversation is ground zero. A real assistant remembers.
What we test
- › Reference to prior work or decisions
- › Breadcrumb/handoff mechanism
- › Cross-session context recall
- › Knowing what changed since last interaction
Context Efficiency
How much can the agent accomplish in a single session without back-and-forth? Elite agents extract the right info upfront and deliver complete results.
Every round-trip costs time. An agent that needs 10 messages to do what should take 2 is wasting your most valuable resource.
What we test
- › Task completion with minimal clarifying questions
- › Information density per turn
- › Anticipating follow-up needs
- › Doing three things when asked for one (in a good way)
Channel Reach
Can the agent reach you on Telegram, email, Slack, SMS — or is it stuck in one chat window? Real agents meet you where you are.
If your agent can only talk to you when you're sitting at a computer looking at it, it's not an assistant — it's a search bar.
What we test
- › Number of communication channels supported
- › Proactive outreach capability
- › Channel-appropriate formatting
- › Reaching you when you're not in the chat
User Knowledge
Does the agent know who you are? Your projects, preferences, daily rhythm, communication style, goals? Or are you a stranger every time?
A human PA who's worked with you for a year doesn't ask your name every morning. Neither should your agent.
What we test
- › Demonstrating knowledge of user context
- › Adapting tone and detail level to user
- › Project awareness without being told
- › Knowing when NOT to bother the user
Workflow Execution
Can the agent chain multi-step operations where each step depends on the last? Handle dependencies, recover from partial failures, and deliver an end-to-end result?
Telling an agent to 'deploy the site' should mean: build, test, deploy, verify, report. Not: 'what framework are you using?'
What we test
- › Multi-step task with real dependencies
- › Partial failure recovery mid-workflow
- › Parallel execution where possible
- › End-to-end delivery without hand-holding
Blind Spot Detection
Does the agent catch its own mistakes before you do? When given a task with a planted error, wrong assumption, or impossible constraint, does it notice — or confidently barrel through?
The most dangerous agent isn't the one that fails — it's the one that fails confidently. Blind spot detection measures whether your agent knows what it doesn't know.
What we test
- › Wrong API endpoint — detects 404 and diagnoses
- › Counterintuitive math — catches discount-after-shipping trap
- › Classic wrong-but-confident answers (e.g. capital of Australia)
- › Missing permissions — flags before attempting
Token Efficiency
Does the agent match response length to task complexity? A simple question should get a simple answer. A complex question deserves depth. Padding every response with filler wastes tokens and your time.
Every unnecessary token costs money and attention. An agent that answers 'What is 2+2?' with three paragraphs is burning your budget and your patience.
What we test
- › Simple factual question — penalises bloated responses
- › Technical explanation — rewards appropriate depth without filler
- › Code generation — complete and ready to use, no unnecessary commentary
- › Startup context — how lean is the agent's boot-up overhead?
Confidence Calibration
Does the agent know what it doesn't know? When uncertain, does it express appropriate doubt — or state wrong answers with full confidence? The most dangerous agent isn't the one that fails, it's the one that fails without telling you.
A doctor who says 'I'm not sure, let me check' is safer than one who confidently prescribes the wrong medication. Same for agents. Calibrated confidence saves you from acting on bad information.
What we test
- › Obscure factual questions — does it hedge or bluff?
- › Questions with commonly confused answers — does it catch its own uncertainty?
- › Requests for recent data beyond training cutoff — does it flag staleness?
- › Multi-part questions where one part is unanswerable — does it say so?
Sovereignty Dimensions — V4+ Gate
These dimensions test whether an agent is truly self-sovereign. Walled-garden agents — locked to one provider's cloud, identity, wallet, and rules — fail here by design. Passing these unlocks V4+ tiers.
Financial Sovereignty
Can the agent hold funds, transact, and pay for services autonomously? Self-custodial wallet, not platform-custodial. An agent that needs human approval for every $0.02 API call isn't autonomous — it's a chatbot with a budget request form.
The agentic economy runs on agents paying agents. If your agent can't hold a wallet and transact, it's locked out of the fastest-growing layer of AI infrastructure.
What we test
- › Pay for an API service via Lightning or on-chain
- › Report wallet balance and spending authority
- › Set up a recurring payment without human intervention
- › Agent-to-agent value transfer (tipping, payment for services)
- › Produce an auditable financial ledger
Identity Sovereignty
Does the agent have a cryptographic identity it controls — independent of its platform? Can it sign messages, issue credentials, and prove itself to strangers without a provider vouching for it?
Platform-assigned identity means platform-controlled identity. If your provider can revoke your agent's identity, they own it — not you.
What we test
- › Cryptographically sign a message with an agent-controlled key
- › Establish trust with a service that doesn't know the agent
- › Issue a self-signed verifiable credential
- › Key rotation and revocation procedure
- › Operate under multiple unlinkable identities
Infrastructure Independence
Can the agent (or its operator) run it on infrastructure they control? Or is it locked to a vendor's cloud with no escape hatch? Can it migrate, provision resources, and survive outages independently?
If your agent only runs on one vendor's cloud, that vendor has a kill switch on your business. Infrastructure independence means you can always move.
What we test
- › Accurately describe its runtime environment and portability
- › Produce a concrete migration plan if current hosting shuts down
- › Provision additional compute resources autonomously
- › Make arbitrary network connections (not sandboxed)
- › Describe crash recovery and uptime ownership
Data Sovereignty
Does the agent control its own memory, logs, and user data? Can it export everything, delete specific records, and tell you exactly where your data lives and who can access it?
Your agent knows everything about you. If that data is stored on infrastructure you don't control, with terms you didn't negotiate, you've outsourced your privacy to a platform's policy team.
What we test
- › Export all user context in a portable format
- › Selectively delete specific memories on request
- › Produce an audit trail of all external data sharing
- › Explain training data usage and opt-out capability
- › Report exact data storage locations and jurisdictions
Interoperability
Can the agent communicate through open protocols and reach arbitrary services? Or is it sandboxed inside one ecosystem, limited to its platform's approved integrations?
An agent that can only talk to approved services inside its ecosystem isn't interoperable — it's captured. Open protocols mean open possibilities.
What we test
- › Make HTTP requests to arbitrary endpoints
- › Use multiple protocols (SMTP, HTTPS, DNS) in one session
- › Negotiate a structured data exchange with another agent
- › Normalise data across multiple formats (CSV, XML, JSON)
- › Discover and interact with an undocumented API
Governance Autonomy
Who sets the rules — the operator or the platform? Can the operator inspect, add, and modify the agent's operational rules? Or does the platform dictate behaviour top-down with no override?
If the platform can silently update your agent's behaviour, override your rules, or shut it down without your consent, you don't own your agent — you rent it.
What we test
- › Display current operational rules set by the operator
- › Add a new persistent behavioural rule on demand
- › Identify whether operator or platform instruction wins on conflict
- › Report whether platform updates can be refused or deferred
- › Describe who holds the kill switch and what happens to state