Agent Evals: How To Test Systems That Use Tools, Change State, and Fail in the Middle

What Is This?

Agent evals are tests for AI systems that do more than answer a prompt.

A normal language-model eval can often be framed as: give the model an input, compare the final answer with a reference answer, and score the result. That breaks down when the system is an agent. An agent can call tools, browse, edit files, write code, run commands, use a UI, remember intermediate state, and adapt over many turns. The final message is only the visible tip of the run.

Anthropic’s useful contribution in “Demystifying evals for AI agents” is a clean vocabulary for the whole test object:

Task: the test case with inputs and success criteria.
Trial: one attempt at the task.
Transcript / trace / trajectory: the full record of messages, tool calls, intermediate results, and observations.
Outcome: the final state of the environment after the trial.
Grader: logic that scores part of the transcript or outcome.
Evaluation harness: the infrastructure that runs tasks, tools, environments, graders, and aggregation.
Agent harness / scaffold: the system that lets the model act as an agent.
Evaluation suite: a collection of related tasks.

The important move is that “the agent” is not just the model. It is the model plus the harness plus the tools plus the environment. You are testing a system.

The One-Sentence Summary

Agent evals should judge the trace, the outcome, and the environment state — not just the final answer — because autonomous systems fail through bad actions, bad tool use, ambiguous tasks, flaky environments, and brittle graders.

Why It Matters

Agents are becoming normal software infrastructure. Coding agents edit repositories. Research agents gather and synthesize sources. Browser agents click through websites. Computer-use agents operate GUIs. Support agents change account state. Personal agents schedule, file, message, and update memory.

That changes what “quality” means.

For a chatbot, a plausible final answer may be enough for many low-stakes uses. For an agent, a plausible final answer can be actively misleading. A travel agent can say “your flight is booked” while no booking exists in the database. A coding agent can say “tests pass” while never running the relevant test. A research agent can produce a fluent memo that quietly leans on weak sources. A browser agent can reach the right page but fail to submit the actual form. A memory agent can save the wrong durable fact and make future sessions worse.

The evaluation target has moved from text quality to stateful task success.

This matters for Jamie because Hermes and the surrounding coding-agent stack live in exactly this category. The useful question is no longer “did the assistant sound right?” It is:

Did it choose the right tools?
Did it change the right state?
Did it verify the change?
Did it avoid damaging unrelated state?
Does the trace show a fair failure, a model failure, or a broken test?

The Agent-Eval Mental Model

Think of an agent eval as a lab bench with four layers.

1. The task layer

The task defines what the agent is supposed to do.

A good task is unambiguous enough that two competent humans would mostly agree on whether it passed. If the task asks an agent to “write a script” but the grader expects the file at a hidden path, the eval is broken. If the test silently requires a constraint the task never states, the agent may fail for following the instructions it was actually given.

Anthropic’s practical warning is simple: with frontier models, a 0% pass rate across many attempts often means the task or grader is broken, not that the model is incapable. Good eval work starts by proving that the task is solvable and that a reference solution passes.

2. The trajectory layer

The trajectory is the complete run: tool calls, observations, intermediate outputs, errors, retries, and decisions.

This is where agent evals differ most from classic answer grading. The trace tells you whether the agent:

used the right tool for the job;
inspected enough context before acting;
ignored a tool failure;
over-searched or under-searched;
changed files it should not have touched;
got stuck in loops;
hallucinated a result instead of verifying it;
found a valid path the grader did not anticipate.

Anthropic’s strongest operational advice is to read transcripts. Scores are not self-explanatory. A failing score could mean the agent made a real mistake. It could also mean the grader rejected a valid solution, the environment was flaky, the task was ambiguous, or the harness prevented the model from using the capability it actually has.

3. The outcome layer

The outcome is the final state of the world.

For a coding agent, the outcome might be: the patch exists, the tests pass, no unrelated files changed, and the vulnerability is fixed. For a browser agent, it might be: the right item is in the cart, the order was not accidentally placed, or the backend state changed as intended. For a support agent, it might be: the refund record exists and the customer was not promised something outside policy.

Outcome checks are powerful because they resist persuasive but false final messages. They ask what actually happened.

4. The harness layer

The harness is everything that makes the run possible: tools, environment, permissions, sandbox, prompts, memory, model settings, concurrency, logging, and grading infrastructure.

This layer creates hidden failure modes. Shared state between trials can contaminate results. A previous run can leave files behind. A cached result can make a task look easier than it is. A sandbox can run out of memory and make unrelated tasks fail together. A browser eval can become invalid if the target site changes.

The harness is not neutral. When you evaluate an agent, you evaluate the harness and model together.

Capability Evals Versus Regression Evals

Anthropic separates two eval types that often get blurred.

Capability evals ask: what can the agent do now? These should include tasks the agent struggles with. A capability eval with a low initial pass rate is useful because it gives the team a hill to climb.

Regression evals ask: does the agent still do what it used to do? These should be close to 100% pass rate. If they drop, something broke.

A healthy agent system uses both. Capability evals push the frontier. Regression evals protect the floor.

This distinction matters because saturated evals become misleading. If an eval is already near 100%, it no longer measures improvement. It is still useful as a regression suite, but it cannot tell you whether a new model is meaningfully better on harder, longer, or more realistic tasks. This is why agent teams need to keep refreshing the task bank with harder examples and newly observed failures.

The Three Grader Families

Agent evals usually combine three grader types.

Code-based graders

Code-based graders are deterministic checks: unit tests, database assertions, file diffs, schema validation, linting, static analysis, API state checks, or page-state checks.

They are strongest when success can be mechanically verified. SWE-bench Verified and Terminal-Bench use this style for coding and terminal tasks: the agent must change the environment so that tests or task checks pass.

The benefit is clarity. The risk is brittleness. If the grader checks the wrong thing or encodes hidden assumptions, a capable agent can fail unfairly.

Model-based graders

Model-based graders use an LLM to judge freeform behaviour: helpfulness, groundedness, tone, instruction following, coherence, source quality, or partial completion.

They are necessary for open-ended research, conversation, and subjective quality. They also need calibration. Anthropic recommends structured rubrics, isolated dimensions, and human calibration so the model judge does not become a fluent source of false certainty.

For research agents, model graders can check whether claims are grounded in sources, whether important facts are missing, and whether the synthesis fits the task. But they should not be trusted blindly.

Human graders

Human graders remain the gold standard for subjective or expert judgement. They are expensive and slow, so they are best used for calibration, spot checks, ambiguous failures, and high-stakes domains.

The pattern is not “choose one grader.” The pattern is to combine them: deterministic outcome checks where possible, model-based rubrics where necessary, and human review to calibrate the grey areas.

Non-Determinism: Why One Run Is Not Enough

Agents vary between runs. A task may pass once and fail next time. Tool timing, sampling, intermediate observations, and model choices can all shift the trajectory.

Anthropic highlights two useful metrics:

pass@k: the probability the agent gets at least one successful solution in k attempts.
pass^k: the probability the agent succeeds in all k attempts.

They answer different product questions.

If the agent is a coding assistant and the user can ask it to try again, pass@k may matter: how likely is at least one attempt to solve the problem? If the agent is customer-facing and users expect it to work every time, pass^k matters more: how reliable is the first run, the second run, and the third run?

This distinction prevents a common mistake: celebrating an agent that can eventually succeed while ignoring that it is unreliable in normal use.

Why Smart People Get This Wrong

The first mistake is grading only the final answer.

Final answers are easy to inspect, but agents do their real work in the middle. The middle is where they choose tools, interpret observations, mutate state, recover from errors, and sometimes create damage. If the eval ignores the trajectory and the environment, it misses the agent.

The second mistake is over-specifying the path.

It is tempting to require a specific sequence of tool calls. That can make tests brittle. Agents often find valid solutions the eval designer did not anticipate. Better evals usually grade the result and use trace checks for broad safety or quality constraints, not rigid choreography.

The third mistake is trusting benchmark numbers without reading failures.

A benchmark score is a compressed signal. It can hide ambiguous specs, broken graders, harness constraints, contaminated environments, or unrealistic tasks. Anthropic gives examples where fixing eval or harness issues changed apparent performance dramatically. The lesson is not “benchmarks are useless.” It is “benchmarks are engineering artifacts.” They need maintenance.

The fourth mistake is starting too late.

Teams often delay evals until the agent is already in production. Then they are reverse-engineering quality from user complaints. Anthropic’s recommendation is to start with 20-50 realistic tasks from manual tests and real failures. That is enough to create a feedback loop.

How To Use This

Use this checklist when evaluating any agent system.

Define the task

Is the success criterion explicit?
Could a competent human complete the task from the same instructions?
Is there a reference solution or known-good trajectory?
Are hidden assumptions removed from the grader?

Control the environment

Does each trial start from a clean state?
Are files, databases, browser sessions, and caches isolated?
Can the agent inspect state from previous runs?
Are failures independent, or are many tasks failing from one infrastructure issue?

Grade the outcome

What state should exist at the end?
Can a deterministic check verify it?
Did the agent avoid prohibited state changes?
Did it verify the claim it makes in the final answer?

Inspect the trace

Did it choose appropriate tools?
Did it handle tool failures honestly?
Did it over-act, under-act, or loop?
Did it use sources and context properly?
Did it find a valid route the grader did not expect?

Maintain the suite

Which tasks are capability tests?
Which tasks are regression tests?
Which evals are saturated and need harder examples?
Which failures should become new regression tests?
Who owns the eval suite as the product changes?

The Hermes Implication

For Jamie’s world, this article is not abstract AI benchmarking. It is an operating principle for Hermes and autonomous coding work.

A Hermes-quality eval should not stop at “the assistant replied.” It should verify:

the right files were read or edited;
config changes were validated;
commands actually ran;
external claims were sourced;
scheduled jobs were created with the right delivery target;
memory updates were durable but not bloated;
final responses matched what the tools really returned;
unrelated working-tree changes were not silently swept in.

That is the same model Anthropic is describing: agents are judged by stateful execution plus trace quality.

Key Terms

Agent: An AI system that can act over multiple turns, often using tools and changing state.
Task: A single eval problem with inputs and success criteria.
Trial: One attempt at a task.
Transcript / trace / trajectory: The full record of the agent run, including tool calls and intermediate observations.
Outcome: The final environment state after the trial.
Grader: Code, model, or human logic that scores the run.
Evaluation harness: Infrastructure for running tasks, environments, tools, graders, and aggregation.
Agent harness / scaffold: The system that wraps the model and lets it act.
Capability eval: A test suite designed to reveal what the agent can learn to do.
Regression eval: A test suite designed to catch things the agent used to do but now breaks.
pass@k: Probability of at least one success across k attempts.
pass^k: Probability of succeeding on every one of k attempts.

Recall Questions

Why is final-answer grading insufficient for agents?
What is the difference between a transcript and an outcome?
Why is the agent harness part of what gets evaluated?
When should an eval use deterministic graders rather than model-based graders?
Why can a 0% pass rate indicate a broken task instead of a weak model?
What is the difference between capability evals and regression evals?
Why do pass@k and pass^k tell opposite stories as k increases?
Why should teams read failed trajectories instead of trusting scores alone?
What makes shared state between trials dangerous?
How would you apply this model to Hermes?

Best Resources To Learn More

Anthropic’s article is the clearest practical overview of agent-eval structure and common failure modes.
Anthropic’s “Building effective agents” gives the companion model for what agent systems are and why simple workflows often beat overcomplicated autonomy.
SWE-bench Verified is useful for understanding deterministic coding-agent outcome grading.
Terminal-Bench is useful for end-to-end terminal and environment tasks.
τ-Bench and τ2-Bench are useful for multi-turn conversational and tool-use agents.
WebArena and OSWorld are useful for browser and computer-use agent evaluation.
BrowseComp is useful for research-agent evaluation where the task is difficult search rather than code execution.

Sources

Mikaela Grace, Jeremy Hadfield, Rodrigo Olivares, and Jiri De Jonghe, “Demystifying evals for AI agents,” Anthropic Engineering, January 9, 2026. https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
Anthropic, “Building effective agents,” Engineering, December 2024. https://www.anthropic.com/engineering/building-effective-agents
SWE-bench Verified. https://www.swebench.com/SWE-bench/
Terminal-Bench. https://www.tbench.ai/
Yao et al., “τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains,” arXiv:2406.12045. https://arxiv.org/abs/2406.12045
“τ2-Bench,” arXiv:2506.07982. https://arxiv.org/abs/2506.07982
OpenAI, “BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents,” arXiv:2504.12516. http://arxiv.org/abs/2504.12516
Zhou et al., “WebArena: A Realistic Web Environment for Building Autonomous Agents,” arXiv:2307.13854. https://arxiv.org/abs/2307.13854
OSWorld benchmark. https://os-world.github.io/

What Is This?

Agent evals are tests for AI systems that do more than answer a prompt.

Anthropic’s useful contribution in “Demystifying evals for AI agents” is a clean vocabulary for the whole test object:

Task: the test case with inputs and success criteria.
Trial: one attempt at the task.
Transcript / trace / trajectory: the full record of messages, tool calls, intermediate results, and observations.
Outcome: the final state of the environment after the trial.
Grader: logic that scores part of the transcript or outcome.
Evaluation harness: the infrastructure that runs tasks, tools, environments, graders, and aggregation.
Agent harness / scaffold: the system that lets the model act as an agent.
Evaluation suite: a collection of related tasks.

The important move is that “the agent” is not just the model. It is the model plus the harness plus the tools plus the environment. You are testing a system.

The One-Sentence Summary

Why It Matters

That changes what “quality” means.

The evaluation target has moved from text quality to stateful task success.

This matters for Jamie because Hermes and the surrounding coding-agent stack live in exactly this category. The useful question is no longer “did the assistant sound right?” It is:

Did it choose the right tools?
Did it change the right state?
Did it verify the change?
Did it avoid damaging unrelated state?
Does the trace show a fair failure, a model failure, or a broken test?

The Agent-Eval Mental Model

Think of an agent eval as a lab bench with four layers.

1. The task layer

The task defines what the agent is supposed to do.

2. The trajectory layer

The trajectory is the complete run: tool calls, observations, intermediate outputs, errors, retries, and decisions.

This is where agent evals differ most from classic answer grading. The trace tells you whether the agent:

used the right tool for the job;
inspected enough context before acting;
ignored a tool failure;
over-searched or under-searched;
changed files it should not have touched;
got stuck in loops;
hallucinated a result instead of verifying it;
found a valid path the grader did not anticipate.

3. The outcome layer

The outcome is the final state of the world.

Outcome checks are powerful because they resist persuasive but false final messages. They ask what actually happened.

4. The harness layer

The harness is everything that makes the run possible: tools, environment, permissions, sandbox, prompts, memory, model settings, concurrency, logging, and grading infrastructure.

The harness is not neutral. When you evaluate an agent, you evaluate the harness and model together.

Capability Evals Versus Regression Evals

Anthropic separates two eval types that often get blurred.

Regression evals ask: does the agent still do what it used to do? These should be close to 100% pass rate. If they drop, something broke.

A healthy agent system uses both. Capability evals push the frontier. Regression evals protect the floor.

The Three Grader Families

Agent evals usually combine three grader types.

Code-based graders

Code-based graders are deterministic checks: unit tests, database assertions, file diffs, schema validation, linting, static analysis, API state checks, or page-state checks.

The benefit is clarity. The risk is brittleness. If the grader checks the wrong thing or encodes hidden assumptions, a capable agent can fail unfairly.

Model-based graders

Model-based graders use an LLM to judge freeform behaviour: helpfulness, groundedness, tone, instruction following, coherence, source quality, or partial completion.

Human graders

Non-Determinism: Why One Run Is Not Enough

Agents vary between runs. A task may pass once and fail next time. Tool timing, sampling, intermediate observations, and model choices can all shift the trajectory.

Anthropic highlights two useful metrics:

pass@k: the probability the agent gets at least one successful solution in k attempts.
pass^k: the probability the agent succeeds in all k attempts.

They answer different product questions.

This distinction prevents a common mistake: celebrating an agent that can eventually succeed while ignoring that it is unreliable in normal use.

Why Smart People Get This Wrong

The first mistake is grading only the final answer.

The second mistake is over-specifying the path.

The third mistake is trusting benchmark numbers without reading failures.

The fourth mistake is starting too late.

How To Use This

Use this checklist when evaluating any agent system.

Define the task

Is the success criterion explicit?
Could a competent human complete the task from the same instructions?
Is there a reference solution or known-good trajectory?
Are hidden assumptions removed from the grader?

Control the environment

Does each trial start from a clean state?
Are files, databases, browser sessions, and caches isolated?
Can the agent inspect state from previous runs?
Are failures independent, or are many tasks failing from one infrastructure issue?

Grade the outcome

What state should exist at the end?
Can a deterministic check verify it?
Did the agent avoid prohibited state changes?
Did it verify the claim it makes in the final answer?

Inspect the trace

Did it choose appropriate tools?
Did it handle tool failures honestly?
Did it over-act, under-act, or loop?
Did it use sources and context properly?
Did it find a valid route the grader did not expect?

Maintain the suite

Which tasks are capability tests?
Which tasks are regression tests?
Which evals are saturated and need harder examples?
Which failures should become new regression tests?
Who owns the eval suite as the product changes?

The Hermes Implication

For Jamie’s world, this article is not abstract AI benchmarking. It is an operating principle for Hermes and autonomous coding work.

A Hermes-quality eval should not stop at “the assistant replied.” It should verify:

the right files were read or edited;
config changes were validated;
commands actually ran;
external claims were sourced;
scheduled jobs were created with the right delivery target;
memory updates were durable but not bloated;
final responses matched what the tools really returned;
unrelated working-tree changes were not silently swept in.

That is the same model Anthropic is describing: agents are judged by stateful execution plus trace quality.

Key Terms

Agent: An AI system that can act over multiple turns, often using tools and changing state.
Task: A single eval problem with inputs and success criteria.
Trial: One attempt at a task.
Transcript / trace / trajectory: The full record of the agent run, including tool calls and intermediate observations.
Outcome: The final environment state after the trial.
Grader: Code, model, or human logic that scores the run.
Evaluation harness: Infrastructure for running tasks, environments, tools, graders, and aggregation.
Agent harness / scaffold: The system that wraps the model and lets it act.
Capability eval: A test suite designed to reveal what the agent can learn to do.
Regression eval: A test suite designed to catch things the agent used to do but now breaks.
pass@k: Probability of at least one success across k attempts.
pass^k: Probability of succeeding on every one of k attempts.

Recall Questions

Why is final-answer grading insufficient for agents?
What is the difference between a transcript and an outcome?
Why is the agent harness part of what gets evaluated?
When should an eval use deterministic graders rather than model-based graders?
Why can a 0% pass rate indicate a broken task instead of a weak model?
What is the difference between capability evals and regression evals?
Why do pass@k and pass^k tell opposite stories as k increases?
Why should teams read failed trajectories instead of trusting scores alone?
What makes shared state between trials dangerous?
How would you apply this model to Hermes?

Best Resources To Learn More

Anthropic’s article is the clearest practical overview of agent-eval structure and common failure modes.
Anthropic’s “Building effective agents” gives the companion model for what agent systems are and why simple workflows often beat overcomplicated autonomy.
SWE-bench Verified is useful for understanding deterministic coding-agent outcome grading.
Terminal-Bench is useful for end-to-end terminal and environment tasks.
τ-Bench and τ2-Bench are useful for multi-turn conversational and tool-use agents.
WebArena and OSWorld are useful for browser and computer-use agent evaluation.
BrowseComp is useful for research-agent evaluation where the task is difficult search rather than code execution.

Sources

Mikaela Grace, Jeremy Hadfield, Rodrigo Olivares, and Jiri De Jonghe, “Demystifying evals for AI agents,” Anthropic Engineering, January 9, 2026. https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
Anthropic, “Building effective agents,” Engineering, December 2024. https://www.anthropic.com/engineering/building-effective-agents
SWE-bench Verified. https://www.swebench.com/SWE-bench/
Terminal-Bench. https://www.tbench.ai/
Yao et al., “τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains,” arXiv:2406.12045. https://arxiv.org/abs/2406.12045
“τ2-Bench,” arXiv:2506.07982. https://arxiv.org/abs/2506.07982
OpenAI, “BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents,” arXiv:2504.12516. http://arxiv.org/abs/2504.12516
Zhou et al., “WebArena: A Realistic Web Environment for Building Autonomous Agents,” arXiv:2307.13854. https://arxiv.org/abs/2307.13854
OSWorld benchmark. https://os-world.github.io/

Agent Evals: How To Test Systems That Use Tools, Change State, and Fail in the Middle

What Is This?

The One-Sentence Summary

Why It Matters

The Agent-Eval Mental Model

1. The task layer

2. The trajectory layer

3. The outcome layer

4. The harness layer

Capability Evals Versus Regression Evals

The Three Grader Families

Code-based graders

Model-based graders

Human graders

Non-Determinism: Why One Run Is Not Enough

Why Smart People Get This Wrong

How To Use This

Define the task

Control the environment

Grade the outcome

Inspect the trace

Maintain the suite

The Hermes Implication

Key Terms

Recall Questions

Best Resources To Learn More

Sources

Want more depth?

What next?

Back to Home

Open Learning

Mark complete

Questions & Answers

Agent Evals: How To Test Systems That Use Tools, Change State, and Fail in the Middle

What Is This?

The One-Sentence Summary

Why It Matters

The Agent-Eval Mental Model

1. The task layer

2. The trajectory layer

3. The outcome layer

4. The harness layer

Capability Evals Versus Regression Evals

The Three Grader Families

Code-based graders

Model-based graders

Human graders

Non-Determinism: Why One Run Is Not Enough

Why Smart People Get This Wrong

How To Use This

Define the task

Control the environment

Grade the outcome

Inspect the trace

Maintain the suite

The Hermes Implication

Key Terms

Recall Questions

Best Resources To Learn More

Sources

Want more depth?

What next?

Back to Home

Open Learning

Mark complete

Questions & Answers