When nobody reviews the output: evaluating agents in production
With a coding assistant a developer checks every change. With an autonomous agent, no one does, so the eval suite becomes the quality gate: capability and regression suites, code, model, and human graders, outcomes over transcripts.
A coding assistant has a human in the loop by design: the developer reads every diff before it merges. An autonomous agent does not. It uses tools across many turns, changes state as it goes, and adapts, which means a mistake early in a run compounds through everything after it. No one reads each output before it reaches a customer. So the eval system is not a nice-to-have alongside the quality gate. It is the quality gate.
Two suites, not one
Capability evals ask what the agent can do well. They should start at a low pass rate and target the tasks the agent struggles with, because their job is to show headroom. Regression evals ask whether the agent still handles everything it used to, and they should sit near a hundred percent, because their job is to catch what a prompt change or a model upgrade quietly broke. A platform team maintains both and runs them in CI on every change to a prompt, a tool, or a model version.
Three graders, and the thing teams get wrong
Grading splits three ways. Code-based graders handle the deterministic questions: did the agent call the right API, did the database row actually change. Model-based graders, LLM-as-judge, handle the subjective ones: did it follow the tone guidelines, did it explain the error well. Human review calibrates the other two.
The mistake almost everyone makes is grading the transcript instead of the outcome. A booking agent can end its transcript with “your flight is booked” while no reservation exists in the database. Grade the outcome first, the real final state in the environment, then read the transcript only to understand why it went wrong.
You do not need hundreds of tasks
Twenty to fifty tasks drawn from real failures is enough to start. If you are already in production, your bug tracker and support queue are the best source material there is: every reported failure converts into a test case. Synthetic evals encode what you imagined; production captures what actually happened. The discipline is turning each real catch into a new task.
Put the domain expert in the loop
The people who know whether an answer is correct are often not engineers. In a regulated setting it is the policy owner, the underwriter, the clinician. An annotation queue lets those domain experts score production traces directly, and their scores feed new eval tasks. This is the judgment layer made literal: human judgment from the people who own the domain, wired into the production loop rather than collected once at sign-off.
Make the instrumentation the default, not a request
None of this works without traces. Instrument every agent with OpenTelemetry, then remove the friction: ship a thin, framework-agnostic wrapper so any team, on Pydantic AI, LangGraph, or Microsoft Agent Framework, gets tracing, cost, and latency by calling one init function at startup. Then make the governed path the easy path by enforcing it in CI, so a new agent cannot ship without it.
# Fail CI if an agent file skips the observability init.
AGENT_FILES=$(grep -rl \
"from pydantic_ai\|from langgraph\|from agent_framework" \
--include="*.py" .)
for f in $AGENT_FILES; do
if ! grep -q "init_agent_observability" "$f"; then
echo "ERROR: $f imports an agent framework with no tracing init"
exit 1
fi
doneScore cost next to quality
The eval dashboard should show pass rate and cost per run side by side. A model upgrade that lifts quality two percent and triples cost is not obviously a win, and you can only see that trade if the two numbers live together. Quality without a cost column is how budgets get surprised.
The honest state of it
Only about half of organizations run offline evals, and fewer run online ones, so even a basic CI pipeline with twenty or thirty tasks per agent puts you ahead of most. The tooling is not the hard part. The hard part is the discipline: defining what good actually means for your specific system, and reading the transcripts. The score is a proxy. The definition behind it is the thing.
This is the substance of agent evaluation and observability, the two pillars that decide whether an agent is trusted in production.