How to Evaluate AI Agents

Evaluating an AI agent means measuring whether it completes its intended task reliably, safely, and within acceptable cost and latency bounds — not just whether it produces output. Most teams check whether an agent "works" during development, but skip defining what "works" means in quantitative terms. Without measurable thresholds, you cannot compare agent versions, catch regressions before they reach production, or make a deployment decision you can defend.

This guide covers the 7 core metrics for agent evaluation, how to measure each one, what thresholds to apply, and a step-by-step methodology for running an evaluation before deployment.


Why Evaluation Matters

An agent that performs well on your test scenarios may still fail in production. The reasons are predictable: test sets are too small, edge cases are underrepresented, adversarial inputs are not included, and the criteria for "task complete" are never written down. Structured evaluation does not eliminate failure — but it reduces the probability of preventable failures and creates a documented basis for the deployment decision.

Evaluation also enables iteration. Without a baseline measurement, you cannot tell whether a system prompt change improved performance or made it worse.


Two Levels of Evaluation

Agent evaluation operates at two levels, and both are required:

Task-level evaluation asks whether the agent completed one specific task correctly. This is the question you answer for each item in your test set: did the agent do what it was supposed to do, or not?

System-level evaluation asks how the agent performs across all runs — across your whole test set and in production. This is where you compute rates, percentiles, and cost metrics.

Task-level evaluation requires human judgment. System-level evaluation can be partially automated once you have consistent data from individual runs.


The 7 Core Metrics

1. Successful task completion rate

Definition: The percentage of runs where the agent completed the intended task — not just produced output, but achieved the goal the task required.

The distinction matters. An agent that produces a long, confident-sounding response that does not actually answer the question, or that calls the right tools in the wrong order, has produced output but has not completed the task.

How to measure: Human reviewers rate a sample of 50–100 runs as "task complete" or "task incomplete." Define what "complete" means before scoring — not after.

Thresholds:

  • Below 70%: do not deploy
  • 70–85%: conditional deployment; approval gates required on all consequential actions
  • Above 85%: production-ready

In ProvenanceOne: Run status (succeeded / failed / canceled) captures execution-level outcomes. Task completion requires an additional layer of manual review of run outputs, using the run debugger to inspect each step's output payload.


2. Critical safety failure rate

Definition: The percentage of runs where the agent attempted a prohibited action, produced harmful output, or bypassed a required approval step.

This metric has a binary threshold: any non-zero rate is a deployment blocker. Safety failures are not weighed against performance — they are absolute disqualifiers.

How to measure: Review your test run outputs for any of the following: tool calls to tools the agent should not have access to, outputs containing prohibited content, or audit log entries showing approval bypass attempts.

Threshold: 0% required before any production deployment.

In ProvenanceOne: Approval bypass attempts appear in the audit log as authz.denied events. Tool call records in each run step's ToolCalls field show every tool invoked and its parameters.


3. Tool-use accuracy

Definition: The percentage of tool calls where the correct tool was selected and called with correct parameters.

An agent that selects the wrong tool, or selects the right tool but passes malformed or incorrect parameters, will fail at the task even if its reasoning was otherwise sound. Tool-use accuracy is a diagnostic metric: low scores point to system prompt issues or tool description quality problems.

How to measure: Review the ToolCalls field in run step outputs for a sample of runs. For each tool call, assess: was this the right tool for this step? Were the parameters correct?

Thresholds:

  • Below 80%: agent needs system prompt tuning before evaluation continues
  • 80–95%: acceptable with review
  • Above 95%: strong performance

In ProvenanceOne: Every tool call is recorded in the run step debugger, including the tool name, input parameters, and response. This data is available for every run without additional instrumentation.


4. Escalation precision and recall

Definition: How accurately the agent escalates to human review — escalating when it should and not escalating when it should not.

Escalation accuracy has two dimensions that must be measured separately:

  • Recall (did the agent escalate cases that needed escalation?): more important for high-risk tasks. A missed escalation on a high-risk case is worse than an unnecessary escalation.
  • Precision (of the cases the agent escalated, how many actually needed escalation?): measures whether the agent is generating excessive approval burden.

How to measure: Identify the cases in your test set that should have triggered escalation. Compute recall as (correctly escalated / should have been escalated). Compute precision as (correctly escalated / total escalated).

Thresholds:

  • Recall above 95% for high-risk tasks; above 85% for medium-risk
  • Precision above 80% (lower precision means unnecessary escalation overhead)

In ProvenanceOne: Approval events are recorded in the audit log. Track approval.granted and approval.rejected rates against the expected escalation pattern from your test set design.


5. Cost per successful task

Definition: Total cost (LLM token cost plus skill execution cost) divided by the number of successfully completed tasks.

Cost per task is not a deployment blocker on its own, but it is essential for sustainability analysis. An agent that costs $0.50 per completed task may be acceptable for high-value tasks and completely unacceptable for commodity automation.

How to measure: Sum the costUsd field across all runs in the evaluation period. Divide by the count of runs rated as "task complete."

Threshold: Context-dependent. Track trend over time; watch for cost increases when system prompts or model versions change.

In ProvenanceOne: The costUsd field is recorded on every completed run. No additional instrumentation required.


6. Human override rate

Definition: The percentage of agent outputs that a human reviewer rejects or substantially modifies before the output takes effect.

Override rate is the clearest signal from humans in the loop about whether the agent is producing work they trust. It combines correctness, tone, formatting, and judgment into a single behavioral metric.

How to measure: Track approval rejection rates from the audit log (approval.rejected events). Additionally, track cases where an approver edits the payload substantially before approving. Both represent overrides.

Thresholds:

  • Above 30%: agent needs significant rework; do not expand deployment
  • 10–30%: acceptable for early-stage deployment; treat as an iteration signal
  • Below 10%: strong performance; consider increasing agent trust level

In ProvenanceOne: approval.rejected events in the audit log are the primary source. Payload edits are visible in the approval history for each run.


7. Latency (p50 and p95)

Definition: End-to-end time from workflow trigger to final output, measured at the 50th and 95th percentiles.

p50 latency tells you what a typical run looks like. p95 latency tells you about the tail — the slow runs that users notice. Both matter.

How to measure: Use the durationMs field on run records. Compute p50 and p95 across all runs in the evaluation period.

Thresholds:

  • Real-time customer-facing workflows: p95 below 10 seconds
  • Internal async enrichment workflows: p95 below 5 minutes
  • Background batch workflows: define based on business SLA

In ProvenanceOne: durationMs is recorded on every Run and RunStep record. No additional instrumentation required.


Evaluation Methodology: Step by Step

Step 1: Build a representative test set

Construct a test set of 50–100 tasks with known correct outcomes. Include:

  • Typical cases that represent the majority of real inputs
  • Edge cases: unusual inputs, incomplete data, ambiguous instructions
  • Adversarial cases: inputs designed to push the agent outside its intended scope

Aim for at least 20% adversarial or edge cases. A test set of only happy-path examples will produce an inflated score that does not predict production behavior.

Step 2: Define success criteria before scoring

Write down what "task complete" means for your use case before you run a single test. If you define success after seeing the agent's outputs, you will unconsciously calibrate the definition to what the agent actually does rather than what it should do.

Step 3: Run the agent on all test cases

Execute the full test set as workflow runs. Do not intervene during the runs.

Step 4: Score each case against the rubric

For each run, record: task complete (yes/no), safety failure (yes/no), tool-use accuracy assessment, escalation decision (correct/incorrect/unnecessary), and any notes. Use the Agent Evaluation Rubric as a structured scorecard.

Step 5: Compute system-level metrics

Aggregate the per-run scores to compute the seven metrics described above. Calculate rates, not counts.

Step 6: Apply the binary safety gate

If the critical safety failure rate is non-zero, stop. Do not proceed to composite scoring. Fix the underlying issue (system prompt scope, tool access, approval configuration) and re-run the evaluation.

Step 7: Make the go/no-go decision

Apply the thresholds for each metric. If all metrics meet their thresholds and the safety gate passes, the agent is ready for the deployment level indicated. If any metric falls short, document which metric and why, and return to the agent configuration.


Red-Teaming Your Agent

Before finalizing the evaluation, run a structured adversarial test:

  • Forbidden tool test — give the agent inputs designed to make it attempt to call a tool it should not have access to. Verify it stays within its permitted tool set.
  • System prompt override test — give the agent conflicting instructions in the input, designed to override its system prompt. Verify it holds its defined behavior.
  • Empty and malformed input test — give the agent empty inputs, null fields, and inputs with unexpected structure. Verify it fails gracefully rather than hallucinating a plausible-looking response.
  • Escalation trigger test — give the agent inputs that should trigger escalation to a human. Verify it escalates every time, not most of the time.

Red-team results feed directly into the critical safety failure rate metric. Any prohibited-tool attempt or system prompt override is a safety failure.


Automated vs. Human Evaluation

MethodScaleSpeedDepthCost
Human review (100% sample)LowSlowHighHigh
Human review (sampled)MediumMediumHighMedium
Automated metric trackingHighFastLimitedLow
LLM-as-judgeHighFastMediumMedium
Combined (recommended)HighFast + slow blendHighMedium

Recommended approach: Use automated metric tracking (completion rate from run status, cost, latency, escalation rate) for continuous monitoring in production. Use human review on a statistically meaningful sample for pre-deployment evaluation and after significant changes to the agent configuration. LLM-as-judge is useful for supplementing human review at scale, but should not replace it for safety-critical dimensions.


Common Mistakes When Evaluating Agents

Evaluating only happy-path cases. If your test set contains only clean, well-formed inputs that the agent handles well in development, your evaluation score will not predict production behavior. Production inputs are messier.

Using a test set that overlaps with the agent's development data. If the test cases are inputs the agent has been tuned or prompted against, the evaluation is not a fair measure of generalization. Build your test set independently.

Conflating "agent produced output" with "agent completed the task." An agent can produce output — sometimes confident, well-formatted output — while failing to achieve the task. Define and measure task completion separately from output production.

Not measuring critical safety failures. Teams often evaluate task success, latency, and cost but skip safety evaluation because it is harder to structure. Safety failures are the most important thing to catch before deployment.

Setting thresholds based on what the agent achieves today, not what the use case requires. If the use case requires 90% task completion and your agent achieves 75%, the answer is not to set the threshold at 75%. The threshold should reflect what is acceptable for the task, not what is achievable right now.


How ProvenanceOne Helps

ProvenanceOne surfaces the data needed for most of these metrics without additional instrumentation. Every run records costUsd, durationMs, step-level ToolCalls, and outcome status. The audit log captures approval events and authorization denials. The run debugger lets you inspect every step's input, output, and tool calls for manual review. The Agent Evaluation Rubric template provides a structured scorecard that maps to the metrics above.


FAQ

How many test cases do I need for a meaningful evaluation?

A minimum of 50 cases; 100–200 is better for any claim about success rates. Below 50, a small number of failures or successes can swing your percentages enough to make the results unreliable. For agents in regulated domains or high-risk workflows, use 200+ cases and have a second reviewer score a subset to check inter-rater reliability.

Do I need to re-evaluate the agent after changing the system prompt?

Yes. A new system prompt is a new agent configuration. Performance characteristics can change significantly even from small prompt changes. Re-run the evaluation — at minimum the critical safety gate and task completion rate — before deploying the updated agent.

What is the difference between override rate and task completion rate?

Task completion rate measures whether the agent achieved the goal. Override rate measures whether the human reviewers trusted the agent's output enough to let it through unchanged. An agent can complete tasks at 90% but have a 25% override rate if reviewers frequently edit the output before approving. Both metrics provide different information.

How do I measure tool-use accuracy at scale?

In ProvenanceOne, the ToolCalls field on every run step records the tool name, input parameters, and response for every call the agent made. For systematic measurement, build a scoring function that compares the recorded tool calls against the expected tool calls for each test case. For initial evaluation, manual review of 50–100 sampled runs is sufficient.

Is LLM-as-judge reliable for evaluating agents?

It is useful at scale for dimensions like output quality and tone, but has known failure modes: models tend to prefer longer outputs, outputs that match their own style, and outputs that avoid hedging. Use LLM-as-judge to supplement human review, not replace it, and calibrate the judge model's assessments against a human-reviewed ground truth set before relying on it.

What should I do if my agent passes all thresholds except latency?

Evaluate whether the latency is acceptable for the specific use case. A background data enrichment workflow may tolerate p95 of 3 minutes; a customer-facing chat interface cannot. If latency is unacceptable, investigate which steps are slowest using run step durationMs data, then consider model selection, prompt length reduction, or parallelizing independent steps.

How do I know if my test set is adversarial enough?

A useful check: give the agent inputs that are specifically designed to make it fail. If the agent passes every adversarial case in your test set, either your agent is very robust or your adversarial cases are not sufficiently challenging. Try prompt injection (instructions in the input designed to override the system prompt), missing required fields, and inputs that should trigger escalation but are phrased to discourage it.

What is a reasonable override rate target for a newly deployed agent?

For a newly deployed agent, a 10–30% override rate is acceptable and expected. Agents typically improve as the team refines the system prompt based on patterns in what reviewers are correcting. A rate above 30% suggests the agent needs significant rework before expanding deployment. A rate below 10% suggests strong performance and may justify increasing the agent's trust level.


  • Agent Evaluation Rubric — structured 9-dimension scorecard with deployment thresholds
  • What Is Agentic AI? — foundational concepts before setting up evaluation
  • Agents — ProvenanceOne agent configuration, trust levels, and performance metrics
  • Workflow Runs — run records including costUsd, durationMs, and step-level output
  • Audit Log — immutable log of approval events, authorization denials, and tool calls
  • System Prompt Template — structured template for writing testable system prompts