ยท2/5/2026ยท5 min read
How to Evaluate AI Agents: The Missing Guide for Product Engineers
Guide
# How to Evaluate AI Agents: The Missing Guide for Product Engineers
**Meta description:** Learn how to evaluate AI agents with practical frameworks, scoring methodologies, and code examples. The complete guide to AI agent evals for product engineers.
**SEO targets:** how to evaluate AI agents, AI agent evals, LLM evaluation framework
---
Everyone's shipping AI agents. Almost nobody is evaluating them properly.
You've seen the pattern: a team builds a customer support agent, demos it to leadership, gets applause, ships it to production โ and three weeks later, users are complaining about hallucinated refund policies and conversations that go in circles.
The gap isn't in *building* agents. It's in *evaluating* them.
This guide is the missing manual. Whether you're a product engineer, an AI PM, or someone who just inherited an agent that's already in production, this is how you build an evaluation system that actually works.
## Why Agent Evals Are Different from Model Evals
If you've worked with LLMs, you've probably seen benchmarks like MMLU, HELM, or Chatbot Arena. These are **model-level evaluations** โ they tell you whether GPT-4o is generally smarter than Claude 3.5 Sonnet at reasoning tasks.
Agent evals are fundamentally different. Here's why:
| Dimension | Model Evals | Agent Evals |
|-----------|------------|-------------|
| What you're testing | Raw capability | End-to-end behavior |
| Input/Output | Prompt โ Completion | Goal โ Multi-step outcome |
| Determinism | Mostly deterministic | Highly stochastic |
| Scope | Single turn | Multi-turn, tool use, memory |
| Failure modes | Wrong answer | Wrong action, loops, hallucinated tool calls |
An agent might use the right model but still fail because of bad tool orchestration, poor retrieval, or broken memory management. Model benchmarks won't catch any of that.
## The Three Layers of Agent Evaluation
Think of agent evals as a pyramid:
### Layer 1: Component Evals (Unit Tests for AI)
Test each piece in isolation:
- **LLM quality:** Is the base model producing good completions for your prompts?
- **Retrieval quality:** Is your RAG pipeline returning relevant documents?
- **Tool accuracy:** When the agent calls a function, does it pass correct parameters?
### Layer 2: Trajectory Evals (Integration Tests for AI)
Test the agent's decision-making path:
- Did the agent take the right *sequence* of steps?
- Did it use the right tools in the right order?
- Did it know when to ask for clarification vs. when to act?
### Layer 3: Outcome Evals (End-to-End Tests for AI)
Test whether the agent achieved the goal:
- Did the customer's issue get resolved?
- Was the final output correct and complete?
- How long did it take? How many steps?
Most teams only do Layer 3 (if they evaluate at all). The magic is in combining all three.
## A Practical Example: Evaluating "SupportBot"
Let's make this concrete. You're the PM for **SupportBot**, a customer support AI agent at a B2B SaaS company. SupportBot handles:
- Account questions ("What plan am I on?")
- Billing issues ("Why was I charged twice?")
- Feature requests ("Can you add dark mode?")
- Bug reports ("The export button is broken")
Here's how to build a real eval system for it.
### Step 1: Build Your Eval Dataset
You need test cases. Not 5. Not 50. **At minimum 200**, spread across your agent's capabilities.
```python
# eval_dataset.py
eval_cases = [
{
"id": "billing_001",
"category": "billing",
"input": "I was charged $99 but I'm on the free plan",
"expected_tools": ["lookup_account", "check_billing_history"],
"expected_behavior": "Verify account status, check for billing discrepancy, escalate if confirmed",
"expected_outcome": "Agent identifies billing error and initiates refund OR correctly explains charge",
"golden_response_keywords": ["billing", "account", "refund OR charge explanation"],
"difficulty": "medium"
},
{
"id": "account_001",
"category": "account",
"input": "What plan am I on and when does it renew?",
"expected_tools": ["lookup_account"],
"expected_behavior": "Look up account, return plan name and renewal date",
"expected_outcome": "Correct plan name and exact renewal date",
"golden_response_keywords": ["plan", "renewal", "date"],
"difficulty": "easy"
},
# ... 198 more cases
]
```
**Pro tip:** Seed your dataset from real conversations. Pull 500 actual support tickets, categorize them, and turn the best examples into eval cases. This is 10x more valuable than synthetic data.
### Step 2: Define Your Scoring Rubric
You need metrics that actually mean something. Here's the rubric I recommend:
```python
# scoring.py
from dataclasses import dataclass
from enum import Enum
class Score(Enum):
FAIL = 0
PARTIAL = 1
PASS = 2
@dataclass
class EvalResult:
case_id: str
# Layer 1: Component scores
retrieval_relevance: float # 0-1, were the right docs fetched?
tool_selection_accuracy: Score # Did it pick the right tools?
# Layer 2: Trajectory scores
step_efficiency: float # optimal_steps / actual_steps
no_hallucinated_actions: bool # Did it call tools that don't exist?
appropriate_escalation: bool # Did it escalate when it should have?
# Layer 3: Outcome scores
task_completed: bool
response_quality: float # 0-1, LLM-as-judge score
factual_accuracy: Score # Did it state correct facts?
tone_appropriate: bool
@property
def composite_score(self) -> float:
weights = {
'factual_accuracy': 0.25,
'task_completed': 0.25,
'response_quality': 0.20,
'tool_selection_accuracy': 0.15,
'step_efficiency': 0.10,
'tone_appropriate': 0.05,
}
raw = (
(self.factual_accuracy.value / 2) * weights['factual_accuracy'] +
float(self.task_completed) * weights['task_completed'] +
self.response_quality * weights['response_quality'] +
(self.tool_selection_accuracy.value / 2) * weights['tool_selection_accuracy'] +
self.step_efficiency * weights['step_efficiency'] +
float(self.tone_appropriate) * weights['tone_appropriate']
)
return round(raw, 3)
```
### Step 3: Implement LLM-as-Judge
For subjective metrics (response quality, tone), use another LLM as a judge. This is the most scalable approach, but it needs calibration.
```python
# llm_judge.py
import json
from openai import OpenAI
client = OpenAI()
JUDGE_PROMPT = """You are evaluating an AI customer support agent's response.
## Context
Customer query: {query}
Agent response: {response}
Expected behavior: {expected}
## Evaluation Criteria
Rate each dimension from 0.0 to 1.0:
1. **Helpfulness**: Did the response address the customer's actual need?
2. **Accuracy**: Are all stated facts correct? (0.0 if any hallucination detected)
3. **Completeness**: Did it cover everything needed, without over-explaining?
4. **Tone**: Professional, empathetic, appropriate for the situation?
5. **Actionability**: Does the customer know exactly what to do next?
Return JSON:
{{"helpfulness": 0.0-1.0, "accuracy": 0.0-1.0, "completeness": 0.0-1.0, "tone": 0.0-1.0, "actionability": 0.0-1.0, "reasoning": "brief explanation"}}
"""
def judge_response(query: str, response: str, expected: str) -> dict:
result = client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[{
"role": "user",
"content": JUDGE_PROMPT.format(
query=query, response=response, expected=expected
)
}],
temperature=0.1 # Low temp for consistency
)
return json.loads(result.choices[0].message.content)
```
**Critical:** Calibrate your judge. Run it against 50 cases you've manually scored, and check the correlation. If the LLM judge disagrees with humans more than 20% of the time, your judge prompt needs work.
### Step 4: Build the Eval Harness
Now wire it all together:
```python
# eval_harness.py
import asyncio
from datetime import datetime
from typing import List
from your_agent import SupportBot # your actual agent
async def run_eval_suite(
agent: SupportBot,
cases: List[dict],
run_id: str = None
) -> dict:
run_id = run_id or f"eval_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
results = []
for case in cases:
# Run the agent
trace = await agent.run(
message=case["input"],
trace_enabled=True # Capture tool calls, intermediate steps
)
# Score components
tool_accuracy = score_tool_selection(
actual_tools=trace.tools_called,
expected_tools=case["expected_tools"]
)
# LLM judge for quality
judge_scores = judge_response(
query=case["input"],
response=trace.final_response,
expected=case["expected_outcome"]
)
# Check for hallucinated tool calls
valid_tools = agent.get_available_tools()
hallucinated = any(t not in valid_tools for t in trace.tools_called)
result = EvalResult(
case_id=case["id"],
retrieval_relevance=score_retrieval(trace),
tool_selection_accuracy=tool_accuracy,
step_efficiency=len(case["expected_tools"]) / max(len(trace.tools_called), 1),
no_hallucinated_actions=not hallucinated,
appropriate_escalation=check_escalation(trace, case),
task_completed=judge_scores["completeness"] > 0.7,
response_quality=judge_scores["helpfulness"],
factual_accuracy=Score.PASS if judge_scores["accuracy"] > 0.9 else Score.FAIL,
tone_appropriate=judge_scores["tone"] > 0.7,
)
results.append(result)
# Aggregate
avg_score = sum(r.composite_score for r in results) / len(results)
pass_rate = sum(1 for r in results if r.composite_score > 0.7) / len(results)
return {
"run_id": run_id,
"total_cases": len(results),
"avg_composite_score": round(avg_score, 3),
"pass_rate": f"{pass_rate:.1%}",
"by_category": aggregate_by_category(results, cases),
"worst_cases": sorted(results, key=lambda r: r.composite_score)[:10],
"results": results,
}
```
### Step 5: Run It in CI/CD
The eval is only useful if it runs automatically. Here's a basic GitHub Actions setup:
```yaml
# .github/workflows/agent-eval.yml
name: Agent Eval Suite
on:
pull_request:
paths:
- 'agent/**'
- 'prompts/**'
- 'tools/**'
schedule:
- cron: '0 6 * * *' # Daily at 6am UTC
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install -r requirements.txt
- name: Run eval suite
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: python -m evals.run --suite full --output results.json
- name: Check regression
run: |
python -m evals.check_regression \
--current results.json \
--baseline evals/baseline.json \
--max-regression 0.05
- name: Post results to PR
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const results = require('./results.json');
const body = `## ๐ค Agent Eval Results
| Metric | Score |
|--------|-------|
| Composite | ${results.avg_composite_score} |
| Pass Rate | ${results.pass_rate} |
| Factual Accuracy | ${results.factual_accuracy_rate} |
| Regression | ${results.regression_detected ? 'โ ๏ธ YES' : 'โ
None'} |
`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body
});
```
## Offline vs. Online Evals
Everything above is **offline evaluation** โ you run it against a fixed dataset before deployment. But you also need **online evaluation** โ monitoring the agent in production.
### Offline Evals (Pre-deployment)
- โ
Controlled, reproducible
- โ
Can test edge cases and adversarial inputs
- โ
Blocks bad changes from shipping
- โ Doesn't capture real user behavior
- โ Dataset may not reflect production distribution
### Online Evals (Post-deployment)
- โ
Real user interactions
- โ
Catches distribution shift and novel inputs
- โ
Measures actual business outcomes
- โ By definition, users see failures
- โ Harder to attribute regressions
**You need both.** The ratio depends on your risk tolerance. Customer support? Heavy offline evals + conservative deploy gates. Internal productivity tool? Lighter offline, heavier online monitoring.
### Online Monitoring Essentials
```typescript
// monitor.ts โ lightweight production eval
interface AgentTrace {
sessionId: string;
query: string;
response: string;
toolCalls: ToolCall[];
latencyMs: number;
tokenCount: number;
timestamp: Date;
}
interface QualitySignals {
// Implicit signals (no user effort)
conversationLength: number; // Long = possibly struggling
toolCallFailures: number; // Failed API calls
selfCorrections: number; // Agent contradicted itself
escalatedToHuman: boolean; // Had to bail out
// Explicit signals (user feedback)
thumbsUp: boolean | null;
csatScore: number | null; // 1-5
}
function detectAnomalies(traces: AgentTrace[], window: string = '1h'): Alert[] {
const alerts: Alert[] = [];
const recent = filterByWindow(traces, window);
// Hallucination proxy: response references tools/data the agent doesn't have
const hallRate = recent.filter(t => detectHallucination(t)).length / recent.length;
if (hallRate > 0.05) {
alerts.push({ severity: 'high', type: 'hallucination_spike', rate: hallRate });
}
// Latency regression
const p95Latency = percentile(recent.map(t => t.latencyMs), 95);
if (p95Latency > 10000) {
alerts.push({ severity: 'medium', type: 'latency_regression', p95: p95Latency });
}
// Escalation rate spike
const escalationRate = recent.filter(t => t.escalatedToHuman).length / recent.length;
if (escalationRate > 0.3) {
alerts.push({ severity: 'high', type: 'escalation_spike', rate: escalationRate });
}
return alerts;
}
```
## Hallucination Monitoring
Hallucinations are the #1 risk in production agents. Three practical approaches:
### 1. Factual Grounding Check
After the agent responds, run a verification pass: does every factual claim in the response trace back to a retrieved document or tool output?
### 2. Consistency Check
Ask the same question 3 times with slight paraphrasing. If the agent gives materially different answers, something's wrong.
### 3. Entailment Scoring
Use an NLI (Natural Language Inference) model to check if the agent's response is *entailed by* its retrieved context. If the response says things the context doesn't support, flag it.
```python
# hallucination_check.py
def check_grounding(response: str, sources: list[str]) -> float:
"""Score 0-1 for how well the response is grounded in sources."""
result = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Given these source documents:
{chr(10).join(sources)}
And this agent response:
{response}
Score from 0.0 to 1.0: what fraction of claims in the response
are directly supported by the sources?
Return just the number."""
}],
temperature=0
)
return float(result.choices[0].message.content.strip())
```
## Model Drift Detection
Your agent will degrade over time, even if you change nothing. Why?
- **Model provider updates:** OpenAI/Anthropic quietly update models
- **Data drift:** User questions shift (seasonal, product changes)
- **Context drift:** Your RAG documents get stale
### Canary Eval Pattern
Run a fixed set of 50 "canary" test cases every day. Track scores over time. If the 7-day moving average drops by more than 5%, trigger an alert.
```python
# canary.py
def run_canary(canary_cases: list, history_file: str = "canary_history.jsonl"):
today_scores = []
for case in canary_cases:
result = run_single_eval(case)
today_scores.append(result.composite_score)
avg = sum(today_scores) / len(today_scores)
# Append to history
with open(history_file, "a") as f:
f.write(json.dumps({"date": str(date.today()), "avg_score": avg}) + "\n")
# Check 7-day trend
history = load_history(history_file)
if len(history) >= 7:
recent_avg = sum(h["avg_score"] for h in history[-7:]) / 7
baseline_avg = sum(h["avg_score"] for h in history[-30:-7]) / max(len(history[-30:-7]), 1)
if baseline_avg - recent_avg > 0.05:
send_alert(f"โ ๏ธ Agent quality regression: {baseline_avg:.3f} โ {recent_avg:.3f}")
```
## The Eval Frameworks Landscape
Here's what's out there and when to use each:
| Framework | Best For | Approach |
|-----------|----------|----------|
| **Chatbot Arena (LMSYS)** | Comparing base models | Crowdsourced human preference (Elo ratings) |
| **HELM** (Stanford) | Holistic model assessment | Multi-metric benchmark across scenarios |
| **DeepEval** | Agent & LLM eval in CI/CD | Python framework, LLM-as-judge, 14+ metrics |
| **LangSmith** | LangChain-based agents | Tracing + eval built into LangChain ecosystem |
| **Langfuse** | Open-source observability | Tracing, scoring, prompt management |
| **Braintrust** | Production LLM apps | Logging, eval, prompt playground |
| **Opik** (Comet) | Open-source LLM eval | Tracing, automated scoring, CI integration |
**My recommendation for most teams:** Start with DeepEval or Langfuse for structure, but build your domain-specific scoring rubric from scratch. No off-the-shelf framework knows that "SupportBot should never promise a refund without checking the billing system first."
## Regression Testing for AI: The Non-Obvious Parts
Traditional regression testing is binary: it works or it doesn't. AI regression testing is probabilistic. Here's how to handle it:
### 1. Statistical Significance
Don't fail a PR because one score dipped by 0.01. Use a paired t-test or bootstrap confidence interval to determine if the regression is statistically significant.
### 2. Category-Level Regression
Your overall score might stay flat while one category tanks. Always break results down by category, difficulty level, and tool dependency.
### 3. The Baseline Problem
What's "good enough"? Set your baseline from a human-evaluated golden set. Have 3 humans rate 100 agent responses, take the average โ that's your target. If the agent scores within 0.05 of human quality, it ships.
### 4. Version Pinning
Always pin your eval against a specific model version, tool version, and prompt version. When something regresses, you need to know *which* change caused it.
## Putting It All Together: The Eval Maturity Model
**Level 0 โ Vibes:** "It seems to work well in demos." (Most teams are here.)
**Level 1 โ Manual Spot Checks:** Someone reviews 20 conversations per week.
**Level 2 โ Automated Offline Evals:** Eval suite runs in CI, blocks regressions.
**Level 3 โ Online Monitoring:** Production quality signals, drift detection, alerting.
**Level 4 โ Continuous Improvement:** Eval results feed back into training data, prompt optimization, and product decisions. The eval system itself gets evaluated.
Most teams should aim for Level 2 within the first month of shipping an agent, and Level 3 within three months. Level 4 is where the best AI teams operate.
## The Bottom Line
Building an AI agent without evals is like launching a product without analytics. You're flying blind, and the first time you notice a problem is when a customer complains.
The good news: you don't need to boil the ocean. Start with 50 test cases, a simple scoring rubric, and a daily canary eval. That alone puts you ahead of 90% of teams shipping agents today.
The teams that master evals will ship better agents, faster, with fewer production fires. That's not a prediction โ it's already happening at the companies that take this seriously.
---
*Building AI products and want more practical guides like this? Subscribe to the PMtheBuilder newsletter for weekly frameworks and templates.*
---
**Related reading:**
- Aakash Gupta, ["The One Skill Every AI PM Needs"](https://www.news.aakashg.com/p/ai-evals) โ excellent overview of eval types for PMs
- Stanford HELM: [crfm.stanford.edu/helm](https://crfm.stanford.edu/helm/)
- DeepEval docs: [deepeval.com](https://deepeval.com)
- Chatbot Arena: [lmarena.ai](https://lmarena.ai)
๐งช
Free Tool
How strong are your AI PM skills?
8 real production scenarios. LLM-judged across 5 dimensions. Takes ~15 minutes. See exactly where your gaps are.
๐ ๏ธ
PM the Builder
Practical AI product management โ backed by PM leaders who build AI products, hire AI PMs, and ship every day. Building what we wish existed when we started.
๐งช
Benchmark your AI PM skills
8 production scenarios. Free. LLM-judged. See where you stand.
๐
Go deeper with the full toolkit
Playbooks, interview prep, prompt libraries, and production frameworks โ built by the teams who hire AI PMs.
โก
Free: 68-page AI PM Prompt Library
Production-ready prompts for evals, architecture reviews, stakeholder comms, and shipping. Enter your email, get the PDF.
Related Posts
Want more like this?
Get weekly tactics for AI product managers.