Skip to content

Commit 265bfbd

Browse files
committed
updated guides
1 parent cd3c09b commit 265bfbd

13 files changed

Lines changed: 1483 additions & 4 deletions

docs/content/guides/guides-ai-agent-evaluation-metrics.mdx

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -428,6 +428,97 @@ Not every agent needs every metric. Here's a decision framework:
428428
All AI agent evaluation metrics in `deepeval` support custom LLM judges, configurable thresholds, strict mode for binary scoring, and detailed reasoning explanations. See each metric's documentation for full configuration options.
429429
:::
430430

431+
## FAQs
432+
433+
<FAQs
434+
qas={[
435+
{
436+
question: "What metrics does DeepEval provide for AI agents?",
437+
answer: (
438+
<>
439+
DeepEval ships agent metrics across three layers: reasoning (
440+
<code>PlanQualityMetric</code>, <code>PlanAdherenceMetric</code>),
441+
action (<code>ToolCorrectnessMetric</code>,{" "}
442+
<code>ArgumentCorrectnessMetric</code>), and execution (
443+
<code>TaskCompletionMetric</code>, <code>StepEfficiencyMetric</code>).
444+
You can also build custom metrics with <code>GEval</code> or{" "}
445+
<code>DAGMetric</code>.
446+
</>
447+
),
448+
},
449+
{
450+
question: "Which metric should I use to evaluate tool selection?",
451+
answer: (
452+
<>
453+
Use <code>ToolCorrectnessMetric</code> to check whether the agent
454+
picked the right tools, and <code>ArgumentCorrectnessMetric</code> to
455+
check whether it passed the correct arguments. Both are
456+
component-level metrics attached to the LLM span that decides tool
457+
calls.
458+
</>
459+
),
460+
},
461+
{
462+
question: "What is the difference between PlanQualityMetric and PlanAdherenceMetric?",
463+
answer: (
464+
<>
465+
<code>PlanQualityMetric</code> evaluates whether the agent's plan is
466+
logical and complete given the task.{" "}
467+
<code>PlanAdherenceMetric</code> evaluates whether the agent then
468+
actually followed that plan during execution.
469+
</>
470+
),
471+
},
472+
{
473+
question: "How does TaskCompletionMetric work?",
474+
answer: (
475+
<>
476+
<code>TaskCompletionMetric</code> reads the full trace, extracts the
477+
user's goal, and uses an LLM judge to score whether the agent
478+
completed it. It's the best end-to-end metric for task-critical
479+
agents.
480+
</>
481+
),
482+
},
483+
{
484+
question: "Do AI agent metrics require expected outputs?",
485+
answer: (
486+
<>
487+
Most agent metrics are referenceless—they only need the trace.
488+
Tool-related metrics like <code>ToolCorrectnessMetric</code> become
489+
reference-based when you provide <code>expected_tools</code> on the
490+
golden, which lets the metric compare actual versus expected tool
491+
calls.
492+
</>
493+
),
494+
},
495+
{
496+
question: "Should I attach agent metrics end-to-end or component-level?",
497+
answer: (
498+
<>
499+
Reasoning and execution metrics need the full trace, so attach them
500+
end-to-end via <code>evals_iterator(metrics=[...])</code>. Action
501+
layer metrics evaluate a specific decision, so attach them
502+
component-level via <code>@observe(metrics=[...])</code> on the LLM
503+
span.
504+
</>
505+
),
506+
},
507+
{
508+
question: "Can I run agent metrics in production?",
509+
answer: (
510+
<>
511+
Yes. Define a metric collection on{" "}
512+
<a href="https://confident-ai.com">Confident AI</a> and reference it
513+
on your <code>@observe</code> decorators. The platform evaluates
514+
exported traces asynchronously, so production agents are scored
515+
continuously without added latency.
516+
</>
517+
),
518+
},
519+
]}
520+
/>
521+
431522
## Next Steps
432523

433524
Now that you understand the available AI agent evaluation metrics, here's where to go next:

docs/content/guides/guides-ai-agent-evaluation.mdx

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -506,6 +506,83 @@ To catch these issues, `deepeval` provides metrics you can apply at different sc
506506

507507
With proper evaluation in place, you can catch regressions before users do, pinpoint exactly where your agent is failing, make data-driven decisions about which version to ship, and continuously monitor quality in production.
508508

509+
## FAQs
510+
511+
<FAQs
512+
qas={[
513+
{
514+
question: "What is AI agent evaluation?",
515+
answer:
516+
"AI agent evaluation is the process of measuring how well an autonomous LLM system reasons, plans, selects and calls tools, and completes tasks. Unlike single-turn LLM evaluation, agent evaluation operates on the full execution trace and assesses the reasoning layer and the action layer separately.",
517+
},
518+
{
519+
question: "How is AI agent evaluation different from regular LLM evaluation?",
520+
answer:
521+
"Standard LLM evaluation scores one input-output pair. AI agent evaluation runs against an execution trace that contains reasoning steps, tool calls, and intermediate decisions—so you can pinpoint whether failures came from bad planning, wrong tool selection, incorrect arguments, or incomplete task execution.",
522+
},
523+
{
524+
question: "Which AI agent metrics should I use in DeepEval?",
525+
answer: (
526+
<>
527+
For most agents, start with <code>PlanQualityMetric</code> and{" "}
528+
<code>PlanAdherenceMetric</code> for reasoning,{" "}
529+
<code>ToolCorrectnessMetric</code> and{" "}
530+
<code>ArgumentCorrectnessMetric</code> for the action layer, and{" "}
531+
<code>TaskCompletionMetric</code> with{" "}
532+
<code>StepEfficiencyMetric</code> for end-to-end execution quality.
533+
</>
534+
),
535+
},
536+
{
537+
question: "What is the difference between end-to-end and component-level agent evals?",
538+
answer: (
539+
<>
540+
End-to-end evals are passed to <code>evals_iterator(metrics=[...])</code>{" "}
541+
and score the entire trace—best for plan quality and task completion.
542+
Component-level evals are attached via{" "}
543+
<code>@observe(metrics=[...])</code> and score a specific span like
544+
the LLM tool-calling component—best for tool selection and argument
545+
correctness.
546+
</>
547+
),
548+
},
549+
{
550+
question: "Do I need tracing to evaluate AI agents?",
551+
answer: (
552+
<>
553+
Yes. Agent metrics in DeepEval require tracing because they read from
554+
the full execution trace—reasoning steps, tool calls, and arguments.
555+
Wrap your agent functions with <code>@observe</code> and the trace is
556+
built automatically.
557+
</>
558+
),
559+
},
560+
{
561+
question: "Can I write custom AI agent evaluation metrics?",
562+
answer: (
563+
<>
564+
Yes. Use <code>GEval</code> for subjective natural-language criteria
565+
like reasoning clarity or professional tone, and{" "}
566+
<code>DAGMetric</code> for deterministic decision-tree logic. Both can
567+
run end-to-end or be attached to a specific span.
568+
</>
569+
),
570+
},
571+
{
572+
question: "How do I run AI agent evaluation in production?",
573+
answer: (
574+
<>
575+
Run development evaluations locally with DeepEval, then export
576+
traces to <a href="https://confident-ai.com">Confident AI</a> for
577+
asynchronous production evaluation. Attach metric collections to
578+
your agent and LLM spans so the platform scores live traffic without
579+
adding latency to your application.
580+
</>
581+
),
582+
},
583+
]}
584+
/>
585+
509586
## Next Steps And Additional Resources
510587

511588
While `deepeval` handles the metrics and evaluation logic, [Confident AI](https://confident-ai.com) is the platform that brings everything together. It solves the infrastructure overhead so you can focus on building better agents:

0 commit comments

Comments
 (0)