You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/content/guides/guides-ai-agent-evaluation-metrics.mdx
+91Lines changed: 91 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -428,6 +428,97 @@ Not every agent needs every metric. Here's a decision framework:
428
428
All AI agent evaluation metrics in `deepeval` support custom LLM judges, configurable thresholds, strict mode for binary scoring, and detailed reasoning explanations. See each metric's documentation for full configuration options.
429
429
:::
430
430
431
+
## FAQs
432
+
433
+
<FAQs
434
+
qas={[
435
+
{
436
+
question: "What metrics does DeepEval provide for AI agents?",
437
+
answer: (
438
+
<>
439
+
DeepEval ships agent metrics across three layers: reasoning (
Copy file name to clipboardExpand all lines: docs/content/guides/guides-ai-agent-evaluation.mdx
+77Lines changed: 77 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -506,6 +506,83 @@ To catch these issues, `deepeval` provides metrics you can apply at different sc
506
506
507
507
With proper evaluation in place, you can catch regressions before users do, pinpoint exactly where your agent is failing, make data-driven decisions about which version to ship, and continuously monitor quality in production.
508
508
509
+
## FAQs
510
+
511
+
<FAQs
512
+
qas={[
513
+
{
514
+
question: "What is AI agent evaluation?",
515
+
answer:
516
+
"AI agent evaluation is the process of measuring how well an autonomous LLM system reasons, plans, selects and calls tools, and completes tasks. Unlike single-turn LLM evaluation, agent evaluation operates on the full execution trace and assesses the reasoning layer and the action layer separately.",
517
+
},
518
+
{
519
+
question: "How is AI agent evaluation different from regular LLM evaluation?",
520
+
answer:
521
+
"Standard LLM evaluation scores one input-output pair. AI agent evaluation runs against an execution trace that contains reasoning steps, tool calls, and intermediate decisions—so you can pinpoint whether failures came from bad planning, wrong tool selection, incorrect arguments, or incomplete task execution.",
522
+
},
523
+
{
524
+
question: "Which AI agent metrics should I use in DeepEval?",
525
+
answer: (
526
+
<>
527
+
For most agents, start with <code>PlanQualityMetric</code> and{""}
528
+
<code>PlanAdherenceMetric</code> for reasoning,{""}
529
+
<code>ToolCorrectnessMetric</code> and{""}
530
+
<code>ArgumentCorrectnessMetric</code> for the action layer, and{""}
531
+
<code>TaskCompletionMetric</code> with{""}
532
+
<code>StepEfficiencyMetric</code> for end-to-end execution quality.
533
+
</>
534
+
),
535
+
},
536
+
{
537
+
question: "What is the difference between end-to-end and component-level agent evals?",
538
+
answer: (
539
+
<>
540
+
End-to-end evals are passed to <code>evals_iterator(metrics=[...])</code>{""}
541
+
and score the entire trace—best for plan quality and task completion.
542
+
Component-level evals are attached via{""}
543
+
<code>@observe(metrics=[...])</code> and score a specific span like
544
+
the LLM tool-calling component—best for tool selection and argument
545
+
correctness.
546
+
</>
547
+
),
548
+
},
549
+
{
550
+
question: "Do I need tracing to evaluate AI agents?",
551
+
answer: (
552
+
<>
553
+
Yes. Agent metrics in DeepEval require tracing because they read from
554
+
the full execution trace—reasoning steps, tool calls, and arguments.
555
+
Wrap your agent functions with <code>@observe</code> and the trace is
556
+
built automatically.
557
+
</>
558
+
),
559
+
},
560
+
{
561
+
question: "Can I write custom AI agent evaluation metrics?",
562
+
answer: (
563
+
<>
564
+
Yes. Use <code>GEval</code> for subjective natural-language criteria
565
+
like reasoning clarity or professional tone, and{""}
566
+
<code>DAGMetric</code> for deterministic decision-tree logic. Both can
567
+
run end-to-end or be attached to a specific span.
568
+
</>
569
+
),
570
+
},
571
+
{
572
+
question: "How do I run AI agent evaluation in production?",
573
+
answer: (
574
+
<>
575
+
Run development evaluations locally with DeepEval, then export
576
+
traces to <ahref="https://confident-ai.com">Confident AI</a> for
577
+
asynchronous production evaluation. Attach metric collections to
578
+
your agent and LLM spans so the platform scores live traffic without
579
+
adding latency to your application.
580
+
</>
581
+
),
582
+
},
583
+
]}
584
+
/>
585
+
509
586
## Next Steps And Additional Resources
510
587
511
588
While `deepeval` handles the metrics and evaluation logic, [Confident AI](https://confident-ai.com) is the platform that brings everything together. It solves the infrastructure overhead so you can focus on building better agents:
0 commit comments