confident-ai
diff --git a/‎docs/content/guides/guides-ai-agent-evaluation-metrics.mdx‎
Lines changed: 91 additions & 0 deletions b/‎docs/content/guides/guides-ai-agent-evaluation-metrics.mdx‎
Lines changed: 91 additions & 0 deletions
diff --git a/‎docs/content/guides/guides-ai-agent-evaluation.mdx‎
Lines changed: 77 additions & 0 deletions b/‎docs/content/guides/guides-ai-agent-evaluation.mdx‎
Lines changed: 77 additions & 0 deletions
@@ -428,6 +428,97 @@ Not every agent needs every metric. Here's a decision framework:
 All AI agent evaluation metrics in `deepeval` support custom LLM judges, configurable thresholds, strict mode for binary scoring, and detailed reasoning explanations. See each metric's documentation for full configuration options.
 :::
 
+## FAQs
+
+<FAQs
+  qas={[
+    {
+      question: "What metrics does DeepEval provide for AI agents?",
+      answer: (
+        <>
+          DeepEval ships agent metrics across three layers: reasoning (
+          <code>PlanQualityMetric</code>, <code>PlanAdherenceMetric</code>),
+          action (<code>ToolCorrectnessMetric</code>,{" "}
+          <code>ArgumentCorrectnessMetric</code>), and execution (
+          <code>TaskCompletionMetric</code>, <code>StepEfficiencyMetric</code>).
+          You can also build custom metrics with <code>GEval</code> or{" "}
+          <code>DAGMetric</code>.
+        </>
+      ),
+    },
+    {
+      question: "Which metric should I use to evaluate tool selection?",
+      answer: (
+        <>
+          Use <code>ToolCorrectnessMetric</code> to check whether the agent
+          picked the right tools, and <code>ArgumentCorrectnessMetric</code> to
+          check whether it passed the correct arguments. Both are
+          component-level metrics attached to the LLM span that decides tool
+          calls.
+        </>
+      ),
+    },
+    {
+      question: "What is the difference between PlanQualityMetric and PlanAdherenceMetric?",
+      answer: (
+        <>
+          <code>PlanQualityMetric</code> evaluates whether the agent's plan is
+          logical and complete given the task.{" "}
+          <code>PlanAdherenceMetric</code> evaluates whether the agent then
+          actually followed that plan during execution.
+        </>
+      ),
+    },
+    {
+      question: "How does TaskCompletionMetric work?",
+      answer: (
+        <>
+          <code>TaskCompletionMetric</code> reads the full trace, extracts the
+          user's goal, and uses an LLM judge to score whether the agent
+          completed it. It's the best end-to-end metric for task-critical
+          agents.
+        </>
+      ),
+    },
+    {
+      question: "Do AI agent metrics require expected outputs?",
+      answer: (
+        <>
+          Most agent metrics are referenceless—they only need the trace.
+          Tool-related metrics like <code>ToolCorrectnessMetric</code> become
+          reference-based when you provide <code>expected_tools</code> on the
+          golden, which lets the metric compare actual versus expected tool
+          calls.
+        </>
+      ),
+    },
+    {
+      question: "Should I attach agent metrics end-to-end or component-level?",
+      answer: (
+        <>
+          Reasoning and execution metrics need the full trace, so attach them
+          end-to-end via <code>evals_iterator(metrics=[...])</code>. Action
+          layer metrics evaluate a specific decision, so attach them
+          component-level via <code>@observe(metrics=[...])</code> on the LLM
+          span.
+        </>
+      ),
+    },
+    {
+      question: "Can I run agent metrics in production?",
+      answer: (
+        <>
+          Yes. Define a metric collection on{" "}
+          <a href="https://confident-ai.com">Confident AI</a> and reference it
+          on your <code>@observe</code> decorators. The platform evaluates
+          exported traces asynchronously, so production agents are scored
+          continuously without added latency.
+        </>
+      ),
+    },
+  ]}
+/>
+
 ## Next Steps
 
 Now that you understand the available AI agent evaluation metrics, here's where to go next:
 
@@ -506,6 +506,83 @@ To catch these issues, `deepeval` provides metrics you can apply at different sc
 
 With proper evaluation in place, you can catch regressions before users do, pinpoint exactly where your agent is failing, make data-driven decisions about which version to ship, and continuously monitor quality in production.
 
+## FAQs
+
+<FAQs
+  qas={[
+    {
+      question: "What is AI agent evaluation?",
+      answer:
+        "AI agent evaluation is the process of measuring how well an autonomous LLM system reasons, plans, selects and calls tools, and completes tasks. Unlike single-turn LLM evaluation, agent evaluation operates on the full execution trace and assesses the reasoning layer and the action layer separately.",
+    },
+    {
+      question: "How is AI agent evaluation different from regular LLM evaluation?",
+      answer:
+        "Standard LLM evaluation scores one input-output pair. AI agent evaluation runs against an execution trace that contains reasoning steps, tool calls, and intermediate decisions—so you can pinpoint whether failures came from bad planning, wrong tool selection, incorrect arguments, or incomplete task execution.",
+    },
+    {
+      question: "Which AI agent metrics should I use in DeepEval?",
+      answer: (
+        <>
+          For most agents, start with <code>PlanQualityMetric</code> and{" "}
+          <code>PlanAdherenceMetric</code> for reasoning,{" "}
+          <code>ToolCorrectnessMetric</code> and{" "}
+          <code>ArgumentCorrectnessMetric</code> for the action layer, and{" "}
+          <code>TaskCompletionMetric</code> with{" "}
+          <code>StepEfficiencyMetric</code> for end-to-end execution quality.
+        </>
+      ),
+    },
+    {
+      question: "What is the difference between end-to-end and component-level agent evals?",
+      answer: (
+        <>
+          End-to-end evals are passed to <code>evals_iterator(metrics=[...])</code>{" "}
+          and score the entire trace—best for plan quality and task completion.
+          Component-level evals are attached via{" "}
+          <code>@observe(metrics=[...])</code> and score a specific span like
+          the LLM tool-calling component—best for tool selection and argument
+          correctness.
+        </>
+      ),
+    },
+    {
+      question: "Do I need tracing to evaluate AI agents?",
+      answer: (
+        <>
+          Yes. Agent metrics in DeepEval require tracing because they read from
+          the full execution trace—reasoning steps, tool calls, and arguments.
+          Wrap your agent functions with <code>@observe</code> and the trace is
+          built automatically.
+        </>
+      ),
+    },
+    {
+      question: "Can I write custom AI agent evaluation metrics?",
+      answer: (
+        <>
+          Yes. Use <code>GEval</code> for subjective natural-language criteria
+          like reasoning clarity or professional tone, and{" "}
+          <code>DAGMetric</code> for deterministic decision-tree logic. Both can
+          run end-to-end or be attached to a specific span.
+        </>
+      ),
+    },
+    {
+      question: "How do I run AI agent evaluation in production?",
+      answer: (
+        <>
+          Run development evaluations locally with DeepEval, then export
+          traces to <a href="https://confident-ai.com">Confident AI</a> for
+          asynchronous production evaluation. Attach metric collections to
+          your agent and LLM spans so the platform scores live traffic without
+          adding latency to your application.
+        </>
+      ),
+    },
+  ]}
+/>
+
 ## Next Steps And Additional Resources
 
 While `deepeval` handles the metrics and evaluation logic, [Confident AI](https://confident-ai.com) is the platform that brings everything together. It solves the infrastructure overhead so you can focus on building better agents: