Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file not shown.
96 changes: 96 additions & 0 deletions website/blog/2026-02-23-automatic-evaluation/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
title: "Automatic LLM and Agent Evaluation: Quality Monitoring Without the Boilerplate"
description: Run LLM judges on your traces and conversations as they're logged—no code required
slug: automatic-evaluation
authors: [corey-zumar, samraj-moorjani, avesh-singh, serena-ruan, yuki-watanabe]
tags: [genai, evaluation, tracing, llm-as-judge, production-monitoring, rag]
thumbnail: img/blog/mlflow-automatic-evaluation-thumbnail.png
image: img/blog/mlflow-automatic-evaluation-thumbnail.png
---

We're excited to introduce **Automatic Evaluation**, a new capability in MLflow that runs LLM judges on your agent traces and conversations as they're logged—no code required.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we replace em dashes?:)


## Challenges

Building an AI agent, LLM application, or RAG system is one thing. Knowing if it's working well is another.

During development, you want fast feedback. You make a change, run your agent, and wonder: did that actually improve things? Is it still hallucinating? Are the tool calls correct? Manually writing and running evaluation scripts for every iteration slows you down.

Production introduces new challenges. Your agent is handling real traffic, and you need to know when things go wrong—ideally before users complain. Is the agent repeating itself? Is it leaking PII? Is response quality degrading over time?

These challenges share a common solution: continuous, automated quality checks that run without manual intervention.

## What is Automatic Evaluation?

Automatic evaluation runs [**LLM judges**](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/#llms-as-judges) on your traces and conversations as they're logged to MLflow. You configure judges once, and they run automatically on new traces as they arrive. Judges run asynchronously on the MLflow server, so your application's performance is unaffected.

An LLM judge is a language model that evaluates the outputs of an agent or LLM application against specific criteria—safety, groundedness, tool usage, user satisfaction, and more. MLflow provides a large ecosystem of [**built-in judges**](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/) for common evaluation criteria, plus integrations with [DeepEval](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#deepeval), [RAGAS](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#ragas), and [Phoenix Evals](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#arize-phoenix-evals). The [`make_judge()`](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/) API also enables you to create custom judges from natural language criteria.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
An LLM judge is a language model that evaluates the outputs of an agent or LLM application against specific criteria—safety, groundedness, tool usage, user satisfaction, and more. MLflow provides a large ecosystem of [**built-in judges**](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/) for common evaluation criteria, plus integrations with [DeepEval](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#deepeval), [RAGAS](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#ragas), and [Phoenix Evals](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#arize-phoenix-evals). The [`make_judge()`](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/) API also enables you to create custom judges from natural language criteria.
An LLM judge is a pair of a language model and a tailored prompt that evaluates the outputs of an agent or LLM application against specific criteria—safety, groundedness, tool usage, user satisfaction, and more. MLflow provides a large ecosystem of [**built-in judges**](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/) for common evaluation criteria, plus integrations with [DeepEval](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#deepeval), [RAGAS](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#ragas), and [Phoenix Evals](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#arize-phoenix-evals). The [`make_judge()`](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/) API also enables you to create custom judges from natural language criteria.


## Get Started

Before setting up automatic evaluation, you'll need MLflow 3.9 or later:

```bash
pip install mlflow>=3.9
```

Next, enable tracing for your agent. MLflow Tracing works with any agent framework and programming language via [OpenTelemetry](https://mlflow.org/docs/latest/genai/tracing/opentelemetry/). MLflow also provides [more than 40 autologging integrations](https://mlflow.org/docs/latest/genai/tracing/integrations/) for tracing popular libraries like LangChain, Vercel, and the OpenAI SDK with just one line of code:

```python
import mlflow

mlflow.openai.autolog() # or langchain.autolog(), etc.
```

Once tracing is enabled, you can set up automatic evaluation through the MLflow UI or SDK.

### Setting Up Automatic Evaluation

The fastest way to get started is through the MLflow UI. Navigate to your experiment, open the **Judges** tab, and click **+ New LLM judge**. Select a built-in judge (like safety or correctness), configure your sampling rate, and save. New traces will be evaluated automatically.

<video width="100%" controls autoPlay loop muted>
<source src={require("./automatic-evaluation-ui-setup.mp4").default} type="video/mp4" />
</video>

You can also configure judges programmatically:

```python
from mlflow.genai.scorers import Safety, ScorerSamplingConfig

judge = Safety(model="gateway:/my-llm-endpoint")
registered = judge.register(name="safety_check")
registered.start(
sampling_config=ScorerSamplingConfig(
sample_rate=0.5, # Evaluate 50% of traces; use 1.0 for dev, lower for production
filter_string="metadata.environment = 'production'" # Optional: target specific traces
)
)
```

For multi-turn conversations, session-level judges evaluate the entire conversation rather than individual traces. Sessions are evaluated after 5 minutes of inactivity, so the judge sees the complete interaction before scoring.

For complete setup instructions, see the [Automatic Evaluation documentation](https://mlflow.org/docs/latest/genai/eval-monitor/automatic-evaluations/).

## Viewing Results

Quality scores from your judges appear directly in the MLflow UI. The [Overview tab](https://mlflow.org/docs/latest/genai/tracing/observe-with-traces/ui/#overview-tab) shows trends over time—you can see at a glance if quality is improving or degrading. The [Traces tab](https://mlflow.org/docs/latest/genai/tracing/observe-with-traces/ui/#traces-tab) shows individual scores alongside each trace, so you can drill into specific failures. Scores typically appear within a minute or two of trace logging.

<video width="100%" controls autoPlay loop muted>
<source
src={require("./automatic-evaluation-ui-setup.mp4").default}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this video supposed to be a different one.

type="video/mp4"
/>
</video>

## Learn More

- [Automatic Evaluation documentation](https://mlflow.org/docs/latest/genai/eval-monitor/automatic-evaluations/)
- [Built-in LLM judges](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/)
- [Creating custom judges](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/create-custom-judge/)
- [Third-party integrations (DeepEval, RAGAS, Phoenix)](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/)

## Feedback

If this is useful, give us a star on GitHub: [github.com/mlflow/mlflow](https://github.com/mlflow/mlflow)

Have questions or feedback? Open an issue on [GitHub](https://github.com/mlflow/mlflow/issues) or join the conversation in the MLflow community.
Binary file not shown.
12 changes: 12 additions & 0 deletions website/blog/authors.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
corey-zumar:
name: Corey Zumar
title: Software Engineer at Databricks
url: https://github.com/dbczumar
image_url: https://github.com/dbczumar.png

mlflow-maintainers:
name: MLflow maintainers
title: MLflow maintainers
Expand Down Expand Up @@ -158,6 +164,12 @@ shyam-sankararaman:
url: https://www.linkedin.com/in/shyampr16/
image_url: /img/authors/shyam_sankararaman.png

serena-ruan:
name: Serena Ruan
title: Software Engineer at Databricks
url: https://www.linkedin.com/in/serena-ruan/
image_url: https://github.com/serena-ruan.png

samraj-moorjani:
name: Samraj Moorjani
title: Software Engineer at Databricks
Expand Down
Binary file added website/static/img/authors/corey_zumar.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added website/static/img/authors/serena_ruan.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading