mlflow · dbczumar · Feb 24, 2026 · Feb 24, 2026 · Feb 24, 2026 · Feb 24, 2026
diff --git a/website/blog/2026-02-23-automatic-evaluation/automatic-evaluation-ui-setup.mp4 b/website/blog/2026-02-23-automatic-evaluation/automatic-evaluation-ui-setup.mp4
diff --git a/website/blog/2026-02-23-automatic-evaluation/index.mdx b/website/blog/2026-02-23-automatic-evaluation/index.mdx
@@ -0,0 +1,96 @@
+---
+title: "Automatic LLM and Agent Evaluation: Quality Monitoring Without the Boilerplate"
+description: Run LLM judges on your traces and conversations as they're logged—no code required
+slug: automatic-evaluation
+authors: [corey-zumar, samraj-moorjani, avesh-singh, serena-ruan, yuki-watanabe]
+tags: [genai, evaluation, tracing, llm-as-judge, production-monitoring, rag]
+thumbnail: img/blog/mlflow-automatic-evaluation-thumbnail.png
+image: img/blog/mlflow-automatic-evaluation-thumbnail.png
+---
+
+We're excited to introduce **Automatic Evaluation**, a new capability in MLflow that runs LLM judges on your agent traces and conversations as they're logged—no code required.
+
+## Challenges
+
+Building an AI agent, LLM application, or RAG system is one thing. Knowing if it's working well is another.
+
+During development, you want fast feedback. You make a change, run your agent, and wonder: did that actually improve things? Is it still hallucinating? Are the tool calls correct? Manually writing and running evaluation scripts for every iteration slows you down.
+
+Production introduces new challenges. Your agent is handling real traffic, and you need to know when things go wrong—ideally before users complain. Is the agent repeating itself? Is it leaking PII? Is response quality degrading over time?
+
+These challenges share a common solution: continuous, automated quality checks that run without manual intervention.
+
+## What is Automatic Evaluation?
+
+Automatic evaluation runs [**LLM judges**](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/#llms-as-judges) on your traces and conversations as they're logged to MLflow. You configure judges once, and they run automatically on new traces as they arrive. Judges run asynchronously on the MLflow server, so your application's performance is unaffected.
+
+An LLM judge is a language model that evaluates the outputs of an agent or LLM application against specific criteria—safety, groundedness, tool usage, user satisfaction, and more. MLflow provides a large ecosystem of [**built-in judges**](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/) for common evaluation criteria, plus integrations with [DeepEval](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#deepeval), [RAGAS](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#ragas), and [Phoenix Evals](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#arize-phoenix-evals). The [`make_judge()`](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/) API also enables you to create custom judges from natural language criteria.
-An LLM judge is a language model that evaluates the outputs of an agent or LLM application against specific criteria—safety, groundedness, tool usage, user satisfaction, and more. MLflow provides a large ecosystem of [**built-in judges**](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/) for common evaluation criteria, plus integrations with [DeepEval](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#deepeval), [RAGAS](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#ragas), and [Phoenix Evals](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#arize-phoenix-evals). The [`make_judge()`](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/) API also enables you to create custom judges from natural language criteria.
+An LLM judge is a pair of a language model and a tailored prompt that evaluates the outputs of an agent or LLM application against specific criteria—safety, groundedness, tool usage, user satisfaction, and more. MLflow provides a large ecosystem of [**built-in judges**](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/) for common evaluation criteria, plus integrations with [DeepEval](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#deepeval), [RAGAS](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#ragas), and [Phoenix Evals](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#arize-phoenix-evals). The [`make_judge()`](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/) API also enables you to create custom judges from natural language criteria.
-An LLM judge is a language model that evaluates the outputs of an agent or LLM application against specific criteria—safety, groundedness, tool usage, user satisfaction, and more. MLflow provides a large ecosystem of [**built-in judges**](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/) for common evaluation criteria, plus integrations with [DeepEval](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#deepeval), [RAGAS](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#ragas), and [Phoenix Evals](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#arize-phoenix-evals). The [`make_judge()`](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/) API also enables you to create custom judges from natural language criteria.
+An LLM judge is a pair of a language model and a tailored prompt that evaluates the outputs of an agent or LLM application against specific criteria—safety, groundedness, tool usage, user satisfaction, and more. MLflow provides a large ecosystem of [**built-in judges**](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/) for common evaluation criteria, plus integrations with [DeepEval](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#deepeval), [RAGAS](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#ragas), and [Phoenix Evals](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#arize-phoenix-evals). The [`make_judge()`](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/) API also enables you to create custom judges from natural language criteria.
+
+## Get Started
+
+Before setting up automatic evaluation, you'll need MLflow 3.9 or later:
+
+```bash
+pip install mlflow>=3.9
+```
+
+Next, enable tracing for your agent. MLflow Tracing works with any agent framework and programming language via [OpenTelemetry](https://mlflow.org/docs/latest/genai/tracing/opentelemetry/). MLflow also provides [more than 40 autologging integrations](https://mlflow.org/docs/latest/genai/tracing/integrations/) for tracing popular libraries like LangChain, Vercel, and the OpenAI SDK with just one line of code:
+
+```python
+import mlflow
+
+mlflow.openai.autolog()  # or langchain.autolog(), etc.
+```
+
+Once tracing is enabled, you can set up automatic evaluation through the MLflow UI or SDK.
+
+### Setting Up Automatic Evaluation
+
+The fastest way to get started is through the MLflow UI. Navigate to your experiment, open the **Judges** tab, and click **+ New LLM judge**. Select a built-in judge (like safety or correctness), configure your sampling rate, and save. New traces will be evaluated automatically.
+
+<video width="100%" controls autoPlay loop muted>
+  <source src={require("./automatic-evaluation-ui-setup.mp4").default} type="video/mp4" />
+</video>
+
+You can also configure judges programmatically:
+
+```python
+from mlflow.genai.scorers import Safety, ScorerSamplingConfig
+
+judge = Safety(model="gateway:/my-llm-endpoint")
+registered = judge.register(name="safety_check")
+registered.start(
+    sampling_config=ScorerSamplingConfig(
+        sample_rate=0.5,  # Evaluate 50% of traces; use 1.0 for dev, lower for production
+        filter_string="metadata.environment = 'production'"  # Optional: target specific traces
+    )
+)
+```
+
+For multi-turn conversations, session-level judges evaluate the entire conversation rather than individual traces. Sessions are evaluated after 5 minutes of inactivity, so the judge sees the complete interaction before scoring.
+
+For complete setup instructions, see the [Automatic Evaluation documentation](https://mlflow.org/docs/latest/genai/eval-monitor/automatic-evaluations/).
+
+## Viewing Results
+
+Quality scores from your judges appear directly in the MLflow UI. The [Overview tab](https://mlflow.org/docs/latest/genai/tracing/observe-with-traces/ui/#overview-tab) shows trends over time—you can see at a glance if quality is improving or degrading. The [Traces tab](https://mlflow.org/docs/latest/genai/tracing/observe-with-traces/ui/#traces-tab) shows individual scores alongside each trace, so you can drill into specific failures. Scores typically appear within a minute or two of trace logging.
+
+<video width="100%" controls autoPlay loop muted>
+  <source
+    src={require("./automatic-evaluation-ui-setup.mp4").default}
+    type="video/mp4"
+  />
+</video>
+
+## Learn More
+
+- [Automatic Evaluation documentation](https://mlflow.org/docs/latest/genai/eval-monitor/automatic-evaluations/)
+- [Built-in LLM judges](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/)
+- [Creating custom judges](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/create-custom-judge/)
+- [Third-party integrations (DeepEval, RAGAS, Phoenix)](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/)
+
+## Feedback
+
+If this is useful, give us a star on GitHub: [github.com/mlflow/mlflow](https://github.com/mlflow/mlflow)
+
+Have questions or feedback? Open an issue on [GitHub](https://github.com/mlflow/mlflow/issues) or join the conversation in the MLflow community.
diff --git a/website/blog/2026-02-23-automatic-evaluation/overview_demo.mp4 b/website/blog/2026-02-23-automatic-evaluation/overview_demo.mp4
diff --git a/website/blog/authors.yml b/website/blog/authors.yml
@@ -1,3 +1,9 @@
+corey-zumar:
+  name: Corey Zumar
+  title: Software Engineer at Databricks
+  url: https://github.com/dbczumar
+  image_url: https://github.com/dbczumar.png
+
 mlflow-maintainers:
   name: MLflow maintainers
   title: MLflow maintainers
@@ -158,6 +164,12 @@ shyam-sankararaman:
   url: https://www.linkedin.com/in/shyampr16/
   image_url: /img/authors/shyam_sankararaman.png
 
+serena-ruan:
+  name: Serena Ruan
+  title: Software Engineer at Databricks
+  url: https://www.linkedin.com/in/serena-ruan/
+  image_url: https://github.com/serena-ruan.png
+
 samraj-moorjani:
   name: Samraj Moorjani
   title: Software Engineer at Databricks

diff --git a/website/static/img/authors/corey_zumar.jpg b/website/static/img/authors/corey_zumar.jpg
diff --git a/website/static/img/authors/serena_ruan.jpg b/website/static/img/authors/serena_ruan.jpg
diff --git a/website/static/img/blog/mlflow-automatic-evaluation-thumbnail.png b/website/static/img/blog/mlflow-automatic-evaluation-thumbnail.png