diff --git a/website/blog/2026-02-23-automatic-evaluation/automatic-evaluation-ui-setup.mp4 b/website/blog/2026-02-23-automatic-evaluation/automatic-evaluation-ui-setup.mp4 new file mode 100644 index 0000000000..9c43ffc9cd Binary files /dev/null and b/website/blog/2026-02-23-automatic-evaluation/automatic-evaluation-ui-setup.mp4 differ diff --git a/website/blog/2026-02-23-automatic-evaluation/index.mdx b/website/blog/2026-02-23-automatic-evaluation/index.mdx new file mode 100644 index 0000000000..002f195456 --- /dev/null +++ b/website/blog/2026-02-23-automatic-evaluation/index.mdx @@ -0,0 +1,96 @@ +--- +title: "Automatic LLM and Agent Evaluation: Quality Monitoring Without the Boilerplate" +description: Run LLM judges on your traces and conversations as they're logged—no code required +slug: automatic-evaluation +authors: [corey-zumar, samraj-moorjani, avesh-singh, serena-ruan, yuki-watanabe] +tags: [genai, evaluation, tracing, llm-as-judge, production-monitoring, rag] +thumbnail: img/blog/mlflow-automatic-evaluation-thumbnail.png +image: img/blog/mlflow-automatic-evaluation-thumbnail.png +--- + +We're excited to introduce **Automatic Evaluation**, a new capability in MLflow that runs LLM judges on your agent traces and conversations as they're logged—no code required. + +## Challenges + +Building an AI agent, LLM application, or RAG system is one thing. Knowing if it's working well is another. + +During development, you want fast feedback. You make a change, run your agent, and wonder: did that actually improve things? Is it still hallucinating? Are the tool calls correct? Manually writing and running evaluation scripts for every iteration slows you down. + +Production introduces new challenges. Your agent is handling real traffic, and you need to know when things go wrong—ideally before users complain. Is the agent repeating itself? Is it leaking PII? Is response quality degrading over time? + +These challenges share a common solution: continuous, automated quality checks that run without manual intervention. + +## What is Automatic Evaluation? + +Automatic evaluation runs [**LLM judges**](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/#llms-as-judges) on your traces and conversations as they're logged to MLflow. You configure judges once, and they run automatically on new traces as they arrive. Judges run asynchronously on the MLflow server, so your application's performance is unaffected. + +An LLM judge is a language model that evaluates the outputs of an agent or LLM application against specific criteria—safety, groundedness, tool usage, user satisfaction, and more. MLflow provides a large ecosystem of [**built-in judges**](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/) for common evaluation criteria, plus integrations with [DeepEval](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#deepeval), [RAGAS](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#ragas), and [Phoenix Evals](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#arize-phoenix-evals). The [`make_judge()`](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/) API also enables you to create custom judges from natural language criteria. + +## Get Started + +Before setting up automatic evaluation, you'll need MLflow 3.9 or later: + +```bash +pip install mlflow>=3.9 +``` + +Next, enable tracing for your agent. MLflow Tracing works with any agent framework and programming language via [OpenTelemetry](https://mlflow.org/docs/latest/genai/tracing/opentelemetry/). MLflow also provides [more than 40 autologging integrations](https://mlflow.org/docs/latest/genai/tracing/integrations/) for tracing popular libraries like LangChain, Vercel, and the OpenAI SDK with just one line of code: + +```python +import mlflow + +mlflow.openai.autolog() # or langchain.autolog(), etc. +``` + +Once tracing is enabled, you can set up automatic evaluation through the MLflow UI or SDK. + +### Setting Up Automatic Evaluation + +The fastest way to get started is through the MLflow UI. Navigate to your experiment, open the **Judges** tab, and click **+ New LLM judge**. Select a built-in judge (like safety or correctness), configure your sampling rate, and save. New traces will be evaluated automatically. + + + +You can also configure judges programmatically: + +```python +from mlflow.genai.scorers import Safety, ScorerSamplingConfig + +judge = Safety(model="gateway:/my-llm-endpoint") +registered = judge.register(name="safety_check") +registered.start( + sampling_config=ScorerSamplingConfig( + sample_rate=0.5, # Evaluate 50% of traces; use 1.0 for dev, lower for production + filter_string="metadata.environment = 'production'" # Optional: target specific traces + ) +) +``` + +For multi-turn conversations, session-level judges evaluate the entire conversation rather than individual traces. Sessions are evaluated after 5 minutes of inactivity, so the judge sees the complete interaction before scoring. + +For complete setup instructions, see the [Automatic Evaluation documentation](https://mlflow.org/docs/latest/genai/eval-monitor/automatic-evaluations/). + +## Viewing Results + +Quality scores from your judges appear directly in the MLflow UI. The [Overview tab](https://mlflow.org/docs/latest/genai/tracing/observe-with-traces/ui/#overview-tab) shows trends over time—you can see at a glance if quality is improving or degrading. The [Traces tab](https://mlflow.org/docs/latest/genai/tracing/observe-with-traces/ui/#traces-tab) shows individual scores alongside each trace, so you can drill into specific failures. Scores typically appear within a minute or two of trace logging. + + + +## Learn More + +- [Automatic Evaluation documentation](https://mlflow.org/docs/latest/genai/eval-monitor/automatic-evaluations/) +- [Built-in LLM judges](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/) +- [Creating custom judges](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/create-custom-judge/) +- [Third-party integrations (DeepEval, RAGAS, Phoenix)](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/) + +## Feedback + +If this is useful, give us a star on GitHub: [github.com/mlflow/mlflow](https://github.com/mlflow/mlflow) + +Have questions or feedback? Open an issue on [GitHub](https://github.com/mlflow/mlflow/issues) or join the conversation in the MLflow community. diff --git a/website/blog/2026-02-23-automatic-evaluation/overview_demo.mp4 b/website/blog/2026-02-23-automatic-evaluation/overview_demo.mp4 new file mode 100644 index 0000000000..cd9a8679af Binary files /dev/null and b/website/blog/2026-02-23-automatic-evaluation/overview_demo.mp4 differ diff --git a/website/blog/authors.yml b/website/blog/authors.yml index 609e798aac..b455a09fcb 100644 --- a/website/blog/authors.yml +++ b/website/blog/authors.yml @@ -1,3 +1,9 @@ +corey-zumar: + name: Corey Zumar + title: Software Engineer at Databricks + url: https://github.com/dbczumar + image_url: https://github.com/dbczumar.png + mlflow-maintainers: name: MLflow maintainers title: MLflow maintainers @@ -158,6 +164,12 @@ shyam-sankararaman: url: https://www.linkedin.com/in/shyampr16/ image_url: /img/authors/shyam_sankararaman.png +serena-ruan: + name: Serena Ruan + title: Software Engineer at Databricks + url: https://www.linkedin.com/in/serena-ruan/ + image_url: https://github.com/serena-ruan.png + samraj-moorjani: name: Samraj Moorjani title: Software Engineer at Databricks diff --git a/website/static/img/authors/corey_zumar.jpg b/website/static/img/authors/corey_zumar.jpg new file mode 100644 index 0000000000..4471124a0f Binary files /dev/null and b/website/static/img/authors/corey_zumar.jpg differ diff --git a/website/static/img/authors/serena_ruan.jpg b/website/static/img/authors/serena_ruan.jpg new file mode 100644 index 0000000000..d2be75a64d Binary files /dev/null and b/website/static/img/authors/serena_ruan.jpg differ diff --git a/website/static/img/blog/mlflow-automatic-evaluation-thumbnail.png b/website/static/img/blog/mlflow-automatic-evaluation-thumbnail.png new file mode 100644 index 0000000000..40b4a4c7a6 Binary files /dev/null and b/website/static/img/blog/mlflow-automatic-evaluation-thumbnail.png differ