-
Notifications
You must be signed in to change notification settings - Fork 41
Blog for automatic evaluation #484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dbczumar
wants to merge
9
commits into
mlflow:main
Choose a base branch
from
dbczumar:blog_online
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
7881fab
fix
dbczumar a75a3c0
fix
dbczumar 2ce0e5b
fix
dbczumar ea1cdae
fix
dbczumar 6d68a6e
fix
dbczumar 29f4742
fix
dbczumar 28202f6
fix
dbczumar 307d940
fix
dbczumar b87d5fc
fix
dbczumar File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Binary file added
BIN
+2.13 MB
website/blog/2026-02-23-automatic-evaluation/automatic-evaluation-ui-setup.mp4
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,96 @@ | ||||||
| --- | ||||||
| title: "Automatic LLM and Agent Evaluation: Quality Monitoring Without the Boilerplate" | ||||||
| description: Run LLM judges on your traces and conversations as they're logged—no code required | ||||||
| slug: automatic-evaluation | ||||||
| authors: [corey-zumar, samraj-moorjani, avesh-singh, serena-ruan, yuki-watanabe] | ||||||
| tags: [genai, evaluation, tracing, llm-as-judge, production-monitoring, rag] | ||||||
| thumbnail: img/blog/mlflow-automatic-evaluation-thumbnail.png | ||||||
| image: img/blog/mlflow-automatic-evaluation-thumbnail.png | ||||||
| --- | ||||||
|
|
||||||
| We're excited to introduce **Automatic Evaluation**, a new capability in MLflow that runs LLM judges on your agent traces and conversations as they're logged—no code required. | ||||||
|
|
||||||
| ## Challenges | ||||||
|
|
||||||
| Building an AI agent, LLM application, or RAG system is one thing. Knowing if it's working well is another. | ||||||
|
|
||||||
| During development, you want fast feedback. You make a change, run your agent, and wonder: did that actually improve things? Is it still hallucinating? Are the tool calls correct? Manually writing and running evaluation scripts for every iteration slows you down. | ||||||
|
|
||||||
| Production introduces new challenges. Your agent is handling real traffic, and you need to know when things go wrong—ideally before users complain. Is the agent repeating itself? Is it leaking PII? Is response quality degrading over time? | ||||||
|
|
||||||
| These challenges share a common solution: continuous, automated quality checks that run without manual intervention. | ||||||
|
|
||||||
| ## What is Automatic Evaluation? | ||||||
|
|
||||||
| Automatic evaluation runs [**LLM judges**](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/#llms-as-judges) on your traces and conversations as they're logged to MLflow. You configure judges once, and they run automatically on new traces as they arrive. Judges run asynchronously on the MLflow server, so your application's performance is unaffected. | ||||||
|
|
||||||
| An LLM judge is a language model that evaluates the outputs of an agent or LLM application against specific criteria—safety, groundedness, tool usage, user satisfaction, and more. MLflow provides a large ecosystem of [**built-in judges**](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/) for common evaluation criteria, plus integrations with [DeepEval](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#deepeval), [RAGAS](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#ragas), and [Phoenix Evals](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party#arize-phoenix-evals). The [`make_judge()`](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/) API also enables you to create custom judges from natural language criteria. | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| ## Get Started | ||||||
|
|
||||||
| Before setting up automatic evaluation, you'll need MLflow 3.9 or later: | ||||||
|
|
||||||
| ```bash | ||||||
| pip install mlflow>=3.9 | ||||||
| ``` | ||||||
|
|
||||||
| Next, enable tracing for your agent. MLflow Tracing works with any agent framework and programming language via [OpenTelemetry](https://mlflow.org/docs/latest/genai/tracing/opentelemetry/). MLflow also provides [more than 40 autologging integrations](https://mlflow.org/docs/latest/genai/tracing/integrations/) for tracing popular libraries like LangChain, Vercel, and the OpenAI SDK with just one line of code: | ||||||
|
|
||||||
| ```python | ||||||
| import mlflow | ||||||
|
|
||||||
| mlflow.openai.autolog() # or langchain.autolog(), etc. | ||||||
| ``` | ||||||
|
|
||||||
| Once tracing is enabled, you can set up automatic evaluation through the MLflow UI or SDK. | ||||||
|
|
||||||
| ### Setting Up Automatic Evaluation | ||||||
|
|
||||||
| The fastest way to get started is through the MLflow UI. Navigate to your experiment, open the **Judges** tab, and click **+ New LLM judge**. Select a built-in judge (like safety or correctness), configure your sampling rate, and save. New traces will be evaluated automatically. | ||||||
|
|
||||||
| <video width="100%" controls autoPlay loop muted> | ||||||
| <source src={require("./automatic-evaluation-ui-setup.mp4").default} type="video/mp4" /> | ||||||
| </video> | ||||||
|
|
||||||
| You can also configure judges programmatically: | ||||||
|
|
||||||
| ```python | ||||||
| from mlflow.genai.scorers import Safety, ScorerSamplingConfig | ||||||
|
|
||||||
| judge = Safety(model="gateway:/my-llm-endpoint") | ||||||
| registered = judge.register(name="safety_check") | ||||||
| registered.start( | ||||||
| sampling_config=ScorerSamplingConfig( | ||||||
| sample_rate=0.5, # Evaluate 50% of traces; use 1.0 for dev, lower for production | ||||||
| filter_string="metadata.environment = 'production'" # Optional: target specific traces | ||||||
| ) | ||||||
| ) | ||||||
| ``` | ||||||
|
|
||||||
| For multi-turn conversations, session-level judges evaluate the entire conversation rather than individual traces. Sessions are evaluated after 5 minutes of inactivity, so the judge sees the complete interaction before scoring. | ||||||
|
|
||||||
| For complete setup instructions, see the [Automatic Evaluation documentation](https://mlflow.org/docs/latest/genai/eval-monitor/automatic-evaluations/). | ||||||
|
|
||||||
| ## Viewing Results | ||||||
|
|
||||||
| Quality scores from your judges appear directly in the MLflow UI. The [Overview tab](https://mlflow.org/docs/latest/genai/tracing/observe-with-traces/ui/#overview-tab) shows trends over time—you can see at a glance if quality is improving or degrading. The [Traces tab](https://mlflow.org/docs/latest/genai/tracing/observe-with-traces/ui/#traces-tab) shows individual scores alongside each trace, so you can drill into specific failures. Scores typically appear within a minute or two of trace logging. | ||||||
|
|
||||||
| <video width="100%" controls autoPlay loop muted> | ||||||
| <source | ||||||
| src={require("./automatic-evaluation-ui-setup.mp4").default} | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems this video supposed to be a different one. |
||||||
| type="video/mp4" | ||||||
| /> | ||||||
| </video> | ||||||
|
|
||||||
| ## Learn More | ||||||
|
|
||||||
| - [Automatic Evaluation documentation](https://mlflow.org/docs/latest/genai/eval-monitor/automatic-evaluations/) | ||||||
| - [Built-in LLM judges](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/) | ||||||
| - [Creating custom judges](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/custom-judges/create-custom-judge/) | ||||||
| - [Third-party integrations (DeepEval, RAGAS, Phoenix)](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/) | ||||||
|
|
||||||
| ## Feedback | ||||||
|
|
||||||
| If this is useful, give us a star on GitHub: [github.com/mlflow/mlflow](https://github.com/mlflow/mlflow) | ||||||
|
|
||||||
| Have questions or feedback? Open an issue on [GitHub](https://github.com/mlflow/mlflow/issues) or join the conversation in the MLflow community. | ||||||
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we replace em dashes?:)