Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added website/blog/2026-01-23-judge-builder/hero.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
53 changes: 53 additions & 0 deletions website/blog/2026-01-23-judge-builder/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: Judge Builder
description: Build, test, and deploy custom LLM judges directly from the MLflow UI. Define evaluation criteria with natural language, preview results on real traces, and schedule automatic evaluation—no code required.
date: 2026-01-23
authors:
- name: MLflow Team
---

# Judge Builder

![Judge Builder Hero](./hero.png)

## What's New

MLflow 3.9 introduces the **Judge Builder**, a visual interface for creating custom LLM judges directly in the MLflow UI.

Evaluating LLM applications is challenging because quality is often subjective and context-dependent. While MLflow has long supported LLM-as-a-Judge evaluation through the SDK, creating and iterating on judges required coding expertise and constant context-switching between your IDE and the traces UI. With Judge Builder you can now define evaluation criteria using natural language, and test judges on real traces directly from the UI.

## Get Started

```bash
pip install 'mlflow[genai]>=3.9'
mlflow server
```

### 1. Navigate to the Judges Tab

From any experiment in MLflow, select the **Judges** tab and click **New LLM judge** to open Judge Builder.

### 2. Define Your Evaluation Scope

Choose what you want to evaluate:

- **Traces**: Evaluate individual request-response pairs for quality, correctness, and safety
- **Sessions**: Evaluate entire multi-turn conversations for coherence and goal completion

### 3. Configure Your Judge

The Judge Builder provides an intuitive interface for defining your judge:

| Field | Description |
| ---------------- | -------------------------------------------------------------------------------------------------------------------------------- |
| **LLM Judge** | Start from a built-in judge or select "Custom judge" for full control |
| **Name** | A unique identifier for your judge (e.g., `tone_checker`, `accuracy_evaluator`) |
| **Instructions** | Natural language criteria using template variables like `{{ inputs }}`, `{{ outputs }}`, `{{ trace }}`, and `{{ conversation }}` |
| **Output Type** | The return type: boolean, categorical, numeric, or structured |
| **Model** | Select an AI Gateway endpoint or specify a model directly |

### 4. Test on Real Data

Before saving, click the trace/session selector to choose specific traces or sessions and click **Run judge** to preview evaluation results. This lets you validate that your instructions produce the expected assessments.

See the [Judge Builder documentation](/genai/eval-monitor/scorers/llm-judge/judge-builder) for more details.
Loading