POLLUX LLM-Judge metric by ulyanaisaeva · Pull Request #2610 · confident-ai/deepeval

ulyanaisaeva · 2026-04-10T13:25:15Z

This PR adds POLLUX, a criteria-based LLM-judge suitable for any generative tasks with customizable criteria descriptions.

Summary

Adds PolluxJudgeMetric (deepeval.metrics.pollux.pollux): a rubric-based POLLUX judge wired to an OpenAI-compatible chat completions API (openai / AsyncOpenAI), with the usual DeepEval knobs (threshold, strict_mode, normalize_score, async_mode, include_reason, verbose_mode, max_tokens, temperature).
deepeval.metrics.pollux.pollux_utils: build_pollux_prompt, normalize_rubrics (dict → sorted score: text lines + sorted keys; requires ≥2 numeric levels), parse_score / parse_feedback, and regex constants POLLUX_DEFAULT_SCORE_RE, POLLUX_TAGGED_SCORE_RE, POLLUX_TAGGED_FEEDBACK_RE (default feedback pattern is None — no extracted reason unless you pass a pattern).
Mapping: LLMTestCase.input → instruction, actual_output → answer, optional expected_output → reference (omitted in the prompt when empty).
Parsing: optional score_pattern / feedback_pattern on the metric; default score is a plain numeric completion. Unparseable score → ValueError with metric.error set (unlike LightEval’s 0.0 fallback).
Exports: PolluxJudgeMetric from deepeval.metrics; metric + all POLLUX_* regex helpers from deepeval.metrics.pollux.
Docs: docs/docs/metrics-pollux-judge.mdx (parameters, tagged-output example, vLLM prerequisite).
Tests: tests/test_metrics/test_pollux_judge_metric.py — mocked HTTP via patched _get_sync_client / _get_async_client (sync, async, tagged patterns, non-zero-based rubrics, normalize_score=False, strict thresholds, parse failure, rubrics validation). Optional live test test_integration_with_real_endpoint when POLLUX_BASE_URL is set (POLLUX_MODEL, POLLUX_API_KEY, POLLUX_USE_TAGGED_JUDGE_OUTPUT optional).

Usage example

from deepeval import evaluate
from deepeval.metrics import PolluxJudgeMetric
from deepeval.test_case import LLMTestCase

metric = PolluxJudgeMetric(
    criteria_name="Correctness",
    rubrics={
        0: "Wrong answer.",
        1: "Partially correct answer.",
        2: "Fully correct answer.",
    },
    judge_model="ai-forever/Pollux-4B-Judge",
    base_url="http://localhost:8000/v1",
    api_key="NONE",
    normalize_score=True,
    strict_mode=False,
)

test_case = LLMTestCase(
    input="What is 2 + 2?",
    actual_output="4",
    expected_output="4",
)

evaluate(test_cases=[test_case], metrics=[metric])

Strict mode and threshold semantics

strict_mode=False: use the user-provided threshold (on the final metric score: normalized [0,1] or raw rubric value depending on normalize_score).
strict_mode=True:
- normalize_score=True: threshold becomes 1.0 (only top of scale passes).
- normalize_score=False: threshold becomes max(rubric_keys) (maximum rubric score).

Prerequisites

POLLUX judge checkpoints are served separately via an OpenAI-compatible API (for example vLLM):

vllm serve ai-forever/Pollux-4B-Judge --port 8000

Models: POLLUX collection

vercel · 2026-04-10T13:25:24Z

@ulyanaisaeva is attempting to deploy a commit to the Confident AI Team on Vercel.

A member of the Team first needs to authorize it.

penguine-ip · 2026-04-10T13:58:17Z

Hey @ulyanaisaeva thanks for the PR - what functionality does this add on top of our already available GEval metric?

ulyanaisaeva · 2026-04-16T14:32:33Z

Hi @penguine-ip!

GEval's value is in its two-step flow (auto-generate eval steps → evaluate with logprobs weighting) and deepeval's model abstraction. POLLUX judges are small fine-tuned models designed for a single-shot rubric evaluation with plain-numeric output — plugging them into GEval would mean disabling steps generation, replacing the JSON parser with regex, bypassing DeepEvalBaseLLM for a direct OpenAI-compatible client, and dropping logprobs — at which point nothing meaningful from GEval is actually reused.

That said, if deepeval eventually adds hooks in BaseMetric for "single-shot judge" style metrics (custom prompt → custom parser → score), we'd happily migrate to that. For now, a lean standalone metric seemed cleaner than a GEval subclass that overrides every method.

ulyanaisaeva · 2026-04-27T07:27:32Z

Hi! I'd like to check what the next steps are from my side to get this PR merged.

From what I can see, the PR is currently not mergeable and these checks are failing:

lint (ubuntu-latest) at psf/black
Core Tests / test at Run tests (no secrets)
Confident Tests / test at Run tests

I also see a separate Vercel failure (“Authorization required to deploy”), which looks like a permissions/infrastructure issue rather than a code issue in this PR.

Could you please clarify:

which of these failures are merge-blocking for this PR,
whether you want me to rebase/update from main and re-run CI,
whether you’d like any additional code/docs changes before merge?

ulyanaisaeva added 2 commits April 10, 2026 15:58

added pollux judge (#1)

e2b7c8a

Merge branch 'confident-ai:main' into main

a07320a

ulyanaisaeva marked this pull request as ready for review April 13, 2026 08:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POLLUX LLM-Judge metric#2610

POLLUX LLM-Judge metric#2610
ulyanaisaeva wants to merge 2 commits intoconfident-ai:mainfrom
ulyanaisaeva:main

ulyanaisaeva commented Apr 10, 2026

Uh oh!

vercel Bot commented Apr 10, 2026

Uh oh!

penguine-ip commented Apr 10, 2026

Uh oh!

ulyanaisaeva commented Apr 16, 2026 •

edited

Loading

Uh oh!

ulyanaisaeva commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ulyanaisaeva commented Apr 10, 2026

Summary

Usage example

Strict mode and threshold semantics

Prerequisites

Uh oh!

vercel Bot commented Apr 10, 2026

Uh oh!

penguine-ip commented Apr 10, 2026

Uh oh!

ulyanaisaeva commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ulyanaisaeva commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ulyanaisaeva commented Apr 16, 2026 •

edited

Loading