Skip to content

POLLUX LLM-Judge metric#2610

Open
ulyanaisaeva wants to merge 2 commits intoconfident-ai:mainfrom
ulyanaisaeva:main
Open

POLLUX LLM-Judge metric#2610
ulyanaisaeva wants to merge 2 commits intoconfident-ai:mainfrom
ulyanaisaeva:main

Conversation

@ulyanaisaeva
Copy link
Copy Markdown

This PR adds POLLUX, a criteria-based LLM-judge suitable for any generative tasks with customizable criteria descriptions.

Summary

  • Adds PolluxJudgeMetric (deepeval.metrics.pollux.pollux): a rubric-based POLLUX judge wired to an OpenAI-compatible chat completions API (openai / AsyncOpenAI), with the usual DeepEval knobs (threshold, strict_mode, normalize_score, async_mode, include_reason, verbose_mode, max_tokens, temperature).
  • deepeval.metrics.pollux.pollux_utils: build_pollux_prompt, normalize_rubrics (dict → sorted score: text lines + sorted keys; requires ≥2 numeric levels), parse_score / parse_feedback, and regex constants POLLUX_DEFAULT_SCORE_RE, POLLUX_TAGGED_SCORE_RE, POLLUX_TAGGED_FEEDBACK_RE (default feedback pattern is None — no extracted reason unless you pass a pattern).
  • Mapping: LLMTestCase.input → instruction, actual_output → answer, optional expected_output → reference (omitted in the prompt when empty).
  • Parsing: optional score_pattern / feedback_pattern on the metric; default score is a plain numeric completion. Unparseable score → ValueError with metric.error set (unlike LightEval’s 0.0 fallback).
  • Exports: PolluxJudgeMetric from deepeval.metrics; metric + all POLLUX_* regex helpers from deepeval.metrics.pollux.
  • Docs: docs/docs/metrics-pollux-judge.mdx (parameters, tagged-output example, vLLM prerequisite).
  • Tests: tests/test_metrics/test_pollux_judge_metric.py — mocked HTTP via patched _get_sync_client / _get_async_client (sync, async, tagged patterns, non-zero-based rubrics, normalize_score=False, strict thresholds, parse failure, rubrics validation). Optional live test test_integration_with_real_endpoint when POLLUX_BASE_URL is set (POLLUX_MODEL, POLLUX_API_KEY, POLLUX_USE_TAGGED_JUDGE_OUTPUT optional).

Usage example

from deepeval import evaluate
from deepeval.metrics import PolluxJudgeMetric
from deepeval.test_case import LLMTestCase

metric = PolluxJudgeMetric(
    criteria_name="Correctness",
    rubrics={
        0: "Wrong answer.",
        1: "Partially correct answer.",
        2: "Fully correct answer.",
    },
    judge_model="ai-forever/Pollux-4B-Judge",
    base_url="http://localhost:8000/v1",
    api_key="NONE",
    normalize_score=True,
    strict_mode=False,
)

test_case = LLMTestCase(
    input="What is 2 + 2?",
    actual_output="4",
    expected_output="4",
)

evaluate(test_cases=[test_case], metrics=[metric])

Strict mode and threshold semantics

  • strict_mode=False: use the user-provided threshold (on the final metric score: normalized [0,1] or raw rubric value depending on normalize_score).
  • strict_mode=True:
    • normalize_score=True: threshold becomes 1.0 (only top of scale passes).
    • normalize_score=False: threshold becomes max(rubric_keys) (maximum rubric score).

Prerequisites

POLLUX judge checkpoints are served separately via an OpenAI-compatible API (for example vLLM):

vllm serve ai-forever/Pollux-4B-Judge --port 8000

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 10, 2026

@ulyanaisaeva is attempting to deploy a commit to the Confident AI Team on Vercel.

A member of the Team first needs to authorize it.

@penguine-ip
Copy link
Copy Markdown
Contributor

Hey @ulyanaisaeva thanks for the PR - what functionality does this add on top of our already available GEval metric?

@ulyanaisaeva ulyanaisaeva marked this pull request as ready for review April 13, 2026 08:31
@ulyanaisaeva
Copy link
Copy Markdown
Author

ulyanaisaeva commented Apr 16, 2026

Hi @penguine-ip!

GEval's value is in its two-step flow (auto-generate eval steps → evaluate with logprobs weighting) and deepeval's model abstraction. POLLUX judges are small fine-tuned models designed for a single-shot rubric evaluation with plain-numeric output — plugging them into GEval would mean disabling steps generation, replacing the JSON parser with regex, bypassing DeepEvalBaseLLM for a direct OpenAI-compatible client, and dropping logprobs — at which point nothing meaningful from GEval is actually reused.

That said, if deepeval eventually adds hooks in BaseMetric for "single-shot judge" style metrics (custom prompt → custom parser → score), we'd happily migrate to that. For now, a lean standalone metric seemed cleaner than a GEval subclass that overrides every method.

@ulyanaisaeva
Copy link
Copy Markdown
Author

Hi! I'd like to check what the next steps are from my side to get this PR merged.

From what I can see, the PR is currently not mergeable and these checks are failing:

  • lint (ubuntu-latest) at psf/black
  • Core Tests / test at Run tests (no secrets)
  • Confident Tests / test at Run tests

I also see a separate Vercel failure (“Authorization required to deploy”), which looks like a permissions/infrastructure issue rather than a code issue in this PR.

Could you please clarify:

  • which of these failures are merge-blocking for this PR,
  • whether you want me to rebase/update from main and re-run CI,
  • whether you’d like any additional code/docs changes before merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants