POLLUX LLM-Judge metric#2610
Conversation
|
@ulyanaisaeva is attempting to deploy a commit to the Confident AI Team on Vercel. A member of the Team first needs to authorize it. |
|
Hey @ulyanaisaeva thanks for the PR - what functionality does this add on top of our already available GEval metric? |
|
Hi @penguine-ip! GEval's value is in its two-step flow (auto-generate eval steps → evaluate with logprobs weighting) and deepeval's model abstraction. POLLUX judges are small fine-tuned models designed for a single-shot rubric evaluation with plain-numeric output — plugging them into GEval would mean disabling steps generation, replacing the JSON parser with regex, bypassing DeepEvalBaseLLM for a direct OpenAI-compatible client, and dropping logprobs — at which point nothing meaningful from GEval is actually reused. That said, if deepeval eventually adds hooks in BaseMetric for "single-shot judge" style metrics (custom prompt → custom parser → score), we'd happily migrate to that. For now, a lean standalone metric seemed cleaner than a GEval subclass that overrides every method. |
|
Hi! I'd like to check what the next steps are from my side to get this PR merged. From what I can see, the PR is currently not mergeable and these checks are failing:
I also see a separate Vercel failure (“Authorization required to deploy”), which looks like a permissions/infrastructure issue rather than a code issue in this PR. Could you please clarify:
|
This PR adds POLLUX, a criteria-based LLM-judge suitable for any generative tasks with customizable criteria descriptions.
Summary
PolluxJudgeMetric(deepeval.metrics.pollux.pollux): a rubric-based POLLUX judge wired to an OpenAI-compatible chat completions API (openai/AsyncOpenAI), with the usual DeepEval knobs (threshold,strict_mode,normalize_score,async_mode,include_reason,verbose_mode,max_tokens,temperature).deepeval.metrics.pollux.pollux_utils:build_pollux_prompt,normalize_rubrics(dict → sortedscore: textlines + sorted keys; requires ≥2 numeric levels),parse_score/parse_feedback, and regex constantsPOLLUX_DEFAULT_SCORE_RE,POLLUX_TAGGED_SCORE_RE,POLLUX_TAGGED_FEEDBACK_RE(default feedback pattern isNone— no extracted reason unless you pass a pattern).LLMTestCase.input→ instruction,actual_output→ answer, optionalexpected_output→ reference (omitted in the prompt when empty).score_pattern/feedback_patternon the metric; default score is a plain numeric completion. Unparseable score →ValueErrorwithmetric.errorset (unlike LightEval’s 0.0 fallback).PolluxJudgeMetricfromdeepeval.metrics; metric + allPOLLUX_*regex helpers fromdeepeval.metrics.pollux.docs/docs/metrics-pollux-judge.mdx(parameters, tagged-output example, vLLM prerequisite).tests/test_metrics/test_pollux_judge_metric.py— mocked HTTP via patched_get_sync_client/_get_async_client(sync, async, tagged patterns, non-zero-based rubrics,normalize_score=False, strict thresholds, parse failure, rubrics validation). Optional live testtest_integration_with_real_endpointwhenPOLLUX_BASE_URLis set (POLLUX_MODEL,POLLUX_API_KEY,POLLUX_USE_TAGGED_JUDGE_OUTPUToptional).Usage example
Strict mode and threshold semantics
strict_mode=False: use the user-providedthreshold(on the final metric score: normalized[0,1]or raw rubric value depending onnormalize_score).strict_mode=True:normalize_score=True: threshold becomes1.0(only top of scale passes).normalize_score=False: threshold becomesmax(rubric_keys)(maximum rubric score).Prerequisites
POLLUX judge checkpoints are served separately via an OpenAI-compatible API (for example vLLM):