Skip to content

Commit 16203d6

Browse files
[NVBug 6007314] Deprecate MT-Bench support, remove openai pin, and add NeMo Evaluator reference (#1116)
### What does this PR do? Type of change: Deprecation, Bug fix, Documentation Removes MT-Bench (FastChat) evaluation support from `examples/llm_eval` and `examples/llm_ptq`. Also removes the stale `openai>=0.28.1` pin from `requirements.txt` that caused dependency conflicts with TRT-LLM (see [NVBug 6007314](https://nvbugspro.nvidia.com/bug/6007314)). Adds a NeMo Evaluator section to the llm_eval README as the recommended evaluation workflow for quantized checkpoints. **Changes:** - Delete `examples/llm_eval/run_fastchat.sh` and `examples/llm_eval/gen_model_answer.py` - Remove `mtbench` task from `examples/llm_ptq/scripts/parser.sh` and `huggingface_example.sh` - Remove `openai` dependency from `examples/llm_eval/requirements.txt` - Add NeMo Evaluator section to `examples/llm_eval/README.md` as the recommended way to evaluate quantized checkpoints from llm_ptq via TensorRT-LLM, vLLM, or SGLang - Update README docs in both `llm_eval` and `llm_ptq` - Add deprecation note to CHANGELOG.rst for 0.43 ### Usage N/A — this is a removal and documentation update. ### Testing N/A — removed code paths; no new functionality introduced. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ❌ — MT-Bench evaluation via `--tasks mtbench` is no longer supported. - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ ### Additional Information Related: [NVBug 6007314](https://nvbugspro.nvidia.com/bug/6007314) — openai dependency conflict caused by FastChat's `llm_judge` extra pinning `openai<1`. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Deprecations** * Removed MT-Bench (FastChat) evaluation support. NeMo Evaluator is now the recommended approach for evaluating quantized model checkpoints across multiple benchmarks. * **Documentation** * Updated evaluation guides to reflect NeMo Evaluator as the primary evaluation method, with support for TensorRT-LLM, vLLM, and SGLang serving backends. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
1 parent ae965a9 commit 16203d6

9 files changed

Lines changed: 7 additions & 725 deletions

File tree

.pre-commit-config.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,6 @@ repos:
8989
examples/deepseek/ptq.py|
9090
examples/diffusers/cache_diffusion/pipeline/models/sdxl.py|
9191
examples/diffusers/quantization/onnx_utils/export.py|
92-
examples/llm_eval/gen_model_answer.py|
9392
examples/llm_eval/humaneval.py|
9493
examples/llm_eval/lm_eval_hf.py|
9594
examples/llm_eval/mmlu.py|

CHANGELOG.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ NVIDIA Model Optimizer Changelog
4949

5050
**Deprecations**
5151

52+
- Removed MT-Bench (FastChat) support from ``examples/llm_eval``. The ``run_fastchat.sh`` and ``gen_model_answer.py`` scripts have been deleted, and the ``mtbench`` task has been removed from the ``llm_ptq`` example scripts.
5253
- Remove deprecated NeMo-2.0 Framework references.
5354

5455
**Misc**

examples/llm_eval/README.md

Lines changed: 4 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,10 @@ This folder includes popular 3rd-party LLM benchmarks for LLM accuracy evaluatio
44

55
The following instructions show how to evaluate the Model Optimizer quantized LLM with the benchmarks, including the TensorRT-LLM deployment.
66

7+
## NeMo Evaluator
8+
9+
[NeMo Evaluator](https://docs.nvidia.com/nemo/evaluator/latest/get-started/quickstart/index.html#self-hosted-options) is the recommended way to evaluate a large choice of benchmarks on quantized checkpoints generated from [llm_ptq](../llm_ptq). Quantized checkpoints can be served with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [vLLM](https://github.com/vllm-project/vllm), or [SGLang](https://github.com/sgl-project/sglang) and then evaluated using NeMo Evaluator.
10+
711
## LM-Eval-Harness
812

913
[LM-Eval-Harness](https://github.com/EleutherAI/lm-evaluation-harness) provides a unified framework to test generative language models on a large number of different evaluation tasks.
@@ -143,34 +147,6 @@ python mmlu.py --model_name causal --model_path <HF model folder or model card>
143147
python mmlu.py --model_name causal --model_path <HF model folder or model card> --checkpoint_dir <Quantized checkpoint dir>
144148
```
145149

146-
## MT-Bench
147-
148-
[MT-Bench](https://arxiv.org/abs/2306.05685). These responses are generated using [FastChat](https://github.com/lm-sys/FastChat).
149-
150-
### Baseline
151-
152-
```bash
153-
bash run_fastchat.sh -h <HF model folder or model card>
154-
```
155-
156-
### Quantized (simulated)
157-
158-
```bash
159-
# MODELOPT_QUANT_CFG: Choose from [INT8_SMOOTHQUANT_CFG|FP8_DEFAULT_CFG|INT4_AWQ_CFG|W4A8_AWQ_BETA_CFG]
160-
bash run_fastchat.sh -h <HF model folder or model card> --quant_cfg MODELOPT_QUANT_CFG
161-
```
162-
163-
### Evaluate with TensorRT-LLM
164-
165-
```bash
166-
bash run_fastchat.sh -h <HF model folder or model card> <Quantized checkpoint dir>
167-
```
168-
169-
### Judging the responses
170-
171-
The responses to questions from MT Bench will be stored under `data/mt_bench/model_answer`.
172-
The quality of the responses can be judged using [llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) from the FastChat repository. Please refer to the [llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) to compute the final MT-Bench score.
173-
174150
## LiveCodeBench
175151

176152
[LiveCodeBench](https://livecodebench.github.io/) is a holistic and contamination-free evaluation benchmark of LLMs for code that continuously collects new problems over time.

0 commit comments

Comments
 (0)