Add qwen3.5-fp4-b200-trt-mtp single-node TensorRT-LLM benchmark#1894
Conversation
Add the qwen3.5-fp4-b200-trt-mtp config (Qwen3.5-397B-A17B-NVFP4, B200, 1k/1k and 8k/1k) with MTP speculative decode across a TP/TEP/DEP parallelism sweep, the qwen3.5_fp4_b200_trt_mtp.sh benchmark script, and a perf-changelog entry.
# Conflicts: # perf-changelog.yaml
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 7649ae1. Configure here.
| - 16 | ||
| - 32 | ||
| - 64 | ||
| - 128 |
There was a problem hiding this comment.
CUDA graph sizes exceed max batch
Medium Severity
The extra LLM config hardcodes cuda_graph_config.batch_sizes through 128, while trtllm-serve gets --max_batch_size from CONC or CONC/8 (often 4–16 in this recipe). Peer Qwen and TRT-MTP scripts tie CUDA graph capture to MAX_BATCH_SIZE via max_batch_size, so graph warmup can overshoot the runtime batch cap and risk validation failures or excess memory use on low-concurrency jobs.
Reviewed by Cursor Bugbot for commit 7649ae1. Configure here.
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend openai \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$(( CONC * 10 ))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ |
There was a problem hiding this comment.
missing --chat-templates
There was a problem hiding this comment.
Thanks for catching it!
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28002602936 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28002602936 |
# Conflicts: # perf-changelog.yaml
MTP runs need --use-chat-template on run_benchmark_serving for meaningful acceptance, matching the other single-node MTP scripts.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28051750810 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28051750810 |
|
As a PR reviewer and CODEOWNER, I have reviewed this and have:
Additional detail section:This is a TRTLLM config, hence no recipe required Signed: |
|
@Ankur-singh Blocks merge: Check 3 fails — the sign-off's Additional detail section has no recipe link (only "This is a TRTLLM config"); this workflow requires a link even for a TRT-LLM config. Open/link a recipe (vllm-project/recipes or sglang cookbook) or the published recipe page.
|
|
/reuse-sweep-run |


Adds the
qwen3.5-fp4-b200-trt-mtpconfig — Qwen3.5-397B-A17B-NVFP4 on B200, single-node TensorRT-LLM with MTP speculative decode — for the 1k/1k and 8k/1k cells with a TP/TEP/DEP parallelism sweep.nvidia-master.yaml: new config entry + MTP search space.qwen3.5_fp4_b200_trt_mtp.sh:trtllm-servebenchmark script; generates the extra-llm-api config (MoE backend, attention-DP / batch-wait settings, MTP speculative config) per parallelism mode.Note
Low Risk
Benchmark-only wiring (YAML config, launch script, changelog); no production inference, auth, or data-path changes.
Overview
Adds
qwen3.5-fp4-b200-trt-mtpso Qwen3.5-397B-A17B-NVFP4 on B200 can be measured with single-node TensorRT-LLM and MTP speculative decode, alongside the existing non-MTPqwen3.5-fp4-b200-trtentry.nvidia-master.yamlregisters the config ontensorrt-llm/release:1.3.0rc18with 1k/1k and 8k/1kfixed-seq-lencells and a TP / EP / attention-DP search space where every point setsspec-decoding: "mtp".qwen3.5_fp4_b200_trt_mtp.shdrivestrtllm-serve(pytorch backend): disables FlashInfer GDN prefill for MTP, writesqwen3.5-fp4-trt-mtp.ymlwith MTP (num_nextn_predict_layers: 3), CUTEDSL vs TRTLLM MoE and KV / batch-wait tuning keyed off DP attention and ISL/TP/EP, then runs the standard serving benchmark (optional lm-eval).perf-changelog.yamldocuments the new config key.Reviewed by Cursor Bugbot for commit 7649ae1. Bugbot is set up for automated code reviews on this repo. Configure here.