Add GLM-5-FP8 GB200 dynamo-sglang multinode benchmark#1895
Add GLM-5-FP8 GB200 dynamo-sglang multinode benchmark#1895hshrivastava-droid wants to merge 3 commits into
Conversation
- nvidia-master.yaml: add glm5-fp8-gb200-dynamo-sglang (14 topologies across 1k/1k and 8k/1k; prefill TP8 STP + decode wide-EP DEP16/DEP32 high-throughput and per-node TP8 low-latency). - 14 split recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/ glm5/gb200-fp8/. - launch_gb200-nv.sh: glm5/fp8 MODEL_PATH branch and glm5 recipes-copy stage so the runtime overlays our recipes onto srt-slurm. - perf-changelog: entry for the new config.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 74c1f2c. Configure here.
| mem-fraction-static: 0.7 | ||
| weight-loader-prefetch-checkpoints: true | ||
| model-loader-extra-config: '{"enable_multithread_load": true}' | ||
|
|
There was a problem hiding this comment.
Missing prefill radix cache disable
Medium Severity
The three 8k/1k low-latency GB200 recipes omit disable-radix-cache: true on the shared prefill block, while every other new GLM-5 GB200 recipe (including 8k/1k high-throughput) sets it. Prefill can keep radix caching enabled for those runs only, skewing latency and throughput versus the rest of the matrix.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 74c1f2c. Configure here.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28063826617 |


Summary
Adds the
glm5-fp8-gb200-dynamo-sglangconfig: GLM-5-FP8 disaggregated multinode SGLang benchmark on GB200 via Dynamo.benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb200-fp8/covering 1k/1k and 8k/1k:nvidia-master.yaml: new entry mirroring the gb300-fp8 sibling, prefill TP8 STP everywhere.runners/launch_gb200-nv.sh: glm5/fp8MODEL_PATHbranch + glm5 recipes-copy stage (overlays our recipes onto srt-slurm at runtime).lmsysorg/sglang:v0.5.12.Note
Low Risk
Benchmark and CI launch wiring only; no changes to application auth, APIs, or production inference paths.
Overview
Introduces
glm5-fp8-gb200-dynamo-sglangso InferenceX can run GLM-5-FP8 on GB200 with disaggregated multinode SGLang behind Dynamo (lmsysorg/sglang:v0.5.12).nvidia-master.yamlgains a new block with 14 fixed-seq-len search-space points for 1k/1k and 8k/1k: prefill TP8 STP paired with decode wide-EP (DEP16/DEP32) high-throughput layouts and per-node TP8 low-latency 1pNd shapes, each wired viaCONFIG_FILEto a concrete recipe.14 flat recipe YAMLs are added under
benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb200-fp8/(split per topology, aligned with the existing gb300-fp8 glm5 layout).runners/launch_gb200-nv.shmaps glm5/fp8 to Lustre weights andglm-5-fp8, and copies the glm5 recipe tree into srt-slurm at job launch.perf-changelog.yamldocuments the new config key.Reviewed by Cursor Bugbot for commit 74c1f2c. Bugbot is set up for automated code reviews on this repo. Configure here.