Add GLM-5-FP8 GB200 dynamo-sglang multinode benchmark#1895
Open
hshrivastava-droid wants to merge 2 commits into
Open
Add GLM-5-FP8 GB200 dynamo-sglang multinode benchmark#1895hshrivastava-droid wants to merge 2 commits into
hshrivastava-droid wants to merge 2 commits into
Conversation
- nvidia-master.yaml: add glm5-fp8-gb200-dynamo-sglang (14 topologies across 1k/1k and 8k/1k; prefill TP8 STP + decode wide-EP DEP16/DEP32 high-throughput and per-node TP8 low-latency). - 14 split recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/ glm5/gb200-fp8/. - launch_gb200-nv.sh: glm5/fp8 MODEL_PATH branch and glm5 recipes-copy stage so the runtime overlays our recipes onto srt-slurm. - perf-changelog: entry for the new config.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 74c1f2c. Configure here.
| mem-fraction-static: 0.7 | ||
| weight-loader-prefetch-checkpoints: true | ||
| model-loader-extra-config: '{"enable_multithread_load": true}' | ||
|
|
There was a problem hiding this comment.
Missing prefill radix cache disable
Medium Severity
The three 8k/1k low-latency GB200 recipes omit disable-radix-cache: true on the shared prefill block, while every other new GLM-5 GB200 recipe (including 8k/1k high-throughput) sets it. Prefill can keep radix caching enabled for those runs only, skewing latency and throughput versus the rest of the matrix.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit 74c1f2c. Configure here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Summary
Adds the
glm5-fp8-gb200-dynamo-sglangconfig: GLM-5-FP8 disaggregated multinode SGLang benchmark on GB200 via Dynamo.benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb200-fp8/covering 1k/1k and 8k/1k:nvidia-master.yaml: new entry mirroring the gb300-fp8 sibling, prefill TP8 STP everywhere.runners/launch_gb200-nv.sh: glm5/fp8MODEL_PATHbranch + glm5 recipes-copy stage (overlays our recipes onto srt-slurm at runtime).lmsysorg/sglang:v0.5.12.Note
Low Risk
Benchmark and CI launch wiring only; no changes to application auth, APIs, or production inference paths.
Overview
Introduces
glm5-fp8-gb200-dynamo-sglangso InferenceX can run GLM-5-FP8 on GB200 with disaggregated multinode SGLang behind Dynamo (lmsysorg/sglang:v0.5.12).nvidia-master.yamlgains a new block with 14 fixed-seq-len search-space points for 1k/1k and 8k/1k: prefill TP8 STP paired with decode wide-EP (DEP16/DEP32) high-throughput layouts and per-node TP8 low-latency 1pNd shapes, each wired viaCONFIG_FILEto a concrete recipe.14 flat recipe YAMLs are added under
benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb200-fp8/(split per topology, aligned with the existing gb300-fp8 glm5 layout).runners/launch_gb200-nv.shmaps glm5/fp8 to Lustre weights andglm-5-fp8, and copies the glm5 recipe tree into srt-slurm at job launch.perf-changelog.yamldocuments the new config key.Reviewed by Cursor Bugbot for commit 74c1f2c. Bugbot is set up for automated code reviews on this repo. Configure here.