Skip to content

Add GLM-5-FP8 GB200 dynamo-sglang multinode benchmark#1895

Open
hshrivastava-droid wants to merge 2 commits into
mainfrom
nv/glm-5-fp8-v2
Open

Add GLM-5-FP8 GB200 dynamo-sglang multinode benchmark#1895
hshrivastava-droid wants to merge 2 commits into
mainfrom
nv/glm-5-fp8-v2

Conversation

@hshrivastava-droid

@hshrivastava-droid hshrivastava-droid commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds the glm5-fp8-gb200-dynamo-sglang config: GLM-5-FP8 disaggregated multinode SGLang benchmark on GB200 via Dynamo.

  • 14 split recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb200-fp8/ covering 1k/1k and 8k/1k:
    • 8k/1k: 4 wide-EP high-throughput shapes (DEP16/DEP32) + 3 per-node TP8 low-latency shapes
    • 1k/1k: 1 wide-EP high-throughput shape + 3 wide-EP low-latency shapes + 3 per-node TP8 low-latency shapes
  • nvidia-master.yaml: new entry mirroring the gb300-fp8 sibling, prefill TP8 STP everywhere.
  • runners/launch_gb200-nv.sh: glm5/fp8 MODEL_PATH branch + glm5 recipes-copy stage (overlays our recipes onto srt-slurm at runtime).
  • Image: lmsysorg/sglang:v0.5.12.

Note

Low Risk
Benchmark and CI launch wiring only; no changes to application auth, APIs, or production inference paths.

Overview
Introduces glm5-fp8-gb200-dynamo-sglang so InferenceX can run GLM-5-FP8 on GB200 with disaggregated multinode SGLang behind Dynamo (lmsysorg/sglang:v0.5.12).

nvidia-master.yaml gains a new block with 14 fixed-seq-len search-space points for 1k/1k and 8k/1k: prefill TP8 STP paired with decode wide-EP (DEP16/DEP32) high-throughput layouts and per-node TP8 low-latency 1pNd shapes, each wired via CONFIG_FILE to a concrete recipe.

14 flat recipe YAMLs are added under benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb200-fp8/ (split per topology, aligned with the existing gb300-fp8 glm5 layout).

runners/launch_gb200-nv.sh maps glm5/fp8 to Lustre weights and glm-5-fp8, and copies the glm5 recipe tree into srt-slurm at job launch. perf-changelog.yaml documents the new config key.

Reviewed by Cursor Bugbot for commit 74c1f2c. Bugbot is set up for automated code reviews on this repo. Configure here.

- nvidia-master.yaml: add glm5-fp8-gb200-dynamo-sglang (14 topologies
  across 1k/1k and 8k/1k; prefill TP8 STP + decode wide-EP DEP16/DEP32
  high-throughput and per-node TP8 low-latency).
- 14 split recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/
  glm5/gb200-fp8/.
- launch_gb200-nv.sh: glm5/fp8 MODEL_PATH branch and glm5 recipes-copy
  stage so the runtime overlays our recipes onto srt-slurm.
- perf-changelog: entry for the new config.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 74c1f2c. Configure here.

mem-fraction-static: 0.7
weight-loader-prefetch-checkpoints: true
model-loader-extra-config: '{"enable_multithread_load": true}'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing prefill radix cache disable

Medium Severity

The three 8k/1k low-latency GB200 recipes omit disable-radix-cache: true on the shared prefill block, while every other new GLM-5 GB200 recipe (including 8k/1k high-throughput) sets it. Prefill can keep radix caching enabled for those runs only, skewing latency and throughput versus the rest of the matrix.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 74c1f2c. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant