Add GLM-5-FP8 GB200 dynamo-sglang multinode benchmark by hshrivastava-droid · Pull Request #1895 · SemiAnalysisAI/InferenceX

hshrivastava-droid · 2026-06-23T05:43:00Z

Summary

Adds the glm5-fp8-gb200-dynamo-sglang config: GLM-5-FP8 disaggregated multinode SGLang benchmark on GB200 via Dynamo.

14 split recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb200-fp8/ covering 1k/1k and 8k/1k:
- 8k/1k: 4 wide-EP high-throughput shapes (DEP16/DEP32) + 3 per-node TP8 low-latency shapes
- 1k/1k: 1 wide-EP high-throughput shape + 3 wide-EP low-latency shapes + 3 per-node TP8 low-latency shapes
nvidia-master.yaml: new entry mirroring the gb300-fp8 sibling, prefill TP8 STP everywhere.
runners/launch_gb200-nv.sh: glm5/fp8 MODEL_PATH branch + glm5 recipes-copy stage (overlays our recipes onto srt-slurm at runtime).
Image: lmsysorg/sglang:v0.5.12.

Note

Low Risk
Benchmark and CI launch wiring only; no changes to application auth, APIs, or production inference paths.

Overview
Introduces glm5-fp8-gb200-dynamo-sglang so InferenceX can run GLM-5-FP8 on GB200 with disaggregated multinode SGLang behind Dynamo (lmsysorg/sglang:v0.5.12).

nvidia-master.yaml gains a new block with 14 fixed-seq-len search-space points for 1k/1k and 8k/1k: prefill TP8 STP paired with decode wide-EP (DEP16/DEP32) high-throughput layouts and per-node TP8 low-latency 1pNd shapes, each wired via CONFIG_FILE to a concrete recipe.

14 flat recipe YAMLs are added under benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb200-fp8/ (split per topology, aligned with the existing gb300-fp8 glm5 layout).

runners/launch_gb200-nv.sh maps glm5/fp8 to Lustre weights and glm-5-fp8, and copies the glm5 recipe tree into srt-slurm at job launch. perf-changelog.yaml documents the new config key.

^{Reviewed by Cursor Bugbot for commit 74c1f2c. Bugbot is set up for automated code reviews on this repo. Configure here.}

- nvidia-master.yaml: add glm5-fp8-gb200-dynamo-sglang (14 topologies across 1k/1k and 8k/1k; prefill TP8 STP + decode wide-EP DEP16/DEP32 high-throughput and per-node TP8 low-latency). - 14 split recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/ glm5/gb200-fp8/. - launch_gb200-nv.sh: glm5/fp8 MODEL_PATH branch and glm5 recipes-copy stage so the runtime overlays our recipes onto srt-slurm. - perf-changelog: entry for the new config.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 74c1f2c. Configure here.}

cursor · 2026-06-23T05:45:44Z

+      mem-fraction-static: 0.7
+      weight-loader-prefetch-checkpoints: true
+      model-loader-extra-config: '{"enable_multithread_load": true}'
+


Missing prefill radix cache disable

Medium Severity

The three 8k/1k low-latency GB200 recipes omit disable-radix-cache: true on the shared prefill block, while every other new GLM-5 GB200 recipe (including 8k/1k high-throughput) sets it. Prefill can keep radix caching enabled for those runs only, skewing latency and throughput versus the rest of the matrix.

Additional Locations (2)

benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_1.yaml#L78-L84

benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb200-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_2.yaml#L78-L84

^{Reviewed by Cursor Bugbot for commit 74c1f2c. Configure here.}

github-actions · 2026-06-24T03:43:43Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28063826617
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28063826617

hshrivastava-droid requested a review from a team June 23, 2026 05:43

hshrivastava-droid requested review from Ankur-singh, jgangani and kedarpotdar-nv as code owners June 23, 2026 05:43

github-project-automation Bot added this to InferenceMAX Board Jun 23, 2026

Update perf-changelog pr-link for #1895

74c1f2c

hshrivastava-droid added the full-sweep-enabled label Jun 23, 2026

cursor Bot reviewed Jun 23, 2026

View reviewed changes

Merge branch 'main' into nv/glm-5-fp8-v2

3ee9670

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GLM-5-FP8 GB200 dynamo-sglang multinode benchmark#1895

Add GLM-5-FP8 GB200 dynamo-sglang multinode benchmark#1895
hshrivastava-droid wants to merge 3 commits into
mainfrom
nv/glm-5-fp8-v2

hshrivastava-droid commented Jun 23, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hshrivastava-droid commented Jun 23, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 23, 2026

Choose a reason for hiding this comment

Missing prefill radix cache disable

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hshrivastava-droid commented Jun 23, 2026 •

edited by cursor Bot

Loading