Skip to content
Open
Show file tree
Hide file tree
Changes from 96 commits
Commits
Show all changes
169 commits
Select commit Hold shift + click to select a range
83761d0
Add CollectiveX experimental cross-vendor collective/EP benchmark
Oseltamivir Jun 23, 2026
b7ed913
CollectiveX: import container by multi-arch tag, fix CI import hang
Oseltamivir Jun 23, 2026
e6fdd84
Merge branch 'main' into collectivex
Oseltamivir Jun 23, 2026
ccfae8e
CollectiveX: copy staged results back to checkout for artifact upload
Oseltamivir Jun 23, 2026
b384171
CollectiveX: per-job summary table + address PR review findings
Oseltamivir Jun 23, 2026
f48daed
CollectiveX: render results as a GitHub Actions job summary
Oseltamivir Jun 23, 2026
be9cc91
CollectiveX: add MI355X / MoRI EP path (dispatch+combine)
Oseltamivir Jun 23, 2026
d8ee9bf
CollectiveX: run MI355X MoRI on push; align launcher with serving script
Oseltamivir Jun 23, 2026
ac3f1b9
CollectiveX: size MoRI symmetric heap (first MI355X run hit the 2 GiB…
Oseltamivir Jun 23, 2026
46208f2
CollectiveX: set MoRI heap to 6G (16 GiB failed RDMA MR registration)
Oseltamivir Jun 23, 2026
b62de99
CollectiveX: MoRI MI355X validated on hardware; fix heap/buffer/teardown
Oseltamivir Jun 23, 2026
481ef59
CollectiveX: wire rccl-tests collective primitives for MI355X (CX_BEN…
Oseltamivir Jun 23, 2026
78322de
CollectiveX: key dispatch concurrency by SKU so B200/MI355X runs don'…
Oseltamivir Jun 23, 2026
2b23573
CollectiveX: render busbw & latency vs bytes/rank sweep tables in the…
Oseltamivir Jun 23, 2026
a3a492c
CollectiveX: GB200 8-GPU multi-node MNNVL path (CX_NODES), validated …
Oseltamivir Jun 23, 2026
871086d
CollectiveX: fix multi-node build cache (MPI=0 vs MPI=1) + gate all-z…
Oseltamivir Jun 23, 2026
368cfbc
CollectiveX: EP dispatch/combine token sweep with separated timing (t…
Oseltamivir Jun 24, 2026
e2717a3
CollectiveX: make MI355X launcher CI-robust (writable lock dir + node…
Oseltamivir Jun 24, 2026
5c7b273
CollectiveX: fair-comparison EP rebuild — deterministic trace, real f…
Oseltamivir Jun 24, 2026
0052b11
CollectiveX: resource-normalized + tuned regimes for the EP comparison
Oseltamivir Jun 24, 2026
3a872a9
CollectiveX: fail-fast timeout guard + cap the MoRI push smoke (T>=32…
Oseltamivir Jun 24, 2026
5876ea0
CollectiveX: floor MoRI normalized block_num — it deadlocks at T>=32 …
Oseltamivir Jun 24, 2026
353c8ee
CollectiveX: FP8 dispatch + low-latency mode + reject-unsupported fra…
Oseltamivir Jun 24, 2026
3bc941c
CollectiveX: fix B300 warmup artifact + GHA matrix for h100-dgxc/b300…
Oseltamivir Jun 24, 2026
9f85d05
CollectiveX: fix h100-dgxc + b300 launcher slurm/storage from serving…
Oseltamivir Jun 24, 2026
c596882
CollectiveX: serialize same-SKU GHA dispatches + add 3-run reproducib…
Oseltamivir Jun 24, 2026
e71ef3c
CollectiveX: per-point clock-ramp burst (gated) — fixes MoRI wedge + …
Oseltamivir Jun 24, 2026
4e217f9
CollectiveX: MoRI repro/validation drivers pass COLLECTIVEX_IMAGE (pr…
Oseltamivir Jun 24, 2026
7a2f94f
CollectiveX: repro driver — match the T row (MoRI ramp-safe) + cap Mo…
Oseltamivir Jun 24, 2026
bbe0578
CollectiveX: dedicated MoRI repro driver (validation-exact invocation)
Oseltamivir Jun 24, 2026
f7b9d35
CollectiveX v3 measurement: explicit contracts, pooled-trial p50/p90/…
Oseltamivir Jun 25, 2026
1afd268
CollectiveX v3 workflow: capability resolver + NCCL phase-dedup + con…
Oseltamivir Jun 25, 2026
6122acb
CollectiveX v3 plotter: percentile + suite selectors, logical-payload…
Oseltamivir Jun 25, 2026
c136ec5
CollectiveX: v3 harness smoke driver (validates contracts/trials/rout…
Oseltamivir Jun 25, 2026
cf34cb3
CollectiveX: MoRI repro driver iters knob (MORI_ITERS, tighter fast-o…
Oseltamivir Jun 25, 2026
82ec864
CollectiveX: v3 re-run drivers (deepep _v3_rerun.sh + mori _v3_mori.s…
Oseltamivir Jun 25, 2026
cad380a
CollectiveX plotter: default to p50 (p99 too noisy a tail estimate at…
Oseltamivir Jun 25, 2026
81cddca
CollectiveX plotter: X-axis Log/Linear toggle (was hardcoded log)
Oseltamivir Jun 25, 2026
e97bc8b
CollectiveX plotter: auto-stitch decode range into prefill curves (co…
Oseltamivir Jun 25, 2026
6a3a185
chore: dispatch CollectiveX snapshot updates [skip ci]
Oseltamivir Jun 25, 2026
270b7b4
CollectiveX: GB300 EP8 across 2 NVL72 trays + EP-degree-aware plotter
Oseltamivir Jun 25, 2026
a6812dc
CollectiveX: routing axis (balanced/zipf) + EPLB expert-replication l…
Oseltamivir Jun 25, 2026
45c4570
CollectiveX v4 (goal Part 1 + scaffolding): workload identity, measur…
Oseltamivir Jun 25, 2026
600e909
CollectiveX: analyze_ep.py — operating-envelope analysis (skew penalt…
Oseltamivir Jun 25, 2026
171c7d1
CollectiveX: --workload-dir canonical-trace consumption + make_worklo…
Oseltamivir Jun 25, 2026
6dba193
CollectiveX: failure taxonomy (classify hang/OOM/registration/deadloc…
Oseltamivir Jun 25, 2026
8ff23bd
CollectiveX plotter: coverage table (publication status per measured …
Oseltamivir Jun 25, 2026
9e52693
CollectiveX: provenance enrichment (GitHub ref/job/artifact, image ar…
Oseltamivir Jun 25, 2026
82c6130
CollectiveX: structured placement metadata + routing locality fractio…
Oseltamivir Jun 25, 2026
e273009
CollectiveX: scaling efficiency (strong/weak from EP4/EP8) + regressi…
Oseltamivir Jun 25, 2026
978d338
CollectiveX: MI355X cross-vendor canonical-workload consume driver (D…
Oseltamivir Jun 25, 2026
a413de2
CollectiveX plotter: fix grid 'undefined' panel title (stale 'serial'…
Oseltamivir Jun 26, 2026
d799e0f
CollectiveX plotter: prefill panels show only the real prefill range …
Oseltamivir Jun 26, 2026
1622dff
CollectiveX plotter: --legacy {all,exclude,only} — v4-only main plot …
Oseltamivir Jun 26, 2026
f5df0ea
CollectiveX GHA: add routing/eplb inputs + h200/gb300 SKUs; wire CX_E…
Oseltamivir Jun 26, 2026
bb296c4
CollectiveX: launch_gb300-nv.sh — GHA launcher for GB300 (EP4 via run…
Oseltamivir Jun 26, 2026
73da67b
CollectiveX GHA: per-(SKU+config) concurrency group so a multi-config…
Oseltamivir Jun 26, 2026
0df55e8
CollectiveX: per-runner stage dir (fix concurrent-dispatch stale-hand…
Oseltamivir Jun 26, 2026
13f0a0f
CollectiveX: fix H200 GHA launcher FS (/home/sa-shared, not /mnt/nfs)
Oseltamivir Jun 26, 2026
9fb6e5d
CollectiveX: H200 partition main (not hpc-gpu-1)
Oseltamivir Jun 26, 2026
2b5e26c
CollectiveX: GB300 launcher uses docker tag, not squash path
Oseltamivir Jun 26, 2026
d2433e3
CollectiveX: pin h200 dispatch to the h200-dgxc runner pool
Oseltamivir Jun 26, 2026
156bf44
CollectiveX: GHA campaign tooling — collector + matrix dry-label fix
Oseltamivir Jun 26, 2026
59a05e0
CollectiveX: gitignore _ssh_v4_archive/ (superseded SSH result JSONs)
Oseltamivir Jun 26, 2026
a767844
CollectiveX: distribution-identity hardening + quant-combine (PR311) …
Oseltamivir Jun 26, 2026
fd23d02
CollectiveX: complete goal Part 1 + Part 2 — runtime-visible contract…
Oseltamivir Jun 26, 2026
70cfef3
CollectiveX: cohort official-membership gate (publication_status==off…
Oseltamivir Jun 26, 2026
60dec7d
CollectiveX: immediate-priority — LL fixed-kernel resource split, res…
Oseltamivir Jun 26, 2026
36d3eb6
CollectiveX: fix UnboundLocalError on EPLB canonical runs — define ro…
Oseltamivir Jun 26, 2026
ee4ffe7
CollectiveX: gitignore _seeded_archive/ (superseded seeded-runtime re…
Oseltamivir Jun 26, 2026
45fa504
CollectiveX: full-suite GHA dispatch — workflow inputs (hidden/topk/e…
Oseltamivir Jun 26, 2026
2c15d94
CollectiveX: full-suite completeness fixes — collect limit 500 (was 1…
Oseltamivir Jun 27, 2026
880f82c
CollectiveX: keep-newest cfg_key includes resource axis (resource_mod…
Oseltamivir Jun 27, 2026
ddc08e7
CollectiveX: add iters workflow input (CX_ITERS) — for the MoRI/MI355…
Oseltamivir Jun 27, 2026
8392632
CollectiveX: add trials/warmup workflow inputs (CX_TRIALS/CX_WARMUP) …
Oseltamivir Jun 27, 2026
74f52e0
CollectiveX: fix workflow_dispatch >25-input limit — consolidate iter…
Oseltamivir Jun 27, 2026
1495866
CollectiveX: add B300 to ep-nightly/ep-models/ep-routing (was missing…
Oseltamivir Jun 27, 2026
0cf9fc6
CollectiveX: DeepEP V2 build hook (CX_DEEPEP_V2 -> build NCCL-Gin V2 …
Oseltamivir Jun 27, 2026
76a3032
CollectiveX: kernel_gen (deepep v1/v2) as a distinct identity axis — …
Oseltamivir Jun 27, 2026
91c7acf
collectivex: fix DeepEP V2 build on PEP 668 images (H200/B300)
Oseltamivir Jun 27, 2026
df7fdde
collectivex: headline defaults, decision/summary/tabs UI, regression …
Oseltamivir Jun 27, 2026
803b785
collectivex: render NCCL all-reduce/all-gather (family=nccl) in plot …
Oseltamivir Jun 27, 2026
b6176a6
collectivex: collect family=nccl (all-reduce/all-gather) + uccl/flash…
Oseltamivir Jun 27, 2026
a504a3e
collectivex: model-shape selector in plot (DeepSeek-V3/V4, MiniMax-M3…
Oseltamivir Jun 27, 2026
1e21c72
collectivex: UCCL EP backend + memcpy-family collective benches (offl…
Oseltamivir Jun 27, 2026
eb6f953
collectivex: document hardware/kernel-gated items (honest blockers)
Oseltamivir Jun 27, 2026
c16f885
collectivex: fix UCCL build-check (import torch first) + capability/c…
Oseltamivir Jun 27, 2026
4c661f9
collectivex: summarize.py recognizes memcpy-family collectives (offlo…
Oseltamivir Jun 27, 2026
95137b8
collectivex: correct UCCL EP status — scaffolded, full run deferred
Oseltamivir Jun 27, 2026
645f9d5
collectivex: collect offload/copy_engine/kvcache files + robust _coll…
Oseltamivir Jun 27, 2026
f531529
collectivex: review upstream precision PRs (MoRI 311, FlashInfer 3376…
Oseltamivir Jun 27, 2026
0e54cde
collectivex: populate offload/copy-engine/kv-cache plot tabs (real data)
Oseltamivir Jun 27, 2026
71477ee
collectivex: RL mesh-to-mesh transfer benchmark (family=rl-mesh)
Oseltamivir Jun 27, 2026
e6224de
collectivex: rl-mesh passes capability pre-flight (non-EP bench passt…
Oseltamivir Jun 27, 2026
c40de99
collectivex: render RL mesh-to-mesh tab (family=rl-mesh) — final coll…
Oseltamivir Jun 27, 2026
925285d
collectivex: launchers/ contains only launch*; runtime/ + tools/ split
Oseltamivir Jun 27, 2026
ca8a505
collectivex: FlashInfer EP adapter + framework all-reduce bench (wire…
Oseltamivir Jun 27, 2026
762eb48
collectivex: direct-cast FP8 + per-token scale-layout dispatch recipes
Oseltamivir Jun 27, 2026
42eddb4
collectivex: fix fp8-variant CLI choices + allreduce-fw gate + surfac…
Oseltamivir Jun 27, 2026
ccb0b4a
collectivex: fix FlashInfer EP Mapping (tp_size=world_size for pure EP)
Oseltamivir Jun 27, 2026
9e1ac40
collectivex: FlashInfer MoeAlltoAll requires hidden_size (Mapping fix…
Oseltamivir Jun 27, 2026
91530dd
collectivex: FlashInfer MNNVL via TorchDistBackend (no MPI) — the rea…
Oseltamivir Jun 27, 2026
e150424
collectivex: FlashInfer EP combine — clone payload + payload_in_works…
Oseltamivir Jun 27, 2026
7aca33d
collectivex: FlashInfer EP — handle stateful dispatch/combine FSM
Oseltamivir Jun 27, 2026
1535869
collectivex: roundtrip-only timing for FlashInfer EP (stateful paired…
Oseltamivir Jun 27, 2026
511188e
collectivex: FlashInfer combine — pass recv as-is (source contract: s…
Oseltamivir Jun 27, 2026
2ebeba9
collectivex: FlashInfer EP correctness factor = distinct ranks per token
Oseltamivir Jun 27, 2026
04d83bf
collectivex: UCCL EP — vendor deep_ep_wrapper (group-based Buffer) + …
Oseltamivir Jun 27, 2026
5d08a93
collectivex: UCCL — pin vendored deep_ep_wrapper to the wheel's tag (…
Oseltamivir Jun 27, 2026
cfa1ec5
collectivex: UCCL EP finalize os._exit past teardown SIGSEGV (result …
Oseltamivir Jun 27, 2026
510fc17
CollectiveX: FlashInfer EP quant dispatch (fp8 e4m3 variants + mxfp8 …
Oseltamivir Jun 28, 2026
0b2753b
CollectiveX: real FlashInfer one-shot/two-shot all-reduce (trtllm_all…
Oseltamivir Jun 28, 2026
5c48dfd
CollectiveX: gate nvfp4 dispatch to Blackwell + refresh gated.md
Oseltamivir Jun 28, 2026
156e9ea
CollectiveX: render framework all-reduce in the All-reduce tab + gate…
Oseltamivir Jun 28, 2026
d8b4764
CollectiveX: document collective-suite serving-use mapping (all-reduc…
Oseltamivir Jun 28, 2026
02ef8d2
CollectiveX: DeepEP hybrid-ep branch backend (NVIDIA TMA HybridEPBuffer)
Oseltamivir Jun 28, 2026
90877fb
CollectiveX: allow AMD collective benches on the MI355X launcher (kv-…
Oseltamivir Jun 28, 2026
3850003
CollectiveX: FlashInfer quantized COMBINE output (fp8) via newer moe_…
Oseltamivir Jun 28, 2026
49dd8db
CollectiveX: fix flashinfer-combine upgrade — match cubin/jit-cache v…
Oseltamivir Jun 28, 2026
f684b37
CollectiveX: raise MI355X wall-clock guard to 1800s (slow shared clus…
Oseltamivir Jun 28, 2026
d9e0423
CollectiveX: install flashinfer from NIGHTLY index for combine output…
Oseltamivir Jun 28, 2026
c2c7feb
CollectiveX: upgrade nvidia-cutlass-dsl with the nightly flashinfer (…
Oseltamivir Jun 28, 2026
43614ad
CollectiveX: record exact upgraded FlashInfer library stack in proven…
Oseltamivir Jun 28, 2026
d4c508a
CollectiveX: build flashinfer main from source if the nightly wheel l…
Oseltamivir Jun 28, 2026
ba7c14a
CollectiveX: force JIT-from-main for combine kernel (uninstall stale …
Oseltamivir Jun 28, 2026
85273c6
CollectiveX: fix combine-quant output_scales to UE8M0 uint8 block-32 …
Oseltamivir Jun 28, 2026
4b3fe29
CollectiveX: NVFP4 quantized combine output (flashinfer fp4 path) — c…
Oseltamivir Jun 28, 2026
ddfbdf7
CollectiveX: gated.md — quant combine OUTPUT now DONE on B300 (flashi…
Oseltamivir Jun 28, 2026
2d65048
CollectiveX: add nvfp4 to harness --combine-dtype argparse choices
Oseltamivir Jun 28, 2026
0e61ac1
CollectiveX: nvfp4 combine dequant — view e4m3 scales as uint8 for e2…
Oseltamivir Jun 28, 2026
d6bf7b1
CollectiveX: gated.md — NVFP4 combine also DONE on B300 (valid, corre…
Oseltamivir Jun 28, 2026
94f03d5
CollectiveX: MXFP4 dispatch via fp4_quantize(ue8m0, swizzled=False) —…
Oseltamivir Jun 28, 2026
99e4ba0
CollectiveX: MoRI fp8 blockwise (e4m3fnuz) dispatch — the FNUZ precis…
Oseltamivir Jun 28, 2026
fe013ce
CollectiveX: NIXL via container switch — transfer bench (wired) + dev…
Oseltamivir Jun 28, 2026
a15bd8b
CollectiveX: AMD SDMA copy path — attempt the off-SM DMA engine on MI…
Oseltamivir Jun 28, 2026
f06b701
CollectiveX: direct-cast FP8 combine — output_scalar_scale-only on th…
Oseltamivir Jun 28, 2026
8405b10
CollectiveX: MoRI-IO transfer bench — the AMD RDMA p2p transfer engin…
Oseltamivir Jun 28, 2026
3ab6feb
CollectiveX: gated.md — NIXL container-switch result + direct-cast ke…
Oseltamivir Jun 28, 2026
83679b0
CollectiveX: methodology — named per-model TP-MoE handoff shapes table
Oseltamivir Jun 28, 2026
ae3032f
CollectiveX: copy-engine — add flash-attention victim for copy-vs-att…
Oseltamivir Jun 28, 2026
0078e31
CollectiveX: MoRI fp8 = fp8_direct_cast (not blockwise) — the validat…
Oseltamivir Jun 28, 2026
08a2f1e
CollectiveX: MoRI fp8_direct_cast needs non-zero-copy (use_external_i…
Oseltamivir Jun 28, 2026
e4f71c4
CollectiveX: MoRI fp8 correctness — gate against the e4m3fnuz consist…
Oseltamivir Jun 28, 2026
8eec44d
CollectiveX: gated.md — FNUZ fp8 VALIDATED (fp8_direct_cast e4m3fnuz,…
Oseltamivir Jun 28, 2026
0cbfe17
CollectiveX: NCCL/RCCL KV-cache transfer backend (p2p send/recv)
Oseltamivir Jun 28, 2026
744426a
CollectiveX: GB200 launcher — add EP multi-srun path (was nccl-only m…
Oseltamivir Jun 28, 2026
001626a
CollectiveX: MoonCake KV transfer backend — pip-import the transfer e…
Oseltamivir Jun 28, 2026
1d7e063
CollectiveX: AITER all-reduce builder (AMD framework-AR tier)
Oseltamivir Jun 28, 2026
a51018c
CollectiveX: workflow concurrency group += inputs.nodes (multi-node E…
Oseltamivir Jun 28, 2026
7a104f2
CollectiveX: gated.md — NVL72 rack-scale EP DONE up to EP64 via Flash…
Oseltamivir Jun 28, 2026
e8b5013
CollectiveX: framework all-reduce — replicate the serving distributed…
Oseltamivir Jun 28, 2026
0688f5d
CollectiveX: vLLM all-reduce via container switch (allreduce-fw-vllm …
Oseltamivir Jun 28, 2026
568b0a7
CollectiveX: AITER all-reduce via serving-init replication (like sglang)
Oseltamivir Jun 28, 2026
f8d87b4
CollectiveX: vLLM AR — enter VllmConfig context; NIXL EP — build UCX-…
Oseltamivir Jun 28, 2026
f594ab9
CollectiveX: gated.md — framework-AR (sglang/vllm/aiter) DONE; NIXL U…
Oseltamivir Jun 28, 2026
e3b1aad
CollectiveX: MI355X cross-node EP path — MoRI RDMA internode (goal 183)
Oseltamivir Jun 28, 2026
79cf2f6
CollectiveX: cross-node H100/H200 EP path — multi-node torchrun + UCC…
Oseltamivir Jun 28, 2026
22c2a12
CollectiveX: add prune_results.py — results hygiene (newest-N-valid p…
Oseltamivir Jun 28, 2026
aaf79c9
CollectiveX: cross-node EP — MASTER_ADDR = routable NodeAddr IP (fix …
Oseltamivir Jun 28, 2026
34943b1
CollectiveX: pin cross-node PG bootstrap iface for EP rendezvous
Oseltamivir Jun 28, 2026
45097ca
CollectiveX: drop superseded DeepEP capability probes
Oseltamivir Jun 28, 2026
308101a
CollectiveX: drop tools/_keep_newest.py — subsumed by prune_results.py
Oseltamivir Jun 28, 2026
53c4575
CollectiveX: xnode-net — always-on net diagnostic + missing-iproute2 …
Oseltamivir Jun 28, 2026
7b93bc0
CollectiveX: opt-in FileStore rendezvous for cross-node EP (CX_RDZV_F…
Oseltamivir Jun 28, 2026
f108874
CollectiveX: H200 cross-node EP via multi-srun + FileStore rendezvous
Oseltamivir Jun 28, 2026
344d051
CollectiveX: cross-node EP local-spawn via FileStore (no torchrun agent)
Oseltamivir Jun 28, 2026
e8d9a77
CollectiveX: add nccl-ep — NCCL/RCCL all-to-all EP (cross-node, both …
Oseltamivir Jun 28, 2026
127785d
CollectiveX: add nccl-ep to run_ep.py --backend argparse choices
Oseltamivir Jun 28, 2026
68d0e18
CollectiveX: gated.md — cross-node EP DONE via nccl-ep (rendezvous + …
Oseltamivir Jun 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
336 changes: 336 additions & 0 deletions .github/workflows/collectivex-experimental.yml

Large diffs are not rendered by default.

22 changes: 22 additions & 0 deletions experimental/CollectiveX/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# in-container nccl-tests build cache
.nccl-tests/
# python
__pycache__/
*.pyc
# generated run artifacts: captured env embeds hostnames / GPU UUIDs / NIC GUIDs,
# so keep results out of git (CI uploads them as workflow artifacts instead).
# Sanitized headline numbers live in CONTAINERS.md.
results/*.json
results/plots/
results/raw_*.txt
results/raw_*.txt.stderr
# superseded SSH-provenance result JSONs moved aside so plot_ep's recursive glob
# won't double-load them; same hostname/UUID sensitivity as results/.
_ssh_v4_archive/
# running local-only reflection log (not a committed artifact)
notes.md
goal.md
# superseded seeded-runtime GHA results (canonical counterpart exists); kept out of the plot glob
_seeded_archive/
# newest-good-per-config kept in results/; superseded runs moved here (out of the plot glob)
_superseded/
75 changes: 75 additions & 0 deletions experimental/CollectiveX/CONTAINERS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# CollectiveX — container & library versions

One **multi-arch, digest-pinned** container is used for all NVIDIA SKUs, so B200
(x86_64) and GB200 (aarch64) share a single reference and the cross-vendor
comparison is truly same-image. Set in `runtime/common.sh` (`cx_default_image`).

## Default container (all NVIDIA SKUs)

- **Image:** import by tag **`lmsysorg/sglang:v0.5.11-cu130`** (multi-arch OCI index). Expected index digest, recorded for provenance/verification: `sha256:061fb71f838e82000a1768c159654d526c2f17ebe751c21e7fc48ca53c8ef975`.
- **Multi-arch manifest list:** linux/amd64 + linux/arm64; `enroot import` on each host pulls the matching arch.
- **Import by TAG, not digest.** enroot builds its anonymous Docker Hub token scope from the *tag* and succeeds (no creds needed — same as the serving launchers). A bare `repo@sha256:` ref makes enroot prompt for a password and **hang** in non-interactive CI; a combined `tag@sha256:` ref 400s. `cx_ensure_squash` therefore imports by tag with `</dev/null` (a missing token fails fast instead of hanging). First import is multi-GB (~minutes); subsequent runs reuse the staged squash.
- **Why v0.5.11-cu130 (chosen):** it's the newest cu130 release **pre-staged on BOTH clusters** — B200 `/home/sa-shared/containers/` (amd64 squash) and GB200 `/mnt/lustre01/users-public/sa-shared/` (arm64 squash), same filename — so neither side imports at all. (Shared cu130 multi-arch squashes across both clusters: v0.5.8.post1, v0.5.9, v0.5.11 — v0.5.11 is newest.) `v0.5.12-cu130` is staged on B200 but **not** GB200: its 62 layers overflow enroot's overlay-based squash creation on the GB200 kernel (`enroot-mksquashovlfs: failed to mount overlay … Invalid argument`), so it can't be the shared default.
- **DeepEP: NOT bundled** here → `run_in_container.sh` builds it via `rebuild-deepep` at job setup (CX_BENCH=deepep). The NCCL path needs no DeepEP.
- **nccl-tests build:** in-container (login nodes have no `nvcc`), `CX_NCCL_HOME=/usr` (system `nccl.h` in `/usr/include`), `CX_CUDA_HOME=/usr/local/cuda`. cu130 lineage ⇒ CUDA 13; confirm exact NCCL/torch on first run and append below.

## Audited reference (cu130 lineage)

Live audit of the sibling DeepSeek-V4 image `lmsysorg/sglang:deepseek-v4-grace-blackwell` (aarch64) on GB200, 2026-06-23 — the multi-arch `v0.5.11-cu130` should match closely (same cu130 base); reconfirm on first run:

| Component | Version |
|---|---|
| OS / arch | Ubuntu 24.04.3, aarch64 |
| CUDA (`nvcc`) | 13.0 (V13.0.88) |
| NCCL (system `/usr/include/nccl.h`) | 2.28.3; torch-bundled 2.27.7 |
| PyTorch | 2.9.1+cu130 |
| DeepEP | bundled in *that* image; **not** in the multi-arch default |
| NVSHMEM | `libnvshmem_host.so.3` present |
| OpenMPI / gcc / make | 4.1.6 / 13.3.0 / 4.3 |
| GPU / driver | GB200, 580.126.20 |

**Version caveat:** the nccl-tests binary links **system NCCL** (2.28.x), while torch/DeepEP use the **bundled** NCCL (2.27.x). Record both in provenance (env_capture does); don't compare an nccl-tests curve against a DeepEP run as if NCCL were identical.

## Bundled-DeepEP reference images (not the default)

If a bundled DeepEP is needed before `rebuild-deepep` is wired on the multi-arch image, these arch-specific images bundle it (pin by digest):

- B200 (amd64): `lmsysorg/sglang:deepseek-v4-blackwell@sha256:df18bfc4aa9ecf59451002b49ba00cae58042de9e2a96378bbd21b404dd62c7b` (pre-staged on B200)
- GB200 (arm64): `lmsysorg/sglang:deepseek-v4-grace-blackwell@sha256:4f583347d7ff08aef7e16dbb4985b2a7c147ff49a0c261d5e27b8f5f41719368` (staged on GB200 Lustre)

Select via `CX_IMAGE=…@sha256:…` on the launch script.

## AMD container (MI355X) — MoRI EP

AMD CDNA4 cannot run the CUDA multi-arch image; MI355X uses a ROCm image that
bundles **MoRI** (AMD's EP dispatch/combine library). Set in `cx_default_image`
for `mi355x*` (also `mi350x*`/`mi325x*`/`mi300x*`).

- **Image:** `rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-0227-2` (single-arch ROCm 7.2.0 runtime; from the AMD master serving config). **Not digest-pinned yet** — record the digest here and pin once validated on the runner, like the NVIDIA image.
- **MoRI:** bundled in-image (build tag `mori-0227`). `tests/ep_mori.py` follows the upstream `ROCm/mori` `tests`/`examples` dispatch+combine path; capture the exact MoRI commit (`MORI_COMMIT` env → provenance) on first run.
- **Squash is NODE-LOCAL** (`/var/lib/squash`), not a shared FS, so `launch_mi355x-amds.sh` imports via `srun` on the allocated node (the NVIDIA adapters import on the login node onto shared FS). pyxis flags `--container-writable --container-remap-root` (matches the AMD serving launcher); workspace is bind-mounted directly (no `CX_STAGE_DIR`).
- **Transport:** intra-node **XGMI** (8× MI355X). Two backends wired: `CX_BENCH=mori` (MoRI EP dispatch/combine) and `CX_BENCH=nccl` (collective primitives via **rccl-tests**, the ROCm nccl-tests fork — built in-container with `make` against `/opt/rocm`/`amdclang++`/`librccl`; same `<op>_perf` binaries + output format as nccl-tests, so `run_nccl.py` parses it unchanged).
- **Validated on MI355X** (on-node via `salloc`+`srun`, nodes `mia1-p01-g10`/`g15`): `salloc` → enroot import (anonymous auth + tag, 24 layers → ~60 GB node-local squash) → torchrun → 8-rank Gloo + MoRI shmem → `EpDispatchCombineConfig`/dispatch/combine **numerically correct** (combine within tol, `max_rel ~2e-3`, ~85 µs round-trip at the decode shape). Three ionic_rdma-fabric constraints, all handled in `tests/ep_mori.py`:
- **RDMA MR size ceiling (~4 GiB).** MoRI registers the *entire* symmetric heap as one RDMA MR at init — even single-node (no disable-RDMA knob exists; only `MORI_DISABLE_P2P`, which forces the opposite). On these ionic NICs a 6 GiB MR fails (`RegisterRdmaMemoryRegion … errno 22 EINVAL`) while 2 GiB registers. Heap is held at **`MORI_SHMEM_HEAP_SIZE=2G`** (override `CX_MORI_HEAP_SIZE`). The reference test's hardcoded `6G` is exactly why it can't run as-is here.
- **Buffer sizing.** `max_num_inp_token_per_rank` is bounded (512 at the decode shape) so dispatch/combine buffers fit the 2 GiB heap. Much larger token counts would need a heap past the MR ceiling — out of reach on this fabric for now.
- **Teardown.** MoRI's shmem teardown asserts (`CheckStatusValid` → SIGABRT) when the op is destroyed after `shmem_finalize()`; `tests/ep_mori.py`'s `finalize()` hard-exits after writing results to avoid it.

Still TODO: capture the exact MoRI commit + a version table (ROCm/torch/RCCL) into provenance, and digest-pin the image.

## Cluster access / QOS

- **B200** (`slurm-login-slinky`): account `benchmark`, **only `gpu-2_qos`** → partition `gpu-2` only (shared with the serving sweep). `gpu-1`/`all` (idle) need `gpu-1_qos`/`all_qos`, not associated with this account.
- **GB200** (`watchtower`): account `benchmark`, qos `normal`, partition `batch` (`AllowQos=ALL`); idle capacity available. Runner workspace is **not** compute-visible → set `CX_STAGE_DIR` to a Lustre path (the launcher rsyncs there).

## First real results (Milestone-0 spike, on the DeepSeek-V4 images)

nccl-tests (system NCCL 2.28.3), all correctness-passed, peak bus-bw:

| op | B200 8× (NVLink island, x86_64) | GB200 4× (NVL72 MNNVL, aarch64) |
|---|---|---|
| all_reduce | 835 GB/s | 689 GB/s |
| all_gather | 653 | 658 |
| reduce_scatter | 667 | 661 |
| alltoall | 638 | 666 |

(B200 vs GB200 carry distinct `comparison_key`s by topology-class, so they are labelled-distinct, not silently merged. Re-run on the multi-arch default to refresh under one image.)
128 changes: 128 additions & 0 deletions experimental/CollectiveX/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# CollectiveX

Cross-vendor collective / EP-library benchmark (see `plan.md`). Per-SKU **launch
adapters** (InferenceX-style `launch_<sku>.sh`) run **any benchmark** — selected
by `CX_BENCH` — through a shared in-container runner, and a GitHub Actions
workflow triggers runs on `push` (no merge to main needed). Milestone-0 headline
already ran for real on both B200 (8× NVLink island) and GB200 (4× NVL72 MNNVL).

> Experimental: WIP, not an official InferenceMAX result. All logic stays under
> `experimental/CollectiveX/`; the only file outside is the orchestration-only
> workflow.

## Files

| File | Role |
|---|---|
| `env_capture.py` | Layer-0 environment + topology fingerprint → JSON (stdlib only) |
| `run_nccl.py` | run stock `nccl-tests`, parse the text table, emit flat JSON (stdlib only) |
| `tests/run_ep.py` | EP dispatch/combine entrypoint (torchrun): source-tokens-per-rank sweep, dispatch & combine timed **separately** |
| `tests/ep_harness.py` | shared EP harness: token ladder, separated timing, correctness gate, doc emission (stdlib top) |
| `tests/ep_deepep.py`, `tests/ep_mori.py` | per-backend adapters (DeepEP / MoRI) implementing the harness protocol |
| `plot.py` | latency/bus-bw curves, B200-vs-GB200 overlay with a comparison guard (matplotlib) |
| `runtime/common.sh` | shared helpers: image resolve, enroot squash, staging, nccl-tests build |
| `runtime/run_in_container.sh` | generic in-container dispatcher — runs `CX_BENCH` (nccl/deepep/mori/all) over `CX_PHASE` |
| `launchers/launch_<sku>.sh` | per-SKU adapters: `launch_b200-dgxc.sh` (8× NVLink), `launch_b200-dgxc-slurm.sh` (2-node IB), `launch_gb200-nv.sh` (NVL72 MNNVL), `launch_mi355x-amds.sh` (8× XGMI, AMD MoRI + rccl) |
| `CONTAINERS.md` | the pinned multi-arch container + audited library versions |
| `results/` | flat JSON artifacts (+ `plots/`, raw captures) |
| `tests/fixtures/` | captured nccl-tests output for offline parser checks |

## Run

### Via GitHub Actions (`.github/workflows/collectivex-experimental.yml`)

- **push** to `experimental/CollectiveX/**` → the **MI355X MoRI** EP dispatch/combine
sweep, **one job per phase** (decode + prefill) via a matrix (lands on free
`mi355x-amds` runners).
- **workflow_dispatch** → pick `sku` (gb200 / b200-dgxc / b200-multinode /
mi355x), `benchmark` (nccl / deepep / mori / all — `mori` is AMD-only; `nccl`
on MI355X runs rccl-tests), `phase` (decode / prefill / **both** → a job each),
`tokens_ladder`, `dispatch_dtype`, ops, sizes, ngpus. Lands on that SKU's
self-hosted runner and runs `launch_${RUNNER_NAME%%_*}.sh`. For EP results
across all SKUs, dispatch once per `sku` with `phase=both`.

Each job renders a results table to the **GitHub Actions job summary** (via
`summarize.py --markdown` → `$GITHUB_STEP_SUMMARY`) and uploads the result JSONs
as an artifact. (The workflow only fires once the branch is pushed to GitHub.)

### Directly on a cluster login node

```bash
# benchmark is selected by CX_BENCH (default nccl)
bash experimental/CollectiveX/launchers/launch_gb200-nv.sh # GB200, NCCL primitives
CX_BENCH=deepep bash experimental/CollectiveX/launchers/launch_gb200-nv.sh # GB200, DeepEP (rebuild)
bash experimental/CollectiveX/launchers/launch_b200-dgxc.sh # B200 8× NVLink
bash experimental/CollectiveX/launchers/launch_b200-dgxc-slurm.sh # B200 2-node, cross-IB
bash experimental/CollectiveX/launchers/launch_mi355x-amds.sh # MI355X 8× XGMI, MoRI EP (CX_BENCH=mori, default)
CX_BENCH=nccl bash experimental/CollectiveX/launchers/launch_mi355x-amds.sh # MI355X primitives via rccl-tests
```

Knobs: `CX_BENCH` (nccl|deepep|mori|all), `CX_OPS`, `CX_MIN_BYTES`/`CX_MAX_BYTES`,
`CX_NGPUS`, `CX_TIME`, `CX_IMAGE`, `CX_SQUASH_DIR`, `CX_STAGE_DIR` (compute-visible
staging — needed on GB200/watchtower), `CX_DRYRUN=1` (print plan, allocate
nothing). EP (deepep/mori) adds `CX_PHASE` (decode|prefill|both), `CX_TOKENS_LADDER`
(e.g. `"1 2 4 8 16 32 64 128"`), `CX_HIDDEN`/`CX_TOPK`/`CX_EXPERTS`,
`CX_DISPATCH_DTYPE`, `CX_NUM_EP_GROUPS`. Results land in `experimental/CollectiveX/results/`.

### Offline (no GPU) — verify the parser/JSON pipeline

```bash
python3 run_nccl.py --op all_reduce --parse-only tests/fixtures/all_reduce_perf_b200_8gpu.txt \
--world-size 8 --nodes 1 --runner b200-dgxc --topology-class b200-nvlink-island --out /tmp/parsed.json
python3 env_capture.py # prints a (degraded, off-GPU) env record
python3 plot.py --results-dir results --out-dir results/plots # needs matplotlib
```

## Container

One **multi-arch** image for all NVIDIA SKUs, imported by tag
`lmsysorg/sglang:v0.5.11-cu130` (amd64 + arm64; index digest `sha256:061fb71f…`
recorded for provenance). Imported by tag, not digest — enroot's anonymous
Docker Hub auth needs a tag, and a bare digest ref hangs in CI. See
`CONTAINERS.md` for versions, the DeepEP-rebuild note, and the bundled-DeepEP
DeepSeek-V4 fallback images.

## How it runs (confirmed against the live clusters)

- Adapters mirror `runners/launch_*.sh`: `salloc` → enroot squash (import only if
missing) → `srun --container-image=… --container-mounts=<repo>:/ix` → in-container
`run_in_container.sh`. B200 partition `gpu-2`, GB200 partition `batch`, account
`benchmark`.
- **AMD MI355X** (`launch_mi355x-amds.sh`, MoRI / `CX_BENCH=mori`) diverges: partition
`compute`, no account, pyxis `--container-writable --container-remap-root`, and a
**node-local** squash (`/var/lib/squash`) imported via `srun` on the allocated node
(not the login node). Workspace is bind-mounted directly (no `CX_STAGE_DIR`).
- Login nodes have no `nvcc`, so `nccl-tests` is **built in-container** (cached in
`.nccl-tests/`, `CX_NCCL_HOME=/usr`). Single-node uses `-g N`; the 2-node
adapter builds `MPI=1` and launches one rank per GPU (`srun --mpi=pmix`).
- The sglang image installs editable under `/workspace`, so the repo is mounted at
**`/ix`**. GB200 compute nodes don't see the runner workspace → `CX_STAGE_DIR`
rsyncs the tree to Lustre first.
- Every result embeds an `env_capture` record and a `comparison_key`; topology
class is part of the key, so B200(IB/NVLink) and GB200(MNNVL) stay labelled
distinct, never silently overlaid.

## Status & known risks

- **Spike done on real hardware** (both SKUs, 4 NCCL primitives, correctness-passed)
— on the DeepSeek-V4 images. Now standardizing on the **multi-arch** default;
validate it on first run and refresh `CONTAINERS.md` (expect CUDA 13 / NCCL 2.28 / torch 2.9).
- **DeepEP** is not bundled in the multi-arch image → `run_in_container.sh` builds
it via `rebuild-deepep` (CX_BENCH=deepep). Its Python API is version-sensitive;
`tests/ep_deepep.py` follows the documented normal-mode API — validate against
the built commit. B200 (x86_64) first; GB200 (aarch64) follows.
- **MoRI / MI355X** (`tests/ep_mori.py` + `launch_mi355x-amds.sh`) is **validated on
hardware** (8× MI355X: dispatch+combine numerically correct, ~85 µs round-trip).
It mirrors `ROCm/mori`'s example (config + `get_registered_combine_input_buffer`
zero-copy path, `expected = input × #unique-destination-ranks`). Three
ionic_rdma-fabric constraints are baked in (see `CONTAINERS.md`): a 2 GiB heap
(the NICs cap RDMA MRs at ~4 GiB), a bounded `max_num_inp_token_per_rank`, and a
hard-exit past MoRI's buggy shmem teardown. The ROCm image isn't digest-pinned yet.
- **Multi-node** (`launch_b200-dgxc-slurm.sh`) assumes `srun --mpi=pmix` + a
compute-visible checkout (`CX_STAGE_DIR`); else fall back to mpirun-in-container
or srt-slurm. CX_BENCH=nccl only for now.
- **B200 QOS:** account `benchmark` has only `gpu-2_qos` (the serving-sweep
partition); idle `gpu-1` needs a QOS grant. GB200 `batch` is open.

Once the multi-arch image is validated end-to-end, freeze the schema from the
artifacts (plan: "Freeze the contract").
Loading
Loading