-
Notifications
You must be signed in to change notification settings - Fork 206
CollectiveX: experimental cross-vendor collective/EP benchmark #1896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 13 commits
83761d0
b7ed913
e6fdd84
ccfae8e
b384171
f48daed
be9cc91
d8ee9bf
ac3f1b9
46208f2
b62de99
481ef59
78322de
2b23573
a3a492c
871086d
368cfbc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,116 @@ | ||
| name: CollectiveX Experimental | ||
|
|
||
| # Orchestration only — all benchmark logic lives in experimental/CollectiveX/. | ||
| # Push to the feature branch runs the MI355X MoRI dispatch/combine benchmark (no | ||
| # merge to main needed); workflow_dispatch runs a chosen SKU + benchmark (the lane | ||
| # for GB200/B200 NCCL, DeepEP, and larger sweeps). Each job lands on the SKU's | ||
| # self-hosted runner and invokes that SKU's launch script — the same | ||
| # launch_${RUNNER_NAME%%_*}.sh convention the serving benchmarks use. | ||
|
|
||
| on: | ||
| push: | ||
| branches: | ||
| - collectivex | ||
| paths: | ||
| - 'experimental/CollectiveX/**' | ||
| - '.github/workflows/collectivex-experimental.yml' | ||
| workflow_dispatch: | ||
| inputs: | ||
| sku: | ||
| # Only SKUs with a matching launchers/launch_<prefix>.sh are offered — | ||
| # runner.name's prefix selects the script, so an SKU without one fails. | ||
| description: Self-hosted runner pool (must have a CollectiveX launcher) | ||
| type: choice | ||
| default: gb200 | ||
| options: [gb200, b200-dgxc, b200-multinode, mi355x] | ||
| benchmark: | ||
| # mori runs only on mi355x; nccl/deepep/all on the NVIDIA SKUs. | ||
| description: Which benchmark to run | ||
| type: choice | ||
| default: nccl | ||
| options: [nccl, deepep, mori, all] | ||
| ops: | ||
| description: NCCL ops (space-separated); blank = default set | ||
| type: string | ||
| default: '' | ||
| min_bytes: | ||
| description: nccl-tests min message size | ||
| type: string | ||
| default: '8' | ||
| max_bytes: | ||
| description: nccl-tests max message size | ||
| type: string | ||
| default: '8G' | ||
| ngpus: | ||
| description: GPUs per node (blank = SKU default) | ||
| type: string | ||
| default: '' | ||
|
|
||
| concurrency: | ||
| # Include the dispatch SKU so two workflow_dispatch runs on different SKUs do | ||
| # not cancel each other; push has no sku input -> shares one 'push' group. | ||
| group: collectivex-${{ github.ref }}-${{ github.event_name }}-${{ inputs.sku || 'push' }} | ||
| cancel-in-progress: true | ||
|
|
||
| permissions: | ||
| contents: read | ||
|
|
||
| jobs: | ||
| # Push -> MI355X MoRI dispatch/combine. Lands on a free mi355x-amds runner and | ||
| # runs launch_mi355x-amds.sh (CX_BENCH=mori). The AMD workspace is compute- | ||
| # visible, so no CX_STAGE_DIR; the launcher defaults to 8 GPUs. | ||
| experimental: | ||
| name: CollectiveX Experimental | ||
| if: github.event_name == 'push' | ||
| runs-on: mi355x | ||
| timeout-minutes: 90 | ||
| env: | ||
| CX_BENCH: mori | ||
| steps: | ||
| - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v5.0.0 | ||
| with: { clean: true } | ||
| - name: Launch MI355X MoRI | ||
| env: | ||
| RUNNER_NAME: ${{ runner.name }} | ||
| run: bash "experimental/CollectiveX/launchers/launch_${RUNNER_NAME%%_*}.sh" | ||
| - name: Results summary | ||
| if: always() | ||
| run: python3 experimental/CollectiveX/summarize.py --results-dir experimental/CollectiveX/results --markdown >> "$GITHUB_STEP_SUMMARY" | ||
| - name: Upload results | ||
| if: always() | ||
| uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1 | ||
| with: | ||
| name: collectivex_mi355x_mori_${{ github.run_id }} | ||
| path: experimental/CollectiveX/results/*.json | ||
| if-no-files-found: warn | ||
|
cursor[bot] marked this conversation as resolved.
|
||
|
|
||
| # Manual dispatch -> chosen SKU + benchmark. Lands on the inputs.sku runner. | ||
| dispatch: | ||
| if: github.event_name == 'workflow_dispatch' | ||
| runs-on: ${{ inputs.sku }} | ||
| timeout-minutes: 120 | ||
| env: | ||
| CX_BENCH: ${{ inputs.benchmark }} | ||
| CX_OPS: ${{ inputs.ops }} | ||
| CX_MIN_BYTES: ${{ inputs.min_bytes }} | ||
| CX_MAX_BYTES: ${{ inputs.max_bytes }} | ||
| CX_NGPUS: ${{ inputs.ngpus }} | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Workflow ngpus env ignoredMedium Severity The dispatch job sets Additional Locations (1)Reviewed by Cursor Bugbot for commit a3a492c. Configure here. |
||
| # GB200/watchtower needs a compute-visible workspace; harmless elsewhere. | ||
| CX_STAGE_DIR: ${{ inputs.sku == 'gb200' && '/mnt/lustre01/users-public/sa-shared/cx-stage' || '' }} | ||
| steps: | ||
| - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v5.0.0 | ||
| with: { clean: true } | ||
| - name: Launch ${{ inputs.sku }} / ${{ inputs.benchmark }} | ||
| env: | ||
| RUNNER_NAME: ${{ runner.name }} | ||
| run: bash "experimental/CollectiveX/launchers/launch_${RUNNER_NAME%%_*}.sh" | ||
|
cursor[bot] marked this conversation as resolved.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Workflow skips multinode stagingMedium Severity
Reviewed by Cursor Bugbot for commit d8ee9bf. Configure here. |
||
| - name: Results summary | ||
| if: always() | ||
| run: python3 experimental/CollectiveX/summarize.py --results-dir experimental/CollectiveX/results --markdown >> "$GITHUB_STEP_SUMMARY" | ||
| - name: Upload results | ||
| if: always() | ||
| uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1 | ||
| with: | ||
| name: collectivex_${{ inputs.sku }}_${{ inputs.benchmark }}_${{ github.run_id }} | ||
| path: experimental/CollectiveX/results/*.json | ||
| if-no-files-found: warn | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| # in-container nccl-tests build cache | ||
| .nccl-tests/ | ||
| # python | ||
| __pycache__/ | ||
| *.pyc | ||
| # generated run artifacts: captured env embeds hostnames / GPU UUIDs / NIC GUIDs, | ||
| # so keep results out of git (CI uploads them as workflow artifacts instead). | ||
| # Sanitized headline numbers live in CONTAINERS.md. | ||
| results/*.json | ||
| results/plots/ | ||
| results/raw_*.txt | ||
| results/raw_*.txt.stderr |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| # CollectiveX — container & library versions | ||
|
|
||
| One **multi-arch, digest-pinned** container is used for all NVIDIA SKUs, so B200 | ||
| (x86_64) and GB200 (aarch64) share a single reference and the cross-vendor | ||
| comparison is truly same-image. Set in `launchers/common.sh` (`cx_default_image`). | ||
|
|
||
| ## Default container (all NVIDIA SKUs) | ||
|
|
||
| - **Image:** import by tag **`lmsysorg/sglang:v0.5.11-cu130`** (multi-arch OCI index). Expected index digest, recorded for provenance/verification: `sha256:061fb71f838e82000a1768c159654d526c2f17ebe751c21e7fc48ca53c8ef975`. | ||
| - **Multi-arch manifest list:** linux/amd64 + linux/arm64; `enroot import` on each host pulls the matching arch. | ||
| - **Import by TAG, not digest.** enroot builds its anonymous Docker Hub token scope from the *tag* and succeeds (no creds needed — same as the serving launchers). A bare `repo@sha256:` ref makes enroot prompt for a password and **hang** in non-interactive CI; a combined `tag@sha256:` ref 400s. `cx_ensure_squash` therefore imports by tag with `</dev/null` (a missing token fails fast instead of hanging). First import is multi-GB (~minutes); subsequent runs reuse the staged squash. | ||
| - **Why v0.5.11-cu130 (chosen):** it's the newest cu130 release **pre-staged on BOTH clusters** — B200 `/home/sa-shared/containers/` (amd64 squash) and GB200 `/mnt/lustre01/users-public/sa-shared/` (arm64 squash), same filename — so neither side imports at all. (Shared cu130 multi-arch squashes across both clusters: v0.5.8.post1, v0.5.9, v0.5.11 — v0.5.11 is newest.) `v0.5.12-cu130` is staged on B200 but **not** GB200: its 62 layers overflow enroot's overlay-based squash creation on the GB200 kernel (`enroot-mksquashovlfs: failed to mount overlay … Invalid argument`), so it can't be the shared default. | ||
| - **DeepEP: NOT bundled** here → `run_in_container.sh` builds it via `rebuild-deepep` at job setup (CX_BENCH=deepep). The NCCL path needs no DeepEP. | ||
| - **nccl-tests build:** in-container (login nodes have no `nvcc`), `CX_NCCL_HOME=/usr` (system `nccl.h` in `/usr/include`), `CX_CUDA_HOME=/usr/local/cuda`. cu130 lineage ⇒ CUDA 13; confirm exact NCCL/torch on first run and append below. | ||
|
|
||
| ## Audited reference (cu130 lineage) | ||
|
|
||
| Live audit of the sibling DeepSeek-V4 image `lmsysorg/sglang:deepseek-v4-grace-blackwell` (aarch64) on GB200, 2026-06-23 — the multi-arch `v0.5.11-cu130` should match closely (same cu130 base); reconfirm on first run: | ||
|
|
||
| | Component | Version | | ||
| |---|---| | ||
| | OS / arch | Ubuntu 24.04.3, aarch64 | | ||
| | CUDA (`nvcc`) | 13.0 (V13.0.88) | | ||
| | NCCL (system `/usr/include/nccl.h`) | 2.28.3; torch-bundled 2.27.7 | | ||
| | PyTorch | 2.9.1+cu130 | | ||
| | DeepEP | bundled in *that* image; **not** in the multi-arch default | | ||
| | NVSHMEM | `libnvshmem_host.so.3` present | | ||
| | OpenMPI / gcc / make | 4.1.6 / 13.3.0 / 4.3 | | ||
| | GPU / driver | GB200, 580.126.20 | | ||
|
|
||
| **Version caveat:** the nccl-tests binary links **system NCCL** (2.28.x), while torch/DeepEP use the **bundled** NCCL (2.27.x). Record both in provenance (env_capture does); don't compare an nccl-tests curve against a DeepEP run as if NCCL were identical. | ||
|
|
||
| ## Bundled-DeepEP reference images (not the default) | ||
|
|
||
| If a bundled DeepEP is needed before `rebuild-deepep` is wired on the multi-arch image, these arch-specific images bundle it (pin by digest): | ||
|
|
||
| - B200 (amd64): `lmsysorg/sglang:deepseek-v4-blackwell@sha256:df18bfc4aa9ecf59451002b49ba00cae58042de9e2a96378bbd21b404dd62c7b` (pre-staged on B200) | ||
| - GB200 (arm64): `lmsysorg/sglang:deepseek-v4-grace-blackwell@sha256:4f583347d7ff08aef7e16dbb4985b2a7c147ff49a0c261d5e27b8f5f41719368` (staged on GB200 Lustre) | ||
|
|
||
| Select via `CX_IMAGE=…@sha256:…` on the launch script. | ||
|
|
||
| ## AMD container (MI355X) — MoRI EP | ||
|
|
||
| AMD CDNA4 cannot run the CUDA multi-arch image; MI355X uses a ROCm image that | ||
| bundles **MoRI** (AMD's EP dispatch/combine library). Set in `cx_default_image` | ||
| for `mi355x*` (also `mi350x*`/`mi325x*`/`mi300x*`). | ||
|
|
||
| - **Image:** `rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-0227-2` (single-arch ROCm 7.2.0 runtime; from the AMD master serving config). **Not digest-pinned yet** — record the digest here and pin once validated on the runner, like the NVIDIA image. | ||
| - **MoRI:** bundled in-image (build tag `mori-0227`). `run_mori.py` follows the upstream `ROCm/mori` `tests`/`examples` dispatch+combine path; capture the exact MoRI commit (`MORI_COMMIT` env → provenance) on first run. | ||
| - **Squash is NODE-LOCAL** (`/var/lib/squash`), not a shared FS, so `launch_mi355x-amds.sh` imports via `srun` on the allocated node (the NVIDIA adapters import on the login node onto shared FS). pyxis flags `--container-writable --container-remap-root` (matches the AMD serving launcher); workspace is bind-mounted directly (no `CX_STAGE_DIR`). | ||
| - **Transport:** intra-node **XGMI** (8× MI355X). Two backends wired: `CX_BENCH=mori` (MoRI EP dispatch/combine) and `CX_BENCH=nccl` (collective primitives via **rccl-tests**, the ROCm nccl-tests fork — built in-container with `make` against `/opt/rocm`/`amdclang++`/`librccl`; same `<op>_perf` binaries + output format as nccl-tests, so `run_nccl.py` parses it unchanged). | ||
| - **Validated on MI355X** (on-node via `salloc`+`srun`, nodes `mia1-p01-g10`/`g15`): `salloc` → enroot import (anonymous auth + tag, 24 layers → ~60 GB node-local squash) → torchrun → 8-rank Gloo + MoRI shmem → `EpDispatchCombineConfig`/dispatch/combine **numerically correct** (combine within tol, `max_rel ~2e-3`, ~85 µs round-trip at the decode shape). Three ionic_rdma-fabric constraints, all handled in `run_mori.py`: | ||
| - **RDMA MR size ceiling (~4 GiB).** MoRI registers the *entire* symmetric heap as one RDMA MR at init — even single-node (no disable-RDMA knob exists; only `MORI_DISABLE_P2P`, which forces the opposite). On these ionic NICs a 6 GiB MR fails (`RegisterRdmaMemoryRegion … errno 22 EINVAL`) while 2 GiB registers. Heap is held at **`MORI_SHMEM_HEAP_SIZE=2G`** (override `CX_MORI_HEAP_SIZE`). The reference test's hardcoded `6G` is exactly why it can't run as-is here. | ||
| - **Buffer sizing.** `max_num_inp_token_per_rank` is bounded (512 at the decode shape) so dispatch/combine buffers fit the 2 GiB heap. Much larger token counts would need a heap past the MR ceiling — out of reach on this fabric for now. | ||
| - **Teardown.** MoRI's shmem teardown asserts (`CheckStatusValid` → SIGABRT) when the op is destroyed after `shmem_finalize()`; `run_mori.py` hard-exits after writing results to avoid it. | ||
|
|
||
| Still TODO: capture the exact MoRI commit + a version table (ROCm/torch/RCCL) into provenance, and digest-pin the image. | ||
|
|
||
| ## Cluster access / QOS | ||
|
|
||
| - **B200** (`slurm-login-slinky`): account `benchmark`, **only `gpu-2_qos`** → partition `gpu-2` only (shared with the serving sweep). `gpu-1`/`all` (idle) need `gpu-1_qos`/`all_qos`, not associated with this account. | ||
| - **GB200** (`watchtower`): account `benchmark`, qos `normal`, partition `batch` (`AllowQos=ALL`); idle capacity available. Runner workspace is **not** compute-visible → set `CX_STAGE_DIR` to a Lustre path (the launcher rsyncs there). | ||
|
|
||
| ## First real results (Milestone-0 spike, on the DeepSeek-V4 images) | ||
|
|
||
| nccl-tests (system NCCL 2.28.3), all correctness-passed, peak bus-bw: | ||
|
|
||
| | op | B200 8× (NVLink island, x86_64) | GB200 4× (NVL72 MNNVL, aarch64) | | ||
| |---|---|---| | ||
| | all_reduce | 835 GB/s | 689 GB/s | | ||
| | all_gather | 653 | 658 | | ||
| | reduce_scatter | 667 | 661 | | ||
| | alltoall | 638 | 666 | | ||
|
|
||
| (B200 vs GB200 carry distinct `comparison_key`s by topology-class, so they are labelled-distinct, not silently merged. Re-run on the multi-arch default to refresh under one image.) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,123 @@ | ||
| # CollectiveX | ||
|
|
||
| Cross-vendor collective / EP-library benchmark (see `plan.md`). Per-SKU **launch | ||
| adapters** (InferenceX-style `launch_<sku>.sh`) run **any benchmark** — selected | ||
| by `CX_BENCH` — through a shared in-container runner, and a GitHub Actions | ||
| workflow triggers runs on `push` (no merge to main needed). Milestone-0 headline | ||
| already ran for real on both B200 (8× NVLink island) and GB200 (4× NVL72 MNNVL). | ||
|
|
||
| > Experimental: WIP, not an official InferenceMAX result. All logic stays under | ||
| > `experimental/CollectiveX/`; the only file outside is the orchestration-only | ||
| > workflow. | ||
|
|
||
| ## Files | ||
|
|
||
| | File | Role | | ||
| |---|---| | ||
| | `env_capture.py` | Layer-0 environment + topology fingerprint → JSON (stdlib only) | | ||
| | `run_nccl.py` | run stock `nccl-tests`, parse the text table, emit flat JSON (stdlib only) | | ||
| | `run_deepep.py` | DeepEP dispatch+combine, normal mode, correctness-gated (torch + DeepEP) | | ||
| | `run_mori.py` | MoRI (AMD) dispatch+combine, normal mode, correctness-gated (torch + MoRI) | | ||
| | `plot.py` | latency/bus-bw curves, B200-vs-GB200 overlay with a comparison guard (matplotlib) | | ||
| | `launchers/common.sh` | shared helpers: image resolve, enroot squash, staging, nccl-tests build | | ||
| | `launchers/run_in_container.sh` | generic in-container dispatcher — runs `CX_BENCH` (nccl/deepep/mori/all) | | ||
| | `launchers/launch_<sku>.sh` | per-SKU adapters: `launch_b200-dgxc.sh` (8× NVLink), `launch_b200-dgxc-slurm.sh` (2-node IB), `launch_gb200-nv.sh` (NVL72 MNNVL), `launch_mi355x-amds.sh` (8× XGMI, AMD MoRI + rccl) | | ||
| | `CONTAINERS.md` | the pinned multi-arch container + audited library versions | | ||
| | `results/` | flat JSON artifacts (+ `plots/`, raw captures) | | ||
| | `tests/fixtures/` | captured nccl-tests output for offline parser checks | | ||
|
|
||
| ## Run | ||
|
|
||
| ### Via GitHub Actions (`.github/workflows/collectivex-experimental.yml`) | ||
|
|
||
| - **push** to `experimental/CollectiveX/**` → the **MI355X MoRI** dispatch/combine | ||
| run (the "CollectiveX Experimental" job; lands on a free `mi355x-amds` runner). | ||
| - **workflow_dispatch** → pick `sku` (gb200 / b200-dgxc / b200-multinode / | ||
| mi355x), `benchmark` (nccl / deepep / mori / all — `mori` is AMD-only; `nccl` | ||
| on MI355X runs rccl-tests), ops, | ||
| sizes, ngpus. Lands on that SKU's self-hosted runner and runs | ||
| `launch_${RUNNER_NAME%%_*}.sh`. | ||
|
|
||
| Each job renders a results table to the **GitHub Actions job summary** (via | ||
| `summarize.py --markdown` → `$GITHUB_STEP_SUMMARY`) and uploads the result JSONs | ||
| as an artifact. (The workflow only fires once the branch is pushed to GitHub.) | ||
|
|
||
| ### Directly on a cluster login node | ||
|
|
||
| ```bash | ||
| # benchmark is selected by CX_BENCH (default nccl) | ||
| bash experimental/CollectiveX/launchers/launch_gb200-nv.sh # GB200, NCCL primitives | ||
| CX_BENCH=deepep bash experimental/CollectiveX/launchers/launch_gb200-nv.sh # GB200, DeepEP (rebuild) | ||
| bash experimental/CollectiveX/launchers/launch_b200-dgxc.sh # B200 8× NVLink | ||
| bash experimental/CollectiveX/launchers/launch_b200-dgxc-slurm.sh # B200 2-node, cross-IB | ||
| bash experimental/CollectiveX/launchers/launch_mi355x-amds.sh # MI355X 8× XGMI, MoRI EP (CX_BENCH=mori, default) | ||
| CX_BENCH=nccl bash experimental/CollectiveX/launchers/launch_mi355x-amds.sh # MI355X primitives via rccl-tests | ||
| ``` | ||
|
|
||
| Knobs: `CX_BENCH` (nccl|deepep|mori|all), `CX_OPS`, `CX_MIN_BYTES`/`CX_MAX_BYTES`, | ||
| `CX_NGPUS`, `CX_TIME`, `CX_IMAGE`, `CX_SQUASH_DIR`, `CX_STAGE_DIR` (compute-visible | ||
| staging — needed on GB200/watchtower), `CX_DRYRUN=1` (print plan, allocate | ||
| nothing). Results land in `experimental/CollectiveX/results/`. | ||
|
|
||
| ### Offline (no GPU) — verify the parser/JSON pipeline | ||
|
|
||
| ```bash | ||
| python3 run_nccl.py --op all_reduce --parse-only tests/fixtures/all_reduce_perf_b200_8gpu.txt \ | ||
| --world-size 8 --nodes 1 --runner b200-dgxc --topology-class b200-nvlink-island --out /tmp/parsed.json | ||
| python3 env_capture.py # prints a (degraded, off-GPU) env record | ||
| python3 plot.py --results-dir results --out-dir results/plots # needs matplotlib | ||
| ``` | ||
|
|
||
| ## Container | ||
|
|
||
| One **multi-arch** image for all NVIDIA SKUs, imported by tag | ||
| `lmsysorg/sglang:v0.5.11-cu130` (amd64 + arm64; index digest `sha256:061fb71f…` | ||
| recorded for provenance). Imported by tag, not digest — enroot's anonymous | ||
| Docker Hub auth needs a tag, and a bare digest ref hangs in CI. See | ||
| `CONTAINERS.md` for versions, the DeepEP-rebuild note, and the bundled-DeepEP | ||
| DeepSeek-V4 fallback images. | ||
|
|
||
| ## How it runs (confirmed against the live clusters) | ||
|
|
||
| - Adapters mirror `runners/launch_*.sh`: `salloc` → enroot squash (import only if | ||
| missing) → `srun --container-image=… --container-mounts=<repo>:/ix` → in-container | ||
| `run_in_container.sh`. B200 partition `gpu-2`, GB200 partition `batch`, account | ||
| `benchmark`. | ||
| - **AMD MI355X** (`launch_mi355x-amds.sh`, MoRI / `CX_BENCH=mori`) diverges: partition | ||
| `compute`, no account, pyxis `--container-writable --container-remap-root`, and a | ||
| **node-local** squash (`/var/lib/squash`) imported via `srun` on the allocated node | ||
| (not the login node). Workspace is bind-mounted directly (no `CX_STAGE_DIR`). | ||
| - Login nodes have no `nvcc`, so `nccl-tests` is **built in-container** (cached in | ||
| `.nccl-tests/`, `CX_NCCL_HOME=/usr`). Single-node uses `-g N`; the 2-node | ||
| adapter builds `MPI=1` and launches one rank per GPU (`srun --mpi=pmix`). | ||
| - The sglang image installs editable under `/workspace`, so the repo is mounted at | ||
| **`/ix`**. GB200 compute nodes don't see the runner workspace → `CX_STAGE_DIR` | ||
| rsyncs the tree to Lustre first. | ||
| - Every result embeds an `env_capture` record and a `comparison_key`; topology | ||
| class is part of the key, so B200(IB/NVLink) and GB200(MNNVL) stay labelled | ||
| distinct, never silently overlaid. | ||
|
|
||
| ## Status & known risks | ||
|
|
||
| - **Spike done on real hardware** (both SKUs, 4 NCCL primitives, correctness-passed) | ||
| — on the DeepSeek-V4 images. Now standardizing on the **multi-arch** default; | ||
| validate it on first run and refresh `CONTAINERS.md` (expect CUDA 13 / NCCL 2.28 / torch 2.9). | ||
| - **DeepEP** is not bundled in the multi-arch image → `run_in_container.sh` builds | ||
| it via `rebuild-deepep` (CX_BENCH=deepep). Its Python API is version-sensitive; | ||
| `run_deepep.py` marks the dispatch/combine block `ADAPT HERE` — validate against | ||
| the built commit. B200 (x86_64) first; GB200 (aarch64) follows. | ||
| - **MoRI / MI355X** (`run_mori.py` + `launch_mi355x-amds.sh`) is **validated on | ||
| hardware** (8× MI355X: dispatch+combine numerically correct, ~85 µs round-trip). | ||
| It mirrors `ROCm/mori`'s example (config + `get_registered_combine_input_buffer` | ||
| zero-copy path, `expected = input × #unique-destination-ranks`). Three | ||
| ionic_rdma-fabric constraints are baked in (see `CONTAINERS.md`): a 2 GiB heap | ||
| (the NICs cap RDMA MRs at ~4 GiB), a bounded `max_num_inp_token_per_rank`, and a | ||
| hard-exit past MoRI's buggy shmem teardown. The ROCm image isn't digest-pinned yet. | ||
| - **Multi-node** (`launch_b200-dgxc-slurm.sh`) assumes `srun --mpi=pmix` + a | ||
| compute-visible checkout (`CX_STAGE_DIR`); else fall back to mpirun-in-container | ||
| or srt-slurm. CX_BENCH=nccl only for now. | ||
| - **B200 QOS:** account `benchmark` has only `gpu-2_qos` (the serving-sweep | ||
| partition); idle `gpu-1` needs a QOS grant. GB200 `batch` is open. | ||
|
|
||
| Once the multi-arch image is validated end-to-end, freeze the schema from the | ||
| artifacts (plan: "Freeze the contract"). |


There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Workflow skips result failure gate
Medium Severity
Both jobs only run
summarize.py --markdown, which is documented to always exit 0. The workflow never runs the plainsummarize.pygate on the checkout’sresults/after launch, so a successful Launch step can stay green when the checkout has no valid JSON (e.g. staged runs where copy-back failed).Additional Locations (1)
.github/workflows/collectivex-experimental.yml#L106-L109Reviewed by Cursor Bugbot for commit f48daed. Configure here.