SemiAnalysisAI · Oseltamivir · Jun 23, 2026 · Jun 23, 2026 · Jun 23, 2026 · Jun 23, 2026
diff --git a/.github/workflows/collectivex-experimental.yml b/.github/workflows/collectivex-experimental.yml
diff --git a/experimental/CollectiveX/.gitignore b/experimental/CollectiveX/.gitignore
@@ -0,0 +1,22 @@
+# in-container nccl-tests build cache
+.nccl-tests/
+# python
+__pycache__/
+*.pyc
+# generated run artifacts: captured env embeds hostnames / GPU UUIDs / NIC GUIDs,
+# so keep results out of git (CI uploads them as workflow artifacts instead).
+# Sanitized headline numbers live in CONTAINERS.md.
+results/*.json
+results/plots/
+results/raw_*.txt
+results/raw_*.txt.stderr
+# superseded SSH-provenance result JSONs moved aside so plot_ep's recursive glob
+# won't double-load them; same hostname/UUID sensitivity as results/.
+_ssh_v4_archive/
+# running local-only reflection log (not a committed artifact)
+notes.md
+goal.md
+# superseded seeded-runtime GHA results (canonical counterpart exists); kept out of the plot glob
+_seeded_archive/
+# newest-good-per-config kept in results/; superseded runs moved here (out of the plot glob)
+_superseded/
diff --git a/experimental/CollectiveX/CONTAINERS.md b/experimental/CollectiveX/CONTAINERS.md
@@ -0,0 +1,75 @@
+# CollectiveX — container & library versions
+
+One **multi-arch, digest-pinned** container is used for all NVIDIA SKUs, so B200
+(x86_64) and GB200 (aarch64) share a single reference and the cross-vendor
+comparison is truly same-image. Set in `runtime/common.sh` (`cx_default_image`).
+
+## Default container (all NVIDIA SKUs)
+
+- **Image:** import by tag **`lmsysorg/sglang:v0.5.11-cu130`** (multi-arch OCI index). Expected index digest, recorded for provenance/verification: `sha256:061fb71f838e82000a1768c159654d526c2f17ebe751c21e7fc48ca53c8ef975`.
+- **Multi-arch manifest list:** linux/amd64 + linux/arm64; `enroot import` on each host pulls the matching arch.
+- **Import by TAG, not digest.** enroot builds its anonymous Docker Hub token scope from the *tag* and succeeds (no creds needed — same as the serving launchers). A bare `repo@sha256:` ref makes enroot prompt for a password and **hang** in non-interactive CI; a combined `tag@sha256:` ref 400s. `cx_ensure_squash` therefore imports by tag with `</dev/null` (a missing token fails fast instead of hanging). First import is multi-GB (~minutes); subsequent runs reuse the staged squash.
+- **Why v0.5.11-cu130 (chosen):** it's the newest cu130 release **pre-staged on BOTH clusters** — B200 `/home/sa-shared/containers/` (amd64 squash) and GB200 `/mnt/lustre01/users-public/sa-shared/` (arm64 squash), same filename — so neither side imports at all. (Shared cu130 multi-arch squashes across both clusters: v0.5.8.post1, v0.5.9, v0.5.11 — v0.5.11 is newest.) `v0.5.12-cu130` is staged on B200 but **not** GB200: its 62 layers overflow enroot's overlay-based squash creation on the GB200 kernel (`enroot-mksquashovlfs: failed to mount overlay … Invalid argument`), so it can't be the shared default.
+- **DeepEP: NOT bundled** here → `run_in_container.sh` builds it via `rebuild-deepep` at job setup (CX_BENCH=deepep). The NCCL path needs no DeepEP.
+- **nccl-tests build:** in-container (login nodes have no `nvcc`), `CX_NCCL_HOME=/usr` (system `nccl.h` in `/usr/include`), `CX_CUDA_HOME=/usr/local/cuda`. cu130 lineage ⇒ CUDA 13; confirm exact NCCL/torch on first run and append below.
+
+## Audited reference (cu130 lineage)
+
+Live audit of the sibling DeepSeek-V4 image `lmsysorg/sglang:deepseek-v4-grace-blackwell` (aarch64) on GB200, 2026-06-23 — the multi-arch `v0.5.11-cu130` should match closely (same cu130 base); reconfirm on first run:
+
+| Component | Version |
+|---|---|
+| OS / arch | Ubuntu 24.04.3, aarch64 |
+| CUDA (`nvcc`) | 13.0 (V13.0.88) |
+| NCCL (system `/usr/include/nccl.h`) | 2.28.3; torch-bundled 2.27.7 |
+| PyTorch | 2.9.1+cu130 |
+| DeepEP | bundled in *that* image; **not** in the multi-arch default |
+| NVSHMEM | `libnvshmem_host.so.3` present |
+| OpenMPI / gcc / make | 4.1.6 / 13.3.0 / 4.3 |
+| GPU / driver | GB200, 580.126.20 |
+
+**Version caveat:** the nccl-tests binary links **system NCCL** (2.28.x), while torch/DeepEP use the **bundled** NCCL (2.27.x). Record both in provenance (env_capture does); don't compare an nccl-tests curve against a DeepEP run as if NCCL were identical.
+
+## Bundled-DeepEP reference images (not the default)
+
+If a bundled DeepEP is needed before `rebuild-deepep` is wired on the multi-arch image, these arch-specific images bundle it (pin by digest):
+
+- B200 (amd64): `lmsysorg/sglang:deepseek-v4-blackwell@sha256:df18bfc4aa9ecf59451002b49ba00cae58042de9e2a96378bbd21b404dd62c7b` (pre-staged on B200)
+- GB200 (arm64): `lmsysorg/sglang:deepseek-v4-grace-blackwell@sha256:4f583347d7ff08aef7e16dbb4985b2a7c147ff49a0c261d5e27b8f5f41719368` (staged on GB200 Lustre)
+
+Select via `CX_IMAGE=…@sha256:…` on the launch script.
+
+## AMD container (MI355X) — MoRI EP
+
+AMD CDNA4 cannot run the CUDA multi-arch image; MI355X uses a ROCm image that
+bundles **MoRI** (AMD's EP dispatch/combine library). Set in `cx_default_image`
+for `mi355x*` (also `mi350x*`/`mi325x*`/`mi300x*`).
+
+- **Image:** `rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-0227-2` (single-arch ROCm 7.2.0 runtime; from the AMD master serving config). **Not digest-pinned yet** — record the digest here and pin once validated on the runner, like the NVIDIA image.
+- **MoRI:** bundled in-image (build tag `mori-0227`). `tests/ep_mori.py` follows the upstream `ROCm/mori` `tests`/`examples` dispatch+combine path; capture the exact MoRI commit (`MORI_COMMIT` env → provenance) on first run.
+- **Squash is NODE-LOCAL** (`/var/lib/squash`), not a shared FS, so `launch_mi355x-amds.sh` imports via `srun` on the allocated node (the NVIDIA adapters import on the login node onto shared FS). pyxis flags `--container-writable --container-remap-root` (matches the AMD serving launcher); workspace is bind-mounted directly (no `CX_STAGE_DIR`).
+- **Transport:** intra-node **XGMI** (8× MI355X). Two backends wired: `CX_BENCH=mori` (MoRI EP dispatch/combine) and `CX_BENCH=nccl` (collective primitives via **rccl-tests**, the ROCm nccl-tests fork — built in-container with `make` against `/opt/rocm`/`amdclang++`/`librccl`; same `<op>_perf` binaries + output format as nccl-tests, so `run_nccl.py` parses it unchanged).
+- **Validated on MI355X** (on-node via `salloc`+`srun`, nodes `mia1-p01-g10`/`g15`): `salloc` → enroot import (anonymous auth + tag, 24 layers → ~60 GB node-local squash) → torchrun → 8-rank Gloo + MoRI shmem → `EpDispatchCombineConfig`/dispatch/combine **numerically correct** (combine within tol, `max_rel ~2e-3`, ~85 µs round-trip at the decode shape). Three ionic_rdma-fabric constraints, all handled in `tests/ep_mori.py`:
+  - **RDMA MR size ceiling (~4 GiB).** MoRI registers the *entire* symmetric heap as one RDMA MR at init — even single-node (no disable-RDMA knob exists; only `MORI_DISABLE_P2P`, which forces the opposite). On these ionic NICs a 6 GiB MR fails (`RegisterRdmaMemoryRegion … errno 22 EINVAL`) while 2 GiB registers. Heap is held at **`MORI_SHMEM_HEAP_SIZE=2G`** (override `CX_MORI_HEAP_SIZE`). The reference test's hardcoded `6G` is exactly why it can't run as-is here.
+  - **Buffer sizing.** `max_num_inp_token_per_rank` is bounded (512 at the decode shape) so dispatch/combine buffers fit the 2 GiB heap. Much larger token counts would need a heap past the MR ceiling — out of reach on this fabric for now.
+  - **Teardown.** MoRI's shmem teardown asserts (`CheckStatusValid` → SIGABRT) when the op is destroyed after `shmem_finalize()`; `tests/ep_mori.py`'s `finalize()` hard-exits after writing results to avoid it.
+
+  Still TODO: capture the exact MoRI commit + a version table (ROCm/torch/RCCL) into provenance, and digest-pin the image.
+
+## Cluster access / QOS
+
+- **B200** (`slurm-login-slinky`): account `benchmark`, **only `gpu-2_qos`** → partition `gpu-2` only (shared with the serving sweep). `gpu-1`/`all` (idle) need `gpu-1_qos`/`all_qos`, not associated with this account.
+- **GB200** (`watchtower`): account `benchmark`, qos `normal`, partition `batch` (`AllowQos=ALL`); idle capacity available. Runner workspace is **not** compute-visible → set `CX_STAGE_DIR` to a Lustre path (the launcher rsyncs there).
+
+## First real results (Milestone-0 spike, on the DeepSeek-V4 images)
+
+nccl-tests (system NCCL 2.28.3), all correctness-passed, peak bus-bw:
+
+| op | B200 8× (NVLink island, x86_64) | GB200 4× (NVL72 MNNVL, aarch64) |
+|---|---|---|
+| all_reduce | 835 GB/s | 689 GB/s |
+| all_gather | 653 | 658 |
+| reduce_scatter | 667 | 661 |
+| alltoall | 638 | 666 |
+
+(B200 vs GB200 carry distinct `comparison_key`s by topology-class, so they are labelled-distinct, not silently merged. Re-run on the multi-arch default to refresh under one image.)
diff --git a/experimental/CollectiveX/README.md b/experimental/CollectiveX/README.md
@@ -0,0 +1,128 @@
+# CollectiveX
+
+Cross-vendor collective / EP-library benchmark (see `plan.md`). Per-SKU **launch
+adapters** (InferenceX-style `launch_<sku>.sh`) run **any benchmark** — selected
+by `CX_BENCH` — through a shared in-container runner, and a GitHub Actions
+workflow triggers runs on `push` (no merge to main needed). Milestone-0 headline
+already ran for real on both B200 (8× NVLink island) and GB200 (4× NVL72 MNNVL).
+
+> Experimental: WIP, not an official InferenceMAX result. All logic stays under
+> `experimental/CollectiveX/`; the only file outside is the orchestration-only
+> workflow.
+
+## Files
+
+| File | Role |
+|---|---|
+| `env_capture.py` | Layer-0 environment + topology fingerprint → JSON (stdlib only) |
+| `run_nccl.py` | run stock `nccl-tests`, parse the text table, emit flat JSON (stdlib only) |
+| `tests/run_ep.py` | EP dispatch/combine entrypoint (torchrun): source-tokens-per-rank sweep, dispatch & combine timed **separately** |
+| `tests/ep_harness.py` | shared EP harness: token ladder, separated timing, correctness gate, doc emission (stdlib top) |
+| `tests/ep_deepep.py`, `tests/ep_mori.py` | per-backend adapters (DeepEP / MoRI) implementing the harness protocol |
+| `plot.py` | latency/bus-bw curves, B200-vs-GB200 overlay with a comparison guard (matplotlib) |
+| `runtime/common.sh` | shared helpers: image resolve, enroot squash, staging, nccl-tests build |
+| `runtime/run_in_container.sh` | generic in-container dispatcher — runs `CX_BENCH` (nccl/deepep/mori/all) over `CX_PHASE` |
+| `launchers/launch_<sku>.sh` | per-SKU adapters: `launch_b200-dgxc.sh` (8× NVLink), `launch_b200-dgxc-slurm.sh` (2-node IB), `launch_gb200-nv.sh` (NVL72 MNNVL), `launch_mi355x-amds.sh` (8× XGMI, AMD MoRI + rccl) |
+| `CONTAINERS.md` | the pinned multi-arch container + audited library versions |
+| `results/` | flat JSON artifacts (+ `plots/`, raw captures) |
+| `tests/fixtures/` | captured nccl-tests output for offline parser checks |
+
+## Run
+
+### Via GitHub Actions (`.github/workflows/collectivex-experimental.yml`)
+
+- **push** to `experimental/CollectiveX/**` → the **MI355X MoRI** EP dispatch/combine
+  sweep, **one job per phase** (decode + prefill) via a matrix (lands on free
+  `mi355x-amds` runners).
+- **workflow_dispatch** → pick `sku` (gb200 / b200-dgxc / b200-multinode /
+  mi355x), `benchmark` (nccl / deepep / mori / all — `mori` is AMD-only; `nccl`
+  on MI355X runs rccl-tests), `phase` (decode / prefill / **both** → a job each),
+  `tokens_ladder`, `dispatch_dtype`, ops, sizes, ngpus. Lands on that SKU's
+  self-hosted runner and runs `launch_${RUNNER_NAME%%_*}.sh`. For EP results
+  across all SKUs, dispatch once per `sku` with `phase=both`.
+
+Each job renders a results table to the **GitHub Actions job summary** (via
+`summarize.py --markdown` → `$GITHUB_STEP_SUMMARY`) and uploads the result JSONs
+as an artifact. (The workflow only fires once the branch is pushed to GitHub.)
+
+### Directly on a cluster login node
+
+```bash
+# benchmark is selected by CX_BENCH (default nccl)
+bash experimental/CollectiveX/launchers/launch_gb200-nv.sh                 # GB200, NCCL primitives
+CX_BENCH=deepep bash experimental/CollectiveX/launchers/launch_gb200-nv.sh # GB200, DeepEP (rebuild)
+bash experimental/CollectiveX/launchers/launch_b200-dgxc.sh               # B200 8× NVLink
+bash experimental/CollectiveX/launchers/launch_b200-dgxc-slurm.sh         # B200 2-node, cross-IB
+bash experimental/CollectiveX/launchers/launch_mi355x-amds.sh                # MI355X 8× XGMI, MoRI EP (CX_BENCH=mori, default)
+CX_BENCH=nccl bash experimental/CollectiveX/launchers/launch_mi355x-amds.sh   # MI355X primitives via rccl-tests
+```
+
+Knobs: `CX_BENCH` (nccl|deepep|mori|all), `CX_OPS`, `CX_MIN_BYTES`/`CX_MAX_BYTES`,
+`CX_NGPUS`, `CX_TIME`, `CX_IMAGE`, `CX_SQUASH_DIR`, `CX_STAGE_DIR` (compute-visible
+staging — needed on GB200/watchtower), `CX_DRYRUN=1` (print plan, allocate
+nothing). EP (deepep/mori) adds `CX_PHASE` (decode|prefill|both), `CX_TOKENS_LADDER`
+(e.g. `"1 2 4 8 16 32 64 128"`), `CX_HIDDEN`/`CX_TOPK`/`CX_EXPERTS`,
+`CX_DISPATCH_DTYPE`, `CX_NUM_EP_GROUPS`. Results land in `experimental/CollectiveX/results/`.
+
+### Offline (no GPU) — verify the parser/JSON pipeline
+
+```bash
+python3 run_nccl.py --op all_reduce --parse-only tests/fixtures/all_reduce_perf_b200_8gpu.txt \
+  --world-size 8 --nodes 1 --runner b200-dgxc --topology-class b200-nvlink-island --out /tmp/parsed.json
+python3 env_capture.py            # prints a (degraded, off-GPU) env record
+python3 plot.py --results-dir results --out-dir results/plots   # needs matplotlib
+```
+
+## Container
+
+One **multi-arch** image for all NVIDIA SKUs, imported by tag
+`lmsysorg/sglang:v0.5.11-cu130` (amd64 + arm64; index digest `sha256:061fb71f…`
+recorded for provenance). Imported by tag, not digest — enroot's anonymous
+Docker Hub auth needs a tag, and a bare digest ref hangs in CI. See
+`CONTAINERS.md` for versions, the DeepEP-rebuild note, and the bundled-DeepEP
+DeepSeek-V4 fallback images.
+
+## How it runs (confirmed against the live clusters)
+
+- Adapters mirror `runners/launch_*.sh`: `salloc` → enroot squash (import only if
+  missing) → `srun --container-image=… --container-mounts=<repo>:/ix` → in-container
+  `run_in_container.sh`. B200 partition `gpu-2`, GB200 partition `batch`, account
+  `benchmark`.
+- **AMD MI355X** (`launch_mi355x-amds.sh`, MoRI / `CX_BENCH=mori`) diverges: partition
+  `compute`, no account, pyxis `--container-writable --container-remap-root`, and a
+  **node-local** squash (`/var/lib/squash`) imported via `srun` on the allocated node
+  (not the login node). Workspace is bind-mounted directly (no `CX_STAGE_DIR`).
+- Login nodes have no `nvcc`, so `nccl-tests` is **built in-container** (cached in
+  `.nccl-tests/`, `CX_NCCL_HOME=/usr`). Single-node uses `-g N`; the 2-node
+  adapter builds `MPI=1` and launches one rank per GPU (`srun --mpi=pmix`).
+- The sglang image installs editable under `/workspace`, so the repo is mounted at
+  **`/ix`**. GB200 compute nodes don't see the runner workspace → `CX_STAGE_DIR`
+  rsyncs the tree to Lustre first.
+- Every result embeds an `env_capture` record and a `comparison_key`; topology
+  class is part of the key, so B200(IB/NVLink) and GB200(MNNVL) stay labelled
+  distinct, never silently overlaid.
+
+## Status & known risks
+
+- **Spike done on real hardware** (both SKUs, 4 NCCL primitives, correctness-passed)
+  — on the DeepSeek-V4 images. Now standardizing on the **multi-arch** default;
+  validate it on first run and refresh `CONTAINERS.md` (expect CUDA 13 / NCCL 2.28 / torch 2.9).
+- **DeepEP** is not bundled in the multi-arch image → `run_in_container.sh` builds
+  it via `rebuild-deepep` (CX_BENCH=deepep). Its Python API is version-sensitive;
+  `tests/ep_deepep.py` follows the documented normal-mode API — validate against
+  the built commit. B200 (x86_64) first; GB200 (aarch64) follows.
+- **MoRI / MI355X** (`tests/ep_mori.py` + `launch_mi355x-amds.sh`) is **validated on
+  hardware** (8× MI355X: dispatch+combine numerically correct, ~85 µs round-trip).
+  It mirrors `ROCm/mori`'s example (config + `get_registered_combine_input_buffer`
+  zero-copy path, `expected = input × #unique-destination-ranks`). Three
+  ionic_rdma-fabric constraints are baked in (see `CONTAINERS.md`): a 2 GiB heap
+  (the NICs cap RDMA MRs at ~4 GiB), a bounded `max_num_inp_token_per_rank`, and a
+  hard-exit past MoRI's buggy shmem teardown. The ROCm image isn't digest-pinned yet.
+- **Multi-node** (`launch_b200-dgxc-slurm.sh`) assumes `srun --mpi=pmix` + a
+  compute-visible checkout (`CX_STAGE_DIR`); else fall back to mpirun-in-container
+  or srt-slurm. CX_BENCH=nccl only for now.
+- **B200 QOS:** account `benchmark` has only `gpu-2_qos` (the serving-sweep
+  partition); idle `gpu-1` needs a QOS grant. GB200 `batch` is open.
+
+Once the multi-arch image is validated end-to-end, freeze the schema from the
+artifacts (plan: "Freeze the contract").