[PyTorch] Expert Parallelism: PyTorch wrapper + autograd ops with symm-mem zero-copy by phu0ngng · Pull Request #3035 · NVIDIA/TransformerEngine

phu0ngng · 2026-05-22T02:54:20Z

Summary

Second PR in the TE Expert Parallelism (EP) series. Adds the PyTorch binding on top of the common C API (#3034): exposes EP dispatch/combine as torch.library custom ops with autograd, and plumbs NCCL symmetric-memory windows through for the zero-copy path.

Payload tensors allocated via te.pytorch.symm_mem_alloc take the one-sided zero-copy path; anything else silently falls back to staged-copy, so the API is drop-in compatible with any allocator.

Implementation

Public Python API (`transformer_engine/pytorch/ep.py`)

from transformer_engine.pytorch.ep import (
    EpHandle, ep_bootstrap, ep_finalize,
    ep_prepare, ep_dispatch, ep_combine,
    symm_mem_alloc,
)

ep_bootstrap / ep_finalize — one-time per-process init + teardown (also auto-registered via atexit). Rank 0 mints an ncclUniqueId, broadcasts it on ep_group, backend opens its own ncclComm_t. Requi
res ep_group.size() >= 4.
symm_mem_alloc(shape, dtype, ep_group) — allocate a per-rank tensor backed by NCCL symmetric memory, already rendezvoused on ep_group.
EpHandle — per-layer routing state; reuse across steps.
ep_prepare / ep_dispatch / ep_combine — per-step ops; both dispatch and combine are autograd-aware and registered as torch.library.custom_op, so they compose with torch.compile fullgraph capture and
CUDA graphs.

C++ bindings (`transformer_engine/pytorch/csrc/extensions/ep.cpp`)

POD-only pybind boundary (py::bytes for ncclUniqueId, primitives for config) — no c10d ABI on the boundary.
maybe_make_window() looks up each payload tensor's NCCLSymmetricMemory window and returns an NVTECommWindow to the backend; non-symm-mem tensors get {nullptr, 0} and the backend picks staged-copy autom
atically.
Warn-once hint when high-traffic payloads (tokens, recv_tokens, expert_out, grad) aren't symm-mem-backed. Routing-weight tensors stay silent (nice-to-have, not required). Suppress with NVTE_EP_SILENCE _NONSYMM_WARN=1.

Build

build_tools/pytorch.py propagates -DNVTE_WITH_NCCL_EP to the PyTorch extension. When NCCL EP is off, the extension still loads — nvte_ep_* come from the common stub and throw on first call.

Testing

tests/pytorch/distributed/run_ep.py — 17-test unittest suite: prepare correctness, dispatch/combine identity (uniform + non-uniform), 3D input, VJPs, top_k=1 all-to-one, alignment edge cases, CUDA grap
h capture (eager + zero-copy), torch.compile fullgraph, bf16 autocast (eager + autograd), zero-copy autograd combine, symm-mem fallback, gradient checkpointing.
Launcher: tests/pytorch/distributed/run_test_ep.sh. Verified on 8×H200: Ran 17 tests in 19.8s … OK on every rank.
Example: examples/pytorch/ep/ep_moe.py — minimal end-to-end MoE forward+backward driver.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…ributed tests/example Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

greptile-apps · 2026-05-22T03:01:34Z

Greptile Summary

This PR adds the PyTorch-level binding for Expert Parallelism (EP): a public Python API (ep.py), pybind11 C++ extensions (ep.cpp), a backend singleton (ep_backend.cpp), and a comprehensive distributed test suite. Payload tensors backed by NCCL symmetric memory take a zero-copy one-sided path; all others fall back to staged copy transparently.

transformer_engine/pytorch/ep.py — ep_bootstrap/ep_finalize lifecycle, EpHandle per-layer state, ep_prepare/ep_dispatch/ep_combine as torch.library.custom_op with autograd, and symm_mem_alloc for symmetric-memory buffer allocation.
transformer_engine/pytorch/csrc/extensions/ep.cpp — pybind11 bindings that translate PyTorch tensors to NVTE descriptors, look up NCCL symmetric-memory windows via maybe_make_window, and forward to the C API.
transformer_engine/common/ep/ep_backend.cpp — EPBackend singleton wrapping ncclEpGroup_t, with per-layer handle caching and forward/backward dispatch/combine ops.

Confidence Score: 3/5

Two correctness bugs in the Python autograd layer and C++ bootstrap path need fixes before this lands in a production training run.

The _EpDispatch.backward fallback allocates a zero-gradient tensor shaped (max_tokens_per_rank, H) when the correct shape is (recv_capacity_per_rank, H); any training path where the upstream gradient of recv_tokens is None will hit either a runtime NVTE_CHECK or silent wrong-sized communication. Separately, ep_initialize in the C++ extension creates an NCCL communicator and then calls nvte_ep_initialize; if the latter throws, the communicator is never stored and can never be destroyed.

transformer_engine/pytorch/ep.py (_EpDispatch.backward zero-grad shape) and transformer_engine/pytorch/csrc/extensions/ep.cpp (ep_initialize NCCL comm lifetime) require the most attention before merge.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/ep.py	New public Python EP API: bootstrap, EpHandle, ep_prepare/dispatch/combine with autograd. Contains a shape bug in _EpDispatch.backward (wrong zero-gradient fallback for g_recv_tokens) and a non-reentrant _zero_copy_scope context manager.
transformer_engine/pytorch/csrc/extensions/ep.cpp	New C++ PyTorch extension binding; implements ep_initialize, per-step ops, and symm-mem window lookup. Has a NCCL communicator resource leak on ep_initialize's error path when nvte_ep_initialize throws.
transformer_engine/common/ep/ep_backend.cpp	EPBackend singleton: NCCL EP group creation, per-op dispatch/combine, and handle cache. Missing validate_config check for max_recv_tokens_per_rank > 0; otherwise well-structured with proper mutex usage and RAII handle guard.
tests/pytorch/distributed/run_ep.py	Comprehensive 17-test multi-process suite covering correctness, autograd VJPs, CUDA graph capture, torch.compile, autocast, and symm-mem paths.

_{Reviews (1): Last reviewed commit: "ep: PyTorch wrapper, autograd ops, symm-..." | Re-trigger Greptile}

greptile-apps · 2026-05-22T03:01:38Z

+        if g_recv_tokens is None:
+            g_recv_tokens = torch.zeros_like(grad_tokens)


Wrong fallback shape for g_recv_tokens zero gradient

grad_tokens has shape (max_tokens_per_rank, H), but g_recv_tokens should have shape (recv_capacity_per_rank, H). These two dimensions are intentionally different (recv capacity is typically much larger). When g_recv_tokens is None — which happens whenever recv_tokens is detached or the loss graph does not flow back through it — the zero-gradient tensor passed to ep_dispatch_bwd has the wrong first dimension. The C++ layer then computes recv_pr = grad.numel() / H = max_tokens_per_rank, causing the NVTE_CHECK g_recv_topk_weights.numel() == recv_pr to fail (since recv_topk_weights has size recv_capacity_per_rank), or silently producing wrong results if the two values happen to coincide.

The correct shape for the zero-gradient fallback can be derived from ctx.recv_topk_weights (already stashed on ctx), whose length equals recv_capacity_per_rank: use torch.zeros(recv_w_tmpl.shape[0], grad_tokens.shape[-1], dtype=grad_tokens.dtype, device=grad_tokens.device) instead of torch.zeros_like(grad_tokens).

greptile-apps · 2026-05-22T03:01:38Z

+      /*max_recv_tokens_per_rank=*/static_cast<int>(max_recv_tokens_per_rank),
+      /*hidden_dim=*/static_cast<int>(hidden_dim),
+      /*max_num_sms=*/static_cast<int>(max_num_sms),
+      /*allow_handle_mem_reloc=*/allow_handle_mem_reloc ? 1 : 0,
+  };
+
+  // Copy bytes into a typed ncclUniqueId so the ABI is unambiguous when
+  // passing it by value to ncclCommInitRank.
+  ncclUniqueId uid{};
+  std::memcpy(uid.internal, unique_id_bytes.data(), kEpUniqueIdSize);
+  ncclComm_t ep_comm = nullptr;


NCCL communicator leaked on nvte_ep_initialize failure

ep_comm is created by ncclCommInitRank and then passed to nvte_ep_initialize. If nvte_ep_initialize throws (e.g., the NCCL version or compute-capability check inside EPBackend::initialize fails), control jumps out before g_ep_nccl_comm = ep_comm is reached. The communicator is never stored and can never be destroyed, leaking the NCCL resource. Additionally, validate_config is invoked inside nvte_ep_initialize rather than before ncclCommInitRank, so validation failures surface only after the collective initialization has already succeeded on all participating ranks.

The simplest fix is to assign g_ep_nccl_comm = ep_comm immediately after the two NVTE_CHECKs and before calling nvte_ep_initialize, then rely on ep_finalize to destroy it on any subsequent error. Alternatively, wrap ep_comm in a RAII guard that calls ncclCommDestroy on scope exit unless it is explicitly released.

greptile-apps · 2026-05-22T03:01:39Z

+@contextlib.contextmanager
+def _zero_copy_scope(enabled: bool):
+    """Toggles whether per-step ops apply the symm-mem NCCL window annotation."""
+    if enabled:
+        yield
+        return
+    tex.ep_set_zero_copy(False)
+    try:
+        yield
+    finally:
+        tex.ep_set_zero_copy(True)


_zero_copy_scope does not save/restore the previous flag value

When enabled=False, the manager unconditionally sets g_zero_copy_enabled=False on entry and g_zero_copy_enabled=True on exit. If two callers both use zero_copy=False concurrently (e.g., pipeline-parallel microbatches dispatched from separate Python threads) or if the context is nested, the inner scope's finally block prematurely re-enables zero-copy while the outer scope is still active. The outer scope's finally then sets True again, but between the inner finally and the outer finally the C++ layer sees True unexpectedly.

The fix is to capture the previous value before writing and restore it unconditionally: save old = tex.ep_get_zero_copy() (adding a corresponding getter), then tex.ep_set_zero_copy(old) in the finally block. At minimum, document the single-caller-at-a-time assumption prominently so pipeline-parallel users know to serialize.

greptile-apps · 2026-05-22T03:01:40Z

+      return ncclFloat32;
+    case kNVTEFloat16:
+      return ncclFloat16;
+    case kNVTEBFloat16:
+      return ncclBfloat16;
+    case kNVTEInt32:
+      return ncclInt32;
+    case kNVTEInt64:
+      return ncclInt64;
+    case kNVTEByte:
+      return ncclUint8;
+    case kNVTEFloat8E4M3:
+      return ncclFloat8e4m3;
+    case kNVTEFloat8E5M2:
+      return ncclFloat8e5m2;
+    default:
+      NVTE_ERROR("Unsupported NVTEDType for NCCL EP conversion: ", static_cast<int>(dtype));
+  }
+  return ncclFloat32;  // unreachable
+}
+


Missing validation for max_recv_tokens_per_rank > 0

validate_config checks max_tokens_per_rank, hidden_dim, num_experts, and ep_size, but omits max_recv_tokens_per_rank. The inline comment on the cfg.max_recv_tokens_per_rank assignment even notes "Must be > 0; NCCL EP errors out on 0", yet there is no NVTE_CHECK to surface a clear message. Passing 0 from Python would silently reach ncclEpCreateGroup and produce an opaque NCCL error. Add NVTE_CHECK(config.max_recv_tokens_per_rank > 0, ...) alongside the other positive-value checks.

phu0ngng added 2 commits May 22, 2026 02:03

Expert Parallelism: common C API + NCCL EP backend

44a8a49

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

ep: PyTorch wrapper, autograd ops, symm-mem zero-copy bindings + dist…

3452634

…ributed tests/example Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

phu0ngng requested review from ksivaman and ptrendx as code owners May 22, 2026 02:54

greptile-apps Bot reviewed May 22, 2026

View reviewed changes

phu0ngng marked this pull request as draft May 22, 2026 03:03

phu0ngng mentioned this pull request May 22, 2026

[PyTorch] Dispatch/Combine for BF16 tensor + Zero-Copy #3024

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Expert Parallelism: PyTorch wrapper + autograd ops with symm-mem zero-copy#3035

[PyTorch] Expert Parallelism: PyTorch wrapper + autograd ops with symm-mem zero-copy#3035
phu0ngng wants to merge 2 commits into
NVIDIA:mainfrom
phu0ngng:phuong/ep-3-pytorch-on-commwindow

phu0ngng commented May 22, 2026

Uh oh!

greptile-apps Bot commented May 22, 2026

Uh oh!

greptile-apps Bot May 22, 2026

Uh oh!

greptile-apps Bot May 22, 2026

Uh oh!

greptile-apps Bot May 22, 2026

Uh oh!

greptile-apps Bot May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if g_recv_tokens is None:
		g_recv_tokens = torch.zeros_like(grad_tokens)

Conversation

phu0ngng commented May 22, 2026

Summary

Implementation

Public Python API (transformer_engine/pytorch/ep.py)

C++ bindings (transformer_engine/pytorch/csrc/extensions/ep.cpp)

Build

Testing

Type of change

Checklist:

Uh oh!

greptile-apps Bot commented May 22, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Uh oh!

greptile-apps Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Public Python API (`transformer_engine/pytorch/ep.py`)

C++ bindings (`transformer_engine/pytorch/csrc/extensions/ep.cpp`)