Skip to content

Fix std::terminate in ibverbs destructors on systems without RDMA hardware#500

Merged
meta-codesync[bot] merged 1 commit into
mainfrom
d4l3k/fix_ibv
Mar 20, 2026
Merged

Fix std::terminate in ibverbs destructors on systems without RDMA hardware#500
meta-codesync[bot] merged 1 commit into
mainfrom
d4l3k/fix_ibv

Conversation

@d4l3k

@d4l3k d4l3k commented Mar 17, 2026

Copy link
Copy Markdown
Member

Summary

  • On CI runners (ubuntu-latest with -DUSE_IBVERBS=ON), rdma-core userspace providers make ibv_get_device_list() return devices even without real RDMA hardware. The ibverbs Device constructor succeeds, but ibv_create_qp() fails in the Pair constructor, throwing EnforceNotMet. During stack unwinding, ~Pair() and ~Device() call GLOO_ENFORCE which throws from implicitly noexcept destructors (C++11+), causing std::terminate().

Fixes the crash at AllgatherRing/AllgatherTest.VarNumPointer/360 seen in every CI run: https://github.com/pytorch/gloo/actions/runs/22975489184/job/66702898253

See also #497 which identified the same crash but addressed it differently.

Test plan

  • Full test suite passes locally (3058 passed, 1061 skipped, 0 failed)
  • CI passes on ubuntu-latest with -DUSE_IBVERBS=ON -DUSE_LIBUV=ON -DUSE_TCP_OPENSSL_LINK=ON

@meta-cla meta-cla Bot added the CLA Signed label Mar 17, 2026
@d4l3k d4l3k force-pushed the d4l3k/fix_ibv branch 2 times, most recently from 7de27ba to 7b0254f Compare March 19, 2026 01:22
@d4l3k d4l3k marked this pull request as ready for review March 19, 2026 17:53
@meta-codesync

meta-codesync Bot commented Mar 19, 2026

Copy link
Copy Markdown

@d4l3k has imported this pull request. If you are a Meta employee, you can view this in D97331908.

…vice

On CI runners without real RDMA hardware, rdma-core software providers
let ibv_open_device/ibv_alloc_pd/ibv_create_comp_channel succeed but
ibv_create_qp fails with EINVAL. Creating a gloo Device starts a
background thread; after fork() in TransportMultiProcTest the thread
handle is invalid, causing SIGSEGV (exit 139) in Device::~Device.

Fix: probe ibverbs capability using raw APIs (through ibv_create_qp)
in the test's createDevice() before constructing a gloo Device. If QP
creation fails, mark IBVERBS as unavailable and return nullptr.

Also moves GTEST_SKIP() out of worker threads to avoid concurrent
calls racing on GTest internals (exit 134), adds a SIGSEGV backtrace
handler for test debugging, and builds with RelWithDebInfo.
@dolpm dolpm self-requested a review March 20, 2026 00:05
@meta-codesync meta-codesync Bot merged commit 6f4c667 into main Mar 20, 2026
19 of 20 checks passed
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Jun 11, 2026
…87077)

## Summary

Bumps the `gloo` submodule from `3135b0b` to `74cc005` (15 commits, 2026-01-09 -> 2026-06-10; +615/-144 across 29 files).

The headline change is a **security fix**: a `size_t` overflow in the TCP transport's recv bounds check that allowed a relative write-what-where. `roffset` and `length` are read directly off the wire, and the old check `roffset + length <= size_` could wrap around 2^64, letting an out-of-bounds pair pass and yield an arbitrary write at `ptr_ + roffset`. The fix validates each term independently before forming the pointer.

The rest is a mix of portability/bug fixes (ibverbs destructor crash on no-RDMA hosts, TCP address-family mismatch, ROCm `[[nodiscard]]` and `-Wnontrivial-memcall` warnings), toolchain/CI work (move to C++20, native CMake HIP support for ROCm, new CUDA/ROCm/arm64 CI runners, clang-format 21), and named gloo threads for observability.

Note: the SHM allreduce optimization ([#458](pytorch/gloo#458)) was added then reverted ([#490](pytorch/gloo#490)) within this range, so it is **not** present in the final tree.

## Included commits (newest -> oldest)

| Commit | Author | Date | Description |
|--------|--------|------|-------------|
| `74cc005` | Tristan Rice | 2026-06-10 | **gloo/tcp: fix size_t overflow in recv bounds check (relative write-what-where)** ([#509](pytorch/gloo#509)) |
| `70dc360` | Tristan Rice | 2026-04-20 | Fix address family mismatch by reusing bound socket ([#503](pytorch/gloo#503)) |
| `6f4c667` | Tristan Rice | 2026-03-19 | Fix `std::terminate` in ibverbs destructors on systems without RDMA hardware ([#500](pytorch/gloo#500)) |
| `2ba34a6` | Karl Gyllstrom | 2026-03-11 | Fix `[[nodiscard]]` `cudaDeviceEnablePeerAccess` warning |
| `845824e` | Richard Barnes | 2026-03-11 | Move gloo onto C++20 |
| `bcd1672` | Gavin Zhao | 2026-02-12 | ROCm: Migrate to native CMake HIP support ([#478](pytorch/gloo#478)) |
| `9322e67` | Tristan Rice | 2026-02-09 | Revert "Intra-node shared memory (SHM) optimizations" ([#490](pytorch/gloo#490)) |
| `f834c75` | Tristan Rice | 2026-02-06 | gloo: improve error message on connection closed |
| `8789be7` | Tristan Rice | 2026-02-05 | ci: add CUDA and rocm builds |
| `d8d0f77` | Nathan Brown | 2026-02-05 | ci: add arm64 runner for github actions ([#487](pytorch/gloo#487)) |
| `8d0b9a4` | Lydia Kim | 2026-01-14 | Fix `-Wnontrivial-memcall` |
| `980c925` | Lucian Adrian Grijincu | 2026-01-13 | Fix `-Wnontrivial-memcall` error in AllreduceLocal |
| `b9cac96` | Tristan Rice | 2026-01-13 | Add `setThreadName` helper and name all gloo threads |
| `7ec708d` | Tristan Rice | 2026-01-12 | ci: bump linter to clang-format 21.1.2 |
| `5994546` | gaopengff | 2026-01-09 | Intra-node shared memory (SHM) optimizations for CPU primitives ([#458](pytorch/gloo#458)) -- *later reverted by `9322e67`* |

## Test Plan

CI. Submodule bump only; the gloo changes carry their own unit tests (TCP pair bounds-check coverage added in #509).

Pull Request resolved: #187077
Approved by: https://github.com/dolpm, https://github.com/kapilsh, https://github.com/Regina8023, https://github.com/malfet
jemitche1 pushed a commit to jemitche1/pytorch that referenced this pull request Jun 13, 2026
…torch#187077)

## Summary

Bumps the `gloo` submodule from `3135b0b` to `74cc005` (15 commits, 2026-01-09 -> 2026-06-10; +615/-144 across 29 files).

The headline change is a **security fix**: a `size_t` overflow in the TCP transport's recv bounds check that allowed a relative write-what-where. `roffset` and `length` are read directly off the wire, and the old check `roffset + length <= size_` could wrap around 2^64, letting an out-of-bounds pair pass and yield an arbitrary write at `ptr_ + roffset`. The fix validates each term independently before forming the pointer.

The rest is a mix of portability/bug fixes (ibverbs destructor crash on no-RDMA hosts, TCP address-family mismatch, ROCm `[[nodiscard]]` and `-Wnontrivial-memcall` warnings), toolchain/CI work (move to C++20, native CMake HIP support for ROCm, new CUDA/ROCm/arm64 CI runners, clang-format 21), and named gloo threads for observability.

Note: the SHM allreduce optimization ([pytorch#458](pytorch/gloo#458)) was added then reverted ([pytorch#490](pytorch/gloo#490)) within this range, so it is **not** present in the final tree.

## Included commits (newest -> oldest)

| Commit | Author | Date | Description |
|--------|--------|------|-------------|
| `74cc005` | Tristan Rice | 2026-06-10 | **gloo/tcp: fix size_t overflow in recv bounds check (relative write-what-where)** ([pytorch#509](pytorch/gloo#509)) |
| `70dc360` | Tristan Rice | 2026-04-20 | Fix address family mismatch by reusing bound socket ([pytorch#503](pytorch/gloo#503)) |
| `6f4c667` | Tristan Rice | 2026-03-19 | Fix `std::terminate` in ibverbs destructors on systems without RDMA hardware ([pytorch#500](pytorch/gloo#500)) |
| `2ba34a6` | Karl Gyllstrom | 2026-03-11 | Fix `[[nodiscard]]` `cudaDeviceEnablePeerAccess` warning |
| `845824e` | Richard Barnes | 2026-03-11 | Move gloo onto C++20 |
| `bcd1672` | Gavin Zhao | 2026-02-12 | ROCm: Migrate to native CMake HIP support ([pytorch#478](pytorch/gloo#478)) |
| `9322e67` | Tristan Rice | 2026-02-09 | Revert "Intra-node shared memory (SHM) optimizations" ([pytorch#490](pytorch/gloo#490)) |
| `f834c75` | Tristan Rice | 2026-02-06 | gloo: improve error message on connection closed |
| `8789be7` | Tristan Rice | 2026-02-05 | ci: add CUDA and rocm builds |
| `d8d0f77` | Nathan Brown | 2026-02-05 | ci: add arm64 runner for github actions ([pytorch#487](pytorch/gloo#487)) |
| `8d0b9a4` | Lydia Kim | 2026-01-14 | Fix `-Wnontrivial-memcall` |
| `980c925` | Lucian Adrian Grijincu | 2026-01-13 | Fix `-Wnontrivial-memcall` error in AllreduceLocal |
| `b9cac96` | Tristan Rice | 2026-01-13 | Add `setThreadName` helper and name all gloo threads |
| `7ec708d` | Tristan Rice | 2026-01-12 | ci: bump linter to clang-format 21.1.2 |
| `5994546` | gaopengff | 2026-01-09 | Intra-node shared memory (SHM) optimizations for CPU primitives ([pytorch#458](pytorch/gloo#458)) -- *later reverted by `9322e67`* |

## Test Plan

CI. Submodule bump only; the gloo changes carry their own unit tests (TCP pair bounds-check coverage added in pytorch#509).

Pull Request resolved: pytorch#187077
Approved by: https://github.com/dolpm, https://github.com/kapilsh, https://github.com/Regina8023, https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants