submodules: bump gloo to 74cc005ae13f69c11d8a41e50b42025b6730e796#187077
submodules: bump gloo to 74cc005ae13f69c11d8a41e50b42025b6730e796#187077d4l3k wants to merge 1 commit into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/187077
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 28 PendingAs of commit c81903d with merge base 99fd1c8 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot merge -f "All distributed tests seems green" |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…torch#187077) ## Summary Bumps the `gloo` submodule from `3135b0b` to `74cc005` (15 commits, 2026-01-09 -> 2026-06-10; +615/-144 across 29 files). The headline change is a **security fix**: a `size_t` overflow in the TCP transport's recv bounds check that allowed a relative write-what-where. `roffset` and `length` are read directly off the wire, and the old check `roffset + length <= size_` could wrap around 2^64, letting an out-of-bounds pair pass and yield an arbitrary write at `ptr_ + roffset`. The fix validates each term independently before forming the pointer. The rest is a mix of portability/bug fixes (ibverbs destructor crash on no-RDMA hosts, TCP address-family mismatch, ROCm `[[nodiscard]]` and `-Wnontrivial-memcall` warnings), toolchain/CI work (move to C++20, native CMake HIP support for ROCm, new CUDA/ROCm/arm64 CI runners, clang-format 21), and named gloo threads for observability. Note: the SHM allreduce optimization ([pytorch#458](pytorch/gloo#458)) was added then reverted ([pytorch#490](pytorch/gloo#490)) within this range, so it is **not** present in the final tree. ## Included commits (newest -> oldest) | Commit | Author | Date | Description | |--------|--------|------|-------------| | `74cc005` | Tristan Rice | 2026-06-10 | **gloo/tcp: fix size_t overflow in recv bounds check (relative write-what-where)** ([pytorch#509](pytorch/gloo#509)) | | `70dc360` | Tristan Rice | 2026-04-20 | Fix address family mismatch by reusing bound socket ([pytorch#503](pytorch/gloo#503)) | | `6f4c667` | Tristan Rice | 2026-03-19 | Fix `std::terminate` in ibverbs destructors on systems without RDMA hardware ([pytorch#500](pytorch/gloo#500)) | | `2ba34a6` | Karl Gyllstrom | 2026-03-11 | Fix `[[nodiscard]]` `cudaDeviceEnablePeerAccess` warning | | `845824e` | Richard Barnes | 2026-03-11 | Move gloo onto C++20 | | `bcd1672` | Gavin Zhao | 2026-02-12 | ROCm: Migrate to native CMake HIP support ([pytorch#478](pytorch/gloo#478)) | | `9322e67` | Tristan Rice | 2026-02-09 | Revert "Intra-node shared memory (SHM) optimizations" ([pytorch#490](pytorch/gloo#490)) | | `f834c75` | Tristan Rice | 2026-02-06 | gloo: improve error message on connection closed | | `8789be7` | Tristan Rice | 2026-02-05 | ci: add CUDA and rocm builds | | `d8d0f77` | Nathan Brown | 2026-02-05 | ci: add arm64 runner for github actions ([pytorch#487](pytorch/gloo#487)) | | `8d0b9a4` | Lydia Kim | 2026-01-14 | Fix `-Wnontrivial-memcall` | | `980c925` | Lucian Adrian Grijincu | 2026-01-13 | Fix `-Wnontrivial-memcall` error in AllreduceLocal | | `b9cac96` | Tristan Rice | 2026-01-13 | Add `setThreadName` helper and name all gloo threads | | `7ec708d` | Tristan Rice | 2026-01-12 | ci: bump linter to clang-format 21.1.2 | | `5994546` | gaopengff | 2026-01-09 | Intra-node shared memory (SHM) optimizations for CPU primitives ([pytorch#458](pytorch/gloo#458)) -- *later reverted by `9322e67`* | ## Test Plan CI. Submodule bump only; the gloo changes carry their own unit tests (TCP pair bounds-check coverage added in pytorch#509). Pull Request resolved: pytorch#187077 Approved by: https://github.com/dolpm, https://github.com/kapilsh, https://github.com/Regina8023, https://github.com/malfet
Summary
Bumps the
gloosubmodule from3135b0bto74cc005(15 commits, 2026-01-09 -> 2026-06-10; +615/-144 across 29 files).The headline change is a security fix: a
size_toverflow in the TCP transport's recv bounds check that allowed a relative write-what-where.roffsetandlengthare read directly off the wire, and the old checkroffset + length <= size_could wrap around 2^64, letting an out-of-bounds pair pass and yield an arbitrary write atptr_ + roffset. The fix validates each term independently before forming the pointer.The rest is a mix of portability/bug fixes (ibverbs destructor crash on no-RDMA hosts, TCP address-family mismatch, ROCm
[[nodiscard]]and-Wnontrivial-memcallwarnings), toolchain/CI work (move to C++20, native CMake HIP support for ROCm, new CUDA/ROCm/arm64 CI runners, clang-format 21), and named gloo threads for observability.Note: the SHM allreduce optimization (#458) was added then reverted (#490) within this range, so it is not present in the final tree.
Included commits (newest -> oldest)
74cc00570dc3606f4c667std::terminatein ibverbs destructors on systems without RDMA hardware (#500)2ba34a6[[nodiscard]]cudaDeviceEnablePeerAccesswarning845824ebcd16729322e67f834c758789be7d8d0f778d0b9a4-Wnontrivial-memcall980c925-Wnontrivial-memcallerror in AllreduceLocalb9cac96setThreadNamehelper and name all gloo threads7ec708d59945469322e67Test Plan
CI. Submodule bump only; the gloo changes carry their own unit tests (TCP pair bounds-check coverage added in #509).