Skip to content

submodules: bump gloo to 74cc005ae13f69c11d8a41e50b42025b6730e796#187077

Closed
d4l3k wants to merge 1 commit into
mainfrom
d4l3k/bump_gloo
Closed

submodules: bump gloo to 74cc005ae13f69c11d8a41e50b42025b6730e796#187077
d4l3k wants to merge 1 commit into
mainfrom
d4l3k/bump_gloo

Conversation

@d4l3k

@d4l3k d4l3k commented Jun 11, 2026

Copy link
Copy Markdown
Member

Summary

Bumps the gloo submodule from 3135b0b to 74cc005 (15 commits, 2026-01-09 -> 2026-06-10; +615/-144 across 29 files).

The headline change is a security fix: a size_t overflow in the TCP transport's recv bounds check that allowed a relative write-what-where. roffset and length are read directly off the wire, and the old check roffset + length <= size_ could wrap around 2^64, letting an out-of-bounds pair pass and yield an arbitrary write at ptr_ + roffset. The fix validates each term independently before forming the pointer.

The rest is a mix of portability/bug fixes (ibverbs destructor crash on no-RDMA hosts, TCP address-family mismatch, ROCm [[nodiscard]] and -Wnontrivial-memcall warnings), toolchain/CI work (move to C++20, native CMake HIP support for ROCm, new CUDA/ROCm/arm64 CI runners, clang-format 21), and named gloo threads for observability.

Note: the SHM allreduce optimization (#458) was added then reverted (#490) within this range, so it is not present in the final tree.

Included commits (newest -> oldest)

Commit Author Date Description
74cc005 Tristan Rice 2026-06-10 gloo/tcp: fix size_t overflow in recv bounds check (relative write-what-where) (#509)
70dc360 Tristan Rice 2026-04-20 Fix address family mismatch by reusing bound socket (#503)
6f4c667 Tristan Rice 2026-03-19 Fix std::terminate in ibverbs destructors on systems without RDMA hardware (#500)
2ba34a6 Karl Gyllstrom 2026-03-11 Fix [[nodiscard]] cudaDeviceEnablePeerAccess warning
845824e Richard Barnes 2026-03-11 Move gloo onto C++20
bcd1672 Gavin Zhao 2026-02-12 ROCm: Migrate to native CMake HIP support (#478)
9322e67 Tristan Rice 2026-02-09 Revert "Intra-node shared memory (SHM) optimizations" (#490)
f834c75 Tristan Rice 2026-02-06 gloo: improve error message on connection closed
8789be7 Tristan Rice 2026-02-05 ci: add CUDA and rocm builds
d8d0f77 Nathan Brown 2026-02-05 ci: add arm64 runner for github actions (#487)
8d0b9a4 Lydia Kim 2026-01-14 Fix -Wnontrivial-memcall
980c925 Lucian Adrian Grijincu 2026-01-13 Fix -Wnontrivial-memcall error in AllreduceLocal
b9cac96 Tristan Rice 2026-01-13 Add setThreadName helper and name all gloo threads
7ec708d Tristan Rice 2026-01-12 ci: bump linter to clang-format 21.1.2
5994546 gaopengff 2026-01-09 Intra-node shared memory (SHM) optimizations for CPU primitives (#458) -- later reverted by 9322e67

Test Plan

CI. Submodule bump only; the gloo changes carry their own unit tests (TCP pair bounds-check coverage added in #509).

@pytorch-bot

pytorch-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/187077

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 28 Pending

As of commit c81903d with merge base 99fd1c8 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot Bot added the topic: not user facing topic category label Jun 11, 2026
@d4l3k d4l3k requested review from Regina8023, atalman and kapilsh June 11, 2026 19:53
@d4l3k

d4l3k commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

@pytorchbot merge

@pytorch-bot pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 11, 2026
@pytorchmergebot

Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@malfet

malfet commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

@pytorchbot merge -f "All distributed tests seems green"

@pytorchmergebot

Copy link
Copy Markdown
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@pytorchmergebot

Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@d4l3k d4l3k deleted the d4l3k/bump_gloo branch June 12, 2026 01:12
jemitche1 pushed a commit to jemitche1/pytorch that referenced this pull request Jun 13, 2026
…torch#187077)

## Summary

Bumps the `gloo` submodule from `3135b0b` to `74cc005` (15 commits, 2026-01-09 -> 2026-06-10; +615/-144 across 29 files).

The headline change is a **security fix**: a `size_t` overflow in the TCP transport's recv bounds check that allowed a relative write-what-where. `roffset` and `length` are read directly off the wire, and the old check `roffset + length <= size_` could wrap around 2^64, letting an out-of-bounds pair pass and yield an arbitrary write at `ptr_ + roffset`. The fix validates each term independently before forming the pointer.

The rest is a mix of portability/bug fixes (ibverbs destructor crash on no-RDMA hosts, TCP address-family mismatch, ROCm `[[nodiscard]]` and `-Wnontrivial-memcall` warnings), toolchain/CI work (move to C++20, native CMake HIP support for ROCm, new CUDA/ROCm/arm64 CI runners, clang-format 21), and named gloo threads for observability.

Note: the SHM allreduce optimization ([pytorch#458](pytorch/gloo#458)) was added then reverted ([pytorch#490](pytorch/gloo#490)) within this range, so it is **not** present in the final tree.

## Included commits (newest -> oldest)

| Commit | Author | Date | Description |
|--------|--------|------|-------------|
| `74cc005` | Tristan Rice | 2026-06-10 | **gloo/tcp: fix size_t overflow in recv bounds check (relative write-what-where)** ([pytorch#509](pytorch/gloo#509)) |
| `70dc360` | Tristan Rice | 2026-04-20 | Fix address family mismatch by reusing bound socket ([pytorch#503](pytorch/gloo#503)) |
| `6f4c667` | Tristan Rice | 2026-03-19 | Fix `std::terminate` in ibverbs destructors on systems without RDMA hardware ([pytorch#500](pytorch/gloo#500)) |
| `2ba34a6` | Karl Gyllstrom | 2026-03-11 | Fix `[[nodiscard]]` `cudaDeviceEnablePeerAccess` warning |
| `845824e` | Richard Barnes | 2026-03-11 | Move gloo onto C++20 |
| `bcd1672` | Gavin Zhao | 2026-02-12 | ROCm: Migrate to native CMake HIP support ([pytorch#478](pytorch/gloo#478)) |
| `9322e67` | Tristan Rice | 2026-02-09 | Revert "Intra-node shared memory (SHM) optimizations" ([pytorch#490](pytorch/gloo#490)) |
| `f834c75` | Tristan Rice | 2026-02-06 | gloo: improve error message on connection closed |
| `8789be7` | Tristan Rice | 2026-02-05 | ci: add CUDA and rocm builds |
| `d8d0f77` | Nathan Brown | 2026-02-05 | ci: add arm64 runner for github actions ([pytorch#487](pytorch/gloo#487)) |
| `8d0b9a4` | Lydia Kim | 2026-01-14 | Fix `-Wnontrivial-memcall` |
| `980c925` | Lucian Adrian Grijincu | 2026-01-13 | Fix `-Wnontrivial-memcall` error in AllreduceLocal |
| `b9cac96` | Tristan Rice | 2026-01-13 | Add `setThreadName` helper and name all gloo threads |
| `7ec708d` | Tristan Rice | 2026-01-12 | ci: bump linter to clang-format 21.1.2 |
| `5994546` | gaopengff | 2026-01-09 | Intra-node shared memory (SHM) optimizations for CPU primitives ([pytorch#458](pytorch/gloo#458)) -- *later reverted by `9322e67`* |

## Test Plan

CI. Submodule bump only; the gloo changes carry their own unit tests (TCP pair bounds-check coverage added in pytorch#509).

Pull Request resolved: pytorch#187077
Approved by: https://github.com/dolpm, https://github.com/kapilsh, https://github.com/Regina8023, https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants