Skip to content

[RFC] Implement gloo abort for graceful shutdown#388

Open
Aidyn-A wants to merge 7 commits into
pytorch:mainfrom
Aidyn-A:gloo_abort
Open

[RFC] Implement gloo abort for graceful shutdown#388
Aidyn-A wants to merge 7 commits into
pytorch:mainfrom
Aidyn-A:gloo_abort

Conversation

@Aidyn-A

@Aidyn-A Aidyn-A commented Sep 25, 2024

Copy link
Copy Markdown
Contributor

In pytorch/pytorch#130345 it was requested to implement a ProcessGroupGloo.shutdown() for faster recovery from distributed rank failures. This PR is a first step into accomplishing the proper shutdown. The second step would be implementing gloo::abort() within the PyTorch's ProcessGroupGloo.

@c-p-i-o

c-p-i-o commented Nov 1, 2024

Copy link
Copy Markdown
Contributor

Sorry for the delay. Are you able to add a test for this change?

@c-p-i-o

c-p-i-o commented Nov 1, 2024

Copy link
Copy Markdown
Contributor

Ignore the CI breakage for now. I'm trying to revive the CI for this repository.

@Aidyn-A

Aidyn-A commented Nov 5, 2024

Copy link
Copy Markdown
Contributor Author

Sorry for the delay. Are you able to add a test for this change?

Sure, I will add a test and resolve the merge conflicts soon.

@Aidyn-A

Aidyn-A commented Nov 15, 2024

Copy link
Copy Markdown
Contributor Author

Hey @c-p-i-o how does the PR look to you? Do you think it is ready to merge? Please let me know if you have any comments.

@Aidyn-A Aidyn-A requested a review from c-p-i-o November 15, 2024 14:37
@facebook-github-bot

Copy link
Copy Markdown

@c-p-i-o has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@Aidyn-A

Aidyn-A commented Dec 5, 2024

Copy link
Copy Markdown
Contributor Author

Hey @c-p-i-o can you please let me know what tests are failing?
Also what kind of linter is used? Would just clang-format be enough to resolve lint errors?

@c-p-i-o

c-p-i-o commented Jan 7, 2025

Copy link
Copy Markdown
Contributor

Hey @c-p-i-o can you please let me know what tests are failing? Also what kind of linter is used? Would just clang-format be enough to resolve lint errors?

Sorry for the delay here.

  1. CLANGFORMAT errors.
  2. Some internal CI failed on this diff. Re-running internal CI and will report back.

@c-p-i-o

c-p-i-o commented Jan 8, 2025

Copy link
Copy Markdown
Contributor

Grr. Some failures are internal to Meta when they try to build this change.
Screenshot 2025-01-07 at 4 49 54 PM
Let me see if I can address these on the internal side.

@Aidyn-A

Aidyn-A commented Jan 17, 2025

Copy link
Copy Markdown
Contributor Author

I do not see any relations to the current PR on this message. How do I reproduce it locally?

Grr. Some failures are internal to Meta when they try to build this change. Screenshot 2025-01-07 at 4 49 54 PM Let me see if I can address these on the internal side.

@Aidyn-A Aidyn-A requested a review from c-p-i-o February 17, 2025 13:14
@pramodk

pramodk commented Jul 21, 2025

Copy link
Copy Markdown

I ended up on this PR while reviewing some nvidia framework docs.

I am wondering what is blocking for this and if previously mentioned issues still exist and blocker here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants