-
Notifications
You must be signed in to change notification settings - Fork 733
[Common] Comm+GEMM overlap API updated to support cuBlasMp backend (incl. framework API) #2443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
denera
wants to merge
81
commits into
NVIDIA:main
Choose a base branch
from
denera:common/tp-overlap-cublasmp
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 23 commits
Commits
Show all changes
81 commits
Select commit
Hold shift + click to select a range
177b2ec
cuBlasMp backend logic added to TE/common with connections to framewo…
denera 7d46b0b
added use_cublasmp flags to CollectiveGemm bootstrapping to avoid UB …
denera 6d4a141
added cuBLASMp backend option to JAX unit tests for CollectiveGEMM
denera 35d0f19
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] dd8eaf3
added pytorch unit tests for comm+GEMM overlap with cuBLASMp backend
denera d79bf21
greptile fixes
denera ee517d3
linting
denera 51b64fb
function argument call order fixes
denera 9be771c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 4cec043
JAX collective GEMM modified to inherit cublasmp usage from global bo…
denera 898cf30
Merge remote-tracking branch 'upstream/main' into common/tp-overlap-c…
denera 422a654
typos and style fixes
pre-commit-ci[bot] 6e42235
documentation and build fixes
denera 626dd1d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] d44cfc4
Merge remote-tracking branch 'upstream/main' into common/tp-overlap-c…
denera e341a8b
fixed default SM margin option and JAX cgemm test runner cleanup
denera 6942d20
cublasmp running with TE/PyTorch
denera bef5c7e
cublasmp working with TE/JAX
denera 81d6383
Merge remote-tracking branch 'upstream/main' into common/tp-overlap-c…
denera 6c6cc4d
cublasmp working with TE/JAX (JAX container is missing cuBLASMp insta…
denera 9ed2adf
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] ca913b9
added arch suffixes for CUBLASMP lib lookup in CMAKE
denera c55626d
Merge branch 'common/tp-overlap-cublasmp' of github.com:denera/Transf…
denera f863ba8
fixed TE/JAX collective gemm test runner
denera 5a8c7ae
TE/JAX CGEMM test runner script fix
denera 5b9df92
fixed the cublasmp option in the pytest runners
denera 775df95
Merge remote-tracking branch 'upstream/main' into common/tp-overlap-c…
denera 441472a
Merge remote-tracking branch 'upstream/main' into common/tp-overlap-c…
denera 3df11fc
Merge remote-tracking branch 'upstream/main' into common/tp-overlap-c…
denera 58f1e68
cuBLASMp passing tests with TE/PyTorch
denera f05f849
Merge remote-tracking branch 'upstream/main' into common/tp-overlap-c…
denera f95f229
updated cuBLASMp C++ tests to also test local chunks instead of globa…
denera f84e8f9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] c67c183
cuBLASmp C++ tests switched to NCCL comms for reference results, now …
denera e9c79a3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 1b8fb1e
[JAX] Fix cuBLASMp collective GEMM tests and document XLA command buf…
denera caa741e
changed cuBLASMp call sizing to use flat first/last dims
denera 9cca8a9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] ff4187c
cuBLASMp backend passing tests with both PyT and JAX, CUDA graph comp…
denera 218257f
Merge remote-tracking branch 'upstream/main' into common/tp-overlap-c…
denera c2af15b
fixed JAX cublasmp bootstrapping TP rank argument, fixed PyTorch Comm…
denera f75d98e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] c208d83
C++ tests restored to working order, TE/PyTorch layer failures diagno…
denera 5bd8ff9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 509c12e
fixed linting issues, corrected Hopper/Blackwell FP8 GEMM layout hand…
denera a51bd3b
Merge branch 'common/tp-overlap-cublasmp' of github.com:denera/Transf…
denera b0bbe6d
Merge remote-tracking branch 'upstream/main' into common/tp-overlap-c…
denera 04c52ca
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 4ea7334
updated TE/JAX CollectiveGemm tests to use normal distributions with …
denera 6d6c7b2
added cuda stream sync to CollectiveGemm XLA custom op prepare stage …
denera f4740ea
Merge branch 'common/tp-overlap-cublasmp' of github.com:denera/Transf…
denera ee80f69
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 85292f3
fixed TE/PyTorch cublasmp backend flag, warmup workspace now cleaned …
denera 0cdbd6a
Merge branch 'common/tp-overlap-cublasmp' of github.com:denera/Transf…
denera 80b0a71
Merge remote-tracking branch 'upstream/main' into common/tp-overlap-c…
denera cc25997
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 0b4ecba
handling ncclComm_t via shared pointers to make sure they don't get p…
denera f753353
dummy warmup cuBLASMp GEMM buffers are locally allocated and destroye…
denera cf54c14
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] deb0890
fixed bulk-overlap fallback for cuBLASMP backend, all comm+GEMM overl…
denera f959f34
Merge branch 'common/tp-overlap-cublasmp' of github.com:denera/Transf…
denera 89f5d8d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 67521f7
test skip condition when TE is NOT built with cuBLASMp
denera 6d5ca20
Merge branch 'common/tp-overlap-cublasmp' of github.com:denera/Transf…
denera 8bcdaff
enforcing initialize_ub() call before module construction
denera e90498d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] a77c914
Merge remote-tracking branch 'upstream/main' into common/tp-overlap-c…
denera 8d254af
Merge branch 'common/tp-overlap-cublasmp' of github.com:denera/Transf…
denera a5c9117
fixed UB initializer flag
denera cd3ad03
cublasmp backend support in comm+GEMM overlap extended to fusible ops
denera 5b413cd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 4e54318
added non-multicast algo fallback for cuBLASMp
denera d29895b
Merge branch 'common/tp-overlap-cublasmp' of github.com:denera/Transf…
denera 15c44e4
Merge remote-tracking branch 'upstream/main' into common/tp-overlap-c…
denera cf07453
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 32402f7
disabling fused attention in TE/PyTorch comm+GEMM layers test to avoi…
denera 0421e2b
Merge branch 'common/tp-overlap-cublasmp' of github.com:denera/Transf…
denera 24169fb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 0206400
removed requirement for PyTorch PG to be on the NCCL backend when boo…
denera 1f2710b
Merge branch 'common/tp-overlap-cublasmp' of github.com:denera/Transf…
denera 94e90b9
Merge remote-tracking branch 'upstream/main' into common/tp-overlap-c…
denera File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this NCCL communicator literally stored by cuBLASMp or is it copied somehow? I'm worried about the case where the process group gets destroyed and the NCCL communicator that was used in it also gets destroyed from underneath us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we don't use NCCL comm from PyTorch anymore. We create our own NCCL comm that spans the same devices as the PyTorch PG with the same TP ranks, so there's no risk of the NCCL comm disappearing if the PyT PG gets destroyed.
The comment you quoted here is leftover from an earlier iteration when we did try to extract the NCCL comm from PyT PG. I will remove it to avoid confusion, and I believe we might also no longer need to directly link to the
libtorch_cuda.soanymore either because of this. I will double-check that and remove it as well.