Skip to content

Fix a race condition in contiguous k-grouped GEMM where in-flight tensormaps are updated in-place#343

Open
dfyz wants to merge 1 commit into
deepseek-ai:mainfrom
dfyz:main
Open

Fix a race condition in contiguous k-grouped GEMM where in-flight tensormaps are updated in-place#343
dfyz wants to merge 1 commit into
deepseek-ai:mainfrom
dfyz:main

Conversation

@dfyz

@dfyz dfyz commented May 29, 2026

Copy link
Copy Markdown

The weight gradient kernel for SM90 modifies the A/B tensormaps stored in GMEM when the scheduler decides to switch groups, but it doesn't make sure that previous TMA loads have already finished reading the tensormaps at that point. A TMA load reading from a tensormap being modified concurrently corrupts the output for obvious reasons.

A reliable way to reproduce this corruption (at least on an H200) is to run the test_k_grouped_gemm_contiguous test a couple of times with the shape added in this PR. Eventually the corruption will be large enough to make the test fail:

[...]
Testing k-grouped contiguous GEMM:
 > Perf (num_groups= 4, m= 4096, n= 7168, k=34304, gran_k=128): 1966 us | 1025 TFLOPS |  681 GB/s
 > Perf (num_groups= 4, m= 7168, n= 2048, k=32896, gran_k=128):  970 us |  996 TFLOPS |  807 GB/s
Traceback (most recent call last):
  File "/persistent/DeepGEMM/tests/test_fp8_fp4.py", line 222, in <module>
    test_k_grouped_gemm_contiguous()
  File "/persistent/DeepGEMM/tests/test_fp8_fp4.py", line 194, in test_k_grouped_gemm_contiguous
    assert diff < 0.001, f'{m=}, {n=}, {k=}, {ks=}, {diff:.5f}'
           ^^^^^^^^^^^^
AssertionError: m=768, n=2048, k=1408, ks=[128, 128, 256, 256, 128, 256, 256, 256], 0.00103

Even when the test doesn't fail, the corruption still happens. You can see that by printing the torch.norm() of the result here and seeing that it differs between the test runs:

[...]
tensor(124499.3047, device='cuda:0')
tensor(122834.4531, device='cuda:0')
 > Perf (num_groups= 8, m=  768, n= 2048, k= 1664, gran_k=128):   44 us |  118 TFLOPS | 2372 GB/s
[...]
tensor(124495.9609, device='cuda:0')
tensor(122833.1406, device='cuda:0')
 > Perf (num_groups= 8, m=  768, n= 2048, k= 1664, gran_k=128):   45 us |  117 TFLOPS | 2356 GB/s
[...]
tensor(124501.3516, device='cuda:0')
tensor(122833.2109, device='cuda:0')
 > Perf (num_groups= 8, m=  768, n= 2048, k= 1664, gran_k=128):   45 us |  117 TFLOPS | 2367 GB/s

The fix for this is conceptually straightforward: section 9.7.9.26.5.2 from the PTX docs says that bulk async-group based completion mechanism can be used "for the completion of reading of the tensormap object" in (otherwise mbarrier-based) TMA loads. So running cp.async.bulk.commit_group + cp.async.bulk.wait_group.read 0 before modifying GMEM ensures all in-flight TMA loads have finished reading their tensormaps, which fixes the race condition.

The only catch is that writing to a TMA descriptor from a single thread generates multiple STG.E.128 instructions, and ptxas decides to move some of them before the DEPBAR.LE SB0, 0x0 generated by cp.async.bulk.wait_group.read 0, which re-introduces the race condition. CUTLASS does a __syncwarp() before updating tensormaps in GMEM, and inserting a warp sync indeed prevents ptxas from reordering instructions, but this seems more like a coincidence, since we only have one thread from the warp anyway and I don't see a good way to explain this in term of the PTX memory model. In other words, this is an ugly hack, so if you have better solutions, I will gladly implement them. :)

@RayWang96

RayWang96 commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

I believe that, at the memory-model level, warpsync can guarantee correctness. However, calling it inside if (elect_one_sync) is a undefined behavior. Instead, you should use cuda::atomic_thread_fence(cuda::memory_order_seq_cst, cuda::thread_scope_thread); which can prevent stg from being reordered before wait_group.read.

@xay5421 can help review and correct me if I missed something.

@dfyz

dfyz commented Jun 2, 2026

Copy link
Copy Markdown
Author

However, calling it inside if (elect_one_sync) is a undefined behavior.

I'm totally fine with replacing __syncwarp() with whatever works best, but can you please clarify why it's undefined behavior? Before settling on __syncwarp(), I read the constraints from the CUDA programming guide and decided that my usage was okay:

  • if (... cute::elect_one_sync()) will elect a single thread from a warp, and only that thread will call __syncwarp() (with the full mask)
  • the programming guide says that "Each non-calling thread must have its corresponding bit set to zero in the mask", but it explicitly says that "Exited threads are ignored"
  • all unelected threads from the same warp will skip that if (... cute::elect_one_sync()) and exit
  • so I think that all of the "constraints [that] must be met for correct execution" are satisfied

The guide also says that the behavior is "invalid or undefined" only if "A non-exited thread specified in the mask fails to either eventually exit or call the intrinsic at the same program point with the same mask value", which is not the case here. It even gives an example of valid usage:

    if (threadIdx.x == 0)
        return; // exit
    // CORRECT, all non-exited threads participate in the call
    __all_sync(0xFFFFFFFF, pred);

I think that my usage is conceptually this:

    if (threadIdx.x != ELECTED_THREAD_ID)
        return;
    __syncwarp(0xFFFFFFFF);

Which seems equivalent to the example from the docs.

@RayWang96

Copy link
Copy Markdown
Collaborator

Thanks for pointing to the CUDA programming guide. I agree my previous wording was too strong: this is not unconditionally undefined behavior.

The precise point is: this pattern is only valid if all non-elected lanes have truly exited before the elected lane reaches the full-mask __syncwarp().

if (... && cute::elect_one_sync()) {
  __syncwarp(0xffffffff);
}

cute::elect_one_sync() itself does not make the other lanes exit. It only makes one lane enter the branch; the other lanes skip the branch body and follow whatever control flow the compiler generates after the if.

So this would be valid if the final control flow is really equivalent to:

if (threadIdx.x != ELECTED_THREAD_ID)
  return;

__syncwarp(0xffffffff);

But the source pattern above does not guarantee that. For example, even if there is no explicit code after the if, a C++ destructor / cleanup for an object declared before the if can still be emitted at scope exit. More generally, the compiler could lower the skipped path to a common exit block:

ELECT P0
@!P0 BRA common_exit

WARPSYNC 0xffffffff  // elected lane only

common_exit:
  // cleanup / epilogue / setmaxnreg-related code
  EXIT

In that case, the non-elected lanes are still non-exited when the elected lane executes WARPSYNC 0xffffffff, and they never execute the same WARPSYNC at the same program point. That violates the warp-sync constraints.

@DanBlanaru

Copy link
Copy Markdown

Just to confirm, the next required action here is for Ivan @dfyz to update the MR based on the suggestion above?
Right, Ray? @RayWang96

@dfyz

dfyz commented Jun 4, 2026

Copy link
Copy Markdown
Author

the next required action here is for Ivan @dfyz to update the MR based on the suggestion above?

Yeah, I think so. I'm still not 100% sure I understand the logic about __syncwarp() being invalid in Ray's response, but I have no problem with changing the code to use a thread fence. I'll do it today.

@dfyz

dfyz commented Jun 4, 2026

Copy link
Copy Markdown
Author

@RayWang96 I've tried using cuda::atomic_thread_fence(cuda::memory_order_seq_cst, cuda::thread_scope_thread); in a branch, but it didn't prevent reordering. Judging from the CCCL code, it can't possibly work, since thread-level fences literally don't do anything at all.

In general, I'm not sure if any form of cuda::atomic_thread_fence is the right tool for this job, since we are dealing with tensormap proxy loads (in terms of the PTX memory model), and none of the cuda::atomic_thread_fence variants lower down to a fence.proxy.* PTX instruction.

What is the right tool for the job? I believe that, conceptually, we need to order the generic proxy write for the next group (*(gmem_tensor_map_{a,b}) = *(smem_tensor_map_{a,b})) after the tensormap proxy read that implicitly happens for the previous group. We already order the tensormap proxy read after the generic proxy write for the same group using fence.proxy.tensormap::generic[...], so it would make sense to use a hypothetical fence.proxy.generic::tensormap[...] instruction instead of __syncwarp(), but unfortunately this PTX instruction doesn't exist. So I don't know what the right tool is.

Of course, __syncwarp() also has nothing to do with proxy fences, and is also not the right tool for the job, as I said in the PR description. But at least it works empirically for some reason. :)

@RayWang96

Copy link
Copy Markdown
Collaborator

@RayWang96 I've tried using cuda::atomic_thread_fence(cuda::memory_order_seq_cst, cuda::thread_scope_thread); in a branch, but it didn't prevent reordering. Judging from the CCCL code, it can't possibly work, since thread-level fences literally don't do anything at all.

Have you checked the SASS code already? I believe all the STG instructions have been scheduled after DEPBAR.LE.

thread_scope_thread has no membar/fence effect in the final PTX or hardware semantics, but that does not mean it is completely invisible during the compiler’s intermediate optimization stages. So it can still prevent reordering.

@RayWang96

Copy link
Copy Markdown
Collaborator

Just to confirm, the next required action here is for Ivan @dfyz to update the MR based on the suggestion above? Right, Ray? @RayWang96

Right, and then we can wait for the DS team to review and merge it. If you want to get it merged quickly, you can merge it into the nv_dev branch first.

@RayWang96

Copy link
Copy Markdown
Collaborator

What is the right tool for the job? I believe that, conceptually, we need to order the generic proxy write for the next group (*(gmem_tensor_map_{a,b}) = *(smem_tensor_map_{a,b})) after the tensormap proxy read that implicitly happens for the previous group. We already order the tensormap proxy read after the generic proxy write for the same group using fence.proxy.tensormap::generic[...], so it would make sense to use a hypothetical fence.proxy.generic::tensormap[...] instruction instead of __syncwarp(), but unfortunately this PTX instruction doesn't exist. So I don't know what the right tool is.

Of course, __syncwarp() also has nothing to do with proxy fences, and is also not the right tool for the job, as I said in the PR description. But at least it works empirically for some reason. :)

After cp.async.bulk.wait_group.read 0 returns, the read side of TMA on the tensormap/descriptor in the bulk async group it is waiting on has completed. Therefore, subsequent modifications to that descriptor storage through the generic proxy will no longer conflict with those earlier TMA reads.

What we need is to prevent STG from being reordered before cp.async.bulk.wait_group.read 0, and this can be achieved with cuda::atomic_thread_fence.

@dfyz

dfyz commented Jun 5, 2026

Copy link
Copy Markdown
Author

Have you checked the SASS code already? I believe all the STG instructions have been scheduled after DEPBAR.LE.

Yes I did (after seeing the test fail). To clarify, I'm using the following nvcc version (although I've seen the same behavior with other versions):

$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2026 NVIDIA Corporation
Built on Thu_Mar_19_11:12:51_PM_PDT_2026
Cuda compilation tools, release 13.2, V13.2.78
Build cuda_13.2.r13.2/compiler.37668154_0

To reproduce the desired behavior, you can run the test_fp8_fp4.py test using commit 898b7c298df1e6b7958fd77013f330037e59b9bf from my fork, then take any random cubin from the DeepGEMM cache and run the following:

$ ~/.deep_gemm/cache/kernel.sm90_fp8_gemm_1d1d.2ad67ad5b93c28f0715fa1a57eb0d886$ /usr/local/cuda/bin/cuobjdump -sass kernel.cubin | egrep 'DEPBAR\.LE SB0|STG'
        /*22d0*/                   DEPBAR.LE SB0, 0x0 ;                                           /* 0x000080000000791a */
        /*30a0*/                   DEPBAR.LE SB0, 0x0 ;                                           /* 0x000080000000791a */
        /*3160*/                   STG.E.128 desc[UR60][R2.64+0x70], R8 ;                         /* 0x0000700802007986 */
        /*3170*/                   STG.E.128 desc[UR60][R2.64+0x60], R16 ;                        /* 0x0000601002007986 */
        /*3190*/                   STG.E.128 desc[UR60][R2.64+0x50], R24 ;                        /* 0x0000501802007986 */
        /*31a0*/                   STG.E.128 desc[UR60][R2.64+0x40], R32 ;                        /* 0x0000402002007986 */
        /*3230*/                   STG.E.128 desc[UR60][R2.64+0x30], R16 ;                        /* 0x0000301002007986 */
        /*3250*/                   STG.E.128 desc[UR60][R2.64+0x20], R12 ;                        /* 0x0000200c02007986 */
        /*3260*/                   STG.E.128 desc[UR60][R2.64+0x10], R20 ;                        /* 0x0000101402007986 */
        /*3280*/                   STG.E.128 desc[UR60][R2.64], R28 ;                             /* 0x0000001c02007986 */
        /*3290*/                   STG.E.128 desc[UR60][R2.64+0xf0], R8 ;                         /* 0x0000f00802007986 */
        /*32b0*/                   STG.E.128 desc[UR60][R2.64+0xe0], R16 ;                        /* 0x0000e01002007986 */
        /*3300*/                   STG.E.128 desc[UR60][R2.64+0xd0], R20 ;                        /* 0x0000d01402007986 */
        /*3340*/                   STG.E.128 desc[UR60][R2.64+0xc0], R32 ;                        /* 0x0000c02002007986 */
        /*3360*/                   STG.E.128 desc[UR60][R2.64+0xb0], R28 ;                        /* 0x0000b01c02007986 */
        /*33b0*/                   STG.E.128 desc[UR60][R30.64+0xa0], R8 ;                        /* 0x0000a0081e007986 */
        /*33c0*/                   STG.E.128 desc[UR60][R28.64+0x90], R16 ;                       /* 0x000090101c007986 */
        /*33d0*/                   STG.E.128 desc[UR60][R2.64+0x80], R24 ;                        /* 0x0000801802007986 */

Here, the DEPBAR from cp.async.bulk.wait_group.read 0 is at address 30a0, and the following STG.E.128 are at larger addresses. Just to be sure, you can also inspect the control-flow graph with nvdiasm -cfg [...] to double-check that the dependency barrier precedes the stores in the CFG.

If you do the same for commit 045c895a160e6ae9efc51f65442de26a7796d573 (which uses cuda::atomic_thread_fence(...)), you will see that some of the stores (for example, the one at address 3510) precede the DEPBAR at address 35a0:

~/.deep_gemm/cache/kernel.sm90_fp8_gemm_1d1d.0ab23d298093f1e942cc3a461ec6c36d$ /usr/local/cuda/bin/cuobjdump -sass kernel.cubin | egrep 'DEPBAR\.LE SB0|STG'
        /*22d0*/                   DEPBAR.LE SB0, 0x0 ;                                           /* 0x000080000000791a */
        /*32c0*/                   STG.E.128 desc[UR60][R2.64+0x70], R8 ;                         /* 0x0000700802007986 */
        /*32d0*/                   STG.E.128 desc[UR60][R2.64+0x60], R16 ;                        /* 0x0000601002007986 */
        /*32e0*/                   STG.E.128 desc[UR60][R2.64+0x50], R24 ;                        /* 0x0000501802007986 */
        /*3450*/                   STG.E.128 desc[UR60][R2.64+0x40], R32 ;                        /* 0x0000402002007986 */
        /*3460*/                   STG.E.128 desc[UR60][R2.64+0x30], R16 ;                        /* 0x0000301002007986 */
        /*3480*/                   STG.E.128 desc[UR60][R2.64+0x20], R20 ;                        /* 0x0000201402007986 */
        /*34a0*/                   STG.E.128 desc[UR60][R2.64+0x10], R24 ;                        /* 0x0000101802007986 */
        /*34c0*/                   STG.E.128 desc[UR60][R2.64], R28 ;                             /* 0x0000001c02007986 */
        /*34d0*/                   STG.E.128 desc[UR60][R2.64+0xf0], R8 ;                         /* 0x0000f00802007986 */
        /*3510*/                   STG.E.128 desc[UR60][R2.64+0xe0], R16 ;                        /* 0x0000e01002007986 */
        /*35a0*/                   DEPBAR.LE SB0, 0x0 ;                                           /* 0x000080000000791a */
        /*35e0*/                   STG.E.128 desc[UR60][R2.64+0xd0], R20 ;                        /* 0x0000d01402007986 */
        /*3620*/                   STG.E.128 desc[UR60][R2.64+0xc0], R32 ;                        /* 0x0000c02002007986 */
        /*3630*/                   STG.E.128 desc[UR60][R2.64+0xb0], R28 ;                        /* 0x0000b01c02007986 */
        /*3660*/                   STG.E.128 desc[UR60][R28.64+0xa0], R8 ;                        /* 0x0000a0081c007986 */
        /*3670*/                   STG.E.128 desc[UR60][R2.64+0x90], R16 ;                        /* 0x0000901002007986 */
        /*3680*/                   STG.E.128 desc[UR60][R2.64+0x80], R24 ;                        /* 0x0000801802007986 */

Again, you can inspect the control-flow graph to be sure.

thread_scope_thread has no membar/fence effect in the final PTX or hardware semantics, but that does not mean it is completely invisible during the compiler’s intermediate optimization stages. So it can still prevent reordering.

Fair enough, but if I take commit 045c895a160e6ae9efc51f65442de26a7796d573 that I mentioned above and manually comment out the atomic thread fence, then run the test with DG_JIT_DUMP_PTX=1, I can observe that the generated PTX doesn't change at all after removing the fence (modulo some generated IDs). So I think it's fair to assume that in this particular case the fence is a no-op.

If you want to get it merged quickly, you can merge it into the nv_dev branch first.

I think there's no rush! The training codebase where this issue was initially found has a workaround applied internally, and it would be good to fix it and review it in a proper way upstream.

@RayWang96

Copy link
Copy Markdown
Collaborator

Have you checked the SASS code already? I believe all the STG instructions have been scheduled after DEPBAR.LE.

Yes I did (after seeing the test fail). To clarify, I'm using the following nvcc version (although I've seen the same behavior with other versions):

$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2026 NVIDIA Corporation
Built on Thu_Mar_19_11:12:51_PM_PDT_2026
Cuda compilation tools, release 13.2, V13.2.78
Build cuda_13.2.r13.2/compiler.37668154_0

To reproduce the desired behavior, you can run the test_fp8_fp4.py test using commit 898b7c298df1e6b7958fd77013f330037e59b9bf from my fork, then take any random cubin from the DeepGEMM cache and run the following:

$ ~/.deep_gemm/cache/kernel.sm90_fp8_gemm_1d1d.2ad67ad5b93c28f0715fa1a57eb0d886$ /usr/local/cuda/bin/cuobjdump -sass kernel.cubin | egrep 'DEPBAR\.LE SB0|STG'
        /*22d0*/                   DEPBAR.LE SB0, 0x0 ;                                           /* 0x000080000000791a */
        /*30a0*/                   DEPBAR.LE SB0, 0x0 ;                                           /* 0x000080000000791a */
        /*3160*/                   STG.E.128 desc[UR60][R2.64+0x70], R8 ;                         /* 0x0000700802007986 */
        /*3170*/                   STG.E.128 desc[UR60][R2.64+0x60], R16 ;                        /* 0x0000601002007986 */
        /*3190*/                   STG.E.128 desc[UR60][R2.64+0x50], R24 ;                        /* 0x0000501802007986 */
        /*31a0*/                   STG.E.128 desc[UR60][R2.64+0x40], R32 ;                        /* 0x0000402002007986 */
        /*3230*/                   STG.E.128 desc[UR60][R2.64+0x30], R16 ;                        /* 0x0000301002007986 */
        /*3250*/                   STG.E.128 desc[UR60][R2.64+0x20], R12 ;                        /* 0x0000200c02007986 */
        /*3260*/                   STG.E.128 desc[UR60][R2.64+0x10], R20 ;                        /* 0x0000101402007986 */
        /*3280*/                   STG.E.128 desc[UR60][R2.64], R28 ;                             /* 0x0000001c02007986 */
        /*3290*/                   STG.E.128 desc[UR60][R2.64+0xf0], R8 ;                         /* 0x0000f00802007986 */
        /*32b0*/                   STG.E.128 desc[UR60][R2.64+0xe0], R16 ;                        /* 0x0000e01002007986 */
        /*3300*/                   STG.E.128 desc[UR60][R2.64+0xd0], R20 ;                        /* 0x0000d01402007986 */
        /*3340*/                   STG.E.128 desc[UR60][R2.64+0xc0], R32 ;                        /* 0x0000c02002007986 */
        /*3360*/                   STG.E.128 desc[UR60][R2.64+0xb0], R28 ;                        /* 0x0000b01c02007986 */
        /*33b0*/                   STG.E.128 desc[UR60][R30.64+0xa0], R8 ;                        /* 0x0000a0081e007986 */
        /*33c0*/                   STG.E.128 desc[UR60][R28.64+0x90], R16 ;                       /* 0x000090101c007986 */
        /*33d0*/                   STG.E.128 desc[UR60][R2.64+0x80], R24 ;                        /* 0x0000801802007986 */

Here, the DEPBAR from cp.async.bulk.wait_group.read 0 is at address 30a0, and the following STG.E.128 are at larger addresses. Just to be sure, you can also inspect the control-flow graph with nvdiasm -cfg [...] to double-check that the dependency barrier precedes the stores in the CFG.

If you do the same for commit 045c895a160e6ae9efc51f65442de26a7796d573 (which uses cuda::atomic_thread_fence(...)), you will see that some of the stores (for example, the one at address 3510) precede the DEPBAR at address 35a0:

~/.deep_gemm/cache/kernel.sm90_fp8_gemm_1d1d.0ab23d298093f1e942cc3a461ec6c36d$ /usr/local/cuda/bin/cuobjdump -sass kernel.cubin | egrep 'DEPBAR\.LE SB0|STG'
        /*22d0*/                   DEPBAR.LE SB0, 0x0 ;                                           /* 0x000080000000791a */
        /*32c0*/                   STG.E.128 desc[UR60][R2.64+0x70], R8 ;                         /* 0x0000700802007986 */
        /*32d0*/                   STG.E.128 desc[UR60][R2.64+0x60], R16 ;                        /* 0x0000601002007986 */
        /*32e0*/                   STG.E.128 desc[UR60][R2.64+0x50], R24 ;                        /* 0x0000501802007986 */
        /*3450*/                   STG.E.128 desc[UR60][R2.64+0x40], R32 ;                        /* 0x0000402002007986 */
        /*3460*/                   STG.E.128 desc[UR60][R2.64+0x30], R16 ;                        /* 0x0000301002007986 */
        /*3480*/                   STG.E.128 desc[UR60][R2.64+0x20], R20 ;                        /* 0x0000201402007986 */
        /*34a0*/                   STG.E.128 desc[UR60][R2.64+0x10], R24 ;                        /* 0x0000101802007986 */
        /*34c0*/                   STG.E.128 desc[UR60][R2.64], R28 ;                             /* 0x0000001c02007986 */
        /*34d0*/                   STG.E.128 desc[UR60][R2.64+0xf0], R8 ;                         /* 0x0000f00802007986 */
        /*3510*/                   STG.E.128 desc[UR60][R2.64+0xe0], R16 ;                        /* 0x0000e01002007986 */
        /*35a0*/                   DEPBAR.LE SB0, 0x0 ;                                           /* 0x000080000000791a */
        /*35e0*/                   STG.E.128 desc[UR60][R2.64+0xd0], R20 ;                        /* 0x0000d01402007986 */
        /*3620*/                   STG.E.128 desc[UR60][R2.64+0xc0], R32 ;                        /* 0x0000c02002007986 */
        /*3630*/                   STG.E.128 desc[UR60][R2.64+0xb0], R28 ;                        /* 0x0000b01c02007986 */
        /*3660*/                   STG.E.128 desc[UR60][R28.64+0xa0], R8 ;                        /* 0x0000a0081c007986 */
        /*3670*/                   STG.E.128 desc[UR60][R2.64+0x90], R16 ;                        /* 0x0000901002007986 */
        /*3680*/                   STG.E.128 desc[UR60][R2.64+0x80], R24 ;                        /* 0x0000801802007986 */

Again, you can inspect the control-flow graph to be sure.

thread_scope_thread has no membar/fence effect in the final PTX or hardware semantics, but that does not mean it is completely invisible during the compiler’s intermediate optimization stages. So it can still prevent reordering.

Fair enough, but if I take commit 045c895a160e6ae9efc51f65442de26a7796d573 that I mentioned above and manually comment out the atomic thread fence, then run the test with DG_JIT_DUMP_PTX=1, I can observe that the generated PTX doesn't change at all after removing the fence (modulo some generated IDs). So I think it's fair to assume that in this particular case the fence is a no-op.

If you want to get it merged quickly, you can merge it into the nv_dev branch first.

I think there's no rush! The training codebase where this issue was initially found has a workaround applied internally, and it would be good to fix it and review it in a proper way upstream.

Indeed, my earlier approach was incorrect; the issue here cannot be resolved with a memory barrier. I believe warp sync is the correct fix, but I would recommend using __syncwarp(1 << lane_idx); so that it acts purely as a instruction scheduling barrier and avoids the risk of causing other side effects.

@dfyz

dfyz commented Jun 6, 2026

Copy link
Copy Markdown
Author

I would recommend using __syncwarp(1 << lane_idx); so that it acts purely as a instruction scheduling barrier and avoids the risk of causing other side effects

Thanks, this sounds like a good idea. I rebased my branch on top of main, and added a lane mask as you suggested.

@dfyz

dfyz commented Jun 17, 2026

Copy link
Copy Markdown
Author

@RayWang96

Right, and then we can wait for the DS team to review and merge it

Do we maybe need to tag someone from the DS team to look at this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants