[JAX] Support for cuDNN-backed flex attention by vcherepanov-nv · Pull Request #2985 · NVIDIA/TransformerEngine

vcherepanov-nv · 2026-05-13T03:18:24Z

Description

Adds experimental JAX fused-attention score_mod support through cuDNN frontend SDPA graphs.

This introduces a score_mod(graph, score, tensors) callback path for fused_attn, plus optional score_mod_bprop(graph, dscore, tensors) support for backward. The Python side builds and serializes cuDNN frontend forward/backward graphs, caches graph metadata with stable callback keys, supports auxiliary tensor operands, and supports Python/NumPy scalar operands as cuDNN pass-by-value tensors. The C++ JAX extension deserializes and caches the graphs per device, then executes them through new forward/backward FFI handlers.

The Flax API now plumbs score_mod through DotProductAttention, MultiHeadAttention, and TransformerLayer. Packed QKV/KV layouts are unpacked to the separate BSHD layout when score modification is requested.

Users are responsible for supplying a mathematically correct score_mod_bprop for the corresponding score_mod; Transformer Engine wires the callback into the cuDNN graph but does not validate gradient semantics.

Current score_mod limitations:

Requires fused attention to be enabled.
Supports separate rank-4 BSHD_BSHD_BSHD Q/K/V tensors only.
Supports FP16/BF16 Q/K/V tensors.
Mutually exclusive with attention bias, masks, sequence descriptors, dropout, sliding-window attention, packed/ragged metadata, context parallelism, and non-vanilla softmax/softmax offset.
Requires matching cuDNN frontend Python package and C++ headers.

Fixes # (issue)
#2492

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

A new score_mod code path for the JAX FusedAttention backend
cuDNN frontend graph serialization and JAX FFI execution for score_mod forward/backward
Flax plumbing for DotProductAttention, MultiHeadAttention, and TransformerLayer
Tests

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-05-13T03:29:19Z

Greptile Summary

This PR adds experimental cuDNN-frontend-backed flex attention (score_mod) to the JAX backend, including graph serialization/deserialization, a Python-level graph cache, new C++ FFI handlers for forward and backward, and Flax plumbing through DotProductAttention, MultiHeadAttention, and TransformerLayer.

Core path: fused_attn short-circuits to a new _fused_attn_score_mod custom_vjp primitive when score_mod is provided; the Python side builds and serializes cuDNN frontend graphs at trace time (cached by shape/dtype/config key), then passes the serialized bytes + UID maps as static FFI attributes to C++ handlers that deserialize and execute them.
C++ side: Two new FFI handlers deserialize graphs on demand into a process-lifetime unordered_map guarded by a mutex, with a thread-local cuDNN handle cache; the current double-checked locking leaves a window for redundant concurrent deserialization.
Flax plumbing: Packed and KV-packed layouts are transparently converted to separate BSHD tensors before the score_mod path; score_mod_tensors / score_mod_bprop_tensors are forwarded as call-time arguments to keep tensor operands in the JAX computation graph.

Confidence Score: 5/5

Safe to merge as an experimental feature; all flagged items are non-blocking quality improvements with no correctness impact.

The core forward and backward graph building, caching, FFI dispatch, and Flax plumbing are all structurally correct. Cache key stability, UID ordering, and pytree gradient structure are handled properly. The findings are race conditions that produce at worst redundant work (not wrong results) and a shutdown-order concern for the thread-local cuDNN handle that matches patterns already present elsewhere in the codebase.

transformer_engine/jax/csrc/extensions/attention.cpp (double-checked locking in GetScoreModGraph, thread-local handle destructor ordering) and transformer_engine/jax/cpp_extensions/flex_attention.py (Python-level cache lock).

Important Files Changed

Filename	Overview
transformer_engine/jax/cpp_extensions/flex_attention.py	New 967-line file implementing cuDNN frontend score_mod graph building, caching, and FFI dispatch; implements a stable cache-key scheme and separates tensor vs. scalar operands cleanly.
transformer_engine/jax/csrc/extensions/attention.cpp	Adds 251 lines for C++ cuDNN graph deserialization, thread-local handle cache, and two new FFI handlers (forward/backward); double-checked locking leaves redundant deserializations possible under thread contention.
transformer_engine/jax/attention.py	Adds custom_vjp wrapper for score_mod path with correct residual propagation, early-return before the deprecated sequence_descriptor path, and proper validation delegation.
transformer_engine/jax/flax/transformer.py	Plumbs score_mod/score_mod_bprop through DotProductAttention, MultiHeadAttention, and TransformerLayer; handles packed/kvpacked layout unpacking correctly before the score_mod path.
tests/jax/test_fused_attn_score_mod.py	New 671-line test suite covering causal masking, post-scale bias, softcap (forward/backward), and Flax layer integration, with reference implementations for correctness comparison.

Sequence Diagram

sequenceDiagram
    participant User
    participant fused_attn
    participant ScoreMod as "_fused_attn_score_mod"
    participant FlexPy as "flex_attention.py"
    participant FFI as "FFI/XLA"
    participant Cpp as "C++ Handler"
    participant Cache as "cuDNN Graph Cache"

    User->>fused_attn: "call with score_mod callback"
    fused_attn->>fused_attn: "validate_fused_attn_score_mod()"
    fused_attn->>FlexPy: "make_fused_attn_score_mod_config()"
    fused_attn->>ScoreMod: "custom_vjp forward"

    Note over ScoreMod,FlexPy: JAX Tracing Phase
    ScoreMod->>FlexPy: "fused_attn_score_mod_fwd()"
    FlexPy->>FlexPy: "check _score_mod_graph_cache"
    alt cache miss
        FlexPy->>FlexPy: "_build_score_mod_fwd_graph()"
        FlexPy->>FlexPy: "store in _score_mod_graph_cache"
    end
    FlexPy->>FFI: "ffi.ffi_call(serialized_graph, uids)"

    Note over FFI,Cache: XLA Execution Phase
    FFI->>Cpp: "FusedAttnScoreModForwardFFI(stream, q, k, v)"
    Cpp->>Cache: "GetScoreModGraph(stream, attrs)"
    alt C++ cache miss
        Cache->>Cache: "graph->deserialize(handle, data)"
        Cache->>Cache: "store shared_ptr in map"
    end
    Cpp->>Cpp: "graph->execute(handle, variant_pack)"
    Cpp-->>FFI: "output, stats, workspace"

    Note over ScoreMod,FlexPy: Backward pass
    ScoreMod->>FlexPy: "fused_attn_score_mod_bwd(qkv, o, dO, stats)"
    FlexPy->>FFI: "ffi.ffi_call(serialized_bwd_graph)"
    FFI->>Cpp: "FusedAttnScoreModBackwardFFI(...)"
    Cpp-->>FFI: "dq, dk, dv"

_{Reviews (9): Last reviewed commit: "Skip softcap score-mod test before SM90" | Re-trigger Greptile}

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

KshitijLakhani · 2026-05-20T01:59:13Z

+    score_mod: Optional[Callable] = None,
+    score_mod_bprop: Optional[Callable] = None,
+    score_mod_tensors: Optional[Mapping[str, Any]] = None,
+    score_mod_bprop_tensors: Optional[Mapping[str, Any]] = None,
 ):


Looks like this is the highest API that score_mod has been plumbed to.
There are higher APIs that would need to be plumbed to as well - please do take a look
At the very least FusedDPA and DPA

Left some comments for this

KshitijLakhani · 2026-05-20T17:01:04Z

+def _reference_attention(
+    query, key, value, scale, *, causal=False, post_scale_bias=False, softcap=None
+):
+    scores = jnp.einsum("bqhd,bkhd->bhqk", query, key).astype(jnp.float32) * scale
+    if causal:
+        q_pos = jnp.arange(query.shape[1])[:, None]
+        kv_pos = jnp.arange(key.shape[1])[None, :]
+        scores = jnp.where(q_pos >= kv_pos, scores, -1e9)
+    if post_scale_bias:
+        q_pos = jnp.arange(query.shape[1], dtype=jnp.float32)[:, None]
+        kv_pos = jnp.arange(key.shape[1], dtype=jnp.float32)[None, :]
+        scores = scores + q_pos - kv_pos
+    if softcap is not None:
+        scores = softcap * jnp.tanh(scores / softcap)
+    probs = jax.nn.softmax(scores, axis=-1)
+    return jnp.einsum("bhqk,bkhd->bqhd", probs, value).astype(query.dtype)
+
+


WHy create your own reference and not use the jax native reference in the test file already ?

Just to clarify, I believe the idea behind this comment was to check the possibility of code reuse. However, it seems like the solutions chosen is to move the contents of the flex attention tests to a different file altogether.

I do not think it is a good practice for us to have different ways of creating the reference for different types of fused attn. It would have been best to use the reference implementation in test_fused_attn.py tweaked for the test case we have fro flex attention.

However, I'll try to not hold the PR for this but I definitely think this should be looked into a follow up PR
cc: @cyanguwa

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

KshitijLakhani · 2026-05-27T21:22:24Z

+        if score_mod_requested:
+            if not enable_fused_attn:
+                raise ValueError("score_mod requires fused attention, but NVTE_FUSED_ATTN=0.")
+            has_fused_attn_kernel = True


Why do we force this to True ?
If user wants to perform flex attn and has enabled fused attn, then shouldn't we check via is_fused_attn_kernel_available() right ?
This seems to be our contract API to determine whether fused is available+requested so I think we should rely on that API rather than force it to True here

KshitijLakhani · 2026-05-27T21:24:32Z

+        score_mod_requested = (
+            self.score_mod is not None
+            or self.score_mod_bprop is not None
+            or score_mod_tensors is not None
+            or score_mod_bprop_tensors is not None
+        )



Why repeat this check in DPA and _FusedDPA ?
The expectation is that _FusedDPA is an internal class (with the leading underscore indicating the same) and so if the checks exist in DPA I donot think we need them again in _FusedDPA. Thoughts ?
The user should be exposed to DPA only IIRC

KshitijLakhani · 2026-05-27T21:25:40Z

+    score_mod: Optional[Callable] = None,
+    score_mod_bprop: Optional[Callable] = None,
+    score_mod_tensors: Optional[Mapping[str, Any]] = None,
+    score_mod_bprop_tensors: Optional[Mapping[str, Any]] = None,
 ):


Left some comments for this

KshitijLakhani · 2026-05-27T21:43:51Z

+def _reference_attention(
+    query, key, value, scale, *, causal=False, post_scale_bias=False, softcap=None
+):
+    scores = jnp.einsum("bqhd,bkhd->bhqk", query, key).astype(jnp.float32) * scale
+    if causal:
+        q_pos = jnp.arange(query.shape[1])[:, None]
+        kv_pos = jnp.arange(key.shape[1])[None, :]
+        scores = jnp.where(q_pos >= kv_pos, scores, -1e9)
+    if post_scale_bias:
+        q_pos = jnp.arange(query.shape[1], dtype=jnp.float32)[:, None]
+        kv_pos = jnp.arange(key.shape[1], dtype=jnp.float32)[None, :]
+        scores = scores + q_pos - kv_pos
+    if softcap is not None:
+        scores = softcap * jnp.tanh(scores / softcap)
+    probs = jax.nn.softmax(scores, axis=-1)
+    return jnp.einsum("bhqk,bkhd->bqhd", probs, value).astype(query.dtype)
+
+


Just to clarify, I believe the idea behind this comment was to check the possibility of code reuse. However, it seems like the solutions chosen is to move the contents of the flex attention tests to a different file altogether.

I do not think it is a good practice for us to have different ways of creating the reference for different types of fused attn. It would have been best to use the reference implementation in test_fused_attn.py tweaked for the test case we have fro flex attention.

However, I'll try to not hold the PR for this but I definitely think this should be looked into a follow up PR
cc: @cyanguwa

KshitijLakhani · 2026-05-27T21:47:05Z

+        query = (0.125 * runner.q).astype(dtype)
+        key_tensor = (0.125 * runner.k).astype(dtype)
+        value = (0.125 * runner.v).astype(dtype)
+        doutput = random.normal(random.PRNGKey(2025), data_shape, dtype=dtype)


Why do we do this ? 0.125*

KshitijLakhani · 2026-05-27T21:56:06Z

+        runner = FusedAttnRunner(
+            batch,
+            seqlen,
+            seqlen,
+            num_heads,
+            num_heads,
+            head_dim,
+            head_dim,
+            AttnBiasType.NO_BIAS,
+            AttnMaskType.NO_MASK,
+            AttnSoftmaxType.VANILLA_SOFTMAX,
+            0.0,
+            dtype,
+            True,
+            QKVLayout.BSHD_BSHD_BSHD,
+            None,
+            None,
+            SeqDescFormat.Mask,
+            number_of_devices=device_count,
+            mesh_shape=mesh_shape,
+            mesh_axes=mesh_axes,
+            mesh_resource=mesh_resource,
+        )
+        runner._setup_inputs()
+


So it seems like you are using the runner to only setup the inputs but then are following that up with "duplicate" code that test_forward and test_backward in test_fused_attn.py
The suggestions was to try and use the runner to call forward(), which does the setup using the runner and also runs the test.

The idea is if you can integrate the non distributed tests with the Runner infrastrucutre then the distributed tests can directly use it here for free. The approach here seems to be somewhere in between your older approach and the suggested approach. You can refer to other tests in this file for reference, incase I've not done a good job explaining

i'm curiosu to know if the reason for that was because you were unable to fully integrate the score mod tests into the Fused Attn runner ? Because it seems like it is the same reason for creating a separate test_fused_attn_score_mod.py as compared to integrating the flex attn tests in the Fused attn runner in test_fused_attn.py

If test_fused_attn_score_mod must be created then a Runner should be ceated in the too and the distributed flex attn tests can then use that runner in here (similar to how other tests do)

KshitijLakhani · 2026-05-27T21:57:13Z

+        qkv_sharding = NamedSharding(runner.mesh, PartitionSpec(dp_axis, None, tp_axis, None))
+        query = (0.125 * runner.q).astype(dtype)
+        key_tensor = (0.125 * runner.k).astype(dtype)
+        value = (0.125 * runner.v).astype(dtype)
+        doutput = random.normal(random.PRNGKey(2025), data_shape, dtype=dtype)
+
+        scaling_factor = runner.scaling_factor
+        softcap = 0.8
+        softcap_score_mod = _ScoreModSoftcap()
+
+        def score_mod_loss(q, k, v, dout):
+            out = customcall_fused_dpa(
+                q,
+                k,
+                v,
+                None,
+                None,
+                None,
+                None,
+                attn_bias_type=AttnBiasType.NO_BIAS,
+                attn_mask_type=AttnMaskType.NO_MASK,
+                qkv_layout=QKVLayout.BSHD_BSHD_BSHD,
+                softmax_type=AttnSoftmaxType.VANILLA_SOFTMAX,
+                scaling_factor=scaling_factor,
+                dropout_probability=0.0,
+                is_training=True,
+                score_mod=softcap_score_mod.forward,
+                score_mod_bprop=softcap_score_mod.backward,
+                score_mod_tensors={"softcap": softcap},
+                score_mod_bprop_tensors={"softcap": softcap},
+            )
+            loss = jnp.sum(out.astype(jnp.float32) * dout.astype(jnp.float32))
+            return loss, out
+
+        def ref_loss(q, k, v, dout):
+            out = _reference_attention(q, k, v, scaling_factor, softcap=softcap)
+            loss = jnp.sum(out.astype(jnp.float32) * dout.astype(jnp.float32))
+            return loss, out
+
+        jitted_score_mod = jax.jit(
+            jax.value_and_grad(score_mod_loss, argnums=(0, 1, 2), has_aux=True),
+            in_shardings=(
+                qkv_sharding,
+                qkv_sharding,
+                qkv_sharding,
+                qkv_sharding,
+            ),
+            out_shardings=((None, qkv_sharding), (qkv_sharding, qkv_sharding, qkv_sharding)),
+        )
+        jitted_ref = jax.jit(jax.value_and_grad(ref_loss, argnums=(0, 1, 2), has_aux=True))
+
+        sharded_args = (
+            jax.device_put(query, qkv_sharding),
+            jax.device_put(key_tensor, qkv_sharding),
+            jax.device_put(value, qkv_sharding),
+            jax.device_put(doutput, qkv_sharding),
+        )


All of this can come for free if the flex attn is integrated with the FusedAttn Runner

vcherepanov-nv added 2 commits May 12, 2026 05:15

Add JAX fused attention score_mod support

f967a26

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Stabilize score_mod callback cache keys

6b05328

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

vcherepanov-nv requested a review from cyanguwa May 13, 2026 03:18

vcherepanov-nv requested review from KshitijLakhani and jberchtold-nvidia as code owners May 13, 2026 03:18

vcherepanov-nv added the 2.16.0 label May 13, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

1a96352

for more information, see https://pre-commit.ci

vcherepanov-nv mentioned this pull request May 13, 2026

[Draft]Support for score_mod and score_mod_bprop in cuDNN's sdpa #2767

Closed

13 tasks

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

Add distributed JAX score mod attention test

3bf9e97

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

Comment thread tests/jax/test_fused_attn.py Outdated

vcherepanov-nv and others added 2 commits May 15, 2026 03:35

Address JAX score_mod review items

29bbac7

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

c597af5

for more information, see https://pre-commit.ci

jberchtold-nvidia reviewed May 15, 2026

View reviewed changes

vcherepanov-nv and others added 2 commits May 18, 2026 23:54

Use serialized cuDNN graphs for score_mod attention

f8bd844

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2c01c5e

for more information, see https://pre-commit.ci

jberchtold-nvidia reviewed May 19, 2026

View reviewed changes

Comment thread transformer_engine/jax/csrc/extensions/attention.cpp Outdated

Comment thread transformer_engine/jax/csrc/extensions/attention.cpp Outdated

vcherepanov-nv and others added 2 commits May 19, 2026 00:32

Rename score_mod graph cache helpers

ba6a1a7

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

deebf8e

for more information, see https://pre-commit.ci

KshitijLakhani requested changes May 20, 2026

View reviewed changes

Add Flax score_mod attention support

9a92dd2

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

jberchtold-nvidia reviewed May 21, 2026

View reviewed changes

Comment thread transformer_engine/jax/attention.py

Address JAX score_mod review feedback

2198a79

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 21, 2026

vcherepanov-nv added 2 commits May 22, 2026 12:20

Add flex-attn tests to QA scripts

ffc8c79

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Skip softcap score-mod test before SM90

8856323

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

jberchtold-nvidia approved these changes May 26, 2026

View reviewed changes

vcherepanov-nv removed the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 26, 2026

vcherepanov-nv requested a review from KshitijLakhani May 27, 2026 17:33

vcherepanov-nv added 2.17 and removed 2.17 2.16.0 labels May 27, 2026

KshitijLakhani reviewed May 27, 2026

View reviewed changes

Conversation

vcherepanov-nv commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vcherepanov-nv commented May 13, 2026 •

edited

Loading

greptile-apps Bot commented May 13, 2026 •

edited

Loading