[Common] Enable NVFP4 2D block scaling in columnwise only by negvet · Pull Request #3027 · NVIDIA/TransformerEngine

negvet · 2026-05-21T17:39:35Z

Description

Enabling 2D NVFP4 quantization in columnwise-only mode.
Needed by HybridQuantizer (PR #2817) for MXFP8 fwd + NVFP4 bwd on W.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Evgeny <etsykunov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-05-21T17:46:37Z

Greptile Summary

This PR enables NVFP4 2D block scaling in columnwise-only mode by removing the early-return guard and host-side check that previously blocked (!return_identity, use_2d_quantization) combinations, and adds a new "Step 2.5" amax-only pass inside the CUDA kernel that re-runs Step 2's 2D warp/smem amax reduction without the rowwise scale/quantize/store writes, so that amax_smem is fully populated before Step 3 (the transpose path) reads it.

Kernel change (quantize_transpose_vector_blockwise_fp4.cu): The if constexpr (kIs2DBlockScaling && !kReturnIdentity) { return; } early exit and its host-side NVTE_CHECK are removed; a new if constexpr (!kReturnIdentity && kIs2DBlockScaling) block mirrors Step 2's 2D reduction to populate amax_smem via the same warp-shuffle + shared-memory reduction sequence, ensuring bitwise-identical columnwise output compared to the dual-direction path.
Test (test_nvfp4_quantize_exact.py): Adds test_nvfp4_2d_columnwise_only_matches_both_directions, a bitwise-equality parametrized test across aligned and padded tile shapes and two input dtypes, that cross-checks the new columnwise-only code path against the reference rowwise+columnwise path while correctly excluding uninitialized padding bytes from the scale tensor comparison.

Confidence Score: 4/5

Safe to merge; the new amax-only pass correctly mirrors the existing Step 2 reduction and is guarded by compile-time template parameters, leaving the existing rowwise and dual-direction paths completely unchanged.

The CUDA kernel change is surgical: the new Step 2.5 block is mutually exclusive with Step 2 at compile time, uses identical warp-shuffle and shared-memory reduction logic, and the final __syncthreads() inside the loop correctly makes amax_smem visible before Step 3 reads it. The bitwise-equality test covers both aligned and padded shapes in two dtypes. The one minor issue is a step-label name collision that has no runtime impact.

transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu — the new Step 2.5 block and surrounding synchronization are the only areas worth a second look.

Important Files Changed

Filename	Overview
transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu	Removes the early-return guard and host-side NVTE_CHECK that blocked columnwise-only 2D quantization, and adds a new "Step 2.5" amax-only pass that mirrors Step 2's 2D warp/smem reduction to populate `amax_smem` before Step 3 (the transpose path) reads it. Synchronization and reduction logic appear correct; a minor step-label collision exists.
tests/pytorch/nvfp4/test_nvfp4_quantize_exact.py	Adds a new parametrized bitwise-equality test covering both aligned and padded tile sizes for `torch.float32` and `torch.bfloat16`, comparing the new columnwise-only 2D path against the existing rowwise+columnwise path. Correctly excludes padding bytes from the scale tensor comparison.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Kernel Entry] --> B{kIs2DBlockScaling
&& kIsE8Scaling?}
    B -- yes --> Z[Early return]
    B -- no --> C[Step 1: Load input to smem
__syncthreads]
    C --> D{kReturnIdentity?}
    D -- yes --> E[Step 2: Cast and store rowwise
2.1 load smem to regs
2.2 local amax
2.3 2D warp+smem reduction to amax_smem
2.4-2.8 scale / quant / store]
    D -- no --> F{kIs2DBlockScaling?}
    F -- yes --> G[Step 2.5 NEW: amax-only pass
load smem to regs
local amax
2D warp+smem reduction to amax_smem
no scale/quant/store]
    F -- no --> H{kReturnTranspose?}
    E --> H
    G --> H
    H -- yes --> I[Step 3: Transpose cast and store columnwise
read amax_smem for 2D path
compute scale / quant / store]
    H -- no --> J[Done]
    I --> J

_{Reviews (1): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

greptile-apps · 2026-05-21T17:46:44Z

    }
  }

+  // Step 2.5: 2D-amax-only pass for columnwise-only mode.


Step label collision with existing substep

The new outer-level block is named "Step 2.5" at line 576, but that same label is already used at line 522 for the "Write scale_inv" substep inside Step 2's loop (if constexpr (kReturnIdentity)). A future reader scanning the file will find two distinct "Step 2.5" sections with different semantics. Consider renaming the new block to something like "Step 2b" or "Step 2.5 (outer)" to distinguish it from the // Step 2.5: Write scale_inv substep inside the inner loop.

ptrendx · 2026-05-21T22:24:13Z

This is just the fallback kernel being changed. Does the main kernel already support this?

negvet and others added 2 commits May 21, 2026 17:35

Enable colwise only 2d nvfp4

56780d1

Signed-off-by: Evgeny <etsykunov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

61a2387

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common] Enable NVFP4 2D block scaling in columnwise only#3027

[Common] Enable NVFP4 2D block scaling in columnwise only#3027
negvet wants to merge 2 commits into
NVIDIA:mainfrom
negvet:nvfp4_2d_colwise_only

negvet commented May 21, 2026

Uh oh!

greptile-apps Bot commented May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

ptrendx commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

negvet commented May 21, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented May 21, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants