Skip to content

[Common] Enable NVFP4 2D block scaling in columnwise only#3027

Open
negvet wants to merge 2 commits into
NVIDIA:mainfrom
negvet:nvfp4_2d_colwise_only
Open

[Common] Enable NVFP4 2D block scaling in columnwise only#3027
negvet wants to merge 2 commits into
NVIDIA:mainfrom
negvet:nvfp4_2d_colwise_only

Conversation

@negvet
Copy link
Copy Markdown
Collaborator

@negvet negvet commented May 21, 2026

Description

Enabling 2D NVFP4 quantization in columnwise-only mode.
Needed by HybridQuantizer (PR #2817) for MXFP8 fwd + NVFP4 bwd on W.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

negvet and others added 2 commits May 21, 2026 17:35
Signed-off-by: Evgeny <etsykunov@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 21, 2026

Greptile Summary

This PR enables NVFP4 2D block scaling in columnwise-only mode by removing the early-return guard and host-side check that previously blocked (!return_identity, use_2d_quantization) combinations, and adds a new "Step 2.5" amax-only pass inside the CUDA kernel that re-runs Step 2's 2D warp/smem amax reduction without the rowwise scale/quantize/store writes, so that amax_smem is fully populated before Step 3 (the transpose path) reads it.

  • Kernel change (quantize_transpose_vector_blockwise_fp4.cu): The if constexpr (kIs2DBlockScaling && !kReturnIdentity) { return; } early exit and its host-side NVTE_CHECK are removed; a new if constexpr (!kReturnIdentity && kIs2DBlockScaling) block mirrors Step 2's 2D reduction to populate amax_smem via the same warp-shuffle + shared-memory reduction sequence, ensuring bitwise-identical columnwise output compared to the dual-direction path.
  • Test (test_nvfp4_quantize_exact.py): Adds test_nvfp4_2d_columnwise_only_matches_both_directions, a bitwise-equality parametrized test across aligned and padded tile shapes and two input dtypes, that cross-checks the new columnwise-only code path against the reference rowwise+columnwise path while correctly excluding uninitialized padding bytes from the scale tensor comparison.

Confidence Score: 4/5

Safe to merge; the new amax-only pass correctly mirrors the existing Step 2 reduction and is guarded by compile-time template parameters, leaving the existing rowwise and dual-direction paths completely unchanged.

The CUDA kernel change is surgical: the new Step 2.5 block is mutually exclusive with Step 2 at compile time, uses identical warp-shuffle and shared-memory reduction logic, and the final __syncthreads() inside the loop correctly makes amax_smem visible before Step 3 reads it. The bitwise-equality test covers both aligned and padded shapes in two dtypes. The one minor issue is a step-label name collision that has no runtime impact.

transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu — the new Step 2.5 block and surrounding synchronization are the only areas worth a second look.

Important Files Changed

Filename Overview
transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu Removes the early-return guard and host-side NVTE_CHECK that blocked columnwise-only 2D quantization, and adds a new "Step 2.5" amax-only pass that mirrors Step 2's 2D warp/smem reduction to populate amax_smem before Step 3 (the transpose path) reads it. Synchronization and reduction logic appear correct; a minor step-label collision exists.
tests/pytorch/nvfp4/test_nvfp4_quantize_exact.py Adds a new parametrized bitwise-equality test covering both aligned and padded tile sizes for torch.float32 and torch.bfloat16, comparing the new columnwise-only 2D path against the existing rowwise+columnwise path. Correctly excludes padding bytes from the scale tensor comparison.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Kernel Entry] --> B{kIs2DBlockScaling
&& kIsE8Scaling?}
    B -- yes --> Z[Early return]
    B -- no --> C[Step 1: Load input to smem
__syncthreads]
    C --> D{kReturnIdentity?}
    D -- yes --> E[Step 2: Cast and store rowwise
2.1 load smem to regs
2.2 local amax
2.3 2D warp+smem reduction to amax_smem
2.4-2.8 scale / quant / store]
    D -- no --> F{kIs2DBlockScaling?}
    F -- yes --> G[Step 2.5 NEW: amax-only pass
load smem to regs
local amax
2D warp+smem reduction to amax_smem
no scale/quant/store]
    F -- no --> H{kReturnTranspose?}
    E --> H
    G --> H
    H -- yes --> I[Step 3: Transpose cast and store columnwise
read amax_smem for 2D path
compute scale / quant / store]
    H -- no --> J[Done]
    I --> J
Loading

Reviews (1): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile

}
}

// Step 2.5: 2D-amax-only pass for columnwise-only mode.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Step label collision with existing substep

The new outer-level block is named "Step 2.5" at line 576, but that same label is already used at line 522 for the "Write scale_inv" substep inside Step 2's loop (if constexpr (kReturnIdentity)). A future reader scanning the file will find two distinct "Step 2.5" sections with different semantics. Consider renaming the new block to something like "Step 2b" or "Step 2.5 (outer)" to distinguish it from the // Step 2.5: Write scale_inv substep inside the inner loop.

@ptrendx
Copy link
Copy Markdown
Member

ptrendx commented May 21, 2026

This is just the fallback kernel being changed. Does the main kernel already support this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants