Skip to content

feat(npu): Add 4 Ascend NPU custom operators for sparse video inference#1152

Closed
liaopingbo wants to merge 4 commits into
ModelTC:mainfrom
liaopingbo:feature/add-npu-operators
Closed

feat(npu): Add 4 Ascend NPU custom operators for sparse video inference#1152
liaopingbo wants to merge 4 commits into
ModelTC:mainfrom
liaopingbo:feature/add-npu-operators

Conversation

@liaopingbo

Copy link
Copy Markdown

Add 4 Ascend NPU custom operators for sparse video inference:

  • sparse attention
  • npu_layer_norm
  • npu_rms_norm
  • npu_rope

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Ascend NPU operations, including blockwise sparse attention (Rainfusion), layer normalization, RMS normalization, and rotary position embeddings (RoPE). The review feedback highlights several critical issues in the new Rainfusion_blockwise implementation, such as shape mismatch bugs in einops.rearrange dimensions, silent failures for batch sizes greater than 1, redundant tensor concatenations, and unused variables. Additionally, the feedback suggests adding safety checks for None weights/biases in NPU normalization layers, optimizing type casting in mask generation, and expanding the data type mapping in RoPE to prevent potential runtime errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py Outdated
Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py Outdated
Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py Outdated
Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py
Comment thread lightx2v_platform/ops/norm/ascend_npu/npu_layer_norm.py Outdated
Comment thread lightx2v_platform/ops/norm/ascend_npu/npu_rms_norm.py Outdated
Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py Outdated
Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py Outdated
Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py Outdated
Comment thread lightx2v_platform/ops/rope/ascend_npu/npu_rope.py

@liaopingbo liaopingbo left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All modifications have been completed in accordance with the review feedback.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds Ascend NPU support to the platform by implementing NPU-optimized operations, including a blockwise Rainfusion attention module, LayerNorm and RMSNorm layers, and a Rotary Position Embedding (RoPE) module. The review feedback highlights several improvement opportunities: handling potential runtime errors in the attention module when the remaining frame count is odd, refactoring duplicated tensor rearrangement logic, adding defensive checks for null weights in the RMSNorm layer, and casting rotated tensors prior to concatenation in the RoPE module to avoid implicit type promotion overhead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py Outdated
Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py Outdated
Comment thread lightx2v_platform/ops/norm/ascend_npu/npu_rms_norm.py Outdated
Comment thread lightx2v_platform/ops/rope/ascend_npu/npu_rope.py Outdated
@liaopingbo

Copy link
Copy Markdown
Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds Ascend NPU support to the platform, introducing custom implementations for blockwise Rainfusion attention, Layer/RMS normalization, and Rotary Position Embeddings (RoPE). The feedback identifies several robustness issues in the new attention module, including potential shape mismatches in avgpool for empty sequences, a runtime error in torch.topk when sparsity is 1.0, and inconsistent frame-count calculations during tensor rearrangement. Additionally, caching and improved error handling are recommended for the sensitive layer data type lookup in the RoPE module.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py Outdated
Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py
Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py
Comment thread lightx2v_platform/ops/rope/ascend_npu/npu_rope.py
@liaopingbo liaopingbo marked this pull request as draft June 16, 2026 09:09
@helloyongyang

Copy link
Copy Markdown
Contributor

Please fix the code format error:

pip install ruff pre-commit

pre-commit run --all-files

@liaopingbo

Copy link
Copy Markdown
Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds Ascend NPU platform support by implementing custom operators for blockwise attention (Rainfusion), layer normalization, RMS normalization, and rotary position embeddings (RoPE). The review feedback identifies a critical bug where the computed block mask is overwritten with None before returning, which breaks caching. Additionally, the feedback suggests replacing uninitialized torch.empty tensors in pooling to avoid dtype mismatches, using explicit tensor reshaping and clearer dimension names in einops for robustness, adding defensive checks for sequence lengths, and ensuring that precision casting correctly respects the configured sensitive_layer_dtype across the normalization and RoPE modules.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +255 to +256
x = self.do_tensor_inv_rearrange(x, h_res_len, w_res_len)
base_blockmask = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Setting base_blockmask = None right before returning completely overwrites the computed block mask, meaning the caller will always receive None and can never cache or reuse the mask. This defeats the caching mechanism and causes redundant mask computations on every step.

Suggested change
x = self.do_tensor_inv_rearrange(x, h_res_len, w_res_len)
base_blockmask = None
x = self.do_tensor_inv_rearrange(x, h_res_len, w_res_len)

Comment on lines +47 to +66
def avgpool(self, input_tensor, pool_size=128): # BSND in, BSND out
batch, seqlen, headnum, dim = input_tensor.shape

num_full_blocks = seqlen // pool_size
tail_size = seqlen % pool_size

if num_full_blocks > 0:
full_blocks = input_tensor[:, :num_full_blocks * pool_size, :, :]
full_blocks_reshaped = full_blocks.view(batch, num_full_blocks, pool_size, headnum, dim)
full_pooled = full_blocks_reshaped.mean(dim=2)
else:
full_pooled = torch.empty(batch, 0, headnum, dim, device=input_tensor.device)
if tail_size > 0:
tail_block = input_tensor[:, num_full_blocks * pool_size:, :, :]
tail_reshaped = tail_block.view(batch, 1, tail_size, headnum, dim)
tail_pooled = tail_reshaped.mean(dim=2)
else:
tail_pooled = torch.empty(batch, 0, headnum, dim, device=input_tensor.device)

return torch.cat([full_pooled, tail_pooled], dim=1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The avgpool implementation uses torch.empty to create uninitialized tensors when num_full_blocks or tail_size is 0. This can lead to device/dtype mismatches during concatenation (since dtype is not specified for torch.empty and defaults to float32). Simplifying this to use a list of active pooled tensors avoids uninitialized memory and potential dtype mismatches.

    def avgpool(self, input_tensor, pool_size=128):  # BSND in,  BSND out
        batch, seqlen, headnum, dim = input_tensor.shape

        num_full_blocks = seqlen // pool_size
        tail_size = seqlen % pool_size

        pooled_tensors = []
        if num_full_blocks > 0:
            full_blocks = input_tensor[:, :num_full_blocks * pool_size, :, :]
            full_blocks_reshaped = full_blocks.view(batch, num_full_blocks, pool_size, headnum, dim)
            pooled_tensors.append(full_blocks_reshaped.mean(dim=2))
        if tail_size > 0:
            tail_block = input_tensor[:, num_full_blocks * pool_size:, :, :]
            tail_reshaped = tail_block.view(batch, 1, tail_size, headnum, dim)
            pooled_tensors.append(tail_reshaped.mean(dim=2))

        return torch.cat(pooled_tensors, dim=1)

batch_size = mask_reshaped.shape[0]

# 2. 生成行索引 标记False位置为S(大于所有有效索引)
row_indices = torch.arange(S, device=device).expand(batch_size, S, -1) # (B*N, S, S)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using .expand(batch_size, S, -1) on a 1D tensor of shape (S,) relies on implicit dimension prepending and can be fragile across different PyTorch versions or backends. Explicitly reshaping with .view(1, 1, S) before expanding is much more robust and readable.

Suggested change
row_indices = torch.arange(S, device=device).expand(batch_size, S, -1) # (B*N, S, S)
row_indices = torch.arange(S, device=device).view(1, 1, S).expand(batch_size, S, S) # (B*N, S, S)

Comment on lines +126 to +127
tensor_hwt = rearrange(tensor_hwt, 'b (fn fb) (hn hb) (wn wb) d -> b (fn hn wn fb hb wb) d',
fn=remaining_frames // 2, fb=2, hb=8, wb=8, hn=h // 8, wn=w // 8)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using b as the first dimension name in the einops pattern is highly confusing because b in the outer scope represents the batch size, whereas the actual size of the first dimension here is b * n. Using a distinct name like bn avoids any confusion.

Suggested change
tensor_hwt = rearrange(tensor_hwt, 'b (fn fb) (hn hb) (wn wb) d -> b (fn hn wn fb hb wb) d',
fn=remaining_frames // 2, fb=2, hb=8, wb=8, hn=h // 8, wn=w // 8)
tensor_hwt = rearrange(tensor_hwt, 'bn (fn fb) (hn hb) (wn wb) d -> bn (fn hn wn fb hb wb) d',
fn=remaining_frames // 2, fb=2, hb=8, wb=8, hn=h // 8, wn=w // 8)

Comment on lines +145 to +146
tensor_hwt = rearrange(tensor_hwt, 'b (fn hn wn fb hb wb) d -> b (fn fb) (hn hb) (wn wb) d',
fn=remaining_frames // 2, fb=2, hb=8, wb=8, hn=h // 8, wn=w // 8)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using b as the first dimension name in the einops pattern is highly confusing because b in the outer scope represents the batch size, whereas the actual size of the first dimension here is b * n. Using a distinct name like bn avoids any confusion.

Suggested change
tensor_hwt = rearrange(tensor_hwt, 'b (fn hn wn fb hb wb) d -> b (fn fb) (hn hb) (wn wb) d',
fn=remaining_frames // 2, fb=2, hb=8, wb=8, hn=h // 8, wn=w // 8)
tensor_hwt = rearrange(tensor_hwt, 'bn (fn hn wn fb hb wb) d -> bn (fn fb) (hn hb) (wn wb) d',
fn=remaining_frames // 2, fb=2, hb=8, wb=8, hn=h // 8, wn=w // 8)

Comment on lines +159 to +165
if self.txt_first:
tensor_t, tensor_f, tensor_i = torch.split(tensor, [self.text_len, self.first_frame_len,
s - self.text_len - self.first_frame_len], dim=1)
else:
tensor_f, tensor_i, tensor_t = torch.split(tensor,
[self.first_frame_len, s - self.text_len - self.first_frame_len,
self.text_len], dim=1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the sequence length s is smaller than self.text_len + self.first_frame_len, torch.split will receive a negative split size and raise a cryptic PyTorch runtime error. Adding a defensive check prevents this.

Suggested change
if self.txt_first:
tensor_t, tensor_f, tensor_i = torch.split(tensor, [self.text_len, self.first_frame_len,
s - self.text_len - self.first_frame_len], dim=1)
else:
tensor_f, tensor_i, tensor_t = torch.split(tensor,
[self.first_frame_len, s - self.text_len - self.first_frame_len,
self.text_len], dim=1)
if s < self.text_len + self.first_frame_len:
raise ValueError(f"Sequence length {s} is too small for text_len {self.text_len} and first_frame_len {self.first_frame_len}")
if self.txt_first:
tensor_t, tensor_f, tensor_i = torch.split(tensor, [self.text_len, self.first_frame_len,
s - self.text_len - self.first_frame_len], dim=1)
else:
tensor_f, tensor_i, tensor_t = torch.split(tensor,
[self.first_frame_len, s - self.text_len - self.first_frame_len,
self.text_len], dim=1)

Comment on lines +20 to +24
if self.sensitive_layer_dtype != self.infer_dtype:
output_tensor, _ = torch_npu.npu_rms_norm(
input_tensor.float(), weight.float(), self.eps,
)
return output_tensor.to(self.infer_dtype)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When self.sensitive_layer_dtype != self.infer_dtype, the code casts the inputs to .float() (which is float32) instead of self.sensitive_layer_dtype. To respect the configured sensitive layer dtype (which could be float16 or bfloat16), cast to self.sensitive_layer_dtype instead.

Suggested change
if self.sensitive_layer_dtype != self.infer_dtype:
output_tensor, _ = torch_npu.npu_rms_norm(
input_tensor.float(), weight.float(), self.eps,
)
return output_tensor.to(self.infer_dtype)
if self.sensitive_layer_dtype != self.infer_dtype:
output_tensor, _ = torch_npu.npu_rms_norm(
input_tensor.to(self.sensitive_layer_dtype),
weight.to(self.sensitive_layer_dtype),
self.eps,
)
return output_tensor.to(self.infer_dtype)

Comment on lines +18 to +28
if torch_npu is not None and hasattr(torch_npu, 'npu_layer_norm') and self.weight is not None and self.bias is not None:
out = torch_npu.npu_layer_norm(
input_tensor,
(input_tensor.shape[-1],),
self.weight,
self.bias,
self.eps,
)
if isinstance(out, tuple):
out = out[0]
return out

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

NpuLayerNormWeight does not respect self.sensitive_layer_dtype when it differs from self.infer_dtype, unlike NpuRmsNormWeight and NpuRope. It should be updated to cast inputs, weight, and bias to self.sensitive_layer_dtype for consistency and correctness.

        if torch_npu is not None and hasattr(torch_npu, 'npu_layer_norm') and self.weight is not None and self.bias is not None:
            if self.sensitive_layer_dtype != self.infer_dtype:
                out = torch_npu.npu_layer_norm(
                    input_tensor.to(self.sensitive_layer_dtype),
                    (input_tensor.shape[-1],),
                    self.weight.to(self.sensitive_layer_dtype),
                    self.bias.to(self.sensitive_layer_dtype),
                    self.eps,
                )
                if isinstance(out, tuple):
                    out = out[0]
                return out.to(self.infer_dtype)
            out = torch_npu.npu_layer_norm(
                input_tensor,
                (input_tensor.shape[-1],),
                self.weight,
                self.bias,
                self.eps,
            )
            if isinstance(out, tuple):
                out = out[0]
            return out

Comment on lines +68 to +72
if self.sensitive_layer_dtype != self.infer_dtype:
xq_part = xq_part.float()
xk_part = xk_part.float()
cos = cos.float() if cos.dtype != torch.float32 else cos
sin = sin.float() if sin.dtype != torch.float32 else sin

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When self.sensitive_layer_dtype != self.infer_dtype, the code casts the inputs to .float() (which is float32) instead of self.sensitive_layer_dtype. To respect the configured sensitive layer dtype (which could be float16 or bfloat16), cast to self.sensitive_layer_dtype instead.

Suggested change
if self.sensitive_layer_dtype != self.infer_dtype:
xq_part = xq_part.float()
xk_part = xk_part.float()
cos = cos.float() if cos.dtype != torch.float32 else cos
sin = sin.float() if sin.dtype != torch.float32 else sin
if self.sensitive_layer_dtype != self.infer_dtype:
xq_part = xq_part.to(self.sensitive_layer_dtype)
xk_part = xk_part.to(self.sensitive_layer_dtype)
cos = cos.to(self.sensitive_layer_dtype)
sin = sin.to(self.sensitive_layer_dtype)

@liaopingbo liaopingbo closed this Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants