feat(npu): Add 4 Ascend NPU custom operators for sparse video inference by liaopingbo · Pull Request #1152 · ModelTC/LightX2V

liaopingbo · 2026-06-16T07:11:46Z

Add 4 Ascend NPU custom operators for sparse video inference:

sparse attention
npu_layer_norm
npu_rms_norm
npu_rope

gemini-code-assist

Code Review

This pull request introduces support for Ascend NPU operations, including blockwise sparse attention (Rainfusion), layer normalization, RMS normalization, and rotary position embeddings (RoPE). The review feedback highlights several critical issues in the new Rainfusion_blockwise implementation, such as shape mismatch bugs in einops.rearrange dimensions, silent failures for batch sizes greater than 1, redundant tensor concatenations, and unused variables. Additionally, the feedback suggests adding safety checks for None weights/biases in NPU normalization layers, optimizing type casting in mask generation, and expanding the data type mapping in RoPE to prevent potential runtime errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

liaopingbo

All modifications have been completed in accordance with the review feedback.

gemini-code-assist

Code Review

This pull request adds Ascend NPU support to the platform by implementing NPU-optimized operations, including a blockwise Rainfusion attention module, LayerNorm and RMSNorm layers, and a Rotary Position Embedding (RoPE) module. The review feedback highlights several improvement opportunities: handling potential runtime errors in the attention module when the remaining frame count is odd, refactoring duplicated tensor rearrangement logic, adding defensive checks for null weights in the RMSNorm layer, and casting rotated tensors prior to concatenation in the RoPE module to avoid implicit type promotion overhead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

liaopingbo · 2026-06-16T08:47:28Z

/gemini review

gemini-code-assist

Code Review

This pull request adds Ascend NPU support to the platform, introducing custom implementations for blockwise Rainfusion attention, Layer/RMS normalization, and Rotary Position Embeddings (RoPE). The feedback identifies several robustness issues in the new attention module, including potential shape mismatches in avgpool for empty sequences, a runtime error in torch.topk when sparsity is 1.0, and inconsistent frame-count calculations during tensor rearrangement. Additionally, caching and improved error handling are recommended for the sensitive layer data type lookup in the RoPE module.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

helloyongyang · 2026-06-16T09:16:14Z

Please fix the code format error：

pip install ruff pre-commit

pre-commit run --all-files

liaopingbo · 2026-06-16T09:19:41Z

/gemini review

gemini-code-assist

Code Review

This pull request adds Ascend NPU platform support by implementing custom operators for blockwise attention (Rainfusion), layer normalization, RMS normalization, and rotary position embeddings (RoPE). The review feedback identifies a critical bug where the computed block mask is overwritten with None before returning, which breaks caching. Additionally, the feedback suggests replacing uninitialized torch.empty tensors in pooling to avoid dtype mismatches, using explicit tensor reshaping and clearer dimension names in einops for robustness, adding defensive checks for sequence lengths, and ensuring that precision casting correctly respects the configured sensitive_layer_dtype across the normalization and RoPE modules.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-16T09:22:14Z

+            x = self.do_tensor_inv_rearrange(x, h_res_len, w_res_len)
+            base_blockmask = None


Setting base_blockmask = None right before returning completely overwrites the computed block mask, meaning the caller will always receive None and can never cache or reuse the mask. This defeats the caching mechanism and causes redundant mask computations on every step.

Suggested change

x = self.do_tensor_inv_rearrange(x, h_res_len, w_res_len)

base_blockmask = None

x = self.do_tensor_inv_rearrange(x, h_res_len, w_res_len)

gemini-code-assist · 2026-06-16T09:22:14Z

+    def avgpool(self, input_tensor, pool_size=128):  # BSND in,  BSND out
+        batch, seqlen, headnum, dim = input_tensor.shape
+
+        num_full_blocks = seqlen // pool_size
+        tail_size = seqlen % pool_size
+
+        if num_full_blocks > 0:
+            full_blocks = input_tensor[:, :num_full_blocks * pool_size, :, :]
+            full_blocks_reshaped = full_blocks.view(batch, num_full_blocks, pool_size, headnum, dim)
+            full_pooled = full_blocks_reshaped.mean(dim=2)
+        else:
+            full_pooled = torch.empty(batch, 0, headnum, dim, device=input_tensor.device)
+        if tail_size > 0:
+            tail_block = input_tensor[:, num_full_blocks * pool_size:, :, :]
+            tail_reshaped = tail_block.view(batch, 1, tail_size, headnum, dim)
+            tail_pooled = tail_reshaped.mean(dim=2)
+        else:
+            tail_pooled = torch.empty(batch, 0, headnum, dim, device=input_tensor.device)
+
+        return torch.cat([full_pooled, tail_pooled], dim=1)


The avgpool implementation uses torch.empty to create uninitialized tensors when num_full_blocks or tail_size is 0. This can lead to device/dtype mismatches during concatenation (since dtype is not specified for torch.empty and defaults to float32). Simplifying this to use a list of active pooled tensors avoids uninitialized memory and potential dtype mismatches.

def avgpool(self, input_tensor, pool_size=128): # BSND in, BSND out batch, seqlen, headnum, dim = input_tensor.shape num_full_blocks = seqlen // pool_size tail_size = seqlen % pool_size pooled_tensors = [] if num_full_blocks > 0: full_blocks = input_tensor[:, :num_full_blocks * pool_size, :, :] full_blocks_reshaped = full_blocks.view(batch, num_full_blocks, pool_size, headnum, dim) pooled_tensors.append(full_blocks_reshaped.mean(dim=2)) if tail_size > 0: tail_block = input_tensor[:, num_full_blocks * pool_size:, :, :] tail_reshaped = tail_block.view(batch, 1, tail_size, headnum, dim) pooled_tensors.append(tail_reshaped.mean(dim=2)) return torch.cat(pooled_tensors, dim=1)

gemini-code-assist · 2026-06-16T09:22:14Z

+        batch_size = mask_reshaped.shape[0]
+
+        # 2. 生成行索引  标记False位置为S（大于所有有效索引）
+        row_indices = torch.arange(S, device=device).expand(batch_size, S, -1)  # (B*N, S, S)


Using .expand(batch_size, S, -1) on a 1D tensor of shape (S,) relies on implicit dimension prepending and can be fragile across different PyTorch versions or backends. Explicitly reshaping with .view(1, 1, S) before expanding is much more robust and readable.

Suggested change

row_indices = torch.arange(S, device=device).expand(batch_size, S, -1) # (B*N, S, S)

row_indices = torch.arange(S, device=device).view(1, 1, S).expand(batch_size, S, S) # (B*N, S, S)

gemini-code-assist · 2026-06-16T09:22:14Z

+        tensor_hwt = rearrange(tensor_hwt, 'b (fn fb) (hn hb) (wn wb) d -> b (fn hn wn fb hb wb) d',
+                               fn=remaining_frames // 2, fb=2, hb=8, wb=8, hn=h // 8, wn=w // 8)


Using b as the first dimension name in the einops pattern is highly confusing because b in the outer scope represents the batch size, whereas the actual size of the first dimension here is b * n. Using a distinct name like bn avoids any confusion.

Suggested change

tensor_hwt = rearrange(tensor_hwt, 'b (fn fb) (hn hb) (wn wb) d -> b (fn hn wn fb hb wb) d',

fn=remaining_frames // 2, fb=2, hb=8, wb=8, hn=h // 8, wn=w // 8)

tensor_hwt = rearrange(tensor_hwt, 'bn (fn fb) (hn hb) (wn wb) d -> bn (fn hn wn fb hb wb) d',

fn=remaining_frames // 2, fb=2, hb=8, wb=8, hn=h // 8, wn=w // 8)

gemini-code-assist · 2026-06-16T09:22:14Z

+        tensor_hwt = rearrange(tensor_hwt, 'b (fn hn wn fb hb wb) d -> b (fn fb) (hn hb) (wn wb) d',
+                               fn=remaining_frames // 2, fb=2, hb=8, wb=8, hn=h // 8, wn=w // 8)


Using b as the first dimension name in the einops pattern is highly confusing because b in the outer scope represents the batch size, whereas the actual size of the first dimension here is b * n. Using a distinct name like bn avoids any confusion.

Suggested change

tensor_hwt = rearrange(tensor_hwt, 'b (fn hn wn fb hb wb) d -> b (fn fb) (hn hb) (wn wb) d',

fn=remaining_frames // 2, fb=2, hb=8, wb=8, hn=h // 8, wn=w // 8)

tensor_hwt = rearrange(tensor_hwt, 'bn (fn hn wn fb hb wb) d -> bn (fn fb) (hn hb) (wn wb) d',

fn=remaining_frames // 2, fb=2, hb=8, wb=8, hn=h // 8, wn=w // 8)

gemini-code-assist · 2026-06-16T09:22:14Z

+        if self.txt_first:
+            tensor_t, tensor_f, tensor_i = torch.split(tensor, [self.text_len, self.first_frame_len,
+                                                                s - self.text_len - self.first_frame_len], dim=1)
+        else:
+            tensor_f, tensor_i, tensor_t = torch.split(tensor,
+                                                       [self.first_frame_len, s - self.text_len - self.first_frame_len,
+                                                        self.text_len], dim=1)


If the sequence length s is smaller than self.text_len + self.first_frame_len, torch.split will receive a negative split size and raise a cryptic PyTorch runtime error. Adding a defensive check prevents this.

Suggested change

if self.txt_first:

tensor_t, tensor_f, tensor_i = torch.split(tensor, [self.text_len, self.first_frame_len,

s - self.text_len - self.first_frame_len], dim=1)

else:

tensor_f, tensor_i, tensor_t = torch.split(tensor,

[self.first_frame_len, s - self.text_len - self.first_frame_len,

self.text_len], dim=1)

if s < self.text_len + self.first_frame_len:

raise ValueError(f"Sequence length {s} is too small for text_len {self.text_len} and first_frame_len {self.first_frame_len}")

if self.txt_first:

tensor_t, tensor_f, tensor_i = torch.split(tensor, [self.text_len, self.first_frame_len,

s - self.text_len - self.first_frame_len], dim=1)

else:

tensor_f, tensor_i, tensor_t = torch.split(tensor,

[self.first_frame_len, s - self.text_len - self.first_frame_len,

self.text_len], dim=1)

gemini-code-assist · 2026-06-16T09:22:15Z

+            if self.sensitive_layer_dtype != self.infer_dtype:
+                output_tensor, _ = torch_npu.npu_rms_norm(
+                    input_tensor.float(), weight.float(), self.eps,
+                )
+                return output_tensor.to(self.infer_dtype)


When self.sensitive_layer_dtype != self.infer_dtype, the code casts the inputs to .float() (which is float32) instead of self.sensitive_layer_dtype. To respect the configured sensitive layer dtype (which could be float16 or bfloat16), cast to self.sensitive_layer_dtype instead.

Suggested change

if self.sensitive_layer_dtype != self.infer_dtype:

output_tensor, _ = torch_npu.npu_rms_norm(

input_tensor.float(), weight.float(), self.eps,

)

return output_tensor.to(self.infer_dtype)

if self.sensitive_layer_dtype != self.infer_dtype:

output_tensor, _ = torch_npu.npu_rms_norm(

input_tensor.to(self.sensitive_layer_dtype),

weight.to(self.sensitive_layer_dtype),

self.eps,

)

return output_tensor.to(self.infer_dtype)

gemini-code-assist · 2026-06-16T09:22:15Z

+        if torch_npu is not None and hasattr(torch_npu, 'npu_layer_norm') and self.weight is not None and self.bias is not None:
+            out = torch_npu.npu_layer_norm(
+                input_tensor,
+                (input_tensor.shape[-1],),
+                self.weight,
+                self.bias,
+                self.eps,
+            )
+            if isinstance(out, tuple):
+                out = out[0]
+            return out


NpuLayerNormWeight does not respect self.sensitive_layer_dtype when it differs from self.infer_dtype, unlike NpuRmsNormWeight and NpuRope. It should be updated to cast inputs, weight, and bias to self.sensitive_layer_dtype for consistency and correctness.

if torch_npu is not None and hasattr(torch_npu, 'npu_layer_norm') and self.weight is not None and self.bias is not None: if self.sensitive_layer_dtype != self.infer_dtype: out = torch_npu.npu_layer_norm( input_tensor.to(self.sensitive_layer_dtype), (input_tensor.shape[-1],), self.weight.to(self.sensitive_layer_dtype), self.bias.to(self.sensitive_layer_dtype), self.eps, ) if isinstance(out, tuple): out = out[0] return out.to(self.infer_dtype) out = torch_npu.npu_layer_norm( input_tensor, (input_tensor.shape[-1],), self.weight, self.bias, self.eps, ) if isinstance(out, tuple): out = out[0] return out

gemini-code-assist · 2026-06-16T09:22:15Z

+            if self.sensitive_layer_dtype != self.infer_dtype:
+                xq_part = xq_part.float()
+                xk_part = xk_part.float()
+                cos = cos.float() if cos.dtype != torch.float32 else cos
+                sin = sin.float() if sin.dtype != torch.float32 else sin


When self.sensitive_layer_dtype != self.infer_dtype, the code casts the inputs to .float() (which is float32) instead of self.sensitive_layer_dtype. To respect the configured sensitive layer dtype (which could be float16 or bfloat16), cast to self.sensitive_layer_dtype instead.

Suggested change

if self.sensitive_layer_dtype != self.infer_dtype:

xq_part = xq_part.float()

xk_part = xk_part.float()

cos = cos.float() if cos.dtype != torch.float32 else cos

sin = sin.float() if sin.dtype != torch.float32 else sin

if self.sensitive_layer_dtype != self.infer_dtype:

xq_part = xq_part.to(self.sensitive_layer_dtype)

xk_part = xk_part.to(self.sensitive_layer_dtype)

cos = cos.to(self.sensitive_layer_dtype)

sin = sin.to(self.sensitive_layer_dtype)

feat(npu): Add 4 Ascend NPU custom operators for sparse video inference

d29185f

gemini-code-assist Bot reviewed Jun 16, 2026

View reviewed changes

feat(npu): Add 4 Ascend NPU custom operators for sparse video inference

273957e

liaopingbo commented Jun 16, 2026

View reviewed changes

gemini-code-assist Bot reviewed Jun 16, 2026

View reviewed changes

feat(npu): Add 4 Ascend NPU custom operators for sparse video inference

b7fb73d

gemini-code-assist Bot reviewed Jun 16, 2026

View reviewed changes

liaopingbo marked this pull request as draft June 16, 2026 09:09

feat(npu): Add 4 Ascend NPU custom operators for sparse video inference

5f78da6

gemini-code-assist Bot reviewed Jun 16, 2026

View reviewed changes

liaopingbo closed this Jun 16, 2026

		x = self.do_tensor_inv_rearrange(x, h_res_len, w_res_len)
		base_blockmask = None

	row_indices = torch.arange(S, device=device).expand(batch_size, S, -1) # (B*N, S, S)
	row_indices = torch.arange(S, device=device).view(1, 1, S).expand(batch_size, S, S) # (B*N, S, S)

		tensor_hwt = rearrange(tensor_hwt, 'b (fn fb) (hn hb) (wn wb) d -> b (fn hn wn fb hb wb) d',
		fn=remaining_frames // 2, fb=2, hb=8, wb=8, hn=h // 8, wn=w // 8)

Conversation

liaopingbo commented Jun 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liaopingbo left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liaopingbo commented Jun 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

helloyongyang commented Jun 16, 2026

Uh oh!

liaopingbo commented Jun 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!