feat(npu): Add 4 Ascend NPU custom operators for sparse video inference#1152
feat(npu): Add 4 Ascend NPU custom operators for sparse video inference#1152liaopingbo wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for Ascend NPU operations, including blockwise sparse attention (Rainfusion), layer normalization, RMS normalization, and rotary position embeddings (RoPE). The review feedback highlights several critical issues in the new Rainfusion_blockwise implementation, such as shape mismatch bugs in einops.rearrange dimensions, silent failures for batch sizes greater than 1, redundant tensor concatenations, and unused variables. Additionally, the feedback suggests adding safety checks for None weights/biases in NPU normalization layers, optimizing type casting in mask generation, and expanding the data type mapping in RoPE to prevent potential runtime errors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
liaopingbo
left a comment
There was a problem hiding this comment.
All modifications have been completed in accordance with the review feedback.
There was a problem hiding this comment.
Code Review
This pull request adds Ascend NPU support to the platform by implementing NPU-optimized operations, including a blockwise Rainfusion attention module, LayerNorm and RMSNorm layers, and a Rotary Position Embedding (RoPE) module. The review feedback highlights several improvement opportunities: handling potential runtime errors in the attention module when the remaining frame count is odd, refactoring duplicated tensor rearrangement logic, adding defensive checks for null weights in the RMSNorm layer, and casting rotated tensors prior to concatenation in the RoPE module to avoid implicit type promotion overhead.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds Ascend NPU support to the platform, introducing custom implementations for blockwise Rainfusion attention, Layer/RMS normalization, and Rotary Position Embeddings (RoPE). The feedback identifies several robustness issues in the new attention module, including potential shape mismatches in avgpool for empty sequences, a runtime error in torch.topk when sparsity is 1.0, and inconsistent frame-count calculations during tensor rearrangement. Additionally, caching and improved error handling are recommended for the sensitive layer data type lookup in the RoPE module.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Please fix the code format error: |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds Ascend NPU platform support by implementing custom operators for blockwise attention (Rainfusion), layer normalization, RMS normalization, and rotary position embeddings (RoPE). The review feedback identifies a critical bug where the computed block mask is overwritten with None before returning, which breaks caching. Additionally, the feedback suggests replacing uninitialized torch.empty tensors in pooling to avoid dtype mismatches, using explicit tensor reshaping and clearer dimension names in einops for robustness, adding defensive checks for sequence lengths, and ensuring that precision casting correctly respects the configured sensitive_layer_dtype across the normalization and RoPE modules.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| x = self.do_tensor_inv_rearrange(x, h_res_len, w_res_len) | ||
| base_blockmask = None |
There was a problem hiding this comment.
Setting base_blockmask = None right before returning completely overwrites the computed block mask, meaning the caller will always receive None and can never cache or reuse the mask. This defeats the caching mechanism and causes redundant mask computations on every step.
| x = self.do_tensor_inv_rearrange(x, h_res_len, w_res_len) | |
| base_blockmask = None | |
| x = self.do_tensor_inv_rearrange(x, h_res_len, w_res_len) |
| def avgpool(self, input_tensor, pool_size=128): # BSND in, BSND out | ||
| batch, seqlen, headnum, dim = input_tensor.shape | ||
|
|
||
| num_full_blocks = seqlen // pool_size | ||
| tail_size = seqlen % pool_size | ||
|
|
||
| if num_full_blocks > 0: | ||
| full_blocks = input_tensor[:, :num_full_blocks * pool_size, :, :] | ||
| full_blocks_reshaped = full_blocks.view(batch, num_full_blocks, pool_size, headnum, dim) | ||
| full_pooled = full_blocks_reshaped.mean(dim=2) | ||
| else: | ||
| full_pooled = torch.empty(batch, 0, headnum, dim, device=input_tensor.device) | ||
| if tail_size > 0: | ||
| tail_block = input_tensor[:, num_full_blocks * pool_size:, :, :] | ||
| tail_reshaped = tail_block.view(batch, 1, tail_size, headnum, dim) | ||
| tail_pooled = tail_reshaped.mean(dim=2) | ||
| else: | ||
| tail_pooled = torch.empty(batch, 0, headnum, dim, device=input_tensor.device) | ||
|
|
||
| return torch.cat([full_pooled, tail_pooled], dim=1) |
There was a problem hiding this comment.
The avgpool implementation uses torch.empty to create uninitialized tensors when num_full_blocks or tail_size is 0. This can lead to device/dtype mismatches during concatenation (since dtype is not specified for torch.empty and defaults to float32). Simplifying this to use a list of active pooled tensors avoids uninitialized memory and potential dtype mismatches.
def avgpool(self, input_tensor, pool_size=128): # BSND in, BSND out
batch, seqlen, headnum, dim = input_tensor.shape
num_full_blocks = seqlen // pool_size
tail_size = seqlen % pool_size
pooled_tensors = []
if num_full_blocks > 0:
full_blocks = input_tensor[:, :num_full_blocks * pool_size, :, :]
full_blocks_reshaped = full_blocks.view(batch, num_full_blocks, pool_size, headnum, dim)
pooled_tensors.append(full_blocks_reshaped.mean(dim=2))
if tail_size > 0:
tail_block = input_tensor[:, num_full_blocks * pool_size:, :, :]
tail_reshaped = tail_block.view(batch, 1, tail_size, headnum, dim)
pooled_tensors.append(tail_reshaped.mean(dim=2))
return torch.cat(pooled_tensors, dim=1)| batch_size = mask_reshaped.shape[0] | ||
|
|
||
| # 2. 生成行索引 标记False位置为S(大于所有有效索引) | ||
| row_indices = torch.arange(S, device=device).expand(batch_size, S, -1) # (B*N, S, S) |
There was a problem hiding this comment.
Using .expand(batch_size, S, -1) on a 1D tensor of shape (S,) relies on implicit dimension prepending and can be fragile across different PyTorch versions or backends. Explicitly reshaping with .view(1, 1, S) before expanding is much more robust and readable.
| row_indices = torch.arange(S, device=device).expand(batch_size, S, -1) # (B*N, S, S) | |
| row_indices = torch.arange(S, device=device).view(1, 1, S).expand(batch_size, S, S) # (B*N, S, S) |
| tensor_hwt = rearrange(tensor_hwt, 'b (fn fb) (hn hb) (wn wb) d -> b (fn hn wn fb hb wb) d', | ||
| fn=remaining_frames // 2, fb=2, hb=8, wb=8, hn=h // 8, wn=w // 8) |
There was a problem hiding this comment.
Using b as the first dimension name in the einops pattern is highly confusing because b in the outer scope represents the batch size, whereas the actual size of the first dimension here is b * n. Using a distinct name like bn avoids any confusion.
| tensor_hwt = rearrange(tensor_hwt, 'b (fn fb) (hn hb) (wn wb) d -> b (fn hn wn fb hb wb) d', | |
| fn=remaining_frames // 2, fb=2, hb=8, wb=8, hn=h // 8, wn=w // 8) | |
| tensor_hwt = rearrange(tensor_hwt, 'bn (fn fb) (hn hb) (wn wb) d -> bn (fn hn wn fb hb wb) d', | |
| fn=remaining_frames // 2, fb=2, hb=8, wb=8, hn=h // 8, wn=w // 8) |
| tensor_hwt = rearrange(tensor_hwt, 'b (fn hn wn fb hb wb) d -> b (fn fb) (hn hb) (wn wb) d', | ||
| fn=remaining_frames // 2, fb=2, hb=8, wb=8, hn=h // 8, wn=w // 8) |
There was a problem hiding this comment.
Using b as the first dimension name in the einops pattern is highly confusing because b in the outer scope represents the batch size, whereas the actual size of the first dimension here is b * n. Using a distinct name like bn avoids any confusion.
| tensor_hwt = rearrange(tensor_hwt, 'b (fn hn wn fb hb wb) d -> b (fn fb) (hn hb) (wn wb) d', | |
| fn=remaining_frames // 2, fb=2, hb=8, wb=8, hn=h // 8, wn=w // 8) | |
| tensor_hwt = rearrange(tensor_hwt, 'bn (fn hn wn fb hb wb) d -> bn (fn fb) (hn hb) (wn wb) d', | |
| fn=remaining_frames // 2, fb=2, hb=8, wb=8, hn=h // 8, wn=w // 8) |
| if self.txt_first: | ||
| tensor_t, tensor_f, tensor_i = torch.split(tensor, [self.text_len, self.first_frame_len, | ||
| s - self.text_len - self.first_frame_len], dim=1) | ||
| else: | ||
| tensor_f, tensor_i, tensor_t = torch.split(tensor, | ||
| [self.first_frame_len, s - self.text_len - self.first_frame_len, | ||
| self.text_len], dim=1) |
There was a problem hiding this comment.
If the sequence length s is smaller than self.text_len + self.first_frame_len, torch.split will receive a negative split size and raise a cryptic PyTorch runtime error. Adding a defensive check prevents this.
| if self.txt_first: | |
| tensor_t, tensor_f, tensor_i = torch.split(tensor, [self.text_len, self.first_frame_len, | |
| s - self.text_len - self.first_frame_len], dim=1) | |
| else: | |
| tensor_f, tensor_i, tensor_t = torch.split(tensor, | |
| [self.first_frame_len, s - self.text_len - self.first_frame_len, | |
| self.text_len], dim=1) | |
| if s < self.text_len + self.first_frame_len: | |
| raise ValueError(f"Sequence length {s} is too small for text_len {self.text_len} and first_frame_len {self.first_frame_len}") | |
| if self.txt_first: | |
| tensor_t, tensor_f, tensor_i = torch.split(tensor, [self.text_len, self.first_frame_len, | |
| s - self.text_len - self.first_frame_len], dim=1) | |
| else: | |
| tensor_f, tensor_i, tensor_t = torch.split(tensor, | |
| [self.first_frame_len, s - self.text_len - self.first_frame_len, | |
| self.text_len], dim=1) |
| if self.sensitive_layer_dtype != self.infer_dtype: | ||
| output_tensor, _ = torch_npu.npu_rms_norm( | ||
| input_tensor.float(), weight.float(), self.eps, | ||
| ) | ||
| return output_tensor.to(self.infer_dtype) |
There was a problem hiding this comment.
When self.sensitive_layer_dtype != self.infer_dtype, the code casts the inputs to .float() (which is float32) instead of self.sensitive_layer_dtype. To respect the configured sensitive layer dtype (which could be float16 or bfloat16), cast to self.sensitive_layer_dtype instead.
| if self.sensitive_layer_dtype != self.infer_dtype: | |
| output_tensor, _ = torch_npu.npu_rms_norm( | |
| input_tensor.float(), weight.float(), self.eps, | |
| ) | |
| return output_tensor.to(self.infer_dtype) | |
| if self.sensitive_layer_dtype != self.infer_dtype: | |
| output_tensor, _ = torch_npu.npu_rms_norm( | |
| input_tensor.to(self.sensitive_layer_dtype), | |
| weight.to(self.sensitive_layer_dtype), | |
| self.eps, | |
| ) | |
| return output_tensor.to(self.infer_dtype) |
| if torch_npu is not None and hasattr(torch_npu, 'npu_layer_norm') and self.weight is not None and self.bias is not None: | ||
| out = torch_npu.npu_layer_norm( | ||
| input_tensor, | ||
| (input_tensor.shape[-1],), | ||
| self.weight, | ||
| self.bias, | ||
| self.eps, | ||
| ) | ||
| if isinstance(out, tuple): | ||
| out = out[0] | ||
| return out |
There was a problem hiding this comment.
NpuLayerNormWeight does not respect self.sensitive_layer_dtype when it differs from self.infer_dtype, unlike NpuRmsNormWeight and NpuRope. It should be updated to cast inputs, weight, and bias to self.sensitive_layer_dtype for consistency and correctness.
if torch_npu is not None and hasattr(torch_npu, 'npu_layer_norm') and self.weight is not None and self.bias is not None:
if self.sensitive_layer_dtype != self.infer_dtype:
out = torch_npu.npu_layer_norm(
input_tensor.to(self.sensitive_layer_dtype),
(input_tensor.shape[-1],),
self.weight.to(self.sensitive_layer_dtype),
self.bias.to(self.sensitive_layer_dtype),
self.eps,
)
if isinstance(out, tuple):
out = out[0]
return out.to(self.infer_dtype)
out = torch_npu.npu_layer_norm(
input_tensor,
(input_tensor.shape[-1],),
self.weight,
self.bias,
self.eps,
)
if isinstance(out, tuple):
out = out[0]
return out| if self.sensitive_layer_dtype != self.infer_dtype: | ||
| xq_part = xq_part.float() | ||
| xk_part = xk_part.float() | ||
| cos = cos.float() if cos.dtype != torch.float32 else cos | ||
| sin = sin.float() if sin.dtype != torch.float32 else sin |
There was a problem hiding this comment.
When self.sensitive_layer_dtype != self.infer_dtype, the code casts the inputs to .float() (which is float32) instead of self.sensitive_layer_dtype. To respect the configured sensitive layer dtype (which could be float16 or bfloat16), cast to self.sensitive_layer_dtype instead.
| if self.sensitive_layer_dtype != self.infer_dtype: | |
| xq_part = xq_part.float() | |
| xk_part = xk_part.float() | |
| cos = cos.float() if cos.dtype != torch.float32 else cos | |
| sin = sin.float() if sin.dtype != torch.float32 else sin | |
| if self.sensitive_layer_dtype != self.infer_dtype: | |
| xq_part = xq_part.to(self.sensitive_layer_dtype) | |
| xk_part = xk_part.to(self.sensitive_layer_dtype) | |
| cos = cos.to(self.sensitive_layer_dtype) | |
| sin = sin.to(self.sensitive_layer_dtype) |
Add 4 Ascend NPU custom operators for sparse video inference: