Skip to content

feat(npu): Add 4 Ascend NPU custom operators for sparse video inference#1157

Open
liaopingbo wants to merge 7 commits into
ModelTC:mainfrom
liaopingbo:feature/add-npu-operators
Open

feat(npu): Add 4 Ascend NPU custom operators for sparse video inference#1157
liaopingbo wants to merge 7 commits into
ModelTC:mainfrom
liaopingbo:feature/add-npu-operators

Conversation

@liaopingbo

Copy link
Copy Markdown

Add 4 Ascend NPU custom operators for sparse video inference:

  • sparse attention
  • npu_layer_norm
  • npu_rms_norm
  • npu_rope

@liaopingbo liaopingbo marked this pull request as ready for review June 16, 2026 10:03

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Ascend NPU platform by implementing several key operators, including a blockwise Rainfusion attention module, NPU-specific LayerNorm and RMSNorm layers, and an NPU Rotary Position Embedding (RoPE) implementation. The feedback highlights several critical issues in the Rainfusion_blockwise module, such as a potential runtime error from calling .view() on a transposed tensor, an inaccurate calculation of protect_len when sequence lengths are not multiples of the pool size, and the presence of unused code (do_tensor_pooling). Additionally, using torch.cat instead of torch.concat is recommended for consistency.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py Outdated
Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py Outdated
Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py Outdated
Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py Outdated
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Ascend NPU operations, including the implementation of Rainfusion blockwise attention, NPU layer normalization, NPU RMS normalization, and NPU rotary position embedding (RoPE). The review feedback highlights critical issues where calling .view() on non-contiguous tensors (resulting from transpose or slicing operations) can cause runtime errors, suggesting the use of .reshape() instead. Additionally, it is recommended to replace torch.concat with torch.cat across the codebase for consistency and compatibility with older PyTorch versions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py
Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py
Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py Outdated
Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py Outdated
Comment thread lightx2v_platform/ops/attn/ascend_npu/rainfusion_blockwise.py Outdated
@liaopingbo liaopingbo marked this pull request as draft June 16, 2026 10:21
@liaopingbo

Copy link
Copy Markdown
Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Ascend NPU platform-specific operations, including Rainfusion blockwise attention, layer and RMS normalization, and rotary position embeddings. The review feedback highlights several safety and robustness improvements: replacing .view() with .reshape() on sliced tensors to avoid non-contiguous memory crashes, ensuring tensors are contiguous before calling torch.view_as_complex, and replacing runtime validation assert statements with explicit ValueError exceptions to prevent them from being optimized away.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +51 to +58
if num_full_blocks > 0:
full_blocks = input_tensor[:, : num_full_blocks * pool_size, :, :]
full_blocks_reshaped = full_blocks.view(batch, num_full_blocks, pool_size, headnum, dim)
pooled_tensors.append(full_blocks_reshaped.mean(dim=2))
if tail_size > 0:
tail_block = input_tensor[:, num_full_blocks * pool_size :, :, :]
tail_reshaped = tail_block.view(batch, 1, tail_size, headnum, dim)
pooled_tensors.append(tail_reshaped.mean(dim=2))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using .view() on sliced tensors (like full_blocks and tail_block) can cause runtime crashes if the tensor is not contiguous in memory. Slicing along the sequence dimension often results in non-contiguous memory layouts. It is safer to use .reshape() instead of .view(), as .reshape() automatically handles non-contiguous tensors by copying them if necessary.

Suggested change
if num_full_blocks > 0:
full_blocks = input_tensor[:, : num_full_blocks * pool_size, :, :]
full_blocks_reshaped = full_blocks.view(batch, num_full_blocks, pool_size, headnum, dim)
pooled_tensors.append(full_blocks_reshaped.mean(dim=2))
if tail_size > 0:
tail_block = input_tensor[:, num_full_blocks * pool_size :, :, :]
tail_reshaped = tail_block.view(batch, 1, tail_size, headnum, dim)
pooled_tensors.append(tail_reshaped.mean(dim=2))
if num_full_blocks > 0:
full_blocks = input_tensor[:, : num_full_blocks * pool_size, :, :]
full_blocks_reshaped = full_blocks.reshape(batch, num_full_blocks, pool_size, headnum, dim)
pooled_tensors.append(full_blocks_reshaped.mean(dim=2))
if tail_size > 0:
tail_block = input_tensor[:, num_full_blocks * pool_size :, :, :]
tail_reshaped = tail_block.reshape(batch, 1, tail_size, headnum, dim)
pooled_tensors.append(tail_reshaped.mean(dim=2))

Comment on lines +46 to +47
xq_fp32 = torch.view_as_complex(xq[:seq_len].to(torch.float32).reshape(seq_len, n, -1, 2))
xk_fp32 = torch.view_as_complex(xk[:seq_len].to(torch.float32).reshape(seq_len, n, -1, 2))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

torch.view_as_complex requires the input tensor to be contiguous in memory. If xq or xk is already in torch.float32 and is non-contiguous, .to(torch.float32) will be a no-op, and the subsequent .reshape() may return a non-contiguous tensor, causing a runtime crash. Calling .contiguous() before torch.view_as_complex ensures safety.

Suggested change
xq_fp32 = torch.view_as_complex(xq[:seq_len].to(torch.float32).reshape(seq_len, n, -1, 2))
xk_fp32 = torch.view_as_complex(xk[:seq_len].to(torch.float32).reshape(seq_len, n, -1, 2))
xq_fp32 = torch.view_as_complex(xq[:seq_len].to(torch.float32).reshape(seq_len, n, -1, 2).contiguous())
xk_fp32 = torch.view_as_complex(xk[:seq_len].to(torch.float32).reshape(seq_len, n, -1, 2).contiguous())

x = attention_forward(q, k, v, opt_mode="manual", op_type="ascend_laser_attention", layout="BNSD")
else:
batch, qSeqlen, numHeads, headDim = q.shape
assert batch == 1, "Rainfusion_blockwise currently only supports batch size 1."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using assert statements for runtime input validation is discouraged because they can be optimized away when Python is run with the -O (optimize) flag. This would bypass the batch size check and could lead to unexpected behavior or crashes later in the execution. Please replace this with an explicit if check and raise a ValueError.

Suggested change
assert batch == 1, "Rainfusion_blockwise currently only supports batch size 1."
if batch != 1:
raise ValueError("Rainfusion_blockwise currently only supports batch size 1.")

@helloyongyang helloyongyang marked this pull request as ready for review June 16, 2026 10:35
@helloyongyang

Copy link
Copy Markdown
Contributor

你好,感谢你的贡献。

请问,你的Rainfusion_blockwise是用来做什么的?它的格式没有和lightx2v的attention格式对齐(比如:https://github.com/ModelTC/LightX2V/blob/main/lightx2v/common/ops/attn/sla_attn.py)
,也没有看到pr中有任何使用到它的地方。

@liaopingbo

liaopingbo commented Jun 16, 2026 via email

Copy link
Copy Markdown
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants