Skip to content

[CPU:Perf] Adapt CommonOptFunction for RVV architecture#4426

Open
jxgxxx wants to merge 1 commit into
alibaba:masterfrom
jxgxxx:rvv-CommonOptFunction
Open

[CPU:Perf] Adapt CommonOptFunction for RVV architecture#4426
jxgxxx wants to merge 1 commit into
alibaba:masterfrom
jxgxxx:rvv-CommonOptFunction

Conversation

@jxgxxx

@jxgxxx jxgxxx commented May 6, 2026

Copy link
Copy Markdown

Description

This PR implements the RISC-V Vector (RVV) adaptation for core operators in CommonOptFunction.

Accuracy Validation

  • Verified that the outputs of all RVV-adapted functions strictly match the original C++ implementations.

Performance Metrics

The performance was evaluated on a remote RISC-V server. Profiling was conducted using perf for each individual function.

Test Environment & Parameters:

  • Hardware: SG2044 64-Core RISC-V 64-bit
  • OS: PolyOS Server 24.03 LTS
Function / Operator Input Dimensions (M, K, N / e, l, h) Baseline (C++) RVV Optimized Speedup
MNNPackedMatMulFP32 M=16, K=256, N=64 1355.11ms 165.08ms 8.21x
generalIm2col e=64, l=1024, pack=16 114.56 ms 12.27 ms 9.33x
MNNDynamicUpdateConvBiasScale OC=65536 (ocQuad=16384) 510.58 ms 400.24 ms 1.23x
MNNPackedMatMulRemainFP32 M=15, K=256, N=64 613.35 ms 105.72 ms 5.80x
MNNPackForMatMul_B N=256, K=256 (h=256, ic=256) 3390.81 ms 714.80 ms 4.74x
MNNQuantScaleFP32 batch=100000, thread=4 2373.37 ms 323.82 ms 7.33x

(Note: Data represents average execution time across 5,000-20,000 iterations.)

Module

CPU / RVV

Type

  • Feature
  • Bugfix
  • Perf
  • Refact
  • Style
  • Doc
  • Test
  • Chore

Checklist

  • Commit message follows [Module:Type] Description format
  • Code compiles without errors
  • Tested on relevant platform(s)
  • No unrelated format or style changes included

@CLAassistant

CLAassistant commented May 6, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@wangzhaode wangzhaode self-assigned this May 7, 2026

@wangzhaode wangzhaode left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work on RVV optimization — the performance numbers look impressive! Before we can merge this, there are a few issues that need to be addressed:

1. Duplicate symbol: MNNPackC4ForMatMul_A (build blocker)
The new file MNNPackC4ForMatMul_A_RVV.cpp defines MNNPackC4ForMatMul_A, which is already defined in the existing MNNPackC4ForMatMul_A.cpp. Since CMakeLists.txt uses FILE(GLOB ...) to compile all .cpp files in the directory, this will cause a linker error due to duplicate symbols. Please either replace the old file or rename the new function.

2. Missing framework integration
The new functions are not registered in CommonOptFunction.cpp (e.g., gCoreFunction->MNNPackedMatMul = ...), so they won't actually be called at runtime. Please add the necessary registration code.

3. Inconsistent naming convention
Some functions use the _RVV suffix (MNNPackForMatMul_B_RVV, MNNDynamicUpdateConvBiasScale_RVV, MNNQuantScaleFP32_RVV) while others don't (MNNPackedMatMulFP32, generalIm2col). Please unify the naming to be consistent with the framework's integration pattern.

4. ARM SME2 macro names in RVV code
MNNPackForMatMul_B.cpp uses SME2_MATMUL_LP and SME2_MATMUL_HP, which are ARM-specific names. Please rename them to something architecture-neutral or RVV-specific.

5. Code duplication
MNNPackedMatMulFP32.cpp and MNNPackedMatMulRemainFP32.cpp are nearly identical. Consider having the Packed version call the Remain version (similar to the SME2 approach):

void MNNPackedMatMulFP32(...) {
    MNNPackedMatMulRemainFP32(C, A, B, 16, parameter, ...);
}

6. Missing trailing newlines
All new files are missing the POSIX-required trailing newline at the end of the file.

Looking forward to the updated version. Thanks again for the contribution!

@wangzhaode wangzhaode mentioned this pull request May 9, 2026
12 tasks
@jxgxxx jxgxxx force-pushed the rvv-CommonOptFunction branch from 88728c4 to ee8a361 Compare May 11, 2026 13:14
@jxgxxx jxgxxx requested a review from wangzhaode May 11, 2026 13:44
@wangzhaode

Copy link
Copy Markdown
Collaborator

Critical Bug: layout incompatible with

MNNGetMatMulPackMode_RVV returns hP=4, which means the framework expects B matrix packed in [h/4][l][4] layout. However, MNNPackForMatMul_B_RVV uses RVV_MATMUL_HP = 64, producing a [h/64][l][64] layout.

MNNPackedMatMulRemainFP32_RVV reads B as:

size_t bStride = bExtraStride + l * 4;   // stride per h-block = l*4
const float* w_ptr = b_base + z * 4;     // 4 weights per l-step

This assumes [h/4][l][4] layout, which does NOT match what PackForMatMul_B produces.

Verification

I wrote a standalone C++ test that simulates PackB(HP=64) + MatMul(hP=4) and compares against a scalar reference. All 8 test cases fail:

=== PR#4426: PackB(HP=64) vs MatMul(hP=4) Layout Test ===

Test0: h=4   l=8   tr=0  FAIL (16 mismatches, maxErr=41.28)
Test1: h=8   l=16  tr=0  FAIL (32 mismatches, maxErr=77.77)
Test2: h=32  l=32  tr=0  FAIL (128 mismatches, maxErr=170.79)
Test3: h=64  l=16  tr=0  FAIL (256 mismatches, maxErr=124.90)
Test4: h=128 l=16  tr=0  FAIL (512 mismatches, maxErr=151.80)
Test5: h=256 l=24  tr=0  FAIL (1024 mismatches, maxErr=207.31)
Test6: h=8   l=16  tr=1  FAIL (32 mismatches, maxErr=58.22)
Test7: h=128 l=32  tr=1  FAIL (512 mismatches, maxErr=169.48)

=== 0 PASSED, 8 FAILED ===

Suggested Fix

Either:

  1. Change RVV_MATMUL_HP in MNNPackForMatMul_B_RVV from 64 to 4 to match hP=4, or
  2. If HP=64 tiling is intentional for performance, update MNNGetMatMulPackMode_RVV to return hP=64 and adjust MNNPackedMatMulRemainFP32_RVV to read B with bStride = bExtraStride + l * 64 and process 64 h-values per block.

Also a minor note: the diff contains many unrelated whitespace/formatting changes (alignment adjustments in NEON code, for-loop spacing, etc.) that make review harder. Consider separating those into a dedicated commit.

@jxgxxx jxgxxx force-pushed the rvv-CommonOptFunction branch 5 times, most recently from 2fc23d2 to 761e012 Compare May 20, 2026 07:03
@jxgxxx

jxgxxx commented May 20, 2026

Copy link
Copy Markdown
Author

Hi @wangzhaode ,

Thank you for the detailed review and the helpful test script! I have addressed all the issues mentioned:

  1. Fixed layout mismatch: Changed RVV_MATMUL_HP to 4 in MNNPackForMatMul_B_RVV to perfectly align with the hP=4 compute kernel.
  2. Cleaned up formatting: Reverted all unintentional formatting/whitespace changes (including the NEON code and CommonOptFunction.cpp) to keep the diff clean.

The commits have been squashed and updated. Please let me know if anything else is needed. Thanks again for your guidance!

@ihb2032

ihb2032 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

I have some concerns about the MNNPackC4ForMatMul_A_RVV implementation in this PR.

First, I agree that registering RVV implementations into CommonOptFunction and fixing the symbol naming / dispatch path are reasonable. The RVV functions should be properly selected through the backend dispatch mechanism.

However, for MNNPackC4ForMatMul_A_RVV, this PR appears to replace or reimplement the already merged RVV kernel from #3813 with a different algorithm. I do not think this replacement is justified by the current benchmark data.

#3813 already provided benchmark data with explicit benchmark entry, shapes, timing results, and test environment. It included:

  • benchmark entry: test_pack_c4_for_mat_mul_a
  • dimensions: eReal and l
  • multiple benchmark shapes, including large and asymmetric cases
  • scalar time, RVV time, and speedup for each case
  • test environment: Banana Pi BPI-F3, EulixOS 3.0
  • also a negative case where RVV is slower, such as eReal = 1, which helps clarify the applicable workload range

For example, #3813 included benchmark cases such as:

  • eReal = 1024, l = 128
  • eReal = 1024, l = 1024
  • eReal = 1024, l = 4096
  • eReal = 1024, l = 8192
  • eReal = 1024, l = 16384
  • eReal = 1024, l = 32768
  • eReal = 65536, l = 128
  • eReal = 1000000, l = 64
  • eReal = 16, l = 1000000
  • eReal = 1, l = 65536

The reported speedups for the large eReal = 1024 cases were around 33x to 63x over the scalar implementation.

In contrast, this PR only reports a single number for MNNPackC4ForMatMul_A:

24.82 ms -> 8.79 ms, 2.83x

This only shows that the new implementation is faster than the scalar C++ baseline. Since #3813 is already merged, the correct comparison target should be the existing RVV implementation from #3813, not only the scalar fallback.

The current benchmark information is also incomplete compared with #3813. For this packing kernel, performance depends heavily on eReal and l, but this PR does not provide the shape corresponding to the reported timing number. Without the benchmark shape and environment, the single timing number is not enough to evaluate whether the new implementation is actually better.

More importantly, the two implementations use different vectorization strategies.

The implementation from #3813 intentionally vectorizes along the e dimension. Although it uses strided loads from the source C4 layout, it uses large vl / m8 and stores to the destination contiguously. This allows the kernel to use the available RVV vector length effectively, especially when e is large.

The implementation in this PR uses contiguous vle32 loads from source, but the effective vector length is only 4, and then it writes to destination with strided stores. This changes the vectorization dimension from the long e dimension to the small C4 dimension. Even though the source load is contiguous, this may underutilize RVV for large e cases.

So I do not think this kernel should be evaluated only by whether the load is contiguous or whether the code looks simpler. The key question is whether the implementation uses the RVV vector length effectively for the real packing workload.

Please provide an apples-to-apples benchmark using the same benchmark entry and shapes from #3813, including:

  1. scalar C++ baseline
  2. current ENH: Optimize MNNPackC4ForMatMul_A with RVV implementation #3813 RVV implementation
  3. this PR's RVV implementation

Please include at least the same level of benchmark information as #3813:

  • benchmark entry
  • eReal and l
  • scalar time
  • RVV time
  • speedup
  • test environment

Before such data is provided, I suggest keeping the #3813 implementation for MNNPackC4ForMatMul_A_RVV, and only adapting the function name / registration if needed for CommonOptFunction.

In short, integrating RVV functions into the dispatch path is reasonable. But replacing an existing performance-critical RVV kernel should require direct benchmark data against the existing RVV kernel, not only against the scalar implementation.

@wangzhaode

Copy link
Copy Markdown
Collaborator

Closing this PR as #4433 has been merged into master, which covers the same CommonOptFunction RVV optimizations. If there are additional unique optimizations not covered by #4433, feel free to open a new PR based on the latest master. Thanks for your contribution!

@wangzhaode wangzhaode closed this Jun 11, 2026
@jxgxxx

jxgxxx commented Jun 11, 2026

Copy link
Copy Markdown
Author

Hi @wangzhaode ,

Thank you for the update and for managing the merges!

I just carefully reviewed PR #4433 to see how I should align my work. However, it seems that #4433 doesn't cover the specific RVV optimizations we implemented in this PR. While #4433 might have touched CommonOptFunction.cpp, the core of my PR contains unique RVV implementations for Matrix Multiplication (e.g., MNNPackedMatMulFP32 ,generalIm2col , MNNDynamicUpdateConvBiasScale ,MNNPackedMatMulFP32 ,MNNPackedMatMulRemainFP32 ,MNNPackC4ForMatMul_A and MNNPackForMatMul_B ).

I have attached a few screenshots below comparing the changes in #4433 with the unique kernels implemented here for clarity.

Could you kindly help double-check this? If these specific optimizations are still valuable to the framework, would you prefer to reopen this PR, or should I create a brand new PR freshly rebased on the latest master to keep the history completely clean?
5d0668a835f4657a06a1aaf3ee132e22
9993aff43fb4d40d46873000d5754920

@wangzhaode

Copy link
Copy Markdown
Collaborator

Thanks for the follow-up @jxgxxx. After reviewing both PRs carefully, I agree that #4433 does not cover the core FP32 GEMM optimizations in this PR. The functions unique to this PR — MNNPackedMatMulFP32, MNNPackedMatMulRemainFP32, MNNPackForMatMul_B, generalIm2col, MNNDynamicUpdateConvBiasScale, and MNNQuantScaleFP32 — are not present in master and represent valuable RVV optimizations.

I'm reopening this PR. However, before merging, please address the following based on previous review feedback:

  1. Remove MNNPackC4ForMatMul_A changes — The existing RVV implementation from ENH: Optimize MNNPackC4ForMatMul_A with RVV implementation #3813 (merged) is well-benchmarked with 33x-63x speedups. As @ihb2032 correctly pointed out, replacing it requires a direct apples-to-apples benchmark comparison against the existing RVV kernel, not just the scalar baseline. Please keep the ENH: Optimize MNNPackC4ForMatMul_A with RVV implementation #3813 implementation and remove this file from your PR.

  2. Rebase on latest master — Since RVV: Optimize common opt functions #4433 has been merged, please rebase your branch to avoid conflicts, especially in CommonOptFunction_RVV.cpp.

  3. Provide comprehensive benchmark data — For each remaining function, please include:

    • Benchmark entry / function name
    • Input dimensions (e.g., M, N, K for GEMM)
    • Scalar baseline time
    • RVV optimized time
    • Speedup ratio
    • Test environment (hardware, OS)
  4. Fix code duplicationMNNPackedMatMulFP32 and MNNPackedMatMulRemainFP32 are nearly identical. Consider having the Packed version call the Remain version (as done in the SME2 approach).

  5. Confirm the HP layout bug is fixed — The critical hP=4 vs HP=64 mismatch identified earlier must be verified as resolved.

Looking forward to the updated version. Thanks for the contribution!

@wangzhaode wangzhaode reopened this Jun 11, 2026
@jxgxxx jxgxxx force-pushed the rvv-CommonOptFunction branch from 4f3fa0b to 31e0f10 Compare June 14, 2026 07:29
@jxgxxx

jxgxxx commented Jun 14, 2026

Copy link
Copy Markdown
Author

Hi @wangzhaode ,

Thanks for your patience and detailed feedback! I have completely refactored the PR and addressed all 5 points:

  1. Reverted MNNPackC4ForMatMul_A: Removed from this PR to avoid conflicting with the merged ENH: Optimize MNNPackC4ForMatMul_A with RVV implementation #3813 upstream optimization.
  2. Rebased on latest master: Cleanly resolved all conflicts in CommonOptFunction.cpp, keeping the execution sequence intact.
  3. Fixed code duplication: Refactored MNNPackedMatMulFP32_RVV to directly call the logic from MNNPackedMatMulRemainFP32_RVV.
  4. Confirmed HP layout fix: Verified that RVV_MATMUL_HP is set to 4 in MNNPackForMatMul_B to perfectly match the hP=4 framework expectation, resolving the layout mismatch.
  5. Comprehensive Benchmark Data: I have updated the original PR Description above to include the detailed test environment (SG2044 64-Core RISC-V), the specific input dimensions (M, N, K mapping), and the final speedup ratios for all remaining functions.

The branch is now completely clean and up-to-date. Please let me know if everything looks good to proceed with the merge!

@wangzhaode wangzhaode left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jxgxxx, thanks for the updated PR! I've done a detailed correctness review. Here's my analysis:

Requirements Check (5/5 from June 11 review)

# Requirement Status
1 Remove MNNPackC4ForMatMul_A changes ✅ Not in diff, not registered in dispatch
2 Rebase on latest master ❌ See below
3 Code dedup (MatMulFP32 calls Remain) MNNPackedMatMulFP32.cpp:7 forwards to Remain with eSize=16
4 HP layout fixed to hP=4 MNNGetMatMulPackMode.cpp:5, MNNPackForMatMul_B.cpp:7, MatMulRemain:15 all use 4
5 Benchmark data provided ✅ In PR description

Blocker: Rebase Required

Your branch is based on a7de2a08, which is 10 commits behind current master (cc20f672). Critically, it's missing commit f83ed32e[CPU:Bugfix] fix rvv pack and unpack functions errors (#4531) — which fixes bugs in MNNPackC2.cpp, MNNPackC4.cpp, and MNNUnpackC4.cpp. These files are in your branch's rvv/ directory and will carry the old buggy versions if merged without rebasing.

git fetch origin master
git rebase origin/master

GEMM Kernel Correctness (MNNPackedMatMulRemainFP32.cpp)

I verified the core GEMM against the scalar reference (CommonOptFunction.cpp:1591-1646):

  • A access A + z * aStride: correct, matches [l][eP] layout ✅
  • B access b_base + z * 4: correct, matches [hC4][l][hP=4] from PackB
  • C write vsse32(c_base + v, stride=16): correct scatter into C4 pack layout ✅
  • bias/clamp: matches reference exactly ✅
  • h not multiple of 4: safe — PackB does memset(0) padding ✅

One defensive suggestion: vl = __riscv_vsetvl_e32m4(eSize) is called once outside the loop (line 26). This implicitly assumes VLMAX >= eP=16, which is guaranteed by -march=rv64gcv (VLEN≥128). Consider adding MNN_ASSERT(vl >= eSize) as a safety net for future changes.

Other Findings

generalIm2col_RVV — RVV path is effectively dead code (minor)

  • generalIm2col.cpp:42: condition xIn + current_pack <= LP with LP=1 and pack=4 is always false
  • The function always falls through to the scalar loop at line 47
  • This means the "RVV optimization" for im2col has no actual effect
  • Suggestion: either vectorize along a different dimension, or note this is a placeholder

MNNPackForMatMul_B.cpp:54 — LMUL=8 overkill for ≤4 elements (nit)

  • y_len ≤ HP=4, but vsetvl_e32m8 allocates 8 register groups for at most 4 elements
  • e32m1 would suffice and reduce register pressure

Summary

Core GEMM logic is correct and consistent with the scalar reference. The HP=4 layout mismatch from the original review is properly fixed. Please rebase on master (critical to pick up #4531 RVV bugfixes), then we can proceed to merge.

@jxgxxx jxgxxx force-pushed the rvv-CommonOptFunction branch 2 times, most recently from ed41a9d to bf72374 Compare June 17, 2026 13:15
Co-authored-by: jxgxxx <1955992348@qq.com>

Co-authored-by: typer-J <2236066784@qq.com>

Co-authored-by: Sherlockzhangjinge <zjgzhangjinge@outlook.com>

Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn>
@jxgxxx jxgxxx force-pushed the rvv-CommonOptFunction branch from bf72374 to 7c7d23d Compare June 17, 2026 13:31
@jxgxxx

jxgxxx commented Jun 17, 2026

Copy link
Copy Markdown
Author

Hi @wangzhaode ,

Thank you for the meticulous review and for validating the core GEMM logic! I have fully reset and rebased the branch, addressing all your new findings cleanly:

  1. Rebase: Fetched the absolute latest upstream and rebased, ensuring the [CPU:Bugfix] fix rvv pack and unpack functions errors #4531 bugfixes in the pack/unpack files are fully preserved. The commit history is now completely clean.
  2. Defensive Assert: Added MNN_ASSERT(vl >= eSize) in MNNPackedMatMulRemainFP32.cpp as a safety net for future vector length changes.
  3. Register Pressure: Changed the LMUL from e32m8 to e32m1 in MNNPackForMatMul_B.cpp to reduce register pressure since we are processing $\le 4$ elements.
  4. Im2col Placeholder: Added a clear comment in generalIm2col_RVV acknowledging that the RVV path is currently a placeholder that falls through to the scalar loop, pending future vectorization.

The PR is now perfectly clean, up-to-date, and ready. Thanks again for your excellent guidance and patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants