[CPU:Perf] Adapt CommonOptFunction for RVV architecture#4426
Conversation
wangzhaode
left a comment
There was a problem hiding this comment.
Thanks for the great work on RVV optimization — the performance numbers look impressive! Before we can merge this, there are a few issues that need to be addressed:
1. Duplicate symbol: MNNPackC4ForMatMul_A (build blocker)
The new file MNNPackC4ForMatMul_A_RVV.cpp defines MNNPackC4ForMatMul_A, which is already defined in the existing MNNPackC4ForMatMul_A.cpp. Since CMakeLists.txt uses FILE(GLOB ...) to compile all .cpp files in the directory, this will cause a linker error due to duplicate symbols. Please either replace the old file or rename the new function.
2. Missing framework integration
The new functions are not registered in CommonOptFunction.cpp (e.g., gCoreFunction->MNNPackedMatMul = ...), so they won't actually be called at runtime. Please add the necessary registration code.
3. Inconsistent naming convention
Some functions use the _RVV suffix (MNNPackForMatMul_B_RVV, MNNDynamicUpdateConvBiasScale_RVV, MNNQuantScaleFP32_RVV) while others don't (MNNPackedMatMulFP32, generalIm2col). Please unify the naming to be consistent with the framework's integration pattern.
4. ARM SME2 macro names in RVV code
MNNPackForMatMul_B.cpp uses SME2_MATMUL_LP and SME2_MATMUL_HP, which are ARM-specific names. Please rename them to something architecture-neutral or RVV-specific.
5. Code duplication
MNNPackedMatMulFP32.cpp and MNNPackedMatMulRemainFP32.cpp are nearly identical. Consider having the Packed version call the Remain version (similar to the SME2 approach):
void MNNPackedMatMulFP32(...) {
MNNPackedMatMulRemainFP32(C, A, B, 16, parameter, ...);
}6. Missing trailing newlines
All new files are missing the POSIX-required trailing newline at the end of the file.
Looking forward to the updated version. Thanks again for the contribution!
88728c4 to
ee8a361
Compare
Critical Bug: layout incompatible with
size_t bStride = bExtraStride + l * 4; // stride per h-block = l*4
const float* w_ptr = b_base + z * 4; // 4 weights per l-stepThis assumes VerificationI wrote a standalone C++ test that simulates Suggested FixEither:
Also a minor note: the diff contains many unrelated whitespace/formatting changes (alignment adjustments in NEON code, for-loop spacing, etc.) that make review harder. Consider separating those into a dedicated commit. |
2fc23d2 to
761e012
Compare
|
Hi @wangzhaode , Thank you for the detailed review and the helpful test script! I have addressed all the issues mentioned:
The commits have been squashed and updated. Please let me know if anything else is needed. Thanks again for your guidance! |
|
I have some concerns about the First, I agree that registering RVV implementations into However, for #3813 already provided benchmark data with explicit benchmark entry, shapes, timing results, and test environment. It included:
For example, #3813 included benchmark cases such as:
The reported speedups for the large In contrast, this PR only reports a single number for This only shows that the new implementation is faster than the scalar C++ baseline. Since #3813 is already merged, the correct comparison target should be the existing RVV implementation from #3813, not only the scalar fallback. The current benchmark information is also incomplete compared with #3813. For this packing kernel, performance depends heavily on More importantly, the two implementations use different vectorization strategies. The implementation from #3813 intentionally vectorizes along the The implementation in this PR uses contiguous So I do not think this kernel should be evaluated only by whether the load is contiguous or whether the code looks simpler. The key question is whether the implementation uses the RVV vector length effectively for the real packing workload. Please provide an apples-to-apples benchmark using the same benchmark entry and shapes from #3813, including:
Please include at least the same level of benchmark information as #3813:
Before such data is provided, I suggest keeping the #3813 implementation for In short, integrating RVV functions into the dispatch path is reasonable. But replacing an existing performance-critical RVV kernel should require direct benchmark data against the existing RVV kernel, not only against the scalar implementation. |
|
Hi @wangzhaode , Thank you for the update and for managing the merges! I just carefully reviewed PR #4433 to see how I should align my work. However, it seems that #4433 doesn't cover the specific RVV optimizations we implemented in this PR. While #4433 might have touched CommonOptFunction.cpp, the core of my PR contains unique RVV implementations for Matrix Multiplication (e.g., MNNPackedMatMulFP32 ,generalIm2col , MNNDynamicUpdateConvBiasScale ,MNNPackedMatMulFP32 ,MNNPackedMatMulRemainFP32 ,MNNPackC4ForMatMul_A and MNNPackForMatMul_B ). I have attached a few screenshots below comparing the changes in #4433 with the unique kernels implemented here for clarity. Could you kindly help double-check this? If these specific optimizations are still valuable to the framework, would you prefer to reopen this PR, or should I create a brand new PR freshly rebased on the latest master to keep the history completely clean? |
|
Thanks for the follow-up @jxgxxx. After reviewing both PRs carefully, I agree that #4433 does not cover the core FP32 GEMM optimizations in this PR. The functions unique to this PR — I'm reopening this PR. However, before merging, please address the following based on previous review feedback:
Looking forward to the updated version. Thanks for the contribution! |
4f3fa0b to
31e0f10
Compare
|
Hi @wangzhaode , Thanks for your patience and detailed feedback! I have completely refactored the PR and addressed all 5 points:
The branch is now completely clean and up-to-date. Please let me know if everything looks good to proceed with the merge! |
wangzhaode
left a comment
There was a problem hiding this comment.
Hi @jxgxxx, thanks for the updated PR! I've done a detailed correctness review. Here's my analysis:
Requirements Check (5/5 from June 11 review)
| # | Requirement | Status |
|---|---|---|
| 1 | Remove MNNPackC4ForMatMul_A changes |
✅ Not in diff, not registered in dispatch |
| 2 | Rebase on latest master | ❌ See below |
| 3 | Code dedup (MatMulFP32 calls Remain) |
✅ MNNPackedMatMulFP32.cpp:7 forwards to Remain with eSize=16 |
| 4 | HP layout fixed to hP=4 | ✅ MNNGetMatMulPackMode.cpp:5, MNNPackForMatMul_B.cpp:7, MatMulRemain:15 all use 4 |
| 5 | Benchmark data provided | ✅ In PR description |
Blocker: Rebase Required
Your branch is based on a7de2a08, which is 10 commits behind current master (cc20f672). Critically, it's missing commit f83ed32e — [CPU:Bugfix] fix rvv pack and unpack functions errors (#4531) — which fixes bugs in MNNPackC2.cpp, MNNPackC4.cpp, and MNNUnpackC4.cpp. These files are in your branch's rvv/ directory and will carry the old buggy versions if merged without rebasing.
git fetch origin master
git rebase origin/master
GEMM Kernel Correctness (MNNPackedMatMulRemainFP32.cpp)
I verified the core GEMM against the scalar reference (CommonOptFunction.cpp:1591-1646):
- A access
A + z * aStride: correct, matches[l][eP]layout ✅ - B access
b_base + z * 4: correct, matches[hC4][l][hP=4]fromPackB✅ - C write
vsse32(c_base + v, stride=16): correct scatter into C4 pack layout ✅ - bias/clamp: matches reference exactly ✅
- h not multiple of 4: safe —
PackBdoesmemset(0)padding ✅
One defensive suggestion: vl = __riscv_vsetvl_e32m4(eSize) is called once outside the loop (line 26). This implicitly assumes VLMAX >= eP=16, which is guaranteed by -march=rv64gcv (VLEN≥128). Consider adding MNN_ASSERT(vl >= eSize) as a safety net for future changes.
Other Findings
generalIm2col_RVV — RVV path is effectively dead code (minor)
generalIm2col.cpp:42: conditionxIn + current_pack <= LPwith LP=1 and pack=4 is always false- The function always falls through to the scalar loop at line 47
- This means the "RVV optimization" for im2col has no actual effect
- Suggestion: either vectorize along a different dimension, or note this is a placeholder
MNNPackForMatMul_B.cpp:54 — LMUL=8 overkill for ≤4 elements (nit)
y_len≤ HP=4, butvsetvl_e32m8allocates 8 register groups for at most 4 elementse32m1would suffice and reduce register pressure
Summary
Core GEMM logic is correct and consistent with the scalar reference. The HP=4 layout mismatch from the original review is properly fixed. Please rebase on master (critical to pick up #4531 RVV bugfixes), then we can proceed to merge.
ed41a9d to
bf72374
Compare
Co-authored-by: jxgxxx <1955992348@qq.com> Co-authored-by: typer-J <2236066784@qq.com> Co-authored-by: Sherlockzhangjinge <zjgzhangjinge@outlook.com> Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn>
bf72374 to
7c7d23d
Compare
|
Hi @wangzhaode , Thank you for the meticulous review and for validating the core GEMM logic! I have fully reset and rebased the branch, addressing all your new findings cleanly:
The PR is now perfectly clean, up-to-date, and ready. Thanks again for your excellent guidance and patience! |


Description
This PR implements the RISC-V Vector (RVV) adaptation for core operators in
CommonOptFunction.Accuracy Validation
Performance Metrics
The performance was evaluated on a remote RISC-V server. Profiling was conducted using
perffor each individual function.Test Environment & Parameters:
(Note: Data represents average execution time across 5,000-20,000 iterations.)
Module
CPU / RVV
Type
Checklist
[Module:Type] Descriptionformat