What happened
In ggml-hexagon, htp_iface_open() powers up HMX with the wrong client/context pointer.
File:
src/ggml-hexagon/htp/main.c
Current code on commit 35ae589fa189a3682a1fe25b7803122680c401b4:
request.type = HAP_power_set_HMX;
request.hmx.power_up = TRUE;
err = HAP_power_set((void *) &ctx, &request);
That passes &ctx (address of the local pointer variable) instead of ctx (the actual struct htp_context *).
The rest of the power votes in the same function use ctx correctly:
HAP_power_set((void *) ctx, &request)
The one-line fix is:
err = HAP_power_set((void *) ctx, &request);
Why this matters
This is not just a cosmetic bug. On device, this causes a large HMX performance regression in ggml-hexagon HMX GEMM.
After fixing only this pointer bug, the HMX core segment drops immediately from ~65 ms to ~22 ms on the same workload.
Reproduction
Repo / commit:
ggml-org/ggml
35ae589fa189a3682a1fe25b7803122680c401b4
Command used:
GGML_HEXAGON_ARCH=79 GGML_HEXAGON_PROFILE=1 \
./test-backend-ops perf -o MUL_MAT -b HTP0 \
-p "type_a=q8_0,type_b=f32,m=4096,n=12288,k=4096"
and:
GGML_HEXAGON_ARCH=79 GGML_HEXAGON_PROFILE=1 \
./test-backend-ops perf -o MUL_MAT -b HTP0 \
-p "type_a=q4_0,type_b=f32,m=4096,n=12288,k=4096"
Measured before / after
q8_0, shape 4096 x 4096 x 12288
Before fix:
dequant ~= 777xx us
core ~= 648xx us
After changing only HAP_power_set((void *)&ctx, ...) -> HAP_power_set((void *)ctx, ...):
dequant ~= 776xx us
core ~= 2205x us
q4_0, shape 4096 x 4096 x 12288
Before fix:
dequant ~= 643xx us
core ~= 653xx us
After fix:
dequant ~= 595xx us
core ~= 2218x us
Suggested fix
Change this line in src/ggml-hexagon/htp/main.c:
err = HAP_power_set((void *) &ctx, &request);
to:
err = HAP_power_set((void *) ctx, &request);
Notes
I ruled out a few unrelated explanations before isolating this:
- HMX lock mode (
lock vs lock2(shared)) was not the cause.
- Chunk/layout changes were not the cause.
- The same workload still ran numerically; the bug manifested as a large performance drop.
This issue is specifically about the wrong context pointer passed to HAP_power_set_HMX.
What happened
In
ggml-hexagon,htp_iface_open()powers up HMX with the wrong client/context pointer.File:
src/ggml-hexagon/htp/main.cCurrent code on commit
35ae589fa189a3682a1fe25b7803122680c401b4:That passes
&ctx(address of the local pointer variable) instead ofctx(the actualstruct htp_context *).The rest of the power votes in the same function use
ctxcorrectly:The one-line fix is:
Why this matters
This is not just a cosmetic bug. On device, this causes a large HMX performance regression in
ggml-hexagonHMX GEMM.After fixing only this pointer bug, the HMX
coresegment drops immediately from ~65 ms to ~22 ms on the same workload.Reproduction
Repo / commit:
ggml-org/ggml35ae589fa189a3682a1fe25b7803122680c401b4Command used:
GGML_HEXAGON_ARCH=79 GGML_HEXAGON_PROFILE=1 \ ./test-backend-ops perf -o MUL_MAT -b HTP0 \ -p "type_a=q8_0,type_b=f32,m=4096,n=12288,k=4096"and:
GGML_HEXAGON_ARCH=79 GGML_HEXAGON_PROFILE=1 \ ./test-backend-ops perf -o MUL_MAT -b HTP0 \ -p "type_a=q4_0,type_b=f32,m=4096,n=12288,k=4096"Measured before / after
q8_0, shape4096 x 4096 x 12288Before fix:
dequant ~= 777xx uscore ~= 648xx usAfter changing only
HAP_power_set((void *)&ctx, ...)->HAP_power_set((void *)ctx, ...):dequant ~= 776xx uscore ~= 2205x usq4_0, shape4096 x 4096 x 12288Before fix:
dequant ~= 643xx uscore ~= 653xx usAfter fix:
dequant ~= 595xx uscore ~= 2218x usSuggested fix
Change this line in
src/ggml-hexagon/htp/main.c:to:
Notes
I ruled out a few unrelated explanations before isolating this:
lockvslock2(shared)) was not the cause.This issue is specifically about the wrong context pointer passed to
HAP_power_set_HMX.