Skip to content

ggml-hexagon: HAP_power_set_HMX uses &ctx instead of ctx in htp_iface_open(), causing large HMX GEMM slowdown #1452

@happyyzy

Description

@happyyzy

What happened

In ggml-hexagon, htp_iface_open() powers up HMX with the wrong client/context pointer.

File:

  • src/ggml-hexagon/htp/main.c

Current code on commit 35ae589fa189a3682a1fe25b7803122680c401b4:

request.type         = HAP_power_set_HMX;
request.hmx.power_up = TRUE;
err = HAP_power_set((void *) &ctx, &request);

That passes &ctx (address of the local pointer variable) instead of ctx (the actual struct htp_context *).

The rest of the power votes in the same function use ctx correctly:

HAP_power_set((void *) ctx, &request)

The one-line fix is:

err = HAP_power_set((void *) ctx, &request);

Why this matters

This is not just a cosmetic bug. On device, this causes a large HMX performance regression in ggml-hexagon HMX GEMM.

After fixing only this pointer bug, the HMX core segment drops immediately from ~65 ms to ~22 ms on the same workload.

Reproduction

Repo / commit:

  • ggml-org/ggml
  • 35ae589fa189a3682a1fe25b7803122680c401b4

Command used:

GGML_HEXAGON_ARCH=79 GGML_HEXAGON_PROFILE=1 \
./test-backend-ops perf -o MUL_MAT -b HTP0 \
  -p "type_a=q8_0,type_b=f32,m=4096,n=12288,k=4096"

and:

GGML_HEXAGON_ARCH=79 GGML_HEXAGON_PROFILE=1 \
./test-backend-ops perf -o MUL_MAT -b HTP0 \
  -p "type_a=q4_0,type_b=f32,m=4096,n=12288,k=4096"

Measured before / after

q8_0, shape 4096 x 4096 x 12288

Before fix:

  • dequant ~= 777xx us
  • core ~= 648xx us

After changing only HAP_power_set((void *)&ctx, ...) -> HAP_power_set((void *)ctx, ...):

  • dequant ~= 776xx us
  • core ~= 2205x us

q4_0, shape 4096 x 4096 x 12288

Before fix:

  • dequant ~= 643xx us
  • core ~= 653xx us

After fix:

  • dequant ~= 595xx us
  • core ~= 2218x us

Suggested fix

Change this line in src/ggml-hexagon/htp/main.c:

err = HAP_power_set((void *) &ctx, &request);

to:

err = HAP_power_set((void *) ctx, &request);

Notes

I ruled out a few unrelated explanations before isolating this:

  • HMX lock mode (lock vs lock2(shared)) was not the cause.
  • Chunk/layout changes were not the cause.
  • The same workload still ran numerically; the bug manifested as a large performance drop.

This issue is specifically about the wrong context pointer passed to HAP_power_set_HMX.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions