Skip to content

metal/ggml: keep q4_0 decode on the mat-vec kernel up to 8 rows#16

Closed
czoli1976 wants to merge 1 commit into
perf/metal-ggml-f16-roundtripfrom
perf/metal-q4-gemv-rows
Closed

metal/ggml: keep q4_0 decode on the mat-vec kernel up to 8 rows#16
czoli1976 wants to merge 1 commit into
perf/metal-ggml-f16-roundtripfrom
perf/metal-q4-gemv-rows

Conversation

@czoli1976

Copy link
Copy Markdown
Owner

The q4_0 matrix-vector kernel is bandwidth-bound on the weight read and stays cheaper than the tiled GEMM up to ~8 activation rows, but the dispatcher switched to GEMM at m > 4, so 5–8-row q4 decode (batched, or speculative / lookahead) paid the full GEMM cost for no gain. This raises the q4 mat-vec row cap to 8; f16/f32 stay at 4.

Stacked on sonos#2366 (perf/metal-ggml-f16-roundtrip).

Perf

Forward-pass latency, Qwen3-1.7B q40ef16, Metal (Apple M-series), 256-token past, median ms/pass:

tokens/pass main +sonos#2366 +sonos#2366 +this
1 30.5 26.5 26.6
4 44.0 43.2 41.7
6 72.2 75.5 55.6
8 72.1 76.3 71.3
12 72.8 77.0 76.0

The 5–8-row band now lands on the mat-vec path (m=6: −26% vs sonos#2366). Single-token decode (m=1) and prefill (m≥12) are unchanged. Downstream this turns k=4 speculative decoding on Qwen3-1.7B from a slowdown (~0.81×) into a ~1.19× speedup, and benefits any small-batch q4 decode.

The crossover (8) is measured on Apple GPUs and would ideally be device-tuned.

🤖 Generated with Claude Code

The q4_0 matrix-vector kernel is bandwidth-bound on the weight read and stays
cheaper than the tiled GEMM up to ~8 activation rows, but the dispatcher switched
to GEMM at m>4, making 5-8-row q4 decode (batched or speculative) needlessly
slow. Raise the q4 mat-vec row cap to 8; f16/f32 stay at 4.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@czoli1976

Copy link
Copy Markdown
Owner Author

Superseded by sonos#2369 (same branch, opened upstream against main, stacked on sonos#2366).

@czoli1976 czoli1976 closed this Jun 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant