metal/ggml: keep q4_0 decode on the mat-vec kernel up to 8 rows#16
Closed
czoli1976 wants to merge 1 commit into
Closed
metal/ggml: keep q4_0 decode on the mat-vec kernel up to 8 rows#16czoli1976 wants to merge 1 commit into
czoli1976 wants to merge 1 commit into
Conversation
The q4_0 matrix-vector kernel is bandwidth-bound on the weight read and stays cheaper than the tiled GEMM up to ~8 activation rows, but the dispatcher switched to GEMM at m>4, making 5-8-row q4 decode (batched or speculative) needlessly slow. Raise the q4 mat-vec row cap to 8; f16/f32 stay at 4. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Owner
Author
|
Superseded by sonos#2369 (same branch, opened upstream against main, stacked on sonos#2366). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The q4_0 matrix-vector kernel is bandwidth-bound on the weight read and stays cheaper than the tiled GEMM up to ~8 activation rows, but the dispatcher switched to GEMM at
m > 4, so 5–8-row q4 decode (batched, or speculative / lookahead) paid the full GEMM cost for no gain. This raises the q4 mat-vec row cap to 8; f16/f32 stay at 4.Stacked on sonos#2366 (
perf/metal-ggml-f16-roundtrip).Perf
Forward-pass latency, Qwen3-1.7B q40ef16, Metal (Apple M-series), 256-token past, median ms/pass:
The 5–8-row band now lands on the mat-vec path (m=6: −26% vs sonos#2366). Single-token decode (m=1) and prefill (m≥12) are unchanged. Downstream this turns k=4 speculative decoding on Qwen3-1.7B from a slowdown (~0.81×) into a ~1.19× speedup, and benefits any small-batch q4 decode.
The crossover (8) is measured on Apple GPUs and would ideally be device-tuned.
🤖 Generated with Claude Code