linalg/block_quant: W4A8 int8-dot decode GEMV for Q4_0 by czoli1976 · Pull Request #2348 · sonos/tract

czoli1976 · 2026-06-07T14:14:07Z

Adds a W4A8 (4-bit weight, int8 activation) decode GEMV on the Q4_0 block-quant format —
the CPU primitive for an int4-compute matmul path, as an alternative to today's
dequantize-to-f32-then-f32-kernel route. Q4_0::w4a8_gemv quantizes the activation row to int8
per block, dots it with the unpacked 4-bit weights, and scales each block by the weight and
activation scales.

Why

The GPU backends already do exactly this — CUDA routes Q4_0 to a Q4_0 × Q8_1 kernel
(cuda/src/kernels/cu/mm_mv_q.cu), Metal to mul_mv_q4_0. On CPU the block-quant path only
dequantizes to f32, so the int4 win is just reduced weight bandwidth, not int8 compute. This is
the missing CPU piece.

Notes / numbers

Near-lossless: validated against the f32-dequant GEMV over the same Q4_0 weight — the int8
activation adds ≈0.3% (W4A8 ≈ W4A16); test included.
No intrinsics, no unsafe: the contiguous i8→i32 inner loop autovectorizes to a NEON
sdot on its own (integer reductions are associative; strict-IEEE f32 ones aren't).
Decode-shape (4096², M=1) it measured ~1.8× over mmv_f32 in a local prototype — that includes
op-wrapper overhead, so a kernel-selection integration would be ≥ that.

Scope / integration

This PR is the kernel primitive only — it is intentionally not yet wired into kernel
selection. The integration question (the W4A8-vs-W4A16 numerical contract, and whether it lands as
a dedicated op like the Metal/CUDA Q4_0×Q8 ops or via strategize in einsum_matmul) is the
open discussion in #2341. Filing the validated primitive + numbers to ground that — happy to wire
it whichever way you prefer.

🤖 Generated with Claude Code

Quantizes the activation row to int8 per block and dots it with the unpacked 4-bit weights, scaling each block by the weight and activation scales. The contiguous inner loop autovectorizes to a NEON int8 dot. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linalg/block_quant: W4A8 int8-dot decode GEMV for Q4_0#2348

linalg/block_quant: W4A8 int8-dot decode GEMV for Q4_0#2348
czoli1976 wants to merge 1 commit into
sonos:mainfrom
czoli1976:feature/w4a8-q4_0-gemv

czoli1976 commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

czoli1976 commented Jun 7, 2026

Why

Notes / numbers

Scope / integration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant