feat: add w4afp8 on Hopper GPUs by foreverrookie · Pull Request #287 · deepseek-ai/DeepGEMM

foreverrookie · 2026-02-09T14:17:40Z

Hi, from novita.ai team.

Algorithm compatible with https://huggingface.co/Barrrrry/DeepSeek-R1-W4AFP8, but uses a custom weight layout (see convert_fp8_to_int4 in tests/generators.py).

test perf(W4Afp8 vs FP8) on H200. nvcc: 13.0.

groups	m/grp	n	k	W4 us	W4 GB/s	FP8 us	FP8 GB/s	Speedup
8	16	4096	7168	48	2649	68	3477	1.42x
8	24	4096	7168	48	2667	68	3493	1.42x
8	32	4096	7168	48	2685	68	3500	1.42x
8	40	4096	7168	48	2727	68	3516	1.42x
8	48	4096	7168	48	2721	68	3530	1.42x
8	56	4096	7168	62	2115	72	3368	1.16x
8	64	4096	7168	62	2122	72	3393	1.16x
8	16	7168	2048	32	2029	40	3012	1.25x
8	24	7168	2048	31	2071	40	3034	1.29x
8	32	7168	2048	32	2096	40	3061	1.25x
8	40	7168	2048	31	2149	40	3085	1.29x
8	48	7168	2048	31	2187	40	3118	1.29x
8	56	7168	2048	31	2259	42	2991	1.35x
8	64	7168	2048	42	1657	42	2981	1.00x
16	16	4096	7168	93	2724	127	3741	1.37x
16	24	4096	7168	93	2748	127	3760	1.37x
16	32	4096	7168	93	2758	127	3772	1.37x
16	40	4096	7168	93	2793	127	3794	1.37x
16	48	4096	7168	93	2797	127	3806	1.37x
16	56	4096	7168	131	2010	137	3542	1.05x
16	64	4096	7168	130	2036	138	3527	1.06x
16	16	7168	2048	57	2262	71	3362	1.25x
16	24	7168	2048	57	2317	71	3394	1.25x
16	32	7168	2048	57	2344	71	3420	1.25x
16	40	7168	2048	56	2392	71	3462	1.27x
16	48	7168	2048	57	2420	71	3486	1.25x
16	56	7168	2048	68	2062	75	3307	1.10x
16	64	7168	2048	69	2048	76	3292	1.10x

- Rebase onto latest main (7f2a703) - Optimize single-wave performance: switch BLOCK_M from 64 to 128 - Fix benchmark first timing incorrectly returning 0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

foreverrookie force-pushed the feat/w4a8 branch from 4e49fdc to e39a0be Compare April 21, 2026 03:26

foreverrookie force-pushed the feat/w4a8 branch from e39a0be to 59ef1df Compare April 21, 2026 06:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add w4afp8 on Hopper GPUs#287

feat: add w4afp8 on Hopper GPUs#287
foreverrookie wants to merge 1 commit into
deepseek-ai:mainfrom
foreverrookie:feat/w4a8

foreverrookie commented Feb 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

foreverrookie commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

foreverrookie commented Feb 9, 2026 •

edited

Loading