Skip to content

feat: add w4afp8 on Hopper GPUs#287

Open
foreverrookie wants to merge 1 commit into
deepseek-ai:mainfrom
foreverrookie:feat/w4a8
Open

feat: add w4afp8 on Hopper GPUs#287
foreverrookie wants to merge 1 commit into
deepseek-ai:mainfrom
foreverrookie:feat/w4a8

Conversation

@foreverrookie

@foreverrookie foreverrookie commented Feb 9, 2026

Copy link
Copy Markdown

Hi, from novita.ai team.

test perf(W4Afp8 vs FP8) on H200. nvcc: 13.0.

groups m/grp n k W4 us W4 GB/s FP8 us FP8 GB/s Speedup
8 16 4096 7168 48 2649 68 3477 1.42x
8 24 4096 7168 48 2667 68 3493 1.42x
8 32 4096 7168 48 2685 68 3500 1.42x
8 40 4096 7168 48 2727 68 3516 1.42x
8 48 4096 7168 48 2721 68 3530 1.42x
8 56 4096 7168 62 2115 72 3368 1.16x
8 64 4096 7168 62 2122 72 3393 1.16x
8 16 7168 2048 32 2029 40 3012 1.25x
8 24 7168 2048 31 2071 40 3034 1.29x
8 32 7168 2048 32 2096 40 3061 1.25x
8 40 7168 2048 31 2149 40 3085 1.29x
8 48 7168 2048 31 2187 40 3118 1.29x
8 56 7168 2048 31 2259 42 2991 1.35x
8 64 7168 2048 42 1657 42 2981 1.00x
16 16 4096 7168 93 2724 127 3741 1.37x
16 24 4096 7168 93 2748 127 3760 1.37x
16 32 4096 7168 93 2758 127 3772 1.37x
16 40 4096 7168 93 2793 127 3794 1.37x
16 48 4096 7168 93 2797 127 3806 1.37x
16 56 4096 7168 131 2010 137 3542 1.05x
16 64 4096 7168 130 2036 138 3527 1.06x
16 16 7168 2048 57 2262 71 3362 1.25x
16 24 7168 2048 57 2317 71 3394 1.25x
16 32 7168 2048 57 2344 71 3420 1.25x
16 40 7168 2048 56 2392 71 3462 1.27x
16 48 7168 2048 57 2420 71 3486 1.25x
16 56 7168 2048 68 2062 75 3307 1.10x
16 64 7168 2048 69 2048 76 3292 1.10x

- Rebase onto latest main (7f2a703)
- Optimize single-wave performance: switch BLOCK_M from 64 to 128
- Fix benchmark first timing incorrectly returning 0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant