Skip to content

feat: support bf16 output and plain TMA writes in k_grouped_gemm on SM90;#298

Open
fedorovgv wants to merge 1 commit into
deepseek-ai:mainfrom
fedorovgv:feat/k_grouped_sm90_bfp16
Open

feat: support bf16 output and plain TMA writes in k_grouped_gemm on SM90;#298
fedorovgv wants to merge 1 commit into
deepseek-ai:mainfrom
fedorovgv:feat/k_grouped_sm90_bfp16

Conversation

@fedorovgv

Copy link
Copy Markdown

Add two features to SM90 FP8 1D1D k-grouped GEMM:

  • Support plain TMA store as an alternative to atomic accumulation, controlled by the presence of the c tensor
  • Support BF16 output dtype, casting WGMMA FP32 accumulators to BF16 before the TMA store

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant