[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes by jasl · Pull Request #41834 · vllm-project/vllm

jasl · 2026-05-06T15:17:15Z

Summary

This PR enables DeepSeek V4 Flash on SM120/SM121 Blackwell client hardware by carrying the SM12x fallback and tuning stack needed for the current vLLM V1 path. It is intended for RTX PRO 6000 Blackwell Workstation Edition, RTX 5090-class SM120, and GB10 / DGX Spark SM121 users who cannot use SM100-only TMEM / tcgen05 kernels.

Change footprint — model kernels vs. core-vLLM touch points

The branch splits cleanly into model/kernel code and a small set of core-vLLM integration points (116 files, +15.5k/−0.4k, of which ~+3.5k is tests):

DeepSeek-V4 model + SM12x kernels (~49 files, ~+10.2k) — the enablement itself. Everything under vllm/models/deepseek_v4/** plus the SM12x sparse-MLA decode / indexer / DeepGEMM kernels that live in shared dirs (v1/attention/backends/mla/sparse_mla_kernels.py, model_executor/layers/sparse_attn_indexer.py, v1/attention/backends/mla/{indexer,sparse_swa}.py, utils/deep_gemm.py, kernels/mhc/tilelang.py), the new DSv4 reasoning parser / tokenizer, and device tuning JSONs.
Core-vLLM integration (~36 files, +1.8k/−0.2k) — the hooks below. Almost all are gated by model architecture / quant config / an env flag and are inert for other models.

Subsystem	Files (≈ lines)	What it does
KV-cache core	`single_type_kv_cache_manager.py` (+243), `kv_cache_coordinator.py` (+67), `kv_cache_manager.py` (+60), `block_pool.py` (+21), `sched/scheduler.py` (+1)	prefix-cache correctness for DSv4 sparse-MLA + MTP: an MLA cache-manager with prompt-block protection, a hybrid-coordinator `cache_blocks` tail-block-reuse rewrite, stale-hash reset (= upstream #44237)
MTP spec-decode	`v1/spec_decode/llm_base_proposer.py` (+173)	DSv4 MTP probabilistic draft sampling + per-step MTP-layer routing in the shared proposer base
MoE quantization	`fused_moe.py` (+65), `oracle/mxfp4.py` (+43), `routed_experts.py` (+33), `experts/flashinfer_cutlass_moe.py` (+27), `quantization/mxfp4.py` (+12), `oracle/nvfp4.py` (+1)	MXFP4 / NVFP4 backend selection; the one-line NVFP4 fix (FLASHINFER_CUTLASS into the SwiGLU-clamp allow-list) lets DSv4-Flash-NVFP4 serve
FP8 / Marlin GEMM	`quantization/utils/fp8_utils.py` (+99), `linear/scaled_mm/{cutlass,marlin}.py` (+45/+16), `csrc/.../marlin_moe_wna16/ops.cu` (+10, the only C++)	SM12x e8m0→fp32 upcast + Marlin MoE SM12.0a cudagraph hardening (mirrors open upstream #43730 / #43722)
cudagraph / compile / config	`config/vllm.py` (+44), `compilation/breakable_cudagraph.py` (+22), `passes/utility/fix_functionalization.py` (+12), `config/compilation.py` (+11)	breakable-cudagraph auto-enable gate (MiniMax-only; DSv4 deliberately excluded), DSv4 custom-op defunctionalization + splitting-op registration
OpenAI entrypoints / parsers	`chat_completion/protocol.py` (+101), `serve/render/serving.py` (+28), `tool_parsers/structural_tag_registry.py` (+16), `chat_utils.py` (+11), `engine/protocol.py` (+9), `chat_completion/{serving,batch_serving}.py` (+8/+6), `reasoning/__init__.py` (+4)	expose DSv4 API semantics — `reasoning_content` / `thinking` param / tool-call streaming (jasl#19 instruction-following)
Kernel warmup	`model_executor/warmup/kernel_warmup.py` (+617)	additive DSv4 warmup (D512-split prefill precompile + MTP) to avoid JIT-during-inference wedges
Weight loading	`weight_utils.py` (+43), `default_loader.py` (+16)	fast-safetensors weight filter + EP-skip (lowers DSv4 load overhead on GB10)
env / utils	`envs.py` (+63), `utils/flashinfer.py` (+16), `utils/import_utils.py` (+9), `v1/worker/{gpu_model_runner,ubatch_utils}.py` (+12/+12)	`VLLM_DEEPSEEK_V4_*` flags + `has_cutedsl` / `has_flashinfer_trtllm_sparse_mla` probes

Two notes for review:

The most invasive generic edits were removed in the 2026-06-21 audit cleanup (below): the scheduler now carries a single +1-line change (the prefill-fairness heuristics were dropped) and the prefix-cache write-fence is gone.
A few hooks do touch code paths shared with non-DSv4 models and are the ones worth a closer look: the block_pool stale-hash reset (a generic prefix-cache bugfix, upstream [Bugfix] Fix linear host RSS growth under sustained classification load with prefix caching (V1) #44237), the kv_cache_coordinator cache_blocks rewrite (affects hybrid-KV models; validated ≥ prior behavior), the MTP proposer base-class change, and the OpenAI-entrypoint plumbing. Everything else (MoE oracle, fp8_utils, cudagraph gate, warmup, envs) is arch / quant / env-gated and inert for other models.

Duplicate-work check

Open PR search was refreshed on 2026-06-12 for SM120 / SM12x / DeepSeek V4 / GB10 terms. The nearest open PRs are related but not duplicates:

PR	Difference
#43477	Draft branch for DeepSeek V4 + GLM-5.1 using FlashInfer SM120 sparse MLA and DeepGEMM SM120 / MXFP4 dependency branches. This PR keeps the SM12x fallback/tuning path and validation surface for users who need the current vLLM branch without that external branch stack.
#40929	Earlier WIP Triton fallback effort. This PR is the maintained replacement branch with the broader scheduler, prefix-cache, parser, quant, warmup, and harness-validated fixes carried forward.
#42856	Focused workspace-bound fix that explicitly depends on / references this PR; it is a subset-style bugfix, not the full DeepSeek V4 SM12x enablement branch.

Fixed preview tags

These tags are in jasl/vllm and give users stable pins while the PR is still moving:

Tag	Commit	Use
`sm120-pr-41834-stable-preview-20260621`	`72261a7af149fa5d3fe2ed2b9956e92590731012`	latest validated head: post-audit cleanup — breakable-cudagraph default OFF (MiniMax-only gate), long-context recall fixed by the int64 block-offset cast (the redundant write-fence + scheduler prefill-fairness heuristics + NVFP4 b12x lever removed), on top of the jasl#19 + #45309-revert correctness fixes. Validated metrics-flat on SM120 + SM121.
`sm120-pr-41834-stable-preview-20260620`	`a743ef5dfbd16cad0b9a628773c0c1d1841f1790`	prior head (write-fence / COW recall approach, since superseded by the int64-cast fix)
`sm120-pr-41834-stable-preview-20260612075245`	`f32247a5a695fa8979d61837bf6b87da897dcb7d`	earlier validated rebased PR branch preview
`sm120-pr-41834-fallback-before-replacement-20260612053720`	`5d1584e2de2b3c64540e70dfc370b0211eb6b2fc`	fallback tag for the old PR head before branch replacement

Update 2026-06-21 — post-audit cleanup (latest validated head)

This supersedes the 2026-06-20 and 2026-06-18 heads and the earlier validation data below. An audit of the SM12x branch against current upstream removed redundant, disproven, and experimental deltas; the cleaned head is validated metrics-flat (marginally better) on 2x RTX PRO 6000 Blackwell (SM120) and 2-node GB10 / DGX Spark (SM121), DeepSeek-V4-Flash, fp8 KV, MTP=2. The jasl#19 (instruction-following) and #45309 breakable-cudagraph-garbage revert (#45972) correctness fixes are retained. Five changes:

Breakable-cudagraph stays default OFF (FULL_AND_PIECEWISE). DeepSeek-V4 is deliberately excluded from breakable-cudagraph auto-enable — on real 2x GB10 MTP decode breakable regressed throughput and degraded as output length grew (≈31→19 tok/s at 400→800 max-tokens vs a flat ≈40). The gate is now a single MiniMax-only helper instead of a dead always-False stub; behavior is unchanged. (VLLM_USE_BREAKABLE_CUDAGRAPH=1 still opts in.)
The long-context recall fix is the int64 block-offset cast, not a cache fence. The 2026-06-20 head attributed the MTP high-concurrency recall/garble bug to a missing copy-on-write on writable caches and added a prefix-cache write-completion fence. Further investigation showed that hypothesis was wrong: the actual cause is an int32 overflow of the packed-KV block offset in the SM12x paged-MQA-logits indexer kernels, fixed by an int64 cast (retained). With the int64 fix in place the write fence is redundant — a fence-OFF recall gate holds 8/8 @ conc=8 and 16/16 @ conc=16 (0 miss) on RTX, and 8/8 on GB10. The write-completion fence and the COW broadening paired with it were therefore removed.
Removed the scheduler prefill-fairness heuristics (ungated, generic very-long-prefill / mixed-decode chunk-limiting). They targeted a decode cliff later re-diagnosed as MoE-GEMM + NCCL-all-reduce bound (not schedulable) and were not load-bearing: a cleanup-vs-prior A/B shows an identical mixed prefill/decode fairness ratio (0.716 vs 0.714) and equal inter-chunk latency.
Moved the experimental VLLM_NVFP4_GEMM_BACKEND b12x research lever out of the PR (off-by-default, unused on the shipped path) and dropped a tool-calling-env diff-reflow churn.

Net vs the 2026-06-20 head: 6 files, −833 lines (the removed fence + scheduler heuristics + their tests). The decode/prefill CUDA kernels are byte-identical across the cleanup, so the gated-decode-optimization profile and the 2026-06-12 throughput baselines below are unchanged.

Validation, 2026-06-21

Trivial-prompt generation (cudagraph sanity), both platforms: 2+2 → 4, 7*8 → 56, capital of France → Paris — no garbage.

Default decode path, MTP=2:

Gate	RTX SM120	GB10 SM121
GSM8K strict (8-shot full · 5-shot limit-200 · limit-100)	0.954 (full) · 0.96 (l200)	0.96 (l100)
Long-context recall, fence OFF, conc 8 / 16, MTP2	8/8 + 16/16, 0 miss	8/8
Instruction-following (jasl#19)	pass (JSON-only)	—
tool-call (15-case suite)	87%	—
Scheduler-removal A/B — mixed prefill/decode fairness ratio (cleanup vs prior)	0.716 vs 0.714	—
random 8192×512 TPOT (cleanup vs prior, ms)	6.27 vs 6.5	—
indexed-D512 min-token gate 4096 vs 8192 — prefill @4k	9,687 vs 6,203 tok/s	—

The GB10 SM121 run is a from-scratch 2-node rebuild of the cleaned head (NCCL 2.30.7 re-pinned per node); arithmetic, GSM8K, and the long-context recall gate all pass, confirming the fence removal holds recall on SM121 as well. The recall fix is the int64 cast in the SM12x indexer kernel, so the 2026-06-12 throughput baselines below are unchanged.

llama-benchy (eugr format), GB10 2-node / SM121, MTP=2, prefix-cache on (GB10 MTP decode/prefill profile — unchanged across the 2026-06-21 cleanup, decode kernel byte-identical):

test	t/s	peak t/s	ttfr (ms)
pp2048 (cold)	1205.5 ± 22		1705
tg128 (C=1)	40.0 ± 0.4	45.7
ctx_pp @ d8192	1722.5 ± 5		4762
ctx_tg @ d8192	38.5 ± 1.5	43.3
ctx_pp @ d16384	1674.8 ± 2		9788
ctx_tg @ d16384	39.2 ± 2.3	44.3
ctx_pp @ d32768	1595.3 ± 1		20547
ctx_tg @ d32768	41.6 ± 1.5	46.3

Prefill 1595–1722 tok/s at depth; decode 40 tok/s @ C=1 holding 38–42 out to 32K context (no decode cliff); prefix-cache hit 42–46% under MTP.

Gated SM120 decode optimization (`VLLM_DEEPSEEK_V4_FLASHINFER_SM120_DECODE=1`)

The decode gate uses flashinfer.mla._sparse_mla_sm120 (in FlashInfer main / 0.6.13; absent from the 0.6.12 release). Installing it correctly matters: a bare pip install --upgrade flashinfer-python @ git+main bumps flashinfer-python but leaves a stale flashinfer-cubin / flashinfer-jit-cache, and FlashInfer then raises a version-mismatch error at startup (and re-JITs kernels). Uninstall the precompiled packages first, then upgrade:

pip uninstall -y flashinfer-jit-cache flashinfer-cubin
pip install --upgrade "flashinfer-python @ git+https://github.com/flashinfer-ai/flashinfer.git"

For a reproducible pin instead of tracking moving main, install matching flashinfer-python + flashinfer-cubin nightlies (e.g. 0.6.13.dev20260619, a bit-identical decode kernel to the validated build) — again uninstalling flashinfer-jit-cache first.

RTX SM120, decode gate ON vs OFF, ctx0 decode (aggregate tok/s, 0 errors all rows):

C	gate OFF	gate ON	gain
1	189.7	201.4	+6%
2	311.3	334.7	+8%
4	483.1	531.6	+10%
8	707.6	801.9	+13%
16	990.5	1164.7	+18%
32	1545.0	1849.6	+20%
64	2132.5	2814.8	+32%

gate-ON @C64 = 2814.8 tok/s matches the community target (~2815). The decode CUDA kernel is byte-identical across the rebase, so this profile is unchanged.

The default path needs no FlashInfer update. With the gate off (default), the import is lazy/gated, so FlashInfer 0.6.12 (official) works unchanged. On GB10 / 2-node, also pin nvidia-nccl-cu13==2.30.7 (a rebuild reverts it; a per-node mismatch hangs the NCCL handshake).

Gated SM120 prefill optimization (`VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1`)

Symmetric to the decode gate, prefill has an opt-in packed FlashInfer sparse-MLA path: VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1 (default off; ~+5–6% single-stream prefill). With it off, prefill defers to the default FlashMLA indexed-D512 path. It routes through the same flashinfer.mla._sparse_mla_sm120 kernels as the decode gate, so it carries the identical FlashInfer version requirement — the install/pin steps above apply unchanged (FlashInfer main / 0.6.13; the default-off path needs no FlashInfer update and runs on 0.6.12). Decode and prefill share one FlashInfer build; there is no separate version to track for prefill.

Branch validation, 2026-06-12

Base and head:

upstream base: 8a91228dbe363d1d113deb2a82e289429130dd01
PR head: f32247a5a695fa8979d61837bf6b87da897dcb7d
branch range: 96 commits over upstream/main

Commands run on the final head:

Command	Result
`git diff --check upstream/main...HEAD`	pass
DCO scan over `upstream/main..HEAD`	pass; every commit has `Signed-off-by`
`VLLM_TARGET_DEVICE=empty .venv/bin/python -m compileall -q vllm/envs.py vllm/model_executor/warmup/kernel_warmup.py vllm/models/deepseek_v4 vllm/v1/core vllm/v1/attention/backends/mla vllm/reasoning/deepseek_v4_reasoning_parser.py tests/test_envs.py tests/v1/core/test_prefix_caching.py tests/v1/core/test_scheduler.py tests/reasoning/test_deepseekv4_reasoning_parser.py tests/quantization/test_sm12x_tuned_config_lookup.py`	pass
`.venv/bin/python -m pytest tests/test_envs.py::test_deepseek_v4_sparse_mla_stats_path_env -q` on the remote vLLM environment	`1 passed, 16 warnings`
`python3 -m pytest tests/test_scripts.py -q` in the public harness	`128 passed in 14.41s`

Local vLLM pytest/ruff were not run on the Mac checkout because its .venv does not currently include torch or ruff. GPU-path validation remains remote SM120/SM121-only.

Latest clean SM120 RTX PRO 6000 x2 data, 2026-06-12

Artifact roots:

artifacts/codex_pr_stable_preview_f32247a/2x_rtx_pro_6000_sm120/rtx_current_pr_short_throughput_mtp_noep_20260612084721
artifacts/codex_pr_stable_preview_f32247a/2x_rtx_pro_6000_sm120/rtx_current_pr_clean_mtp_noep_20260612080629

Short-throughput profile:

TP=2, MTP=2, expert parallel off, FP8 KV, block size 256.
max_model_len=131072, gpu_memory_utilization=0.975, max_num_batched_tokens=4096, max_num_seqs=24.
Prefix cache disabled, FULL_AND_PIECEWISE, 80 prompts per concurrency.
Phase exits: server_startup=0, bench_hf_mt_bench=0, bench_random_prefill_sweep=0.
Regression check: output/input throughput ratios are against the previous accepted same-profile EP-off reference; all are above the 0.95 floor.

HF MT-bench, 80 prompts:

C	output tok/s	ratio vs reference	mean TTFT ms	p99 ITL ms	MTP acceptance %
1	180.94	1.009	49.59	13.08	68.36
2	284.53	1.003	70.04	32.35	68.19
4	427.10	0.999	82.70	38.83	68.25
8	600.33	1.005	110.97	86.19	67.91
16	840.46	1.019	156.73	86.50	67.34
24	987.77	1.030	209.05	86.71	68.20

Random prefill sweep, C=1, output length 128, 8 requests per case:

Prompt / output tokens	input tok/s	ratio vs reference	mean TTFT ms	requests
4K / 128	3123.74	0.996	660.21	8 / 8
16K / 128	6209.00	1.005	2030.49	8 / 8
64K / 128	7049.72	0.999	8715.51	8 / 8

Correctness and reliability profile:

TP=2, MTP=2, expert parallel off, FP8 KV, prefix cache disabled, max_model_len=131072, max_num_seqs=4, max_num_batched_tokens=4096.
Phase exits: server_startup=0, bench_hf_mt_bench=0, eval_gsm8k=0, bench_random_prefill_sweep=0, bench_random_8000x1000=0, bench_random_256x256=0.
Post-run current-boot driver scan found no Xid, UVM, NV_ERR, GPU-lost, illegal-access, unspecified-launch, or fatal GPU signals; no vLLM compute processes were left running.

GSM8K 5-shot, limit-200, /v1/completions, MTP=2, concurrency 4:

Metric	Value	Floor	Result
flexible exact match	0.965	0.940	pass
strict exact match	0.940	0.925	pass

Additional 128K-profile random checks:

Shape	C	output tok/s	mean TTFT ms	p99 ITL ms	MTP acceptance %
8K / 1K	1	130.93	1367.03	13.44	52.56
8K / 1K	2	191.19	1586.64	17.44	50.28
8K / 1K	4	260.72	1666.96	199.75	51.76
256 / 256	1	153.07	88.80	13.17	51.46
256 / 256	4	369.86	127.80	84.44	52.50

Latest clean GB10 / SM121 data, 2026-06-12

Artifact root:

artifacts/codex_pr_stable_preview_f32247a/2x_gb10_sm121/gb10_forum53_mtp2_epoff_c2_gmem0685_mml81920/20260612074113

Profile:

TP=2, MTP=2, expert parallel off, FP8 KV, block size 256.
max_model_len=81920, max_num_seqs=2, max_num_batched_tokens=4096, gpu_memory_utilization=0.685.
Prefix cache enabled; Forum Refactor attention kernels #53 C=2 shape: forum53_c2:2:2:3200:256.
This covers the 80K-token prompt case on the final PR head. Failed, interrupted, or driver-signal artifacts are intentionally excluded from this PR body.

Gate result:

Gate	Result
summary `ok`	`true`
`serve_start.exit_code`	`0`
`streaming_pressure.exit_code`	`0`
driver health	`ok=true`, signal count `0`
request failures	`0 / 4`
preemptions	`0`

Timing and runtime summary:

Metric	Value
max prompt tokens	80,127
max TTFT	124.045698 s
max elapsed	124.949141 s
avg inter-chunk latency	0.056711 s
p95 inter-chunk latency	0.064278 s
p99 inter-chunk latency	0.144954 s
max inter-chunk latency	0.144954 s
GPU KV usage avg / max	65.81% / 86.40%
prefix-cache hits / queries	79,872 / 3,444,165

Running the NVFP4 checkpoint

This branch also serves nvidia/DeepSeek-V4-Flash-NVFP4 on SM12x (RTX PRO 6000 / GB10). The NVFP4 MoE auto-selects the FlashInfer CUTLASS backend (the SwiGLU-clamp model gate now accepts it), so no --moe-backend flag is required, and no special FlashInfer build is needed (the 0.6.12 release works):

vllm serve nvidia/DeepSeek-V4-Flash-NVFP4 \
  --trust-remote-code --tensor-parallel-size 2 \
  --kv-cache-dtype fp8 \
  --tokenizer-mode deepseek_v4

--kv-cache-dtype fp8 is mandatory: DeepSeek-V4's fp8_ds_mla attention asserts an fp8 KV layout, so the default auto fails at model construction (this is not NVFP4-specific). Expert-parallel off (plain TP) is the supported path.

Accuracy matches MXFP4 (GSM8K 8-shot ~0.96 on both SM120 and SM121). Note that on SM12x NVFP4 is not a memory or throughput win versus MXFP4: NVFP4 weights are ~4 GiB/GPU larger (~78 vs ~74 GiB), leaving less KV-cache room (lower max concurrency); single-stream prefill is marginally faster and aggregate decode marginally slower. Its value here is checkpoint availability / parity with the SM100 datacenter path, not an SM12x performance advantage — MXFP4 remains the better practical choice on consumer Blackwell.

AI assistance disclosure

AI assistants, including OpenAI Codex/GPT models and Anthropic Claude models, were used for code review, refactoring support, regression-script writing, and benchmark analysis. The branch was validated through human review plus the commands and harness artifacts listed above.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

jasl · 2026-05-06T15:21:50Z

@zyongye
I've cleaned up the old PR, could you help review this one?

gemini-code-assist

Code Review

This pull request implements support for DeepSeek V4 on SM12x (Blackwell) architectures by providing Triton-based fallbacks for DeepGEMM-dependent operations. Key enhancements include the introduction of specialized Triton kernels for sparse MLA, FP8 einsum, and MQA logits, as well as memory optimizations in the sparse attention indexer to compute top-k indices without materializing full logits. Additionally, the PR updates the model loader to support weight name filtering for skipping MTP weights and handles Blackwell-specific FP8 quantization scales. I have no feedback to provide.

chatgpt-codex-connector · 2026-05-06T15:25:09Z

💡 Codex Review

vllm/vllm/model_executor/layers/sparse_attn_indexer.py

Lines 86 to 89 in 9596dbf

    
           def _sparse_indexer_requires_deep_gemm() -> bool: 
        
               return current_platform.is_cuda() and not ( 
        
                   current_platform.is_device_capability_family(120) 
        
               )

Keep DeepGEMM requirement for SM120 FP4 indexer path

This helper now disables the DeepGEMM requirement for every SM120 run, but the FP4 indexer cache path still depends on DeepGEMM kernels (fp8_fp4_*) because the new SM120 fallback only handles q_scale is None (FP8 Q). With use_fp4_cache=True on SM120 and no DeepGEMM installed, construction succeeds and the first prefill/decode call fails at runtime with the DeepGEMM _missing() error instead of being rejected up front.

vllm/vllm/model_executor/model_loader/default_loader.py

Lines 236 to 240 in 9596dbf

    
           if self.load_config.load_format == "fastsafetensors": 
        
               weights_iterator = fastsafetensors_weights_iterator( 
        
                   hf_weights_files, 
        
                   self.load_config.use_tqdm_on_load, 
        
               )

Propagate weight_name_filter to fast safetensor loaders

The new pre-load weight_name_filter is only wired into safetensors_weights_iterator; this branch still loads all tensors for fastsafetensors (and similarly other non-default safetensor iterators), so skipped tensors are still materialized. For DeepSeek V4 this defeats the intended early skip of MTP weights and can reintroduce high transient memory use/OOM when these load formats are enabled.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…-project#45061) Replace the fixed PREFILL_CHUNK_SIZE chunking + batch-wide workspace bound in the SM12x sparse-MLA prefill path with upstream's adaptive get_prefill_chunk_plan (vllm-project#45061): pack as many requests as fit the workspace-area bound per chunk and allocate the kv workspace per-chunk at this chunk's compressed+gather width (chunk_M) instead of the batch-wide worst case. Preserves the SM12x indexed-D512 split/chunked prefill paths and the Triton sparse-MLA dispatch. Signed-off-by: jasl <jasl9187@hotmail.com>

The rebase onto upstream/main surfaced 9 mypy errors in the sparse-MLA decode kernels where decode_swa_lens / decode_swa_indices / seq_lens (typed torch.Tensor | None) are indexed without a None-guard. They were uncaught because git rebase skips pre-commit hooks. The fields are unconditionally populated when num_decode_tokens > 0 (the only path that reaches the decode kernels), so assert-guard them. Signed-off-by: jasl <jasl9187@hotmail.com>

git rebase replays commits without running pre-commit hooks, so the rebased branch carried pre-existing type-safety gaps and ruff-format drift in our changed files (surfaced under the newer upstream config). Fixes: - kernel_warmup: drop the phantom _disable_sparse_mla_prefill_stats reference (the symbol was never defined anywhere; the stats-disable wrapper collapses to a direct warmup call) and type: ignore the intentional SimpleNamespace warmup batch for apply_grammar_bitmask - single_type_kv_cache_manager: access MLA-spec subtype attributes via getattr (mypy-safe, identical runtime) - llm_base_proposer: assert the draft temperature tensor is non-None (it is used unconditionally below) - fused_moe: annotate config_file_paths: list[str] - test_deepseek_v4_mega_moe: annotate the mixed-type calls list - ruff-format drift across the touched files No runtime behavior change except the warmup phantom, which previously raised ImportError when VLLM_DEEPSEEK_V4_SPARSE_MLA_STATS_PATH was set. Signed-off-by: jasl <jasl9187@hotmail.com>

A dev-branch sparse-MLA stats diagnostic leaked into the PR: the only reader was the phantom _disable_sparse_mla_prefill_stats warmup wrapper (removed in the prior cleanup, as it referenced a never-defined symbol). With that gone the env has no production reader, so remove its envs.py declaration + lookup and the dedicated test that exercised it. Signed-off-by: jasl <jasl9187@hotmail.com>

Wire FlashInfer PR3395's packed SM120 sparse-MLA decode kernel into the DeepSeek V4 FlashMLA attention as an env-gated decode override (default off). The kernel ships in official flashinfer >= 0.6.13; we drive it through its low-level _SparseMLAPagedAttentionRunner rather than the public trtllm_batch_decode_sparse_mla_dsv4 wrapper. Root cause of the C8-C64 ctx0 decode gap versus the FlashMLA decode path is this decode kernel (the prior PR3395 reintegration ported only the packed prefill). Holding everything else fixed (MARLIN MoE, packed fp8_ds_mla cache, source tree, MTP2) and swapping only the decode kernel lifts ctx0 decode throughput on dual RTX PRO 6000 / SM120, in128/out512: C default(Triton) this delta 8 542 582 +7% 16 771 833 +8% 32 790 981 +24% 64 1345 1683 +25% GSM8K 5-shot limit-300 is correctness-neutral (flexible 0.953 / strict 0.927, matching the MXFP4 baseline). Why the low-level runner and not the public wrapper: the wrapper's _sparse_mla_decode_workspace returns no scratch when num_tokens > 64, so it allocates mid_out/mid_lse (hundreds of MB) fresh on every decode step. The MTP multi-query decode shape routinely exceeds 64 tokens (C32/C64), making the wrapper a regression (-17 to -20% vs the FlashMLA path). The runner instead takes graph-stable mid_out/mid_lse reserved once from the vLLM workspace manager and reused every step; that cached scratch is the entire win (a decode-shaped autotune pass over the kernel's chunks_per_block tactic added 0% on top, so it is not included). - New DeepseekV4FlashInferSM120Attention(DeepseekV4FlashMLAAttention) overrides only _forward_decode; reuses the packed cache, sparse-index metadata, and packed prefill. The compressed decode index is forced contiguous (the kernel asserts it). - Gated by VLLM_DEEPSEEK_V4_FLASHINFER_SM120_DECODE, SM12x, and has_flashinfer_trtllm_sparse_mla_dsv4(); default off keeps the FlashMLA decode path byte-for-byte. Signed-off-by: jasl <jasl9187@hotmail.com>

… startup The first long prefill JIT-compiled the D512-split sparse-MLA prefill Triton kernels mid-engine-step (~20s), parking EngineCore in shm_broadcast and surfacing as a "sample_tokens RPC timed out" wedge under concurrency. Pre-compile them during the DeepSeek-V4 sparse-MLA warmup over the complete 128-aligned combined_topk specialization set [256..1152], gated by a new env VLLM_DEEPSEEK_V4_INDEXED_D512_SPLIT_PREFILL_WARMUP (default on). Synthetic throwaway tensors only (no workspace manager use, no state leak); cleanly no-ops when the split path is unreachable. No inference-path behavior change. Signed-off-by: jasl <jasl9187@hotmail.com>

Port the dropped "Align DeepSeek V4 API semantics" layer (a874655, on ds4-sm120-full but never on the PR line) onto the PR head 73e99c1: - top-level `thinking` request field (DeepSeek OpenAI-compat) -> chat-template enable_thinking, via apply_chat_template_kwargs at the protocol boundary; - bare DeepSeek-V4 requests (no thinking key) now default thinking ON, matching ds4-sm120-full (the PR line currently defaults them OFF); - deepseek_v4_sampling_override (apply DeepSeek's official sampling defaults when thinking is enabled; per-request opt-out); - reasoning_content alias on ChatMessage/DeltaMessage; prefix/wo_eos message fields; tool-call empty-arguments robustness in the DSv4 tokenizer. Addresses the common bare-request instruction-following regression in #19: preview-dev defaulted bare requests to thinking OFF, so the model answered directly and prepended explanatory prose despite "output ONLY a JSON array"; full defaulted bare -> thinking ON, reasoned, then complied. (The reporter's EXPLICIT enable_thinking=false case is a separate residual softness, not fixed by this change.) Cherry-picked from a874655; conflicts resolved by keeping the June PR's evolved code (build_chat_params reasoning_effort handling; the dedicated DeepSeekV4ReasoningParser) and layering the DSv4 semantics on top (serving._effective_chat_template_kwargs applies apply_chat_template_kwargs after build_chat_params). Signed-off-by: jasl <jasl9187@hotmail.com>

…tion top_k_per_row_prefill writes its output as a contiguous [M, select_k] buffer (it receives the logits strides, not the output's). The indexer passes out[:, :select_k], which is non-contiguous whenever the compressed-KV count is below the top-k width -- i.e. for short prompts and the early queries of long prompts. Writing it as contiguous silently corrupts the later rows' top-k (all -1), so the C4A sparse-MLA prefill drops the distant/downsampled context and attends only the recent sliding window. This degrades instruction following (returns prose instead of the requested JSON) and garbles long-context generation under concurrent traffic. Hand the op a contiguous work buffer and copy the result back; this is a no-op when the slice is already contiguous (select_k == top-k width), so behavior is unchanged outside the corrupted case. The chunked long-context path was already safe (stride-aware torch.topk / torch.gather). Signed-off-by: jasl <jasl9187@hotmail.com>

The non-contiguous-output fix used selected.contiguous(), which copies the slice's current contents (the -1 placeholders just written by out.fill_(-1)) into the work buffer. top_k_per_row_prefill then overwrites every element, so that copy is wasted. Allocate an uninitialized contiguous buffer via selected.new_empty(selected.shape) instead; behavior is unchanged (copy-back still lands the result in the strided slice), one elementwise pass saved on the short-prompt / early-query path. Signed-off-by: jasl <jasl9187@hotmail.com>

The SM121 carve-out in _should_auto_enable_deepseek_v4_breakable_cudagraph was added because breakable cudagraph produced garbage on trivial prompts on SM121. That was upstream vllm-project#45309 (reduced eager_break_during_capture), reverted upstream in vllm-project#45972 and now in our base. With the full @eager_break_during_capture split restored, breakable cudagraph generates correctly on SM121 again -- verified on 2x GB10 (EP off): "2+2等于几" and arithmetic clean, throughput on-par-or-slightly- better than FULL_AND_PIECEWISE (38.5 vs 36.7 out_tok/s). Drop the carve-out so breakable auto-enables for DeepSeek-V4 on all SM12x platforms. Signed-off-by: jasl <jasl9187@hotmail.com>

Restore ruff cleanliness after the upstream rebase: the indexer top-k fallback contiguity guard exceeded the 88-char line limit (E501). Signed-off-by: jasl <jasl9187@hotmail.com>

…r spec-decode The DSv4 sparse-MLA decode is per-token (token_to_req_indices -> per-request block_table/topk gather). Captured into a FULL monolithic cudagraph under speculative decoding (MTP) it cross-contaminates concurrent requests (long-context high-concurrency gibberish) and, for the q=1 draft forward, collapses deep-context recall. Keep cudagraph_mode=FULL_AND_PIECEWISE but eager-break the DSv4 attention out of the FULL graph whenever spec-decode is active (draft + verify); the nested indexer then runs eagerly too. Non-spec single-token decode and all non-DSv4 ops keep FULL capture (zero default-path regression). GPU-validated on 2xRTX SM120: forcing the DSv4 attention to eager-break under FULL recovered no-MTP parity vs gibberish when fully captured. Signed-off-by: jasl <jasl9187@hotmail.com>

A cached block was registered cache-shareable at schedule time (for total_computed_tokens + num_new_tokens, i.e. including this step's not-yet-forwarded tokens), with no barrier proving the forward that writes its KV has retired. Under >=3 concurrent identical-prefix requests a sibling could bind a recent-region block whose write was still in flight; for DeepSeek-V4, whose SWA + C4/C128 + compressor-state groups are byte-packed into one physical page, this persistently committed a corrupted block to the shared prefix cache, dropping the most-recent long-context needle (long-context high-concurrency recall failure). Add a write-completion fence: tag each block with the schedule pass at which it was committed, and only hand a cached block to OTHER requests once a forward has retired past that pass (two decoupled counters: schedule_pass advances at schedule() start, retired_forward at update_from_output). Safe cross-step prefix hits are preserved; only the unsafe in-flight intra-pass hand-off is withheld. Gated by VLLM_PREFIX_CACHE_WRITE_FENCE (default on). GPU-validated 2xRTX SM120: arthur 280-line conc=8 recall ~20% -> ~91% (MTP2) / ~78% -> ~98% (no-MTP), prefix-cache hit rate preserved (~87% vs ~90%); GSM8K-200 0.97 + issue19 JSON-only PASS (no correctness regression); GB10 SM121 2-node non-regressive + serves cleanly. Signed-off-by: jasl <jasl9187@hotmail.com>

…e-cudagraph auto-enable) Breakable cudagraph was auto-enabled for DeepSeek-V4 on the assumption it was "on-par-or-faster than FULL_AND_PIECEWISE". That is wrong: breakable mode disables the torch.compile pipeline (equivalent to -O.mode=none) and runs attention eagerly every decode step, so single-stream MTP decode is 1.5-3.8x SLOWER and degrades with output length. Measured on both arches: RTX PRO 6000 (SM120), single-stream MTP2 decode tok/s, breakable on vs off: max_tokens 400: ~103 vs ~160 600: ~55 vs ~175 800: ~45 vs ~172 2x GB10 (SM121) shows the same on/off split (community report on vllm-project#41834). FULL_AND_PIECEWISE + torch.compile is correct (GSM8K-200 0.96, bare prompts clean: 2+2->4, capital->Paris) and faster, so make it the default for both SM120 and SM121. Breakable stays available via VLLM_USE_BREAKABLE_CUDAGRAPH=1 for the MTP + long-context + high-concurrency garbled-output workaround (which then also engages the spec-decode attention eager-break). The prefix-cache write-fence and the spec-break fix are unaffected; the latter is now opt-in with breakable. Signed-off-by: jasl <jasl9187@hotmail.com>

…ups under spec decode Under MTP/spec decode, _annotate_eagle_groups_deepseek_v4 flags only the MTP draft layer's group as an eagle group, so only it receives drop_eagle_block (the COW that gives spec decode a private final block). DeepSeek-V4's other writable cache groups -- notably the sliding-window caches (SWA + compressor state, SlidingWindowMLASpec), which are rewritten every decode step -- kept a final block that is physically SHARED across concurrent identical prefix-hit requests. Under MTP that shared block is written while sibling requests read it, corrupting the KV the sparse-MLA decode consumes and producing ~9% needle-drop / mixed-script gibberish under long-context concurrent traffic (reporter: arthur, PR vllm-project#41834). Fix: in HybridKVCacheCoordinator.find_longest_cache_hit and find_longest_cache_hit_per_group, extend drop_eagle_block to fire for writable compressed (compress_ratio>1) and sliding-window groups whenever spec decode is active (eagle_group_ids non-empty), not only the MTP group. The sliding-window drop is the load-bearing part; compressed (C4/C128) is included for completeness (all DeepSeek-V4 writable groups get a private final block under spec). Perf-neutral: one extra private final block per writable group under spec. Validated: RTX SM120 TP=2 warm-full-hit conc=8 MTP2 ~9%->0% recall-miss (0.4% conc=16); GB10 SM121 2-node 64/64; GSM8K-200 0.96-0.975; decode 170-180 tok/s (no regression); bare prompts clean. Signed-off-by: jasl <jasl9187@hotmail.com>

@4096

…-tunable) Re-add VLLM_DEEPSEEK_V4_INDEXED_D512_SPLIT_PREFILL_MIN_TOKENS (dropped in the upstream rebase, then left hard-coded at 8192) and default it to 4096. With max-num-batched-tokens=4096 this admits the first ~8192-token region (the early chunks) of every long prefill to the fast indexed-D512 path instead of the slow fallback. Measured on 2x RTX PRO 6000 (SM120): +57%@4096, +28%@8192, decaying to +9%@24K (gain proportional to 1/num_chunks); GSM8K-200 strict 0.965 (clean) and KV-cache-neutral. Set =8192 to restore the prior threshold. Signed-off-by: jasl <jasl9187@hotmail.com>

Reverts a743ef5. The broadened drop_eagle_block (copy-on-write the shared final block for all DSv4 compressed/sliding-window cache groups under spec decode) fixed the long-context + high-concurrency recall/garble bug, but it drops nearly all prefix-cache reuse under MTP, not just the contended block. On 2-node GB10 the prefix-cache hit rate was 0% with it (warm request = full re-prefill) vs 33-71% without; a same-build recall-fix ON/OFF x MTP ON/OFF matrix plus a revert test isolated it definitively. That is a broad perf regression for every multi-turn / agentic MTP request, traded for a narrow recall corner case. Reverting to restore caching; reopening the recall bug pending a surgical fix (drop only the genuinely spec-written block). Signed-off-by: jasl <jasl.lightsworn@gmail.com>

The post-rebase indexer kv_cache block stride grew (~1039680, a strided slice of the fused KV block); block_idx * stride overflowed int32 in the SM120 paged MQA-logits Triton kernel for block_idx beyond ~2065, giving an illegal memory access on SM120 and silent garbage on SM121 under long-context / multi-request indexer calls. Cast block_idx to int64 at the K and scale gather sites in _fp8_paged_mqa_logits_rowwise_kernel. Signed-off-by: jasl <jasl.lightsworn@gmail.com>

…eek-V4 Add a _forward_prefill override driving the SM120 packed sparse-MLA runner, gated by VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL (default off; defers to the FlashMLA indexed-D512 prefill when off). Rebase the indexer's batch-global compressed top-k to per-request-local before the per-request block-table map, and slice the query to the real prefill-token count to stay consistent under padded / MTP-draft batches. ~+5-6% single-stream prefill, flat with concurrency vs the FlashMLA prefill path. Signed-off-by: jasl <jasl.lightsworn@gmail.com>

…ombine combine_topk_swa_indices maps a compressed position p of the k-th in-chunk request to gathered slot p + M*k, which is only correct for request-local p; the indexer writes batch-global (cu_seqlen_ks) positions, so non-first prefill requests indexed past their gathered slot and read stale workspace (latent C4A multi-request prefill correctness bug). Rebase to per-request-local before combine. No-op at num_prefills == 1. Signed-off-by: jasl <jasl.lightsworn@gmail.com>

…ernel too 197d21e cast block_idx to int64 in _fp8_paged_mqa_logits_rowwise_kernel but left _fp8_paged_mqa_logits_kernel (the non-rowwise variant) unfixed: its `block_idx * stride_sb` (scale) and `block_idx[:, :, None] * stride_kvb` (KV) still overflow int32 for the post-rebase packed-KV block stride (~1039680) at higher block ids. conc=1 exercises the rowwise kernel (already clean) but conc=8 hits this one and hard-crashes (Xid 31 MMU fault, SM121 / IMA on SM120). Cast both block-offset multiplies to int64 to match the rowwise kernel. Validated on GB10 2-node (SM121, MTP2): conc=8 long-context coherence 68/68 (0 gibberish, 0 recall-miss; was a hard CUDA crash), GSM8K-200 0.965. Signed-off-by: jasl <jasl9187@hotmail.com>

jasl · 2026-06-20T07:28:22Z

@eugr done

If you use the cutting-edge FlashInfer, you can enable

VLLM_DEEPSEEK_V4_FLASHINFER_SM120_DECODE=1
VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1

jasl · 2026-06-20T07:41:24Z

This round (rebased onto current upstream/main) — highlights

Long-context garble on SM12x fixed (packed-KV int32 overflow). The fused KV layout on main makes the indexer KV block stride large; the SM12x paged-MQA-logits kernels did block_idx * stride in int32 → overflow past 2³¹ at high block ids → IMA on SM120 / wrong-memory garble on SM121. Both (rowwise + non-rowwise) kernels now compute the block offset in int64. (conc=1 only exercised the rowwise path; conc=8 exposed the non-rowwise one.)
FlashMLA prefill combine: per-request compressed top-k — fixes a latent multi-request prefill correctness bug (non-first prefill requests read stale workspace; no-op at a single prefill).
New opt-in prefill switch VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1 (default off). Routes prefill through the packed FlashInfer sparse-MLA SM120 path; defers to the default FlashMLA indexed-D512 prefill when off. It shares the decode gate's _sparse_mla_sm120 kernels, so it has the same FlashInfer version requirement (main / 0.6.13) — one shared build, no separate version for prefill.

Validation (GB10 2-node / SM121, MTP=2): long-context coherence 68/68 (8-concurrent), GSM8K-200 0.965, prefix-cache restored (42–46% hit). llama-benchy: prefill 1595–1722 tok/s at depth, decode 40 tok/s @ C=1 holding 38–42 out to 32K.

Prefill perf — and how the packed switch relates to the default INDEXED_D512_SPLIT_PREFILL_MIN_TOKENS=4096 gate. The default FlashMLA prefill is unchanged this round (the fixes above are correctness). The packed switch and that 4096 gate are mutually-exclusive routes, not additive: enabling the packed switch bypasses the FlashMLA indexed-D512 split path and its 4096 gate entirely (an if/else over the same sparse-MLA prefill). So the trade-off is context-dependent:

Short/medium prefill (4–8k): the packed +7.5% (2026-06-17 A/B @ 8k, N=10) was measured against a baseline using the old 8192 gate, where an 8k prefill did not take the split path. The current 4096 gate already routes that band to the split kernel (its own +9–59%, biggest at 4–8k) — so it likely captures most of the packed gain, and gate-4096 may win here. Net packed-over-gate-4096 is unproven; it needs a same-build packed-ON vs gate-4096-default A/B.
Long context (≥64k → 256K+): the 4096 gate is moot (any long prefill clears it, so the FlashMLA baseline is already the split kernel) — so it's purely kernel quality. There the packed kernel shows a genuine, growing gain: ~+10% over the split at 64k (N=12) vs +7.5% at 8k, and the fork's hand-tuned kernel widens to +47% @ 64k vs +27% @ 8k (headroom). So the packed switch is fundamentally a long-context lever; its value should be judged at long context — we don't yet have a same-build >256K number.

Net: default-off is the right call until a same-build A/B — at long context specifically — shows the packed kernel beats the gate-4096 FlashMLA route.

…_STRICT_TOOL_CALLING Cosmetic reflow churn from 59c7918 on an upstream-owned env; restore byte-identical to base 0fbf42a. No behavior change.

…eal MiniMax-only gate The DSv4-out behavior (default FULL_AND_PIECEWISE, 1.5-3.8x faster MTP decode, measured RTX/SM120 + GB10/SM121) is unchanged. Replaces the dead always-False _should_auto_enable_deepseek_v4_breakable_cudagraph stub + unused DEEPSEEK_V4_CUDAGRAPH_ARCHITECTURES frozenset + misleading SM120/SM121 comment with a single meaningful _should_auto_enable_breakable_cudagraph(model_config) that returns True only for the MiniMax M3 architectures (upstream's auto-enable set minus DSv4). Test upgraded from tautological always-False asserts to observable behavior: DSv4 off, MiniMax on, others off.

Reverts 99a9f10 (whose actual content is solely the gemm-backend env + _NVFP4_BACKEND_TO_KERNEL force-map, not the modelopt-routing its title names). The DSv4-Flash shipped path does not use the flashinfer-b12x NVFP4 route; this env was only a research lever (and the sole working way to reach b12x, since FlashInferB12xNvFp4LinearKernel is excluded from auto-selection and --linear-backend flashinfer_b12x is filtered out). Preserved verbatim on backup/min-enable-88ec-pre-audit-20260620 for future NVFP4-backend experiments; restore the env+map from that commit to re-enable b12x A/B.

… + tests Removes the 9-commit very-long-prefill starvation / mixed-decode-prefill chunk limiting family (a8bdc00, 129e129, a962bf1[sched part], 1059c81, ad26f8f, db3a71f, 6dac492, 52c549e, 574905a[sched part]) from scheduler.py: 9 helper methods + 3 call sites; restores the deleted blank line and the original 'assert num_new_tokens > 0'. Drops the 11 tautological fairness tests. Ungated generic-vLLM tuning aimed at a cliff re-diagnosed as MoE-GEMM + NCCL-all-reduce bound (config knobs proven dead on GB10) plus a phantom wedge; never required for correctness. Preserved (verified standalone, not fairness-coupled): the write-fence hooks (20e1472, kept pending the fence-OFF recall gate), max_num_seqs + DSv4 MLA prefix-retention (fde655c), the a962bf1 adaptive BLOCK_M kernel tuning in sm12x_mqa.py, and the 574905a record_stats param + its test_prefix_cache_peek_does_not_record_stats (tests kept kv_cache_manager stats-suppression behavior). Net: scheduler.py == base except the 3 KEEP hunks; test_scheduler.py == base except the peek test. Needs RTX long-ctx-concurrency + GSM8K + toolcall-15 no-regression revalidation.

…ds recall) GPU-validated 2026-06-21 on the int64-fixed 88ec build: the fence-OFF recall gate (VLLM_PREFIX_CACHE_WRITE_FENCE=0) holds arthur long-context coherence 8/8 at conc=8 and 16/16 at conc=16 (MTP2, 0 miss) -- exercising exactly the >=3 concurrent-identical-prefix in-flight hand-off window the fence guarded. So the write fence is redundant: the int64 block-offset overflow fix (197d21e + 88ec87e) is the real long-context recall fix; the fence was built on the disproven shared-write/COW theory and its 06-19 commit-message recall claim (~20%->91%) was masking the then-unfixed int64 bug. Reverts 20e1472 (committed_step/schedule_pass/retired_forward clocks, get_one_block_retired, the default-on env, and the two scheduler hooks). scheduler.py is now identical to base except the fde655c max_num_seqs prefix-retention arg.

DerAndereAndi · 2026-06-20T19:03:50Z

Activating flashinfer decode and prefill shows ~40% prefill performance improvement, nice!

| model             |                 test |      t/s (total) |        t/s (req) |     peak t/s |   peak t/s (req) |          ttfr (ms) |       est_ppt (ms) |      e2e_ttft (ms) |
|:------------------|---------------------:|-----------------:|-----------------:|-------------:|-----------------:|-------------------:|-------------------:|-------------------:|
| deepseek-v4-flash |          pp2048 (c1) |  1917.05 ± 20.09 |  1917.05 ± 20.09 |              |                  |    1186.91 ± 11.12 |    1068.42 ± 11.12 |    1186.91 ± 11.12 |
| deepseek-v4-flash |           tg512 (c1) |     39.04 ± 0.27 |     39.04 ± 0.27 | 47.33 ± 1.25 |     47.33 ± 1.25 |                    |                    |                    |
| deepseek-v4-flash |          pp2048 (c2) | 1779.90 ± 109.81 | 1291.89 ± 472.44 |              |                  |   1896.37 ± 530.09 |   1777.89 ± 530.09 |   1896.37 ± 530.09 |
| deepseek-v4-flash |           tg512 (c2) |     57.78 ± 4.22 |     31.95 ± 1.89 | 74.33 ± 1.70 |     40.83 ± 3.02 |                    |                    |                    |
| deepseek-v4-flash |  pp2048 @ d4096 (c1) |   2023.77 ± 6.66 |   2023.77 ± 6.66 |              |                  |     3154.43 ± 9.97 |     3035.95 ± 9.97 |     3154.43 ± 9.97 |
| deepseek-v4-flash |   tg512 @ d4096 (c1) |     37.53 ± 2.43 |     37.53 ± 2.43 | 45.67 ± 4.78 |     45.67 ± 4.78 |                    |                    |                    |
| deepseek-v4-flash |  pp2048 @ d4096 (c2) |   1946.30 ± 1.72 | 1511.55 ± 519.76 |              |                  |  4728.23 ± 1585.09 |  4609.75 ± 1585.09 |  4728.23 ± 1585.09 |
| deepseek-v4-flash |   tg512 @ d4096 (c2) |     44.47 ± 5.77 |     26.47 ± 4.92 | 68.00 ± 3.56 |     37.67 ± 3.77 |                    |                    |                    |
| deepseek-v4-flash |  pp2048 @ d8192 (c1) |   1942.51 ± 4.12 |   1942.51 ± 4.12 |              |                  |    5390.03 ± 11.17 |    5271.55 ± 11.17 |    5390.03 ± 11.17 |
| deepseek-v4-flash |   tg512 @ d8192 (c1) |     37.63 ± 2.34 |     37.63 ± 2.34 | 44.67 ± 2.05 |     44.67 ± 2.05 |                    |                    |                    |
| deepseek-v4-flash |  pp2048 @ d8192 (c2) |   1892.56 ± 1.56 | 1449.92 ± 493.14 |              |                  |  8104.78 ± 2716.24 |  7986.30 ± 2716.24 |  8104.78 ± 2716.24 |
| deepseek-v4-flash |   tg512 @ d8192 (c2) |     44.71 ± 1.73 |     27.54 ± 4.13 | 73.67 ± 1.70 |     39.33 ± 2.75 |                    |                    |                    |
| deepseek-v4-flash | pp2048 @ d16384 (c1) |   1895.43 ± 1.40 |   1895.43 ± 1.40 |              |                  |     9842.92 ± 7.17 |     9724.43 ± 7.17 |     9842.92 ± 7.17 |
| deepseek-v4-flash |  tg512 @ d16384 (c1) |     37.61 ± 2.46 |     37.61 ± 2.46 | 45.00 ± 2.16 |     45.00 ± 2.16 |                    |                    |                    |
| deepseek-v4-flash | pp2048 @ d16384 (c2) |   1829.18 ± 1.30 | 1165.41 ± 245.39 |              |                  | 16668.08 ± 3484.76 | 16549.60 ± 3484.76 | 16668.08 ± 3484.76 |
| deepseek-v4-flash |  tg512 @ d16384 (c2) |     36.84 ± 5.33 |     26.17 ± 4.65 | 67.67 ± 4.71 |     38.17 ± 4.88 |                    |                    |                    |
| deepseek-v4-flash | pp2048 @ d32768 (c1) |   1800.17 ± 1.23 |   1800.17 ± 1.23 |              |                  |   19458.87 ± 13.27 |   19340.38 ± 13.27 |   19458.87 ± 13.27 |
| deepseek-v4-flash |  tg512 @ d32768 (c1) |     34.09 ± 0.58 |     34.09 ± 0.58 | 42.33 ± 2.05 |     42.33 ± 2.05 |                    |                    |                    |
| deepseek-v4-flash | pp2048 @ d32768 (c2) |  1723.90 ± 18.63 | 1200.41 ± 335.99 |              |                  | 31588.38 ± 8813.53 | 31469.90 ± 8813.53 | 31588.38 ± 8813.53 |
| deepseek-v4-flash |  tg512 @ d32768 (c2) |     25.51 ± 5.12 |    23.32 ± 14.26 | 63.33 ± 9.74 |    34.83 ± 14.87 |                    |                    |                    |

brianmiller · 2026-06-20T19:17:25Z

GB10 TP=2 Benchmark — `88ec87e` (June 20)

Dual NVIDIA GB10 (Grace Blackwell, SM121), TP=2 over 200GbE RoCE, CUDA graphs FULL_AND_PIECEWISE, MTP=2, --kv-cache-dtype fp8 --block-size 256 --max-num-seqs 4 --max-num-batched-tokens 4096 --gpu-memory-utilization 0.85.

Short-context (512 tok output, 5 requests per concurrency)

Concurrency	tok/s	Mean latency
C=1	40.56	12,528ms
C=2	56.51	15,477ms
C=4	69.59	22,155ms

Long-context (121K prompt tokens, warm prefix cache)

Test	Time	tok/s
121K prefill C=1 (cached)	5.5s	35.30
121K cached C=2	5.5s / 11.0s	33.6 / 18.0

vs previous build (`5d1584e`, June 10)

Metric	`5d1584e`	`88ec87e`	Delta
C=1	42.50	40.56	−4.6%
C=2	56.63	56.51	−0.2%
C=4	72.31	69.59	−3.8%
121K cached prefill	9.2s	5.5s	−40%
121K cached C=2	9.2/11.1s	5.5/11.0s	−40% / −1%

Short-context decode is slightly slower (~3-5%), but prefix cache performance improved dramatically — 40% faster on cached 121K prompts. The MTP+prefix cache stability fixes (COW shared blocks, write-completion fence, int64 block offsets, eager-break cudagraph fix) are all working well. No crashes across the full benchmark suite.

Stable, running in production. Thank you @jasl!

jasl requested review from 22quinn, LucasWilkinson, MatthewBonanni, ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners May 6, 2026 15:17

claude Bot reviewed May 6, 2026

View reviewed changes

mergify Bot added deepseek Related to DeepSeek models nvidia v1 labels May 6, 2026

github-project-automation Bot added this to NVIDIA May 6, 2026

jasl mentioned this pull request May 6, 2026

[DSv4][Nvidia] SM12x DeepSeek V4 support #40991

Closed

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

jasl changed the title ~~[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash~~ [New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes May 6, 2026

jasl requested review from ApostaC, alexm-redhat, heheda12345, njhill, orozery and ywang96 as code owners May 6, 2026 15:54

jasl force-pushed the codex/ds4-sm120-min-enable branch from 042e366 to df2e6f8 Compare May 6, 2026 16:26

jasl and others added 21 commits June 20, 2026 08:22

chore: wrap SM12x indexer fallback line under ruff line-length

3476a25

Restore ruff cleanliness after the upstream rebase: the indexer top-k fallback contiguity guard exceeded the 88-char line limit (E501). Signed-off-by: jasl <jasl9187@hotmail.com>

jasl added 5 commits June 20, 2026 22:16

chore(sm12x-audit): restore upstream multi-line form for VLLM_ENFORCE…

3f42d92

…_STRICT_TOOL_CALLING Cosmetic reflow churn from 59c7918 on an upstream-owned env; restore byte-identical to base 0fbf42a. No behavior change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes#41834

[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes#41834
jasl wants to merge 126 commits into
vllm-project:mainfrom
jasl:codex/ds4-sm120-min-enable

jasl commented May 6, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

jasl commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

chatgpt-codex-connector Bot commented May 6, 2026

Uh oh!

jasl commented Jun 20, 2026

Uh oh!

jasl commented Jun 20, 2026 •

edited

Loading

Uh oh!

DerAndereAndi commented Jun 20, 2026 •

edited

Loading

Uh oh!

brianmiller commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

jasl commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change footprint — model kernels vs. core-vLLM touch points

Duplicate-work check

Fixed preview tags

Update 2026-06-21 — post-audit cleanup (latest validated head)

Validation, 2026-06-21

Gated SM120 decode optimization (VLLM_DEEPSEEK_V4_FLASHINFER_SM120_DECODE=1)

Gated SM120 prefill optimization (VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1)

Branch validation, 2026-06-12

Latest clean SM120 RTX PRO 6000 x2 data, 2026-06-12

Latest clean GB10 / SM121 data, 2026-06-12

Running the NVFP4 checkpoint

AI assistance disclosure

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

jasl commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector Bot commented May 6, 2026

💡 Codex Review

Uh oh!

jasl commented Jun 20, 2026

Uh oh!

jasl commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DerAndereAndi commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brianmiller commented Jun 20, 2026

GB10 TP=2 Benchmark — 88ec87e (June 20)

Short-context (512 tok output, 5 requests per concurrency)

Long-context (121K prompt tokens, warm prefix cache)

vs previous build (5d1584e, June 10)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

jasl commented May 6, 2026 •

edited

Loading

Gated SM120 decode optimization (`VLLM_DEEPSEEK_V4_FLASHINFER_SM120_DECODE=1`)

Gated SM120 prefill optimization (`VLLM_DEEPSEEK_V4_FLASHINFER_SM120_PREFILL=1`)

jasl commented Jun 20, 2026 •

edited

Loading

DerAndereAndi commented Jun 20, 2026 •

edited

Loading

GB10 TP=2 Benchmark — `88ec87e` (June 20)

vs previous build (`5d1584e`, June 10)