You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[DeepSeek] Default to top-k calibration with peer-max input amax sync
Previously DeepSeek PTQ forced every token through every MoE expert during
calibration via CalibMoe. This doubled the calibration forward pass and
exposed cold-routing experts to outliers they would never see at inference,
inflating their input_quantizer.amax.
Default now uses native top-k routing and runs fixup_moe_expert_amax after
mtq.quantize. For each MoE layer x linear (w1/w2/w3), every expert's
input_quantizer.amax is synced to the per-layer global peer max
(dist.all_reduce(MAX) across EP ranks). weight_quantizer.amax stays
per-expert; uncalibrated experts fall back to a compute path over the
dequantized FP8 weight.
The previous behavior is preserved behind --calib_all_experts.
Also write mtq.print_quant_summary output to <output_path>/.quant_summary.txt
to mirror llm_ptq/hf_ptq.py.
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,6 +18,7 @@ Changelog
18
18
**New Features**
19
19
20
20
- Add offline DFlash speculative decoding training. Train the draft module from pre-computed base-model hidden states dumped by ``examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py``; base-model transformer layers are deleted after conversion to save memory. Controlled by the auto-derived ``dflash_offline`` flag on ``DFlashConfig`` (derived from ``data_args.offline_data_path``). The dump scripts now share ``collect_hidden_states/common.py`` for aux-layer selection (``--aux-layers eagle|dflash|<list>``) and optional assistant-token ``loss_mask`` for answer-only-loss training.
21
+
- DeepSeek PTQ (``examples/deepseek/ptq.py``) now defaults to native top-k calibration with post-hoc per-layer peer-max sync of expert ``input_quantizer.amax``; the all-experts path is preserved behind ``--calib_all_experts``.
0 commit comments