Skip to content

Commit 5de5541

Browse files
committed
cleanup recipes
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
1 parent b5c5331 commit 5de5541

3 files changed

Lines changed: 13 additions & 17 deletions

File tree

modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -31,14 +31,16 @@
3131
# values. This keeps the FP8 static-scale path but uses a coarser candidate set.
3232
metadata:
3333
recipe_type: ptq
34-
description: Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); shared experts, mamba in/out_proj, and Latent MOE fc1_latent_proj/fc2_latent_proj
35-
FP8 per-tensor; FP8 KV cache; lm_head/MTP/SSM stay BF16/FP16. Weight-MSE calibration with stride-4 FP8 scale sweep.
34+
description: Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); shared experts, mamba in/out_proj
35+
FP8 per-tensor; FP8 KV cache; everything else(lm_head/MTP/Latent MOE) stay BF16. Weight-MSE calibration with stride-4 FP8 scale sweep.
3636
quantize:
3737
algorithm:
3838
method: mse
3939
fp8_scale_sweep: true
4040
fp8_scale_sweep_stride: 4
4141
quant_cfg:
42+
# Disable all layers by default so that these layers stay in original BF16 precision:
43+
# lm_head, output projection, MoE routers/gates, Latent MOE, MTP head, mamba conv1d.
4244
- quantizer_name: '*'
4345
enable: false
4446

@@ -130,6 +132,3 @@ quantize:
130132
enable: true
131133
cfg:
132134
num_bits: e4m3
133-
134-
# Stay BF16: lm_head, output projection, MoE routers/gates, MTP head.
135-
# SSM state / mamba conv1d stay FP16.

modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -28,21 +28,21 @@
2828
# - Latent MOE (fc1_latent_proj, fc2_latent_proj): BF16 (not quantized)
2929
# - SSM cache: FP32 (can be set to FP16 in VLLM)
3030
#
31-
# Calibration: amax/max calibration comparison variant. This skips MSE weight
32-
# scale search and uses max calibration for enabled quantizers.
31+
# Calibration: amax/max calibration comparison variant
3332
metadata:
3433
recipe_type: ptq
35-
description: Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); shared experts, mamba in/out_proj, and Latent MOE fc1_latent_proj/fc2_latent_proj
36-
FP8 per-tensor; FP8 KV cache; lm_head/MTP/SSM stay BF16/FP16. Amax calibration comparison variant.
34+
description: Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); shared experts, mamba in/out_proj
35+
FP8 per-tensor; FP8 KV cache; everything else(lm_head/MTP/Latent MOE) stay BF16. Amax calibration comparison variant.
3736
quantize:
3837
algorithm:
3938
method: max
4039
quant_cfg:
40+
# Disable all layers by default so that these layers stay in original BF16 precision:
41+
# lm_head, output projection, MoE routers/gates, Latent MOE, MTP head, mamba conv1d.
4142
- quantizer_name: '*'
4243
enable: false
4344

4445
# MoE routed experts -> NVFP4 W4A4, block_size 16, e4m3 scale.
45-
# Max/amax calibration uses dynamic block scales for both weight and activation.
4646
# HF/export names: backbone.layers.*.mixer.experts.*.{up,down}_proj.
4747
- quantizer_name: '*mixer.experts.*weight_quantizer'
4848
enable: true
@@ -129,6 +129,3 @@ quantize:
129129
enable: true
130130
cfg:
131131
num_bits: e4m3
132-
133-
# Stay BF16: lm_head, output projection, MoE routers/gates, MTP head.
134-
# SSM state / mamba conv1d stay FP16.

modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -32,15 +32,15 @@
3232
# are also chosen via MSE search instead of plain amax).
3333
metadata:
3434
recipe_type: ptq
35-
description: Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); shared experts, mamba in/out_proj, and Latent MOE fc1_latent_proj/fc2_latent_proj
36-
FP8 per-tensor; FP8 KV cache; lm_head/MTP/SSM stay BF16/FP16. Weight-MSE calibration with FP8 scale sweep.
35+
description: Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); shared experts, mamba in/out_proj
36+
FP8 per-tensor; FP8 KV cache; everything else(lm_head/MTP/latent MOE) stay BF16. Weight-MSE calibration with FP8 scale sweep.
3737
quantize:
3838
algorithm:
3939
method: mse
4040
fp8_scale_sweep: true
4141
quant_cfg:
42-
# Disable all layers by default so that these layers stay in their original precision: BF16/FP32:
43-
# lm_head, output projection, MoE routers/gates, MTP head, SSM state, mamba conv1d.
42+
# Disable all layers by default so that these layers stay in original BF16 precision:
43+
# lm_head, output projection, MoE routers/gates, Latent MOE, MTP head, mamba conv1d.
4444
- quantizer_name: '*'
4545
enable: false
4646

0 commit comments

Comments
 (0)