Skip to content

Commit 42482b1

Browse files
authored
Add nvfp4_omlp_only config and simplify the config.py (#973)
### What does this PR do? Type of change: ? new feature 1) Add vfp4_omlp_only config == nvfp4_mlp_only + o_proj quant 2) Add block sparse MOE to mlp only config 3) Simplfiy config.py 4) Update readme in llm_ptq mention these two configs for better accuracy. ### Usage huggingface_script.sh ... --quant nvfp4_omlp_only ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, using `torch.load(..., weights_only=True)`, avoiding `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other source, did you follow IP policy in [CONTRIBUTING.md](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md#-copying-code-from-other-sources)?: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added nvfp4_omlp_only quantization format for NVFP4, enabling selective quantization of MLP and output projection layers while preserving attention QKV projection accuracy. * **Changed** * pass_through_bwd now defaults to True; set to False if using STE with zeroed outlier gradients for better QAT accuracy. * **Documentation** * Updated post-training quantization guidance with NVFP4-specific configuration recommendations and usage examples. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
1 parent dd16a96 commit 42482b1

8 files changed

Lines changed: 104 additions & 181 deletions

File tree

CHANGELOG.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ NVIDIA Model Optimizer Changelog
1717
- Add support for rotating the input before quantization for RHT.
1818
- Add support for advanced weight scale search for NVFP4 quantization and its export path.
1919
- Enable PTQ workflow for Qwen3.5 MoE models.
20+
- Add ``nvfp4_omlp_only`` quantization format for NVFP4 quantization. This is similar to ``nvfp4_mlp_only`` but also quantizes the output projection layer in attention.
21+
- ``pass_through_bwd`` in the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.
2022

2123
**Misc**
2224

examples/llm_ptq/README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,8 @@ def forward_loop(model):
6969
model = mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG, forward_loop)
7070
```
7171

72+
> *For higher NVFP4 PTQ accuracy, we recommend using `mtq.NVFP4_MLP_ONLY_CFG` or `mtq.NVFP4_OMLP_ONLY_CFG` instead of `mtq.NVFP4_DEFAULT_CFG`. `NVFP4_MLP_ONLY_CFG` applies NVFP4 quantization to MLP (and MoE) layers, leaving attention layers unquantized. `NVFP4_OMLP_ONLY_CFG` additionally quantizes the `o_proj` layer. Both preserve accuracy in the sensitive attention QKV projections while still providing significant compression.*
73+
7274
### 2. Export Quantized Model
7375

7476
Once your model is quantized, you can now export that model to a checkpoint for easy deployment. \
@@ -126,7 +128,7 @@ Please reference our [framework scripts](#framework-scripts) and our [docs](http
126128
> *<sup>7.</sup>[PTQ for DeepSeek](../deepseek/README.md)* \
127129
> *<sup>8.</sup>GLM-4.7 has MTP (Multi-Token Prediction) layers that are automatically loaded and excluded from quantization.*
128130
129-
> *The accuracy loss after PTQ may vary depending on the actual model and the quantization method. Different models may have different accuracy loss and usually the accuracy loss is more significant when the base model is small. If the accuracy after PTQ is not meeting the requirement, please try either modifying [hf_ptq.py](./hf_ptq.py) and disabling the KV cache quantization or using the [QAT](./../llm_qat/README.md) instead.*
131+
> *The accuracy loss after PTQ may vary depending on the actual model and the quantization method. Different models may have different accuracy loss and usually the accuracy loss is more significant when the base model is small. If the accuracy after PTQ is not meeting the requirement, please try either modifying [hf_ptq.py](./hf_ptq.py) and disabling the KV cache quantization or using the [QAT](./../llm_qat/README.md) instead. For NVFP4 quantization specifically, we recommend `nvfp4_mlp_only` or `nvfp4_omlp_only` to achieve higher accuracy by restricting quantization to the MLP layers (and optionally the `o_proj` layer) while keeping the attention QKV projections unquantized.*
130132
131133
> You can also create your own custom config using [this](https://nvidia.github.io/Model-Optimizer/guides/_pytorch_quantization.html#custom-calibration-algorithm) guide.
132134
@@ -144,7 +146,7 @@ For LLM models like [Llama-3](https://huggingface.co/meta-llama):
144146
# Install model specific pip dependencies if needed
145147

146148
export HF_PATH=<the downloaded LLaMA checkpoint from the Hugging Face hub, or simply the model card>
147-
scripts/huggingface_example.sh --model $HF_PATH --quant [fp8|nvfp4|int8_sq|int4_awq|w4a8_awq] --tp [1|2|4|8]
149+
scripts/huggingface_example.sh --model $HF_PATH --quant [fp8|nvfp4|nvfp4_mlp_only|nvfp4_omlp_only|int8_sq|int4_awq|w4a8_awq] --tp [1|2|4|8]
148150
```
149151

150152
> *By default `trust_remote_code` is set to false. Please turn it on if model calibration and eval requires it using `--trust_remote_code`.*
@@ -295,7 +297,7 @@ accelerate launch --config_file fsdp2.yaml \
295297
--fsdp_transformer_layer_cls_to_wrap=<decoder_layer_name>
296298
multinode_ptq.py \
297299
--pyt_ckpt_path <path_to_model> \
298-
--qformat <fp8/nvfp4/nvfp4_awq/int8> \
300+
--qformat <fp8/nvfp4/nvfp4_mlp_only/nvfp4_omlp_only/nvfp4_awq/int8> \
299301
--kv_cache_qformat <fp8/nvfp4/nvfp4_affine/none> \
300302
--batch_size <calib_batch_size> \
301303
--calib_size <num_calib_samples> \
@@ -460,4 +462,4 @@ There are many quantization schemes supported in the example scripts:
460462

461463
1. The W4A8 AWQ is an extension of the INT4 AWQ quantization that it also uses FP8 for activation for more speed up and acceleration.
462464

463-
1. The [NVFP4](https://blogs.nvidia.com/blog/generative-ai-studio-ces-geforce-rtx-50-series/) is one of the new FP4 formats supported by NVIDIA Blackwell GPU and demonstrates good accuracy compared with other 4-bit alternatives. NVFP4 can be applied to both model weights as well as activations, providing the potential for both a significant increase in math throughput and reductions in memory footprint and memory bandwidth usage compared to the FP8 data format on Blackwell.
465+
1. The [NVFP4](https://blogs.nvidia.com/blog/generative-ai-studio-ces-geforce-rtx-50-series/) is one of the new FP4 formats supported by NVIDIA Blackwell GPU and demonstrates good accuracy compared with other 4-bit alternatives. NVFP4 can be applied to both model weights as well as activations, providing the potential for both a significant increase in math throughput and reductions in memory footprint and memory bandwidth usage compared to the FP8 data format on Blackwell. For higher accuracy with NVFP4 PTQ, we recommend `nvfp4_mlp_only` or `nvfp4_omlp_only`. `nvfp4_mlp_only` restricts NVFP4 quantization to MLP (and MoE) layers only, leaving attention layers in higher precision. `nvfp4_omlp_only` extends this by also quantizing the `o_proj` layer, providing a middle ground between full NVFP4 and MLP-only quantization.

examples/llm_ptq/example_utils.py

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -258,18 +258,6 @@ def build_quant_cfg(
258258
quant_cfg["quant_cfg"]["*image*"] = {"enable": False}
259259
quant_cfg["quant_cfg"]["*vision*"] = {"enable": False}
260260

261-
if model_type in ["qwen3moe", "qwen3next", "minimax"] and qformat == "nvfp4":
262-
# Disable the attention projection layers to retain accuracy
263-
quant_cfg["quant_cfg"]["model*.*attn*in_proj*"] = {"enable": False}
264-
quant_cfg["quant_cfg"]["model*.*attn*q_proj*"] = {"enable": False}
265-
quant_cfg["quant_cfg"]["model*.*attn*k_proj*"] = {"enable": False}
266-
quant_cfg["quant_cfg"]["model*.*attn*v_proj*"] = {"enable": False}
267-
268-
if model_type == "deepseek":
269-
# Disable MLA quantization for accuracy.
270-
quant_cfg["quant_cfg"]["*self_attn.q*"] = {"enable": False}
271-
quant_cfg["quant_cfg"]["*self_attn.kv*"] = {"enable": False}
272-
273261
return quant_cfg
274262

275263

examples/llm_ptq/hf_ptq.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,7 @@
8989
"w4a8_nvfp4_fp8": mtq.W4A8_NVFP4_FP8_CFG,
9090
"w4a8_mxfp4_fp8": mtq.W4A8_MXFP4_FP8_CFG,
9191
"nvfp4_mlp_only": mtq.NVFP4_MLP_ONLY_CFG,
92+
"nvfp4_omlp_only": mtq.NVFP4_OMLP_ONLY_CFG,
9293
"nvfp4_svdquant": mtq.NVFP4_SVDQUANT_DEFAULT_CFG,
9394
"mxfp8": mtq.MXFP8_DEFAULT_CFG,
9495
}
@@ -254,6 +255,7 @@ def auto_quantize(
254255
"fp8_pb_wo",
255256
"w4a8_mxfp4_fp8",
256257
"nvfp4_mlp_only",
258+
"nvfp4_omlp_only",
257259
"mxfp8",
258260
]
259261
for args.qformat in qformat_list
@@ -909,6 +911,7 @@ def quantize_main(
909911
"fp8_pb_wo",
910912
"w4a8_mxfp4_fp8",
911913
"nvfp4_mlp_only",
914+
"nvfp4_omlp_only",
912915
"mxfp8",
913916
]
914917
or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES

examples/llm_ptq/scripts/huggingface_example.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,9 +53,9 @@ esac
5353
IFS=","
5454
for qformat in $QFORMAT; do
5555
case $qformat in
56-
fp8 | fp8_pc_pt | fp8_pb_wo | int8_wo | int8_sq | int4_awq | w4a8_awq | fp16 | bf16 | nvfp4 | nvfp4_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8 | nvfp4_mlp_only | nvfp4_svdquant | mxfp8) ;;
56+
fp8 | fp8_pc_pt | fp8_pb_wo | int8_wo | int8_sq | int4_awq | w4a8_awq | fp16 | bf16 | nvfp4 | nvfp4_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8 | nvfp4_mlp_only | nvfp4_omlp_only | nvfp4_svdquant | mxfp8) ;;
5757
*)
58-
echo "Unknown quant argument: Expected one of: [fp8, fp8_pc_pt, fp8_pb_wo, int8_wo, int8_sq, int4_awq, w4a8_awq, fp16, bf16, nvfp4, nvfp4_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8, nvfp4_mlp_only, nvfp4_svdquant, mxfp8]" >&2
58+
echo "Unknown quant argument: Expected one of: [fp8, fp8_pc_pt, fp8_pb_wo, int8_wo, int8_sq, int4_awq, w4a8_awq, fp16, bf16, nvfp4, nvfp4_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8, nvfp4_mlp_only, nvfp4_omlp_only, nvfp4_svdquant, mxfp8]" >&2
5959
exit 1
6060
;;
6161
esac

0 commit comments

Comments
 (0)