Skip to content

Commit 168cd82

Browse files
Add qwen3 moe experts only test (#1274)
## Summary - Add unit test for Qwen3 MoE HF export with `NVFP4_EXPERTS_ONLY_CFG` quantization config - Verifies that `hf_quant_config.json` correctly reports `quant_algo: NVFP4` and that non-expert modules (`self_attn`, `lm_head`) appear in `exclude_modules` while routed expert layers (`mlp.experts.*`) do not - Reference: https://huggingface.co/nvidia/Qwen3.5-397B-A17B-NVFP4/blob/main/hf_quant_config.json Type of change: New tests ### Known issue On `transformers>=5.0`, fused MoE experts (`_QuantFusedExperts`) are not recognized by `get_quant_config`, causing `quant_algo=None` in the exported config. This test currently **fails** on transformers 5.x and is intended to be fixed by a follow-up change. ## Testing - **transformers 4.57.6**: PASSED - **transformers 5.5.4**: FAILED (`quant_algo` is `None` due to fused expert export gap) ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added GPU test coverage for exporting Qwen3 Mixture-of-Experts models with NVFP4 quantization. * Verifies the exported checkpoint records the NVFP4 quantization algorithm and that module exclusion patterns correctly exclude attention and LM head components while not excluding routed expert paths. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
1 parent 3ad4f4f commit 168cd82

1 file changed

Lines changed: 54 additions & 0 deletions

File tree

tests/gpu/torch/export/test_export.py

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,9 @@
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
1515

16+
import json
17+
from fnmatch import fnmatch
18+
1619
import pytest
1720
import torch
1821
from _test_utils.torch.export.utils import (
@@ -29,6 +32,7 @@
2932
partial_nvfp4_config,
3033
partial_w4a8_config,
3134
)
35+
from _test_utils.torch.transformers_models import get_tiny_qwen3_moe
3236

3337
import modelopt.torch.quantization as mtq
3438
from modelopt.torch.export.model_config import (
@@ -53,13 +57,15 @@
5357
postprocess_state_dict,
5458
process_layer_quant_config,
5559
)
60+
from modelopt.torch.export.unified_export_hf import export_hf_checkpoint
5661
from modelopt.torch.quantization.config import (
5762
FP8_DEFAULT_CFG,
5863
INT4_AWQ_CFG,
5964
INT8_SMOOTHQUANT_CFG,
6065
INT8_WEIGHT_ONLY_CFG,
6166
NVFP4_AWQ_LITE_CFG,
6267
NVFP4_DEFAULT_CFG,
68+
NVFP4_EXPERTS_ONLY_CFG,
6369
W4A8_AWQ_BETA_CFG,
6470
)
6571
from modelopt.torch.quantization.nn import SequentialQuantizer, TensorQuantizer
@@ -466,3 +472,51 @@ def test_get_quant_config(config, expected):
466472
mtq.quantize(model, config, lambda x: x(torch.randn(1, 4, 10, device="cuda")))
467473
quant_config = get_quant_config(model)
468474
assert quant_config["quantization"] == expected
475+
476+
477+
def test_qwen3_moe_nvfp4_experts_only_export_exclude_modules(tmp_path):
478+
"""Test that NVFP4_EXPERTS_ONLY_CFG correctly excludes non-expert modules in HF export.
479+
480+
For a Qwen3 MoE model, only routed expert layers (mlp.experts.*) should be quantized.
481+
Attention layers and lm_head should appear in the exported hf_quant_config.json
482+
exclude_modules.
483+
484+
Reference: https://huggingface.co/nvidia/Qwen3.5-397B-A17B-NVFP4/blob/main/hf_quant_config.json
485+
"""
486+
model = get_tiny_qwen3_moe().to("cuda")
487+
# from_config doesn't set architectures; export code requires it
488+
model.config.architectures = ["Qwen3MoeForCausalLM"]
489+
490+
# Quantize with NVFP4_EXPERTS_ONLY_CFG (targets only *mlp.experts* patterns)
491+
dummy_inputs = {k: v.to("cuda") for k, v in model.dummy_inputs.items()}
492+
mtq.quantize(model, NVFP4_EXPERTS_ONLY_CFG, lambda m: m(**dummy_inputs))
493+
494+
# Export
495+
export_dir = tmp_path / "qwen3_moe_nvfp4_experts_only"
496+
export_hf_checkpoint(model, export_dir=export_dir)
497+
498+
# Load the generated hf_quant_config.json
499+
hf_quant_config_path = export_dir / "hf_quant_config.json"
500+
assert hf_quant_config_path.exists(), "hf_quant_config.json should be generated"
501+
with open(hf_quant_config_path) as f:
502+
hf_quant_config = json.load(f)
503+
504+
quant_section = hf_quant_config["quantization"]
505+
assert quant_section["quant_algo"] == "NVFP4"
506+
exclude_modules = quant_section["exclude_modules"]
507+
508+
def is_excluded(module_name: str) -> bool:
509+
return any(fnmatch(module_name, pattern) for pattern in exclude_modules)
510+
511+
# Attention layers must be excluded
512+
assert is_excluded("model.layers.0.self_attn.q_proj"), (
513+
f"self_attn should be excluded, got patterns: {exclude_modules}"
514+
)
515+
516+
# lm_head must be excluded
517+
assert is_excluded("lm_head"), f"lm_head should be excluded, got patterns: {exclude_modules}"
518+
519+
# Routed experts should NOT be excluded
520+
assert not is_excluded("model.layers.0.mlp.experts.0.down_proj"), (
521+
f"Routed experts should not be excluded, got patterns: {exclude_modules}"
522+
)

0 commit comments

Comments
 (0)