feat(ptq): replace Nemotron-H ad-hoc lm_head/embedding helper with YAML recipe

ajrasane · ajrasane · commit 8fa0d4c5e1d3 · 2026-05-01T18:22:07.000Z
Move the Nemotron-H-specific quantization extensions out of `hf_ptq.py` and into a declarative recipe at `modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml`, addressing PR #1327 review feedback. The recipe captures exactly what the removed `_enable_lm_head_and_embedding_quantization` helper did: * All Linear weight quantizers ON (NVFP4 W4A16, group_size 16, scale_bits e4m3). * Standard `_default_disabled_quantizer_cfg` exclusions (BatchNorm, conv1d, etc.). * `*lm_head*weight_quantizer`, `*embeddings*weight_quantizer`, and `*embed_tokens*weight_quantizer` re-enabled AFTER the default disables so they take precedence (last matching entry wins). Drop the helpers (`_enable_lm_head_and_embedding_quantization`, `_extract_wildcard_quantizer_cfg`) and the `if model_type == "nemotron_h":` block in `mono_quantize`. Users now opt in explicitly via `--recipe models/Nemotron-H/nvfp4_w4a16` instead of relying on auto-detection. Verified end-to-end on `nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16` (RTX 6000 Ada, calib_size=16, calib_seq=256): 94 weight quantizers enabled and 21 disabled (the Mamba `*mixer.conv1d*` layers), `lm_head.weight_quantizer` and `model.embeddings.weight_quantizer` carry NVFP4 cfg, exported safetensors is 2.13 GiB (matches prior PR-validation export size), and `hf_quant_config.json` reports `quant_algo=NVFP4_W4A16`, `group_size=16`, `exclude_modules=[21 conv1d layers]`. Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -7,7 +7,7 @@ Changelog
 **New Features**
 
 - Add NVFP4 W4A16 weight-only quantization (``nvfp4_w4a16``): FP4 weights with group_size=16, BF16 activations, no calibration forward pass required. Use ``mtq.NVFP4_W4A16_CFG`` or ``--qformat nvfp4_w4a16`` in ``hf_ptq.py``. Exported checkpoints can be served on vLLM after conversion to compressed-tensors format.
-- Register ``nn.Embedding`` with ``QuantModuleRegistry`` (weight-only wrapper) and extend the unified HF exporter to pack quantized embedding weights. Enables NVFP4 quantization of ``lm_head`` and the input token embedding on hybrid SSM+Attention models such as Nemotron-H, where those two tables are a sizeable fraction of parameters and leaving them in bf16 wastes most of the compression. Nemotron-H-specific enablement + ``--exclude_modules`` CLI flag wired up in ``examples/llm_ptq/hf_ptq.py``.
+- Register ``nn.Embedding`` with ``QuantModuleRegistry`` (weight-only wrapper) and extend the unified HF exporter to pack quantized embedding weights. Enables NVFP4 quantization of ``lm_head`` and the input token embedding on hybrid SSM+Attention models such as Nemotron-H, where those two tables are a sizeable fraction of parameters and leaving them in bf16 wastes most of the compression. Use ``--recipe models/Nemotron-H/nvfp4_w4a16`` (see `modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml>`_) to opt in. The ``--exclude_modules`` CLI flag in ``examples/llm_ptq/hf_ptq.py`` lets users selectively exclude individual modules from the recipe's coverage.
 - Support full Transformer Engine spec for Minitron pruning (``mcore_minitron``). Now we no longer need to use custom ModelOpt spec. Note that this does not affect the usage of the pruning workflow but makes pruning slightly faster and may result in slightly different pruned model because of different kernel and numerics.
 - Add Puzzletron - a new algorithm for heterogeneous pruning of LLM and VLM models. See `examples/puzzletron/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/puzzletron>`_ for more details.
 - Added iterator interface using CalibrationDataReader in ONNX quantization workflow.
diff --git a/examples/llm_ptq/hf_ptq.py b/examples/llm_ptq/hf_ptq.py
@@ -596,99 +596,6 @@ def sparsity_main(
     mts.export(full_model)
 
 
-def _enable_lm_head_and_embedding_quantization(
-    quant_cfg: dict[str, Any],
-    weight_quantizer_cfg: dict[str, Any],
-    input_quantizer_cfg: dict[str, Any] | None = None,
-    user_excluded_modules: list[str] | None = None,
-) -> None:
-    """Re-enable quantization of ``lm_head`` and the input embedding table.
-
-    ModelOpt's default PTQ recipes exclude ``*lm_head*`` and never touch ``nn.Embedding``
-    because most LLM deployment runtimes keep those layers at full precision. For Nemotron-H
-    (and similar SSM+Attention hybrids) the embedding and lm_head are a large fraction of the
-    total parameters — quantizing them recovers most of the promised memory savings. This
-    helper appends entries to the cfg list that override earlier ``*lm_head*`` disables
-    and explicitly target the embedding weight quantizer.
-
-    For activation-aware recipes (``fp8``, ``nvfp4``, ...) ``input_quantizer_cfg`` is mirrored
-    onto ``*lm_head*input_quantizer`` so ``lm_head`` keeps the same activation format as the
-    rest of the model. Embedding input quantizers are left alone since
-    ``QuantEmbedding._setup`` disables them by default (embedding inputs are integer indices).
-
-    If ``user_excluded_modules`` is provided, entries matching any user exclusion pattern
-    are skipped so ``--exclude_modules lm_head`` / ``--exclude_modules embeddings`` is not
-    silently overridden.
-
-    Args:
-        quant_cfg: the primary quant_cfg dict (``{"quant_cfg": [...], "algorithm": ...}``).
-        weight_quantizer_cfg: the weight-quantizer attribute dict to apply (e.g. ``_nvfp4_cfg``).
-        input_quantizer_cfg: the activation-quantizer attribute dict to mirror on ``lm_head``.
-            ``None`` for weight-only recipes, in which case no input-quantizer entry is added.
-        user_excluded_modules: raw ``--exclude_modules`` patterns from the CLI; targets
-            matching any of them (bidirectional substring match) are skipped.
-    """
-    excluded = user_excluded_modules or []
-
-    def _user_excluded(target_hint: str) -> bool:
-        # Bidirectional substring: "lm_head" user pattern excludes target "lm_head"; a more
-        # specific user pattern (e.g. "backbone.embeddings") also excludes "embeddings".
-        return any(p in target_hint or target_hint in p for p in excluded)
-
-    # Ordering matters: these entries must come AFTER the _default_disabled_quantizer_cfg
-    # entries (which set *lm_head* → disabled) so they take effect.
-    if not _user_excluded("lm_head"):
-        quant_cfg["quant_cfg"].append(
-            {
-                "quantizer_name": "*lm_head*weight_quantizer",
-                "cfg": copy.deepcopy(weight_quantizer_cfg),
-            }
-        )
-        # For activation-aware recipes, keep lm_head's input format aligned with the rest of
-        # the model — otherwise lm_head silently downgrades to weight-only and gets
-        # reclassified as e.g. NVFP4_W4A16 on export while the rest of the model is NVFP4.
-        if input_quantizer_cfg is not None:
-            quant_cfg["quant_cfg"].append(
-                {
-                    "quantizer_name": "*lm_head*input_quantizer",
-                    "cfg": copy.deepcopy(input_quantizer_cfg),
-                }
-            )
-
-    # nn.Embedding quantizers only exist once `quant_embedding.py` registers the class.
-    # Nemotron-H's backbone attribute name differs between the remote-code ("backbone.embeddings")
-    # and transformers built-in ("model.embeddings") paths; both are weight-only vocab
-    # embeddings here. The broad "*embeddings*" wildcard covers both and does not match
-    # any other layer in a Nemotron-H model (no positional/rotary embeddings exist).
-    if not _user_excluded("embeddings"):
-        quant_cfg["quant_cfg"].append(
-            {
-                "quantizer_name": "*embeddings*weight_quantizer",
-                "cfg": copy.deepcopy(weight_quantizer_cfg),
-            }
-        )
-    # Also keep the standard HF "embed_tokens" naming in case future Nemotron-H variants
-    # rename the attribute.
-    if not _user_excluded("embed_tokens"):
-        quant_cfg["quant_cfg"].append(
-            {
-                "quantizer_name": "*embed_tokens*weight_quantizer",
-                "cfg": copy.deepcopy(weight_quantizer_cfg),
-            }
-        )
-
-
-def _extract_wildcard_quantizer_cfg(
-    quant_cfg: dict[str, Any], quantizer_attr: str
-) -> dict[str, Any] | None:
-    """Return the first ``*<quantizer_attr>`` cfg dict from an ordered quant_cfg list."""
-    target = f"*{quantizer_attr}"
-    for entry in quant_cfg.get("quant_cfg", []):
-        if entry.get("quantizer_name") == target and isinstance(entry.get("cfg"), dict):
-            return entry["cfg"]
-    return None
-
-
 def mono_quantize(
     args: argparse.Namespace,
     quant_cfg: dict[str, Any],
@@ -725,32 +632,11 @@ def mono_quantize(
         )  # Nemotron-Parse specific
         print("Quantization will only be applied to the decoder (text generation) component")
 
-    # For Nemotron-H (Mamba-2 + MLP + Attention hybrid, e.g. NVIDIA-Nemotron-3-Nano-4B),
-    # extend quantization coverage to the lm_head and the input token embedding. On this
-    # architecture those two 131072x3136 tables account for ~21% of parameters, so leaving
-    # them at bf16 wastes most of the NVFP4 memory benefit.
-    if model_type == "nemotron_h":
-        weight_quantizer_cfg = _extract_wildcard_quantizer_cfg(quant_cfg, "weight_quantizer")
-        if weight_quantizer_cfg is not None:
-            # ``input_quantizer_cfg`` is present only for activation-aware recipes (fp8, nvfp4,
-            # ...). For weight-only recipes (nvfp4_w4a16, fp8_pb_wo, ...) this returns None and
-            # ``lm_head`` stays weight-only along with the embedding.
-            input_quantizer_cfg = _extract_wildcard_quantizer_cfg(quant_cfg, "input_quantizer")
-            print(
-                "Nemotron-H detected: extending quantization to lm_head and input embedding "
-                "(backbone.embeddings)."
-            )
-            _enable_lm_head_and_embedding_quantization(
-                quant_cfg,
-                weight_quantizer_cfg,
-                input_quantizer_cfg=input_quantizer_cfg,
-                user_excluded_modules=args.exclude_modules or None,
-            )
-        else:
-            warnings.warn(
-                "Nemotron-H detected but quant_cfg has no wildcard '*weight_quantizer' entry; "
-                "skipping lm_head/embedding extension (model-specific or non-standard recipe)."
-            )
+    # Model-specific quantization extensions (e.g. quantizing lm_head + input embedding for
+    # Nemotron-H, where those tables are a large fraction of parameters and leaving them at
+    # bf16 wastes most of the memory savings) are now expressed as recipes under
+    # ``modelopt_recipes/models/<ModelName>/``. Pass ``--recipe models/<ModelName>/<flavor>``
+    # (e.g. ``--recipe models/Nemotron-H/nvfp4_w4a16``) to opt in.
 
     if not model_is_already_quantized or calibration_only:
         # quantize the model
diff --git a/modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml b/modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml
@@ -0,0 +1,126 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# NVFP4 W4A16 (weight-only) recipe for Nemotron-H hybrid models.
+#
+# Mirrors the general ``nvfp4_w4a16`` qformat (NVFP4_W4A16_CFG) but additionally
+# re-enables quantization of ``lm_head`` and the input token embedding. On
+# Nemotron-3-Nano-4B those two 131072x3136 tables account for ~21% of model
+# parameters, so leaving them at bf16 wastes most of the NVFP4 memory benefit.
+#
+# Coverage:
+#   * Linear layers in attention + MLP: NVFP4 W4A16 weight-only.
+#   * lm_head: NVFP4 W4A16 weight-only (re-enabled here; default disables it).
+#   * Input embedding (``backbone.embeddings`` / ``model.embed_tokens``):
+#     NVFP4 W4A16 weight-only via ``QuantEmbedding``. Embedding inputs are
+#     integer indices, so the input quantizer is intentionally not enabled.
+#   * Mamba ``*mixer.conv1d*``: kept at bf16 (default exclusion).
+#
+# Notes for vLLM consumption:
+#   * ``vllm.compressed-tensors`` consumes packed NVFP4 weights for ``Linear``
+#     and ``Embedding`` layers when the corresponding kernels are present. As
+#     of vLLM 0.19, ``ParallelLMHead`` and ``VocabParallelEmbedding`` need an
+#     additional patch to dispatch ``CompressedTensorsLinearMethod``; see the
+#     PR notes for details. If the target deployment is stock vLLM and you
+#     can't apply that patch, use the general ``nvfp4_w4a16`` qformat
+#     instead, which leaves ``lm_head`` and embeddings at bf16.
+
+metadata:
+  recipe_type: ptq
+  description: NVFP4 W4A16 weight-only for Nemotron-H, including lm_head and input embedding.
+quantize:
+  algorithm: max
+  quant_cfg:
+    # Start with everything disabled, then enable layers explicitly.
+    - quantizer_name: '*'
+      enable: false
+
+    # Quantize all Linear weight quantizers (attention q/k/v/o + MLP up/down).
+    - quantizer_name: '*weight_quantizer'
+      enable: true
+      cfg:
+        block_sizes:
+          -1: 16
+          type: dynamic
+          scale_bits: e4m3
+        num_bits: e2m1
+
+    # Standard exclusions copied from ``_default_disabled_quantizer_cfg``.
+    # Order matters: later entries override earlier ones in
+    # ``modelopt.torch.quantization.set_quantizer_by_cfg``.
+    - quantizer_name: '*lm_head*'
+      enable: false
+    - quantizer_name: '*proj_out.*'
+      enable: false
+    - quantizer_name: '*block_sparse_moe.gate*'
+      enable: false
+    - quantizer_name: '*router*'
+      enable: false
+    - quantizer_name: '*mlp.gate.*'
+      enable: false
+    - quantizer_name: '*mlp.shared_expert_gate.*'
+      enable: false
+    - quantizer_name: '*linear_attn.conv1d*'
+      enable: false
+    - quantizer_name: '*mixer.conv1d*'
+      enable: false
+    - quantizer_name: '*output_layer*'
+      enable: false
+    - quantizer_name: 'output.*'
+      enable: false
+    - parent_class: 'nn.BatchNorm1d'
+      quantizer_name: '*'
+      enable: false
+    - parent_class: 'nn.BatchNorm2d'
+      quantizer_name: '*'
+      enable: false
+    - parent_class: 'nn.BatchNorm3d'
+      quantizer_name: '*'
+      enable: false
+    - parent_class: 'nn.LeakyReLU'
+      quantizer_name: '*'
+      enable: false
+
+    # Nemotron-H specific overrides: re-enable the weight quantizer for
+    # ``lm_head`` and the input embedding. These come AFTER the default
+    # disables above so they take precedence (last matching entry wins).
+    - quantizer_name: '*lm_head*weight_quantizer'
+      enable: true
+      cfg:
+        block_sizes:
+          -1: 16
+          type: dynamic
+          scale_bits: e4m3
+        num_bits: e2m1
+
+    # Two embedding patterns cover both the Nemotron-H remote-code path
+    # (``backbone.embeddings``) and the standard transformers naming
+    # (``model.embed_tokens``).
+    - quantizer_name: '*embeddings*weight_quantizer'
+      enable: true
+      cfg:
+        block_sizes:
+          -1: 16
+          type: dynamic
+          scale_bits: e4m3
+        num_bits: e2m1
+    - quantizer_name: '*embed_tokens*weight_quantizer'
+      enable: true
+      cfg:
+        block_sizes:
+          -1: 16
+          type: dynamic
+          scale_bits: e4m3
+        num_bits: e2m1