chore(ptq): drop --exclude_modules CLI flag (recipes own exclusions)

ajrasane · ajrasane · commit 78f4c421fa71 · 2026-05-01T18:49:33.000Z
The `--exclude_modules` flag was added in this PR as an escape hatch for
overriding the auto-applied lm_head/embedding inclusion on Nemotron-H. Now
that meenchen's recipe-system review is addressed and the Nemotron-H
extensions live in `modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml`,
this flag has no remaining purpose: users who want different exclusions
write a different recipe.

Removes:
* the `--exclude_modules` argparse entry in `hf_ptq.py`
* the `args.exclude_modules` apply-loop in `quantize_main()`
* the `EXCLUDE_MODULES` env-var passthrough + `EXCLUDE_MODULES_ARGS` bash
  array in `examples/llm_ptq/scripts/huggingface_example.sh`

Verified end-to-end on `nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16` with
`--recipe models/Nemotron-H/nvfp4_w4a16` (transformers 4.56.2, GPU 5,
calib_size=16): same coverage as before — 94 weight quantizers enabled,
21 disabled (the Mamba `*mixer.conv1d*` layers); `lm_head.weight_quantizer`
and `backbone.embeddings.weight_quantizer` carry NVFP4 W4A16 cfg;
exported safetensors 2.1 GiB; `hf_quant_config.json` reports
`quant_algo=NVFP4_W4A16`, `group_size=16`, `exclude_modules=[21 conv1d
layers]`. The recipe still dictates the exclusion set, so behavior is
unchanged for the supported codepath.

Signed-off-by: ajrasane &lt;131806219+ajrasane@users.noreply.github.com&gt;
diff --git a/examples/llm_ptq/hf_ptq.py b/examples/llm_ptq/hf_ptq.py
@@ -1128,18 +1128,6 @@ def quantize_main(
                 quant_cfg["quant_cfg"].append({"quantizer_name": pattern, "enable": False})
                 print(f"Excluding MTP layer from quantization: {pattern}")
 
-        # Apply user-requested per-module exclusions (--exclude_modules).
-        if args.exclude_modules:
-            quant_cfg = copy.deepcopy(quant_cfg)
-            for mod in args.exclude_modules:
-                quant_cfg["quant_cfg"].append(
-                    {"quantizer_name": f"*{mod}*.weight_quantizer", "enable": False}
-                )
-                quant_cfg["quant_cfg"].append(
-                    {"quantizer_name": f"*{mod}*.input_quantizer", "enable": False}
-                )
-                print(f"Excluding module from quantization: {mod}")
-
         # Use constant amax for KV quantizers when a cast format is selected.
         if args.kv_cache_qformat in _KV_CAST_FORMATS:
             quant_cfg = copy.deepcopy(quant_cfg)
@@ -1338,17 +1326,6 @@ def parse_args() -> argparse.Namespace:
         default=False,
         action="store_true",
     )
-    parser.add_argument(
-        "--exclude_modules",
-        nargs="+",
-        default=[],
-        metavar="MODULE",
-        help=(
-            "Module name patterns to exclude from quantization "
-            "(e.g. lm_head backbone.layers.0.mixer). "
-            "Appends a disable rule for each pattern's weight and input quantizers."
-        ),
-    )
     parser.add_argument(
         "--low_memory_mode",
         help=(
diff --git a/examples/llm_ptq/scripts/huggingface_example.sh b/examples/llm_ptq/scripts/huggingface_example.sh
@@ -127,18 +127,6 @@ if $TRUST_REMOTE_CODE; then
     PTQ_ARGS+=" --trust_remote_code "
 fi
 
-# --exclude_modules is kept out of the PTQ_ARGS string and passed via a bash array so
-# wildcard patterns like '*embed_tokens*' reach hf_ptq.py verbatim. Word-splitting into
-# per-pattern elements happens with glob expansion disabled (set -f) so the shell does
-# not expand '*' against the filesystem.
-EXCLUDE_MODULES_ARGS=()
-if [ -n "${EXCLUDE_MODULES:-}" ]; then
-    set -f
-    # shellcheck disable=SC2206  # intentional word-splitting without glob expansion
-    EXCLUDE_MODULES_ARGS=(--exclude_modules $EXCLUDE_MODULES)
-    set +f
-fi
-
 if $USE_SEQ_DEVICE_MAP; then
     PTQ_ARGS+=" --use_seq_device_map "
 fi
@@ -195,7 +183,6 @@ if [[ $TASKS =~ "quant" ]] || [[ ! -d "$SAVE_PATH" ]] || [[ ! $(ls -A $SAVE_PATH
             --inference_tensor_parallel=$TP \
             --inference_pipeline_parallel=$PP \
             $PTQ_ARGS \
-            "${EXCLUDE_MODULES_ARGS[@]}" \
             $AWQ_ARGS
     else
         echo "Quantized model config $MODEL_CONFIG exists, skipping the quantization stage"