Skip to content

Commit 011d5ab

Browse files
authored
Merge branch 'main' into dev-willg-integrate-auto-qdq-placement-part2.3
2 parents db2f2cc + 1ebb6b3 commit 011d5ab

132 files changed

Lines changed: 17604 additions & 1926 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ modelopt/torch/utils @NVIDIA/modelopt-torch-utils-codeowners
4444
/examples/llm_ptq @NVIDIA/modelopt-examples-llm_ptq-codeowners
4545
/examples/llm_qat @NVIDIA/modelopt-examples-llm_qat-codeowners
4646
/examples/llm_sparsity @NVIDIA/modelopt-torch-sparsity-codeowners
47+
/examples/megatron_bridge @NVIDIA/modelopt-examples-megatron-codeowners
4748
/examples/model_hub @NVIDIA/modelopt-examples-model_hub-codeowners
4849
/examples/nemo_run @NVIDIA/modelopt-examples-megatron-codeowners
4950
/examples/onnx_ptq @NVIDIA/modelopt-onnx-codeowners

.github/workflows/example_tests.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -86,14 +86,14 @@ jobs:
8686
pip_install_extras: "[hf,dev-test]"
8787
runner: linux-amd64-gpu-h100-latest-2
8888

89-
##### Speculative Decoding Example Tests (requires 25.08 image) #####
89+
##### Speculative Decoding Example Tests (requires 26.01 image) #####
9090
speculative-decoding-pr:
9191
needs: [check-file-changes, wait-checks]
9292
if: startsWith(github.ref, 'refs/heads/pull-request/') && needs.check-file-changes.outputs.any_changed == 'true'
9393
uses: ./.github/workflows/_example_tests_runner.yml
9494
secrets: inherit
9595
with:
96-
docker_image: "nvcr.io/nvidia/pytorch:25.08-py3"
96+
docker_image: "nvcr.io/nvidia/pytorch:26.01-py3"
9797
example: speculative_decoding
9898
pip_install_extras: "[hf,dev-test]"
9999
runner: linux-amd64-gpu-l4-latest-1
@@ -103,7 +103,7 @@ jobs:
103103
uses: ./.github/workflows/_example_tests_runner.yml
104104
secrets: inherit
105105
with:
106-
docker_image: "nvcr.io/nvidia/pytorch:25.08-py3"
106+
docker_image: "nvcr.io/nvidia/pytorch:26.01-py3"
107107
example: speculative_decoding
108108
pip_install_extras: "[hf,dev-test]"
109109
runner: linux-amd64-gpu-h100-latest-2

.github/workflows/gpu_tests.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# NOTE: Make sure this file is consistent with .gitlab/tests.yml
1+
# TODO: Optimize gpu tests runtime!
22
name: GPU tests
33

44
on:
@@ -78,7 +78,7 @@ jobs:
7878
gpu-tests-non-pr:
7979
if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }}
8080
runs-on: linux-amd64-gpu-h100-latest-2
81-
timeout-minutes: 120
81+
timeout-minutes: 150
8282
container: *gpu_container
8383
steps: *gpu_steps
8484
gpu-pr-required-check:

.pre-commit-config.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,8 @@ repos:
9696
examples/speculative_decoding/main.py|
9797
examples/speculative_decoding/medusa_utils.py|
9898
examples/speculative_decoding/server_generate.py|
99+
experimental/dms/models/qwen3/configuration_qwen3_dms.py|
100+
experimental/dms/models/qwen3/modeling_qwen3_dms.py|
99101
)$
100102
101103
# Default hook for Apache 2.0 in c/c++/cuda files

CHANGELOG.rst

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,14 @@ NVIDIA Model Optimizer Changelog (Linux)
1313
- Add standalone type inference option (``--use_standalone_type_inference``) in ONNX AutoCast as an alternative to ONNX's ``infer_shapes``. This experimental feature performs type-only inference without shape inference, useful as a workaround when shape inference fails or to avoid unnecessary shape inference overhead.
1414
- Add support for Kimi K2 Thinking model quantization from the original int4 checkpoint.
1515
- Add support for ``params`` constraint based automatic neural architecture search in Minitron pruning (``mcore_minitron``) as an alternative to manual pruning (using ``export_config``). See `examples/pruning/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning>`_ for more details on its usage.
16+
- New example for Minitron pruning with Megatron-Bridge framework along with advanced pruning usage with new ``params`` constraint based pruning. Also add example for distillation with Megatron-Bridge framework. Check `examples/megatron_bridge/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/megatron_bridge>`_ for example scripts.
1617
- Add support for calibration data with multiple samples in ``npz`` format in the ONNX Autocast workflow.
1718
- Add ``--opset`` option to ONNX quantization CLI to specify the target opset version for the quantized model.
1819
- Add support for context parallelism in Eagle speculative decoding for huggingface and megatron core models.
20+
- Add unified Hugging Face export support for diffusers pipelines/components.
21+
- Add LTX-2 and Wan2.2 (T2V) support in the diffusers quantization workflow.
22+
- Add PTQ support for GLM-4.7, including loading MTP layer weights from a separate ``mtp.safetensors`` file and export as-is.
23+
- Add support for image-text data calibration in PTQ for Nemotron VL models.
1924

2025
0.41 (2026-01-19)
2126
^^^^^^^^^^^^^^^^^
@@ -225,7 +230,7 @@ NVIDIA Model Optimizer Changelog (Linux)
225230
- Add support for UNet ONNX quantization.
226231
- Enable ``concat_elimination`` pass by default to improve the performance of quantized ONNX models.
227232
- Enable Redundant Cast elimination pass by default in :meth:`moq.quantize <modelopt.onnx.quantization.quantize>`.
228-
- Add new attribute ``parallel_state`` to :class:`DynamicModule <modelopt.torch.opt.dynamic.DynamicModule>` to support distributed parallelism such as data parallel and tensor parallel.
233+
- Add new attribute ``parallel_state`` to :class:`QuantModule <modelopt.torch.quantization.nn.modules.quant_module.QuantModule>` to support distributed parallelism such as data parallel and tensor parallel.
229234
- Add MXFP8, NVFP4 quantized ONNX export support.
230235
- Add new example for torch quantization to ONNX for MXFP8, NVFP4 precision.
231236

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,9 @@ ______________________________________________________________________
2020
**[Input]** Model Optimizer currently supports inputs of a [Hugging Face](https://huggingface.co/), [PyTorch](https://github.com/pytorch/pytorch) or [ONNX](https://github.com/onnx/onnx) model.
2121

2222
**[Optimize]** Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint.
23-
Model Optimizer is also integrated with [NVIDIA NeMo](https://github.com/NVIDIA-NeMo/NeMo), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [Hugging Face Accelerate](https://github.com/huggingface/accelerate) for training required inference optimization techniques.
23+
Model Optimizer is also integrated with [NVIDIA Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [Hugging Face Accelerate](https://github.com/huggingface/accelerate) for training required inference optimization techniques.
2424

25-
**[Export for deployment]** Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [SGLang](https://github.com/sgl-project/sglang), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization), [TensorRT](https://github.com/NVIDIA/TensorRT), or [vLLM](https://github.com/vllm-project/vllm).
25+
**[Export for deployment]** Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [SGLang](https://github.com/sgl-project/sglang), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization), [TensorRT](https://github.com/NVIDIA/TensorRT), or [vLLM](https://github.com/vllm-project/vllm). The unified Hugging Face export API now supports both transformers and diffusers models.
2626

2727
## Latest News
2828

docs/source/deployment/3_unified_hf.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
Unified HuggingFace Checkpoint
33
=================================================================
44

5-
We support exporting modelopt-optimized Huggingface models and Megatron Core models to a unified checkpoint format that can be deployed in various inference frameworks such as TensorRT-LLM, vLLM, and SGLang.
5+
We support exporting modelopt-optimized Hugging Face models (transformers and diffusers pipelines/components) and Megatron Core models to a unified checkpoint format that can be deployed in various inference frameworks such as TensorRT-LLM, vLLM, and SGLang.
66

77
The workflow is as follows:
88

@@ -32,6 +32,10 @@ The export API (:meth:`export_hf_checkpoint <modelopt.torch.export.unified_expor
3232
export_dir, # The directory where the exported files will be stored.
3333
)
3434
35+
.. note::
36+
``export_hf_checkpoint`` also supports diffusers pipelines and components (e.g., UNet/transformer). See the
37+
diffusers quantization examples for end-to-end workflows and CLI usage.
38+
3539
Deployment Support Matrix
3640
==============================================
3741

docs/source/getting_started/1_overview.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Minimizing inference costs presents a significant challenge as generative AI mod
99
The `NVIDIA Model Optimizer <https://github.com/NVIDIA/Model-Optimizer>`_ (referred to as Model Optimizer, or ModelOpt)
1010
is a library comprising state-of-the-art model optimization techniques including quantization and sparsity to compress model.
1111
It accepts a torch or ONNX model as input and provides Python APIs for users to easily stack different model optimization
12-
techniques to produce optimized & quantized checkpoints. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like `TensorRT-LLM <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization>`_ or `TensorRT <https://github.com/NVIDIA/TensorRT>`_ (Linux). ModelOpt is integrated with `NVIDIA NeMo <https://github.com/NVIDIA-NeMo/NeMo>`_ and `Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_ for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on `NVIDIA NIM <https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/>`_.
12+
techniques to produce optimized & quantized checkpoints. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like `TensorRT-LLM <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization>`_ or `TensorRT <https://github.com/NVIDIA/TensorRT>`_ (Linux). The unified Hugging Face export API supports both transformers and diffusers models. ModelOpt is integrated with `NVIDIA NeMo <https://github.com/NVIDIA-NeMo/NeMo>`_ and `Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_ for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on `NVIDIA NIM <https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/>`_.
1313

1414
For Windows users, the `Model Optimizer for Windows <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows>`_ (ModelOpt-Windows) delivers model compression techniques, including quantization, on Windows RTX PC systems. ModelOpt-Windows is optimized for efficient quantization, featuring local GPU calibration, reduced system and video memory consumption, and swift processing times. It integrates seamlessly with the Windows ecosystem, with optimized ONNX models as output for `Microsoft DirectML <https://github.com/microsoft/DirectML>`_ and `TensorRT-RTX <https://github.com/NVIDIA/TensorRT-RTX>`_ backends. Furthermore, ModelOpt-Windows supports SDKs such as `Microsoft Olive <https://github.com/microsoft/Olive>`_ and `ONNX Runtime <https://github.com/microsoft/onnxruntime>`_, enabling the deployment of quantized models across various independent hardware vendors through the DirectML path.
1515

0 commit comments

Comments
 (0)