Commit 839fa3d
add: ModelOpt Launcher for Slurm job submission (#1031)
```
# Install
cd Model-Optimizer/launcher
curl -LsSf https://astral.sh/uv/install.sh | sh
git submodule update --init --recursive
# Run locally with Docker (single GPU)
uv run launch.py --yaml Qwen/Qwen3-8B/megatron_lm_ptq.yaml hf_local=/mnt/hf-local --yes
# Run on Slurm cluster (no need to export the follow SLURM_XXX envs if used in sandbox)
export SLURM_HOST=login-node.example.com
export SLURM_ACCOUNT=my_account
export SLURM_HF_LOCAL=/shared/hf-local
export SLURM_JOB_DIR=/shared/experiments
uv run launch.py --yaml Qwen/Qwen3-8B/megatron_lm_ptq.yaml --yes
# Preview config without running
uv run launch.py --yaml Qwen/Qwen3-8B/megatron_lm_ptq.yaml --dryrun --yes -v
# Override parameters
uv run launch.py --yaml Qwen/Qwen3-8B/megatron_lm_ptq.yaml \
pipeline.task_0.slurm_config.nodes=2 --yes
# Dump resolved config for reproducibility (single YAML for reproducibility, great for QA, Eng, and agent to triage)
uv run launch.py --yaml Qwen/Qwen3-8B/megatron_lm_ptq.yaml --to-yaml resolved.yaml
# Run tests
uv pip install -e . pytest
uv run pytest -v
```
## Summary
Add `launcher/` module for submitting quantization, training, and
evaluation jobs to Slurm clusters or running them locally with Docker
via `nemo-run`. `nemo-run` is used in all `NVIDIA-NeMo/*` projects. It
supports modern YAML factory (superset of the `OmegaConf` and `Hydra`)
and it support multiple executor backends (here we use docker and slurm
mainly).
A sample YAML config `launcher/Qwen/Qwen3-8B/megatron_lm_ptq.yaml`:
```
job_name: Qwen3-8B_NVFP4_DEFAULT_CFG
pipeline:
# hf_local: path prefix for model weights and datasets.
#
# This should be a self-managed directory that mirrors the HuggingFace Hub
# hierarchy (e.g., /hf-local/Qwen/Qwen3-8B/, /hf-local/cais/mmlu/). Using
# a dedicated folder is preferred over the HuggingFace cache (~/.cache/huggingface)
# to avoid cache corruption issues with concurrent jobs.
#
# Override on CLI:
# pipeline.global_vars.hf_local=/mnt/my-models/ # use a different path
# pipeline.global_vars.hf_local="" # download from HuggingFace Hub
global_vars:
hf_local: /hf-local/
task_0:
script: common/megatron-lm/quantize/quantize.sh
args:
- --calib-dataset-path-or-name <<global_vars.hf_local>>abisee/cnn_dailymail
- --calib-size 32
environment:
- MLM_MODEL_CFG: Qwen/Qwen3-8B
- QUANT_CFG: NVFP4_DEFAULT_CFG
- HF_MODEL_CKPT: <<global_vars.hf_local>>Qwen/Qwen3-8B
- MMLU_DATASET: <<global_vars.hf_local>>cais/mmlu
- TP: 4
slurm_config:
_factory_: "slurm_factory"
nodes: 1
ntasks_per_node: 4
gpus_per_node: 4
```
### Key features
- **`launch.py`** — public entrypoint accepting `--yaml` config format
- **`core.py`** — shared logic (dataclasses, executor builders, run
loop) also used by nmm-sandbox's `slurm.py`
- **Factory system** — env-var-driven `slurm_factory` with
`register_factory()` registry
- **`<<global_vars.X>>`** interpolation for sharing values across
pipeline tasks
- **`hf_local`** global var for configurable model/dataset storage path
- **Version reporting** — git commit/branch printed at job start for
reproducibility
- **`--to-yaml`** — dump resolved config for bug reports and
reproducibility
- **Model-Optimizer symlink** — `modules/Model-Optimizer -> ../..`
(auto-created, avoids recursive submodule)
### Files
| Path | Description |
|------|-------------|
| `launcher/launch.py` | Public entrypoint |
| `launcher/core.py` | Shared dataclasses, executors, run loop |
| `launcher/slurm_config.py` | SlurmConfig + env-var factory |
| `launcher/common/` | Shell scripts (quantize, query, eagle3,
specdec_bench) |
| `launcher/Qwen/Qwen3-8B/` | Example configs (PTQ, EAGLE3 pipeline) |
| `launcher/tests/` | 64 unit tests |
| `launcher/README.md` | User guide |
| `launcher/ADVANCED.md` | Architecture, mount mechanism, Claude Code
workflows |
| `launcher/CLAUDE.md` | Claude Code project instructions |
| `.github/workflows/unit_tests.yml` | CI job for launcher tests |
### Verified
- Same YAML produces identical MMLU results via both `slurm.py` and
`launch.py`:
- Local Docker (TP=1): 0.719 (128/178)
- OCI-HSG Slurm (TP=4): 0.730 (130/178)
## Test plan
- [x] 64 unit tests (core, factory, YAML, Docker executor, Slurm
executor, Docker launch)
- [x] CI workflow added to `.github/workflows/unit_tests.yml`
- [x] Local Docker end-to-end with `python:3.12-slim`
- [x] Qwen3-8B PTQ on OCI-HSG via both launchers
- [ ] Reviewer runs: `cd launcher && uv pip install -e . pytest && uv
run pytest -v`
### Before your PR is "*Ready for review*"
Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).
Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).
- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->
### Additional Information
<!-- E.g. related issue. -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Release Notes
* **New Features**
* Introduced ModelOpt Launcher for submitting quantization, training,
and evaluation jobs to Slurm clusters or running locally via Docker.
* Added YAML-based job configuration with multi-task pipeline support
and global variable interpolation.
* Included example workflows for Qwen3-8B quantization and EAGLE3
speculative decoding.
* Provided configurable Slurm and execution environment defaults.
* **Documentation**
* Added comprehensive README with quick start, environment setup, and
configuration guidance.
* Added advanced guide detailing launcher architecture and integration
patterns.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 52cfa4e commit 839fa3d
38 files changed
Lines changed: 4108 additions & 2 deletions
File tree
- .github/workflows
- tools/launcher
- common
- eagle3
- megatron_lm/quantize
- specdec_bench
- tensorrt_llm
- vllm
- docs
- examples/Qwen/Qwen3-8B
- modules
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
| 15 | + | |
15 | 16 | | |
16 | 17 | | |
17 | 18 | | |
| |||
98 | 99 | | |
99 | 100 | | |
100 | 101 | | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
101 | 119 | | |
102 | 120 | | |
103 | 121 | | |
| |||
114 | 132 | | |
115 | 133 | | |
116 | 134 | | |
117 | | - | |
| 135 | + | |
118 | 136 | | |
119 | 137 | | |
120 | 138 | | |
| |||
124 | 142 | | |
125 | 143 | | |
126 | 144 | | |
127 | | - | |
| 145 | + | |
| 146 | + | |
128 | 147 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
0 commit comments