Skip to content

Commit 839fa3d

Browse files
ChenhanYuclaude
andauthored
add: ModelOpt Launcher for Slurm job submission (#1031)
``` # Install cd Model-Optimizer/launcher curl -LsSf https://astral.sh/uv/install.sh | sh git submodule update --init --recursive # Run locally with Docker (single GPU) uv run launch.py --yaml Qwen/Qwen3-8B/megatron_lm_ptq.yaml hf_local=/mnt/hf-local --yes # Run on Slurm cluster (no need to export the follow SLURM_XXX envs if used in sandbox) export SLURM_HOST=login-node.example.com export SLURM_ACCOUNT=my_account export SLURM_HF_LOCAL=/shared/hf-local export SLURM_JOB_DIR=/shared/experiments uv run launch.py --yaml Qwen/Qwen3-8B/megatron_lm_ptq.yaml --yes # Preview config without running uv run launch.py --yaml Qwen/Qwen3-8B/megatron_lm_ptq.yaml --dryrun --yes -v # Override parameters uv run launch.py --yaml Qwen/Qwen3-8B/megatron_lm_ptq.yaml \ pipeline.task_0.slurm_config.nodes=2 --yes # Dump resolved config for reproducibility (single YAML for reproducibility, great for QA, Eng, and agent to triage) uv run launch.py --yaml Qwen/Qwen3-8B/megatron_lm_ptq.yaml --to-yaml resolved.yaml # Run tests uv pip install -e . pytest uv run pytest -v ``` ## Summary Add `launcher/` module for submitting quantization, training, and evaluation jobs to Slurm clusters or running them locally with Docker via `nemo-run`. `nemo-run` is used in all `NVIDIA-NeMo/*` projects. It supports modern YAML factory (superset of the `OmegaConf` and `Hydra`) and it support multiple executor backends (here we use docker and slurm mainly). A sample YAML config `launcher/Qwen/Qwen3-8B/megatron_lm_ptq.yaml`: ``` job_name: Qwen3-8B_NVFP4_DEFAULT_CFG pipeline: # hf_local: path prefix for model weights and datasets. # # This should be a self-managed directory that mirrors the HuggingFace Hub # hierarchy (e.g., /hf-local/Qwen/Qwen3-8B/, /hf-local/cais/mmlu/). Using # a dedicated folder is preferred over the HuggingFace cache (~/.cache/huggingface) # to avoid cache corruption issues with concurrent jobs. # # Override on CLI: # pipeline.global_vars.hf_local=/mnt/my-models/ # use a different path # pipeline.global_vars.hf_local="" # download from HuggingFace Hub global_vars: hf_local: /hf-local/ task_0: script: common/megatron-lm/quantize/quantize.sh args: - --calib-dataset-path-or-name <<global_vars.hf_local>>abisee/cnn_dailymail - --calib-size 32 environment: - MLM_MODEL_CFG: Qwen/Qwen3-8B - QUANT_CFG: NVFP4_DEFAULT_CFG - HF_MODEL_CKPT: <<global_vars.hf_local>>Qwen/Qwen3-8B - MMLU_DATASET: <<global_vars.hf_local>>cais/mmlu - TP: 4 slurm_config: _factory_: "slurm_factory" nodes: 1 ntasks_per_node: 4 gpus_per_node: 4 ``` ### Key features - **`launch.py`** — public entrypoint accepting `--yaml` config format - **`core.py`** — shared logic (dataclasses, executor builders, run loop) also used by nmm-sandbox's `slurm.py` - **Factory system** — env-var-driven `slurm_factory` with `register_factory()` registry - **`<<global_vars.X>>`** interpolation for sharing values across pipeline tasks - **`hf_local`** global var for configurable model/dataset storage path - **Version reporting** — git commit/branch printed at job start for reproducibility - **`--to-yaml`** — dump resolved config for bug reports and reproducibility - **Model-Optimizer symlink** — `modules/Model-Optimizer -> ../..` (auto-created, avoids recursive submodule) ### Files | Path | Description | |------|-------------| | `launcher/launch.py` | Public entrypoint | | `launcher/core.py` | Shared dataclasses, executors, run loop | | `launcher/slurm_config.py` | SlurmConfig + env-var factory | | `launcher/common/` | Shell scripts (quantize, query, eagle3, specdec_bench) | | `launcher/Qwen/Qwen3-8B/` | Example configs (PTQ, EAGLE3 pipeline) | | `launcher/tests/` | 64 unit tests | | `launcher/README.md` | User guide | | `launcher/ADVANCED.md` | Architecture, mount mechanism, Claude Code workflows | | `launcher/CLAUDE.md` | Claude Code project instructions | | `.github/workflows/unit_tests.yml` | CI job for launcher tests | ### Verified - Same YAML produces identical MMLU results via both `slurm.py` and `launch.py`: - Local Docker (TP=1): 0.719 (128/178) - OCI-HSG Slurm (TP=4): 0.730 (130/178) ## Test plan - [x] 64 unit tests (core, factory, YAML, Docker executor, Slurm executor, Docker launch) - [x] CI workflow added to `.github/workflows/unit_tests.yml` - [x] Local Docker end-to-end with `python:3.12-slim` - [x] Qwen3-8B PTQ on OCI-HSG via both launchers - [ ] Reviewer runs: `cd launcher && uv pip install -e . pytest && uv run pytest -v` ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Introduced ModelOpt Launcher for submitting quantization, training, and evaluation jobs to Slurm clusters or running locally via Docker. * Added YAML-based job configuration with multi-task pipeline support and global variable interpolation. * Included example workflows for Qwen3-8B quantization and EAGLE3 speculative decoding. * Provided configurable Slurm and execution environment defaults. * **Documentation** * Added comprehensive README with quick start, environment setup, and configuration guidance. * Added advanced guide detailing launcher architecture and integration patterns. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Chenhan Yu <chenhany@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 52cfa4e commit 839fa3d

38 files changed

Lines changed: 4108 additions & 2 deletions

.github/workflows/unit_tests.yml

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ on:
1212
- "tests/unit/**"
1313
- "pyproject.toml"
1414
- "tox.ini"
15+
- "tools/launcher/**"
1516
schedule:
1617
- cron: "0 0 * * *" # Nightly
1718
workflow_dispatch: # On-demand
@@ -98,6 +99,23 @@ jobs:
9899
- uses: ./.github/actions/ubuntu-setup
99100
- name: Run unit tests
100101
run: pip install tox && tox -e py312-torch210-tf_${{ matrix.tf }}-unit
102+
launcher:
103+
if: github.event_name == 'pull_request'
104+
needs: [linux]
105+
runs-on: ubuntu-latest
106+
timeout-minutes: 15
107+
steps:
108+
- uses: actions/checkout@v6
109+
with:
110+
submodules: recursive
111+
- name: Run launcher tests
112+
working-directory: tools/launcher
113+
run: |
114+
curl -LsSf https://astral.sh/uv/install.sh | sh
115+
export PATH="$HOME/.local/bin:$PATH"
116+
uv venv .venv
117+
uv pip install -e . pytest
118+
uv run python3 -m pytest -v
101119
partial-install:
102120
if: github.event_name == 'pull_request'
103121
needs: [linux]
@@ -114,7 +132,7 @@ jobs:
114132
unit-pr-required-check:
115133
# Run even if some jobs are skipped
116134
if: ${{ github.event_name == 'pull_request' && always() }}
117-
needs: [linux, windows, multi-py, multi-torch, multi-transformers, partial-install]
135+
needs: [linux, windows, multi-py, multi-torch, multi-transformers, partial-install, launcher]
118136
runs-on: ubuntu-latest
119137
steps:
120138
- name: Required unit tests did not succeed
@@ -124,5 +142,6 @@ jobs:
124142
needs.multi-py.result != 'success' ||
125143
needs.multi-torch.result != 'success' ||
126144
needs.multi-transformers.result != 'success' ||
127-
needs.partial-install.result != 'success' }}
145+
needs.partial-install.result != 'success' ||
146+
needs.launcher.result != 'success' }}
128147
run: exit 1

.gitmodules

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[submodule "tools/launcher/modules/Megatron-LM"]
2+
path = tools/launcher/modules/Megatron-LM
3+
url = https://github.com/NVIDIA/Megatron-LM.git

tools/launcher/.gitignore

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Virtual environment
2+
.venv/
3+
4+
# nemo-run state
5+
.slurm_jobs
6+
.docker_jobs.json
7+
.local_jobs.json
8+
9+
# Experiment artifacts (generated at runtime)
10+
experiments/
11+
local_experiments/
12+
13+
# uv lock (generated, not portable)
14+
uv.lock
15+
16+
# Python cache
17+
__pycache__/
18+
19+
# Editor swap files
20+
*.swp
21+
*.swo
22+
*~

tools/launcher/CLAUDE.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# CLAUDE.md — ModelOpt Launcher
2+
3+
## Overview
4+
5+
The launcher submits ModelOpt quantization, training, and evaluation jobs to Slurm clusters or runs them locally with Docker.
6+
7+
## Key Files
8+
9+
| File | Role |
10+
|------|------|
11+
| `launch.py` | Public entrypoint — accepts `--yaml` or `pipeline=@` |
12+
| `core.py` | Shared dataclasses, executor builders, run loop, version reporting |
13+
| `slurm_config.py` | `SlurmConfig` dataclass and env-var-driven `slurm_factory` |
14+
| `common/` | Shell scripts and `query.py` packaged to the cluster |
15+
| `modules/Megatron-LM/` | Git submodule |
16+
| `modules/Model-Optimizer` | Symlink to `../..` (auto-created by `launch.py` if missing) |
17+
18+
## Common Commands
19+
20+
```shell
21+
# Run locally with Docker
22+
uv run launch.py --yaml examples/Qwen/Qwen3-8B/megatron_lm_ptq.yaml hf_local=/mnt/hf-local --yes
23+
24+
# Run on Slurm (set env vars first)
25+
uv run launch.py --yaml examples/Qwen/Qwen3-8B/megatron_lm_ptq.yaml --yes
26+
27+
# Dry run — preview resolved config
28+
uv run launch.py --yaml examples/Qwen/Qwen3-8B/megatron_lm_ptq.yaml --dryrun --yes -v
29+
30+
# Dump resolved config
31+
uv run launch.py --yaml examples/Qwen/Qwen3-8B/megatron_lm_ptq.yaml --to-yaml resolved.yaml
32+
33+
# Run unit tests
34+
uv pip install pytest
35+
uv run python3 -m pytest tests/ -v
36+
```
37+
38+
## YAML Config Format
39+
40+
The `--yaml` format maps top-level keys to `launch()` function arguments:
41+
42+
```yaml
43+
job_name: Qwen3-8B_NVFP4_DEFAULT_CFG
44+
pipeline:
45+
global_vars:
46+
hf_local: /hf-local/
47+
task_0:
48+
script: common/megatron_lm/quantize/quantize.sh
49+
args:
50+
- --calib-dataset-path-or-name <<global_vars.hf_local>>abisee/cnn_dailymail
51+
environment:
52+
- MLM_MODEL_CFG: Qwen/Qwen3-8B
53+
- HF_MODEL_CKPT: <<global_vars.hf_local>>Qwen/Qwen3-8B
54+
- TP: 4
55+
slurm_config:
56+
_factory_: "slurm_factory"
57+
nodes: 1
58+
ntasks_per_node: 4
59+
gpus_per_node: 4
60+
```
61+
62+
Key conventions:
63+
64+
- Scripts go in `common/` (not `services/`)
65+
- `<<global_vars.X>>` interpolation for shared values across tasks
66+
- `_factory_: "slurm_factory"` — resolved via `register_factory()` in `core.py`
67+
- Environment is list-of-single-key-dicts: `- KEY: value`
68+
- CLI overrides: `pipeline.task_0.slurm_config.nodes=2`
69+
70+
## Architecture
71+
72+
```text
73+
launch.py → imports core.py + slurm_config.py
74+
75+
core.run_jobs()
76+
77+
build_docker_executor() or build_slurm_executor()
78+
79+
nemo_run.Experiment → Docker or Slurm
80+
```
81+
82+
- `set_slurm_config_type(SlurmConfig)` — patches `SandboxTask` annotation at import time
83+
- `register_factory("slurm_factory", slurm_factory)` — enables YAML `_factory_` resolution
84+
- `report_versions(base_dir)` — prints git commit/branch for launcher + submodules
85+
- `get_default_env(title)` — returns `(slurm_env, local_env)` dicts
86+
87+
## Adding a New Model Config
88+
89+
1. Create `examples/<Org>/<Model>/megatron_lm_ptq.yaml` following the format above
90+
2. Set `MLM_MODEL_CFG` to the HuggingFace repo ID
91+
3. Set `QUANT_CFG` (e.g., `NVFP4_DEFAULT_CFG`, `INT8_DEFAULT_CFG`)
92+
4. Set GPU/node counts based on model size
93+
5. Test: `uv run launch.py --yaml <path> --dryrun --yes -v`
94+
95+
## Testing
96+
97+
65 unit tests in `tests/`. Run standalone without installing `modelopt`:
98+
99+
From the launcher directory:
100+
101+
```shell
102+
uv run python3 -m pytest tests/ -v
103+
```
104+
105+
Tests cover: core dataclasses, factory registry, global_vars interpolation, YAML formats, Docker/Slurm executor construction (mocked), environment merging, metadata writing, and end-to-end Docker launch via subprocess.
106+
107+
## Further Reading
108+
109+
- [docs/configuration.md](docs/configuration.md) — YAML formats, overrides, hf_local
110+
- [docs/architecture.md](docs/architecture.md) — Shared core, factory system, typed tasks, mount mechanism
111+
- [docs/testing.md](docs/testing.md) — Running tests locally and in CI
112+
- [docs/claude_code.md](docs/claude_code.md) — Claude Code workflows
113+
- [docs/contributing.md](docs/contributing.md) — Adding models, typed tasks, bug reporting

tools/launcher/README.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
# ModelOpt Launcher
2+
3+
Submit ModelOpt quantization, training, and evaluation jobs to Slurm clusters or run them locally with Docker.
4+
5+
## Quick Start
6+
7+
```bash
8+
# Install
9+
curl -LsSf https://astral.sh/uv/install.sh | sh
10+
git submodule update --init --recursive
11+
12+
# Run locally with 1 GPU
13+
cd Model-Optimizer/tools/launcher
14+
uv run launch.py --yaml examples/Qwen/Qwen3-8B/megatron_lm_ptq_local.yaml hf_local=/mnt/hf-local --yes
15+
16+
# Run on a Slurm cluster (4 GPUs)
17+
export SLURM_HOST=login-node.example.com
18+
export SLURM_ACCOUNT=my_account
19+
export SLURM_HF_LOCAL=/mnt/hf-local
20+
export SLURM_JOB_DIR=/shared/experiments
21+
uv run launch.py --yaml examples/Qwen/Qwen3-8B/megatron_lm_ptq.yaml --yes
22+
```
23+
24+
> **Local vs cluster:** `megatron_lm_ptq.yaml` defaults to TP=4 on 4 GPUs.
25+
> Use `megatron_lm_ptq_local.yaml` for single-GPU local Docker runs.
26+
27+
## Directory Structure
28+
29+
```text
30+
tools/launcher/
31+
├── launch.py # Main entrypoint
32+
├── core.py # Core logic (dataclasses, executors, run loop)
33+
├── slurm_config.py # SlurmConfig dataclass and factory
34+
├── common/ # Scripts and typed tasks
35+
│ ├── megatron_lm/quantize/
36+
│ │ ├── quantize.sh # PTQ quantization + MMLU evaluation
37+
│ │ └── task.py # MegatronLMQuantizeTask (typed config)
38+
│ ├── tensorrt_llm/query.sh # TRT-LLM server + query
39+
│ ├── vllm/query.sh # vLLM server + query
40+
│ ├── eagle3/ # EAGLE3 speculative decoding scripts
41+
│ └── specdec_bench/ # Speculative decoding benchmark
42+
├── examples/ # Example configs
43+
│ └── Qwen/Qwen3-8B/
44+
│ ├── megatron_lm_ptq.yaml # PTQ (4 GPUs, Slurm)
45+
│ ├── megatron_lm_ptq_local.yaml # PTQ (1 GPU, local Docker)
46+
│ └── hf_offline_eagle3.yaml # EAGLE3 offline pipeline
47+
├── tests/ # 64 unit tests
48+
├── modules/ # Dependencies
49+
│ ├── Megatron-LM/ # Git submodule
50+
│ └── Model-Optimizer -> ../.. # Symlink (auto-created)
51+
└── docs/ # Documentation
52+
├── configuration.md # YAML formats, overrides, hf_local
53+
├── architecture.md # Design, factory system, typed tasks
54+
├── testing.md # Running tests, CI
55+
├── claude_code.md # Claude Code workflows
56+
└── contributing.md # Adding models, bug reporting
57+
```
58+
59+
## Documentation
60+
61+
| Guide | Description |
62+
|-------|-------------|
63+
| [Configuration](docs/configuration.md) | YAML formats, CLI overrides, flags, `hf_local` |
64+
| [Architecture](docs/architecture.md) | Shared core, factory system, typed tasks, mount mechanism |
65+
| [Testing](docs/testing.md) | Running tests locally and in CI |
66+
| [Claude Code](docs/claude_code.md) | Submit, monitor, diagnose workflows |
67+
| [Contributing](docs/contributing.md) | Adding models, typed tasks, bug reporting |

tools/launcher/__init__.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
"""ModelOpt Launcher — submit quantization, training, and evaluation jobs to Slurm clusters."""
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
#!/bin/bash
2+
3+
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
4+
# SPDX-License-Identifier: Apache-2.0
5+
#
6+
# Licensed under the Apache License, Version 2.0 (the "License");
7+
# you may not use this file except in compliance with the License.
8+
# You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing, software
13+
# distributed under the License is distributed on an "AS IS" BASIS,
14+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
# See the License for the specific language governing permissions and
16+
# limitations under the License.
17+
18+
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
19+
20+
source ${SCRIPT_DIR}/../service_utils.sh
21+
22+
###################################################################################################
23+
24+
if [ -z ${SLURM_ARRAY_TASK_ID} ]; then
25+
TASK_ID=0
26+
else
27+
echo "SLURM_ARRAY_TASK_ID ${SLURM_ARRAY_TASK_ID}"
28+
TASK_ID=${SLURM_ARRAY_TASK_ID}
29+
fi
30+
31+
if [ -z ${SLURM_ARRAY_TASK_COUNT} ]; then
32+
TASK_COUNT=1
33+
else
34+
echo "SLURM_ARRAY_TASK_COUNT ${SLURM_ARRAY_TASK_COUNT}"
35+
TASK_COUNT=${SLURM_ARRAY_TASK_COUNT}
36+
fi
37+
38+
trtllm-llmapi-launch python3 modules/Model-Optimizer/examples/speculative_decoding/collect_hidden_states/compute_hidden_states_trtllm.py \
39+
--model ${HF_MODEL_CKPT} \
40+
--dp-rank ${TASK_ID} \
41+
--dp-world-size ${TASK_COUNT} \
42+
${@}
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
#!/bin/bash
2+
3+
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
4+
# SPDX-License-Identifier: Apache-2.0
5+
#
6+
# Licensed under the Apache License, Version 2.0 (the "License");
7+
# you may not use this file except in compliance with the License.
8+
# You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing, software
13+
# distributed under the License is distributed on an "AS IS" BASIS,
14+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
# See the License for the specific language governing permissions and
16+
# limitations under the License.
17+
18+
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
19+
source ${SCRIPT_DIR}/../service_utils.sh
20+
21+
pip install -r modules/Model-Optimizer/examples/speculative_decoding/requirements.txt
22+
pip install huggingface-hub>=1.2.1
23+
export PATH=$PATH:/workspace/.local/bin
24+
25+
###################################################################################################
26+
27+
trap 'error_handler $0 $LINENO' ERR # ERROR HANDLER
28+
29+
bash modules/Model-Optimizer/examples/speculative_decoding/launch_train.sh \
30+
--model ${HF_MODEL_CKPT} \
31+
${@}
32+
33+
python modules/Model-Optimizer/examples/speculative_decoding/scripts/export_hf_checkpoint.py \
34+
--model_path /scratchspace/eagle3 \
35+
--export_path /scratchspace/export
36+
37+
###################################################################################################
38+
39+
# This function handles the exit status (fails the CI).
40+
#exit_handler $0

0 commit comments

Comments
 (0)