Skip to content

Commit 9d2e608

Browse files
authored
[Fix]: $HOME in launcher eagle example (#1365)
### What does this PR do? Type of change: Bug fix <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> <!-- Details about the change. --> Launcher example bug raised by @cjluo-nv Before fix: task1 in tools/launcher/examples/Qwen/Qwen3-8B/hf_online_eagle3.yaml fails Reason: due to `HOME: /tmp` set in container, enroot credentials in `$HOME/.config/enroot/.crendential` not found ``` GpuFreq=control_disabled pyxis: importing docker image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10 Apr 28 13:35:59.491365 2515157 slurmstepd 0x155552c3b780: error: pyxis: child 2515158 failed with error code: 1 Apr 28 13:35:59.491415 2515157 slurmstepd 0x155552c3b780: error: pyxis: failed to import docker image Apr 28 13:35:59.491433 2515157 slurmstepd 0x155552c3b780: error: pyxis: printing enroot log file: Apr 28 13:35:59.491453 2515157 slurmstepd 0x155552c3b780: error: pyxis: [INFO] Querying registry for permission grant Apr 28 13:35:59.491469 2515157 slurmstepd 0x155552c3b780: error: pyxis: [INFO] Authenticating with user: <anonymous> Apr 28 13:35:59.491483 2515157 slurmstepd 0x155552c3b780: error: pyxis: [INFO] Authentication succeeded Apr 28 13:35:59.491499 2515157 slurmstepd 0x155552c3b780: error: pyxis: [INFO] Fetching image manifest list Apr 28 13:35:59.491512 2515157 slurmstepd 0x155552c3b780: error: pyxis: [INFO] Fetching image manifest Apr 28 13:35:59.491524 2515157 slurmstepd 0x155552c3b780: error: pyxis: [ERROR] URL https://registry-1.docker.io/v2/nvcr.io/nvidia/tensorrt-llm/release/manifests/1.3.0rc10 returned error code: 401 Unauthorized Apr 28 13:35:59.491564 2515157 slurmstepd 0x155552c3b780: error: pyxis: couldn't start container Apr 28 13:35:59.491579 2515157 slurmstepd 0x155552c3b780: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 Apr 28 13:35:59.491593 2515157 slurmstepd 0x155552c3b780: error: Failed to invoke spank plugin stack Apr 28 13:35:59.515523 2515146 slurmstepd 0x155552c3b780: error: pyxis: child 2515240 failed with error code: 1 ``` After fix: ``` GpuFreq=control_disabled pyxis: importing docker image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10 pyxis: imported docker image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10 ``` ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing <!-- Mention how have you tested your change if applicable. --> ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Updated example pipeline to use the standardized dataset example path. * Removed unnecessary per-task overrides of the process home and cache directory to simplify environment setup. * Preserved required model checkpoint environment setting for the relevant task so model resolution continues to work. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
1 parent 50706d1 commit 9d2e608

1 file changed

Lines changed: 1 addition & 6 deletions

File tree

tools/launcher/examples/Qwen/Qwen3-8B/hf_online_eagle3.yaml

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ pipeline:
2424
task_0:
2525
script: common/eagle3/make_dataset.sh
2626
args:
27-
- -f modules/Model-Optimizer/examples/speculative_decoding/prepare_input_conversations/example_data_config.yaml
27+
- -f modules/Model-Optimizer/examples/dataset/example_data_config.yaml
2828
- --full-conversations
2929
slurm_config:
3030
_factory_: "slurm_factory"
@@ -44,9 +44,6 @@ pipeline:
4444
- training.disable_tqdm=true
4545
- training.ar_validate_steps=500000
4646
- training.num_train_epochs=1
47-
environment:
48-
- HOME: /tmp
49-
- TORCHINDUCTOR_CACHE_DIR: /tmp/torch_cache
5047
slurm_config:
5148
_factory_: "slurm_factory"
5249
nodes: 1
@@ -68,8 +65,6 @@ pipeline:
6865
- --concurrency 1
6966
environment:
7067
- HF_MODEL_CKPT: <<global_vars.hf_model>>
71-
- HOME: /tmp
72-
- TORCHINDUCTOR_CACHE_DIR: /tmp/torch_cache
7368
slurm_config:
7469
_factory_: "slurm_factory"
7570
nodes: 1

0 commit comments

Comments
 (0)