[Fix]: $HOME in launcher eagle example (#1365)

h-guo18 · web-flow · commit 9d2e6087d1c0 · 2026-05-01T16:31:46.000-07:00
### What does this PR do? Type of change: Bug fix   Launcher example bug raised by @cjluo-nv Before fix: task1 in tools/launcher/examples/Qwen/Qwen3-8B/hf_online_eagle3.yaml fails Reason: due to `HOME: /tmp` set in container, enroot credentials in `$HOME/.config/enroot/.crendential` not found ``` GpuFreq=control_disabled pyxis: importing docker image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10 Apr 28 13:35:59.491365 2515157 slurmstepd 0x155552c3b780: error: pyxis: child 2515158 failed with error code: 1 Apr 28 13:35:59.491415 2515157 slurmstepd 0x155552c3b780: error: pyxis: failed to import docker image Apr 28 13:35:59.491433 2515157 slurmstepd 0x155552c3b780: error: pyxis: printing enroot log file: Apr 28 13:35:59.491453 2515157 slurmstepd 0x155552c3b780: error: pyxis: [INFO] Querying registry for permission grant Apr 28 13:35:59.491469 2515157 slurmstepd 0x155552c3b780: error: pyxis: [INFO] Authenticating with user: <anonymous> Apr 28 13:35:59.491483 2515157 slurmstepd 0x155552c3b780: error: pyxis: [INFO] Authentication succeeded Apr 28 13:35:59.491499 2515157 slurmstepd 0x155552c3b780: error: pyxis: [INFO] Fetching image manifest list Apr 28 13:35:59.491512 2515157 slurmstepd 0x155552c3b780: error: pyxis: [INFO] Fetching image manifest Apr 28 13:35:59.491524 2515157 slurmstepd 0x155552c3b780: error: pyxis: [ERROR] URL https://registry-1.docker.io/v2/nvcr.io/nvidia/tensorrt-llm/release/manifests/1.3.0rc10 returned error code: 401 Unauthorized Apr 28 13:35:59.491564 2515157 slurmstepd 0x155552c3b780: error: pyxis: couldn't start container Apr 28 13:35:59.491579 2515157 slurmstepd 0x155552c3b780: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 Apr 28 13:35:59.491593 2515157 slurmstepd 0x155552c3b780: error: Failed to invoke spank plugin stack Apr 28 13:35:59.515523 2515146 slurmstepd 0x155552c3b780: error: pyxis: child 2515240 failed with error code: 1 ``` After fix: ``` GpuFreq=control_disabled pyxis: importing docker image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10 pyxis: imported docker image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10 ``` ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing  ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **Chores** * Updated example pipeline to use the standardized dataset example path. * Removed unnecessary per-task overrides of the process home and cache directory to simplify environment setup. * Preserved required model checkpoint environment setting for the relevant task so model resolution continues to work.  --------- Signed-off-by: h-guo18 <67671475+h-guo18@users.noreply.github.com>
diff --git a/tools/launcher/examples/Qwen/Qwen3-8B/hf_online_eagle3.yaml b/tools/launcher/examples/Qwen/Qwen3-8B/hf_online_eagle3.yaml
@@ -24,7 +24,7 @@ pipeline:
   task_0:
     script: common/eagle3/make_dataset.sh
     args:
-      - -f modules/Model-Optimizer/examples/speculative_decoding/prepare_input_conversations/example_data_config.yaml
+      - -f modules/Model-Optimizer/examples/dataset/example_data_config.yaml
       - --full-conversations
     slurm_config:
       _factory_: "slurm_factory"
@@ -44,9 +44,6 @@ pipeline:
       - training.disable_tqdm=true
       - training.ar_validate_steps=500000
       - training.num_train_epochs=1
-    environment:
-      - HOME: /tmp
-      - TORCHINDUCTOR_CACHE_DIR: /tmp/torch_cache
     slurm_config:
       _factory_: "slurm_factory"
       nodes: 1
@@ -68,8 +65,6 @@ pipeline:
       - --concurrency 1
     environment:
       - HF_MODEL_CKPT: <<global_vars.hf_model>>
-      - HOME: /tmp
-      - TORCHINDUCTOR_CACHE_DIR: /tmp/torch_cache
     slurm_config:
       _factory_: "slurm_factory"
       nodes: 1