Restructure Slurm training recipes to align with standard naming convention#225
Open
weikuo0506 wants to merge 2 commits intoAI-Hypercomputer:mainfrom
Open
Restructure Slurm training recipes to align with standard naming convention#225weikuo0506 wants to merge 2 commits intoAI-Hypercomputer:mainfrom
weikuo0506 wants to merge 2 commits intoAI-Hypercomputer:mainfrom
Conversation
…ention
This commit restructures the Slurm-based training recipes in the training/ directory.
Key changes:
- Renamed directories to follow the 7-level convention: training/{accelerator}/{model}/{framework_runtime}/{version}/{scale_params}/recipe
- Standardized framework names to *-slurm.
- Omitted SEQ length for NeMo recipes and cases where it was not specified in files.
- Corrected GBS for Qwen3-30b based on file content (512).
- Corrected GPU count for Wan based on file content (32).
- Cleaned up empty directories.
This commit removes the accidental nested recipe/recipe folder structure in all Slurm recipes (a3ultra, a4, and a4x) and updates the corresponding README files to use the correct, non-truncated RECIPE_ROOT paths.
02c326c to
5cf4ecb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR restructures the Slurm-based training recipes in the
training/directory to comply with the new standardized folder and file naming convention. This follows the previous work done for GKE recipes.Key Changes:
training/{accelerator}/{model}/{framework_runtime}/{version}/{scale_params}/recipe.*-slurm(e.g.,megatron-bridge-slurm,nemo-slurm).fp8cs,fp8mx) to match GKE.seq{len}for all NeMo recipes and cases where it was not explicitly specified in the configuration files (e.g., Llama 3.1 405B Slurm recipes).qwen3_30b_a3bfrom1024to512based on the actuallaunch_script.sh.wan_14bfrom64to32(8 nodes * 4 GPUs/node) based onwan_14b_benchmark.sh.Scope:
This PR focuses strictly on Slurm recipes across
a4,a4x, anda3ultraaccelerators. GKE recipes ina3ultraanda3megaare left untouched for now.