Skip to content

Restructure Slurm training recipes to align with standard naming convention#225

Open
weikuo0506 wants to merge 2 commits intoAI-Hypercomputer:mainfrom
weikuo0506:restructure-slurm-recipes
Open

Restructure Slurm training recipes to align with standard naming convention#225
weikuo0506 wants to merge 2 commits intoAI-Hypercomputer:mainfrom
weikuo0506:restructure-slurm-recipes

Conversation

@weikuo0506
Copy link
Copy Markdown
Contributor

This PR restructures the Slurm-based training recipes in the training/ directory to comply with the new standardized folder and file naming convention. This follows the previous work done for GKE recipes.

Key Changes:

  • 7-Level Directory Structure: Renamed Slurm directories to follow the standard: training/{accelerator}/{model}/{framework_runtime}/{version}/{scale_params}/recipe.
  • Framework Normalization: Standardized framework names to *-slurm (e.g., megatron-bridge-slurm, nemo-slurm).
  • Scale Parameters Alignment:
    • Used continuous precision naming (e.g., fp8cs, fp8mx) to match GKE.
    • Omitted seq{len} for all NeMo recipes and cases where it was not explicitly specified in the configuration files (e.g., Llama 3.1 405B Slurm recipes).
  • Parameter Corrections based on File Content:
    • Corrected Global Batch Size (GBS) for qwen3_30b_a3b from 1024 to 512 based on the actual launch_script.sh.
    • Corrected total GPU count for wan_14b from 64 to 32 (8 nodes * 4 GPUs/node) based on wan_14b_benchmark.sh.
  • Reference Updates: Updated internal path references in relevant scripts, readmes, and SLURM files.
  • Cleanup: Removed empty parent directories left after moving files.

Scope:

This PR focuses strictly on Slurm recipes across a4, a4x, and a3ultra accelerators. GKE recipes in a3ultra and a3mega are left untouched for now.

weikuo0506 added 2 commits May 8, 2026 21:51
…ention

This commit restructures the Slurm-based training recipes in the training/ directory.

Key changes:
- Renamed directories to follow the 7-level convention: training/{accelerator}/{model}/{framework_runtime}/{version}/{scale_params}/recipe
- Standardized framework names to *-slurm.
- Omitted SEQ length for NeMo recipes and cases where it was not specified in files.
- Corrected GBS for Qwen3-30b based on file content (512).
- Corrected GPU count for Wan based on file content (32).
- Cleaned up empty directories.
This commit removes the accidental nested recipe/recipe folder structure in all Slurm recipes (a3ultra, a4, and a4x) and updates the corresponding README files to use the correct, non-truncated RECIPE_ROOT paths.
@weikuo0506 weikuo0506 force-pushed the restructure-slurm-recipes branch from 02c326c to 5cf4ecb Compare May 9, 2026 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant