Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a3mega/gpt3-175b/nemo-pretraining-gke
export RECIPE_ROOT=$REPO_ROOT/training/a3mega/gpt3_175b/nemo-gke/nemo2507/recipe
```

### Get cluster credentials
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ From your client, clone the `gpu-recipes` repository and set a reference to the
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a3mega/llama3-1-70b/nemo-pretraining-gke-gcs
export RECIPE_ROOT=$REPO_ROOT/training/a3mega/llama3_70b/nemo-gke-gcs/nemo2507/recipe
```

### Get cluster credentials
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,7 @@ From your client, clone the `gpu-recipes` repository and set a reference to the
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a3mega/llama3-1-70b/nemo-pretraining-gke-resiliency
export RECIPE_ROOT=$REPO_ROOT/training/a3mega/llama3_70b/nemo-gke-resiliency/nemo2507/recipe
```

### Get cluster credentials
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Achieving high GoodPut can be challenging due to several factors common in large
| **Stragglers and Performance Bottlenecks** | Slower nodes delay the entire job, underutilizing resources. | 3-7% |
| **Lack of Rapid Failure Detection and Diagnosis** | Longer detection/diagnosis time increases downtime. | 2-5% |

This guide provides a general overview of techniques and tools to address these common challenges and maximize ML GoodPut. While the principles discussed are broadly applicable, we will use the [Llama 3.1 70B pretraining recipe](https://github.com/AI-Hypercomputer/gpu-recipes/tree/main/training/a3mega/llama3-1-70b/nemo-pretraining-gke-resiliency) as a concrete case study to illustrate how these components can be implemented and customized for large-scale training workloads on Google Cloud. The goal is to showcase a "DIY" style product, where users can understand and selectively adopt these "Lego blocks" to build resilient and efficient training pipelines.
This guide provides a general overview of techniques and tools to address these common challenges and maximize ML GoodPut. While the principles discussed are broadly applicable, we will use the [Llama 3.1 70B pretraining recipe](https://github.com/AI-Hypercomputer/gpu-recipes/tree/main/training/a3mega/llama3_70b/nemo-gke-resiliency/nemo2507/recipe) as a concrete case study to illustrate how these components can be implemented and customized for large-scale training workloads on Google Cloud. The goal is to showcase a "DIY" style product, where users can understand and selectively adopt these "Lego blocks" to build resilient and efficient training pipelines.

## TLDR: Recommended Lego Blocks for Your Deployment
For customers looking to improve GoodPut on their own ML training workloads, here’s a concise guide to the key strategies discussed in this document, presented as 'Lego blocks' you can implement:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a3mega/llama3-70b/nemo-pretraining-gke
export RECIPE_ROOT=$REPO_ROOT/training/a3mega/llama3_70b/nemo-gke/nemo2507/256gpus-bf16/recipe/old_llama3_70b
```

### Get cluster credentials
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a3mega/llama3-1-70b/nemo-pretraining-gke
export RECIPE_ROOT=$REPO_ROOT/training/a3mega/llama3_70b/nemo-gke/nemo2507/256gpus-bf16/recipe
```

### Get cluster credentials
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a3mega/mixtral-8x7b/nemo-pretraining-gke
export RECIPE_ROOT=$REPO_ROOT/training/a3mega/mixtral_8x7b/nemo-gke/nemo2507/recipe
```

### Get cluster credentials
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a3ultra/gpt-oss-120b/megatron-bridge-pretraining-gke/8node-BF16-GBSunknown/recipe
export RECIPE_ROOT=$REPO_ROOT/training/a3ultra/gpt_oss_120b/nemo-gke/nemo2602/64gpus-bf16-gbs1280/recipe
cd $RECIPE_ROOT
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ From your client, clone the `gpu-recipes` repository and set a reference to the
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a3ultra/llama3-1-405b/nemo-pretraining-gke-resiliency
export RECIPE_ROOT=$REPO_ROOT/training/a3ultra/llama31_405b/nemo-gke-resiliency/nemo2412/recipe
```

### Get cluster credentials
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Achieving high GoodPut can be challenging due to several factors common in large
| **Stragglers and Performance Bottlenecks** | Slower nodes delay the entire job, underutilizing resources. | 3-7% |
| **Lack of Rapid Failure Detection and Diagnosis** | Longer detection/diagnosis time increases downtime. | 2-5% |

This guide provides a general overview of techniques and tools to address these common challenges and maximize ML GoodPut. While the principles discussed are broadly applicable, we will use the [Llama 3.1 405B pretraining recipe](https://github.com/AI-Hypercomputer/gpu-recipes/tree/main/training/a3ultra/llama3-1-405b/nemo-pretraining-gke-resiliency) as a concrete case study to illustrate how these components can be implemented and customized for large-scale training workloads on Google Cloud. The goal is to showcase a "DIY" style product, where users can understand and selectively adopt these "Lego blocks" to build resilient and efficient training pipelines.
This guide provides a general overview of techniques and tools to address these common challenges and maximize ML GoodPut. While the principles discussed are broadly applicable, we will use the [Llama 3.1 405B pretraining recipe](https://github.com/AI-Hypercomputer/gpu-recipes/tree/main/training/a3ultra/llama31_405b/nemo-gke-resiliency/nemo2412/recipe) as a concrete case study to illustrate how these components can be implemented and customized for large-scale training workloads on Google Cloud. The goal is to showcase a "DIY" style product, where users can understand and selectively adopt these "Lego blocks" to build resilient and efficient training pipelines.

## TLDR: Recommended Lego Blocks for Your Deployment
For customers looking to improve GoodPut on their own ML training workloads, here’s a concise guide to the key strategies discussed in this document, presented as 'Lego blocks' you can implement:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a3ultra/llama3-1-405b/nemo-pretraining-gke
export RECIPE_ROOT=$REPO_ROOT/training/a3ultra/llama31_405b/nemo-gke/nemo2412/recipe
cd $RECIPE_ROOT
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a3ultra/llama3-1-70b/nemo-pretraining-gke
export RECIPE_ROOT=$REPO_ROOT/training/a3ultra/llama3_70b/nemo-gke/nemo2407/recipe
cd $RECIPE_ROOT
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ From your client, clone the `gpu-recipes` repository and set a reference to the
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a3ultra/mixtral-8x7b/nemo-pretraining-gke-resiliency
export RECIPE_ROOT=$REPO_ROOT/training/a3ultra/mixtral_8x7b/nemo-gke-resiliency/nemo2407/recipe
```

### Get cluster credentials
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Achieving high GoodPut can be challenging due to several factors common in large
| **Stragglers and Performance Bottlenecks** | Slower nodes delay the entire job, underutilizing resources. | 3-7% |
| **Lack of Rapid Failure Detection and Diagnosis** | Longer detection/diagnosis time increases downtime. | 2-5% |

This guide provides a general overview of techniques and tools to address these common challenges and maximize ML GoodPut. While the principles discussed are broadly applicable, we will use the [Mixtral 8x7B pretraining recipe](https://github.com/AI-Hypercomputer/gpu-recipes/tree/main/training/a3ultra/mixtral-8x7b/nemo-pretraining-gke-resiliency) as a concrete case study to illustrate how these components can be implemented and customized for large-scale training workloads on Google Cloud. The goal is to showcase a "DIY" style product, where users can understand and selectively adopt these "Lego blocks" to build resilient and efficient training pipelines.
This guide provides a general overview of techniques and tools to address these common challenges and maximize ML GoodPut. While the principles discussed are broadly applicable, we will use the [Mixtral 8x7B pretraining recipe](https://github.com/AI-Hypercomputer/gpu-recipes/tree/main/training/a3ultra/mixtral_8x7b/nemo-gke-resiliency/nemo2407/recipe) as a concrete case study to illustrate how these components can be implemented and customized for large-scale training workloads on Google Cloud. The goal is to showcase a "DIY" style product, where users can understand and selectively adopt these "Lego blocks" to build resilient and efficient training pipelines.

## TLDR: Recommended Lego Blocks for Your Deployment
For customers looking to improve GoodPut on their own ML training workloads, here’s a concise guide to the key strategies discussed in this document, presented as 'Lego blocks' you can implement:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a3ultra/mixtral-8x7b/nemo-pretraining-gke
export RECIPE_ROOT=$REPO_ROOT/training/a3ultra/mixtral_8x7b/nemo-gke/nemo2407/recipe
cd $RECIPE_ROOT
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a3ultra/qwen3-30b-a3b/megatron-bridge-pretraining-gke/2node-BF16-GBSunknown/recipe
export RECIPE_ROOT=$REPO_ROOT/training/a3ultra/qwen3_30b_a3b/nemo-gke/nemo2602/16gpus-bf16-gbs1024/recipe
cd $RECIPE_ROOT
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a3ultra/qwen3-30b-a3b/megatron-bridge-pretraining-gke/2node-FP8CS-GBSunknown/recipe
export RECIPE_ROOT=$REPO_ROOT/training/a3ultra/qwen3_30b_a3b/nemo-gke/nemo2602/16gpus-fp8cs-gbs1024/recipe
cd $RECIPE_ROOT
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a4/llama3-8b/megatron-bridge-pretraining-gke/1node-FP8CS-GBSunknown/recipe
export RECIPE_ROOT=$REPO_ROOT/training/a4/llama3-8b/megatron-bridge-gke/nemo2602/8gpus-fp8cs-seq8192-gbs128/recipe
cd $RECIPE_ROOT
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a4/llama31-405b/megatron-bridge-pretraining-gke/32node-FP8CS-GBSunknown/recipe
export RECIPE_ROOT=$REPO_ROOT/training/a4/llama31_405b/nemo-gke/nemo2602/256gpus-fp8cs-gbs256/recipe
cd $RECIPE_ROOT
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a4/llama3_70b/nemo-gke/nemo2507/256gpus-bf16-gbs256/recipe/recipe
export RECIPE_ROOT=$REPO_ROOT/training/a4/llama3_70b/nemo-gke/nemo2507/256gpus-bf16-gbs256/recipe
cd $RECIPE_ROOT
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a4/mixtral-8x7b/nemo-pretraining-gke
export RECIPE_ROOT=$REPO_ROOT/training/a4/mixtral_8x7b/nemo-gke/nemo2507/recipe
cd $RECIPE_ROOT
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a4/qwen3-30b-a3b/megatron-bridge-pretraining-gke/1node-FP8MX-GBSunknown/recipe
export RECIPE_ROOT=$REPO_ROOT/training/a4/qwen3_30b_a3b/nemo-gke/nemo2602/8gpus-fp8mx-seq4096-gbs512/recipe
cd $RECIPE_ROOT
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a4x/llama3-70b/megatron-bridge-pretraining-gke/16node-FP8CS-GBSunknown/recipe
export RECIPE_ROOT=$REPO_ROOT/training/a4x/llama3_70b/nemo-gke/nemo2602/64gpus-bf16-gbs256
cd $RECIPE_ROOT
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a4x/llama3-70b/megatron-bridge-pretraining-gke/16node-FP8CS-GBSunknown/recipe
export RECIPE_ROOT=$REPO_ROOT/training/a4x/llama3_70b/nemo-gke/nemo2602/64gpus-fp8cs-gbs256
cd $RECIPE_ROOT
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a4x/llama3-70b/megatron-bridge-pretraining-gke/16node-FP8MX-GBSunknown/recipe
export RECIPE_ROOT=$REPO_ROOT/training/a4x/llama3_70b/nemo-gke/nemo2602/64gpus-fp8mx-gbs256
cd $RECIPE_ROOT
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/g4/llama3-1-70b/nemo-pretraining-gke/4gpu-bf16
export RECIPE_ROOT=$REPO_ROOT/training/g4/llama3_70b/nemo-finetuning-gke/nemo2507/4gpus-bf16-gbs32/recipe
cd $RECIPE_ROOT
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/g4/llama3-1-70b/nemo-finetuning-gke/8gpu-bf16
export RECIPE_ROOT=$REPO_ROOT/training/g4/llama3_70b/nemo-finetuning-gke/nemo2507/8gpus-bf16-gbs32/recipe
cd $RECIPE_ROOT
```

Expand Down