fix(aws): align Kueue batch memory with EKS allocatable + safety margin#75
Open
mike-ainsel wants to merge 4 commits into
Open
fix(aws): align Kueue batch memory with EKS allocatable + safety margin#75mike-ainsel wants to merge 4 commits into
mike-ainsel wants to merge 4 commits into
Conversation
…argin
Batch jobs requesting the full r7i.16xlarge nominal capacity (500Gi) cannot
schedule on a real node — EKS reserves ~14 GiB for kubelet/OS, leaving
allocatable at ~486 GiB. Subtract an additional 1 GiB safety margin for
customer-installed DaemonSets (Datadog, Falco, Wiz, etc.) and the effective
per-node ceiling is 485 GiB.
Changes:
- CloudFormation Mappings BatchMemoryGi (per-node × max-batch-nodes):
small 500 -> 485 (1 node)
medium 1000 -> 970 (2 nodes)
large 2000 -> 1940 (4 nodes)
xlarge 4000 -> 3880 (8 nodes)
- kueue.maxJobResources.memory: 500Gi -> 485Gi (both CFN-rendered values
and values-aws-s3.yaml)
- dedicated.resources.batch.memory: 500Gi -> 485Gi (values-aws-s3.yaml)
- Parameter description and Mappings header comment updated.
Also bumps platforma controller memory limit from 16Gi -> 32Gi (matches
the request:limit ratio used on GCP) so the controller has burst headroom
under heavy workflow scheduling. Memory request stays at 16Gi.
This combines two related fixes: aligning the per-job ceiling with what
EKS can actually schedule, and reserving headroom for customer DaemonSets
so the chart's defaults work out of the box even with extra agents.
… GCP (484 GiB) Reduces the per-node batch memory ceiling from 485 GiB to 484 GiB to subtract ~1 GiB of EKS-managed DaemonSet overhead (aws-node, kube-proxy, ebs-csi) explicitly, matching the GCP per-job ceiling on n2d-highmem-64. Same deployment_size label now means the same workload capacity on both clouds. CloudFormation Mappings BatchMemoryGi: small 485 -> 484 (1 node) medium 970 -> 968 (2 nodes) large 1940 -> 1936 (4 nodes) xlarge 3880 -> 3872 (8 nodes) kueue.maxJobResources.memory: 485Gi -> 484Gi (CFN-rendered + values-aws-s3.yaml) kueue.dedicated.resources.batch.memory: 485Gi -> 484Gi (values-aws-s3.yaml)
…aint (singular) - BuildSpecRevision: 2 -> 3 forces CodeBuild to re-run on stack update, picking up the 484Gi Kueue values from the earlier commit on this branch (without this bump, existing stacks would keep the cached buildspec and never apply the new Kueue ceiling). - Cluster Autoscaler --set arg: startup-taints -> startup-taint (CA flag is singular; the plural form was silently ignored, leaving GPU nodes without the intended startup taint).
…es, max 25 chars)
Sync from pl main commit 1f25f0cdd (review feedback): tighten the
ClusterName parameter to '^[a-z0-9][a-z0-9-]{0,24}$' to match the
constraints of derived resource names:
- ECR pull-through cache prefix (quay-${ClusterName}) must stay under
AWS's 30-char limit → 25-char ceiling on ClusterName.
- S3 bucket name (platforma-${ClusterName}-...) is S3-naming-rules
bound → no underscores, no uppercase.
Regression on this branch: the loose pattern (alphanumeric +
underscores + uppercase, 1-100 chars) silently allowed names that
then break ECR/S3 downstream.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
500Gi) cannot schedule on a real node — EKS reserves ~14 GiB for kubelet/OS, leaving allocatable at ~486 GiB.app.resources.limits.memoryfrom16Gito32Giso the controller has burst headroom under heavy workflow scheduling (request stays at16Gi).This combines two related fixes into one PR: aligning the per-job ceiling with what EKS can actually schedule, and reserving headroom for customer DaemonSets.
Changes
cloudformation-eks-1-35.yaml:BatchMemoryGi(small)500485BatchMemoryGi(medium)1000970BatchMemoryGi(large)20001940BatchMemoryGi(xlarge)40003880kueue.maxJobResources.memory(rendered)500Gi485Giapp.resources.limits.memory(rendered)16Gi32GiParameter description and the Mappings header comment also updated to mention the safety-margin rationale.
values-aws-s3.yaml: same three values —maxJobResources.memory,dedicated.resources.batch.memory,app.limits.memory— kept in sync with the CFN-rendered block.Sister change on GCP
GCP equivalent applied separately in
core/pl(refactor/gcp-split-infra-platforma-modules, commit55e73ac0): per-poolmemory_gi= measured GKE allocatable − GKE DS overhead (~1 GiB) − 1 GiB safety margin. End-state per-job ceiling is484 GiBon GCP (n2d-highmem-64),485 GiBon AWS (r7i.16xlarge) — different by 1 GiB because GKE's managed DS footprint is ~1 GiB larger than EKS's.Test plan
small-size cluster — verifyClusterQueuereportsBatchMemoryGi=485after rollout.62 CPU / 485Gi) — verify ProvisioningRequest succeeds and pod schedules.limits.memory=32Giand request still16Gi.memory: 486Gi— should be rejected by Kueue with "exceeds maxJobResources".