fix(aws): align Kueue batch memory with EKS allocatable + safety margin by mike-ainsel · Pull Request #75 · milaboratory/platforma-helm

mike-ainsel · 2026-05-26T11:55:47Z

Summary

Batch jobs requesting the full r7i.16xlarge nominal capacity (500Gi) cannot schedule on a real node — EKS reserves ~14 GiB for kubelet/OS, leaving allocatable at ~486 GiB.
Subtracts an additional 1 GiB safety margin for customer-installed DaemonSets (Datadog, Falco, Wiz, etc.) so the chart's defaults work out of the box even with extra agents — per-node ceiling becomes 485 GiB.
Also bumps platforma controller app.resources.limits.memory from 16Gi to 32Gi so the controller has burst headroom under heavy workflow scheduling (request stays at 16Gi).

This combines two related fixes into one PR: aligning the per-job ceiling with what EKS can actually schedule, and reserving headroom for customer DaemonSets.

Changes

cloudformation-eks-1-35.yaml:

Mapping	Old	New
`BatchMemoryGi` (small)	`500`	`485`
`BatchMemoryGi` (medium)	`1000`	`970`
`BatchMemoryGi` (large)	`2000`	`1940`
`BatchMemoryGi` (xlarge)	`4000`	`3880`
`kueue.maxJobResources.memory` (rendered)	`500Gi`	`485Gi`
`app.resources.limits.memory` (rendered)	`16Gi`	`32Gi`

Parameter description and the Mappings header comment also updated to mention the safety-margin rationale.

values-aws-s3.yaml: same three values — maxJobResources.memory, dedicated.resources.batch.memory, app.limits.memory — kept in sync with the CFN-rendered block.

Sister change on GCP

GCP equivalent applied separately in core/pl (refactor/gcp-split-infra-platforma-modules, commit 55e73ac0): per-pool memory_gi = measured GKE allocatable − GKE DS overhead (~1 GiB) − 1 GiB safety margin. End-state per-job ceiling is 484 GiB on GCP (n2d-highmem-64), 485 GiB on AWS (r7i.16xlarge) — different by 1 GiB because GKE's managed DS footprint is ~1 GiB larger than EKS's.

Test plan

CloudFormation stack update on a small-size cluster — verify ClusterQueue reports BatchMemoryGi=485 after rollout.
Submit a max-size batch job (62 CPU / 485Gi) — verify ProvisioningRequest succeeds and pod schedules.
Verify platforma controller pod runs with limits.memory=32Gi and request still 16Gi.
(Spot check) Submit a job with memory: 486Gi — should be rejected by Kueue with "exceeds maxJobResources".

…argin Batch jobs requesting the full r7i.16xlarge nominal capacity (500Gi) cannot schedule on a real node — EKS reserves ~14 GiB for kubelet/OS, leaving allocatable at ~486 GiB. Subtract an additional 1 GiB safety margin for customer-installed DaemonSets (Datadog, Falco, Wiz, etc.) and the effective per-node ceiling is 485 GiB. Changes: - CloudFormation Mappings BatchMemoryGi (per-node × max-batch-nodes): small 500 -> 485 (1 node) medium 1000 -> 970 (2 nodes) large 2000 -> 1940 (4 nodes) xlarge 4000 -> 3880 (8 nodes) - kueue.maxJobResources.memory: 500Gi -> 485Gi (both CFN-rendered values and values-aws-s3.yaml) - dedicated.resources.batch.memory: 500Gi -> 485Gi (values-aws-s3.yaml) - Parameter description and Mappings header comment updated. Also bumps platforma controller memory limit from 16Gi -> 32Gi (matches the request:limit ratio used on GCP) so the controller has burst headroom under heavy workflow scheduling. Memory request stays at 16Gi. This combines two related fixes: aligning the per-job ceiling with what EKS can actually schedule, and reserving headroom for customer DaemonSets so the chart's defaults work out of the box even with extra agents.

… GCP (484 GiB) Reduces the per-node batch memory ceiling from 485 GiB to 484 GiB to subtract ~1 GiB of EKS-managed DaemonSet overhead (aws-node, kube-proxy, ebs-csi) explicitly, matching the GCP per-job ceiling on n2d-highmem-64. Same deployment_size label now means the same workload capacity on both clouds. CloudFormation Mappings BatchMemoryGi: small 485 -> 484 (1 node) medium 970 -> 968 (2 nodes) large 1940 -> 1936 (4 nodes) xlarge 3880 -> 3872 (8 nodes) kueue.maxJobResources.memory: 485Gi -> 484Gi (CFN-rendered + values-aws-s3.yaml) kueue.dedicated.resources.batch.memory: 485Gi -> 484Gi (values-aws-s3.yaml)

…aint (singular) - BuildSpecRevision: 2 -> 3 forces CodeBuild to re-run on stack update, picking up the 484Gi Kueue values from the earlier commit on this branch (without this bump, existing stacks would keep the cached buildspec and never apply the new Kueue ceiling). - Cluster Autoscaler --set arg: startup-taints -> startup-taint (CA flag is singular; the plural form was silently ignored, leaving GPU nodes without the intended startup taint).

…es, max 25 chars) Sync from pl main commit 1f25f0cdd (review feedback): tighten the ClusterName parameter to '^[a-z0-9][a-z0-9-]{0,24}$' to match the constraints of derived resource names: - ECR pull-through cache prefix (quay-${ClusterName}) must stay under AWS's 30-char limit → 25-char ceiling on ClusterName. - S3 bucket name (platforma-${ClusterName}-...) is S3-naming-rules bound → no underscores, no uppercase. Regression on this branch: the loose pattern (alphanumeric + underscores + uppercase, 1-100 chars) silently allowed names that then break ECR/S3 downstream.

mike-ainsel added 4 commits May 26, 2026 13:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(aws): align Kueue batch memory with EKS allocatable + safety margin#75

fix(aws): align Kueue batch memory with EKS allocatable + safety margin#75
mike-ainsel wants to merge 4 commits into
mainfrom
fix/kueue-batch-memory-safety-margin

mike-ainsel commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mike-ainsel commented May 26, 2026

Summary

Changes

Sister change on GCP

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant