Skip to content

fix(aws): align Kueue batch memory with EKS allocatable + safety margin#75

Open
mike-ainsel wants to merge 4 commits into
mainfrom
fix/kueue-batch-memory-safety-margin
Open

fix(aws): align Kueue batch memory with EKS allocatable + safety margin#75
mike-ainsel wants to merge 4 commits into
mainfrom
fix/kueue-batch-memory-safety-margin

Conversation

@mike-ainsel

Copy link
Copy Markdown
Member

Summary

  • Batch jobs requesting the full r7i.16xlarge nominal capacity (500Gi) cannot schedule on a real node — EKS reserves ~14 GiB for kubelet/OS, leaving allocatable at ~486 GiB.
  • Subtracts an additional 1 GiB safety margin for customer-installed DaemonSets (Datadog, Falco, Wiz, etc.) so the chart's defaults work out of the box even with extra agents — per-node ceiling becomes 485 GiB.
  • Also bumps platforma controller app.resources.limits.memory from 16Gi to 32Gi so the controller has burst headroom under heavy workflow scheduling (request stays at 16Gi).

This combines two related fixes into one PR: aligning the per-job ceiling with what EKS can actually schedule, and reserving headroom for customer DaemonSets.

Changes

cloudformation-eks-1-35.yaml:

Mapping Old New
BatchMemoryGi (small) 500 485
BatchMemoryGi (medium) 1000 970
BatchMemoryGi (large) 2000 1940
BatchMemoryGi (xlarge) 4000 3880
kueue.maxJobResources.memory (rendered) 500Gi 485Gi
app.resources.limits.memory (rendered) 16Gi 32Gi

Parameter description and the Mappings header comment also updated to mention the safety-margin rationale.

values-aws-s3.yaml: same three values — maxJobResources.memory, dedicated.resources.batch.memory, app.limits.memory — kept in sync with the CFN-rendered block.

Sister change on GCP

GCP equivalent applied separately in core/pl (refactor/gcp-split-infra-platforma-modules, commit 55e73ac0): per-pool memory_gi = measured GKE allocatable − GKE DS overhead (~1 GiB) − 1 GiB safety margin. End-state per-job ceiling is 484 GiB on GCP (n2d-highmem-64), 485 GiB on AWS (r7i.16xlarge) — different by 1 GiB because GKE's managed DS footprint is ~1 GiB larger than EKS's.

Test plan

  • CloudFormation stack update on a small-size cluster — verify ClusterQueue reports BatchMemoryGi=485 after rollout.
  • Submit a max-size batch job (62 CPU / 485Gi) — verify ProvisioningRequest succeeds and pod schedules.
  • Verify platforma controller pod runs with limits.memory=32Gi and request still 16Gi.
  • (Spot check) Submit a job with memory: 486Gi — should be rejected by Kueue with "exceeds maxJobResources".

…argin

Batch jobs requesting the full r7i.16xlarge nominal capacity (500Gi) cannot
schedule on a real node — EKS reserves ~14 GiB for kubelet/OS, leaving
allocatable at ~486 GiB. Subtract an additional 1 GiB safety margin for
customer-installed DaemonSets (Datadog, Falco, Wiz, etc.) and the effective
per-node ceiling is 485 GiB.

Changes:
- CloudFormation Mappings BatchMemoryGi (per-node × max-batch-nodes):
    small   500 -> 485   (1 node)
    medium  1000 -> 970  (2 nodes)
    large   2000 -> 1940 (4 nodes)
    xlarge  4000 -> 3880 (8 nodes)
- kueue.maxJobResources.memory: 500Gi -> 485Gi (both CFN-rendered values
  and values-aws-s3.yaml)
- dedicated.resources.batch.memory: 500Gi -> 485Gi (values-aws-s3.yaml)
- Parameter description and Mappings header comment updated.

Also bumps platforma controller memory limit from 16Gi -> 32Gi (matches
the request:limit ratio used on GCP) so the controller has burst headroom
under heavy workflow scheduling. Memory request stays at 16Gi.

This combines two related fixes: aligning the per-job ceiling with what
EKS can actually schedule, and reserving headroom for customer DaemonSets
so the chart's defaults work out of the box even with extra agents.
… GCP (484 GiB)

Reduces the per-node batch memory ceiling from 485 GiB to 484 GiB to subtract
~1 GiB of EKS-managed DaemonSet overhead (aws-node, kube-proxy, ebs-csi)
explicitly, matching the GCP per-job ceiling on n2d-highmem-64. Same
deployment_size label now means the same workload capacity on both clouds.

CloudFormation Mappings BatchMemoryGi:
  small   485 -> 484   (1 node)
  medium  970 -> 968   (2 nodes)
  large   1940 -> 1936 (4 nodes)
  xlarge  3880 -> 3872 (8 nodes)

kueue.maxJobResources.memory: 485Gi -> 484Gi (CFN-rendered + values-aws-s3.yaml)
kueue.dedicated.resources.batch.memory: 485Gi -> 484Gi (values-aws-s3.yaml)
…aint (singular)

- BuildSpecRevision: 2 -> 3 forces CodeBuild to re-run on stack update,
  picking up the 484Gi Kueue values from the earlier commit on this
  branch (without this bump, existing stacks would keep the cached
  buildspec and never apply the new Kueue ceiling).
- Cluster Autoscaler --set arg: startup-taints -> startup-taint (CA
  flag is singular; the plural form was silently ignored, leaving GPU
  nodes without the intended startup taint).
…es, max 25 chars)

Sync from pl main commit 1f25f0cdd (review feedback): tighten the
ClusterName parameter to '^[a-z0-9][a-z0-9-]{0,24}$' to match the
constraints of derived resource names:

- ECR pull-through cache prefix (quay-${ClusterName}) must stay under
  AWS's 30-char limit → 25-char ceiling on ClusterName.
- S3 bucket name (platforma-${ClusterName}-...) is S3-naming-rules
  bound → no underscores, no uppercase.

Regression on this branch: the loose pattern (alphanumeric +
underscores + uppercase, 1-100 chars) silently allowed names that
then break ECR/S3 downstream.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant