From 3bf9fddf823a2cd971666fec1ddcf260b160ba42 Mon Sep 17 00:00:00 2001
From: Aryan <aryan@gupta-inc.com>
Date: Wed, 17 Jun 2026 19:02:40 -0700
Subject: [PATCH 1/2] fix(ci): bound the multinode pre-run Slurm cleanup drain
 loop
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The 'Slurm cleanup (pre-run)' step waits for jobs named after the runner with
NO timeout. On the NVIDIA clusters squeue/scancel hang (a zombie scancel can't
reap, or unresponsive slurmctld), so the while-condition's $(squeue ...) blocks
and the step wedges 15-20min+, failing EVERY dsr1 multinode leg (gb300-nv,
gb200, b200; CoreWeave gb300-cw is unaffected — 5 wedged sweep runs observed).

Wrap every scancel/squeue in 'timeout 30' so a hung call can't block the loop,
and force-KILL + proceed after a 120s deadline instead of looping forever. The
benchmark legs then reach launch (sbatch works on these clusters — glm5-gb300
succeeds), unblocking measured-power sweeps for everyone.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .../workflows/benchmark-multinode-tmpl.yml    | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/.github/workflows/benchmark-multinode-tmpl.yml b/.github/workflows/benchmark-multinode-tmpl.yml
index 81727ef39..76bc6e8f6 100644
--- a/.github/workflows/benchmark-multinode-tmpl.yml
+++ b/.github/workflows/benchmark-multinode-tmpl.yml
@@ -172,9 +172,22 @@ jobs:
         run: &slurm-cleanup |
           if command -v squeue >/dev/null 2>&1; then
             echo "[Slurm] Cleaning up jobs with name: ${{ runner.name }} ..."
-            scancel --name="${{ runner.name }}" || true
-            while [ -n "$(squeue --name='${{ runner.name }}' --noheader --format='%i')" ]; do
-              squeue --name="${{ runner.name }}"
+            timeout 30 scancel --name="${{ runner.name }}" 2>/dev/null || true
+            # Bound the drain: on the NVIDIA clusters squeue/scancel hang (zombie
+            # job scancel can't reap, or unresponsive slurmctld), wedging this
+            # step for 15-20min+ and failing every dsr1 multinode leg (gb300-nv,
+            # gb200, b200; CoreWeave is unaffected). timeout-wrap every slurm call
+            # so a hang in the while-condition can't block, and force-KILL +
+            # proceed after 120s rather than looping forever.
+            _drain_deadline=$((SECONDS + 120))
+            while [ -n "$(timeout 30 squeue --name='${{ runner.name }}' --noheader --format='%i' 2>/dev/null)" ]; do
+              if [ "$SECONDS" -ge "$_drain_deadline" ]; then
+                echo "[Slurm] drain exceeded 120s; force-cancelling (KILL) and proceeding"
+                timeout 30 scancel --signal=KILL --name="${{ runner.name }}" 2>/dev/null || true
+                sleep 5
+                break
+              fi
+              timeout 30 squeue --name="${{ runner.name }}" 2>/dev/null || true
               sleep 5
             done
           fi

From 00a040f5ad909601abff62756ba76a720ba5a5fe Mon Sep 17 00:00:00 2001
From: Aryan <aryan@gupta-inc.com>
Date: Mon, 22 Jun 2026 15:40:55 -0700
Subject: [PATCH 2/2] fix(ci): raise multinode Slurm cleanup timeouts to 5min
 (scancel epilog headroom)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Review feedback: 30s was too short — scancel triggers the node epilog, which can be a slow/complex script, so a 30s cap could kill a cleanup that was still legitimately working. Raise scancel to 300s and the overall drain deadline to 300s; squeue stays at 30s (a hung squeue should give up fast so we proceed). A real not-yet-cleared job now gets a full 5min to drain before the force-KILL.

Proven live 2026-06-22: gb300-nv_2 answered squeue in 37ms, then the same runner's cleanup squeue hung >6min 14min later, with gb300-nv_0 hanging concurrently — an intermittent cluster-wide slurmctld/munge/network hang, not a stuck job. Unbounded, the drain loop froze dsr1 multinode legs 15-20min+ (observed up to 8h).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .../workflows/benchmark-multinode-tmpl.yml    | 27 ++++++++++++-------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/.github/workflows/benchmark-multinode-tmpl.yml b/.github/workflows/benchmark-multinode-tmpl.yml
index 76bc6e8f6..ffc9ad093 100644
--- a/.github/workflows/benchmark-multinode-tmpl.yml
+++ b/.github/workflows/benchmark-multinode-tmpl.yml
@@ -172,18 +172,25 @@ jobs:
         run: &slurm-cleanup |
           if command -v squeue >/dev/null 2>&1; then
             echo "[Slurm] Cleaning up jobs with name: ${{ runner.name }} ..."
-            timeout 30 scancel --name="${{ runner.name }}" 2>/dev/null || true
-            # Bound the drain: on the NVIDIA clusters squeue/scancel hang (zombie
-            # job scancel can't reap, or unresponsive slurmctld), wedging this
-            # step for 15-20min+ and failing every dsr1 multinode leg (gb300-nv,
-            # gb200, b200; CoreWeave is unaffected). timeout-wrap every slurm call
-            # so a hang in the while-condition can't block, and force-KILL +
-            # proceed after 120s rather than looping forever.
-            _drain_deadline=$((SECONDS + 120))
+            # scancel can legitimately run a while: it triggers the node epilog,
+            # which may be a slow/complex script. Give it 5min before giving up
+            # (30s was too short — it could kill an epilog that was still working).
+            timeout 300 scancel --name="${{ runner.name }}" 2>/dev/null || true
+            # Bound the drain: on the NVIDIA clusters squeue/scancel intermittently
+            # HANG (unresponsive slurmctld / munge / network — NOT a stuck job),
+            # wedging this step for 15-20min+ (observed up to 8h) and failing dsr1
+            # multinode legs on gb300-nv, gb200, b200 (CoreWeave unaffected). Proven
+            # live 2026-06-22: gb300-nv_2 answered squeue in 37ms, then the same
+            # runner's cleanup squeue hung >6min only 14min later; gb300-nv_0 hung
+            # concurrently. So: timeout-wrap every slurm call (a hung squeue returns
+            # empty -> the while-condition is false -> loop exits and we proceed),
+            # and cap the whole drain at 5min with a force-KILL instead of looping
+            # forever. A real not-yet-cleared job still gets the full 5min to drain.
+            _drain_deadline=$((SECONDS + 300))
             while [ -n "$(timeout 30 squeue --name='${{ runner.name }}' --noheader --format='%i' 2>/dev/null)" ]; do
               if [ "$SECONDS" -ge "$_drain_deadline" ]; then
-                echo "[Slurm] drain exceeded 120s; force-cancelling (KILL) and proceeding"
-                timeout 30 scancel --signal=KILL --name="${{ runner.name }}" 2>/dev/null || true
+                echo "[Slurm] drain exceeded 5min; force-cancelling (KILL) and proceeding"
+                timeout 60 scancel --signal=KILL --name="${{ runner.name }}" 2>/dev/null || true
                 sleep 5
                 break
               fi