From 3bf9fddf823a2cd971666fec1ddcf260b160ba42 Mon Sep 17 00:00:00 2001 From: Aryan Date: Wed, 17 Jun 2026 19:02:40 -0700 Subject: [PATCH 1/2] fix(ci): bound the multinode pre-run Slurm cleanup drain loop MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The 'Slurm cleanup (pre-run)' step waits for jobs named after the runner with NO timeout. On the NVIDIA clusters squeue/scancel hang (a zombie scancel can't reap, or unresponsive slurmctld), so the while-condition's $(squeue ...) blocks and the step wedges 15-20min+, failing EVERY dsr1 multinode leg (gb300-nv, gb200, b200; CoreWeave gb300-cw is unaffected — 5 wedged sweep runs observed). Wrap every scancel/squeue in 'timeout 30' so a hung call can't block the loop, and force-KILL + proceed after a 120s deadline instead of looping forever. The benchmark legs then reach launch (sbatch works on these clusters — glm5-gb300 succeeds), unblocking measured-power sweeps for everyone. Co-Authored-By: Claude Opus 4.8 --- .../workflows/benchmark-multinode-tmpl.yml | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/.github/workflows/benchmark-multinode-tmpl.yml b/.github/workflows/benchmark-multinode-tmpl.yml index 81727ef39..76bc6e8f6 100644 --- a/.github/workflows/benchmark-multinode-tmpl.yml +++ b/.github/workflows/benchmark-multinode-tmpl.yml @@ -172,9 +172,22 @@ jobs: run: &slurm-cleanup | if command -v squeue >/dev/null 2>&1; then echo "[Slurm] Cleaning up jobs with name: ${{ runner.name }} ..." - scancel --name="${{ runner.name }}" || true - while [ -n "$(squeue --name='${{ runner.name }}' --noheader --format='%i')" ]; do - squeue --name="${{ runner.name }}" + timeout 30 scancel --name="${{ runner.name }}" 2>/dev/null || true + # Bound the drain: on the NVIDIA clusters squeue/scancel hang (zombie + # job scancel can't reap, or unresponsive slurmctld), wedging this + # step for 15-20min+ and failing every dsr1 multinode leg (gb300-nv, + # gb200, b200; CoreWeave is unaffected). timeout-wrap every slurm call + # so a hang in the while-condition can't block, and force-KILL + + # proceed after 120s rather than looping forever. + _drain_deadline=$((SECONDS + 120)) + while [ -n "$(timeout 30 squeue --name='${{ runner.name }}' --noheader --format='%i' 2>/dev/null)" ]; do + if [ "$SECONDS" -ge "$_drain_deadline" ]; then + echo "[Slurm] drain exceeded 120s; force-cancelling (KILL) and proceeding" + timeout 30 scancel --signal=KILL --name="${{ runner.name }}" 2>/dev/null || true + sleep 5 + break + fi + timeout 30 squeue --name="${{ runner.name }}" 2>/dev/null || true sleep 5 done fi From 00a040f5ad909601abff62756ba76a720ba5a5fe Mon Sep 17 00:00:00 2001 From: Aryan Date: Mon, 22 Jun 2026 15:40:55 -0700 Subject: [PATCH 2/2] fix(ci): raise multinode Slurm cleanup timeouts to 5min (scancel epilog headroom) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Review feedback: 30s was too short — scancel triggers the node epilog, which can be a slow/complex script, so a 30s cap could kill a cleanup that was still legitimately working. Raise scancel to 300s and the overall drain deadline to 300s; squeue stays at 30s (a hung squeue should give up fast so we proceed). A real not-yet-cleared job now gets a full 5min to drain before the force-KILL. Proven live 2026-06-22: gb300-nv_2 answered squeue in 37ms, then the same runner's cleanup squeue hung >6min 14min later, with gb300-nv_0 hanging concurrently — an intermittent cluster-wide slurmctld/munge/network hang, not a stuck job. Unbounded, the drain loop froze dsr1 multinode legs 15-20min+ (observed up to 8h). Co-Authored-By: Claude Opus 4.8 --- .../workflows/benchmark-multinode-tmpl.yml | 27 ++++++++++++------- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/.github/workflows/benchmark-multinode-tmpl.yml b/.github/workflows/benchmark-multinode-tmpl.yml index 76bc6e8f6..ffc9ad093 100644 --- a/.github/workflows/benchmark-multinode-tmpl.yml +++ b/.github/workflows/benchmark-multinode-tmpl.yml @@ -172,18 +172,25 @@ jobs: run: &slurm-cleanup | if command -v squeue >/dev/null 2>&1; then echo "[Slurm] Cleaning up jobs with name: ${{ runner.name }} ..." - timeout 30 scancel --name="${{ runner.name }}" 2>/dev/null || true - # Bound the drain: on the NVIDIA clusters squeue/scancel hang (zombie - # job scancel can't reap, or unresponsive slurmctld), wedging this - # step for 15-20min+ and failing every dsr1 multinode leg (gb300-nv, - # gb200, b200; CoreWeave is unaffected). timeout-wrap every slurm call - # so a hang in the while-condition can't block, and force-KILL + - # proceed after 120s rather than looping forever. - _drain_deadline=$((SECONDS + 120)) + # scancel can legitimately run a while: it triggers the node epilog, + # which may be a slow/complex script. Give it 5min before giving up + # (30s was too short — it could kill an epilog that was still working). + timeout 300 scancel --name="${{ runner.name }}" 2>/dev/null || true + # Bound the drain: on the NVIDIA clusters squeue/scancel intermittently + # HANG (unresponsive slurmctld / munge / network — NOT a stuck job), + # wedging this step for 15-20min+ (observed up to 8h) and failing dsr1 + # multinode legs on gb300-nv, gb200, b200 (CoreWeave unaffected). Proven + # live 2026-06-22: gb300-nv_2 answered squeue in 37ms, then the same + # runner's cleanup squeue hung >6min only 14min later; gb300-nv_0 hung + # concurrently. So: timeout-wrap every slurm call (a hung squeue returns + # empty -> the while-condition is false -> loop exits and we proceed), + # and cap the whole drain at 5min with a force-KILL instead of looping + # forever. A real not-yet-cleared job still gets the full 5min to drain. + _drain_deadline=$((SECONDS + 300)) while [ -n "$(timeout 30 squeue --name='${{ runner.name }}' --noheader --format='%i' 2>/dev/null)" ]; do if [ "$SECONDS" -ge "$_drain_deadline" ]; then - echo "[Slurm] drain exceeded 120s; force-cancelling (KILL) and proceeding" - timeout 30 scancel --signal=KILL --name="${{ runner.name }}" 2>/dev/null || true + echo "[Slurm] drain exceeded 5min; force-cancelling (KILL) and proceeding" + timeout 60 scancel --signal=KILL --name="${{ runner.name }}" 2>/dev/null || true sleep 5 break fi