Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 16 additions & 3 deletions .github/workflows/benchmark-multinode-tmpl.yml
Original file line number Diff line number Diff line change
Expand Up @@ -172,9 +172,22 @@ jobs:
run: &slurm-cleanup |
if command -v squeue >/dev/null 2>&1; then
echo "[Slurm] Cleaning up jobs with name: ${{ runner.name }} ..."
scancel --name="${{ runner.name }}" || true
while [ -n "$(squeue --name='${{ runner.name }}' --noheader --format='%i')" ]; do
squeue --name="${{ runner.name }}"
timeout 30 scancel --name="${{ runner.name }}" 2>/dev/null || true
# Bound the drain: on the NVIDIA clusters squeue/scancel hang (zombie
# job scancel can't reap, or unresponsive slurmctld), wedging this
# step for 15-20min+ and failing every dsr1 multinode leg (gb300-nv,
# gb200, b200; CoreWeave is unaffected). timeout-wrap every slurm call
# so a hang in the while-condition can't block, and force-KILL +
# proceed after 120s rather than looping forever.
_drain_deadline=$((SECONDS + 120))
while [ -n "$(timeout 30 squeue --name='${{ runner.name }}' --noheader --format='%i' 2>/dev/null)" ]; do
if [ "$SECONDS" -ge "$_drain_deadline" ]; then
echo "[Slurm] drain exceeded 120s; force-cancelling (KILL) and proceeding"
timeout 30 scancel --signal=KILL --name="${{ runner.name }}" 2>/dev/null || true
sleep 5
break
fi
timeout 30 squeue --name="${{ runner.name }}" 2>/dev/null || true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hung squeue skips force-KILL

Medium Severity

The drain loop treats a timed-out squeue in the while test the same as an empty queue, so it can exit without running the _drain_deadline force-KILL block. After jobs were seen and the five-minute window may have elapsed, a hung squeue still skips scancel --signal=KILL, leaving named jobs and risking a colliding sbatch.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 00a040f. Configure here.

sleep 5
done
fi
Expand Down