Skip to content

Fix #23081 - druntime parallel GC livelock when scan stack stays empty#23082

Open
CyberShadow wants to merge 2 commits intodlang:stablefrom
CyberShadow:fix-evStackFilled-livelock
Open

Fix #23081 - druntime parallel GC livelock when scan stack stays empty#23082
CyberShadow wants to merge 2 commits intodlang:stablefrom
CyberShadow:fix-evStackFilled-livelock

Conversation

@CyberShadow
Copy link
Copy Markdown
Member

Fixes #23081.

Summary

  • markParallel unconditionally sets the manual-reset evStackFilled event before its pull loop. The two reset paths (pullFromScanStackImpl, scanStackPopLocked) only fire after popping the last item from the global scan stack. If a collection has small enough roots (pointersPerThread == 0) and a narrow enough live structure (no spill from mark()'s 32-entry local stack), nothing is ever pushed, nothing is ever popped, and evStackFilled is left set with an empty stack. The 47 background scan threads then spin forever in scanBackground: wait() returns immediately, pullFromScanStack is a no-op, evDone is broadcast, repeat — pegging every CPU.
  • Reset evStackFilled at the end of markParallel, symmetric with the setIfInitialized() above the pull loop. When pullLoop returns, pullFromScanStackImpl has reported busyThreads == 0 with an empty stack; from that point until the reset no worker can push (push only happens from mark(), which is only entered after a successful pop, which requires a non-empty stack), so no concurrent setIfInitialized can race with this reset.

Test plan

compiler/test/runnable/test23081.d measures total process CPU consumption against wall time during a 500ms quiescent sleep after automatic GC collections. With the livelock, the background scan threads each peg a CPU; without it, the ratio is near 0.

Verified on a 48-core machine:

Build --DRT-gcopt=parallel:128 minPoolSize:1
DMD 2.112.0 (unpatched) CPU/wall = 38.24 ❌ assert fires
Patched druntime CPU/wall = 0.00

The test uses parallel:128 (capped to logical_cpus - 1 by startScanThreads) so the regression path is exercised on many-core machines. On few-core CI runners the bug cannot manifest (numScanThreads + 1 fits within toscanRoots.length) and the test passes without exercising the regression path — but it remains a useful canary on big-box runners and developer machines.

Druntime unit tests pass.


The diagnosis was produced by Claude Code from a core dump of a hung process; see #23081 for the full analysis.

Fix dlang#23081.

`evStackFilled` is a manual-reset event (initialized by
`startScanThreads` with `manualReset=true`). The only places that
reset it are inside the global scan-stack pop paths
(`pullFromScanStackImpl` and `scanStackPopLocked`), both after the
last item has been popped.

`markParallel` unconditionally sets `evStackFilled` before entering
its pull loop. If the collection has so few roots that
`pointersPerThread == 0` (i.e. `toscanRoots.length < numScanThreads
+ 1`), nothing is pre-pushed onto the global stack; and if the
breadth of live structure stays below `mark`'s `FANOUT_LIMIT == 32`,
nothing spills from the local stack into the global stack either.
In that case no pop ever happens, so `evStackFilled` is never
reset. After `markParallel` returns, the background scan threads
spin in `scanBackground`: `evStackFilled.wait()` returns
immediately, `pullFromScanStack` is a no-op, `evDone` is
broadcast, repeat -- pegging every CPU.

Reset `evStackFilled` symmetrically with the `setIfInitialized()`
above the pull loop, so that after `markParallel` returns the
event reflects the actual state (no work pending). When `pullLoop`
returns, `pullFromScanStackImpl` has reported `busyThreads == 0`
with an empty stack; from that point until this reset no worker
can push (push only happens from `mark()`, which is only entered
after a successful pop, which requires a non-empty stack), so no
concurrent `setIfInitialized` can race with this reset.
Comment on lines +16 to +20
// We use parallel:128 (capped to logical_cpus - 1 by startScanThreads) to push
// numScanThreads as high as the host allows. The bug only triggers when
// numScanThreads + 1 exceeds toscanRoots.length, so on few-core CI runners this
// test passes without exercising the regression path; on many-core machines
// (>= ~20 cores) it reliably triggers without the fix.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Are we OK with adding a test that only catches the bug on certain hardware configurations?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine with a test that would fail eventually without the fix, but it should not falsely fail with the fix.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. This measures CPU utilization so I guess there's also a risk of a spurious false positive.

@CyberShadow
Copy link
Copy Markdown
Member Author

CC @schveiguy @rainers

Copy link
Copy Markdown
Member

@rainers rainers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Druntime unit tests pass.

It is failing on Windows as it uses posix functions. You could add some DISABLED: win32 win64 at the top.

I think the test rather belongs to the druntime test suite, though.

I didn't read the full analysis in the bug report (yet), but the presented situation in the test comment is already convincing. The fix also seems reasonable and should not have bad side effects.

Comment thread compiler/test/runnable/test23081.d Outdated
Comment on lines +16 to +20
// We use parallel:128 (capped to logical_cpus - 1 by startScanThreads) to push
// numScanThreads as high as the host allows. The bug only triggers when
// numScanThreads + 1 exceeds toscanRoots.length, so on few-core CI runners this
// test passes without exercising the regression path; on many-core machines
// (>= ~20 cores) it reliably triggers without the fix.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine with a test that would fail eventually without the fix, but it should not falsely fail with the fix.

Comment thread compiler/test/runnable/test23081.d Outdated
@CyberShadow CyberShadow force-pushed the fix-evStackFilled-livelock branch 2 times, most recently from a353cd3 to ee32d9b Compare May 6, 2026 20:54
…ng#23081)

Measures total process CPU consumption against wall time during a
500ms quiescent sleep after automatic GC collections. With the
livelock, the background scan threads each peg a CPU, pushing the
ratio into the tens. Healthy is near 0.

Uses --DRT-gcopt=parallel:128 (capped to logical_cpus - 1 by
startScanThreads) so the test is effective on many-core machines;
on few-core CI runners the bug cannot manifest (numScanThreads + 1
fits within toscanRoots.length) and the test passes without
exercising the regression path.

Posix-only (uses getrusage), so listed in the Posix-only TESTS
group of druntime/test/gc/Makefile.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@CyberShadow CyberShadow force-pushed the fix-evStackFilled-livelock branch from ee32d9b to 20012c4 Compare May 6, 2026 20:57
@CyberShadow
Copy link
Copy Markdown
Member Author

It is failing on Windows as it uses posix functions. You could add some DISABLED: win32 win64 at the top.

I think the test rather belongs to the druntime test suite, though.

Fixed. Or we can drop it.

// test passes without exercising the regression path; on many-core machines
// (>= ~20 cores) it reliably triggers without the fix.
//
// NOTE: GC.collect() calls fullcollect(isFinal=true) which disables parallel
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to have crept in with https://github.com/dlang/dmd/pull/16401/changes#diff-cb1050c2f89057445a525aeaabe2c1864a24ddcd5f8caa9de24074dbab81aca3L3019 where the call from GC.collect was not adapted. Should be fixed by another PR.

// allocator triggers an automatic collection (isFinal=false path) rather
// than growing the heap. Keep each object tiny so the live set is small.
foreach (i; 0 .. 8192)
new int[32]; // 8192 * 128 bytes = 1MB of small allocations
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe check whether a collection happened at all via GC.stats?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants