Fix #23081 - druntime parallel GC livelock when scan stack stays empty#23082
Fix #23081 - druntime parallel GC livelock when scan stack stays empty#23082CyberShadow wants to merge 2 commits intodlang:stablefrom
Conversation
Fix dlang#23081. `evStackFilled` is a manual-reset event (initialized by `startScanThreads` with `manualReset=true`). The only places that reset it are inside the global scan-stack pop paths (`pullFromScanStackImpl` and `scanStackPopLocked`), both after the last item has been popped. `markParallel` unconditionally sets `evStackFilled` before entering its pull loop. If the collection has so few roots that `pointersPerThread == 0` (i.e. `toscanRoots.length < numScanThreads + 1`), nothing is pre-pushed onto the global stack; and if the breadth of live structure stays below `mark`'s `FANOUT_LIMIT == 32`, nothing spills from the local stack into the global stack either. In that case no pop ever happens, so `evStackFilled` is never reset. After `markParallel` returns, the background scan threads spin in `scanBackground`: `evStackFilled.wait()` returns immediately, `pullFromScanStack` is a no-op, `evDone` is broadcast, repeat -- pegging every CPU. Reset `evStackFilled` symmetrically with the `setIfInitialized()` above the pull loop, so that after `markParallel` returns the event reflects the actual state (no work pending). When `pullLoop` returns, `pullFromScanStackImpl` has reported `busyThreads == 0` with an empty stack; from that point until this reset no worker can push (push only happens from `mark()`, which is only entered after a successful pop, which requires a non-empty stack), so no concurrent `setIfInitialized` can race with this reset.
| // We use parallel:128 (capped to logical_cpus - 1 by startScanThreads) to push | ||
| // numScanThreads as high as the host allows. The bug only triggers when | ||
| // numScanThreads + 1 exceeds toscanRoots.length, so on few-core CI runners this | ||
| // test passes without exercising the regression path; on many-core machines | ||
| // (>= ~20 cores) it reliably triggers without the fix. |
There was a problem hiding this comment.
There was a problem hiding this comment.
I'd be fine with a test that would fail eventually without the fix, but it should not falsely fail with the fix.
There was a problem hiding this comment.
Yeah. This measures CPU utilization so I guess there's also a risk of a spurious false positive.
rainers
left a comment
There was a problem hiding this comment.
Druntime unit tests pass.
It is failing on Windows as it uses posix functions. You could add some DISABLED: win32 win64 at the top.
I think the test rather belongs to the druntime test suite, though.
I didn't read the full analysis in the bug report (yet), but the presented situation in the test comment is already convincing. The fix also seems reasonable and should not have bad side effects.
| // We use parallel:128 (capped to logical_cpus - 1 by startScanThreads) to push | ||
| // numScanThreads as high as the host allows. The bug only triggers when | ||
| // numScanThreads + 1 exceeds toscanRoots.length, so on few-core CI runners this | ||
| // test passes without exercising the regression path; on many-core machines | ||
| // (>= ~20 cores) it reliably triggers without the fix. |
There was a problem hiding this comment.
I'd be fine with a test that would fail eventually without the fix, but it should not falsely fail with the fix.
a353cd3 to
ee32d9b
Compare
…ng#23081) Measures total process CPU consumption against wall time during a 500ms quiescent sleep after automatic GC collections. With the livelock, the background scan threads each peg a CPU, pushing the ratio into the tens. Healthy is near 0. Uses --DRT-gcopt=parallel:128 (capped to logical_cpus - 1 by startScanThreads) so the test is effective on many-core machines; on few-core CI runners the bug cannot manifest (numScanThreads + 1 fits within toscanRoots.length) and the test passes without exercising the regression path. Posix-only (uses getrusage), so listed in the Posix-only TESTS group of druntime/test/gc/Makefile. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ee32d9b to
20012c4
Compare
Fixed. Or we can drop it. |
| // test passes without exercising the regression path; on many-core machines | ||
| // (>= ~20 cores) it reliably triggers without the fix. | ||
| // | ||
| // NOTE: GC.collect() calls fullcollect(isFinal=true) which disables parallel |
There was a problem hiding this comment.
This seems to have crept in with https://github.com/dlang/dmd/pull/16401/changes#diff-cb1050c2f89057445a525aeaabe2c1864a24ddcd5f8caa9de24074dbab81aca3L3019 where the call from GC.collect was not adapted. Should be fixed by another PR.
| // allocator triggers an automatic collection (isFinal=false path) rather | ||
| // than growing the heap. Keep each object tiny so the live set is small. | ||
| foreach (i; 0 .. 8192) | ||
| new int[32]; // 8192 * 128 bytes = 1MB of small allocations |
There was a problem hiding this comment.
Maybe check whether a collection happened at all via GC.stats?
Fixes #23081.
Summary
markParallelunconditionally sets the manual-resetevStackFilledevent before its pull loop. The two reset paths (pullFromScanStackImpl,scanStackPopLocked) only fire after popping the last item from the global scan stack. If a collection has small enough roots (pointersPerThread == 0) and a narrow enough live structure (no spill frommark()'s 32-entry local stack), nothing is ever pushed, nothing is ever popped, andevStackFilledis left set with an empty stack. The 47 background scan threads then spin forever inscanBackground:wait()returns immediately,pullFromScanStackis a no-op,evDoneis broadcast, repeat — pegging every CPU.evStackFilledat the end ofmarkParallel, symmetric with thesetIfInitialized()above the pull loop. WhenpullLoopreturns,pullFromScanStackImplhas reportedbusyThreads == 0with an empty stack; from that point until the reset no worker can push (push only happens frommark(), which is only entered after a successful pop, which requires a non-empty stack), so no concurrentsetIfInitializedcan race with this reset.Test plan
compiler/test/runnable/test23081.dmeasures total process CPU consumption against wall time during a 500ms quiescent sleep after automatic GC collections. With the livelock, the background scan threads each peg a CPU; without it, the ratio is near 0.Verified on a 48-core machine:
--DRT-gcopt=parallel:128 minPoolSize:1The test uses
parallel:128(capped tological_cpus - 1bystartScanThreads) so the regression path is exercised on many-core machines. On few-core CI runners the bug cannot manifest (numScanThreads + 1fits withintoscanRoots.length) and the test passes without exercising the regression path — but it remains a useful canary on big-box runners and developer machines.Druntime unit tests pass.
The diagnosis was produced by Claude Code from a core dump of a hung process; see #23081 for the full analysis.