Fix #23081 - druntime parallel GC livelock when scan stack stays empty by CyberShadow · Pull Request #23082 · dlang/dmd

CyberShadow · 2026-05-06T19:16:12Z

Summary

markParallel unconditionally sets the manual-reset evStackFilled event before its pull loop. The two reset paths (pullFromScanStackImpl, scanStackPopLocked) only fire after popping the last item from the global scan stack. If a collection has small enough roots (pointersPerThread == 0) and a narrow enough live structure (no spill from mark()'s 32-entry local stack), nothing is ever pushed, nothing is ever popped, and evStackFilled is left set with an empty stack. The 47 background scan threads then spin forever in scanBackground: wait() returns immediately, pullFromScanStack is a no-op, evDone is broadcast, repeat — pegging every CPU.
Reset evStackFilled at the end of markParallel, symmetric with the setIfInitialized() above the pull loop. When pullLoop returns, pullFromScanStackImpl has reported busyThreads == 0 with an empty stack; from that point until the reset no worker can push (push only happens from mark(), which is only entered after a successful pop, which requires a non-empty stack), so no concurrent setIfInitialized can race with this reset.

Test plan

compiler/test/runnable/test23081.d measures total process CPU consumption against wall time during a 500ms quiescent sleep after automatic GC collections. With the livelock, the background scan threads each peg a CPU; without it, the ratio is near 0.

Verified on a 48-core machine:

Build	`--DRT-gcopt=parallel:128 minPoolSize:1`
DMD 2.112.0 (unpatched)	CPU/wall = 38.24 ❌ assert fires
Patched druntime	CPU/wall = 0.00 ✅

The test uses parallel:128 (capped to logical_cpus - 1 by startScanThreads) so the regression path is exercised on many-core machines. On few-core CI runners the bug cannot manifest (numScanThreads + 1 fits within toscanRoots.length) and the test passes without exercising the regression path — but it remains a useful canary on big-box runners and developer machines.

Druntime unit tests pass.

The diagnosis was produced by Claude Code from a core dump of a hung process; see #23081 for the full analysis.

Fix dlang#23081. `evStackFilled` is a manual-reset event (initialized by `startScanThreads` with `manualReset=true`). The only places that reset it are inside the global scan-stack pop paths (`pullFromScanStackImpl` and `scanStackPopLocked`), both after the last item has been popped. `markParallel` unconditionally sets `evStackFilled` before entering its pull loop. If the collection has so few roots that `pointersPerThread == 0` (i.e. `toscanRoots.length < numScanThreads + 1`), nothing is pre-pushed onto the global stack; and if the breadth of live structure stays below `mark`'s `FANOUT_LIMIT == 32`, nothing spills from the local stack into the global stack either. In that case no pop ever happens, so `evStackFilled` is never reset. After `markParallel` returns, the background scan threads spin in `scanBackground`: `evStackFilled.wait()` returns immediately, `pullFromScanStack` is a no-op, `evDone` is broadcast, repeat -- pegging every CPU. Reset `evStackFilled` symmetrically with the `setIfInitialized()` above the pull loop, so that after `markParallel` returns the event reflects the actual state (no work pending). When `pullLoop` returns, `pullFromScanStackImpl` has reported `busyThreads == 0` with an empty stack; from that point until this reset no worker can push (push only happens from `mark()`, which is only entered after a successful pop, which requires a non-empty stack), so no concurrent `setIfInitialized` can race with this reset.

CyberShadow · 2026-05-06T19:17:33Z

+// We use parallel:128 (capped to logical_cpus - 1 by startScanThreads) to push
+// numScanThreads as high as the host allows.  The bug only triggers when
+// numScanThreads + 1 exceeds toscanRoots.length, so on few-core CI runners this
+// test passes without exercising the regression path; on many-core machines
+// (>= ~20 cores) it reliably triggers without the fix.


⚠️ Are we OK with adding a test that only catches the bug on certain hardware configurations?

I'd be fine with a test that would fail eventually without the fix, but it should not falsely fail with the fix.

Yeah. This measures CPU utilization so I guess there's also a risk of a spurious false positive.

CyberShadow · 2026-05-06T19:22:02Z

CC @schveiguy @rainers

rainers

Druntime unit tests pass.

It is failing on Windows as it uses posix functions. You could add some DISABLED: win32 win64 at the top.

I think the test rather belongs to the druntime test suite, though.

I didn't read the full analysis in the bug report (yet), but the presented situation in the test comment is already convincing. The fix also seems reasonable and should not have bad side effects.

rainers · 2026-05-06T19:56:52Z

+// We use parallel:128 (capped to logical_cpus - 1 by startScanThreads) to push
+// numScanThreads as high as the host allows.  The bug only triggers when
+// numScanThreads + 1 exceeds toscanRoots.length, so on few-core CI runners this
+// test passes without exercising the regression path; on many-core machines
+// (>= ~20 cores) it reliably triggers without the fix.


I'd be fine with a test that would fail eventually without the fix, but it should not falsely fail with the fix.

…ng#23081) Measures total process CPU consumption against wall time during a 500ms quiescent sleep after automatic GC collections. With the livelock, the background scan threads each peg a CPU, pushing the ratio into the tens. Healthy is near 0. Uses --DRT-gcopt=parallel:128 (capped to logical_cpus - 1 by startScanThreads) so the test is effective on many-core machines; on few-core CI runners the bug cannot manifest (numScanThreads + 1 fits within toscanRoots.length) and the test passes without exercising the regression path. Posix-only (uses getrusage), so listed in the Posix-only TESTS group of druntime/test/gc/Makefile. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

CyberShadow · 2026-05-06T21:01:44Z

It is failing on Windows as it uses posix functions. You could add some DISABLED: win32 win64 at the top.

I think the test rather belongs to the druntime test suite, though.

Fixed. Or we can drop it.

rainers · 2026-05-06T21:23:45Z

+// test passes without exercising the regression path; on many-core machines
+// (>= ~20 cores) it reliably triggers without the fix.
+//
+// NOTE: GC.collect() calls fullcollect(isFinal=true) which disables parallel


This seems to have crept in with https://github.com/dlang/dmd/pull/16401/changes#diff-cb1050c2f89057445a525aeaabe2c1864a24ddcd5f8caa9de24074dbab81aca3L3019 where the call from GC.collect was not adapted. Should be fixed by another PR.

rainers · 2026-05-06T21:24:49Z

+    // allocator triggers an automatic collection (isFinal=false path) rather
+    // than growing the heap.  Keep each object tiny so the live set is small.
+    foreach (i; 0 .. 8192)
+        new int[32]; // 8192 * 128 bytes = 1MB of small allocations


Maybe check whether a collection happened at all via GC.stats?

CyberShadow commented May 6, 2026

View reviewed changes

rainers reviewed May 6, 2026

View reviewed changes

CyberShadow force-pushed the fix-evStackFilled-livelock branch 2 times, most recently from a353cd3 to ee32d9b Compare May 6, 2026 20:54

CyberShadow force-pushed the fix-evStackFilled-livelock branch from ee32d9b to 20012c4 Compare May 6, 2026 20:57

rainers approved these changes May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix #23081 - druntime parallel GC livelock when scan stack stays empty#23082

Fix #23081 - druntime parallel GC livelock when scan stack stays empty#23082
CyberShadow wants to merge 2 commits intodlang:stablefrom
CyberShadow:fix-evStackFilled-livelock

CyberShadow commented May 6, 2026

Uh oh!

CyberShadow May 6, 2026

Uh oh!

rainers May 6, 2026

Uh oh!

CyberShadow May 6, 2026

Uh oh!

CyberShadow commented May 6, 2026

Uh oh!

rainers left a comment

Uh oh!

Uh oh!

rainers May 6, 2026

Uh oh!

Uh oh!

CyberShadow commented May 6, 2026

Uh oh!

rainers May 6, 2026

Uh oh!

rainers May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

CyberShadow commented May 6, 2026

Summary

Test plan

Uh oh!

CyberShadow May 6, 2026

Choose a reason for hiding this comment

Uh oh!

rainers May 6, 2026

Choose a reason for hiding this comment

Uh oh!

CyberShadow May 6, 2026

Choose a reason for hiding this comment

Uh oh!

CyberShadow commented May 6, 2026

Uh oh!

rainers left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rainers May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CyberShadow commented May 6, 2026

Uh oh!

rainers May 6, 2026

Choose a reason for hiding this comment

Uh oh!

rainers May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants