fix: ThreadSafeDeque race condition in kill_all()/notify_all()#298
Conversation
Fix a race condition in ThreadSafeDeque where kill_all() and notify_all()
modified _is_waitable and signaled the condition variable without holding
the mutex. This caused a deadlock when the consumer thread (read_row)
checked is_waitable()==true but the worker thread called kill_all() before
the consumer entered wait(), causing the notify signal to be lost.
Timeline of the race:
1. Consumer: records->empty() == true
2. Consumer: records->is_waitable() == true (lock-free atomic read)
3. Worker: kill_all() sets _is_waitable=false, calls _cond.notify_all()
4. Consumer: wait() acquires lock, enters _cond.wait() - DEADLOCK
(notification was already sent in step 3, predicate !is_waitable()
was already true but checked AFTER the signal was lost)
Fix: Acquire the mutex in kill_all() and notify_all() before modifying
_is_waitable and signaling. This ensures the condition variable
notification cannot be lost between the is_waitable() check and wait().
The bug is most easily triggered with small CSV files where the worker
thread completes almost instantly, and is particularly reproducible in
no_header() mode on Windows.
Also adds a regression test (test_threadsafe_deque_race.cpp) that
exercises the race window with rapid small-CSV iterations.
Move C++ standard detection and string_view includes before csv namespace to prevent <functional> from being included inside csv namespace. This prevents namespace collision where std:: gets resolved as csv::std::
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #298 +/- ##
==========================================
+ Coverage 88.89% 88.91% +0.01%
==========================================
Files 23 23
Lines 1513 1515 +2
Branches 617 617
==========================================
+ Hits 1345 1347 +2
Misses 66 66
Partials 102 102 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thank you for your detailed PR and analysis. In the last version, I tried to optimize the I made a minor change to wrap your tests in a timeout helper since, if the parser regresses (hopefully not), they would cause the CI to hang. Overall, this is a great PR. |
There was a problem hiding this comment.
Pull request overview
Fixes a lost-wakeup deadlock in ThreadSafeDeque by ensuring kill_all()/notify_all() update _is_waitable and signal the condition variable while holding the deque mutex, and adds regression coverage to prevent hangs from resurfacing.
Changes:
- Serialize
_is_waitabletransitions +condition_variablenotifications inThreadSafeDeque::notify_all()andkill_all()by taking_lock. - Add a deadlock-regression/stress test plus a shared timeout helper to fail fast instead of hanging CI.
- Add internal synchronization design documentation and update test utility documentation/CMake wiring.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
include/internal/thread_safe_deque.hpp |
Locks around _is_waitable state transitions and notify_all() to prevent lost wakeups. |
single_include/csv.hpp |
Mirrors the ThreadSafeDeque locking change in the generated single-include header. |
single_include_test/csv.hpp |
Mirrors the ThreadSafeDeque locking change for single-include testing. |
tests/test_threadsafe_deque_race.cpp |
Adds regression/stress scenarios intended to reproduce the deadlock window. |
tests/shared/timeout_helper.hpp |
Adds a timeout wrapper to avoid indefinite hangs in deadlock-regression tests. |
tests/CMakeLists.txt |
Adds the new race-regression test file to the test target sources. |
include/internal/THREADSAFE_DEQUE_DESIGN.md |
Documents the producer/consumer protocol and the rationale for the fix. |
tests/CLAUDE.md |
Documents the new shared timeout helper and testing conventions. |
tests/shared/CLAUDE.md |
Adds guidance for maintaining shared test utilities documentation. |
include/internal/common.hpp |
Moves C++ standard detection and string_view inclusion outside the csv namespace. |
.claude/rules/testing-conventions.md |
Removes the old AI/testing conventions doc (migrated into tests/CLAUDE.md). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| auto future = std::async(std::launch::async, fn); | ||
|
|
||
| auto status = future.wait_for(timeout); | ||
|
|
||
| REQUIRE(status == std::future_status::ready); |
There was a problem hiding this comment.
test_with_timeout() uses std::async(std::launch::async, ...) and then REQUIRE(status == std::future_status::ready). If the async task deadlocks and the REQUIRE fails, stack unwinding will destroy future; for futures created by std::async this destructor can block waiting for the task to finish, which defeats the whole purpose of the timeout and can still hang CI. Consider reworking the helper so the timeout path does not wait for the worker (e.g., run the callable on a detached std::thread with a std::promise/std::future to signal completion, and on timeout forcefully end the test process via std::quick_exit/std::_Exit, or otherwise ensure the background task cannot block destruction).
| * @param fn Test function to execute | ||
| * @param timeout Maximum time to wait before failing (default: 10 seconds) | ||
| * | ||
| * @throws std::runtime_error if timeout occurs |
There was a problem hiding this comment.
The doc comment says @throws std::runtime_error if timeout occurs, but the implementation uses a Catch2 REQUIRE(...) assertion instead of throwing (and never uses <stdexcept>). Either update the documentation to match the actual failure mechanism, or change the implementation to throw on timeout and let the caller assert/fail appropriately.
| * @throws std::runtime_error if timeout occurs | |
| * @throws Catch2 assertion failure (test failure) if timeout occurs |
| @@ -0,0 +1,109 @@ | |||
| // | |||
There was a problem hiding this comment.
The file appears to start with a UTF-8 BOM character before // (shown as an invisible character in the diff). BOMs can cause tooling/formatting issues and occasionally break compilation on stricter toolchains; please remove the BOM so the source starts with a plain //.
| // | |
| // |
| ### Test Organization | ||
|
|
There was a problem hiding this comment.
### Test Organization is an empty section header (it immediately jumps to ### Testing Conventions). Either add content under "Test Organization" or remove the empty header to avoid confusing documentation structure.
| ### Test Organization |
| // Detect C++ standard version BEFORE namespace to properly include string_view | ||
| #if (defined(CMAKE_CXX_STANDARD) && CMAKE_CXX_STANDARD == 20) || __cplusplus >= 202002L | ||
| #define CSV_HAS_CXX20 | ||
| #endif | ||
|
|
||
| #if (defined(CMAKE_CXX_STANDARD) && CMAKE_CXX_STANDARD == 17) || __cplusplus >= 201703L | ||
| #define CSV_HAS_CXX17 | ||
| #endif | ||
|
|
||
| #if (defined(CMAKE_CXX_STANDARD) && CMAKE_CXX_STANDARD >= 14) || __cplusplus >= 201402L | ||
| #define CSV_HAS_CXX14 | ||
| #endif | ||
|
|
||
| // Include string_view BEFORE csv namespace to avoid namespace pollution issues | ||
| #ifdef CSV_HAS_CXX17 | ||
| #include <string_view> | ||
| #else | ||
| #include "../external/string_view.hpp" | ||
| #endif | ||
|
|
There was a problem hiding this comment.
This PR description focuses on the ThreadSafeDeque race fix and regression tests, but it also changes include/internal/common.hpp to move C++-standard detection macros and the string_view include outside the csv namespace. If this change is intentional (it can affect downstream include behavior), please either mention it explicitly in the PR description or split it into a separate PR so the concurrency fix can be reviewed/merged independently.
Fix a race condition in ThreadSafeDeque where kill_all() and notify_all() modified _is_waitable and signaled the condition variable without holding the mutex. This caused a deadlock when the consumer thread (read_row) checked is_waitable()==true but the worker thread called kill_all() before the consumer entered wait(), causing the notify signal to be lost.
Timeline of the race:
(notification was already sent in step 3, predicate !is_waitable()
was already true but checked AFTER the signal was lost)
Fix: Acquire the mutex in kill_all() and notify_all() before modifying _is_waitable and signaling. This ensures the condition variable notification cannot be lost between the is_waitable() check and wait().
The bug is most easily triggered with small CSV files where the worker thread completes almost instantly, and is particularly reproducible in no_header() mode on Windows.
Also adds a regression test (test_threadsafe_deque_race.cpp) that exercises the race window with rapid small-CSV iterations.