Skip to content

fix: ThreadSafeDeque race condition in kill_all()/notify_all()#298

Merged
vincentlaucsb merged 3 commits intovincentlaucsb:masterfrom
YamingPei:fix/threadsafe-deque-race-condition
Mar 28, 2026
Merged

fix: ThreadSafeDeque race condition in kill_all()/notify_all()#298
vincentlaucsb merged 3 commits intovincentlaucsb:masterfrom
YamingPei:fix/threadsafe-deque-race-condition

Conversation

@YamingPei
Copy link
Copy Markdown
Contributor

Fix a race condition in ThreadSafeDeque where kill_all() and notify_all() modified _is_waitable and signaled the condition variable without holding the mutex. This caused a deadlock when the consumer thread (read_row) checked is_waitable()==true but the worker thread called kill_all() before the consumer entered wait(), causing the notify signal to be lost.

Timeline of the race:

  1. Consumer: records->empty() == true
  2. Consumer: records->is_waitable() == true (lock-free atomic read)
  3. Worker: kill_all() sets _is_waitable=false, calls _cond.notify_all()
  4. Consumer: wait() acquires lock, enters _cond.wait() - DEADLOCK
    (notification was already sent in step 3, predicate !is_waitable()
    was already true but checked AFTER the signal was lost)

Fix: Acquire the mutex in kill_all() and notify_all() before modifying _is_waitable and signaling. This ensures the condition variable notification cannot be lost between the is_waitable() check and wait().

The bug is most easily triggered with small CSV files where the worker thread completes almost instantly, and is particularly reproducible in no_header() mode on Windows.

Also adds a regression test (test_threadsafe_deque_race.cpp) that exercises the race window with rapid small-CSV iterations.

Fix a race condition in ThreadSafeDeque where kill_all() and notify_all()
modified _is_waitable and signaled the condition variable without holding
the mutex. This caused a deadlock when the consumer thread (read_row)
checked is_waitable()==true but the worker thread called kill_all() before
the consumer entered wait(), causing the notify signal to be lost.

Timeline of the race:
  1. Consumer: records->empty() == true
  2. Consumer: records->is_waitable() == true  (lock-free atomic read)
  3. Worker:   kill_all() sets _is_waitable=false, calls _cond.notify_all()
  4. Consumer: wait() acquires lock, enters _cond.wait() - DEADLOCK
     (notification was already sent in step 3, predicate !is_waitable()
      was already true but checked AFTER the signal was lost)

Fix: Acquire the mutex in kill_all() and notify_all() before modifying
_is_waitable and signaling. This ensures the condition variable
notification cannot be lost between the is_waitable() check and wait().

The bug is most easily triggered with small CSV files where the worker
thread completes almost instantly, and is particularly reproducible in
no_header() mode on Windows.

Also adds a regression test (test_threadsafe_deque_race.cpp) that
exercises the race window with rapid small-CSV iterations.
Move C++ standard detection and string_view includes before csv namespace
to prevent <functional> from being included inside csv namespace.

This prevents namespace collision where std:: gets resolved as csv::std::
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.91%. Comparing base (8be1567) to head (65efdf8).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #298      +/-   ##
==========================================
+ Coverage   88.89%   88.91%   +0.01%     
==========================================
  Files          23       23              
  Lines        1513     1515       +2     
  Branches      617      617              
==========================================
+ Hits         1345     1347       +2     
  Misses         66       66              
  Partials      102      102              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread tests/test_threadsafe_deque_race.cpp Dismissed
Comment thread tests/test_threadsafe_deque_race.cpp Dismissed
@vincentlaucsb
Copy link
Copy Markdown
Owner

vincentlaucsb commented Mar 28, 2026

Thank you for your detailed PR and analysis. In the last version, I tried to optimize the ThreadSafeDeque but introduced the race condition you described. A lock was really not a big deal, since the two functions are not on a hot path.

I made a minor change to wrap your tests in a timeout helper since, if the parser regresses (hopefully not), they would cause the CI to hang.

Overall, this is a great PR.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a lost-wakeup deadlock in ThreadSafeDeque by ensuring kill_all()/notify_all() update _is_waitable and signal the condition variable while holding the deque mutex, and adds regression coverage to prevent hangs from resurfacing.

Changes:

  • Serialize _is_waitable transitions + condition_variable notifications in ThreadSafeDeque::notify_all() and kill_all() by taking _lock.
  • Add a deadlock-regression/stress test plus a shared timeout helper to fail fast instead of hanging CI.
  • Add internal synchronization design documentation and update test utility documentation/CMake wiring.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
include/internal/thread_safe_deque.hpp Locks around _is_waitable state transitions and notify_all() to prevent lost wakeups.
single_include/csv.hpp Mirrors the ThreadSafeDeque locking change in the generated single-include header.
single_include_test/csv.hpp Mirrors the ThreadSafeDeque locking change for single-include testing.
tests/test_threadsafe_deque_race.cpp Adds regression/stress scenarios intended to reproduce the deadlock window.
tests/shared/timeout_helper.hpp Adds a timeout wrapper to avoid indefinite hangs in deadlock-regression tests.
tests/CMakeLists.txt Adds the new race-regression test file to the test target sources.
include/internal/THREADSAFE_DEQUE_DESIGN.md Documents the producer/consumer protocol and the rationale for the fix.
tests/CLAUDE.md Documents the new shared timeout helper and testing conventions.
tests/shared/CLAUDE.md Adds guidance for maintaining shared test utilities documentation.
include/internal/common.hpp Moves C++ standard detection and string_view inclusion outside the csv namespace.
.claude/rules/testing-conventions.md Removes the old AI/testing conventions doc (migrated into tests/CLAUDE.md).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +38 to +42
auto future = std::async(std::launch::async, fn);

auto status = future.wait_for(timeout);

REQUIRE(status == std::future_status::ready);
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_with_timeout() uses std::async(std::launch::async, ...) and then REQUIRE(status == std::future_status::ready). If the async task deadlocks and the REQUIRE fails, stack unwinding will destroy future; for futures created by std::async this destructor can block waiting for the task to finish, which defeats the whole purpose of the timeout and can still hang CI. Consider reworking the helper so the timeout path does not wait for the worker (e.g., run the callable on a detached std::thread with a std::promise/std::future to signal completion, and on timeout forcefully end the test process via std::quick_exit/std::_Exit, or otherwise ensure the background task cannot block destruction).

Copilot uses AI. Check for mistakes.
* @param fn Test function to execute
* @param timeout Maximum time to wait before failing (default: 10 seconds)
*
* @throws std::runtime_error if timeout occurs
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment says @throws std::runtime_error if timeout occurs, but the implementation uses a Catch2 REQUIRE(...) assertion instead of throwing (and never uses <stdexcept>). Either update the documentation to match the actual failure mechanism, or change the implementation to throw on timeout and let the caller assert/fail appropriately.

Suggested change
* @throws std::runtime_error if timeout occurs
* @throws Catch2 assertion failure (test failure) if timeout occurs

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,109 @@
//
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file appears to start with a UTF-8 BOM character before // (shown as an invisible character in the diff). BOMs can cause tooling/formatting issues and occasionally break compilation on stricter toolchains; please remove the BOM so the source starts with a plain //.

Suggested change
//
//

Copilot uses AI. Check for mistakes.
Comment thread tests/CLAUDE.md
Comment on lines 37 to 38
### Test Organization

Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

### Test Organization is an empty section header (it immediately jumps to ### Testing Conventions). Either add content under "Test Organization" or remove the empty header to avoid confusing documentation structure.

Suggested change
### Test Organization

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +76
// Detect C++ standard version BEFORE namespace to properly include string_view
#if (defined(CMAKE_CXX_STANDARD) && CMAKE_CXX_STANDARD == 20) || __cplusplus >= 202002L
#define CSV_HAS_CXX20
#endif

#if (defined(CMAKE_CXX_STANDARD) && CMAKE_CXX_STANDARD == 17) || __cplusplus >= 201703L
#define CSV_HAS_CXX17
#endif

#if (defined(CMAKE_CXX_STANDARD) && CMAKE_CXX_STANDARD >= 14) || __cplusplus >= 201402L
#define CSV_HAS_CXX14
#endif

// Include string_view BEFORE csv namespace to avoid namespace pollution issues
#ifdef CSV_HAS_CXX17
#include <string_view>
#else
#include "../external/string_view.hpp"
#endif

Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR description focuses on the ThreadSafeDeque race fix and regression tests, but it also changes include/internal/common.hpp to move C++-standard detection macros and the string_view include outside the csv namespace. If this change is intentional (it can affect downstream include behavior), please either mention it explicitly in the PR description or split it into a separate PR so the concurrency fix can be reviewed/merged independently.

Copilot uses AI. Check for mistakes.
@vincentlaucsb vincentlaucsb merged commit 172ad8d into vincentlaucsb:master Mar 28, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants