Skip to content

Commit b603042

Browse files
CSV Parser 4.0.0 - Benchmarks, DataFrame gets parallelized, MSVC AV2 Fixes (#309)
* Add benchmarks + MSVC AXV2 fixes * Simplify DataFrame * JsonConverter refactor * Fix compiler errors * Euthanize CSVStat * Update file2.cpp * Make read_chunk() more efficient * Remove DataFrame dummy key * Clean up CSV writing duplicated logic * Make sure DataFrameExecutor does not swallow exceptions Fix writing edge case with zero * Simplify DelimWriter factory functions * Update csv_writer.hpp * Made DataFrame edit overlay thread safe * More unit tests * More unit tests * Update sanitizers.yml
1 parent 9172253 commit b603042

64 files changed

Lines changed: 5597 additions & 1745 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/README.md

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,18 @@ This directory contains GitHub Actions workflows for comprehensive memory and th
1212
- **UndefinedBehaviorSanitizer (UBSan)** - Catches undefined behavior: signed overflow, type mismatches
1313

1414
**Config:**
15-
- Runs on Ubuntu with GCC
16-
- Tests C++17 and C++20 standards
15+
- Runs Linux sanitizer coverage on Ubuntu with GCC
16+
- Adds a dedicated Windows/MSVC AddressSanitizer leg on C++20
1717
- Debug builds for better diagnostics
1818
- Timeout: 20 minutes per configuration
1919
- Artifacts: Upload logs on failure
2020

2121
**Key Features:**
22-
- Matrix testing: 3 sanitizers × 2 C++ standards = 6 parallel jobs
22+
- Linux matrix testing:
23+
- ASan on C++20
24+
- TSan on C++20
25+
- UBSan on C++17
26+
- Dedicated Windows/MSVC ASan coverage for AVX2- and codegen-sensitive issues
2327
- Fail-fast disabled to see all results
2428
- Environment variables configured for halt-on-error behavior
2529

@@ -54,6 +58,16 @@ cmake -B build/asan -DCMAKE_BUILD_TYPE=Debug \
5458
-DCMAKE_CXX_FLAGS="-fsanitize=address -fno-omit-frame-pointer -g"
5559
cmake --build build/asan
5660
cd build/asan && ctest --output-on-failure && cd ../..
61+
62+
# Windows / MSVC AddressSanitizer
63+
cmake -S . -B build/msvc-asan ^
64+
-DCSV_CXX_STANDARD=20 ^
65+
-DCMAKE_BUILD_TYPE=RelWithDebInfo ^
66+
-DCMAKE_CXX_FLAGS="/fsanitize=address /Zi" ^
67+
-DCMAKE_C_FLAGS="/fsanitize=address /Zi" ^
68+
-DCMAKE_EXE_LINKER_FLAGS="/INCREMENTAL:NO"
69+
cmake --build build/msvc-asan --config RelWithDebInfo
70+
ctest --test-dir build/msvc-asan --build-config RelWithDebInfo --output-on-failure
5771
```
5872

5973
### CI/CD Pipeline Order
@@ -75,6 +89,7 @@ cd build/asan && ctest --output-on-failure && cd ../..
7589
- **Why Important:** Catches CSVFieldList memory issues like issue #278
7690
- Cannot run simultaneously with TSan (different memory models)
7791
- Better performance than TSan for memory safety
92+
- The MSVC C++20 ASan leg is especially useful for AVX2- and codegen-sensitive regressions
7893

7994
### Valgrind
8095
- Slower than sanitizers but more mature tool

.github/workflows/codeql.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,9 @@ jobs:
4747
cmake .. \
4848
-DCMAKE_BUILD_TYPE=Release \
4949
-DCSV_CXX_STANDARD=20 \
50+
-DCSV_BUILD_PROGRAMS=OFF \
51+
-DCSV_BUILD_TESTS=OFF \
52+
-DBUILD_PYTHON=OFF \
5053
-DCMAKE_CXX_COMPILER=g++ \
5154
-DCMAKE_C_COMPILER=gcc
5255

.github/workflows/sanitizers.yml

Lines changed: 48 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,22 +10,23 @@ permissions:
1010
contents: read
1111

1212
jobs:
13-
sanitizers:
13+
linux-sanitizers:
1414
runs-on: ubuntu-latest
1515

1616
strategy:
1717
fail-fast: false
1818
matrix:
19-
sanitizer: [address, thread, undefined]
20-
cxx_standard: [17, 20]
2119
include:
2220
- sanitizer: address
21+
cxx_standard: 20
2322
flag: "-fsanitize=address -fno-omit-frame-pointer"
2423
name: "AddressSanitizer"
2524
- sanitizer: thread
25+
cxx_standard: 20
2626
flag: "-fsanitize=thread"
2727
name: "ThreadSanitizer"
2828
- sanitizer: undefined
29+
cxx_standard: 17
2930
flag: "-fsanitize=undefined -fno-omit-frame-pointer"
3031
name: "UndefinedBehaviorSanitizer"
3132

@@ -69,9 +70,52 @@ jobs:
6970
if: failure()
7071
uses: actions/upload-artifact@v4
7172
with:
72-
name: sanitizer-logs-${{ matrix.sanitizer }}-std${{ matrix.cxx_standard }}
73+
name: linux-sanitizer-logs-${{ matrix.sanitizer }}-std${{ matrix.cxx_standard }}
7374
path: build/Testing/
7475

76+
windows-asan-msvc:
77+
name: MSVC AddressSanitizer (C++20)
78+
runs-on: windows-latest
79+
80+
env:
81+
ASAN_OPTIONS: "halt_on_error=1"
82+
83+
steps:
84+
- name: Checkout repository and submodules
85+
uses: actions/checkout@v5
86+
with:
87+
submodules: recursive
88+
89+
- name: Set up MSVC developer command prompt
90+
uses: ilammy/msvc-dev-cmd@0b201ec74fa43914dc39ae48a89fd1d8cb592756
91+
92+
- name: Configure CMake with MSVC AddressSanitizer
93+
shell: cmd
94+
run: >
95+
cmake -S %GITHUB_WORKSPACE%
96+
-B %GITHUB_WORKSPACE%\build\msvc-asan
97+
-DCSV_CXX_STANDARD=20
98+
-DCMAKE_BUILD_TYPE=RelWithDebInfo
99+
-DCMAKE_CXX_FLAGS="/fsanitize=address /Zi"
100+
-DCMAKE_C_FLAGS="/fsanitize=address /Zi"
101+
-DCMAKE_EXE_LINKER_FLAGS="/INCREMENTAL:NO"
102+
-DCMAKE_SHARED_LINKER_FLAGS="/INCREMENTAL:NO"
103+
104+
- name: Build with MSVC AddressSanitizer
105+
run: cmake --build ${{ github.workspace }}/build/msvc-asan --config RelWithDebInfo
106+
107+
- name: Test with MSVC AddressSanitizer
108+
working-directory: ${{ github.workspace }}/build/msvc-asan
109+
run: ctest --build-config RelWithDebInfo --output-on-failure -V
110+
timeout-minutes: 20
111+
112+
- name: Upload MSVC AddressSanitizer logs
113+
if: failure()
114+
uses: actions/upload-artifact@v4
115+
with:
116+
name: windows-msvc-asan-logs-std20
117+
path: build/msvc-asan/Testing/
118+
75119
valgrind:
76120
runs-on: ubuntu-latest
77121

AGENTS.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,9 +69,12 @@ See `tests/AGENTS.md` for test strategy, checklist, and conventions.
6969
- **Templates must stay in `.hpp`** — the compiler needs the definition at instantiation time. `init_from_stream` is the standing example.
7070
- **Trivial one-liner accessors** may be unconditionally `inline` in the header when the call overhead is measurable and the body will never change.
7171
- **Consolidation:** If a `.cpp` would be under ~100 lines *and* the split causes excessive comment duplication between the two files, prefer a single `.hpp` with definitions marked `inline` (free functions and methods alike). Do not use `CSV_INLINE` for consolidated definitions — `CSV_INLINE` expands to empty in multi-header mode, which would produce ODR violations across TUs. Do not consolidate just for brevity — only when duplication is the dominant cost.
72+
7. **Prefer LF (`\n`) line endings for tracked source, test, CMake, and Markdown files.** When you touch a file with mixed line endings, normalize the edited file to LF unless there is a file-specific reason not to. Avoid introducing mixed CRLF/LF endings in the same file.
73+
8. **Keep preprocessor directives flush left.** `#define`, `#if`, `#ifdef`, `#else`, and `#endif` should start at column 0. Code inside multi-line macros should be indented exactly as the equivalent non-macro code would be; do not add extra indentation just because it lives inside a macro body.
7274
7375
### Rules for Comments
7476
1. **Always update or remove incorrect comments.**
7577
2. **Don't reference internal functions in public API comments.** Public API docs should describe user-visible behavior and contracts; internal helper/function details belong in internal docs.
7678
3. **Avoid meaningless @param and @return descriptions.** Do not add comments that could trivially be inferred by the function's name or other existing comments. When editing a function, remove any @param/@return descriptions that merely restate the function name or signature.
77-
4. **Don't delete or simplify comments** unless allowed by other rules in this section. Comments in this codebase frequently encode concurrency invariants, non-obvious design decisions, and hard-won bug context that cannot be recovered from the code alone.
79+
4. **Don't delete or simplify comments** unless allowed by other rules in this section. Comments in this codebase frequently encode concurrency invariants, non-obvious design decisions, and hard-won bug context that cannot be recovered from the code alone.
80+
5. **Public API docs belong on declarations in `.hpp` files.** When a class has both a header and implementation file, put user-facing/Doxygen documentation on the declaration in the header. Keep the `.cpp` focused on implementation notes, concurrency invariants, performance rationale, and bug-history comments.

ARCHITECTURE.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,11 @@ Notes:
1818
- Queue synchronization details are maintained only in THREADSAFE_DEQUE_DESIGN.md to avoid duplication.
1919
- Always update or remove incorrect comments.
2020
- Public API comments should remain user-facing and avoid references to internal helper/function details.
21+
- Public API docs belong on declarations in `.hpp` files; keep `.cpp` comments focused on implementation notes, concurrency invariants, performance rationale, and bug history.
2122
- When editing a function, remove `@param` and `@return` descriptions that merely restate the function name or signature.
2223
- Private member naming should prefer trailing underscores; when editing mixed-style code, normalize the touched region toward that convention.
24+
- Prefer LF (`\n`) line endings for tracked source, test, CMake, and Markdown files; when touching a file with mixed endings, normalize it to LF unless there is a file-specific reason not to.
25+
- Keep preprocessor directives flush left; `#define`, `#if`, `#ifdef`, `#else`, and `#endif` should start at column 0, and code inside multi-line macros should be indented as if the macro wrapper were not present.
2326
- Compatibility macros defined in `common.hpp` must only be referenced after including `common.hpp`. See AGENTS.md and CLAUDE.md for details.
2427
- API constraints should be user-friendly: do not over-constrain templates unless needed for correctness, safety, or a measured performance win.
2528
- `CSVReader` is intentionally non-copyable and move-enabled; use explicit ownership transfer patterns (`std::move`, `std::unique_ptr`) at API boundaries.

0 commit comments

Comments
 (0)