Skip to content
1 change: 1 addition & 0 deletions .github/workflows/cmake-multi-platform.yml
Original file line number Diff line number Diff line change
Expand Up @@ -93,3 +93,4 @@ jobs:
run: ctest --build-config ${{ matrix.build_type }}



63 changes: 63 additions & 0 deletions .github/workflows/msvc-consumer-compat.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
name: MSVC Consumer Compatibility

on:
push:
branches: [ "master" ]
paths:
- "include/**"
- "single_header.py"
- "CMakeLists.txt"
- ".github/workflows/msvc-consumer-compat.yml"
pull_request:
branches: [ "master" ]
paths:
- "include/**"
- "single_header.py"
- "CMakeLists.txt"
- ".github/workflows/msvc-consumer-compat.yml"
workflow_dispatch:

jobs:
windows-msvc-no-zc-cplusplus:
name: Windows MSVC compatibility without /Zc:__cplusplus
runs-on: windows-latest

steps:
- name: Checkout repository and submodules
uses: actions/checkout@v4
with:
submodules: recursive

- name: Set up MSVC developer command prompt
uses: ilammy/msvc-dev-cmd@v1

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.x'

- name: Configure CMake (for single-header generation)
run: >
cmake -S ${{ github.workspace }}
-B ${{ github.workspace }}/build/no-zc-cplusplus
-DCSV_CXX_STANDARD=20
-DCSV_BUILD_PROGRAMS=OFF
-DBUILD_PYTHON=OFF

- name: Generate amalgamated single-header
run: cmake --build ${{ github.workspace }}/build/no-zc-cplusplus --config Release --target generate_single_header

- name: Compile generated single-header without /Zc:__cplusplus
shell: cmd
run: |
REM NOTE: single_include/csv.hpp is an intentional compatibility shim; compile the generated amalgamated header instead.
echo #include "csv.hpp" > ci_no_zc_single_header.cpp
echo int main() { return 0; } >> ci_no_zc_single_header.cpp
cl /nologo /std:c++20 /EHsc /Zc:__cplusplus- /I build\no-zc-cplusplus\single_include_generated /c ci_no_zc_single_header.cpp

- name: Compile unamalgamated header without /Zc:__cplusplus
shell: cmd
run: |
echo #include "csv.hpp" > ci_no_zc_unamalgamated.cpp
echo int main() { return 0; } >> ci_no_zc_unamalgamated.cpp
cl /nologo /std:c++20 /EHsc /Zc:__cplusplus- /I include /c ci_no_zc_unamalgamated.cpp
Comment thread Fixed
77 changes: 77 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# CSV Parser - AI Agent Context

Architectural overview for AI assistants working with this codebase.

> **Maintenance rule:** Whenever this file is changed, `CLAUDE.md` in the same directory must be updated to reflect the changes. `CLAUDE.md` is a bullet-point summary of this file and must stay in sync.

## Critical: single_include/csv.hpp Is A Shim

`single_include/csv.hpp` is intentionally **non-functional** and exists only as a compatibility shim.

- Do **not** compile against `single_include/csv.hpp`
- For single-header validation, generate `build/.../single_include_generated/csv.hpp` via the `generate_single_header` target, then compile that generated file
- For unamalgamated usage, include headers from `include/`

This guard exists to prevent stale-in-repo amalgamated headers and to force use of the canonical generated distribution.

## Critical: Two Independent Code Paths

The `CSVReader` class has **two completely different implementations**:

```cpp
// PATH 1: Memory-mapped I/O (MmapParser)
CSVReader reader("filename.csv");

// PATH 2: Stream-based (StreamParser)
std::ifstream infile("filename.csv", std::ios::binary);
CSVReader reader(infile, format);
```

**Impact:** Bugs can exist in one path but not the other (see issue #281). Any test validating parsing behavior must test BOTH paths using Catch2 `SECTION`.

## Threading: Worker + 10MB Chunks

- Worker thread reads in 10MB chunks (`ITERATION_CHUNK_SIZE`)
- Communicates via `ThreadSafeDeque<CSVRow>`
- Exceptions propagate via `std::exception_ptr`
- Critical: Fields spanning chunk boundaries must not corrupt

**Testing requirement:** Use ≥500K rows to cross 10MB boundary.

## Key Files

| File | Contains |
|------|----------|
| `csv_reader.hpp` | Mmap vs stream constructors |
| `csv_reader.cpp` | Delimiter guessing, header detection |
| `basic_csv_parser.hpp` | Parser base class (IBasicCSVParser, MmapParser, StreamParser) |
| `basic_csv_parser.cpp` | Chunk transitions, worker thread |
| `raw_csv_data.hpp` | Internal parser data structures (RawCSVField, CSVFieldList, RawCSVData) |
| `thread_safe_deque.hpp` | Producer-consumer queue for parser→main thread communication |
| `csv_row.hpp` | Public API types (CSVField, CSVRow) |
| `test_round_trip.cpp` | Exemplar test patterns |

## Data Flow: Parser → Row API

```
Parser Thread Main Thread
↓ ↓
RawCSVData (shared_ptr) ─────────────→ CSVRow
↓ ↓
CSVFieldList → RawCSVField[] CSVField (lazy unescaping)
ThreadSafeDeque<CSVRow>
(producer-consumer queue)
```

**Thread Safety:** Parser populates `RawCSVData`, pushes `CSVRow` to `ThreadSafeDeque`, main thread pops and reads. The `CSVFieldList` uses chunked allocation (~170 fields/chunk) for cache locality. See `raw_csv_data.hpp` and `thread_safe_deque.hpp` for implementation details.

## Common Pitfalls

1. **Don't assume one code path:** Mmap and stream paths are different. Always test both.
2. **Don't write tiny tests:** Need ≥500K rows to cross 10MB chunk boundary.
3. **Don't use uniform values:** Each column needs distinct values to detect corruption.
4. **Don't ignore async:** Worker thread means exceptions must use `exception_ptr`.
5. **Don't change one constructor:** Likely affects both mmap and stream paths.

See `tests/AGENTS.md` for test strategy, checklist, and conventions.
113 changes: 28 additions & 85 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,94 +1,37 @@
# CSV Parser - AI Agent Context
# CSV Parser - Claude Summary

Architectural overview for AI assistants working with this codebase.
> **`AGENTS.md` is the source of truth.** This file is a bullet-point summary only. Always load and follow `AGENTS.md` — it takes precedence over anything here.

## Critical: Two Independent Code Paths
## single_include/csv.hpp
- Non-functional shim — do **not** compile against it
- For single-header use: generate `build/.../single_include_generated/csv.hpp` via `generate_single_header` target
- For unamalgamated use: include from `include/`

The `CSVReader` class has **two completely different implementations**:
## Two Independent Code Paths
- `CSVReader("file.csv")` → MmapParser
- `CSVReader(istream, format)` → StreamParser
- Bugs can exist in one and not the other — always test both with Catch2 `SECTION`

```cpp
// PATH 1: Memory-mapped I/O (MmapParser)
CSVReader reader("filename.csv");

// PATH 2: Stream-based (StreamParser)
std::ifstream infile("filename.csv", std::ios::binary);
CSVReader reader(infile, format);
```

**Impact:** Bugs can exist in one path but not the other (see issue #281). Any test validating parsing behavior must test BOTH paths using Catch2 `SECTION`.

## Threading: Worker + 10MB Chunks

- Worker thread reads in 10MB chunks (`ITERATION_CHUNK_SIZE`)
- Communicates via `ThreadSafeDeque<CSVRow>`
## Threading
- Worker thread reads 10MB chunks (`ITERATION_CHUNK_SIZE`)
- Communication via `ThreadSafeDeque<CSVRow>`
- Exceptions propagate via `std::exception_ptr`
- Critical: Fields spanning chunk boundaries must not corrupt

**Testing requirement:** Use ≥500K rows to cross 10MB boundary.

## Test Strategy: Use Distinct Column Values

❌ **BAD:** `array{i, i, i, i, i}` - All columns identical
✅ **GOOD:** `array{i*5+0, i*5+1, i*5+2, i*5+3, i*5+4}` - Each column distinct

**Why:** Field corruption is only detectable if columns have different values.
- Tests must use ≥500K rows to cross chunk boundary

## Key Files

| File | Contains |
|------|----------|
| `csv_reader.hpp` | Mmap vs stream constructors |
| `csv_reader.cpp` | Delimiter guessing, header detection |
| `basic_csv_parser.hpp` | Parser base class (IBasicCSVParser, MmapParser, StreamParser) |
| `basic_csv_parser.cpp` | Chunk transitions, worker thread |
| `raw_csv_data.hpp` | Internal parser data structures (RawCSVField, CSVFieldList, RawCSVData) |
| `thread_safe_deque.hpp` | Producer-consumer queue for parser→main thread communication |
| `csv_row.hpp` | Public API types (CSVField, CSVRow) |
| `test_round_trip.cpp` | Exemplar test patterns |

## Data Flow: Parser → Row API

```
Parser Thread Main Thread
↓ ↓
RawCSVData (shared_ptr) ─────────────→ CSVRow
↓ ↓
CSVFieldList → RawCSVField[] CSVField (lazy unescaping)
ThreadSafeDeque<CSVRow>
(producer-consumer queue)
```

**Thread Safety:** Parser populates `RawCSVData`, pushes `CSVRow` to `ThreadSafeDeque`, main thread pops and reads. The `CSVFieldList` uses chunked allocation (~170 fields/chunk) for cache locality. See `raw_csv_data.hpp` and `thread_safe_deque.hpp` for implementation details.
- `csv_reader.hpp` — mmap vs stream constructors
- `basic_csv_parser.hpp` — MmapParser, StreamParser implementations
- `basic_csv_parser.cpp` — chunk transitions, worker thread
- `raw_csv_data.hpp` — RawCSVField, CSVFieldList, RawCSVData
- `thread_safe_deque.hpp` — producer-consumer queue
- `csv_row.hpp` — CSVField, CSVRow public API

## Common Pitfalls

1. **Don't assume one code path:** Mmap and stream paths are different. Always test both.
2. **Don't write tiny tests:** Need ≥500K rows to cross 10MB chunk boundary.
3. **Don't use uniform values:** Each column needs distinct values to detect corruption.
4. **Don't ignore async:** Worker thread means exceptions must use `exception_ptr`.
5. **Don't change one constructor:** Likely affects both mmap and stream paths.

## Test Checklist

- [ ] Tests both mmap and stream paths (use `SECTION`)
- [ ] Distinct values per column
- [ ] ≥500K rows to cross chunk boundary
- [ ] Documents bug it would catch
- [ ] Lambda + SECTION pattern for code reuse
- [ ] Test data in `tests/data/fake_data` (real data in `tests/data/real_data`)
- [ ] Use `FileGuard` for temporary files (ensures cleanup even if test fails)

**Note:** `tests/data` is a git submodule. Remember to commit changes separately.

## Recent Bug Fixes

| Issue | Bug | Fixed |
|-------|-----|-------|
| #278 | CSVFieldList move constructor dangling pointer | Feb 2026 |
| #280 | Field corruption at chunk boundaries | PR #282 |
| #281 | Stream-specific exception handling | PR #282 |
| #283 | Header detection with variable-width rows | Jan 2026 |
| #285 | Delimiter guessing overwrites `no_header()` | Feb 2026 |

See inline comments in source files for implementation details.
- Always test both mmap and stream paths
- ≥500K rows needed to cross 10MB boundary
- Use distinct column values to detect field corruption
- Exceptions from worker thread need `exception_ptr`
- Changes to one constructor likely affect both paths

## Tests
See `tests/AGENTS.md` for full test strategy, checklist, and conventions.
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,10 @@ Found a bug? Please report it! This project welcomes **genuine bug reports broug
* ✅ Performance regressions in real-world scenarios
* ✅ API issues that affect **practical, real-world use cases**

When reporting integration or compiler issues, please state which library form you are using:
* Single-header
* Unamalgamated headers/library (`include/` with your own build system, CMake, etc.)

Please keep reports grounded in real use cases—no contrived edge cases or philosophical debates about API design, thanks!

**Design Note:** `CSVReader` uses `std::input_iterator_tag` for single-pass streaming of arbitrarily large files. If you need multi-pass iteration or random access, copy rows to a `std::vector` first. This is by design, not a bug.
Expand Down
13 changes: 10 additions & 3 deletions include/internal/common.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -55,15 +55,22 @@
#endif

// Detect C++ standard version BEFORE namespace to properly include string_view
#if (defined(CMAKE_CXX_STANDARD) && CMAKE_CXX_STANDARD == 20) || __cplusplus >= 202002L
// MSVC: __cplusplus == 199711L unless /Zc:__cplusplus is set; use _MSVC_LANG instead.
#if defined(_MSVC_LANG) && _MSVC_LANG > __cplusplus
# define CSV_CPLUSPLUS _MSVC_LANG
#else
# define CSV_CPLUSPLUS __cplusplus
#endif

#if (defined(CMAKE_CXX_STANDARD) && CMAKE_CXX_STANDARD == 20) || CSV_CPLUSPLUS >= 202002L
#define CSV_HAS_CXX20
#endif

#if (defined(CMAKE_CXX_STANDARD) && CMAKE_CXX_STANDARD == 17) || __cplusplus >= 201703L
#if (defined(CMAKE_CXX_STANDARD) && CMAKE_CXX_STANDARD == 17) || CSV_CPLUSPLUS >= 201703L
#define CSV_HAS_CXX17
#endif

#if (defined(CMAKE_CXX_STANDARD) && CMAKE_CXX_STANDARD >= 14) || __cplusplus >= 201402L
#if (defined(CMAKE_CXX_STANDARD) && CMAKE_CXX_STANDARD >= 14) || CSV_CPLUSPLUS >= 201402L
#define CSV_HAS_CXX14
#endif

Expand Down
Loading
Loading