|
1 | | -# CSV Parser - AI Agent Context |
| 1 | +# CSV Parser - Claude Summary |
2 | 2 |
|
3 | | -Architectural overview for AI assistants working with this codebase. |
| 3 | +> **`AGENTS.md` is the source of truth.** This file is a bullet-point summary only. Always load and follow `AGENTS.md` — it takes precedence over anything here. |
4 | 4 |
|
5 | | -## Critical: Two Independent Code Paths |
| 5 | +## single_include/csv.hpp |
| 6 | +- Non-functional shim — do **not** compile against it |
| 7 | +- For single-header use: generate `build/.../single_include_generated/csv.hpp` via `generate_single_header` target |
| 8 | +- For unamalgamated use: include from `include/` |
6 | 9 |
|
7 | | -The `CSVReader` class has **two completely different implementations**: |
| 10 | +## Two Independent Code Paths |
| 11 | +- `CSVReader("file.csv")` → MmapParser |
| 12 | +- `CSVReader(istream, format)` → StreamParser |
| 13 | +- Bugs can exist in one and not the other — always test both with Catch2 `SECTION` |
8 | 14 |
|
9 | | -```cpp |
10 | | -// PATH 1: Memory-mapped I/O (MmapParser) |
11 | | -CSVReader reader("filename.csv"); |
12 | | - |
13 | | -// PATH 2: Stream-based (StreamParser) |
14 | | -std::ifstream infile("filename.csv", std::ios::binary); |
15 | | -CSVReader reader(infile, format); |
16 | | -``` |
17 | | -
|
18 | | -**Impact:** Bugs can exist in one path but not the other (see issue #281). Any test validating parsing behavior must test BOTH paths using Catch2 `SECTION`. |
19 | | -
|
20 | | -## Threading: Worker + 10MB Chunks |
21 | | -
|
22 | | -- Worker thread reads in 10MB chunks (`ITERATION_CHUNK_SIZE`) |
23 | | -- Communicates via `ThreadSafeDeque<CSVRow>` |
| 15 | +## Threading |
| 16 | +- Worker thread reads 10MB chunks (`ITERATION_CHUNK_SIZE`) |
| 17 | +- Communication via `ThreadSafeDeque<CSVRow>` |
24 | 18 | - Exceptions propagate via `std::exception_ptr` |
25 | | -- Critical: Fields spanning chunk boundaries must not corrupt |
26 | | -
|
27 | | -**Testing requirement:** Use ≥500K rows to cross 10MB boundary. |
28 | | -
|
29 | | -## Test Strategy: Use Distinct Column Values |
30 | | -
|
31 | | -❌ **BAD:** `array{i, i, i, i, i}` - All columns identical |
32 | | -✅ **GOOD:** `array{i*5+0, i*5+1, i*5+2, i*5+3, i*5+4}` - Each column distinct |
33 | | -
|
34 | | -**Why:** Field corruption is only detectable if columns have different values. |
| 19 | +- Tests must use ≥500K rows to cross chunk boundary |
35 | 20 |
|
36 | 21 | ## Key Files |
37 | | -
|
38 | | -| File | Contains | |
39 | | -|------|----------| |
40 | | -| `csv_reader.hpp` | Mmap vs stream constructors | |
41 | | -| `csv_reader.cpp` | Delimiter guessing, header detection | |
42 | | -| `basic_csv_parser.hpp` | Parser base class (IBasicCSVParser, MmapParser, StreamParser) | |
43 | | -| `basic_csv_parser.cpp` | Chunk transitions, worker thread | |
44 | | -| `raw_csv_data.hpp` | Internal parser data structures (RawCSVField, CSVFieldList, RawCSVData) | |
45 | | -| `thread_safe_deque.hpp` | Producer-consumer queue for parser→main thread communication | |
46 | | -| `csv_row.hpp` | Public API types (CSVField, CSVRow) | |
47 | | -| `test_round_trip.cpp` | Exemplar test patterns | |
48 | | -
|
49 | | -## Data Flow: Parser → Row API |
50 | | -
|
51 | | -``` |
52 | | -Parser Thread Main Thread |
53 | | - ↓ ↓ |
54 | | -RawCSVData (shared_ptr) ─────────────→ CSVRow |
55 | | - ↓ ↓ |
56 | | -CSVFieldList → RawCSVField[] CSVField (lazy unescaping) |
57 | | - ↓ |
58 | | -ThreadSafeDeque<CSVRow> |
59 | | -(producer-consumer queue) |
60 | | -``` |
61 | | -
|
62 | | -**Thread Safety:** Parser populates `RawCSVData`, pushes `CSVRow` to `ThreadSafeDeque`, main thread pops and reads. The `CSVFieldList` uses chunked allocation (~170 fields/chunk) for cache locality. See `raw_csv_data.hpp` and `thread_safe_deque.hpp` for implementation details. |
| 22 | +- `csv_reader.hpp` — mmap vs stream constructors |
| 23 | +- `basic_csv_parser.hpp` — MmapParser, StreamParser implementations |
| 24 | +- `basic_csv_parser.cpp` — chunk transitions, worker thread |
| 25 | +- `raw_csv_data.hpp` — RawCSVField, CSVFieldList, RawCSVData |
| 26 | +- `thread_safe_deque.hpp` — producer-consumer queue |
| 27 | +- `csv_row.hpp` — CSVField, CSVRow public API |
63 | 28 |
|
64 | 29 | ## Common Pitfalls |
65 | | -
|
66 | | -1. **Don't assume one code path:** Mmap and stream paths are different. Always test both. |
67 | | -2. **Don't write tiny tests:** Need ≥500K rows to cross 10MB chunk boundary. |
68 | | -3. **Don't use uniform values:** Each column needs distinct values to detect corruption. |
69 | | -4. **Don't ignore async:** Worker thread means exceptions must use `exception_ptr`. |
70 | | -5. **Don't change one constructor:** Likely affects both mmap and stream paths. |
71 | | -
|
72 | | -## Test Checklist |
73 | | -
|
74 | | -- [ ] Tests both mmap and stream paths (use `SECTION`) |
75 | | -- [ ] Distinct values per column |
76 | | -- [ ] ≥500K rows to cross chunk boundary |
77 | | -- [ ] Documents bug it would catch |
78 | | -- [ ] Lambda + SECTION pattern for code reuse |
79 | | -- [ ] Test data in `tests/data/fake_data` (real data in `tests/data/real_data`) |
80 | | -- [ ] Use `FileGuard` for temporary files (ensures cleanup even if test fails) |
81 | | -
|
82 | | -**Note:** `tests/data` is a git submodule. Remember to commit changes separately. |
83 | | -
|
84 | | -## Recent Bug Fixes |
85 | | -
|
86 | | -| Issue | Bug | Fixed | |
87 | | -|-------|-----|-------| |
88 | | -| #278 | CSVFieldList move constructor dangling pointer | Feb 2026 | |
89 | | -| #280 | Field corruption at chunk boundaries | PR #282 | |
90 | | -| #281 | Stream-specific exception handling | PR #282 | |
91 | | -| #283 | Header detection with variable-width rows | Jan 2026 | |
92 | | -| #285 | Delimiter guessing overwrites `no_header()` | Feb 2026 | |
93 | | -
|
94 | | -See inline comments in source files for implementation details. |
| 30 | +- Always test both mmap and stream paths |
| 31 | +- ≥500K rows needed to cross 10MB boundary |
| 32 | +- Use distinct column values to detect field corruption |
| 33 | +- Exceptions from worker thread need `exception_ptr` |
| 34 | +- Changes to one constructor likely affect both paths |
| 35 | + |
| 36 | +## Tests |
| 37 | +See `tests/AGENTS.md` for full test strategy, checklist, and conventions. |
0 commit comments