Skip to content

Commit 6eb125d

Browse files
committed
Clean up CSVFieldList
And rename it to RawCSVFieldList
1 parent 5e6e499 commit 6eb125d

16 files changed

Lines changed: 356 additions & 107 deletions

AGENTS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Architectural overview for AI assistants working with this codebase.
44

5-
> **Maintenance rule:** Whenever this file is changed, `CLAUDE.md` in the same directory must be updated to reflect the changes. `CLAUDE.md` is a bullet-point summary of this file and must stay in sync.
5+
> **Maintenance rule:** Whenever this file is changed, update both `CLAUDE.md` and `ARCHITECTURE.md` in the same directory to reflect relevant changes. `CLAUDE.md` is a bullet-point summary and `ARCHITECTURE.md` is the top-level architecture index; both must stay in sync with this guidance.
66
77
## Critical: single_include/csv.hpp Is A Shim
88

ARCHITECTURE.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Architecture Index
2+
3+
This file is a top-level index for architecture documentation.
4+
5+
Primary architecture document:
6+
- include/internal/ARCHITECTURE.md
7+
8+
Subsystem deep-dive:
9+
- include/internal/THREADSAFE_DEQUE_DESIGN.md
10+
11+
Operational/testing guidance:
12+
- AGENTS.md
13+
- tests/AGENTS.md
14+
15+
Notes:
16+
- Internal architecture content lives under include/internal to stay close to implementation.
17+
- Queue synchronization details are maintained only in THREADSAFE_DEQUE_DESIGN.md to avoid duplication.
18+

CLAUDE.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,9 @@
22

33
> **`AGENTS.md` is the source of truth.** This file is a bullet-point summary only. Always load and follow `AGENTS.md` — it takes precedence over anything here.
44
5+
## Maintenance Rule
6+
- When `AGENTS.md` changes, update both `CLAUDE.md` and root `ARCHITECTURE.md` to keep guidance and architecture index references aligned.
7+
58
## single_include/csv.hpp
69
- Non-functional shim — do **not** compile against it
710
- For single-header use: generate `build/.../single_include_generated/csv.hpp` via `generate_single_header` target

include/internal/ARCHITECTURE.md

Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
# Internal Architecture
2+
3+
This document describes the high-level architecture of the CSV parser internals and how major classes interact.
4+
5+
Scope:
6+
- Internal component responsibilities
7+
- End-to-end data flow
8+
- Core invariants for safe changes
9+
- Where to add tests for subsystem changes
10+
11+
For queue synchronization protocol details, see THREADSAFE_DEQUE_DESIGN.md.
12+
For AI-agent workflow and guardrails, see ../../AGENTS.md and ../../tests/AGENTS.md.
13+
14+
## 1. System Shape
15+
16+
The parser is a streaming producer-consumer pipeline:
17+
18+
1. A parser implementation reads source bytes in chunks.
19+
2. Parsed rows are emitted into a queue.
20+
3. Consumer-side APIs expose rows and fields lazily.
21+
22+
Two independent parser paths exist and must be kept behaviorally aligned:
23+
24+
- File constructor path: memory-mapped parser
25+
- iostream constructor path: stream parser
26+
27+
## 2. Major Components
28+
29+
### API-facing core
30+
31+
- CSVReader
32+
- Orchestrates parser lifecycle, worker cycle, and row retrieval.
33+
- Holds parser, queue, format, and exception propagation state.
34+
35+
- CSVRow
36+
- Lightweight row view over shared chunk data.
37+
- Resolves field slices and supports index/name access.
38+
39+
- CSVField
40+
- Field-level typed conversion facade.
41+
- Defers conversion work until requested.
42+
43+
- CSVFormat
44+
- Parse configuration (delimiter/quote/trim/header/chunk size/policies).
45+
46+
### Parsing core
47+
48+
- IBasicCSVParser
49+
- Shared parse loop and field/row state machine.
50+
51+
- MmapParser
52+
- Reads chunks from memory maps and handles chunk-transition remainder.
53+
54+
- StreamParser
55+
- Reads chunks from stream sources.
56+
57+
### Internal storage and transport
58+
59+
- RawCSVData
60+
- Shared chunk payload and per-chunk parse metadata.
61+
62+
- RawCSVFieldList
63+
- Compact field metadata storage (start/length/quote flags).
64+
65+
- ThreadSafeDeque<CSVRow>
66+
- Parser-to-consumer transport queue.
67+
- Synchronization protocol is documented in THREADSAFE_DEQUE_DESIGN.md.
68+
69+
### Relationship diagrams
70+
71+
Parser hierarchy:
72+
73+
```text
74+
+------------------+
75+
| IBasicCSVParser |
76+
| (abstract base) |
77+
+---------+--------+
78+
^
79+
+----------+----------+
80+
| |
81+
+--------+--------+ +-------+--------+
82+
| MmapParser | | StreamParser |
83+
| concrete parser | | concrete parser|
84+
+-----------------+ +----------------+
85+
```
86+
87+
Reader + row/data ownership:
88+
89+
```text
90+
CSVReader
91+
-> parser->next() builds RawCSVData chunk
92+
-> emits CSVRow objects into ThreadSafeDeque
93+
94+
+--------------------------+
95+
| RawCSVData |
96+
| - _data: shared_ptr<void>|
97+
| - data: string_view |
98+
| - fields: RawCSVFieldList|
99+
+------------+-------------+
100+
^
101+
| shared_ptr<RawCSVData>
102+
+-----------+-----------+-----------+
103+
| | |
104+
CSVRow #1 CSVRow #2 CSVRow #N
105+
```
106+
107+
Notes:
108+
- Multiple CSVRow instances can share the same RawCSVData chunk.
109+
- RawCSVData lifetime extends until the last referencing CSVRow is destroyed.
110+
- RawCSVFieldList is contained inside RawCSVData and indexes slices into the backing data payload.
111+
112+
CSVRow -> CSVField lazy materialization:
113+
114+
```text
115+
RawCSVData
116+
|- data (chunk bytes)
117+
|- fields[i] = {start, length, has_double_quote}
118+
v
119+
CSVRow::get_field_impl(i)
120+
-> slice = data.substr(start, length)
121+
-> if quoted: unescape/cached materialization
122+
-> if trim enabled: apply trim at access time
123+
v
124+
CSVField(string_view)
125+
-> typed conversion only when get<T>() / try_get<T>() is called
126+
```
127+
128+
Implication:
129+
- Parser throughput stays focused on boundary detection and row emission; expensive string work is deferred until fields are actually accessed.
130+
131+
## 3. End-to-End Flow
132+
133+
Source bytes -> parser chunk read -> parse loop -> RawCSVData + RawCSVFieldList -> CSVRow enqueue -> CSVReader read_row / iteration -> CSVField materialization
134+
135+
Operationally:
136+
137+
1. CSVReader starts a read cycle with current chunk size.
138+
2. Parser next(bytes) ingests one chunk and emits complete rows.
139+
3. Queue buffers rows for consumer-side retrieval.
140+
4. CSVRow/CSVField lazily materialize trim/unescape/conversion behavior.
141+
5. Worker completion and errors are signaled back to the consumer side.
142+
143+
## 4. Key Invariants
144+
145+
### Chunk boundary integrity
146+
147+
Fields spanning chunk boundaries must not be split/corrupted.
148+
149+
### Path parity
150+
151+
Mmap and stream parsers must preserve the same externally observable behavior.
152+
153+
### Lazy materialization contract
154+
155+
Trimming/unescaping/conversion behavior must remain coherent across parser and field-access layers.
156+
157+
### Bounded streaming semantics
158+
159+
Avoid designs that force retaining all parsed chunks globally.
160+
161+
## 5. Change Impact Map
162+
163+
- Parser state machine changes:
164+
- basic_csv_parser.hpp, basic_csv_parser.cpp
165+
166+
- Chunk transition changes:
167+
- basic_csv_parser.cpp (MmapParser/StreamParser next)
168+
169+
- Reader worker/iteration behavior:
170+
- csv_reader.hpp, csv_reader.cpp, csv_reader_iterator.cpp
171+
172+
- Field extraction and trimming/unescaping:
173+
- csv_row.hpp, csv_row.cpp, raw_csv_data.hpp
174+
175+
- Parse configuration behavior:
176+
- csv_format.hpp, csv_format.cpp
177+
178+
- Queue synchronization semantics:
179+
- thread_safe_deque.hpp, THREADSAFE_DEQUE_DESIGN.md
180+
181+
## 6. Test Guidance by Subsystem
182+
183+
For full testing strategy, checklist, and conventions, see:
184+
- ../../tests/AGENTS.md

include/internal/THREADSAFE_DEQUE_DESIGN.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ The producer is intentionally not always running.
4242

4343
### 1. **Early Guard Check in `wait()`**
4444

45-
**Location:** [thread_safe_deque.hpp:85](thread_safe_deque.hpp#L85)
45+
**Location:** [thread_safe_deque.hpp](thread_safe_deque.hpp)
4646

4747
```cpp
4848
void wait() {
@@ -167,7 +167,7 @@ The atomic store has memory ordering but does **not** prevent the lost-wakeup wi
167167

168168
## Batching Predicate: The `notify_size` Parameter
169169

170-
**Location:** [thread_safe_deque.hpp:27](thread_safe_deque.hpp#L27)
170+
**Location:** [thread_safe_deque.hpp](thread_safe_deque.hpp)
171171

172172
```cpp
173173
ThreadSafeDeque(size_t notify_size = 100) : _notify_size(notify_size) {}
@@ -247,17 +247,17 @@ t=5: Consumer: pop_front() → returns row1
247247
248248
| Component | Where | What |
249249
|-----------|-------|------|
250-
| **Consumer** | [csv_reader.cpp:310](../csv_reader.cpp#L310) | `read_row()` main thread |
251-
| **Consumer** | [csv_reader.cpp:313-315](../csv_reader.cpp#L313) | Check empty, is_waitable, call wait() |
252-
| **Producer** | [csv_reader.cpp:261](../csv_reader.cpp#L261) | `read_csv()` worker thread |
253-
| **Producer** | [csv_reader.cpp:274](../csv_reader.cpp#L274) | notify_all() at start |
254-
| **Producer** | [csv_reader.cpp:278](../csv_reader.cpp#L278) | parser->next() pushes rows |
255-
| **Producer** | [csv_reader.cpp:291](../csv_reader.cpp#L291) | kill_all() at end |
256-
| **Queue** | [thread_safe_deque.hpp:57](thread_safe_deque.hpp#L57) | push_back() producer |
257-
| **Queue** | [thread_safe_deque.hpp:67](thread_safe_deque.hpp#L67) | pop_front() consumer |
258-
| **Queue** | [thread_safe_deque.hpp:84](thread_safe_deque.hpp#L84) | wait() – consumer |
259-
| **Queue** | [thread_safe_deque.hpp:108](thread_safe_deque.hpp#L108) | notify_all() – producer |
260-
| **Queue** | [thread_safe_deque.hpp:115](thread_safe_deque.hpp#L115) | kill_all() – producer |
250+
| **Consumer** | [csv_reader.cpp](../csv_reader.cpp) | `read_row()` flow on main thread |
251+
| **Consumer** | [csv_reader.cpp](../csv_reader.cpp) | Empty/is_waitable checks and wait() call |
252+
| **Producer** | [csv_reader.cpp](../csv_reader.cpp) | `read_csv()` worker lifecycle |
253+
| **Producer** | [csv_reader.cpp](../csv_reader.cpp) | notify_all() at cycle start |
254+
| **Producer** | [csv_reader.cpp](../csv_reader.cpp) | parser->next() push cycle |
255+
| **Producer** | [csv_reader.cpp](../csv_reader.cpp) | kill_all() terminal signal |
256+
| **Queue** | [thread_safe_deque.hpp](thread_safe_deque.hpp) | push_back() producer path |
257+
| **Queue** | [thread_safe_deque.hpp](thread_safe_deque.hpp) | pop_front() consumer path |
258+
| **Queue** | [thread_safe_deque.hpp](thread_safe_deque.hpp) | wait() condition protocol |
259+
| **Queue** | [thread_safe_deque.hpp](thread_safe_deque.hpp) | notify_all() wake-up state |
260+
| **Queue** | [thread_safe_deque.hpp](thread_safe_deque.hpp) | kill_all() terminal publication |
261261
262262
---
263263

include/internal/basic_csv_parser.cpp

Lines changed: 11 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ namespace csv {
7474
_ws_flags = internals::make_ws_flags(
7575
format.trim_chars.data(), format.trim_chars.size()
7676
);
77+
_has_ws_trimming = !format.trim_chars.empty();
7778
}
7879

7980
CSV_INLINE void IBasicCSVParser::end_feed() {
@@ -99,10 +100,6 @@ namespace csv {
99100
using internals::ParseFlags;
100101
auto& in = this->data_ptr->data;
101102

102-
// Trim off leading whitespace
103-
while (data_pos < in.size() && ws_flag(in[data_pos]))
104-
data_pos++;
105-
106103
if (field_start == UNINITIALIZED_FIELD)
107104
field_start = (int)(data_pos - current_row_start());
108105

@@ -122,34 +119,23 @@ namespace csv {
122119

123120
field_length = data_pos - (field_start + current_row_start());
124121

125-
// Trim off trailing whitespace, this->field_length constraint matters
126-
// when field is entirely whitespace
127-
for (size_t j = data_pos - 1; ws_flag(in[j]) && this->field_length > 0; j--)
128-
this->field_length--;
122+
// Whitespace trimming is deferred to get_field_impl() so callers that never
123+
// read field values (e.g. row counting) pay no trimming cost.
129124
}
130125

131126
CSV_INLINE void IBasicCSVParser::push_field()
132127
{
133128
// Update
134-
if (field_has_double_quote) {
135-
fields->emplace_back(
136-
field_start == UNINITIALIZED_FIELD ? 0 : (unsigned int)field_start,
137-
field_length,
138-
true
139-
);
140-
field_has_double_quote = false;
141-
142-
}
143-
else {
144-
fields->emplace_back(
145-
field_start == UNINITIALIZED_FIELD ? 0 : (unsigned int)field_start,
146-
field_length
147-
);
148-
}
129+
fields->emplace_back(
130+
field_start == UNINITIALIZED_FIELD ? 0 : (unsigned int)field_start,
131+
field_length,
132+
field_has_double_quote
133+
);
149134

150135
current_row.row_length++;
151136

152137
// Reset field state
138+
field_has_double_quote = false;
153139
field_start = UNINITIALIZED_FIELD;
154140
field_length = 0;
155141
}
@@ -245,6 +231,8 @@ namespace csv {
245231
CSV_INLINE void IBasicCSVParser::reset_data_ptr() {
246232
this->data_ptr = std::make_shared<RawCSVData>();
247233
this->data_ptr->parse_flags = this->_parse_flags;
234+
this->data_ptr->ws_flags = this->_ws_flags;
235+
this->data_ptr->has_ws_trimming = this->_has_ws_trimming;
248236
this->data_ptr->col_names = this->_col_names;
249237
this->fields = &(this->data_ptr->fields);
250238
}

include/internal/basic_csv_parser.hpp

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,7 @@ namespace csv {
123123
) : _parse_flags(parse_flags), _ws_flags(ws_flags) {
124124
const char d = internals::infer_delimiter(parse_flags);
125125
_simd_sentinels = SentinelVecs(d, internals::infer_quote_char(parse_flags, d));
126+
_has_ws_trimming = std::any_of(ws_flags.begin(), ws_flags.end(), [](bool b) { return b; });
126127
}
127128

128129
virtual ~IBasicCSVParser() {}
@@ -155,7 +156,7 @@ namespace csv {
155156
CSVRow current_row;
156157
RawCSVDataPtr data_ptr = nullptr;
157158
ColNamesPtr _col_names = nullptr;
158-
CSVFieldList* fields = nullptr;
159+
RawCSVFieldList* fields = nullptr;
159160
int field_start = UNINITIALIZED_FIELD;
160161
size_t field_length = 0;
161162

@@ -190,6 +191,11 @@ namespace csv {
190191
* be trimmed
191192
*/
192193
WhitespaceMap _ws_flags;
194+
195+
/** True when at least one whitespace trim character is configured.
196+
* Used to skip trim loops entirely in the common no-trim case.
197+
*/
198+
bool _has_ws_trimming = false;
193199
bool quote_escape = false;
194200
bool field_has_double_quote = false;
195201

include/internal/common.hpp

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,6 @@
2424
* in the single header version
2525
*/
2626
#define CSV_INLINE
27-
28-
#pragma once
2927
#include <type_traits>
3028

3129
#if defined(__EMSCRIPTEN__)
@@ -57,14 +55,6 @@
5755
#define CSV_NON_NULL(...)
5856
#endif
5957

60-
#if defined(__GNUC__) || defined(__clang__)
61-
#define CSV_UNREACHABLE() __builtin_unreachable()
62-
#elif defined(_MSC_VER)
63-
#define CSV_UNREACHABLE() __assume(0)
64-
#else
65-
#define CSV_UNREACHABLE() abort()
66-
#endif
67-
6858
// This library uses C++ exceptions for error reporting in public APIs.
6959
#if defined(__cpp_exceptions) || defined(_CPPUNWIND) || defined(__EXCEPTIONS)
7060
#define CSV_EXCEPTIONS_ENABLED 1

0 commit comments

Comments
 (0)