|
| 1 | +# Internal Architecture |
| 2 | + |
| 3 | +This document describes the high-level architecture of the CSV parser internals and how major classes interact. |
| 4 | + |
| 5 | +Scope: |
| 6 | +- Internal component responsibilities |
| 7 | +- End-to-end data flow |
| 8 | +- Core invariants for safe changes |
| 9 | +- Where to add tests for subsystem changes |
| 10 | + |
| 11 | +For queue synchronization protocol details, see THREADSAFE_DEQUE_DESIGN.md. |
| 12 | +For AI-agent workflow and guardrails, see ../../AGENTS.md and ../../tests/AGENTS.md. |
| 13 | + |
| 14 | +## 1. System Shape |
| 15 | + |
| 16 | +The parser is a streaming producer-consumer pipeline: |
| 17 | + |
| 18 | +1. A parser implementation reads source bytes in chunks. |
| 19 | +2. Parsed rows are emitted into a queue. |
| 20 | +3. Consumer-side APIs expose rows and fields lazily. |
| 21 | + |
| 22 | +Two independent parser paths exist and must be kept behaviorally aligned: |
| 23 | + |
| 24 | +- File constructor path: memory-mapped parser |
| 25 | +- iostream constructor path: stream parser |
| 26 | + |
| 27 | +## 2. Major Components |
| 28 | + |
| 29 | +### API-facing core |
| 30 | + |
| 31 | +- CSVReader |
| 32 | + - Orchestrates parser lifecycle, worker cycle, and row retrieval. |
| 33 | + - Holds parser, queue, format, and exception propagation state. |
| 34 | + |
| 35 | +- CSVRow |
| 36 | + - Lightweight row view over shared chunk data. |
| 37 | + - Resolves field slices and supports index/name access. |
| 38 | + |
| 39 | +- CSVField |
| 40 | + - Field-level typed conversion facade. |
| 41 | + - Defers conversion work until requested. |
| 42 | + |
| 43 | +- CSVFormat |
| 44 | + - Parse configuration (delimiter/quote/trim/header/chunk size/policies). |
| 45 | + |
| 46 | +### Parsing core |
| 47 | + |
| 48 | +- IBasicCSVParser |
| 49 | + - Shared parse loop and field/row state machine. |
| 50 | + |
| 51 | +- MmapParser |
| 52 | + - Reads chunks from memory maps and handles chunk-transition remainder. |
| 53 | + |
| 54 | +- StreamParser |
| 55 | + - Reads chunks from stream sources. |
| 56 | + |
| 57 | +### Internal storage and transport |
| 58 | + |
| 59 | +- RawCSVData |
| 60 | + - Shared chunk payload and per-chunk parse metadata. |
| 61 | + |
| 62 | +- RawCSVFieldList |
| 63 | + - Compact field metadata storage (start/length/quote flags). |
| 64 | + |
| 65 | +- ThreadSafeDeque<CSVRow> |
| 66 | + - Parser-to-consumer transport queue. |
| 67 | + - Synchronization protocol is documented in THREADSAFE_DEQUE_DESIGN.md. |
| 68 | + |
| 69 | +### Relationship diagrams |
| 70 | + |
| 71 | +Parser hierarchy: |
| 72 | + |
| 73 | +```text |
| 74 | + +------------------+ |
| 75 | + | IBasicCSVParser | |
| 76 | + | (abstract base) | |
| 77 | + +---------+--------+ |
| 78 | + ^ |
| 79 | + +----------+----------+ |
| 80 | + | | |
| 81 | + +--------+--------+ +-------+--------+ |
| 82 | + | MmapParser | | StreamParser | |
| 83 | + | concrete parser | | concrete parser| |
| 84 | + +-----------------+ +----------------+ |
| 85 | +``` |
| 86 | + |
| 87 | +Reader + row/data ownership: |
| 88 | + |
| 89 | +```text |
| 90 | +CSVReader |
| 91 | + -> parser->next() builds RawCSVData chunk |
| 92 | + -> emits CSVRow objects into ThreadSafeDeque |
| 93 | +
|
| 94 | + +--------------------------+ |
| 95 | + | RawCSVData | |
| 96 | + | - _data: shared_ptr<void>| |
| 97 | + | - data: string_view | |
| 98 | + | - fields: RawCSVFieldList| |
| 99 | + +------------+-------------+ |
| 100 | + ^ |
| 101 | + | shared_ptr<RawCSVData> |
| 102 | + +-----------+-----------+-----------+ |
| 103 | + | | | |
| 104 | + CSVRow #1 CSVRow #2 CSVRow #N |
| 105 | +``` |
| 106 | + |
| 107 | +Notes: |
| 108 | +- Multiple CSVRow instances can share the same RawCSVData chunk. |
| 109 | +- RawCSVData lifetime extends until the last referencing CSVRow is destroyed. |
| 110 | +- RawCSVFieldList is contained inside RawCSVData and indexes slices into the backing data payload. |
| 111 | + |
| 112 | +CSVRow -> CSVField lazy materialization: |
| 113 | + |
| 114 | +```text |
| 115 | +RawCSVData |
| 116 | + |- data (chunk bytes) |
| 117 | + |- fields[i] = {start, length, has_double_quote} |
| 118 | + v |
| 119 | +CSVRow::get_field_impl(i) |
| 120 | + -> slice = data.substr(start, length) |
| 121 | + -> if quoted: unescape/cached materialization |
| 122 | + -> if trim enabled: apply trim at access time |
| 123 | + v |
| 124 | +CSVField(string_view) |
| 125 | + -> typed conversion only when get<T>() / try_get<T>() is called |
| 126 | +``` |
| 127 | + |
| 128 | +Implication: |
| 129 | +- Parser throughput stays focused on boundary detection and row emission; expensive string work is deferred until fields are actually accessed. |
| 130 | + |
| 131 | +## 3. End-to-End Flow |
| 132 | + |
| 133 | +Source bytes -> parser chunk read -> parse loop -> RawCSVData + RawCSVFieldList -> CSVRow enqueue -> CSVReader read_row / iteration -> CSVField materialization |
| 134 | + |
| 135 | +Operationally: |
| 136 | + |
| 137 | +1. CSVReader starts a read cycle with current chunk size. |
| 138 | +2. Parser next(bytes) ingests one chunk and emits complete rows. |
| 139 | +3. Queue buffers rows for consumer-side retrieval. |
| 140 | +4. CSVRow/CSVField lazily materialize trim/unescape/conversion behavior. |
| 141 | +5. Worker completion and errors are signaled back to the consumer side. |
| 142 | + |
| 143 | +## 4. Key Invariants |
| 144 | + |
| 145 | +### Chunk boundary integrity |
| 146 | + |
| 147 | +Fields spanning chunk boundaries must not be split/corrupted. |
| 148 | + |
| 149 | +### Path parity |
| 150 | + |
| 151 | +Mmap and stream parsers must preserve the same externally observable behavior. |
| 152 | + |
| 153 | +### Lazy materialization contract |
| 154 | + |
| 155 | +Trimming/unescaping/conversion behavior must remain coherent across parser and field-access layers. |
| 156 | + |
| 157 | +### Bounded streaming semantics |
| 158 | + |
| 159 | +Avoid designs that force retaining all parsed chunks globally. |
| 160 | + |
| 161 | +## 5. Change Impact Map |
| 162 | + |
| 163 | +- Parser state machine changes: |
| 164 | + - basic_csv_parser.hpp, basic_csv_parser.cpp |
| 165 | + |
| 166 | +- Chunk transition changes: |
| 167 | + - basic_csv_parser.cpp (MmapParser/StreamParser next) |
| 168 | + |
| 169 | +- Reader worker/iteration behavior: |
| 170 | + - csv_reader.hpp, csv_reader.cpp, csv_reader_iterator.cpp |
| 171 | + |
| 172 | +- Field extraction and trimming/unescaping: |
| 173 | + - csv_row.hpp, csv_row.cpp, raw_csv_data.hpp |
| 174 | + |
| 175 | +- Parse configuration behavior: |
| 176 | + - csv_format.hpp, csv_format.cpp |
| 177 | + |
| 178 | +- Queue synchronization semantics: |
| 179 | + - thread_safe_deque.hpp, THREADSAFE_DEQUE_DESIGN.md |
| 180 | + |
| 181 | +## 6. Test Guidance by Subsystem |
| 182 | + |
| 183 | +For full testing strategy, checklist, and conventions, see: |
| 184 | +- ../../tests/AGENTS.md |
0 commit comments