You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: AGENTS.md
+9-1Lines changed: 9 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -74,7 +74,15 @@ ThreadSafeDeque<CSVRow>
74
74
4. **Don't ignore async:** Worker thread means exceptions must use `exception_ptr`.
75
75
5. **Don't change one constructor:** Likely affects both mmap and stream paths.
76
76
6. **Don't delete or simplify comments** unless they are trivially obvious (e.g. `// increment i`) or factually incorrect. Comments in this codebase frequently encode concurrency invariants, non-obvious design decisions, and hard-won bug context that cannot be recovered from the code alone.
77
+
7. **Compatibility macros defined in `common.hpp` MUST be referenced only after including `common.hpp`.** Any macro (such as `CSV_HAS_CXX20`) that is defined in `common.hpp` must not be used or checked before `#include "common.hpp"` appears in the file. This ensures feature detection and conditional compilation work as intended across all supported compilers and build modes.
77
78
7. **Don't reference internal functions in public API comments.** Public API docs should describe user-visible behavior and contracts; internal helper/function details belong in internal docs.
78
-
8. **`CSVReader` is non-copyable and non-movable.** The preferred idiom for sharing or transferring a reader is `std::unique_ptr<CSVReader>`. Document this at any API surface that might tempt callers to copy or move.
79
+
8. **`CSVReader` is non-copyable and move-enabled.** Prefer explicit ownership transfer (`std::move`) or `std::unique_ptr<CSVReader>` when sharing/handing off parser ownership across APIs.
80
+
9. **Prefer trailing underscore for private members** (for example `source_`, `leftover_`). When you touch code with mixed private-member naming styles, normalize the edited region toward trailing underscores instead of introducing more leading-underscore or unsuffixed names.
81
+
10. **Prefer user-friendly API constraints.** Do not narrow template constraints unless required for correctness, safety, or a measured performance win. If an implementation already handles common standard-library containers/ranges correctly, keep those inputs accepted instead of over-constraining APIs for aesthetic purity.
82
+
11. **Respect existing compile-time compatibility macros.** Keep `IF_CONSTEXPR`, `CONSTEXPR_VALUE`, and similar macros unless there is a correctness bug.
83
+
12. **Do not replace compile-time constructs with runtime control flow to silence warnings.** Prefer smallest scoped warning suppression at the exact site (for example, local `#pragma warning(push/pop)` on MSVC) over semantic rewrites.
84
+
13. **Opportunistic rewrites/refactors are allowed when they are safe and justified.** Keep them separated from build-fix urgency where possible, and avoid bundling unrelated churn with compiler triage unless explicitly requested.
85
+
14. **When proposing changes that affect compile-time behavior, explain the tradeoff clearly.** Call out any impact to codegen, performance, portability, and readability before applying the change.
86
+
15. **If a build fix appears to require more than ~3 files or ~60 changed lines, pause and confirm scope first.** Provide a short justification before expanding further.
79
87
80
88
See `tests/AGENTS.md` for test strategy, checklist, and conventions.
Copy file name to clipboardExpand all lines: ARCHITECTURE.md
+9Lines changed: 9 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,4 +16,13 @@ Notes:
16
16
- Internal architecture content lives under include/internal to stay close to implementation.
17
17
- Queue synchronization details are maintained only in THREADSAFE_DEQUE_DESIGN.md to avoid duplication.
18
18
- Public API comments should remain user-facing and avoid references to internal helper/function details.
19
+
- Private member naming should prefer trailing underscores; when editing mixed-style code, normalize the touched region toward that convention.
20
+
- Compatibility macros defined in `common.hpp` must only be referenced after including `common.hpp`. See AGENTS.md and CLAUDE.md for details.
21
+
- API constraints should be user-friendly: do not over-constrain templates unless needed for correctness, safety, or a measured performance win.
22
+
-`CSVReader` is intentionally non-copyable and move-enabled; use explicit ownership transfer patterns (`std::move`, `std::unique_ptr`) at API boundaries.
Copy file name to clipboardExpand all lines: CLAUDE.md
+9-1Lines changed: 9 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,8 +36,16 @@
36
36
- Exceptions from worker thread need `exception_ptr`
37
37
- Changes to one constructor likely affect both paths
38
38
-**Do not delete or simplify comments** unless trivially obvious or factually wrong — comments encode concurrency invariants and bug history
39
+
-**Compatibility macros defined in `common.hpp` MUST be referenced only after including `common.hpp`.** Any macro (such as `CSV_HAS_CXX20`) that is defined in `common.hpp` must not be used or checked before `#include "common.hpp"` appears in the file. This ensures feature detection and conditional compilation work as intended across all supported compilers and build modes.
39
40
-**Do not reference internal functions in public API comments** — public API docs should remain user-facing; internal details belong in internal docs
40
-
-**`CSVReader` is non-copyable and non-movable** — the preferred sharing/transfer idiom is `std::unique_ptr<CSVReader>`
41
+
-**`CSVReader` is non-copyable and move-enabled** — prefer explicit ownership transfer (`std::move`) or `std::unique_ptr<CSVReader>` when handing off parser ownership
42
+
-**Prefer trailing underscore for private members** — when touching mixed-style code, normalize the edited region toward names like `source_` and `leftover_`
43
+
-**Prefer user-friendly API constraints** — do not narrow template constraints unless required for correctness, safety, or a measured performance win; if common containers/ranges already work, keep them accepted
44
+
-**Respect compile-time compatibility macros** — keep constructs like `IF_CONSTEXPR` and `CONSTEXPR_VALUE` unless there is a correctness bug
45
+
-**Do not rewrite compile-time logic to silence warnings** — prefer tightly scoped suppression at the exact site when needed
46
+
-**Opportunistic rewrites are allowed when safe and justified** — avoid mixing unrelated churn into urgent compiler triage unless requested
47
+
-**Explain compile-time tradeoffs explicitly** — when a change affects compile-time behavior, call out impact on codegen/perf/portability/readability
48
+
-**Scope guard for build fixes** — if a fix grows beyond roughly 3 files or 60 changed lines, pause and confirm scope with justification
41
49
42
50
## Tests
43
51
See `tests/AGENTS.md` for full test strategy, checklist, and conventions.
Copy file name to clipboardExpand all lines: README.md
+42-64Lines changed: 42 additions & 64 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@
22
22
-[Avoid cloning with FetchContent](#avoid-cloning-with-fetchcontent)
23
23
-[Features \& Examples](#features--examples)
24
24
-[Reading an Arbitrarily Large File (with Iterators)](#reading-an-arbitrarily-large-file-with-iterators)
25
-
-[Memory-Mapped Files vs. Streams](#memory-mapped-files-vs-streams)
25
+
-[Memory-Mapped I/O and Streams](#memory-mapped-io-and-streams)
26
26
-[Indexing by Column Names](#indexing-by-column-names)
27
27
-[Numeric Conversions](#numeric-conversions)
28
28
-[Converting to JSON](#converting-to-json)
@@ -33,14 +33,19 @@
33
33
-[Parsing an In-Memory String](#parsing-an-in-memory-string)
34
34
-[DataFrames for Random Access and Updates](#dataframes-for-random-access-and-updates)
35
35
-[Writing CSV Files](#writing-csv-files)
36
+
-[C++20 Ranges: Efficient writing for `CSVRow`, `DataFrameRow`, and STL containers](#c20-ranges-efficient-writing-for-csvrow-dataframerow-and-stl-containers)
36
37
37
38
## Motivation
38
-
There's plenty of other CSV parsers in the wild, but I had a hard time finding what I wanted. Inspired by Python's `csv` module, I wanted a library with **simple, intuitive syntax**. Furthermore, I wanted support for special use cases such as calculating statistics on very large files. Thus, this library was created with these following goals in mind.
39
+
I wanted a CSV library that was fast and reliable without forcing you into either:
40
+
* A 1990s C-style API
41
+
* A high-level wrapper that murders `malloc()` and your memory cache
42
+
43
+
This library tries to be **fast for developers** and **fast for your computer**.
39
44
40
45
### Performance and Memory Requirements
41
-
A high-performance CSV parser lets you take advantage of large datasets efficiently. This library combines SIMD-accelerated parsing, memory-mapped I/O, careful memory layout, minimal allocation, and background parsing to process large CSV files quickly, even when they exceed available RAM.
46
+
This library combines SIMD-accelerated parsing, memory-mapped I/O, careful memory layout, minimal allocation, and background parsing to process large CSV files quickly, even when they exceed available RAM.
42
47
43
-
In fact, [according to Visual Studio's profiler](https://github.com/vincentlaucsb/csv-parser/wiki/Microsoft-Visual-Studio-CPU-Profiling-Results) this
48
+
[According to Visual Studio's profiler](https://github.com/vincentlaucsb/csv-parser/wiki/Microsoft-Visual-Studio-CPU-Profiling-Results) this
44
49
CSV parser **spends almost 90% of its CPU cycles actually reading your data** as opposed to getting hung up in hard disk I/O or pushing around memory.
45
50
46
51
#### Show me the numbers
@@ -52,7 +57,7 @@ All benchmarks shown are warm cache runs to focus on parser/CPU performance rath
52
57
53
58
#### Chunk Size Tuning
54
59
55
-
By default, the parser reads CSV data in 10MB chunks. This balance was determined through empirical testing to optimize throughput while minimizing memory overhead and thread synchronization costs.
60
+
By default, the parser reads CSV data in 10MB chunks. This balance was determined through empirical testing to optimize throughput while minimizing memory overhead and thread synchronization costs, but feel free to experiment and measure with different numbers yourself.
56
61
57
62
If you encounter rows larger than the chunk size, pass a custom `CSVFormat` with `chunk_size()`:
58
63
@@ -65,13 +70,11 @@ for (auto& row : reader) {
65
70
}
66
71
```
67
72
68
-
**Tuning guidance:** The default 10MB provides good balance for typical workloads. Smaller chunks (e.g., 500KB) increase thread overhead without meaningful memory savings. Larger chunks (e.g., 100MB+) reduce thread coordination overhead but consume more memory and delay the first row. Feel free to experiment and measure with your own hardware and data patterns.
69
-
70
73
### Robust Yet Flexible
71
74
#### RFC 4180 and Beyond
72
75
This CSV parser is much more than a fancy string splitter, and parses all files following [RFC 4180](https://www.rfc-editor.org/rfc/rfc4180.txt).
73
76
74
-
However, in reality we know that RFC 4180 is just a suggestion, and there's many "flavors" of CSV such as tab-delimited files. Thus, this library has:
77
+
However, in reality we know that RFC 4180 is just a suggestion, so this library has:
75
78
* Automatic delimiter guessing
76
79
* Ability to ignore comments in leading rows and elsewhere
77
80
* Ability to handle rows of different lengths
@@ -81,32 +84,34 @@ By default, rows of variable length are silently ignored, although you may elect
81
84
82
85
#### Encoding
83
86
This CSV parser is encoding-agnostic and will handle ANSI and UTF-8 encoded files.
84
-
It does not try to decode UTF-8, except for detecting and stripping UTF-8 byte order marks.
87
+
It does not try to decode UTF-8, except for detecting and stripping UTF-8 byte order marks (BOM).
85
88
86
89
### Well Tested
87
90
This CSV parser has:
88
91
* An extensive Catch2 test suite
89
92
* Tests of various CMake and non-CMake builds across g++, clang, MSVC, and MinGW
90
93
* Address, thread safety, and undefined behavior checks with ASan, TSan, and Valgrind (see [GitHub Actions](https://github.com/vincentlaucsb/csv-parser/actions))
91
-
94
+
92
95
#### Bug Reports
93
-
Found a bug? Please report it! This project welcomes **genuine bug reports brought in good faith**:
94
-
* ✅ Crashes, memory leaks, data corruption, race conditions
95
-
* ✅ Incorrect parsing of valid CSV files
96
-
* ✅ Performance regressions in real-world scenarios
97
-
* ✅ API issues that affect **practical, real-world use cases**
98
96
99
-
When reporting integration or compiler issues, please state which library form you are using:
100
-
* Single-header
101
-
* Unamalgamated headers/library (`include/` with your own build system, CMake, etc.)
97
+
I welcome genuine bug reports brought in good faith. This includes:
98
+
99
+
- Crashes, memory leaks, data corruption, or race conditions
100
+
- Incorrect parsing of valid CSV files
101
+
- Performance regressions on real-world data
102
+
- API issues that affect practical use cases
102
103
103
-
Please keep reports grounded in real use cases—no contrived edge cases or philosophical debates about API design, thanks!
104
+
When reporting compiler or integration issues, please mention which form of the library you're using:
105
+
- Single-header
106
+
- Regular headers + your own build system
107
+
- CMake
104
108
105
-
**Design Note:** `CSVReader` uses `std::input_iterator_tag` for single-pass streaming of arbitrarily large files. If you need multi-pass iteration or random access, copy rows to a `std::vector` first. This is by design, not a bug.
109
+
**Note:** Please keep reports focused on real-world problems.
110
+
Questions about extremely edge-case behavior (e.g. "what should `,,,` return?") do not belong in the issue tracker.
106
111
107
112
## Documentation
108
113
109
-
In addition to the [Features & Examples](#features--examples) below, a [fully-fledged online documentation](https://vincentlaucsb.github.io/csv-parser/) contains more examples, details, interesting features, and instructions for less common use cases.
114
+
In addition to the [Features & Examples](#features--examples) below, an [extensive documentation site](https://vincentlaucsb.github.io/csv-parser/) contains more examples, details, interesting features, and instructions for less common use cases.
110
115
111
116
## Sponsors
112
117
If you use this library for work, please [become a sponsor](https://github.com/sponsors/vincentlaucsb). Your donation
@@ -121,7 +126,7 @@ This library was developed with Microsoft Visual Studio and is compatible with >
121
126
All of the code required to build this library, aside from the C++ standard library, is contained under `include/`.
122
127
123
128
### C++ Version
124
-
While C++17 is recommended, C++11 is the minimum version required. This library makes extensive use of string views, and uses
129
+
While C++20 is recommended, C++11 is the minimum version required. This library makes extensive use of string views, and uses
125
130
[Martin Moene's string view library](https://github.com/martinmoene/string-view-lite) if `std::string_view` is not available.
126
131
127
132
This library requires C++ exceptions to be enabled (for example, do not compile with `-fno-exceptions`).
@@ -226,40 +231,18 @@ while (reader.read_row(row)) {
226
231
...
227
232
```
228
233
229
-
#### Memory-Mapped Files vs. Streams
230
-
By default, passing in a file path string to the constructor of `CSVReader`
231
-
causes memory-mapped IO to be used. In general, this option is the most
232
-
performant.
233
-
234
-
However, `std::ifstream` may also be used as well as in-memory sources via `std::stringstream`.
235
-
236
-
**Note**: Currently CSV guessing only works for memory-mapped files. The CSV dialect
237
-
must be manually defined for other sources.
238
-
239
234
**⚠️ IMPORTANT - Iterator Type and Memory Safety**:
240
235
`CSVReader::iterator` is an **input iterator** (`std::input_iterator_tag`), NOT a forward iterator.
241
-
This design enables streaming large CSV files (50+ GB) without loading them entirely into memory.
236
+
This design enables streaming large CSV files (50+ GB) without loading them entirely into memory, but may fail with some standard algorithms that require forward iterators.
242
237
243
-
**Why Forward Iterator Algorithms Don't Work**:
244
-
- As the iterator advances, underlying data chunks are automatically freed to bound memory usage
245
-
- Algorithms like `std::max_element` require ForwardIterator semantics (multi-pass, hold multiple positions)
246
-
- Using such algorithms directly on `CSVReader::iterator` will cause **heap-use-after-free** when the
247
-
algorithm tries to access iterators pointing to already-freed data chunks
248
-
- While it may appear to work with small files that fit in a single chunk, it WILL fail with larger files
238
+
If you need to get around this, I suggest either loading all rows into an STL container, e.g. `std::vector<CSVRow>`, or using the `DataFrame` class which supports row and column random access.
249
239
250
-
**✅ Correct Approach for ForwardIterator Algorithms**:
251
-
```cpp
252
-
// Copy rows to vector first (enables multi-pass iteration)
When passing in a file path to `CSVReader`, memory-mapped I/O is used as it is the most performant.
242
+
243
+
However, most finite steams implementing `std::istream`, such as `std::stringstream` and `std::ifstream` are supported as well as non-seekable streams. `CSVReader` is capable of taking a stream by reference, although it is recommended to pass in an owning `std::unique_ptr<std::istream>` for memory safety.
262
244
245
+
Both memory-mapped and `std::istream` paths benefit from having a background parsing thread, unless disabled.
263
246
264
247
```cpp
265
248
CSVFormat format;
@@ -483,7 +466,7 @@ for (auto& r: rows) {
483
466
484
467
### DataFrames for Random Access and Updates
485
468
486
-
For files that fit comfortably in memory, `DataFrame` provides fast and powerful keyed access, in-place updates, and grouping operations—all built on the same high-performance parser. It uses the same parsing pipeline as `CSVReader` but retains the results in memory for random access.
469
+
For files that fit comfortably in memory, `DataFrame` provides fast and powerful keyed access, in-place updates, and grouping operations—all built on the same high-performance parser. It uses the same parsing pipeline as `CSVReader` but retains the results in memory for both row-wise and column-wise random access.
-**Use DataFrame** for: Files that fit in RAM, frequent lookups/updates, grouping operations, data that needs random access
596
-
597
-
**When Not to Use DataFrame:**
598
-
- Extremely large files that do not fit in RAM
599
-
- Streaming pipelines where you only need single-pass access
600
-
601
-
Both options deliver the same parsing performance—DataFrame simply keeps the results in memory for convenience.
602
-
603
576
### Writing CSV Files
577
+
Writing CSVs is powered by the generic `DelimWriter`, with helpful factory functions like `make_csv_writer()` and `make_tsv_writer()` that cut down on boilerplate.
0 commit comments