Skip to content

Commit 9172253

Browse files
Spring Cleaning (#308)
* Unify inference logic; add explicit column_names flag; fix stream/mmap parity regressions * Clean up MSVC compatibility macros * Simplify IBasicCSVParser construction * Refactor get_col_names() Use parse() and move to utility header * Get rid of get_file_size() This returned the file size of an mmap source for the MmapParser, but used ifstream. Not efficient! * Delete sh.hpp * Cleanup constructors for CSVReader & IBasicCSVParser * Clean up CSVField duplication * Figured out what was wrong with clang & CSVWriter Also add variadic write_row() method * Fix MSVC pedantry * Potential fix for pull request finding 'CodeQL / Large object passed by value' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Potential fix for pull request finding 'CodeQL / Large object passed by value' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Update writing documentation * Fix clanker-induced error Also fix shadowing issues * Fix CSV writing issues * Bump version * Delete sh.hpp * Reduce duplication in get_csv_head_stream() --------- Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
1 parent 786023f commit 9172253

37 files changed

Lines changed: 1065 additions & 861 deletions

.github/workflows/codeql.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ jobs:
4646
cd build
4747
cmake .. \
4848
-DCMAKE_BUILD_TYPE=Release \
49-
-DCSV_CXX_STANDARD=17 \
49+
-DCSV_CXX_STANDARD=20 \
5050
-DCMAKE_CXX_COMPILER=g++ \
5151
-DCMAKE_C_COMPILER=gcc
5252

AGENTS.md

Lines changed: 21 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -38,33 +38,7 @@ CSVReader reader(infile, format);
3838
3939
**Testing requirement:** Use ≥500K rows to cross 10MB boundary.
4040
41-
## Key Files
42-
43-
| File | Contains |
44-
|------|----------|
45-
| `csv_reader.hpp` | Mmap vs stream constructors |
46-
| `csv_reader.cpp` | Delimiter guessing, header detection |
47-
| `basic_csv_parser.hpp` | Parser base class (IBasicCSVParser, MmapParser, StreamParser) |
48-
| `basic_csv_parser.cpp` | Chunk transitions, worker thread |
49-
| `raw_csv_data.hpp` | Internal parser data structures (RawCSVField, CSVFieldList, RawCSVData) |
50-
| `thread_safe_deque.hpp` | Producer-consumer queue for parser→main thread communication |
51-
| `csv_row.hpp` | Public API types (CSVField, CSVRow) |
52-
| `test_round_trip.cpp` | Exemplar test patterns |
53-
54-
## Data Flow: Parser → Row API
55-
56-
```
57-
Parser Thread Main Thread
58-
↓ ↓
59-
RawCSVData (shared_ptr) ─────────────→ CSVRow
60-
↓ ↓
61-
CSVFieldList → RawCSVField[] CSVField (lazy unescaping)
62-
63-
ThreadSafeDeque<CSVRow>
64-
(producer-consumer queue)
65-
```
66-
67-
**Thread Safety:** Parser populates `RawCSVData`, pushes `CSVRow` to `ThreadSafeDeque`, main thread pops and reads. The `CSVFieldList` uses chunked allocation (~170 fields/chunk) for cache locality. See `raw_csv_data.hpp` and `thread_safe_deque.hpp` for implementation details.
41+
For detailed file mapping, parser data flow, and component relationships, see `ARCHITECTURE.md` and `include/internal/ARCHITECTURE.md`.
6842
6943
## Common Pitfalls
7044
@@ -73,18 +47,29 @@ ThreadSafeDeque<CSVRow>
7347
3. **Don't use uniform values:** Each column needs distinct values to detect corruption.
7448
4. **Don't ignore async:** Worker thread means exceptions must use `exception_ptr`.
7549
5. **Don't change one constructor:** Likely affects both mmap and stream paths.
76-
7. **Compatibility macros defined in `common.hpp` MUST be referenced only after including `common.hpp`.** Any macro (such as `CSV_HAS_CXX20`) that is defined in `common.hpp` must not be used or checked before `#include "common.hpp"` appears in the file. This ensures feature detection and conditional compilation work as intended across all supported compilers and build modes.
77-
8. **`CSVReader` is non-copyable and move-enabled.** Prefer explicit ownership transfer (`std::move`) or `std::unique_ptr<CSVReader>` when sharing/handing off parser ownership across APIs.
78-
9. **Prefer trailing underscore for private members** (for example `source_`, `leftover_`). When you touch code with mixed private-member naming styles, normalize the edited region toward trailing underscores instead of introducing more leading-underscore or unsuffixed names.
79-
10. **Prefer user-friendly API constraints.** Do not narrow template constraints unless required for correctness, safety, or a measured performance win. If an implementation already handles common standard-library containers/ranges correctly, keep those inputs accepted instead of over-constraining APIs for aesthetic purity.
80-
11. **Respect existing compile-time compatibility macros.** Keep `IF_CONSTEXPR`, `CONSTEXPR_VALUE`, and similar macros unless there is a correctness bug.
81-
12. **Do not replace compile-time constructs with runtime control flow to silence warnings.** Prefer smallest scoped warning suppression at the exact site (for example, local `#pragma warning(push/pop)` on MSVC) over semantic rewrites.
82-
13. **Opportunistic rewrites/refactors are allowed when they are safe and justified.** Keep them separated from build-fix urgency where possible, and avoid bundling unrelated churn with compiler triage unless explicitly requested.
83-
14. **When proposing changes that affect compile-time behavior, explain the tradeoff clearly.** Call out any impact to codegen, performance, portability, and readability before applying the change.
84-
15. **If a build fix appears to require more than ~3 files or ~60 changed lines, pause and confirm scope first.** Provide a short justification before expanding further.
50+
6. **`CSVReader` is non-copyable and move-enabled.** Prefer explicit ownership transfer (`std::move`) or `std::unique_ptr<CSVReader>` when sharing/handing off parser ownership across APIs.
51+
7. **Prefer user-friendly API constraints.** Do not narrow template constraints unless required for correctness, safety, or a measured performance win. If an implementation already handles common standard-library containers/ranges correctly, keep those inputs accepted instead of over-constraining APIs for aesthetic purity.
52+
8. **Opportunistic rewrites/refactors are allowed when they are safe and justified.** Keep them separated from build-fix urgency where possible, and avoid bundling unrelated churn with compiler triage unless explicitly requested.
53+
9. **When proposing changes that affect compile-time behavior, explain the tradeoff clearly.** Call out any impact to codegen, performance, portability, and readability before applying the change.
54+
10. **If a build fix appears to require more than ~3 files or ~60 changed lines, pause and confirm scope first.** Provide a short justification before expanding further.
8555
8656
See `tests/AGENTS.md` for test strategy, checklist, and conventions.
8757
58+
### Rules for Coding
59+
1. **Use compatibility macros defined in `common.hpp`** for cross-compiler or cross-standard concerns. If it doesn't exist, consider creating one.
60+
2. **Compatibility macros defined in `common.hpp` MUST be referenced only after including `common.hpp`** to ensure correctness.
61+
3. **Prefer compile time control flow and assertions where possible**. For example, if a branch may be safely written with `if constexpr`, then use the `IF_CONSTEXPR` macro (from `common.hpp`) to ensure C++11 compatibility while ensuring optimal control flow for C++17 and later users.
62+
1. **If this causes compiler warnings, always silence the compiler. Do not revert to unnecessary runtime flow.**
63+
4. **Prefer trailing underscore for private members** (for example `source_`, `leftover_`). When you touch code with mixed private-member naming styles, normalize the edited region toward trailing underscores instead of introducing more leading-underscore or unsuffixed names.
64+
5. **Apply the 5/2 anti-duplication rule.**
65+
1. If equivalent behavior exists in 2 or more code paths and each copy is about 5+ meaningful lines, extract a shared helper.
66+
2. If duplication is intentionally kept, add a brief comment explaining why (for example performance, API boundary, or template constraints).
67+
3. For behavior-sensitive duplicated logic, keep at least one regression test that exercises each path (for example mmap and stream via separate Catch2 `SECTION`s).
68+
6. If a class has both a `.hpp` and `.cpp` file, put methods inside the `.cpp` and prefix the definition with `CSV_INLINE` to ensure proper single-header compilation (the macro is `inline` in the generated single-header and empty otherwise). Exceptions:
69+
- **Templates must stay in `.hpp`** — the compiler needs the definition at instantiation time. `init_from_stream` is the standing example.
70+
- **Trivial one-liner accessors** may be unconditionally `inline` in the header when the call overhead is measurable and the body will never change.
71+
- **Consolidation:** If a `.cpp` would be under ~100 lines *and* the split causes excessive comment duplication between the two files, prefer a single `.hpp` with definitions marked `inline` (free functions and methods alike). Do not use `CSV_INLINE` for consolidated definitions — `CSV_INLINE` expands to empty in multi-header mode, which would produce ODR violations across TUs. Do not consolidate just for brevity — only when duplication is the dominant cost.
72+
8873
### Rules for Comments
8974
1. **Always update or remove incorrect comments.**
9075
2. **Don't reference internal functions in public API comments.** Public API docs should describe user-visible behavior and contracts; internal helper/function details belong in internal docs.

ARCHITECTURE.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ Operational/testing guidance:
1414

1515
Notes:
1616
- Internal architecture content lives under include/internal to stay close to implementation.
17+
- Detailed file map, parser data flow, and component relationship diagrams are maintained in include/internal/ARCHITECTURE.md.
1718
- Queue synchronization details are maintained only in THREADSAFE_DEQUE_DESIGN.md to avoid duplication.
1819
- Always update or remove incorrect comments.
1920
- Public API comments should remain user-facing and avoid references to internal helper/function details.
@@ -27,4 +28,5 @@ Notes:
2728
- Opportunistic rewrites are acceptable when safe/justified, but should be kept separate from urgent compiler triage unless requested.
2829
- When changing compile-time behavior, explicitly document tradeoffs (codegen, performance, portability, readability).
2930
- If a build fix appears to require more than ~3 files or ~60 changed lines, pause and confirm scope first.
31+
- Apply the 5/2 anti-duplication rule: if equivalent behavior exists in 2+ code paths and each copy is ~5+ meaningful lines, extract a shared helper; if duplication remains, document why and keep regression coverage for each path.
3032

CLAUDE.md

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -21,13 +21,8 @@
2121
- Exceptions propagate via `std::exception_ptr`
2222
- Tests must use ≥500K rows to cross chunk boundary
2323

24-
## Key Files
25-
- `csv_reader.hpp` — mmap vs stream constructors
26-
- `basic_csv_parser.hpp` — MmapParser, StreamParser implementations
27-
- `basic_csv_parser.cpp` — chunk transitions, worker thread
28-
- `raw_csv_data.hpp` — RawCSVField, CSVFieldList, RawCSVData
29-
- `thread_safe_deque.hpp` — producer-consumer queue
30-
- `csv_row.hpp` — CSVField, CSVRow public API
24+
## Architecture Detail
25+
- File mapping, parser data flow, and component relationships are maintained in `ARCHITECTURE.md` and `include/internal/ARCHITECTURE.md`
3126

3227
## Common Pitfalls
3328
- Always test both mmap and stream paths
@@ -48,6 +43,8 @@
4843
- **Opportunistic rewrites are allowed when safe and justified** — avoid mixing unrelated churn into urgent compiler triage unless requested
4944
- **Explain compile-time tradeoffs explicitly** — when a change affects compile-time behavior, call out impact on codegen/perf/portability/readability
5045
- **Scope guard for build fixes** — if a fix grows beyond roughly 3 files or 60 changed lines, pause and confirm scope with justification
46+
- **Apply the 5/2 anti-duplication rule** — if equivalent behavior exists in 2+ code paths and each copy is ~5+ meaningful lines, extract a shared helper; if duplication remains, document why; keep at least one regression test that exercises each path
47+
- **Non-trivial methods go in `.cpp` with `CSV_INLINE`**`CSV_INLINE` is `inline` in the generated single-header and empty otherwise; omitting it causes ODR violations. Exceptions: templated methods must stay in `.hpp` (`init_from_stream` is the standing example); trivial one-liner accessors may stay `inline` in the header when call overhead matters. Consolidate into a single `.hpp` only when the `.cpp` would be under ~100 lines *and* the split causes excessive comment duplication — consolidated definitions (free functions and methods alike) must use `inline`, not `CSV_INLINE`, to avoid ODR violations across TUs.
5148

5249
## Tests
5350
See `tests/AGENTS.md` for full test strategy, checklist, and conventions.

Doxyfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2037,7 +2037,8 @@ INCLUDE_FILE_PATTERNS =
20372037
# recursively expanded use the := operator instead of the = operator.
20382038
# This tag requires that the tag ENABLE_PREPROCESSING is set to YES.
20392039

2040-
PREDEFINED = DOXYGEN_SHOULD_SKIP_THIS
2040+
PREDEFINED = DOXYGEN_SHOULD_SKIP_THIS \
2041+
CSV_HAS_CXX20=1
20412042

20422043
# If the MACRO_EXPANSION and EXPAND_ONLY_PREDEF tags are set to YES then this
20432044
# tag can be used to specify a list of macro names that should be expanded. The

README.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -385,6 +385,12 @@ format.delimiter('\t')
385385
// Alternatively, we can use format.delimiter({ '\t', ',', ... })
386386
// to tell the CSV guesser which delimiters to try out
387387

388+
// Inference contract:
389+
// - Single delimiter: delimiter is fixed (no delimiter inference)
390+
// - Header row is still inferred unless you explicitly call header_row(...), no_header(),
391+
// or provide column_names(...)
392+
// - Multiple delimiters: delimiter inference is enabled
393+
388394
CSVReader reader("weird_csv_dialect.csv", format);
389395

390396
for (auto& row: reader) {
@@ -577,6 +583,8 @@ for (auto& row : df) {
577583
```
578584

579585
### Writing CSV Files
586+
*For a more in-depth guide, check out the [Doxygen page on CSV writing](https://vincentlaucsb.github.io/csv-parser/csv_writing_guide.html).*
587+
580588
Writing CSVs is powered by the generic `DelimWriter`, with helpful factory functions like `make_csv_writer()` and `make_tsv_writer()` that cut down on boilerplate.
581589

582590
```cpp
@@ -599,8 +607,12 @@ writer << vector<string>({ "A", "B", "C" })
599607
<< deque<string>({ "I'm", "too", "tired" })
600608
<< list<string>({ "to", "write", "documentation." });
601609

610+
// Uses compile time templates
602611
writer << array<string, 3>({ "The quick brown", "fox", "jumps over the lazy dog" });
603-
writer << make_tuple(1, 2.0, "Three");
612+
writer << make_tuple(1, 2.0, "Three", "Quatro");
613+
614+
// write_row() does everything operator<< does and then some
615+
writer.write_row(67, "six", "seven", 6.7, "mogged");
604616
...
605617
```
606618

docs/source/Doxy.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ For quick examples, go to this project's [GitHub page](https://github.com/vincen
2828
* csv::get_col_pos(): Returns the zero-based index of a named column
2929

3030
#### See also
31-
[Dealing with Variable Length CSV Rows](md_docs_source_variable_row_lengths.html)
31+
[Dealing with Variable Length CSV Rows](variable_row_lengths.html)
3232

3333
#### Working with parsed data
3434
* csv::CSVRow: \copybrief csv::CSVRow
@@ -97,7 +97,7 @@ column extraction, editing, and grouping.
9797
* csv::CSVStat::get_col_names()
9898

9999
### CSV Writing
100-
The [CSV Writing Guide](\ref md_docs_2source_2csv__writing) contains a
100+
The [CSV Writing Guide](@ref csv_writing_guide) contains a
101101
high-level overview of writing CSVs.
102102

103103
* csv::make_csv_writer(): Construct a csv::CSVWriter
@@ -126,4 +126,5 @@ experimenting should follow these guidelines:
126126
create separate threads to process each column
127127
* csv::CSVRow may be safely processed from multiple threads
128128
* csv::CSVField objects should only be read from one thread at a time
129-
* **Note**: csv::CSVRow::operator[]() produces separate copies of `csv::CSVField` objects
129+
* **Note**: csv::CSVRow::operator[]() produces separate copies of `csv::CSVField` objects
130+

docs/source/csv_writing.md

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
@page csv_writing_guide CSV Writing Guide
2+
13
# CSV Writing Guide
24

35
This page summarizes write-side APIs and practical usage patterns for emitting
@@ -18,13 +20,21 @@ Any row-like container of string-convertible values can be streamed directly.
1820

1921
\snippet tests/test_write_csv.cpp CSV Writer Example
2022

21-
## Writing Tuples and Custom Types
23+
### Writing Tuples and Custom Types
2224

2325
`DelimWriter` can also serialize tuples and custom types that provide a string
2426
conversion.
2527

2628
\snippet tests/test_write_csv.cpp CSV Writer Tuple Example
2729

30+
## Using `write_row()`
31+
32+
The `write_row()` method can be used to write rows with arbitrary fields and mixed types without having to construct a container first.
33+
34+
Through the magic of SFINAE, `write_row()` also supports any of the operations of `operator<<`.
35+
36+
\snippet tests/test_write_csv.cpp CSV write_row Variadic Example
37+
2838
## Data Reordering Workflow
2939

3040
For read-transform-write pipelines, `csv::CSVRow` supports conversion to
@@ -39,3 +49,22 @@ Typical flow:
3949
4. Emit with `CSVWriter`
4050

4151
\snippet tests/test_write_csv.cpp CSV Reordering Example
52+
53+
### C++20 Ranges Version
54+
55+
With C++20, you can use `std::ranges::views` to elegantly reorder fields in a single expression:
56+
57+
\snippet tests/test_write_csv.cpp CSV Ranges Reordering Example
58+
59+
## DataFrame with Sparse Overlay
60+
61+
When working with DataFrames, you can efficiently update specific cells without reconstructing entire rows. The overlay mechanism stores only the changed cells and writes them correctly:
62+
63+
\snippet tests/test_write_csv.cpp DataFrame Sparse Overlay Write Example
64+
65+
## End-to-End Round-Trip Integrity Example
66+
67+
The following test is intentionally write-first then read/verify, but it validates
68+
the same data-integrity guarantee as read-transform-write user workflows.
69+
70+
\snippet tests/test_round_trip.cpp Round Trip Distinct Field Values Example

docs/source/scientific_notation.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
@page scientific_notation Scientific Notation Parsing
2+
13
# Scientific Notation Parsing
24

35
This library has support for parsing scientific notation through `csv::internals::data_type()`,

docs/source/variable_row_lengths.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
@page variable_row_lengths Dealing with Variable Length CSV Rows
2+
13
# Dealing with Variable Length CSV Rows
24

35
`csv::CSVReader` generally assumes that most rows in a CSV are of the same length.

0 commit comments

Comments
 (0)