Skip to content

Commit 6a31bdc

Browse files
committed
Fix CSV iterator test to not use std::max_element
1 parent 1507866 commit 6a31bdc

5 files changed

Lines changed: 119 additions & 21 deletions

File tree

.claude/rules/csv_reader_rules.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
---
2+
paths:
3+
- "include/internal/csv_reader.hpp"
4+
- "include/internal/csv_reader.cpp"
5+
- "include/internal/csv_reader_iterator.cpp"
6+
---
7+
8+
# CSVReader Iterator Rules
9+
10+
## CRITICAL CONSTRAINT: DO NOT Cache All RawCSVDataPtr Chunks
11+
12+
### The Streaming Architecture
13+
- Library parses 50+ GB CSV files with bounded memory (e.g., 8GB RAM)
14+
- Previous data chunks are freed as iterator advances
15+
- `CSVReader::iterator` is `std::input_iterator_tag` by design (single-pass)
16+
17+
### When You See Heap-Use-After-Free with std::max_element or ForwardIterator Algorithms
18+
19+
**❌ WRONG FIX (defeats streaming)**:
20+
```cpp
21+
class iterator {
22+
std::vector<RawCSVDataPtr> _all_data_chunks; // NO!
23+
// Result: 50GB CSV requires 50GB RAM
24+
};
25+
```
26+
27+
**✅ CORRECT FIXES**:
28+
1. **Document limitation**: ForwardIterator algorithms not supported on CSVReader::iterator
29+
2. **Fix test**: Copy to vector first: `std::vector<CSVRow> rows(reader.begin(), reader.end());`
30+
3. **Update README**: Show vector-based workaround for std::max_element
31+
32+
### Valid RawCSVDataPtr Usage
33+
- ✅ `CSVRow::iterator` CAN have RawCSVDataPtr member (single row scope, negligible memory)
34+
- ❌ `CSVReader::iterator` CANNOT cache all chunks (file scope, unbounded memory)
35+
36+
### Root Cause Analysis
37+
- Small test files (< chunk size) fit in one chunk → ForwardIterator algorithms appear to work
38+
- Large files span multiple chunks → heap-use-after-free when algorithm accesses freed chunk
39+
- Solution: Document as unsupported OR provide vector workaround, NOT cache all chunks

.github/workflows/sanitizers.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,8 @@ jobs:
5151
-DCSV_CXX_STANDARD=${{ matrix.cxx_standard }} \
5252
-DCMAKE_CXX_COMPILER=g++ \
5353
-DCMAKE_C_COMPILER=gcc \
54-
-DCMAKE_CXX_FLAGS="${{ matrix.flag }} -g" \
55-
-DCMAKE_C_FLAGS="${{ matrix.flag }} -g"
54+
-DCMAKE_CXX_FLAGS="${{ matrix.flag }} -g -fno-coverage" \
55+
-DCMAKE_C_FLAGS="${{ matrix.flag }} -g -fno-coverage"
5656
5757
- name: Build with ${{ matrix.name }}
5858
run: cmake --build build --config Debug
@@ -92,8 +92,8 @@ jobs:
9292
-DCSV_CXX_STANDARD=17 \
9393
-DCMAKE_CXX_COMPILER=g++ \
9494
-DCMAKE_C_COMPILER=gcc \
95-
-DCMAKE_CXX_FLAGS="-g -O1" \
96-
-DCMAKE_C_FLAGS="-g -O1"
95+
-DCMAKE_CXX_FLAGS="-g -O1 -fno-coverage" \
96+
-DCMAKE_C_FLAGS="-g -O1 -fno-coverage"
9797
9898
- name: Build
9999
run: cmake --build build --config Debug

README.md

Lines changed: 22 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -147,17 +147,31 @@ However, `std::ifstream` may also be used as well as in-memory sources via `std:
147147
**Note**: Currently CSV guessing only works for memory-mapped files. The CSV dialect
148148
must be manually defined for other sources.
149149

150-
**Note on Iterator Type**: `CSVReader::iterator` is an **input iterator**, not a forward iterator.
151-
This design enables streaming large CSV files without loading them entirely into memory.
152-
While some algorithms requiring forward iterators (e.g., `std::max_element`) may work in practice,
153-
only input iterator algorithms are guaranteed to work reliably. If you need guaranteed forward-iterator
154-
or random-access support, collect rows into a container first:
150+
**⚠️ IMPORTANT - Iterator Type and Memory Safety**:
151+
`CSVReader::iterator` is an **input iterator** (`std::input_iterator_tag`), NOT a forward iterator.
152+
This design enables streaming large CSV files (50+ GB) without loading them entirely into memory.
153+
154+
**Why Forward Iterator Algorithms Don't Work**:
155+
- As the iterator advances, underlying data chunks are automatically freed to bound memory usage
156+
- Algorithms like `std::max_element` require ForwardIterator semantics (multi-pass, hold multiple positions)
157+
- Using such algorithms directly on `CSVReader::iterator` will cause **heap-use-after-free** when the
158+
algorithm tries to access iterators pointing to already-freed data chunks
159+
- While it may appear to work with small files that fit in a single chunk, it WILL fail with larger files
160+
161+
**✅ Correct Approach for ForwardIterator Algorithms**:
155162
```cpp
156-
std::vector<CSVRow> rows;
157-
for (auto& row : reader) rows.push_back(row);
158-
// Now use std::max_element or other forward-iterator algorithms reliably
163+
// Copy rows to vector first (enables multi-pass iteration)
164+
CSVReader reader("large_file.csv");
165+
std::vector<CSVRow> rows(reader.begin(), reader.end());
166+
167+
// Now safely use any algorithm requiring ForwardIterator
168+
auto max_row = std::max_element(rows.begin(), rows.end(),
169+
[](const CSVRow& a, const CSVRow& b) {
170+
return a["salary"].get<double>() < b["salary"].get<double>();
171+
});
159172
```
160173
174+
161175
```cpp
162176
CSVFormat format;
163177
// custom formatting options go here

include/internal/csv_reader.hpp

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,34 @@ namespace csv {
8585
*
8686
* @par Using with `<algorithm>` library
8787
* @snippet tests/test_csv_iterator.cpp CSVReader Iterator 2
88+
*
89+
* @warning STREAMING CONSTRAINT - DO NOT ATTEMPT TO CACHE ALL DATA
90+
* This iterator is intentionally std::input_iterator_tag (single-pass) to support
91+
* streaming large CSV files that may exceed available RAM (e.g., 50+ GB files).
92+
*
93+
* @par CRITICAL DESIGN CONSTRAINT:
94+
* - The underlying RawCSVData chunks are automatically freed as the iterator advances
95+
* - Only the CURRENT row's data is kept alive (via CSVRow's RawCSVDataPtr member)
96+
* - Previous chunks are freed to allow unlimited file sizes with bounded memory
97+
*
98+
* @par WHY FORWARD ITERATOR IS NOT POSSIBLE:
99+
* - ForwardIterator requires multi-pass guarantees (can hold multiple valid positions)
100+
* - Supporting this would require caching ALL RawCSVDataPtr chunks in the iterator
101+
* - This defeats the streaming purpose: a 50 GB CSV would require 50+ GB of RAM
102+
* - The entire library design depends on automatic chunk cleanup for memory efficiency
103+
*
104+
* @par IMPLICATIONS FOR ALGORITHM USE:
105+
* - Algorithms requiring ForwardIterator (std::max_element, std::sort, etc.) may
106+
* appear to work in tests with small files, but will cause heap-use-after-free
107+
* when the algorithm tries to access iterators pointing to freed chunks
108+
* - CORRECT approach: Copy rows to std::vector first, then use algorithms
109+
* - Example: auto rows = std::vector<CSVRow>(reader.begin(), reader.end());
110+
* auto max_row = std::max_element(rows.begin(), rows.end(), ...);
111+
*
112+
* @par FOR AI AGENTS AND MAINTAINERS:
113+
* If you encounter heap-use-after-free with ForwardIterator algorithms, the solution
114+
* is NOT to cache all RawCSVDataPtr chunks - this violates the fundamental streaming
115+
* architecture. Instead, document the limitation or provide vector-based workarounds.
88116
*/
89117
class iterator {
90118
public:

tests/test_csv_iterator.cpp

Lines changed: 26 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -107,31 +107,48 @@ TEST_CASE("Basic CSVReader Iterator Test", "[read_ints_iter]") {
107107
//! [CSVReader Iterator 1]
108108

109109
//! [CSVReader Iterator 2]
110-
TEST_CASE("CSVReader Iterator + std::max_elem", "[iter_max_elem]") {
111-
SECTION("Find Max Element with std::max_element") {
110+
/**
111+
* IMPORTANT: CSVReader::iterator is std::input_iterator_tag (single-pass)
112+
* to support streaming large files with bounded memory usage.
113+
*
114+
* Algorithms requiring ForwardIterator (like std::max_element) may appear
115+
* to work with small files, but cause heap-use-after-free when data spans
116+
* multiple RawCSVData chunks that get freed as the iterator advances.
117+
*
118+
* CORRECT approach: Copy to vector first, then use algorithms.
119+
*/
120+
TEST_CASE("CSVReader Iterator + Algorithms Requiring ForwardIterator", "[iter_algorithms]") {
121+
SECTION("std::max_element - CORRECT approach using vector") {
112122
// The first is such that each value in the ith row is the number i
113123
// There are 100 rows
114-
CSVReader r1("./tests/data/fake_data/ints.csv");
124+
CSVReader reader("./tests/data/fake_data/ints.csv");
125+
126+
// Copy rows to vector to enable ForwardIterator algorithms
127+
auto rows = std::vector<CSVRow>(reader.begin(), reader.end());
128+
REQUIRE(rows.size() == 100);
115129

116130
// Find largest number
117-
auto int_finder = [](CSVRow& left, CSVRow& right) {
131+
auto int_finder = [](const CSVRow& left, const CSVRow& right) {
118132
return (left["A"].get<int>() < right["A"].get<int>());
119133
};
120134

121-
auto max_int = std::max_element(r1.begin(), r1.end(), int_finder);
135+
auto max_int = std::max_element(rows.begin(), rows.end(), int_finder);
122136
REQUIRE((*max_int)["A"] == 100);
123137
}
124138

125-
SECTION ("Find Max Element with std::max_element - Large File") {
139+
SECTION("std::max_element - Large File using vector") {
126140
// The second file is a database of California state employee salaries
127-
CSVReader r2("./tests/data/real_data/2015_StateDepartment.csv");
141+
CSVReader reader("./tests/data/real_data/2015_StateDepartment.csv");
142+
143+
// Copy rows to vector to enable ForwardIterator algorithms
144+
auto rows = std::vector<CSVRow>(reader.begin(), reader.end());
128145

129146
// Find highest salary
130-
auto wage_finder = [](CSVRow& left, CSVRow& right) {
147+
auto wage_finder = [](const CSVRow& left, const CSVRow& right) {
131148
return (left["Total Wages"].get<double>() < right["Total Wages"].get<double>());
132149
};
133150

134-
auto max_wage = std::max_element(r2.begin(), r2.end(), wage_finder);
151+
auto max_wage = std::max_element(rows.begin(), rows.end(), wage_finder);
135152

136153
REQUIRE((*max_wage)["Total Wages"] == "812064.87");
137154
}

0 commit comments

Comments
 (0)