Skip to content

Commit b78130d

Browse files
Codecov setup (#294)
* Codecov setup * Update coverage.yml * Update codecov.yml * Potential fix for code scanning alert no. 269: Workflow does not contain permissions Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Fix --coverage race condition * Update test_read_csv.cpp * Update coverage.yml * Update coverage.yml * Add check for trailing EOF * Improve unit test coverage * Update coverage.yml * Update coverage.yml * Add branch coverage * Update coverage.yml * Update test_csv_field.cpp * Fixed various idiosyncrasies * CSVReader::begin() calls read_row() instead of being a 3rd path to starting the worker * CSVFormat now has chunking property * Fix DataFrame wierdness when key type is integral * Update tests * More tests added * Update test_stream_sources.cpp --------- Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
1 parent b870b08 commit b78130d

29 files changed

Lines changed: 736 additions & 452 deletions

.claude/rules/testing-conventions.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,32 @@ Or exclude from default runs:
4949
TEST_CASE("Description", "[.][bug]") { ... } // Skip by default
5050
```
5151

52+
## Rule: Edge Case and Regression Tests Go at the End of the File
53+
54+
Test files are organized with mainline functionality tests first. Edge cases, regression
55+
tests for specific issues, and boundary condition tests must be placed **at the end of
56+
the file**, after all general feature tests.
57+
58+
**WRONG - regression test inserted at the top or middle:**
59+
```cpp
60+
// At line 14, before any feature tests:
61+
TEST_CASE("Regression #149 - trailing newline", ...) { ... }
62+
63+
TEST_CASE("Test Parse Flags", ...) { ... } // General test displaced
64+
```
65+
66+
✅ **RIGHT - regression test at the bottom:**
67+
```cpp
68+
TEST_CASE("Test Parse Flags", ...) { ... } // General tests first
69+
TEST_CASE("Read CSV from string", ...) { ... } // ...
70+
71+
// --- Edge cases and regression tests ---
72+
TEST_CASE("Regression #149 - trailing newline", ...) { ... }
73+
```
74+
75+
This keeps the file readable: a maintainer skimming the top sees the feature coverage;
76+
scrolling to the bottom reveals all known edge cases in one place.
77+
5278
## Test Pattern for Known Bugs
5379

5480
```cpp

.github/workflows/coverage.yml

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
name: Code Coverage
2+
3+
on:
4+
push:
5+
branches: [ "master" ]
6+
pull_request:
7+
branches: [ "master" ]
8+
9+
permissions:
10+
contents: read
11+
pull-requests: write
12+
13+
jobs:
14+
coverage:
15+
runs-on: ubuntu-latest
16+
17+
steps:
18+
- name: Checkout repository and submodules
19+
uses: actions/checkout@v4
20+
with:
21+
submodules: recursive
22+
23+
- name: Install dependencies
24+
run: |
25+
sudo apt-get update
26+
sudo apt-get install -y lcov
27+
28+
- name: Configure CMake with coverage instrumentation
29+
run: >
30+
cmake -B build
31+
-DCMAKE_BUILD_TYPE=Debug
32+
-DENABLE_CODE_COVERAGE=ON
33+
-DCMAKE_CXX_COMPILER=g++
34+
-DCMAKE_C_COMPILER=gcc
35+
-DCSV_CXX_STANDARD=17
36+
-S ${{ github.workspace }}
37+
38+
- name: Build
39+
run: cmake --build build --config Debug
40+
41+
- name: Run tests
42+
working-directory: build
43+
run: ctest --build-config Debug --output-on-failure
44+
45+
- name: Capture coverage data
46+
run: |
47+
lcov --capture \
48+
--directory build \
49+
--output-file coverage.info \
50+
--ignore-errors mismatch,source,empty,unused,negative \
51+
--rc lcov_branch_coverage=1
52+
# Strip system headers, external deps, Catch2, and test files
53+
lcov --remove coverage.info \
54+
'/usr/*' \
55+
'*/external/*' \
56+
'*/catch2/*' \
57+
'*/Catch2/*' \
58+
'*/tests/*' \
59+
--output-file coverage.info \
60+
--ignore-errors empty,unused \
61+
--rc lcov_branch_coverage=1
62+
lcov --list coverage.info --ignore-errors empty,unused --rc lcov_branch_coverage=1
63+
64+
- name: Upload to Codecov
65+
uses: codecov/codecov-action@v4
66+
with:
67+
files: coverage.info
68+
fail_ci_if_error: true
69+
token: ${{ secrets.CODECOV_TOKEN }}
70+
override_commit: ${{ github.event.pull_request.head.sha || github.sha }}
71+
override_pr: ${{ github.event.pull_request.number }}

.github/workflows/sanitizers.yml

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,3 +109,40 @@ jobs:
109109
with:
110110
name: valgrind-results
111111
path: build/Testing/
112+
113+
# Reproduces the exact build profile reported in issue #293:
114+
# -O3 combined with --coverage exposes memory ordering issues invisible to plain -O3
115+
o3-coverage:
116+
runs-on: ubuntu-latest
117+
118+
steps:
119+
- name: Checkout repository and submodules
120+
uses: actions/checkout@v4
121+
with:
122+
submodules: recursive
123+
124+
- name: Install dependencies
125+
run: |
126+
sudo apt-get update
127+
sudo apt-get install -y build-essential cmake lcov
128+
129+
- name: Configure CMake
130+
run: |
131+
mkdir -p build
132+
cd build
133+
cmake .. \
134+
-DCMAKE_BUILD_TYPE=Release \
135+
-DCSV_CXX_STANDARD=17 \
136+
-DCMAKE_CXX_COMPILER=g++ \
137+
-DCMAKE_C_COMPILER=gcc \
138+
-DCMAKE_CXX_FLAGS="-O3 --coverage" \
139+
-DCMAKE_C_FLAGS="-O3 --coverage" \
140+
-DCMAKE_EXE_LINKER_FLAGS="--coverage"
141+
142+
- name: Build
143+
run: cmake --build build
144+
145+
- name: Test with -O3 --coverage
146+
working-directory: build
147+
run: ctest --output-on-failure
148+
timeout-minutes: 20

CMakeLists.txt

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,9 @@ else()
3434

3535
if(ENABLE_CODE_COVERAGE)
3636
message("Code coverage instrumentation enabled")
37-
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} --coverage -Og")
37+
# -fprofile-update=atomic prevents negative counter corruption when
38+
# background reader threads write to the same .gcda file concurrently.
39+
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} --coverage -Og -fprofile-update=atomic")
3840
endif()
3941
endif(MSVC)
4042

README.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -52,11 +52,12 @@ On my computer (12th Gen Intel(R) Core(TM) i5-12400 @ 2.50 GHz/Western Digital B
5252

5353
By default, the parser reads CSV data in 10MB chunks. This balance was determined through empirical testing to optimize throughput while minimizing memory overhead and thread synchronization costs.
5454

55-
If you encounter rows larger than the chunk size, use `set_chunk_size()` to adjust:
55+
If you encounter rows larger than the chunk size, pass a custom `CSVFormat` with `chunk_size()`:
5656

5757
```cpp
58-
CSVReader reader("massive_rows.csv");
59-
reader.set_chunk_size(100 * 1024 * 1024); // 100MB chunks
58+
CSVFormat fmt;
59+
fmt.chunk_size(100 * 1024 * 1024); // 100MB chunks
60+
CSVReader reader("massive_rows.csv", fmt);
6061
for (auto& row : reader) {
6162
// Process row
6263
}
@@ -452,8 +453,9 @@ DataFrame<int> df2(reader, "employee_id");
452453
// O(1) lookups by key
453454
auto salary = df[12345]["salary"].get<double>();
454455

455-
// Access by position also works
456-
auto first_row = df[0];
456+
// Positional access: operator[](size_t) is disabled when KeyType is an integer
457+
// type to prevent ambiguity with operator[](const KeyType&). Use iloc() instead.
458+
auto first_row = df.iloc(0);
457459
auto name = first_row["name"].get<std::string>();
458460

459461
// Check if a key exists

codecov.yml

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,16 @@
11
ignore:
22
- "include/external"
3+
- "programs"
4+
- "single_include"
5+
- "single_include_test"
36
- "tests"
47
coverage:
5-
status:
6-
project:
7-
default:
8-
target: 95%
8+
status:
9+
project:
10+
default:
11+
target: auto
12+
informational: true
13+
patch:
14+
default:
15+
target: auto
16+
informational: true

include/internal/csv_format.cpp

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,18 @@ namespace csv {
4747
return *this;
4848
}
4949

50+
CSV_INLINE CSVFormat& CSVFormat::chunk_size(size_t size) {
51+
if (size < internals::ITERATION_CHUNK_SIZE) {
52+
throw std::invalid_argument(
53+
"Chunk size must be at least " +
54+
std::to_string(internals::ITERATION_CHUNK_SIZE) +
55+
" bytes (10MB). Provided: " + std::to_string(size)
56+
);
57+
}
58+
this->_chunk_size = size;
59+
return *this;
60+
}
61+
5062
CSV_INLINE void CSVFormat::assert_no_char_overlap()
5163
{
5264
auto delims = std::set<char>(

include/internal/csv_format.hpp

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,17 @@ namespace csv {
104104
return *this;
105105
}
106106

107+
/** Sets the chunk size used when reading the CSV
108+
*
109+
* @param[in] size Chunk size in bytes (minimum: 10MB = ITERATION_CHUNK_SIZE)
110+
* @throws std::invalid_argument if size < ITERATION_CHUNK_SIZE
111+
*
112+
* Use this when constructing a CSVReader from a filename and individual rows
113+
* may exceed the default 10MB chunk size. The value is passed to CSVReader at
114+
* construction time, before any data is read.
115+
*/
116+
CSVFormat& chunk_size(size_t size);
117+
107118
#ifndef DOXYGEN_SHOULD_SKIP_THIS
108119
char get_delim() const {
109120
// This error should never be received by end users.
@@ -120,6 +131,7 @@ namespace csv {
120131
std::vector<char> get_possible_delims() const { return this->possible_delimiters; }
121132
std::vector<char> get_trim_chars() const { return this->trim_chars; }
122133
CONSTEXPR VariableColumnPolicy get_variable_column_policy() const { return this->variable_column_policy; }
134+
CONSTEXPR size_t get_chunk_size() const { return this->_chunk_size; }
123135
#endif
124136

125137
/** CSVFormat for guessing the delimiter */
@@ -163,5 +175,8 @@ namespace csv {
163175

164176
/**< Allow variable length columns? */
165177
VariableColumnPolicy variable_column_policy = VariableColumnPolicy::IGNORE_ROW;
178+
179+
/**< Chunk size for reading; passed to CSVReader at construction time */
180+
size_t _chunk_size = internals::ITERATION_CHUNK_SIZE;
166181
};
167182
}

include/internal/csv_reader.cpp

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -167,7 +167,8 @@ namespace csv {
167167
CSV_INLINE CSVReader::CSVReader(csv::string_view filename, CSVFormat format) : _format(format) {
168168
auto head = internals::get_csv_head(filename);
169169
using Parser = internals::MmapParser;
170-
170+
// Apply chunk size from format before any reading occurs
171+
this->_chunk_size = format.get_chunk_size();
171172
/** Guess delimiter and header row */
172173
if (format.guess_delim()) {
173174
auto guess_result = internals::_guess_format(head, format.possible_delimiters);
@@ -326,14 +327,18 @@ namespace csv {
326327
// End of file and no more records
327328
return false;
328329

329-
// Detect infinite loop: A previous read was requested but records are still empty
330-
// This typically means a single row is larger than the chunk size
330+
// Detect infinite loop: a previous read was requested but records are still empty.
331+
// This fires when a single row spans more than 2 × _chunk_size bytes:
332+
// - chunk N fills without finding '\n' → _read_requested set to true
333+
// - chunk N+1 also fills without '\n' → guard fires here
334+
// Default _chunk_size is ITERATION_CHUNK_SIZE (10 MB), so the threshold is
335+
// rows > 20 MB. Use CSVFormat::chunk_size() to raise the limit.
331336
if (this->_read_requested && this->records->empty()) {
332337
throw std::runtime_error(
333338
"End of file not reached and no more records parsed. "
334339
"This likely indicates a CSV row larger than the chunk size of " +
335340
std::to_string(this->_chunk_size) + " bytes. "
336-
"Use set_chunk_size() to increase the chunk size."
341+
"Use CSVFormat::chunk_size() to increase the chunk size."
337342
);
338343
}
339344

include/internal/csv_reader.hpp

Lines changed: 3 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,9 @@ namespace csv {
184184
auto head = internals::get_csv_head(source);
185185
using Parser = internals::StreamParser<TStream>;
186186

187+
// Apply chunk size from format before any reading occurs
188+
this->_chunk_size = format.get_chunk_size();
189+
187190
if (format.guess_delim()) {
188191
auto guess_result = internals::_guess_format(head, format.possible_delimiters);
189192
format.delimiter(guess_result.delim);
@@ -260,31 +263,6 @@ namespace csv {
260263
/** Sets this reader's column names and associated data */
261264
void set_col_names(const std::vector<std::string>&);
262265

263-
/** @brief Set the size of chunks to read from the CSV in bytes
264-
*
265-
* @param[in] size Chunk size in bytes (minimum: 10MB, default: 10MB)
266-
* @throws std::invalid_argument if size < 10MB (ITERATION_CHUNK_SIZE)
267-
*
268-
* Use this to handle CSV files where a single row exceeds the default 10MB chunk size.
269-
* Larger chunks use more memory but allow parsing of larger individual rows.
270-
*
271-
* Example:
272-
* @snippet tests/test_edge_cases_large_rows.cpp Set Chunk Size Example
273-
*
274-
* @note Chunk size must be at least ITERATION_CHUNK_SIZE (10MB) to avoid
275-
* architectural constraints and ensure reliable parsing behavior.
276-
*/
277-
void set_chunk_size(size_t size) {
278-
if (size < internals::ITERATION_CHUNK_SIZE) {
279-
throw std::invalid_argument(
280-
"Chunk size must be at least " +
281-
std::to_string(internals::ITERATION_CHUNK_SIZE) +
282-
" bytes (10MB). Provided: " + std::to_string(size)
283-
);
284-
}
285-
this->_chunk_size = size;
286-
}
287-
288266
/** @name CSV Settings **/
289267
///@{
290268
CSVFormat _format;

0 commit comments

Comments
 (0)