Skip to content

Commit 672cdb2

Browse files
Don't push empty rows (#271)
* Don't push empty rows The current NEWLINE behavior of reading as much CR/LF as possible basically means empty rows are/should be ignored. This fixes the issue of pushing an empty row that can occur when the data chunk ends in the middle of such a CR/LF stream (either because of a block of empty lines or if you are unlucky because the chunk ended right in the middle of a Windows CRLF endline) * Update basic_csv_parser.cpp * Reduce minimum chunk size Also modified chunk size stress tests to not use massive chunk sizes all the time --------- Co-authored-by: Vincent La <vincela9@gmail.com>
1 parent 429f37a commit 672cdb2

16 files changed

Lines changed: 173 additions & 68 deletions

AGENTS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ CSVReader reader(infile, format);
3131
3232
## Threading: Worker + 10MB Chunks
3333
34-
- Worker thread reads in 10MB chunks (`ITERATION_CHUNK_SIZE`)
34+
- Worker thread reads in 10MB chunks (`CSV_CHUNK_SIZE_DEFAULT`)
3535
- Communicates via `ThreadSafeDeque<CSVRow>`
3636
- Exceptions propagate via `std::exception_ptr`
3737
- Critical: Fields spanning chunk boundaries must not corrupt

CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
- Bugs can exist in one and not the other — always test both with Catch2 `SECTION`
1717

1818
## Threading
19-
- Worker thread reads 10MB chunks (`ITERATION_CHUNK_SIZE`)
19+
- Worker thread reads 10MB chunks (`CSV_CHUNK_SIZE_DEFAULT`)
2020
- Communication via `ThreadSafeDeque<CSVRow>`
2121
- Exceptions propagate via `std::exception_ptr`
2222
- Tests must use ≥500K rows to cross chunk boundary

README.md

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
## Motivation
3939
I wanted a CSV library that was fast and reliable without forcing you into either:
4040
* A 1990s C-style API
41-
* A high-level wrapper that murders `malloc()` and your memory cache
41+
* A high-level wrapper that murders `malloc()` and your cache
4242

4343
This library tries to be **fast for developers** and **fast for your computer**.
4444

@@ -57,17 +57,16 @@ All benchmarks shown are warm cache runs to focus on parser/CPU performance rath
5757

5858
#### Chunk Size Tuning
5959

60-
By default, the parser reads CSV data in 10MB chunks. This balance was determined through empirical testing to optimize throughput while minimizing memory overhead and thread synchronization costs, but feel free to experiment and measure with different numbers yourself.
60+
By default, the parser reads CSV data in 10MB chunks. 10MB was chosen after empirical testing to optimize throughput while minimizing memory and thread synchronization costs, but feel free to experiment with different numbers yourself.
6161

62-
If you encounter rows larger than the chunk size, pass a custom `CSVFormat` with `chunk_size()`:
62+
A custom `CSVFormat` with `chunk_size()` can be passed to:
63+
* Shrink the chunk size (down to a minimum of 500KB)
64+
* Expand the chunk size (necessary if you encounter very large rows)
6365

6466
```cpp
6567
CSVFormat fmt;
6668
fmt.chunk_size(100 * 1024 * 1024); // 100MB chunks
6769
CSVReader reader("massive_rows.csv", fmt);
68-
for (auto& row : reader) {
69-
// Process row
70-
}
7170
```
7271
7372
### Robust Yet Flexible

include/internal/basic_csv_parser.cpp

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -165,9 +165,11 @@ namespace csv {
165165
while (this->data_pos_ < in.size() && parse_flag(in[this->data_pos_]) == ParseFlags::NEWLINE)
166166
this->data_pos_++;
167167

168-
// End of record -> Write record
169-
this->push_field();
170-
this->push_row();
168+
// End of record -> Write non-empty record
169+
if (this->field_length_ > 0 || !this->current_row_.empty()) {
170+
this->push_field();
171+
this->push_row();
172+
}
171173

172174
// Reset
173175
this->current_row_ = CSVRow(data_ptr_, this->data_pos_, fields_->size());
@@ -257,7 +259,7 @@ namespace csv {
257259
#pragma region Specializations
258260
#endif
259261
#if !defined(__EMSCRIPTEN__)
260-
CSV_INLINE void MmapParser::next(size_t bytes = ITERATION_CHUNK_SIZE) {
262+
CSV_INLINE void MmapParser::next(size_t bytes = CSV_CHUNK_SIZE_DEFAULT) {
261263
// CRITICAL SECTION: Chunk Transition Logic
262264
// This function reads 10MB chunks and must correctly handle fields that span
263265
// chunk boundaries. The 'remainder' calculation below ensures partial fields

include/internal/basic_csv_parser.hpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -176,7 +176,7 @@ namespace csv {
176176
///@}
177177

178178
/** Whether or not source needs to be read in chunks */
179-
CONSTEXPR bool no_chunk() const { return this->source_size_ < ITERATION_CHUNK_SIZE; }
179+
CONSTEXPR bool no_chunk() const { return this->source_size_ < CSV_CHUNK_SIZE_DEFAULT; }
180180

181181
/** Parse the current chunk of data *
182182
*
@@ -287,7 +287,7 @@ namespace csv {
287287

288288
~StreamParser() {}
289289

290-
void next(size_t bytes = ITERATION_CHUNK_SIZE) override {
290+
void next(size_t bytes = CSV_CHUNK_SIZE_DEFAULT) override {
291291
if (this->eof()) return;
292292

293293
// Reset parser state

include/internal/common.hpp

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -204,17 +204,24 @@ namespace csv {
204204
const int PAGE_SIZE = 4096;
205205
#endif
206206

207-
/** Chunk size for lazy-loading large CSV files
207+
/** Default chunk size for lazy-loading large CSV files
208208
*
209-
* The worker thread reads this many bytes at a time (10MB).
209+
* The worker thread reads this many bytes at a time by default (10MB).
210210
*
211211
* CRITICAL INVARIANT: Field boundaries at chunk transitions must be preserved.
212212
* Bug #280 was caused by fields spanning chunk boundaries being corrupted.
213213
*
214214
* @note Tests must write >10MB of data to cross chunk boundaries
215215
* @see basic_csv_parser.cpp MmapParser::next() for chunk transition logic
216216
*/
217-
constexpr size_t ITERATION_CHUNK_SIZE = 10000000; // 10MB
217+
constexpr size_t CSV_CHUNK_SIZE_DEFAULT = 10000000; // 10MB
218+
219+
/** Minimum supported custom chunk size for CSVFormat::chunk_size().
220+
*
221+
* This lower bound allows memory-constrained environments to reduce parser
222+
* buffer size while avoiding pathological tiny-buffer overhead.
223+
*/
224+
constexpr size_t CSV_CHUNK_SIZE_FLOOR = 500 * 1024; // 500KB
218225

219226
template<typename T>
220227
inline bool is_equal(T a, T b, T epsilon = 0.001) {

include/internal/csv_format.cpp

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -48,11 +48,11 @@ namespace csv {
4848
}
4949

5050
CSV_INLINE CSVFormat& CSVFormat::chunk_size(size_t size) {
51-
if (size < internals::ITERATION_CHUNK_SIZE) {
51+
if (size < internals::CSV_CHUNK_SIZE_FLOOR) {
5252
throw std::invalid_argument(
5353
"Chunk size must be at least " +
54-
std::to_string(internals::ITERATION_CHUNK_SIZE) +
55-
" bytes (10MB). Provided: " + std::to_string(size)
54+
std::to_string(internals::CSV_CHUNK_SIZE_FLOOR) +
55+
" bytes (500KB). Provided: " + std::to_string(size)
5656
);
5757
}
5858
this->_chunk_size = size;

include/internal/csv_format.hpp

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -123,8 +123,8 @@ namespace csv {
123123

124124
/** Sets the chunk size used when reading the CSV
125125
*
126-
* @param[in] size Chunk size in bytes (minimum: 10MB = ITERATION_CHUNK_SIZE)
127-
* @throws std::invalid_argument if size < ITERATION_CHUNK_SIZE
126+
* @param[in] size Chunk size in bytes (minimum: CSV_CHUNK_SIZE_FLOOR)
127+
* @throws std::invalid_argument if size < CSV_CHUNK_SIZE_FLOOR
128128
*
129129
* Use this when constructing a CSVReader from a filename and individual rows
130130
* may exceed the default 10MB chunk size. The value is passed to CSVReader at
@@ -198,6 +198,6 @@ namespace csv {
198198
ColumnNamePolicy _column_name_policy = ColumnNamePolicy::EXACT;
199199

200200
/**< Chunk size for reading; passed to CSVReader at construction time */
201-
size_t _chunk_size = internals::ITERATION_CHUNK_SIZE;
201+
size_t _chunk_size = internals::CSV_CHUNK_SIZE_DEFAULT;
202202
};
203203
}

include/internal/csv_reader.cpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -306,7 +306,7 @@ namespace csv {
306306
* Retrieve rows as CSVRow objects, returning true if more rows are available.
307307
*
308308
* @par Performance Notes
309-
* - Reads chunks of data that are csv::internals::ITERATION_CHUNK_SIZE bytes large at a time
309+
* - Reads chunks of data that are csv::internals::CSV_CHUNK_SIZE_DEFAULT bytes large at a time
310310
* - For performance details, read the documentation for CSVRow and CSVField.
311311
*
312312
* @param[out] row The variable where the parsed row will be stored
@@ -344,7 +344,7 @@ namespace csv {
344344
// This fires when a single row spans more than 2 × _chunk_size bytes:
345345
// - chunk N fills without finding '\n' → _read_requested set to true
346346
// - chunk N+1 also fills without '\n' → guard fires here
347-
// Default _chunk_size is ITERATION_CHUNK_SIZE (10 MB), so the threshold is
347+
// Default _chunk_size is CSV_CHUNK_SIZE_DEFAULT (10 MB), so the threshold is
348348
// rows > 20 MB. Use CSVFormat::chunk_size() to raise the limit.
349349
if (this->_read_requested && this->records->empty()) {
350350
throw std::runtime_error(

include/internal/csv_reader.hpp

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -250,7 +250,7 @@ namespace csv {
250250
other._n_rows = 0;
251251
other.header_trimmed = false;
252252
other._read_requested = false;
253-
other._chunk_size = internals::ITERATION_CHUNK_SIZE;
253+
other._chunk_size = internals::CSV_CHUNK_SIZE_DEFAULT;
254254
}
255255

256256
/** Move assignment.
@@ -287,7 +287,7 @@ namespace csv {
287287
other._n_rows = 0;
288288
other.header_trimmed = false;
289289
other._read_requested = false;
290-
other._chunk_size = internals::ITERATION_CHUNK_SIZE;
290+
other._chunk_size = internals::CSV_CHUNK_SIZE_DEFAULT;
291291

292292
return *this;
293293
}
@@ -373,7 +373,7 @@ namespace csv {
373373

374374
/** @name Multi-Threaded File Reading Functions */
375375
///@{
376-
bool read_csv(size_t bytes = internals::ITERATION_CHUNK_SIZE);
376+
bool read_csv(size_t bytes = internals::CSV_CHUNK_SIZE_DEFAULT);
377377
///@}
378378

379379
/**@}*/
@@ -386,7 +386,7 @@ namespace csv {
386386
#if CSV_ENABLE_THREADS
387387
std::thread read_csv_worker; /**< Worker thread for read_csv() */
388388
#endif
389-
size_t _chunk_size = internals::ITERATION_CHUNK_SIZE; /**< Current chunk size in bytes */
389+
size_t _chunk_size = internals::CSV_CHUNK_SIZE_DEFAULT; /**< Current chunk size in bytes */
390390
bool _read_requested = false; /**< Flag to detect infinite read loops (Issue #218) */
391391
///@}
392392

0 commit comments

Comments
 (0)