Don't push empty rows (#271)

sjoubert · vincentlaucsb · web-flow · commit 672cdb21abb9 · 2026-04-18T14:29:32.000-07:00
* Don't push empty rows

The current NEWLINE behavior of reading as much CR/LF as possible
basically means empty rows are/should be ignored.
This fixes the issue of pushing an empty row that can occur when the
data chunk ends in the middle of such a CR/LF stream (either because of
a block of empty lines or if you are unlucky because the chunk ended
right in the middle of a Windows CRLF endline)

* Update basic_csv_parser.cpp

* Reduce minimum chunk size

Also modified chunk size stress tests to not use massive chunk sizes all the time

---------

Co-authored-by: Vincent La &lt;vincela9@gmail.com&gt;
diff --git a/AGENTS.md b/AGENTS.md
@@ -31,7 +31,7 @@ CSVReader reader(infile, format);
 
 ## Threading: Worker + 10MB Chunks
 
-- Worker thread reads in 10MB chunks (`ITERATION_CHUNK_SIZE`)
+- Worker thread reads in 10MB chunks (`CSV_CHUNK_SIZE_DEFAULT`)
 - Communicates via `ThreadSafeDeque<CSVRow>`
 - Exceptions propagate via `std::exception_ptr`
 - Critical: Fields spanning chunk boundaries must not corrupt
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -16,7 +16,7 @@
 - Bugs can exist in one and not the other — always test both with Catch2 `SECTION`
 
 ## Threading
-- Worker thread reads 10MB chunks (`ITERATION_CHUNK_SIZE`)
+- Worker thread reads 10MB chunks (`CSV_CHUNK_SIZE_DEFAULT`)
 - Communication via `ThreadSafeDeque<CSVRow>`
 - Exceptions propagate via `std::exception_ptr`
 - Tests must use ≥500K rows to cross chunk boundary
diff --git a/README.md b/README.md
@@ -38,7 +38,7 @@
 ## Motivation
 I wanted a CSV library that was fast and reliable without forcing you into either:
  * A 1990s C-style API
- * A high-level wrapper that murders `malloc()` and your memory cache
+ * A high-level wrapper that murders `malloc()` and your cache
 
 This library tries to be **fast for developers** and **fast for your computer**.
 
@@ -57,17 +57,16 @@ All benchmarks shown are warm cache runs to focus on parser/CPU performance rath
 
 #### Chunk Size Tuning
 
-By default, the parser reads CSV data in 10MB chunks. This balance was determined through empirical testing to optimize throughput while minimizing memory overhead and thread synchronization costs, but feel free to experiment and measure with different numbers yourself.
+By default, the parser reads CSV data in 10MB chunks. 10MB was chosen after empirical testing to optimize throughput while minimizing memory and thread synchronization costs, but feel free to experiment with different numbers yourself.
 
-If you encounter rows larger than the chunk size, pass a custom `CSVFormat` with `chunk_size()`:
+A custom `CSVFormat` with `chunk_size()` can be passed to:
+ * Shrink the chunk size (down to a minimum of 500KB)
+ * Expand the chunk size (necessary if you encounter very large rows)
 
 ```cpp
 CSVFormat fmt;
 fmt.chunk_size(100 * 1024 * 1024);  // 100MB chunks
 CSVReader reader("massive_rows.csv", fmt);
-for (auto& row : reader) {
-    // Process row
-}
 ```
 
 ### Robust Yet Flexible
diff --git a/include/internal/basic_csv_parser.cpp b/include/internal/basic_csv_parser.cpp
@@ -165,9 +165,11 @@ namespace csv {
                     while (this->data_pos_ < in.size() && parse_flag(in[this->data_pos_]) == ParseFlags::NEWLINE)
                         this->data_pos_++;
 
-                    // End of record -> Write record
-                    this->push_field();
-                    this->push_row();
+                    // End of record -> Write non-empty record
+                    if (this->field_length_ > 0 || !this->current_row_.empty()) {
+                        this->push_field();
+                        this->push_row();
+                    }
 
                     // Reset
                     this->current_row_ = CSVRow(data_ptr_, this->data_pos_, fields_->size());
@@ -257,7 +259,7 @@ namespace csv {
 #pragma region Specializations
 #endif
 #if !defined(__EMSCRIPTEN__)
-        CSV_INLINE void MmapParser::next(size_t bytes = ITERATION_CHUNK_SIZE) {
+        CSV_INLINE void MmapParser::next(size_t bytes = CSV_CHUNK_SIZE_DEFAULT) {
             // CRITICAL SECTION: Chunk Transition Logic
             // This function reads 10MB chunks and must correctly handle fields that span
             // chunk boundaries. The 'remainder' calculation below ensures partial fields
diff --git a/include/internal/basic_csv_parser.hpp b/include/internal/basic_csv_parser.hpp
@@ -176,7 +176,7 @@ namespace csv {
             ///@}
 
             /** Whether or not source needs to be read in chunks */
-            CONSTEXPR bool no_chunk() const { return this->source_size_ < ITERATION_CHUNK_SIZE; }
+            CONSTEXPR bool no_chunk() const { return this->source_size_ < CSV_CHUNK_SIZE_DEFAULT; }
 
             /** Parse the current chunk of data *
              *
@@ -287,7 +287,7 @@ namespace csv {
 
             ~StreamParser() {}
 
-            void next(size_t bytes = ITERATION_CHUNK_SIZE) override {
+            void next(size_t bytes = CSV_CHUNK_SIZE_DEFAULT) override {
                 if (this->eof()) return;
 
                 // Reset parser state
diff --git a/include/internal/common.hpp b/include/internal/common.hpp
@@ -204,17 +204,24 @@ namespace csv {
         const int PAGE_SIZE = 4096;
 #endif
 
-        /** Chunk size for lazy-loading large CSV files
+        /** Default chunk size for lazy-loading large CSV files
          * 
-         * The worker thread reads this many bytes at a time (10MB).
+         * The worker thread reads this many bytes at a time by default (10MB).
          * 
          * CRITICAL INVARIANT: Field boundaries at chunk transitions must be preserved.
          * Bug #280 was caused by fields spanning chunk boundaries being corrupted.
          * 
          * @note Tests must write >10MB of data to cross chunk boundaries
          * @see basic_csv_parser.cpp MmapParser::next() for chunk transition logic
          */
-        constexpr size_t ITERATION_CHUNK_SIZE = 10000000; // 10MB
+        constexpr size_t CSV_CHUNK_SIZE_DEFAULT = 10000000; // 10MB
+
+        /** Minimum supported custom chunk size for CSVFormat::chunk_size().
+         *
+         * This lower bound allows memory-constrained environments to reduce parser
+         * buffer size while avoiding pathological tiny-buffer overhead.
+         */
+        constexpr size_t CSV_CHUNK_SIZE_FLOOR = 500 * 1024; // 500KB
 
         template<typename T>
         inline bool is_equal(T a, T b, T epsilon = 0.001) {
diff --git a/include/internal/csv_format.cpp b/include/internal/csv_format.cpp
@@ -48,11 +48,11 @@ namespace csv {
     }
 
     CSV_INLINE CSVFormat& CSVFormat::chunk_size(size_t size) {
-        if (size < internals::ITERATION_CHUNK_SIZE) {
+        if (size < internals::CSV_CHUNK_SIZE_FLOOR) {
             throw std::invalid_argument(
                 "Chunk size must be at least " +
-                std::to_string(internals::ITERATION_CHUNK_SIZE) +
-                " bytes (10MB). Provided: " + std::to_string(size)
+                std::to_string(internals::CSV_CHUNK_SIZE_FLOOR) +
+                " bytes (500KB). Provided: " + std::to_string(size)
             );
         }
         this->_chunk_size = size;
diff --git a/include/internal/csv_format.hpp b/include/internal/csv_format.hpp
@@ -123,8 +123,8 @@ namespace csv {
 
         /** Sets the chunk size used when reading the CSV
          *
-         *  @param[in] size Chunk size in bytes (minimum: 10MB = ITERATION_CHUNK_SIZE)
-         *  @throws std::invalid_argument if size < ITERATION_CHUNK_SIZE
+         *  @param[in] size Chunk size in bytes (minimum: CSV_CHUNK_SIZE_FLOOR)
+         *  @throws std::invalid_argument if size < CSV_CHUNK_SIZE_FLOOR
          *
          *  Use this when constructing a CSVReader from a filename and individual rows
          *  may exceed the default 10MB chunk size. The value is passed to CSVReader at
@@ -198,6 +198,6 @@ namespace csv {
         ColumnNamePolicy _column_name_policy = ColumnNamePolicy::EXACT;
 
         /**< Chunk size for reading; passed to CSVReader at construction time */
-        size_t _chunk_size = internals::ITERATION_CHUNK_SIZE;
+        size_t _chunk_size = internals::CSV_CHUNK_SIZE_DEFAULT;
     };
 }
diff --git a/include/internal/csv_reader.cpp b/include/internal/csv_reader.cpp
@@ -306,7 +306,7 @@ namespace csv {
      * Retrieve rows as CSVRow objects, returning true if more rows are available.
      *
      * @par Performance Notes
-     *  - Reads chunks of data that are csv::internals::ITERATION_CHUNK_SIZE bytes large at a time
+    *  - Reads chunks of data that are csv::internals::CSV_CHUNK_SIZE_DEFAULT bytes large at a time
      *  - For performance details, read the documentation for CSVRow and CSVField.
      *
      * @param[out] row The variable where the parsed row will be stored
@@ -344,7 +344,7 @@ namespace csv {
                 // This fires when a single row spans more than 2 × _chunk_size bytes:
                 //   - chunk N   fills without finding '\n'  → _read_requested set to true
                 //   - chunk N+1 also fills without '\n'     → guard fires here
-                // Default _chunk_size is ITERATION_CHUNK_SIZE (10 MB), so the threshold is
+                // Default _chunk_size is CSV_CHUNK_SIZE_DEFAULT (10 MB), so the threshold is
                 // rows > 20 MB.  Use CSVFormat::chunk_size() to raise the limit.
                 if (this->_read_requested && this->records->empty()) {
                     throw std::runtime_error(
diff --git a/include/internal/csv_reader.hpp b/include/internal/csv_reader.hpp
@@ -250,7 +250,7 @@ namespace csv {
             other._n_rows = 0;
             other.header_trimmed = false;
             other._read_requested = false;
-            other._chunk_size = internals::ITERATION_CHUNK_SIZE;
+            other._chunk_size = internals::CSV_CHUNK_SIZE_DEFAULT;
         }
 
         /** Move assignment.
@@ -287,7 +287,7 @@ namespace csv {
             other._n_rows = 0;
             other.header_trimmed = false;
             other._read_requested = false;
-            other._chunk_size = internals::ITERATION_CHUNK_SIZE;
+            other._chunk_size = internals::CSV_CHUNK_SIZE_DEFAULT;
 
             return *this;
         }
@@ -373,7 +373,7 @@ namespace csv {
 
         /** @name Multi-Threaded File Reading Functions */
         ///@{
-        bool read_csv(size_t bytes = internals::ITERATION_CHUNK_SIZE);
+        bool read_csv(size_t bytes = internals::CSV_CHUNK_SIZE_DEFAULT);
         ///@}
 
         /**@}*/
@@ -386,7 +386,7 @@ namespace csv {
     #if CSV_ENABLE_THREADS
         std::thread read_csv_worker; /**< Worker thread for read_csv() */
     #endif
-        size_t _chunk_size = internals::ITERATION_CHUNK_SIZE; /**< Current chunk size in bytes */
+        size_t _chunk_size = internals::CSV_CHUNK_SIZE_DEFAULT; /**< Current chunk size in bytes */
         bool _read_requested = false; /**< Flag to detect infinite read loops (Issue #218) */
         ///@}
 
diff --git a/include/internal/raw_csv_data.hpp b/include/internal/raw_csv_data.hpp
@@ -67,7 +67,7 @@ namespace csv {
             /** Construct a RawCSVFieldList which allocates blocks of a certain size */
             RawCSVFieldList(size_t single_buffer_capacity = (size_t)(internals::PAGE_SIZE / sizeof(RawCSVField))) :
                 _single_buffer_capacity(single_buffer_capacity) {
-                const size_t max_fields = internals::ITERATION_CHUNK_SIZE + 1;
+                const size_t max_fields = internals::CSV_CHUNK_SIZE_DEFAULT + 1;
                 const size_t block_capacity = (max_fields + _single_buffer_capacity - 1) / _single_buffer_capacity;
                 _owned_blocks.reserve(block_capacity);
 
diff --git a/tests/AGENTS.md b/tests/AGENTS.md
@@ -253,7 +253,7 @@ TEST_CASE("Multithreaded parsing", "[threading]") {
 3. **FileGuard RAII**: Guaranteed cleanup for temp files
 4. **Timeout Guards**: Use `test_with_timeout()` for race/deadlock-sensitive tests
 5. **Distinct Values**: Detect cross-field corruption
-6. **Chunk Boundary Testing**: Cross 10MB ITERATION_CHUNK_SIZE
+6. **Chunk Boundary Testing**: Cross 10MB CSV_CHUNK_SIZE_DEFAULT
 
 ### Data Files
 Test data in `tests/data/` is a git submodule:
diff --git a/tests/CMakeLists.txt b/tests/CMakeLists.txt
@@ -38,6 +38,7 @@ target_sources(csv_test
         test_edge_cases_large_rows.cpp
         test_error_handling.cpp
         test_guess_csv.cpp
+        test_parser_edge_cases.cpp
         test_raw_csv_data.cpp
         test_read_csv.cpp
         test_read_csv_file.cpp
diff --git a/tests/test_edge_cases_large_rows.cpp b/tests/test_edge_cases_large_rows.cpp
diff --git a/tests/test_error_handling.cpp b/tests/test_error_handling.cpp
diff --git a/tests/test_parser_edge_cases.cpp b/tests/test_parser_edge_cases.cpp

Original file line number	Diff line number	Diff line change
`@@ -48,11 +48,11 @@ namespace csv {`
`48`	`48`	`}`
`49`	`49`
`50`	`50`	`CSV_INLINE CSVFormat& CSVFormat::chunk_size(size_t size) {`
`51`		`- if (size < internals::ITERATION_CHUNK_SIZE) {`
	`51`	`+ if (size < internals::CSV_CHUNK_SIZE_FLOOR) {`
`52`	`52`	`throw std::invalid_argument(`
`53`	`53`	`"Chunk size must be at least " +`
`54`		`- std::to_string(internals::ITERATION_CHUNK_SIZE) +`
`55`		`- " bytes (10MB). Provided: " + std::to_string(size)`
	`54`	`+ std::to_string(internals::CSV_CHUNK_SIZE_FLOOR) +`
	`55`	`+ " bytes (500KB). Provided: " + std::to_string(size)`
`56`	`56`	`);`
`57`	`57`	`}`
`58`	`58`	`this->_chunk_size = size;`
Original file line number	Diff line number	Diff line change
`@@ -123,8 +123,8 @@ namespace csv {`
`123`	`123`
`124`	`124`	`/** Sets the chunk size used when reading the CSV`
`125`	`125`	`*`
`126`		`- * @param[in] size Chunk size in bytes (minimum: 10MB = ITERATION_CHUNK_SIZE)`
`127`		`- * @throws std::invalid_argument if size < ITERATION_CHUNK_SIZE`
	`126`	`+ * @param[in] size Chunk size in bytes (minimum: CSV_CHUNK_SIZE_FLOOR)`
	`127`	`+ * @throws std::invalid_argument if size < CSV_CHUNK_SIZE_FLOOR`
`128`	`128`	`*`
`129`	`129`	`* Use this when constructing a CSVReader from a filename and individual rows`
`130`	`130`	`* may exceed the default 10MB chunk size. The value is passed to CSVReader at`
`@@ -198,6 +198,6 @@ namespace csv {`
`198`	`198`	`ColumnNamePolicy _column_name_policy = ColumnNamePolicy::EXACT;`
`199`	`199`
`200`	`200`	`/*< Chunk size for reading; passed to CSVReader at construction time /`
`201`		`- size_t _chunk_size = internals::ITERATION_CHUNK_SIZE;`
	`201`	`+ size_t _chunk_size = internals::CSV_CHUNK_SIZE_DEFAULT;`
`202`	`202`	`};`
`203`	`203`	`}`