Skip to content

Commit 50583f8

Browse files
2.5.0: Add DataFrame for enhanced data science-based workflows (#290)
* Add DataFrame * Replace std::exception with more portable alternative * Added CSVField::try_get() methods * Fix #267 * Update README.md
1 parent 729d767 commit 50583f8

25 files changed

Lines changed: 7725 additions & 7955 deletions

.claude/rules/testing-conventions.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
---
2+
paths:
3+
- "tests/**/*"
4+
---
5+
16
# Testing Conventions for AI Agents
27

38
## Rule: Tests Should Expose Bugs, Not Assert Them

.github/codeql/codeql-config.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
name: "CodeQL config"
2+
3+
paths-ignore:
4+
- "tests/**"
5+
- "single_include_test/**"
6+
- "**/tests/**"

.github/workflows/codeql.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ jobs:
3333
with:
3434
language: ${{ matrix.language }}
3535
queries: security-and-quality
36+
config-file: ./.github/codeql/codeql-config.yml
3637

3738
- name: Install dependencies
3839
run: |

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
# Custom Settings
22
CMakeLists2.txt
3+
.personal-todo.md
34

45
# Build
56
bin/

CMakeLists.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,8 @@ if(MSVC)
2626
# Make Visual Studio report accurate C++ version
2727
# See: https://devblogs.microsoft.com/cppblog/msvc-now-correctly-reports-__cplusplus/
2828
# /Wall emits warnings about the C++ standard library
29-
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /EHsc /GS- /Zc:__cplusplus /W4")
29+
# /permissive- enables standards conformance (disables MSVC extensions)
30+
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /EHsc /GS- /Zc:__cplusplus /W4 /permissive-")
3031
else()
3132
# Ignore Visual Studio pragma regions
3233
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-unknown-pragmas")

README.md

Lines changed: 131 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
- [Handling Variable Numbers of Columns](#handling-variable-numbers-of-columns)
3030
- [Setting Column Names](#setting-column-names)
3131
- [Parsing an In-Memory String](#parsing-an-in-memory-string)
32+
- [DataFrames for Random Access and Updates](#dataframes-for-random-access-and-updates)
3233
- [Writing CSV Files](#writing-csv-files)
3334

3435
## Motivation
@@ -38,7 +39,7 @@ There's plenty of other CSV parsers in the wild, but I had a hard time finding w
3839
A high performance CSV parser allows you to take advantage of the deluge of large datasets available. By using overlapped threads, memory mapped IO, and
3940
minimal memory allocation, this parser can quickly tackle large CSV files--even if they are larger than RAM.
4041

41-
In fact, [according to Visual Studio's profier](https://github.com/vincentlaucsb/csv-parser/wiki/Microsoft-Visual-Studio-CPU-Profiling-Results) this
42+
In fact, [according to Visual Studio's profiler](https://github.com/vincentlaucsb/csv-parser/wiki/Microsoft-Visual-Studio-CPU-Profiling-Results) this
4243
CSV parser **spends almost 90% of its CPU cycles actually reading your data** as opposed to getting hung up in hard disk I/O or pushing around memory.
4344

4445
#### Show me the numbers
@@ -103,6 +104,9 @@ In addition to the [Features & Examples](#features--examples) below, a [fully-fl
103104
If you use this library for work, please [become a sponsor](https://github.com/sponsors/vincentlaucsb). Your donation
104105
will fund continued maintenance and development of the project.
105106
107+
Shameless plug: If you like this library, check out my side project
108+
[experiencer](https://github.com/vincentlaucsb/experiencer) — a WYSIWYG resume editor with clean HTML/CSS output.
109+
106110
## Integration
107111
108112
This library was developed with Microsoft Visual Studio and is compatible with >g++ 7.5 and clang.
@@ -248,6 +252,7 @@ for (auto& row: reader) {
248252
If your CSV has lots of numeric values, you can also have this parser (lazily)
249253
convert them to the proper data type.
250254

255+
* `try_get<T>()` is a non-throwing version of `get<T>` which returns `bool` if the conversion was successful
251256
* Type checking is performed on conversions to prevent undefined behavior and integer overflow
252257
* Negative numbers cannot be blindly converted to unsigned integer types
253258
* `get<float>()`, `get<double>()`, and `get<long double>()` are capable of parsing numbers written in scientific notation.
@@ -263,6 +268,12 @@ using namespace csv;
263268
CSVReader reader("very_big_file.csv");
264269

265270
for (auto& row: reader) {
271+
int timestamp = 0;
272+
if (row["timestamp"].try_get(timestamp)) {
273+
// Non-throwing conversion
274+
std::cout << "Timestamp: " << timestamp << std::endl;
275+
}
276+
266277
if (row["timestamp"].is_int()) {
267278
// Can use get<>() with any integer type, but negative
268279
// numbers cannot be converted to unsigned types
@@ -340,7 +351,7 @@ format.delimiter('\t')
340351
// Alternatively, we can use format.delimiter({ '\t', ',', ... })
341352
// to tell the CSV guesser which delimiters to try out
342353

343-
CSVReader reader("wierd_csv_dialect.csv", format);
354+
CSVReader reader("weird_csv_dialect.csv", format);
344355

345356
for (auto& row: reader) {
346357
// Do stuff with rows here
@@ -418,6 +429,124 @@ for (auto& r: rows) {
418429
419430
```
420431

432+
### DataFrames for Random Access and Updates
433+
434+
For files that fit comfortably in memory, `DataFrame` provides fast and powerful keyed access, in-place updates, and grouping operations—all built on the same high-performance parser. It uses the same parsing pipeline as `CSVReader` but retains the results in memory for random access.
435+
436+
**Creating a DataFrame with Keyed Access**
437+
```cpp
438+
# include "csv.hpp"
439+
440+
using namespace csv;
441+
442+
...
443+
444+
// Shortest form: pass a filename directly with DataFrameOptions
445+
DataFrame<int> df("employees.csv",
446+
DataFrameOptions().set_key_column("employee_id"));
447+
448+
// Or construct from an existing CSVReader (e.g. when you need a custom format)
449+
CSVReader reader("employees.csv");
450+
DataFrame<int> df2(reader, "employee_id");
451+
452+
// O(1) lookups by key
453+
auto salary = df[12345]["salary"].get<double>();
454+
455+
// Access by position also works
456+
auto first_row = df[0];
457+
auto name = first_row["name"].get<std::string>();
458+
459+
// Check if a key exists
460+
if (df.contains(99999)) {
461+
std::cout << "Employee exists" << std::endl;
462+
}
463+
```
464+
465+
**Using DataFrameOptions for Fine-Grained Control**
466+
```cpp
467+
// Configure key column, duplicate-key policy, and missing-key behaviour
468+
DataFrameOptions opts;
469+
opts.set_key_column("employee_id")
470+
.set_duplicate_key_policy(
471+
DataFrameOptions::DuplicateKeyPolicy::KEEP_FIRST) // or OVERWRITE / THROW
472+
.set_throw_on_missing_key(false); // silently skip rows with no key value
473+
474+
DataFrame<int> df("employees.csv", opts);
475+
```
476+
477+
**Creating a DataFrame with a Custom Key Function**
478+
```cpp
479+
CSVReader reader("employees.csv");
480+
481+
// Build a composite key from two columns
482+
auto make_key = [](const CSVRow& row) {
483+
return row["first_name"].get<std::string>() + "_" +
484+
row["last_name"].get<std::string>();
485+
};
486+
487+
DataFrame<std::string> by_name(reader, make_key);
488+
489+
// Lookups by composite key
490+
auto employee = by_name["Ada_Lovelace"]["department"].get<std::string>();
491+
```
492+
493+
**Updating Values**
494+
```cpp
495+
// Updates are stored in an efficient overlay without copying the entire dataset
496+
df.set(12345, "salary", "95000");
497+
df.set(67890, "department", "Engineering");
498+
499+
// Access methods return updated values transparently
500+
std::cout << df[12345]["salary"].get<std::string>(); // "95000"
501+
502+
// Iterate with edits visible
503+
for (auto& row : df) {
504+
std::cout << row["salary"].get<std::string>(); // Shows edited values
505+
}
506+
```
507+
508+
**Grouping and Analysis**
509+
```cpp
510+
// Group by department
511+
auto groups = df.group_by("department");
512+
for (auto& [dept, row_indices] : groups) {
513+
double total_salary = 0;
514+
for (size_t i : row_indices) {
515+
total_salary += df[i]["salary"].get<double>();
516+
}
517+
std::cout << dept << " total: $" << total_salary << std::endl;
518+
}
519+
520+
// Group using a custom function
521+
auto by_salary_range = df.group_by([](const CSVRow& row) {
522+
double salary = row["salary"].get<double>();
523+
return salary < 50000 ? "junior" : salary < 100000 ? "mid" : "senior";
524+
});
525+
```
526+
527+
**Writing Back to CSV**
528+
529+
Each `DataFrameRow` has an implicit conversion to `std::vector<std::string>`,
530+
which is convenient when using `CSVWriter`.
531+
532+
```cpp
533+
// DataFrameRow has implicit conversion for CSVWriter compatibility
534+
auto writer = make_csv_writer(std::cout);
535+
for (auto& row : df) {
536+
writer << row; // Outputs edited values
537+
}
538+
```
539+
540+
**When to Use DataFrame vs. CSVReader:**
541+
- **Use CSVReader** for: Large files (>1GB), streaming pipelines, minimal memory footprint
542+
- **Use DataFrame** for: Files that fit in RAM, frequent lookups/updates, grouping operations, data that needs random access
543+
544+
**When Not to Use DataFrame:**
545+
- Extremely large files that do not fit in RAM
546+
- Streaming pipelines where you only need single-pass access
547+
548+
Both options deliver the same parsing performance—DataFrame simply keeps the results in memory for convenience.
549+
421550
### Writing CSV Files
422551

423552
```cpp

include/csv.hpp

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
/*
2-
CSV for C++, version 2.4.2
2+
CSV for C++, version 2.5.0
33
https://github.com/vincentlaucsb/csv-parser
44
55
MIT License
@@ -29,6 +29,7 @@ SOFTWARE.
2929
#ifndef CSV_HPP
3030
#define CSV_HPP
3131

32+
#include "internal/data_frame.hpp"
3233
#include "internal/csv_reader.hpp"
3334
#include "internal/csv_stat.hpp"
3435
#include "internal/csv_utility.hpp"

0 commit comments

Comments
 (0)