Skip to content

Be explicit about the datatypes of each column in csv files#68

Open
ablack3 wants to merge 3 commits intodevelopfrom
issue65
Open

Be explicit about the datatypes of each column in csv files#68
ablack3 wants to merge 3 commits intodevelopfrom
issue65

Conversation

@ablack3
Copy link
Copy Markdown
Collaborator

@ablack3 ablack3 commented Sep 18, 2024

We have Eunomia CDM datasets stored in csv files. Currently the datatype of each column is not explicitly specified when reading in the data from csv which is causing #65.

In this PR I'm using the specification in the CommonDataModel package to be explicit about the datatypes when we read the csv files which should fix the issue. However this does mean that the column order matters.

I'm not sure if we consider column order (first, second, ect) part of the CDM specification but I noticed that in the GiBleed dataset the column order does not match the order in CommonDataModel specification csv. We can work around it and/or fix the file. It's a bit more tricky if we want to allow columns to be in any order but possible.

@ablack3 ablack3 changed the base branch from main to develop September 18, 2024 08:12
@ablack3 ablack3 marked this pull request as draft September 18, 2024 08:12
@ablack3 ablack3 marked this pull request as ready for review September 18, 2024 08:28
@ablack3
Copy link
Copy Markdown
Collaborator Author

ablack3 commented Sep 18, 2024

I need to investigate and fix the failing tests.

@fdefalco
Copy link
Copy Markdown
Collaborator

Thanks for looking into this, another reason the duckdb based data examples are a nice direction to go in.

@fdefalco
Copy link
Copy Markdown
Collaborator

For the column order, I would suggest that the data files should match the order of the columns defined by the CDM specification, so would we rather update the data files to follow that column order as a fix?

@ablack3
Copy link
Copy Markdown
Collaborator Author

ablack3 commented Sep 18, 2024

For the column order, I would suggest that the data files should match the order of the columns defined by the CDM specification, so would we rather update the data files to follow that column order as a fix?

That would be my preference as well. So we require csv files to have columns in the same order specified by the CommonDataModel specification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants