Skip to content

1078 - Refactor csv_importer.py and csv_to_database.py #1103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: dev
Choose a base branch
from

Conversation

BrainIsDead
Copy link
Contributor

@BrainIsDead BrainIsDead commented Mar 1, 2023

closes 1078

  • csv_to_database functionality moved to csv_importer (csv_to_database deleted)
  • fixed imports
  • changed mocks
  • added pandas errors exceptions to pd.read_csv in the load_csv method
  • changed is_header_valid method with issubset method

… to `csv_importer`, fixed imports, changed mocks
@BrainIsDead BrainIsDead changed the title csv importer with issues 1078 - Refactor csv_importer.py and csv_to_database.py Mar 3, 2023
@BrainIsDead BrainIsDead marked this pull request as ready for review March 3, 2023 14:48
@BrainIsDead BrainIsDead requested a review from melange396 March 3, 2023 15:01
@krivard
Copy link
Contributor

krivard commented Mar 3, 2023

NB releasing this will need to be coordinated with configuration changes to Automation and/or Cronicle in prod.

Copy link
Collaborator

@melange396 melange396 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great start, but i think we need to change another test file -- since we got rid of csv_to_database.py and added its content to csv_importer.py (in src/acquisition/covidcast/), we should merge test_csv_to_database.py into test_csv_importer.py (in tests/acquisition/covidcast/)

@@ -414,3 +426,169 @@ def load_csv(filepath: str, details: PathDetails) -> Iterator[Optional[Covidcast
details.issue,
details.lag,
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from this line down, all this text is a straight copy-paste from csv_to_database.py (line 15 down) with no modifications, right?

@dshemetov
Copy link
Contributor

dshemetov commented Mar 3, 2023

Want to point out that this is a partial fix of #1078. This PR can handle merging the files together and adding some Pandas exceptions, but I'm hoping that in another PR we can:

  • rewrite load_csv to not loop over dataframe with df.itertuples and instead rely on Pandas data type-casting to do most of our validation
  • all the floaty_int and maybe_apply stuff should be deleted
  • extract_and_check_row is a very slow and manual way of going about the same things that can be done in Pandas column-by-column;
  • the way extract_and_check_row return a (Value, Error) tuple is unnecessary, especially because a CSV file is archived as failed if a single row fails, so we should just detect the failure and return immediately (instead of parsing the rest of the CSV); it would be a bit challenging to reproduce the same error log as current code with this setup (the current code returns the string of the full row that failed; while the Pandas code would just give an error for a full column; we could use Pandas to find the offending row though, probably)

@BrainIsDead BrainIsDead requested a review from melange396 March 7, 2023 15:04
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented extract_and_check_row, validate_missing_code, validate_quantity and other validations with pandas. I found the only way to create CovidcastRows - with itertuples.
Not sure if we need to save @staticmethods it makes no sense if we not using them somewhere else

@BrainIsDead BrainIsDead requested a review from dshemetov March 29, 2023 20:08
* rely on Pandas built-ins more for speed
* assume Pandas data types are not strings
* update tests
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 14 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Copy link
Contributor

@dshemetov dshemetov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be good to go, as long as we prepare a rollback / hotfix plan in case there is a bug here undetected by our tests/integrations.

After this is merged, we should

  • watch the ingestion logs in Kibana (Epidata dashboard) and make sure they look healthy
  • get some queries ready to validate that the new data can be pulled from the database and that it looks fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor csv_importer.py and csv_to_database.py and covidcast_row.py
4 participants