-
Notifications
You must be signed in to change notification settings - Fork 67
1078 - Refactor csv_importer.py and csv_to_database.py #1103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
… to `csv_importer`, fixed imports, changed mocks
NB releasing this will need to be coordinated with configuration changes to Automation and/or Cronicle in prod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great start, but i think we need to change another test file -- since we got rid of csv_to_database.py
and added its content to csv_importer.py
(in src/acquisition/covidcast/
), we should merge test_csv_to_database.py
into test_csv_importer.py
(in tests/acquisition/covidcast/
)
@@ -414,3 +426,169 @@ def load_csv(filepath: str, details: PathDetails) -> Iterator[Optional[Covidcast | |||
details.issue, | |||
details.lag, | |||
) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from this line down, all this text is a straight copy-paste from csv_to_database.py
(line 15 down) with no modifications, right?
Want to point out that this is a partial fix of #1078. This PR can handle merging the files together and adding some Pandas exceptions, but I'm hoping that in another PR we can:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented extract_and_check_row
, validate_missing_code
, validate_quantity
and other validations with pandas. I found the only way to create CovidcastRows
- with itertuples.
Not sure if we need to save @staticmethod
s it makes no sense if we not using them somewhere else
* rely on Pandas built-ins more for speed * assume Pandas data types are not strings * update tests
Refactor csv_importer.py with Pandas
Kudos, SonarCloud Quality Gate passed!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be good to go, as long as we prepare a rollback / hotfix plan in case there is a bug here undetected by our tests/integrations.
After this is merged, we should
- watch the ingestion logs in Kibana (Epidata dashboard) and make sure they look healthy
- get some queries ready to validate that the new data can be pulled from the database and that it looks fine
closes 1078
csv_to_database
functionality moved tocsv_importer
(csv_to_database deleted)pd.read_csv
in theload_csv
methodis_header_valid
method withissubset
method