-
Notifications
You must be signed in to change notification settings - Fork 68
[Refactor] Further cleans of csv_importer
#1072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* add type hints * add consistent spacings
* simplify and unify CovidcastRow creation from file * convert CSV file details tuple to NamedTuple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🏆 refactoring from *_impl
to @patch
has been on my irklist for ages
some questions and a suggestion but i wouldn't be mad if this merged as-is
experimentally using the Conventional Comments approach
from dataclasses import dataclass | ||
from datetime import date | ||
from glob import glob | ||
from typing import Iterator, NamedTuple, Optional, Tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: keep these in alphabetical order
we use pylint to enforce alphabetical import order in covidcast-indicators, and it would be good to be consistent across repositories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explanation:
So I'm using a Python import sort extension in my IDE and it likes to:
- have three sections separated by two new lines: Python STD, third party, and first party
- place bare
import X
statements abovefrom X import Y
statements - sort alphabetically within these sections
So that's the convention I followed here. I think we were sort-of doing the first convention already, not really the second, and doing the third.
question: do you like the second requirement? should I just go with just 1 and 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is an open ended question that we can address in a more comprehensive push for broader linting/formatting goals. Gonna keep as is for now.
|
||
# first party | ||
from delphi_utils import Nans | ||
from delphi.utils.epiweek import delta_epiweeks | ||
from .logger import get_structured_logger | ||
from delphi.epidata.acquisition.covidcast.database import CovidcastRow | ||
from delphi.epidata.acquisition.covidcast.logger import get_structured_logger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: fully-qualified name ftw!
this will make our lives easier if/when we eventually convert the delphi-epidata codebase to actual python modules, or if/when we switch to a single logger behavior definition everywhere
from delphi.epidata.acquisition.covidcast.logger import get_structured_logger | ||
|
||
DFRow = NamedTuple('DFRow', [('geo_id', str), ('value', float), ('stderr', float), ('sample_size', float), ('missing_value', int), ('missing_stderr', int), ('missing_sample_size', int)]) | ||
PathDetails = NamedTuple('PathDetails', [('source', str), ("signal", str), ('time_type', str), ('geo_type', str), ('time_value', int), ('issue', int), ('lag', int)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: is the ordering on this funky?
I would've expected either path order (issue, lag, source, signal, time_type, time_value, geo_type) or column order (source, signal, geo_type, issue, time_type, time_value, lag), with a preference for path order. ...i've just come back to this comment after reading further down the page, and it looks like this is just a legacy ordering that probably emerged as features were added over time. worth keeping or change it to something more natural?
nitpick: mixed quotes
PathDetails = NamedTuple('PathDetails', [('source', str), ("signal", str), ('time_type', str), ('geo_type', str), ('time_value', int), ('issue', int), ('lag', int)]) | |
PathDetails = NamedTuple('PathDetails', [('source', str), ('signal', str), ('time_type', str), ('geo_type', str), ('time_value', int), ('issue', int), ('lag', int)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it's just a legacy ordering. I can change it to path ordering for now, though it may be something worth revisiting later.
csv_rows = CsvImporter.load_csv(path, details) | ||
rows_list = list(csv_rows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: this is an enormous improvement
way more readable and less dependent on catching small errors in long tuple definitions. I have no idea why load_csv wasn't returning a CovidcastRow iterator to begin with.
@@ -42,37 +42,38 @@ def test_is_sane_week(self): | |||
self.assertFalse(CsvImporter.is_sane_week(202054)) | |||
self.assertFalse(CsvImporter.is_sane_week(20200418)) | |||
|
|||
|
|||
@patch("delphi.epidata.acquisition.covidcast.csv_importer.glob") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why does this patch location need to be delphi.epidata.acquisition.covidcast.csv_importer
when the patch for isdir
on the line below can be in os.path
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the TLDR of this is due to Python imports
csv_importer
usesimport os
, so we can patchos.path
, becausecsv_importer
points toos.path
csv_importer
usesfrom glob import glob
, which makes a new variableglob
in thecsv_importer
namespace with the same value asglob.glob
, but if we patchglob.glob
, which updates that value to a Mock,csv_importer.glob
retains the old value, so it misses the patch
So the end result is that you generally want to patch where a name is used, not a module itself.
Fix mixed quotes Co-authored-by: Katie Mazaitis <[email protected]>
Co-authored-by: Katie Mazaitis <[email protected]>
* update covidcast_nowcast * update tests
Thanks @krivard ! Just going to merge this one in, we can add any further feedback to another issue. |
Ok guess not, whenever @melange396 gets to it. |
@dshemetov yeah I have it set to dismiss reviews when new commits are added; we had an intern that merged some unreviewed code a while back and it seemed prudent at the time. it's kindof a pain if we don't need it though. would you prefer to wait for George or want me to override and merge now? |
I'd prefer to just merge it in, since George won't be back until Wednesday and I don't think this is that complicated of a change |
Prerequisites:
dev
branchdev
Summary
A follow up to #949. There were just some pieces of this code that really bothered me cause they were unnecessarily confusing to read every time I went there.