Skip to content

[Refactor] Further cleans of csv_importer #1072

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 27, 2023
Merged

Conversation

dshemetov
Copy link
Contributor

@dshemetov dshemetov commented Jan 21, 2023

Prerequisites:

  • Unless it is a documentation hotfix it should be merged against the dev branch
  • Branch is up-to-date with the branch to be merged with, i.e. dev
  • Build is successful
  • Code is cleaned up and formatted

Summary

A follow up to #949. There were just some pieces of this code that really bothered me cause they were unnecessarily confusing to read every time I went there.

  • Remove a bunch of unnecessary function arguments by correctly patching in tests
  • Add type hints to many functions
  • Simplify the logic that returns a list of CovidcastRows for database import
  • Convert the path details tuple to a NamedTuples, so it's less of a mystery

* add type hints
* add consistent spacings
* simplify and unify CovidcastRow creation from file
* convert CSV file details tuple to NamedTuple
krivard
krivard previously approved these changes Jan 25, 2023
Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🏆 refactoring from *_impl to @patch has been on my irklist for ages

some questions and a suggestion but i wouldn't be mad if this merged as-is

experimentally using the Conventional Comments approach

Comment on lines +6 to +9
from dataclasses import dataclass
from datetime import date
from glob import glob
from typing import Iterator, NamedTuple, Optional, Tuple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: keep these in alphabetical order

we use pylint to enforce alphabetical import order in covidcast-indicators, and it would be good to be consistent across repositories.

Copy link
Contributor Author

@dshemetov dshemetov Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explanation:

So I'm using a Python import sort extension in my IDE and it likes to:

  • have three sections separated by two new lines: Python STD, third party, and first party
  • place bare import X statements above from X import Y statements
  • sort alphabetically within these sections

So that's the convention I followed here. I think we were sort-of doing the first convention already, not really the second, and doing the third.

question: do you like the second requirement? should I just go with just 1 and 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is an open ended question that we can address in a more comprehensive push for broader linting/formatting goals. Gonna keep as is for now.


# first party
from delphi_utils import Nans
from delphi.utils.epiweek import delta_epiweeks
from .logger import get_structured_logger
from delphi.epidata.acquisition.covidcast.database import CovidcastRow
from delphi.epidata.acquisition.covidcast.logger import get_structured_logger
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: fully-qualified name ftw!

this will make our lives easier if/when we eventually convert the delphi-epidata codebase to actual python modules, or if/when we switch to a single logger behavior definition everywhere

from delphi.epidata.acquisition.covidcast.logger import get_structured_logger

DFRow = NamedTuple('DFRow', [('geo_id', str), ('value', float), ('stderr', float), ('sample_size', float), ('missing_value', int), ('missing_stderr', int), ('missing_sample_size', int)])
PathDetails = NamedTuple('PathDetails', [('source', str), ("signal", str), ('time_type', str), ('geo_type', str), ('time_value', int), ('issue', int), ('lag', int)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: is the ordering on this funky?

I would've expected either path order (issue, lag, source, signal, time_type, time_value, geo_type) or column order (source, signal, geo_type, issue, time_type, time_value, lag), with a preference for path order. ...i've just come back to this comment after reading further down the page, and it looks like this is just a legacy ordering that probably emerged as features were added over time. worth keeping or change it to something more natural?

nitpick: mixed quotes

Suggested change
PathDetails = NamedTuple('PathDetails', [('source', str), ("signal", str), ('time_type', str), ('geo_type', str), ('time_value', int), ('issue', int), ('lag', int)])
PathDetails = NamedTuple('PathDetails', [('source', str), ('signal', str), ('time_type', str), ('geo_type', str), ('time_value', int), ('issue', int), ('lag', int)])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it's just a legacy ordering. I can change it to path ordering for now, though it may be something worth revisiting later.

Comment on lines +106 to +107
csv_rows = CsvImporter.load_csv(path, details)
rows_list = list(csv_rows)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: this is an enormous improvement

way more readable and less dependent on catching small errors in long tuple definitions. I have no idea why load_csv wasn't returning a CovidcastRow iterator to begin with.

@@ -42,37 +42,38 @@ def test_is_sane_week(self):
self.assertFalse(CsvImporter.is_sane_week(202054))
self.assertFalse(CsvImporter.is_sane_week(20200418))


@patch("delphi.epidata.acquisition.covidcast.csv_importer.glob")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why does this patch location need to be delphi.epidata.acquisition.covidcast.csv_importer when the patch for isdir on the line below can be in os.path?

Copy link
Contributor Author

@dshemetov dshemetov Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the TLDR of this is due to Python imports

  • csv_importer uses import os, so we can patch os.path, because csv_importer points to os.path
  • csv_importer uses from glob import glob, which makes a new variable glob in the csv_importer namespace with the same value as glob.glob, but if we patch glob.glob, which updates that value to a Mock, csv_importer.glob retains the old value, so it misses the patch

So the end result is that you generally want to patch where a name is used, not a module itself.

Here's a more complete blog post on this.

Fix mixed quotes

Co-authored-by: Katie Mazaitis <[email protected]>
@dshemetov
Copy link
Contributor Author

Thanks @krivard ! Just going to merge this one in, we can add any further feedback to another issue.

@dshemetov dshemetov requested review from melange396 and removed request for melange396 January 25, 2023 22:19
@dshemetov
Copy link
Contributor Author

Ok guess not, whenever @melange396 gets to it.

@krivard
Copy link
Contributor

krivard commented Jan 27, 2023

@dshemetov yeah I have it set to dismiss reviews when new commits are added; we had an intern that merged some unreviewed code a while back and it seemed prudent at the time. it's kindof a pain if we don't need it though. would you prefer to wait for George or want me to override and merge now?

@dshemetov
Copy link
Contributor Author

I'd prefer to just merge it in, since George won't be back until Wednesday and I don't think this is that complicated of a change

@krivard krivard merged commit 9ab7840 into dev Jan 27, 2023
@krivard krivard deleted the ds/refactor-csv-importer branch January 27, 2023 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants