[Refactor] Further cleans of `csv_importer` #1072

dshemetov · 2023-01-21T12:46:45Z

Prerequisites:

Unless it is a documentation hotfix it should be merged against the dev branch
Branch is up-to-date with the branch to be merged with, i.e. dev
Build is successful
Code is cleaned up and formatted

Summary

A follow up to #949. There were just some pieces of this code that really bothered me cause they were unnecessarily confusing to read every time I went there.

Remove a bunch of unnecessary function arguments by correctly patching in tests
Add type hints to many functions
Simplify the logic that returns a list of CovidcastRows for database import
Convert the path details tuple to a NamedTuples, so it's less of a mystery

* add type hints * add consistent spacings

* simplify and unify CovidcastRow creation from file * convert CSV file details tuple to NamedTuple

krivard

🏆 refactoring from *_impl to @patch has been on my irklist for ages

some questions and a suggestion but i wouldn't be mad if this merged as-is

experimentally using the Conventional Comments approach

krivard · 2023-01-25T15:41:39Z

src/acquisition/covidcast/csv_importer.py

+from dataclasses import dataclass
+from datetime import date
+from glob import glob
+from typing import Iterator, NamedTuple, Optional, Tuple


suggestion: keep these in alphabetical order

we use pylint to enforce alphabetical import order in covidcast-indicators, and it would be good to be consistent across repositories.

explanation:

So I'm using a Python import sort extension in my IDE and it likes to:

have three sections separated by two new lines: Python STD, third party, and first party

place bare import X statements above from X import Y statements

sort alphabetically within these sections

So that's the convention I followed here. I think we were sort-of doing the first convention already, not really the second, and doing the third.

question: do you like the second requirement? should I just go with just 1 and 3?

I feel like this is an open ended question that we can address in a more comprehensive push for broader linting/formatting goals. Gonna keep as is for now.

krivard · 2023-01-25T15:42:17Z

src/acquisition/covidcast/csv_importer.py


 # first party
 from delphi_utils import Nans
 from delphi.utils.epiweek import delta_epiweeks
-from .logger import get_structured_logger
+from delphi.epidata.acquisition.covidcast.database import CovidcastRow
+from delphi.epidata.acquisition.covidcast.logger import get_structured_logger


praise: fully-qualified name ftw!

this will make our lives easier if/when we eventually convert the delphi-epidata codebase to actual python modules, or if/when we switch to a single logger behavior definition everywhere

krivard · 2023-01-25T15:51:57Z

src/acquisition/covidcast/csv_importer.py

+from delphi.epidata.acquisition.covidcast.logger import get_structured_logger
+
+DFRow = NamedTuple('DFRow', [('geo_id', str), ('value', float), ('stderr', float), ('sample_size', float), ('missing_value', int), ('missing_stderr', int), ('missing_sample_size', int)])
+PathDetails = NamedTuple('PathDetails', [('source', str), ("signal", str), ('time_type', str), ('geo_type', str), ('time_value', int), ('issue', int), ('lag', int)])


question: is the ordering on this funky?

I would've expected either path order (issue, lag, source, signal, time_type, time_value, geo_type) or column order (source, signal, geo_type, issue, time_type, time_value, lag), with a preference for path order. ...i've just come back to this comment after reading further down the page, and it looks like this is just a legacy ordering that probably emerged as features were added over time. worth keeping or change it to something more natural?

nitpick: mixed quotes

Suggested change

PathDetails = NamedTuple('PathDetails', [('source', str), ("signal", str), ('time_type', str), ('geo_type', str), ('time_value', int), ('issue', int), ('lag', int)])

PathDetails = NamedTuple('PathDetails', [('source', str), ('signal', str), ('time_type', str), ('geo_type', str), ('time_value', int), ('issue', int), ('lag', int)])

Yea, it's just a legacy ordering. I can change it to path ordering for now, though it may be something worth revisiting later.

krivard · 2023-01-25T19:10:56Z

src/acquisition/covidcast/csv_to_database.py

+    csv_rows = CsvImporter.load_csv(path, details)
+    rows_list = list(csv_rows)


praise: this is an enormous improvement

way more readable and less dependent on catching small errors in long tuple definitions. I have no idea why load_csv wasn't returning a CovidcastRow iterator to begin with.

krivard · 2023-01-25T19:13:45Z

tests/acquisition/covidcast/test_csv_importer.py

@@ -42,37 +42,38 @@ def test_is_sane_week(self):
    self.assertFalse(CsvImporter.is_sane_week(202054))
    self.assertFalse(CsvImporter.is_sane_week(20200418))

+
+  @patch("delphi.epidata.acquisition.covidcast.csv_importer.glob")


question: why does this patch location need to be delphi.epidata.acquisition.covidcast.csv_importer when the patch for isdir on the line below can be in os.path?

So the TLDR of this is due to Python imports

csv_importer uses import os, so we can patch os.path, because csv_importer points to os.path

csv_importer uses from glob import glob, which makes a new variable glob in the csv_importer namespace with the same value as glob.glob, but if we patch glob.glob, which updates that value to a Mock, csv_importer.glob retains the old value, so it misses the patch

So the end result is that you generally want to patch where a name is used, not a module itself.

Here's a more complete blog post on this.

tests/acquisition/covidcast/test_csv_to_database.py

Fix mixed quotes Co-authored-by: Katie Mazaitis <[email protected]>

Co-authored-by: Katie Mazaitis <[email protected]>

* update covidcast_nowcast * update tests

dshemetov · 2023-01-25T22:19:29Z

Thanks @krivard ! Just going to merge this one in, we can add any further feedback to another issue.

dshemetov · 2023-01-25T22:20:54Z

Ok guess not, whenever @melange396 gets to it.

krivard · 2023-01-27T20:20:39Z

@dshemetov yeah I have it set to dismiss reviews when new commits are added; we had an intern that merged some unreviewed code a while back and it seemed prudent at the time. it's kindof a pain if we don't need it though. would you prefer to wait for George or want me to override and merge now?

dshemetov · 2023-01-27T20:37:51Z

I'd prefer to just merge it in, since George won't be back until Wednesday and I don't think this is that complicated of a change

dshemetov added 2 commits January 21, 2023 04:41

Acquisition: lint csv_importer and csv_to_database

8da3e0a

* add type hints * add consistent spacings

Acquisition: refactor csv_importer for cleanliness

ca356ed

* simplify and unify CovidcastRow creation from file * convert CSV file details tuple to NamedTuple

dshemetov requested a review from krivard January 21, 2023 12:46

dshemetov mentioned this pull request Jan 21, 2023

Add CovidcastRow testing util and a few other changes #1044

Merged

5 tasks

dshemetov requested a review from melange396 January 23, 2023 17:30

krivard previously approved these changes Jan 25, 2023

View reviewed changes

Update src/acquisition/covidcast/csv_importer.py

d6b653a

Fix mixed quotes Co-authored-by: Katie Mazaitis <[email protected]>

dshemetov dismissed krivard’s stale review via d6b653a January 25, 2023 21:12

dshemetov and others added 2 commits January 25, 2023 13:34

Update tests/acquisition/covidcast/test_csv_to_database.py

51722fa

Co-authored-by: Katie Mazaitis <[email protected]>

Acquisition: change PathDetails order to path order

57aa137

* update covidcast_nowcast * update tests

dshemetov requested review from melange396 and removed request for melange396 January 25, 2023 22:19

krivard merged commit 9ab7840 into dev Jan 27, 2023

krivard deleted the ds/refactor-csv-importer branch January 27, 2023 20:49

krivard mentioned this pull request Feb 9, 2023

Release Delphi Epidata 0.4.6 #1081

Merged

dshemetov mentioned this pull request Aug 21, 2023

Refactor usage of mock in csv_to_database.py #966

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] Further cleans of `csv_importer` #1072

[Refactor] Further cleans of `csv_importer` #1072

dshemetov commented Jan 21, 2023 •

edited

Loading

krivard left a comment

krivard Jan 25, 2023

dshemetov Jan 25, 2023 •

edited

Loading

dshemetov Jan 25, 2023

krivard Jan 25, 2023

krivard Jan 25, 2023

dshemetov Jan 25, 2023

krivard Jan 25, 2023

krivard Jan 25, 2023

dshemetov Jan 25, 2023 •

edited

Loading

dshemetov commented Jan 25, 2023

dshemetov commented Jan 25, 2023

krivard commented Jan 27, 2023

dshemetov commented Jan 27, 2023

	PathDetails = NamedTuple('PathDetails', [('source', str), ("signal", str), ('time_type', str), ('geo_type', str), ('time_value', int), ('issue', int), ('lag', int)])
	PathDetails = NamedTuple('PathDetails', [('source', str), ('signal', str), ('time_type', str), ('geo_type', str), ('time_value', int), ('issue', int), ('lag', int)])

		csv_rows = CsvImporter.load_csv(path, details)
		rows_list = list(csv_rows)

[Refactor] Further cleans of csv_importer #1072

[Refactor] Further cleans of csv_importer #1072

Conversation

dshemetov commented Jan 21, 2023 • edited Loading

Summary

krivard left a comment

Choose a reason for hiding this comment

krivard Jan 25, 2023

Choose a reason for hiding this comment

dshemetov Jan 25, 2023 • edited Loading

Choose a reason for hiding this comment

dshemetov Jan 25, 2023

Choose a reason for hiding this comment

krivard Jan 25, 2023

Choose a reason for hiding this comment

krivard Jan 25, 2023

Choose a reason for hiding this comment

dshemetov Jan 25, 2023

Choose a reason for hiding this comment

krivard Jan 25, 2023

Choose a reason for hiding this comment

krivard Jan 25, 2023

Choose a reason for hiding this comment

dshemetov Jan 25, 2023 • edited Loading

Choose a reason for hiding this comment

dshemetov commented Jan 25, 2023

dshemetov commented Jan 25, 2023

krivard commented Jan 27, 2023

dshemetov commented Jan 27, 2023

[Refactor] Further cleans of `csv_importer` #1072

[Refactor] Further cleans of `csv_importer` #1072

dshemetov commented Jan 21, 2023 •

edited

Loading

dshemetov Jan 25, 2023 •

edited

Loading

dshemetov Jan 25, 2023 •

edited

Loading