Skip to content

Further refactoring for the geo coding utility #217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Oct 7, 2020
Merged

Conversation

dshemetov
Copy link
Contributor

The work for this #215

@dshemetov dshemetov self-assigned this Aug 25, 2020
@dshemetov
Copy link
Contributor Author

I am done with this refactor!

A summary of the changes:

  • added documentation and consolidated geocode source file info in /data_proc/geomap/README.md
  • added the functions zip_to_state_code, zip_to_state_id (and the convert_* versions), zip_to_msa and convert_zip_to_msa
  • added two functions add_geocode and replace_geocode meant to consolidate the logic in the utility and reduce the code size by a factor of 5. These functions work along side with the rest of the deprecated functions and are meant to replace e.g. zip_to_msa(df, ...) with replace_geocode(df, "zip", "msa", ...).
  • added tests for all new functions
  • cleaned geo_data_proc.py, removed unnecessary functions, separated derived crosswalks (shortcut crosswalks) from source crosswalks, consolidated the JHU UID hand modifications here
  • renamed functions that referred to fips or county interchangeably to consistently use fips, e.g. zip_to_county to zip_to_fips
  • enforced the string type on all geocodes, with zero padding as necessary
  • renamed instances of stcode to state_code for clarity
  • removed non-JHU UID functions for JHU conversion

@dshemetov dshemetov marked this pull request as ready for review August 28, 2020 04:37
@dshemetov
Copy link
Contributor Author

dshemetov commented Aug 28, 2020

Admittedly, this is a lot of changes. And for what it's worth, I don't plan on making such big PRs in the future. I believe, however, that this code is correct, because:

  • the tests ensure the two new functions have the same output as James' previous functions, on the given test cases
  • we total the values across the test dataframe columns after geocoding and verify that the data sum is identical

There may still be glitches in specific codings:

  • the JHU UID to FIPS conversion has some new additions and may need small revisions to its handling of edge cases
  • the FIPS to HRR conversion uses different files, the weights have an average absolute difference of 0.017 (normalized to 1)

@krivard
Copy link
Contributor

krivard commented Aug 28, 2020

Verify that this zip-fips mapping is not substantively different from the 02_20_uszips.csv mapping claimed to be updated in 2020.

@dshemetov
Copy link
Contributor Author

dshemetov commented Aug 29, 2020

Some updates:

  • Lower cased the state abbreviations
  • Updated the ZIP/FIPS population table to use the updated data from here (based on the total population count, this data appears to be from 2018), which will help @jingjtang's work in Quidel
    • Contacted the source about their ZIP/FIPS data, they said "...currently we are using the 2010 US Census mapping for ZCTAs to counties and are using our own GIS algorithm to calculate map point zips to counties (using recent county shape data)." It is still unclear how they have updated population values per region.
  • We will continue to use the US Census data for the ZIP/FIPS mapping, for source reliability @jsharpna
    • For future reference, I looked into the differences in weights given by simplemaps and the US census. The average absolute weights difference between the file above and the 2010 Census information is ~1e-05; the correlation between the twocolumns is ~1 - 1e-09.

@jsharpna
Copy link
Contributor

jsharpna commented Sep 2, 2020

We need to come to a decision, here are the pros and cons of using the 02_20_uszip file for weights...
pros: possibly updated to be consistent with the 2018 populations
cons: not clear what the source data is, not easy to update when new census data is available

Since we don't know where these weights come from, we can at least try to verify that they are consistent with the marginal population totals. To do this we...

  1. multiply the fips-zip weight from this file by the zip population - this should give us the fips x zip population
  2. sum that grouped by fips then if this matches the fips population then good

@dshemetov
Copy link
Contributor Author

dshemetov commented Sep 2, 2020

My apologies, it appears I made a large error in comparing the weights between the two sources in the comment above. The ZIP/FIPS weights in the two sources are actually almost identical. I have updated the comment to reflect the correct information.

@dshemetov
Copy link
Contributor Author

dshemetov commented Sep 2, 2020

@jsharpna, I ran your suggested query. The difference between the ZIP population total and the ZIP transformed to FIPS population total is 113.45 (out of 326 million), using simplemaps weights, and 147.0 using Census weights directly.

@jsharpna
Copy link
Contributor

jsharpna commented Sep 3, 2020

Ok @krivard, given that @dshemetov seems to think that the weights are just from the 2010 CB then I propose

  1. we use the same weight file that we currently are (derived from 2010 CB)
  2. we use most recent CB populations at the fips and zip level
  3. this is subtle, but I think that we should match the population base geo to the base geo for the signal - for example, jhu uses fips (ish) then the populations should be aggregated from the fips level. (this could be precomputed in geomapper - but is more complicated).

@dshemetov
Copy link
Contributor Author

dshemetov commented Sep 11, 2020

@jsharpna and I found population estimates at the county level from US Census Bureau. I will use the 2019 estimates.

@dshemetov
Copy link
Contributor Author

dshemetov commented Sep 17, 2020

It turns out that Puerto Rico is not in the 2019 estimates above. Population estimates for all of Puerto Rico in 2019 are available, but not at the municipality level. I think that if we want municipality-level prop signals, we should use the 2010 data for Puerto Rico in our FIPS population file. The alternative is not reporting population-adjusted signals for Puerto Rico. I am in the process of implementing this solution, but if someone thinks of something better, let me know.

@krivard
Copy link
Contributor

krivard commented Sep 18, 2020

@dshemetov go ahead with the 2010 population figures for Puerto Rico municipalities.

@jsharpna We'll want to call out the discrepancy in the API documentation, probably as a new Population section in https://github.com/cmu-delphi/delphi-epidata/blob/main/docs/api/covidcast_geography.md. You and Dmitry should coordinate to write that section next week.

* README
* data processing script
* demo/tutorial notebooks

Changes include:

* updated data sources
* Added state->HHS region mappings
* new HRR weights
* new JHU UID mappings by hand
* 2019 population from US Census Bureau
* Puerto Rico population from 2010
* Shortcut crosswalks

Removed unused code
* Removed old sources, since we now pull from the internet
* Renamed cross files, since they are tables
* Added new mapping tables
* Updated population figures
* Added HHS mapping
* added the functions zip_to_state_code, zip_to_state_id (and the convert_* versions), zip_to_msa and convert_zip_to_msa
* added two functions add_geocode and replace_geocode meant to consolidate the logic in the utility and reduce the code size by a factor of 5. These functions work along side with the rest of the deprecated functions and are meant to replace e.g. zip_to_msa(df, ...) with replace_geocode(df, "zip", "msa", ...).
* renamed functions that referred to fips or county interchangeably to consistently use fips, e.g. zip_to_county to zip_to_fips
* enforced the string type on all geocodes, with zero padding as necessary
* renamed instances of stcode to state_code for clarity
* removed non-JHU UID functions for JHU conversion
* updated tests to match

Bugfixes:

* Removed .0000, 9xxx in output mappings - fixes most of #254
* Puerto Rico deaths should now be reported - fixes #179
* Generally fixes #215
fips_col: str = 'fips',
stcode_col: str = 'st_code'):
"""create st_code column from fips column
assert (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use a ValueError here since it's a user input that's problematic.

state_id_col='state_id',
full=False):
"""create state_id column from county (fips) column
### DEPRECATED FUNCTIONS BELOW
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add DeprecationWarnings to all of these when run?

@chinandrew
Copy link
Contributor

chinandrew commented Sep 20, 2020

I have tests failing, one of which may be due to pandas sorting. If so, adding sort=True to the pandas .merge should solve.

----------- coverage: platform linux, python 3.6.9-final-0 -----------
Name                                                                                        Stmts   Miss  Cover   Missing
-------------------------------------------------------------------------------------------------------------------------
/home/andrew/Documents/covidcast-indicators/_delphi_utils_python/delphi_utils/__init__.py       7      0   100%
/home/andrew/Documents/covidcast-indicators/_delphi_utils_python/delphi_utils/archive.py      153      0   100%
/home/andrew/Documents/covidcast-indicators/_delphi_utils_python/delphi_utils/export.py        11      0   100%
/home/andrew/Documents/covidcast-indicators/_delphi_utils_python/delphi_utils/geomap.py       296     18    94%   151, 283, 421, 502, 693, 737, 788, 854, 898, 907, 934, 938, 946, 970, 998, 1003, 1029, 1060
/home/andrew/Documents/covidcast-indicators/_delphi_utils_python/delphi_utils/utils.py          8      0   100%
-------------------------------------------------------------------------------------------------------------------------
TOTAL                                                                                         475     18    96%

======================================================================================================== short test summary info ========================================================================================================
FAILED test_geomap.py::TestGeoMapper::test_add_population_column - assert 172060 == 263790
FAILED test_geomap.py::TestGeoMapper::test_add_code - ValueError: operands could not be broadcast together with shapes (4,2) (3,2)

On linting, I get a number of errors, some of are in the deprecated functions and we could ignore.

************* Module delphi_utils.geomap
delphi_utils/geomap.py:3:5: C0303: Trailing whitespace (trailing-whitespace)
delphi_utils/geomap.py:13:61: C0303: Trailing whitespace (trailing-whitespace)
delphi_utils/geomap.py:413:0: C0301: Line too long (110/100) (line-too-long)
delphi_utils/geomap.py:492:0: C0301: Line too long (103/100) (line-too-long)
delphi_utils/geomap.py:546:0: C0301: Line too long (103/100) (line-too-long)
delphi_utils/geomap.py:844:0: C0301: Line too long (103/100) (line-too-long)
delphi_utils/geomap.py:887:0: C0301: Line too long (103/100) (line-too-long)
delphi_utils/geomap.py:961:0: C0301: Line too long (104/100) (line-too-long)
delphi_utils/geomap.py:1018:0: C0301: Line too long (104/100) (line-too-long)
delphi_utils/geomap.py:1080:0: C0301: Line too long (104/100) (line-too-long)
delphi_utils/geomap.py:1113:0: C0301: Line too long (104/100) (line-too-long)
delphi_utils/geomap.py:1:0: C0302: Too many lines in module (1137/1000) (too-many-lines)
delphi_utils/geomap.py:1137:0: C0305: Trailing newlines (trailing-newlines)
delphi_utils/geomap.py:90:4: W0102: Dangerous default value CROSSWALK_FILEPATHS (builtins.dict) as argument (dangerous-default-value)
delphi_utils/geomap.py:343:8: R1705: Unnecessary "else" after "return" (no-else-return)
delphi_utils/geomap.py:391:4: R0201: Method could be a function (no-self-use)
delphi_utils/geomap.py:442:4: R0201: Method could be a function (no-self-use)
delphi_utils/geomap.py:624:22: E1136: Value 'state_table' is unsubscriptable (unsubscriptable-object)
delphi_utils/geomap.py:686:35: W0613: Unused argument 'msa_col' (unused-argument)
delphi_utils/geomap.py:49:0: R0904: Too many public methods (30/20) (too-many-public-methods)

Also annoying docstring style nitpicks, courtesy of pydocstyle. Removed the warnings for deprecated functions

delphi_utils/geomap.py:50 in public class `GeoMapper`:
        D400: First line should end with a period (not 'i')
delphi_utils/geomap.py:91 in public method `__init__`:
        D205: 1 blank line required between summary line and description (found 0)
delphi_utils/geomap.py:91 in public method `__init__`:
        D400: First line should end with a period (not 's')
delphi_utils/geomap.py:110 in public method `load_crosswalk`:
        D401: First line should be in imperative mood (perhaps 'Load', not 'Loads')
delphi_utils/geomap.py:192 in public method `convert_fips_to_mega`:
        D400: First line should end with a period (not 'g')
delphi_utils/geomap.py:192 in public method `convert_fips_to_mega`:
        D403: First word of the first line should be properly capitalized ('Convert', not 'convert')
delphi_utils/geomap.py:208 in public method `megacounty_creation`:
        D202: No blank lines allowed after function docstring (found 1)
delphi_utils/geomap.py:208 in public method `megacounty_creation`:
        D400: First line should end with a period (not 'n')
delphi_utils/geomap.py:208 in public method `megacounty_creation`:
        D403: First word of the first line should be properly capitalized ('Create', not 'create')
delphi_utils/geomap.py:357 in public method `add_population_column`:
        D401: First line should be in imperative mood (perhaps 'Append', not 'Appends')
delphi_utils/geomap.py:402 in public method `fips_to_megacounty`:
        D400: First line should end with a period (not ')')
delphi_utils/geomap.py:402 in public method `fips_to_megacounty`:
        D403: First word of the first line should be properly capitalized ('Convert', not 'convert')

}

# Utility functions
def load_crosswalk(self, from_code, to_code):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is just used by other methods and not public facing, may want to make this protected or private


Return:
Return
---------
data: copy of dataframe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more description here

@chinandrew
Copy link
Contributor

@chinandrew I think those are the long URL lines. Is there are a preferred way to handle those?

if its a URL it's probably fine to leave it. might be able to do backslashes but that might actually make things less readable. Some of thse look like normal dosctrings though

Screenshot from 2020-09-30 14-24-57

@dshemetov
Copy link
Contributor Author

Oh oops, thanks for catching those. Just fixed those up.

@dshemetov
Copy link
Contributor Author

Just realized there are some renamings in this refactor that need to be smoothed out with the previous extension of the utility to JHU. Should be just naming fixes.

@krivard
Copy link
Contributor

krivard commented Oct 1, 2020

Merge is pending completion of corresponding documentation

* Change 'national' -> 'nation' for consistency with #207
* Enforce FIPS and ZIP from_code when going to 'nation' level
* if 'bucket_name' param is empty, do not archive
* typo fix to comment in geo_data_proc
* keep the JHU entries for the counties that Kansas City contributes to (were deleted previously, leading to missing counts)
* fix bug that deleted the Puerto Rico Unassigned and Out of State JHU UIDs
* delete the New York County splitting that was previously necessary
* add New York County UIDs back to the crosswalk
Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • delphi_utils lints and tests
  • jhu lints (with python3.6; 3.7+ still has a pytest recursion problem) and tests
  • emr_hosp is a retired signal, is exempt from linting, but tests pass
  • verified no other signals are currently using GeoMapper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants