-
Notifications
You must be signed in to change notification settings - Fork 16
Further refactoring for the geo coding utility #217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I am done with this refactor! A summary of the changes:
|
Admittedly, this is a lot of changes. And for what it's worth, I don't plan on making such big PRs in the future. I believe, however, that this code is correct, because:
There may still be glitches in specific codings:
|
Verify that this zip-fips mapping is not substantively different from the 02_20_uszips.csv mapping claimed to be updated in 2020. |
Some updates:
|
We need to come to a decision, here are the pros and cons of using the 02_20_uszip file for weights... Since we don't know where these weights come from, we can at least try to verify that they are consistent with the marginal population totals. To do this we...
|
My apologies, it appears I made a large error in comparing the weights between the two sources in the comment above. The ZIP/FIPS weights in the two sources are actually almost identical. I have updated the comment to reflect the correct information. |
@jsharpna, I ran your suggested query. The difference between the ZIP population total and the ZIP transformed to FIPS population total is 113.45 (out of 326 million), using simplemaps weights, and 147.0 using Census weights directly. |
Ok @krivard, given that @dshemetov seems to think that the weights are just from the 2010 CB then I propose
|
@jsharpna and I found population estimates at the county level from US Census Bureau. I will use the 2019 estimates. |
It turns out that Puerto Rico is not in the 2019 estimates above. Population estimates for all of Puerto Rico in 2019 are available, but not at the municipality level. I think that if we want municipality-level prop signals, we should use the 2010 data for Puerto Rico in our FIPS population file. The alternative is not reporting population-adjusted signals for Puerto Rico. I am in the process of implementing this solution, but if someone thinks of something better, let me know. |
@dshemetov go ahead with the 2010 population figures for Puerto Rico municipalities. @jsharpna We'll want to call out the discrepancy in the API documentation, probably as a new Population section in https://github.com/cmu-delphi/delphi-epidata/blob/main/docs/api/covidcast_geography.md. You and Dmitry should coordinate to write that section next week. |
* README * data processing script * demo/tutorial notebooks Changes include: * updated data sources * Added state->HHS region mappings * new HRR weights * new JHU UID mappings by hand * 2019 population from US Census Bureau * Puerto Rico population from 2010 * Shortcut crosswalks Removed unused code
* Removed old sources, since we now pull from the internet * Renamed cross files, since they are tables * Added new mapping tables * Updated population figures * Added HHS mapping
* added the functions zip_to_state_code, zip_to_state_id (and the convert_* versions), zip_to_msa and convert_zip_to_msa * added two functions add_geocode and replace_geocode meant to consolidate the logic in the utility and reduce the code size by a factor of 5. These functions work along side with the rest of the deprecated functions and are meant to replace e.g. zip_to_msa(df, ...) with replace_geocode(df, "zip", "msa", ...). * renamed functions that referred to fips or county interchangeably to consistently use fips, e.g. zip_to_county to zip_to_fips * enforced the string type on all geocodes, with zero padding as necessary * renamed instances of stcode to state_code for clarity * removed non-JHU UID functions for JHU conversion * updated tests to match Bugfixes: * Removed .0000, 9xxx in output mappings - fixes most of #254 * Puerto Rico deaths should now be reported - fixes #179 * Generally fixes #215
fd302a9
to
0c6b422
Compare
fips_col: str = 'fips', | ||
stcode_col: str = 'st_code'): | ||
"""create st_code column from fips column | ||
assert ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use a ValueError here since it's a user input that's problematic.
state_id_col='state_id', | ||
full=False): | ||
"""create state_id column from county (fips) column | ||
### DEPRECATED FUNCTIONS BELOW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add DeprecationWarnings to all of these when run?
I have tests failing, one of which may be due to pandas sorting. If so, adding sort=True to the pandas .merge should solve.
On linting, I get a number of errors, some of are in the deprecated functions and we could ignore.
Also annoying docstring style nitpicks, courtesy of
|
} | ||
|
||
# Utility functions | ||
def load_crosswalk(self, from_code, to_code): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is just used by other methods and not public facing, may want to make this protected or private
|
||
Return: | ||
Return | ||
--------- | ||
data: copy of dataframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add more description here
if its a URL it's probably fine to leave it. might be able to do backslashes but that might actually make things less readable. Some of thse look like normal dosctrings though |
Oh oops, thanks for catching those. Just fixed those up. |
Just realized there are some renamings in this refactor that need to be smoothed out with the previous extension of the utility to JHU. Should be just naming fixes. |
* Correct README information about JHU fixes * Remove JHU NYC county splitting * Add Dmitry to authorship comments in geo_data_proc and geomap
…icators into rf_geo_refactor
Merge is pending completion of corresponding documentation |
* Change 'national' -> 'nation' for consistency with #207 * Enforce FIPS and ZIP from_code when going to 'nation' level
* if 'bucket_name' param is empty, do not archive * typo fix to comment in geo_data_proc
…icators into rf_geo_refactor
* keep the JHU entries for the counties that Kansas City contributes to (were deleted previously, leading to missing counts) * fix bug that deleted the Puerto Rico Unassigned and Out of State JHU UIDs * delete the New York County splitting that was previously necessary * add New York County UIDs back to the crosswalk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- delphi_utils lints and tests
- jhu lints (with python3.6; 3.7+ still has a pytest recursion problem) and tests
- emr_hosp is a retired signal, is exempt from linting, but tests pass
- verified no other signals are currently using GeoMapper
The work for this #215