Skip to content

Further refactoring for the geo coding utility #217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Oct 7, 2020
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
0c6b4ad
Code and documentation for producing geo mapping files
krivard Sep 18, 2020
b08ed3f
Static geo mapping files
krivard Sep 18, 2020
0c6b422
Updated geo mapping/aggregation utility
krivard Sep 18, 2020
528ef2e
Remove 8XXXX and 9XXYY, YY > 56 JHU FIPS codes, updated Puerto Rico
dshemetov Sep 18, 2020
b564e9f
Code review updates:
dshemetov Sep 22, 2020
149e16d
Update _delphi_utils_python/data_proc/geomap/geo_data_proc.py
dshemetov Sep 22, 2020
eafdf13
Update _delphi_utils_python/delphi_utils/geomap.py
dshemetov Sep 22, 2020
e16d51d
Replace assert with ValueError exception
dshemetov Sep 22, 2020
fde9525
Add doc string for megacounty code
dshemetov Sep 22, 2020
baf0c14
Link todo list to github issues
dshemetov Sep 22, 2020
9c8e1b3
Taking ownership in the README
dshemetov Sep 22, 2020
2c3de60
Add crosswalk sanity checks to test_geomap
dshemetov Sep 23, 2020
806db8a
Merge branch 'rf_geo_refactor' of https://github.com/cmu-delphi/covid…
dshemetov Sep 23, 2020
c26743e
Uncomment work functions
dshemetov Sep 24, 2020
5fbff78
Code review updates
dshemetov Sep 24, 2020
655f21d
String conversion check coverage
dshemetov Sep 24, 2020
7d34fb9
Two final features
dshemetov Sep 25, 2020
b88e89f
Part of previous commit
dshemetov Sep 25, 2020
9a0eb10
Final set of tests:
dshemetov Sep 25, 2020
f06e085
Complete national in todo, add dropna=True/False tests
dshemetov Sep 30, 2020
8689a8a
Fix too long lines
dshemetov Sep 30, 2020
5f7c28c
A few comment fixes and additions, minor change of "is" to "=="
dshemetov Oct 1, 2020
9a3ffe9
Linting fixes for the tests
dshemetov Oct 1, 2020
2ddd9b1
Update EMR hosp geomapper with new changes
dshemetov Oct 1, 2020
eccc72b
Emr hosp update that should've been in previous commit
dshemetov Oct 1, 2020
9aacec5
Update the jhu indicator with geomapper changes
dshemetov Oct 1, 2020
0cfe78b
A few minor changes:
dshemetov Oct 1, 2020
cb171dc
Merge branch 'rf_geo_refactor' of github.com:cmu-delphi/covidcast-ind…
dshemetov Oct 1, 2020
8c9a143
Remove unneeded numpy import
dshemetov Oct 1, 2020
3ab1871
Small update to README
dshemetov Oct 1, 2020
4e6ce63
Modify national level code support:
dshemetov Oct 6, 2020
21cb473
Add archive bypass flag to JHU
dshemetov Oct 6, 2020
78ffb9a
Merge branch 'rf_geo_refactor' of github.com:cmu-delphi/covidcast-ind…
dshemetov Oct 6, 2020
90ba3d6
Important fixes to JHU hand additions:
dshemetov Oct 7, 2020
64fb9b7
Add indicator testing notebook and geocoding utility demo notebook
dshemetov Oct 7, 2020
b830cc8
Remove trailing whitespace
krivard Oct 7, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,9 @@ venv.bak/
# mkdocs documentation
/site

# VSCode settings
*.vscode

# mypy
.mypy_cache/

Expand Down
33,100 changes: 0 additions & 33,100 deletions _delphi_utils_python/data_proc/geomap/02_20_uszips.csv

This file was deleted.

Binary file not shown.
Binary file not shown.
Binary file not shown.
122 changes: 34 additions & 88 deletions _delphi_utils_python/data_proc/geomap/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,114 +2,60 @@

Authors: Jingjing Tang, James Sharpnack

The data_proc/geomap directory contains original source data, processing scripts, and notes for processing from original source to crosswalk tables in the data directory for the delphi_utils package.

## Usage

Requires the following source files below.

Run the following to write the cross files in the package data dir...
Run the following to build the crosswalk tables in `covidcast-indicators/_delph_utils_python/delph_utils/data`
```
$ python geo_data_proc.py
```
this will build the following files...
- fips_msa_cross.csv
- zip_fips_cross.csv
- state_codes.csv

You can see consistency checks and diffs with old sources in ./consistency_checks.ipynb

## Source files

- 03_20_MSAs.xls : [US Census Bureau](https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html)
- 02_20_uszips.csv : Hand edited file from Jingjing, we only use the fips,zip encoding and also extract the states from these
- Crosswalk files from https://www.huduser.gov/portal/datasets/usps_crosswalk.html
- JHU crosswalk table: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data#uid-lookup-table-logic
- ZIP/County population: https://www.census.gov/geographies/reference-files/time-series/geo/relationship-files.html#par_textimage_674173622, https://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txt?#

## Todo

- make direct cross tables for fips -> hrr and zip -> msa / state
- use hud for zip -> fips?

## Notes

Some of the source files were constructed by hand, most notably 02_20_uszips.csv.

The 02_20_uszips.csv file is based on the newest consensus data including 5-digit zipcode, fips code, county name, state, population, HRR, HSA (I downloaded the original file from here https://simplemaps.com/data/us-zips. This file matches best to the most recent (2020) situation in terms of the population. But there still exist some matching problems. I manually checked and corrected those lines (~20) with zip-codes.com (https://www.zip-codes.com/zip-code/58439/zip-code-58439.asp). The mapping from 5-digit zipcode to HRR is based on the file in 2017 version downloaded from https://atlasdata.dartmouth.edu/static/supp_research_data

transStateToHRR.csv and transfipsToHRR.csv are used to transform data from state level or county level to HRR respectively. For example, x is the horizontal vector of covid cases for different states in 04/10/20, then we have x @ H = y, where H is the table provided in these two csv files and y is a horizontal vector of covid cases for different HRRs.
## TODO

HRRs are represented by hrrnum. There are 306 hrrs in total. They are not named as consecutive numbers.
- Fix Puerto Rico in the JHU UIDs.

-Jingjing
## Geo Codes

We support the following geocodes.

04/14/20: 'msa_id' and 'msa_name' are added according to the msa_list.csv that Aaron found from https://apps.bea.gov/regional/docs/msalist.cfm (2019)
- The ZIP code and the FIPS code are the most granular geocodes we support. The [ZIP code](https://en.wikipedia.org/wiki/ZIP_Code) is a US postal code used by the USPS and the [FIPS code](https://en.wikipedia.org/wiki/FIPS_county_code) is an identifier for US counties and other associated territories. The ZIP code is five digit code (with leading zeros). The FIPS code is a five digit code (with leading zeros), where the first two digits are a two-digit state code and the last three are a three-digit county code (see this [US Census Bureau page](https://www.census.gov/library/reference/code-lists/ansi.html) for detailed information).
- The Metropolitan Statistical Area (MSA) code refers to regions around cities (these are sometimes referred to as CBSA codes). More information on these can be found from the [US Census Bureau](https://www.census.gov/programs-surveys/metro-micro/about.html).
- We are reserving 10001-10099 for states codes of the form 100XX where XX is the FIPS code for the state. In the case that the CBSA codes change then it should be verified that these are not used. The current smallest CBSA is 10100.
- State codes are a series of equivalent identifiers for US state. They include the state name, the state number, and the state two-letter abbreviation. The state number matches the state FIPS code. See [here](https://en.wikipedia.org/wiki/List_of_U.S._state_and_territory_abbreviations) for more.
- The Hospital Referral Region (HRR) and the Hospital Service Area (HSA). More information [here](https://www.dartmouthatlas.org/covid-19/hrr-mapping/).
- The JHU signal contains its own geographic identifier, labeled the UID. Documentation is provided at [their repo](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data#uid-lookup-table-logic). Its FIPS codes depart in some special cases, so add some hand additions.
- Dukes and Nantucket counties in Massachusets are aggregated, so we split them with 50/50 weight into two FIPS.
- Same with Kansas City and four of its counties.
- Kuslvak, Alaska

04/15/20:
The newly updated(added columns) are based on cbsatocountycrosswalk.csv from https://data.nber.org/data/cbsa-fips-county-crosswalk.html
- 'msa' : MSA ID
- 'msaname': Name of the MSA
- 'cbsa': CBSA ID
- 'cbsaname': Name of the CBSA


04/19/20:
Changed to msa_list.csv again.

05/20/20: Updated msa_list.csv to include MSAs in Puerto Rico, using the delineations file from March 2020: https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html

06/15/20:
Added file co-est2019-annres.csv, which gives 2019 population estimates for each county by name

Source: Annual Estimates of the Resident Population for Counties in the United States: April 1, 2010 to July 1, 2019 (CO-EST2019-ANNRES). U.S. Census Bureau, Population Division. Release Date: March 2020
Note: The estimates are based on the 2010 Census and reflect changes to the April 1, 2010 population due to the Count Question Resolution program and geographic program revisions. All geographic boundaries for the 2019 population estimates are as of January 1, 2019. For population estimates methodology statements, see http://www.census.gov/programs-surveys/popest/technical-documentation/methodology.html.
## Source files

Note: The 6,222 people in Bedford city, Virginia, which was an independent city as of the 2010 Census, are not included in the April 1, 2010 Census enumerated population presented in the county estimates. In July 2013, the legal status of Bedford changed from a city to a town and it became dependent within (or part of) Bedford County, Virginia. This population of Bedford town is now included in the April 1, 2010 estimates base and all July 1 estimates for Bedford County. Because it is no longer an independent city, Bedford town is not listed in this table. As a result, the sum of the April 1, 2010 census values for Virginia counties and independent cities does not equal the 2010 Census count for Virginia, and the sum of April 1, 2010 census values for all counties and independent cities in the United States does not equal the 2010 Census count for the United States. Substantial geographic changes to counties can be found on the Census Bureau website at https://www.census.gov/programs-surveys/geography/technical-documentation/county-changes.html.
The source files are requested from a government URL when `geo_data_proc.py` is run (see the top of said script for the URLs). Below we describe the locations to find updated versions of the source files, if they are ever needed.

- ZIP -> FIPS (county) population tables available from [US Census](https://www.census.gov/geographies/reference-files/time-series/geo/relationship-files.html#par_textimage_674173622). This file contains the population of the intersections between ZIP and FIPS regions, allowing the creation of a population-weighted transform between the two.
- ZIP -> HRR -> HSA crosswalk file comes from the 2018 version at the [Dartmouth Atlas Project](https://atlasdata.dartmouth.edu/static/supp_research_data).
- FIPS -> MSA crosswalk file comes from the September 2018 version of the delineation files at the [US Census Bureau](https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html).
- State Code -> State ID -> State Name comes from the ANSI standard at the [US Census](https://www.census.gov/library/reference/code-lists/ansi.html#par_textimage_3). The first two digits of a FIPS codes should match the state code here.
- JHU UID -> FIPS comes from [the JHU documentation](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data#uid-lookup-table-logic). We have to do some hand modifications to the JHU UID because the mapping to FIPS isn't always consistent.

07/07/2020:
Introduced the March 2020 MSA file, source is [US Census Bureau](https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html). This file seems to differ in a few fips codes from the source for the 02_20_uszip file which Jingjing constructed. There are at least 10 additional fips in 03_20_msa that are not in the uszip file, and one of the msa codes seems to be incorrect: 49020 (a google search confirms that it is incorrect in uszip and correct in the census data).
## Derived files

07/08/2020:
We are reserving 00001-00099 for states codes of the form 100XX where XX is the fips code for the state. In the case that the CBSA codes change then it should be verified that these are not used. The current smallest CBSA is 10100.
The rest of the crosswalk tables are derived from the mappings above. We provide crosswalk functions from granular to coarser codes, but not the other way around. This is because there is no information gained when crosswalking from coarse to granular.

-James
## Deprecated source files

07/22/2020:
- Introducing the COUNTY_ZIP and ZIP_COUNTY crosswalk files from https://www.huduser.gov/portal/datasets/usps_crosswalk.html
- Also the ZIP to HRR Crosswalk file (from 2018) from https://atlasdata.dartmouth.edu/static/supp_research_data
- Added the JHU crosswalk table and created a jhu_uid to fips crosswalk table: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data#uid-lookup-table-logic
- ZIP to FIPS to HRR to states: `02_20_uszips.csv` comes from a version of the table [here](https://simplemaps.com/data/us-zips) modified by Jingjing to include population weights.
- The `02_20_uszips.csv` file is based on the newest consensus data including 5-digit zipcode, fips code, county name, state, population, HRR, HSA (I downloaded the original file from [here](https://simplemaps.com/data/us-zips). This file matches best to the most recent (2020) situation in terms of the population. But there still exist some matching problems. I manually checked and corrected those lines (~20) with [zip-codes](https://www.zip-codes.com/zip-code/58439/zip-code-58439.asp). The mapping from 5-digit zipcode to HRR is based on the file in 2017 version downloaded from [here](https://atlasdata.dartmouth.edu/static/supp_research_data).
- ZIP -> FIPS is provided by [huduser.gov](https://www.huduser.gov/portal/datasets/usps_crosswalk.html) for zip -> fips?
- FIPS county population data from [US Census Bureau](http://www.census.gov/programs-surveys/popest/technical-documentation/methodology.html). Details of Bedford, Virginia counting [here](https://www.census.gov/programs-surveys/geography/technical-documentation/county-changes.html).
- JHU UID crosswalk table [here](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data#uid-lookup-table-logic)
- CBSA -> FIPS crosswalk from [here](https://data.nber.org/data/cbsa-fips-county-crosswalk.html) (the file is `cbsatocountycrosswalk.csv`).
- MSA tables from March 2020 [here](https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html). This file seems to differ in a few fips codes from the source for the 02_20_uszip file which Jingjing constructed. There are at least 10 additional fips in 03_20_msa that are not in the uszip file, and one of the msa codes seems to be incorrect: 49020 (a google search confirms that it is incorrect in uszip and correct in the census data).
- MSA tables from 2019 [here](https://apps.bea.gov/regional/docs/msalist.cfm)

There are NaN fips in the JHU tables, so to resolve this we are moving over to using the JHU unique id.
We have to deal with the NaN fips by hand, which are
```
748 US
887 Recovered, US
888 Dukes and Nantucket, Massachusetts, US
889 Kansas City, Missouri, US
890 Michigan Department of Corrections (MDOC), Mic...
891 Federal Correctional Institution (FCI), Michig...
892 Air Force, US Military, US
893 Army, US Military, US
894 Marine Corps, US Military, US
895 Navy, US Military, US
896 Unassigned, US Military, US
897 US Military, US
898 Inmates, Federal Bureau of Prisons, US
899 Staff, Federal Bureau of Prisons, US
900 Federal Bureau of Prisons, US
901 Bear River, Utah, US
902 Central Utah, Utah, US
903 Southeast Utah, Utah, US
904 Southwest Utah, Utah, US
905 TriCounty, Utah, US
906 Weber-Morgan, Utah, US
907 Veteran Hospitals, US
```
Is you look at geo_data.py::
## Notes

08/04/2020:
Large changes in MSA from 2018 version from bea.gov (msa_list.csv), and the new 2020 version from census bureau (03_20_MSAs.xls).
Trying to use 2018 version instead from https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html
- The NAs in the coding currently zero-fills.
Loading