Skip to content

Commit 1f4f352

Browse files
authored
Merge pull request #1831 from cmu-delphi/jhu-deactivation
JHU deactivation: geomap + validator
2 parents 72eb581 + c18d8e1 commit 1f4f352

File tree

10 files changed

+1164
-7714
lines changed

10 files changed

+1164
-7714
lines changed

_delphi_utils_python/data_proc/geomap/README.md

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ We support the following geocodes.
2424
- We are reserving 10001-10099 for states codes of the form 100XX where XX is the FIPS code for the state (the current smallest CBSA is 10100). In the case that the CBSA codes change then it should be verified that these are not used.
2525
- State codes are a series of equivalent identifiers for US state. They include the state name, the state number (state_id), and the state two-letter abbreviation (state_code). The state number is the state FIPS code. See [here](https://en.wikipedia.org/wiki/List_of_U.S._state_and_territory_abbreviations) for more.
2626
- The Hospital Referral Region (HRR) and the Hospital Service Area (HSA). More information [here](https://www.dartmouthatlas.org/covid-19/hrr-mapping/).
27-
- The JHU signal contains its own geographic identifier, labeled the UID. Documentation is provided at [their repo](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data#uid-lookup-table-logic). Its FIPS codes depart in some special cases, so we produce manual changes listed below.
27+
FIPS codes depart in some special cases, so we produce manual changes listed below.
2828

2929
## Source files
3030

@@ -34,28 +34,20 @@ The source files are requested from a government URL when `geo_data_proc.py` is
3434
- ZIP -> HRR -> HSA crosswalk file comes from the 2018 version at the [Dartmouth Atlas Project](https://atlasdata.dartmouth.edu/static/supp_research_data).
3535
- FIPS -> MSA crosswalk file comes from the September 2018 version of the delineation files at the [US Census Bureau](https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html).
3636
- State Code -> State ID -> State Name comes from the ANSI standard at the [US Census](https://www.census.gov/library/reference/code-lists/ansi.html#par_textimage_3). The first two digits of a FIPS codes should match the state code here.
37-
- JHU UID -> FIPS comes from [the JHU documentation](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data#uid-lookup-table-logic). We have to do some hand modifications to the JHU UID because the mapping to FIPS isn't always consistent.
37+
3838

3939
## Derived files
4040

4141
The rest of the crosswalk tables are derived from the mappings above. We provide crosswalk functions from granular to coarser codes, but not the other way around. This is because there is no information gained when crosswalking from coarse to granular.
4242

43-
## JHU UID mapping changes
4443

45-
- Dukes and Nantucket counties in Massachusets are aggregated, so we split them with population-proportional weights (approximately 2/3 Dukes and 1/3 Nantucket).
46-
- The same procedure is followed by Kansas City and four of its counties.
47-
- Kusilvak, Alaska is mapped to the FIPS code 02270.
48-
- Ogalala Lakota, South Dakota is mapped to the FIPS code 46113.
49-
- Utah reports at a territory level, so we only report it at in a state level megaFIPS 49000.
50-
- JHU places cases and deaths that cannot be localized to a single county into "Out of State" and "Unassigned" categories. We map these to the "megaFIPS" code XX000, where XX is the state FIPS code. This way, the data is recovered when aggregating up to the state level, but does not interfere with other counties.
5144

5245
## Deprecated source files
5346

5447
- ZIP to FIPS to HRR to states: `02_20_uszips.csv` comes from a version of the table [here](https://simplemaps.com/data/us-zips) modified by Jingjing to include population weights.
5548
- The `02_20_uszips.csv` file is based on the newest consensus data including 5-digit zipcode, fips code, county name, state, population, HRR, HSA (I downloaded the original file from [here](https://simplemaps.com/data/us-zips). This file matches best to the most recent (2020) situation in terms of the population. But there still exist some matching problems. I manually checked and corrected those lines (~20) with [zip-codes](https://www.zip-codes.com/zip-code/58439/zip-code-58439.asp). The mapping from 5-digit zipcode to HRR is based on the file in 2017 version downloaded from [here](https://atlasdata.dartmouth.edu/static/supp_research_data).
5649
- ZIP -> FIPS is provided by [huduser.gov](https://www.huduser.gov/portal/datasets/usps_crosswalk.html) for zip -> fips?
5750
- FIPS county population data from [US Census Bureau](http://www.census.gov/programs-surveys/popest/technical-documentation/methodology.html). Details of Bedford, Virginia counting [here](https://www.census.gov/programs-surveys/geography/technical-documentation/county-changes.html).
58-
- JHU UID crosswalk table [here](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data#uid-lookup-table-logic)
5951
- CBSA -> FIPS crosswalk from [here](https://data.nber.org/data/cbsa-fips-county-crosswalk.html) (the file is `cbsatocountycrosswalk.csv`).
6052
- MSA tables from March 2020 [here](https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html). This file seems to differ in a few fips codes from the source for the 02_20_uszip file which Jingjing constructed. There are at least 10 additional fips in 03_20_msa that are not in the uszip file, and one of the msa codes seems to be incorrect: 49020 (a google search confirms that it is incorrect in uszip and correct in the census data).
6153
- MSA tables from 2019 [here](https://apps.bea.gov/regional/docs/msalist.cfm)

_delphi_utils_python/data_proc/geomap/geo_data_proc.py

Lines changed: 0 additions & 98 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,6 @@
2727
ZIP_HSA_HRR_URL = "https://atlasdata.dartmouth.edu/downloads/geography/ZipHsaHrr18.csv.zip"
2828
ZIP_HSA_HRR_FILENAME = "ZipHsaHrr18.csv"
2929
FIPS_MSA_URL = "https://www2.census.gov/programs-surveys/metro-micro/geographies/reference-files/2018/delineation-files/list1_Sep_2018.xls"
30-
JHU_FIPS_URL = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv"
3130
STATE_CODES_URL = "http://www2.census.gov/geo/docs/reference/state.txt?#"
3231
FIPS_POPULATION_URL = f"https://www2.census.gov/programs-surveys/popest/datasets/2010-{YEAR}/counties/totals/co-est{YEAR}-alldata.csv"
3332
FIPS_PUERTO_RICO_POPULATION_URL = "https://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txt?"
@@ -57,7 +56,6 @@
5756
STATE_POPULATION_OUT_FILENAME = "state_pop.csv"
5857
HHS_POPULATION_OUT_FILENAME = "hhs_pop.csv"
5958
NATION_POPULATION_OUT_FILENAME = "nation_pop.csv"
60-
JHU_FIPS_OUT_FILENAME = "jhu_uid_fips_table.csv"
6159

6260

6361
def create_fips_zip_crosswalk():
@@ -111,101 +109,6 @@ def create_fips_msa_crosswalk():
111109
msa_df.sort_values(["fips", "msa"]).to_csv(join(OUTPUT_DIR, FIPS_MSA_OUT_FILENAME), columns=["fips", "msa"], index=False)
112110

113111

114-
def create_jhu_uid_fips_crosswalk():
115-
"""Build a crosswalk table from JHU UID to FIPS."""
116-
# These are hand modifications that need to be made to the translation
117-
# between JHU UID and FIPS. See below for the special cases information
118-
# https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html#geographical-exceptions
119-
hand_additions = pd.DataFrame(
120-
[
121-
{
122-
"jhu_uid": "84070002",
123-
"fips": "25007", # Split aggregation of Dukes and Nantucket, Massachusetts
124-
"weight": 16535 / (16535 + 10172), # Population: 16535
125-
},
126-
{
127-
"jhu_uid": "84070002",
128-
"fips": "25019",
129-
"weight": 10172 / (16535 + 10172), # Population: 10172
130-
},
131-
{
132-
"jhu_uid": "84070003",
133-
"fips": "29095", # Kansas City, Missouri
134-
"weight": 674158 / 1084897, # Population: 674158
135-
},
136-
{
137-
"jhu_uid": "84070003",
138-
"fips": "29165",
139-
"weight": 89322 / 1084897, # Population: 89322
140-
},
141-
{
142-
"jhu_uid": "84070003",
143-
"fips": "29037",
144-
"weight": 99478 / 1084897, # Population: 99478
145-
},
146-
{
147-
"jhu_uid": "84070003",
148-
"fips": "29047",
149-
"weight": 221939 / 1084897, # Population: 221939
150-
},
151-
# Kusilvak, Alaska
152-
{"jhu_uid": "84002158", "fips": "02270", "weight": 1.0},
153-
# Oglala Lakota
154-
{"jhu_uid": "84046102", "fips": "46113", "weight": 1.0},
155-
# Aggregate Utah territories into a "State FIPS"
156-
{"jhu_uid": "84070015", "fips": "49000", "weight": 1.0},
157-
{"jhu_uid": "84070016", "fips": "49000", "weight": 1.0},
158-
{"jhu_uid": "84070017", "fips": "49000", "weight": 1.0},
159-
{"jhu_uid": "84070018", "fips": "49000", "weight": 1.0},
160-
{"jhu_uid": "84070019", "fips": "49000", "weight": 1.0},
161-
{"jhu_uid": "84070020", "fips": "49000", "weight": 1.0},
162-
]
163-
)
164-
# Map the Unassigned category to a custom megaFIPS XX000
165-
unassigned_states = pd.DataFrame(
166-
{"jhu_uid": str(x), "fips": str(x)[-2:].ljust(5, "0"), "weight": 1.0}
167-
for x in range(84090001, 84090057)
168-
)
169-
# Map the Out of State category to a custom megaFIPS XX000
170-
out_of_state = pd.DataFrame(
171-
{"jhu_uid": str(x), "fips": str(x)[-2:].ljust(5, "0"), "weight": 1.0}
172-
for x in range(84080001, 84080057)
173-
)
174-
# Map the Unassigned and Out of State categories to the cusom megaFIPS 72000
175-
puerto_rico_unassigned = pd.DataFrame(
176-
[
177-
{"jhu_uid": "63072888", "fips": "72000", "weight": 1.0},
178-
{"jhu_uid": "63072999", "fips": "72000", "weight": 1.0},
179-
]
180-
)
181-
cruise_ships = pd.DataFrame(
182-
[
183-
{"jhu_uid": "84088888", "fips": "88888", "weight": 1.0},
184-
{"jhu_uid": "84099999", "fips": "99999", "weight": 1.0},
185-
]
186-
)
187-
188-
189-
jhu_df = pd.read_csv(JHU_FIPS_URL, dtype={"UID": str, "FIPS": str}).query("Country_Region == 'US'")
190-
jhu_df = jhu_df.rename(columns={"UID": "jhu_uid", "FIPS": "fips"}).dropna(subset=["fips"])
191-
192-
# FIPS Codes that are just two digits long should be zero filled on the right.
193-
# These are US state codes (XX) and the territories Guam (66), Northern Mariana Islands (69),
194-
# Virgin Islands (78), and Puerto Rico (72).
195-
fips_territories = jhu_df["fips"].str.len() <= 2
196-
jhu_df.loc[fips_territories, "fips"] = jhu_df.loc[fips_territories, "fips"].str.ljust(5, "0")
197-
198-
# Drop the JHU UIDs that were hand-modified
199-
manual_correction_ids = pd.concat([hand_additions, unassigned_states, out_of_state, puerto_rico_unassigned, cruise_ships])["jhu_uid"]
200-
jhu_df.drop(jhu_df.index[jhu_df["jhu_uid"].isin(manual_correction_ids)], inplace=True)
201-
202-
# Add weights of 1.0 to everything not in hand additions, then merge in hand-additions
203-
# Finally, zero fill FIPS
204-
jhu_df["weight"] = 1.0
205-
jhu_df = pd.concat([jhu_df, hand_additions, unassigned_states, out_of_state, puerto_rico_unassigned])
206-
jhu_df["fips"] = jhu_df["fips"].astype(int).astype(str).str.zfill(5)
207-
jhu_df.sort_values(["jhu_uid", "fips"]).to_csv(join(OUTPUT_DIR, JHU_FIPS_OUT_FILENAME), columns=["jhu_uid", "fips", "weight"], index=False)
208-
209112

210113
def create_state_codes_crosswalk():
211114
"""Build a State ID -> State Name -> State code crosswalk file."""
@@ -659,7 +562,6 @@ def clear_dir(dir_path: str):
659562
create_fips_zip_crosswalk()
660563
create_zip_hsa_hrr_crosswalk()
661564
create_fips_msa_crosswalk()
662-
create_jhu_uid_fips_crosswalk()
663565
create_state_codes_crosswalk()
664566
create_state_hhs_crosswalk()
665567
create_fips_population_table()

0 commit comments

Comments
 (0)