diff --git a/.git-blame-ignore-revs b/.git-blame-ignore-revs index f91c04645..904a3bf69 100644 --- a/.git-blame-ignore-revs +++ b/.git-blame-ignore-revs @@ -1,2 +1,4 @@ -# Format geomap.py with black +# Format geomap.py d4b056e7a4c11982324e9224c9f9f6fd5d5ec65c +# Format test_geomap.py +79072dcdec3faca9aaeeea65de83f7fa5c00d53f \ No newline at end of file diff --git a/_delphi_utils_python/data_proc/geomap/README.md b/_delphi_utils_python/data_proc/geomap/README.md index 08075fff9..38297b691 100644 --- a/_delphi_utils_python/data_proc/geomap/README.md +++ b/_delphi_utils_python/data_proc/geomap/README.md @@ -1,4 +1,4 @@ -# Geocoding data processing pipeline +# Geocoding Data Processing Authors: Jingjing Tang, James Sharpnack, Dmitry Shemetov @@ -7,42 +7,37 @@ Authors: Jingjing Tang, James Sharpnack, Dmitry Shemetov Requires the following source files below. Run the following to build the crosswalk tables in `covidcast-indicators/_delph_utils_python/delph_utils/data` -``` + +```sh $ python geo_data_proc.py ``` -You can see consistency checks and diffs with old sources in ./consistency_checks.ipynb +Find data consistency checks in `./source-file-sanity-check.ipynb`. ## Geo Codes We support the following geocodes. -- The ZIP code and the FIPS code are the most granular geocodes we support. - - The [ZIP code](https://en.wikipedia.org/wiki/ZIP_Code) is a US postal code used by the USPS and the [FIPS code](https://en.wikipedia.org/wiki/FIPS_county_code) is an identifier for US counties and other associated territories. The ZIP code is five digit code (with leading zeros). - - The FIPS code is a five digit code (with leading zeros), where the first two digits are a two-digit state code and the last three are a three-digit county code (see this [US Census Bureau page](https://www.census.gov/library/reference/code-lists/ansi.html) for detailed information). -- The Metropolitan Statistical Area (MSA) code refers to regions around cities (these are sometimes referred to as CBSA codes). More information on these can be found at the [US Census Bureau](https://www.census.gov/programs-surveys/metro-micro/about.html). - - We are reserving 10001-10099 for states codes of the form 100XX where XX is the FIPS code for the state (the current smallest CBSA is 10100). In the case that the CBSA codes change then it should be verified that these are not used. +- The [ZIP code](https://en.wikipedia.org/wiki/ZIP_Code) is a US postal code used by the USPS and the [FIPS code](https://en.wikipedia.org/wiki/FIPS_county_code) is an identifier for US counties and other associated territories. The ZIP code is five digit code (with leading zeros). +- The FIPS code is a five digit code (with leading zeros), where the first two digits are a two-digit state code and the last three are a three-digit county code (see this [US Census Bureau page](https://www.census.gov/library/reference/code-lists/ansi.html) for detailed information). +- The Metropolitan Statistical Area (MSA) code refers to regions around cities (these are sometimes referred to as CBSA codes). More information on these can be found at the [US Census Bureau](https://www.census.gov/programs-surveys/metro-micro/about.html). We rserve 10001-10099 for states codes of the form 100XX where XX is the FIPS code for the state (the current smallest CBSA is 10100). In the case that the CBSA codes change then it should be verified that these are not used. - State codes are a series of equivalent identifiers for US state. They include the state name, the state number (state_id), and the state two-letter abbreviation (state_code). The state number is the state FIPS code. See [here](https://en.wikipedia.org/wiki/List_of_U.S._state_and_territory_abbreviations) for more. - The Hospital Referral Region (HRR) and the Hospital Service Area (HSA). More information [here](https://www.dartmouthatlas.org/covid-19/hrr-mapping/). -FIPS codes depart in some special cases, so we produce manual changes listed below. -## Source files +## Source Files The source files are requested from a government URL when `geo_data_proc.py` is run (see the top of said script for the URLs). Below we describe the locations to find updated versions of the source files, if they are ever needed. - ZIP -> FIPS (county) population tables available from [US Census](https://www.census.gov/geographies/reference-files/time-series/geo/relationship-files.html#par_textimage_674173622). This file contains the population of the intersections between ZIP and FIPS regions, allowing the creation of a population-weighted transform between the two. As of 4 February 2022, this source did not include population information for 24 ZIPs that appear in our indicators. We have added those values manually using information available from the [zipdatamaps website](www.zipdatamaps.com). - ZIP -> HRR -> HSA crosswalk file comes from the 2018 version at the [Dartmouth Atlas Project](https://atlasdata.dartmouth.edu/static/supp_research_data). - FIPS -> MSA crosswalk file comes from the September 2018 version of the delineation files at the [US Census Bureau](https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html). -- State Code -> State ID -> State Name comes from the ANSI standard at the [US Census](https://www.census.gov/library/reference/code-lists/ansi.html#par_textimage_3). The first two digits of a FIPS codes should match the state code here. +- State Code -> State ID -> State Name comes from the ANSI standard at the [US Census](https://www.census.gov/library/reference/code-lists/ansi.html#par_textimage_3). - -## Derived files +## Derived Files The rest of the crosswalk tables are derived from the mappings above. We provide crosswalk functions from granular to coarser codes, but not the other way around. This is because there is no information gained when crosswalking from coarse to granular. - - -## Deprecated source files +## Deprecated Source Files - ZIP to FIPS to HRR to states: `02_20_uszips.csv` comes from a version of the table [here](https://simplemaps.com/data/us-zips) modified by Jingjing to include population weights. - The `02_20_uszips.csv` file is based on the newest consensus data including 5-digit zipcode, fips code, county name, state, population, HRR, HSA (I downloaded the original file from [here](https://simplemaps.com/data/us-zips). This file matches best to the most recent (2020) situation in terms of the population. But there still exist some matching problems. I manually checked and corrected those lines (~20) with [zip-codes](https://www.zip-codes.com/zip-code/58439/zip-code-58439.asp). The mapping from 5-digit zipcode to HRR is based on the file in 2017 version downloaded from [here](https://atlasdata.dartmouth.edu/static/supp_research_data). @@ -51,7 +46,3 @@ The rest of the crosswalk tables are derived from the mappings above. We provide - CBSA -> FIPS crosswalk from [here](https://data.nber.org/data/cbsa-fips-county-crosswalk.html) (the file is `cbsatocountycrosswalk.csv`). - MSA tables from March 2020 [here](https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html). This file seems to differ in a few fips codes from the source for the 02_20_uszip file which Jingjing constructed. There are at least 10 additional fips in 03_20_msa that are not in the uszip file, and one of the msa codes seems to be incorrect: 49020 (a google search confirms that it is incorrect in uszip and correct in the census data). - MSA tables from 2019 [here](https://apps.bea.gov/regional/docs/msalist.cfm) - -## Notes - -- The NAs in the coding currently zero-fills. diff --git a/_delphi_utils_python/data_proc/geomap/geo_data_proc.py b/_delphi_utils_python/data_proc/geomap/geo_data_proc.py index c2a07a78f..5634d6f83 100755 --- a/_delphi_utils_python/data_proc/geomap/geo_data_proc.py +++ b/_delphi_utils_python/data_proc/geomap/geo_data_proc.py @@ -1,10 +1,7 @@ """ -Authors: Dmitry Shemetov @dshemetov, James Sharpnack @jsharpna - -Intended execution: +Authors: Dmitry Shemetov, James Sharpnack cd _delphi_utils/data_proc/geomap -chmod u+x geo_data_proc.py python geo_data_proc.py """ @@ -19,7 +16,7 @@ # Source files -YEAR = 2019 +YEAR = 2020 INPUT_DIR = "./old_source_files" OUTPUT_DIR = f"../../delphi_utils/data/{YEAR}" FIPS_BY_ZIP_POP_URL = "https://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txt?#" @@ -41,7 +38,6 @@ FIPS_HHS_FILENAME = "fips_hhs_table.csv" FIPS_CHNGFIPS_OUT_FILENAME = "fips_chng-fips_table.csv" FIPS_POPULATION_OUT_FILENAME = "fips_pop.csv" - CHNGFIPS_STATE_OUT_FILENAME = "chng-fips_state_table.csv" ZIP_HSA_OUT_FILENAME = "zip_hsa_table.csv" ZIP_HRR_OUT_FILENAME = "zip_hrr_table.csv" diff --git a/_delphi_utils_python/delphi_utils/geomap.py b/_delphi_utils_python/delphi_utils/geomap.py index 29ae3667e..a313c754c 100644 --- a/_delphi_utils_python/delphi_utils/geomap.py +++ b/_delphi_utils_python/delphi_utils/geomap.py @@ -18,54 +18,90 @@ class GeoMapper: # pylint: disable=too-many-public-methods The GeoMapper class provides utility functions for translating between different geocodes. Supported geocodes: - - zip: zip5, a length 5 str of 0-9 with leading 0's - - fips: state code and county code, a length 5 str of 0-9 with leading 0's - - msa: metropolitan statistical area, a length 5 str of 0-9 with leading 0's - - state_code: state code, a str of 0-9 - - state_id: state id, a str of A-Z - - hrr: hospital referral region, an int 1-500 - - Mappings: - - [x] zip -> fips : population weighted - - [x] zip -> hrr : unweighted - - [x] zip -> msa : unweighted - - [x] zip -> state - - [x] zip -> hhs - - [x] zip -> population - - [x] state code -> hhs - - [x] fips -> state : unweighted - - [x] fips -> msa : unweighted - - [x] fips -> megacounty - - [x] fips -> hrr - - [x] fips -> hhs - - [x] fips -> chng-fips - - [x] chng-fips -> state : unweighted - - [x] nation - - [ ] zip -> dma (postponed) - - The GeoMapper instance loads crosswalk tables from the package data_dir. The - crosswalk tables are assumed to have been built using the geo_data_proc.py script - in data_proc/geomap. If a mapping between codes is NOT one to many, then the table has - just two colums. If the mapping IS one to many, then a third column, the weight column, - exists (e.g. zip, fips, weight; satisfying (sum(weights) where zip==ZIP) == 1). + + - zip: five characters [0-9] with leading 0's, e.g. "33626" + also known as zip5 or zip code + - fips: five characters [0-9] with leading 0's, e.g. "12057" + the first two digits are the state FIPS code and the last + three are the county FIPS code + - msa: five characters [0-9] with leading 0's, e.g. "90001" + also known as metropolitan statistical area + - state_code: two characters [0-9], e.g "06" + - state_id: two characters [A-Z], e.g "CA" + - state_name: human-readable name, e.g "California" + - state_*: we use this below to refer to the three above geocodes in aggregate + - hrr: an integer from 1-500, also known as hospital + referral region + - hhs: an integer from 1-10, also known as health and human services region + https://www.hhs.gov/about/agencies/iea/regional-offices/index.html + + Valid mappings: + + From To Population Weighted + zip fips Yes + zip hrr No + zip msa Yes + zip state_* Yes + zip hhs Yes + zip population -- + zip nation No + state_* state_* No + state_* hhs No + state_* population -- + state_* nation No + fips state_* No + fips msa No + fips megacounty No + fips hrr Yes + fips hhs No + fips chng-fips No + fips nation No + chng-fips state_* No + + Crosswalk Tables + ================ + + The GeoMapper instance loads pre-generated crosswalk tables (built by the + script in `data_proc/geomap/geo_data_proc.py`). If a mapping between codes + is one to one or many to one, then the table has just two columns. If the + mapping is one to many, then a weight column is provided, which gives the + fractional population contribution of a source_geo to the target_geo. The + weights satisfy the condition that df.groupby(from_code).sum(weight) == 1.0 + for all values of from_code. + + Aggregation + =========== + + The GeoMapper class provides functions to aggregate data from one geocode + to another. The aggregation can be a simple one-to-one mapping or a + weighted aggregation. The weighted aggregation is useful when the data + being aggregated is a population-weighted quantity, such as visits or + cases. The aggregation is done by multiplying the data columns by the + weights and summing over the data columns. Note that the aggregation does + not adjust the aggregation for missing or NA values in the data columns, + which is equivalent to a zero-fill. Example Usage - ========== + ============= The main GeoMapper object loads and stores crosswalk dataframes on-demand. - When replacing geocodes with a new one an aggregation step is performed on the data columns - to merge entries (i.e. in the case of a many to one mapping or a weighted mapping). This - requires a specification of the data columns, which are assumed to be all the columns that - are not the geocodes or the date column specified in date_col. + When replacing geocodes with a new one an aggregation step is performed on + the data columns to merge entries (i.e. in the case of a many to one + mapping or a weighted mapping). This requires a specification of the data + columns, which are assumed to be all the columns that are not the geocodes + or the date column specified in date_col. Example 1: to add a new column with a new geocode, possibly with weights: > gmpr = GeoMapper() - > df = gmpr.add_geocode(df, "fips", "zip", from_col="fips", new_col="geo_id", + > df = gmpr.add_geocode(df, "fips", "zip", + from_col="fips", new_col="geo_id", date_col="timestamp", dropna=False) - Example 2: to replace a geocode column with a new one, aggregating the data with weights: + Example 2: to replace a geocode column with a new one, aggregating the data + with weights: > gmpr = GeoMapper() - > df = gmpr.replace_geocode(df, "fips", "zip", from_col="fips", new_col="geo_id", + > df = gmpr.replace_geocode(df, "fips", "zip", + from_col="fips", new_col="geo_id", date_col="timestamp", dropna=False) """ @@ -113,7 +149,7 @@ def __init__(self, census_year: int = 2020): subkey for mainkey in self.CROSSWALK_FILENAMES for subkey in self.CROSSWALK_FILENAMES[mainkey] - }.union(set(self.CROSSWALK_FILENAMES.keys())) - set(["state", "pop"]) + }.union(set(self.CROSSWALK_FILENAMES.keys())) - {"state", "pop"} for from_code, to_codes in self.CROSSWALK_FILENAMES.items(): for to_code, file_path in to_codes.items(): @@ -135,7 +171,6 @@ def _load_crosswalk_from_file( "weight": float, **{geo: str for geo in self._geos - set("nation")}, } - usecols = [from_code, "pop"] if to_code == "pop" else None return pd.read_csv(stream, dtype=dtype, usecols=usecols) @@ -229,12 +264,7 @@ def add_geocode( ): """Add a new geocode column to a dataframe. - Currently supported conversions: - - fips -> state_code, state_id, state_name, zip, msa, hrr, nation, hhs, chng-fips - - chng-fips -> state_code, state_id, state_name - - zip -> state_code, state_id, state_name, fips, msa, hrr, nation, hhs - - state_x -> state_y (where x and y are in {code, id, name}), nation - - state_code -> hhs, nation + See class docstring for supported geocode transformations. Parameters --------- @@ -303,7 +333,7 @@ def add_geocode( df = df.merge(crosswalk, left_on=from_col, right_on=from_col, how="left") # Drop extra state columns - if new_code in state_codes and not from_code in state_codes: + if new_code in state_codes and from_code not in state_codes: state_codes.remove(new_code) df.drop(columns=state_codes, inplace=True) elif new_code in state_codes and from_code in state_codes: @@ -345,12 +375,7 @@ def replace_geocode( ) -> pd.DataFrame: """Replace a geocode column in a dataframe. - Currently supported conversions: - - fips -> chng-fips, state_code, state_id, state_name, zip, msa, hrr, nation - - chng-fips -> state_code, state_id, state_name - - zip -> state_code, state_id, state_name, fips, msa, hrr, nation - - state_x -> state_y (where x and y are in {code, id, name}), nation - - state_code -> hhs, nation + See class docstring for supported geocode transformations. Parameters --------- @@ -397,7 +422,7 @@ def replace_geocode( df[data_cols] = df[data_cols].multiply(df["weight"], axis=0) df.drop("weight", axis=1, inplace=True) - if not date_col is None: + if date_col is not None: df = df.groupby([date_col, new_col]).sum(numeric_only=True).reset_index() else: df = df.groupby([new_col]).sum(numeric_only=True).reset_index() @@ -575,8 +600,7 @@ def get_geos_within( Return all contained regions of the given type within the given container geocode. Given container_geocode (e.g "ca" for California) of type container_geocode_type - (e.g "state"), return: - - all (contained_geocode_type)s within container_geocode + (e.g "state"), return all (contained_geocode_type)s within container_geocode. Supports these 4 combinations: - all states within a nation @@ -627,3 +651,55 @@ def get_geos_within( "must be one of (state, nation), (state, hhs), (county, state)" ", (fips, state), (chng-fips, state)" ) + + def aggregate_by_weighted_sum( + self, df: pd.DataFrame, to_geo: str, sensor_col: str, time_col: str, population_col: str + ) -> pd.DataFrame: + """Aggregate sensor, weighted by time-dependent population. + + Note: This function generates its own population weights and excludes + locations where the data is NA, which is effectively an extrapolation + assumption to the rest of the geos. This is in contrast to the + `replace_geocode` function, which assumes that the weights are already + present in the data and does not adjust for missing data (see the + docstring for the GeoMapper class). + + Parameters + --------- + df: pd.DataFrame + Input dataframe, assumed to have a sensor column (e.g. "visits"), a + to_geo column (e.g. "state"), and a population column (corresponding + to a from_geo, e.g. "wastewater collection site"). + to_geo: str + The column name of the geocode to aggregate to. + sensor: str + The column name of the sensor to aggregate. + population_column: str + The column name of the population to weight the sensor by. + + Returns + --------- + agg_df: pd.DataFrame + A dataframe with the aggregated sensor values, weighted by population. + """ + # Don't modify the input dataframe + df = df.copy() + # Zero-out populations where the sensor is NA + df["_zeroed_pop"] = df[population_col] * df[sensor_col].abs().notna() + # Weight the sensor by the population + df["_weighted_sensor"] = df[sensor_col] * df["_zeroed_pop"] + agg_df = ( + df.groupby([time_col, to_geo]) + .agg( + { + "_zeroed_pop": "sum", + "_weighted_sensor": lambda x: x.sum(min_count=1), + } + ).assign( + _new_sensor = lambda x: x["_weighted_sensor"] / x["_zeroed_pop"] + ).reset_index() + .rename(columns={"_new_sensor": f"weighted_{sensor_col}"}) + .drop(columns=["_zeroed_pop", "_weighted_sensor"]) + ) + + return agg_df diff --git a/_delphi_utils_python/tests/test_geomap.py b/_delphi_utils_python/tests/test_geomap.py index ab86c143d..c968fd359 100644 --- a/_delphi_utils_python/tests/test_geomap.py +++ b/_delphi_utils_python/tests/test_geomap.py @@ -10,10 +10,12 @@ def geomapper(): return GeoMapper(census_year=2020) + @pytest.fixture(scope="class") def geomapper_2019(): return GeoMapper(census_year=2019) + class TestGeoMapper: fips_data = pd.DataFrame( { @@ -34,7 +36,8 @@ class TestGeoMapper: fips_data_3 = pd.DataFrame( { "fips": ["48059", "48253", "48441", "72003", "72005", "10999"], - "timestamp": [pd.Timestamp("2018-01-01")] * 3 + [pd.Timestamp("2018-01-03")] * 3, + "timestamp": [pd.Timestamp("2018-01-01")] * 3 + + [pd.Timestamp("2018-01-03")] * 3, "count": [1, 2, 3, 4, 8, 5], "total": [2, 4, 7, 11, 100, 10], } @@ -58,7 +61,8 @@ class TestGeoMapper: zip_data = pd.DataFrame( { "zip": ["45140", "95616", "95618"] * 2, - "timestamp": [pd.Timestamp("2018-01-01")] * 3 + [pd.Timestamp("2018-01-03")] * 3, + "timestamp": [pd.Timestamp("2018-01-01")] * 3 + + [pd.Timestamp("2018-01-03")] * 3, "count": [99, 345, 456, 100, 344, 442], } ) @@ -132,7 +136,7 @@ class TestGeoMapper: ) # Loading tests updated 8/26 - def test_crosswalks(self, geomapper): + def test_crosswalks(self, geomapper: GeoMapper): # These tests ensure that the one-to-many crosswalks have properly normalized weights # FIPS -> HRR is allowed to be an incomplete mapping, since only a fraction of a FIPS # code can not belong to an HRR @@ -152,33 +156,32 @@ def test_crosswalks(self, geomapper): cw = geomapper.get_crosswalk(from_code="zip", to_code="hhs") assert cw.groupby("zip")["weight"].sum().round(5).eq(1.0).all() - - def test_load_zip_fips_table(self, geomapper): + def test_load_zip_fips_table(self, geomapper: GeoMapper): fips_data = geomapper.get_crosswalk(from_code="zip", to_code="fips") assert set(fips_data.columns) == set(["zip", "fips", "weight"]) assert pd.api.types.is_string_dtype(fips_data.zip) assert pd.api.types.is_string_dtype(fips_data.fips) assert pd.api.types.is_float_dtype(fips_data.weight) - def test_load_state_table(self, geomapper): + def test_load_state_table(self, geomapper: GeoMapper): state_data = geomapper.get_crosswalk(from_code="state", to_code="state") assert tuple(state_data.columns) == ("state_code", "state_id", "state_name") assert state_data.shape[0] == 60 - def test_load_fips_msa_table(self, geomapper): + def test_load_fips_msa_table(self, geomapper: GeoMapper): msa_data = geomapper.get_crosswalk(from_code="fips", to_code="msa") assert tuple(msa_data.columns) == ("fips", "msa") - def test_load_fips_chngfips_table(self, geomapper): + def test_load_fips_chngfips_table(self, geomapper: GeoMapper): chngfips_data = geomapper.get_crosswalk(from_code="fips", to_code="chng-fips") assert tuple(chngfips_data.columns) == ("fips", "chng-fips") - def test_load_zip_hrr_table(self, geomapper): + def test_load_zip_hrr_table(self, geomapper: GeoMapper): zip_data = geomapper.get_crosswalk(from_code="zip", to_code="hrr") assert pd.api.types.is_string_dtype(zip_data["zip"]) assert pd.api.types.is_string_dtype(zip_data["hrr"]) - def test_megacounty(self, geomapper): + def test_megacounty(self, geomapper: GeoMapper): new_data = geomapper.fips_to_megacounty(self.mega_data, 6, 50) assert ( new_data[["count", "visits"]].sum() @@ -204,12 +207,18 @@ def test_megacounty(self, geomapper): "count": [8, 7, 3, 10021], } ) - pd.testing.assert_frame_equal(new_data.set_index("megafips").sort_index(axis=1), expected_df.set_index("megafips").sort_index(axis=1)) + pd.testing.assert_frame_equal( + new_data.set_index("megafips").sort_index(axis=1), + expected_df.set_index("megafips").sort_index(axis=1), + ) # chng-fips should have the same behavior when converting to megacounties. mega_county_groups = self.mega_data_3.copy() - mega_county_groups.fips.replace({1125:"01g01"}, inplace = True) + mega_county_groups.fips.replace({1125: "01g01"}, inplace=True) new_data = geomapper.fips_to_megacounty(self.mega_data_3, 4, 1) - pd.testing.assert_frame_equal(new_data.set_index("megafips").sort_index(axis=1), expected_df.set_index("megafips").sort_index(axis=1)) + pd.testing.assert_frame_equal( + new_data.set_index("megafips").sort_index(axis=1), + expected_df.set_index("megafips").sort_index(axis=1), + ) new_data = geomapper.fips_to_megacounty(self.mega_data_3, 4, 1, thr_col="count") expected_df = pd.DataFrame( @@ -220,14 +229,20 @@ def test_megacounty(self, geomapper): "count": [6, 5, 7, 10021], } ) - pd.testing.assert_frame_equal(new_data.set_index("megafips").sort_index(axis=1), expected_df.set_index("megafips").sort_index(axis=1)) + pd.testing.assert_frame_equal( + new_data.set_index("megafips").sort_index(axis=1), + expected_df.set_index("megafips").sort_index(axis=1), + ) # chng-fips should have the same behavior when converting to megacounties. mega_county_groups = self.mega_data_3.copy() - mega_county_groups.fips.replace({1123:"01g01"}, inplace = True) + mega_county_groups.fips.replace({1123: "01g01"}, inplace=True) new_data = geomapper.fips_to_megacounty(self.mega_data_3, 4, 1, thr_col="count") - pd.testing.assert_frame_equal(new_data.set_index("megafips").sort_index(axis=1), expected_df.set_index("megafips").sort_index(axis=1)) + pd.testing.assert_frame_equal( + new_data.set_index("megafips").sort_index(axis=1), + expected_df.set_index("megafips").sort_index(axis=1), + ) - def test_add_population_column(self, geomapper): + def test_add_population_column(self, geomapper: GeoMapper): new_data = geomapper.add_population_column(self.fips_data_3, "fips") assert new_data.shape == (5, 5) new_data = geomapper.add_population_column(self.zip_data, "zip") @@ -245,14 +260,18 @@ def test_add_population_column(self, geomapper): new_data = geomapper.add_population_column(self.nation_data, "nation") assert new_data.shape == (1, 3) - def test_add_geocode(self, geomapper): + def test_add_geocode(self, geomapper: GeoMapper): # state_code -> nation new_data = geomapper.add_geocode(self.zip_data, "zip", "state_code") new_data2 = geomapper.add_geocode(new_data, "state_code", "nation") assert new_data2["nation"].unique()[0] == "us" new_data = geomapper.replace_geocode(self.zip_data, "zip", "state_code") - new_data2 = geomapper.add_geocode(new_data, "state_code", "state_id", new_col="state") - new_data3 = geomapper.replace_geocode(new_data2, "state_code", "nation", new_col="geo_id") + new_data2 = geomapper.add_geocode( + new_data, "state_code", "state_id", new_col="state" + ) + new_data3 = geomapper.replace_geocode( + new_data2, "state_code", "nation", new_col="geo_id" + ) assert "state" not in new_data3.columns # state_code -> hhs @@ -264,11 +283,15 @@ def test_add_geocode(self, geomapper): new_data = geomapper.replace_geocode(self.zip_data, "zip", "state_name") new_data2 = geomapper.add_geocode(new_data, "state_name", "state_id") assert new_data2.shape == (4, 5) - new_data2 = geomapper.replace_geocode(new_data, "state_name", "state_id", new_col="abbr") + new_data2 = geomapper.replace_geocode( + new_data, "state_name", "state_id", new_col="abbr" + ) assert "abbr" in new_data2.columns # fips -> nation - new_data = geomapper.replace_geocode(self.fips_data_5, "fips", "nation", new_col="NATION") + new_data = geomapper.replace_geocode( + self.fips_data_5, "fips", "nation", new_col="NATION" + ) pd.testing.assert_frame_equal( new_data, pd.DataFrame().from_dict( @@ -278,15 +301,25 @@ def test_add_geocode(self, geomapper): "count": {0: 10024.0}, "total": {0: 100006.0}, } - ) + ), ) # fips -> chng-fips new_data = geomapper.add_geocode(self.fips_data_5, "fips", "chng-fips") - assert sorted(list(new_data["chng-fips"])) == ['01123', '18181', '48g19', '72003'] + assert sorted(list(new_data["chng-fips"])) == [ + "01123", + "18181", + "48g19", + "72003", + ] assert new_data["chng-fips"].size == self.fips_data_5.fips.size new_data = geomapper.replace_geocode(self.fips_data_5, "fips", "chng-fips") - assert sorted(list(new_data["chng-fips"])) == ['01123', '18181', '48g19', '72003'] + assert sorted(list(new_data["chng-fips"])) == [ + "01123", + "18181", + "48g19", + "72003", + ] assert new_data["chng-fips"].size == self.fips_data_5.fips.size # chng-fips -> state_id @@ -294,12 +327,12 @@ def test_add_geocode(self, geomapper): new_data2 = geomapper.add_geocode(new_data, "chng-fips", "state_id") assert new_data2["state_id"].unique().size == 4 assert new_data2["state_id"].size == self.fips_data_5.fips.size - assert sorted(list(new_data2["state_id"])) == ['al', 'in', 'pr', 'tx'] + assert sorted(list(new_data2["state_id"])) == ["al", "in", "pr", "tx"] new_data2 = geomapper.replace_geocode(new_data, "chng-fips", "state_id") assert new_data2["state_id"].unique().size == 4 assert new_data2["state_id"].size == 4 - assert sorted(list(new_data2["state_id"])) == ['al', 'in', 'pr', 'tx'] + assert sorted(list(new_data2["state_id"])) == ["al", "in", "pr", "tx"] # zip -> nation new_data = geomapper.replace_geocode(self.zip_data, "zip", "nation") @@ -315,7 +348,7 @@ def test_add_geocode(self, geomapper): "count": {0: 900, 1: 886}, "total": {0: 1800, 1: 1772}, } - ) + ), ) # hrr -> nation @@ -324,53 +357,84 @@ def test_add_geocode(self, geomapper): new_data2 = geomapper.replace_geocode(new_data, "hrr", "nation") # fips -> hrr (dropna=True/False check) - assert not geomapper.add_geocode(self.fips_data_3, "fips", "hrr").isna().any().any() - assert geomapper.add_geocode(self.fips_data_3, "fips", "hrr", dropna=False).isna().any().any() + assert ( + not geomapper.add_geocode(self.fips_data_3, "fips", "hrr") + .isna() + .any() + .any() + ) + assert ( + geomapper.add_geocode(self.fips_data_3, "fips", "hrr", dropna=False) + .isna() + .any() + .any() + ) # fips -> zip (date_col=None chech) - new_data = geomapper.replace_geocode(self.fips_data_5.drop(columns=["timestamp"]), "fips", "hrr", date_col=None) + new_data = geomapper.replace_geocode( + self.fips_data_5.drop(columns=["timestamp"]), "fips", "hrr", date_col=None + ) pd.testing.assert_frame_equal( new_data, pd.DataFrame().from_dict( { - 'hrr': {0: '1', 1: '183', 2: '184', 3: '382', 4: '7'}, - 'count': {0: 1.772347174163783, 1: 7157.392403522299, 2: 2863.607596477701, 3: 1.0, 4: 0.22765282583621685}, - 'total': {0: 3.544694348327566, 1: 71424.64801363471, 2: 28576.35198636529, 3: 1.0, 4: 0.4553056516724337} + "hrr": {0: "1", 1: "183", 2: "184", 3: "382", 4: "7"}, + "count": { + 0: 1.772347174163783, + 1: 7157.392403522299, + 2: 2863.607596477701, + 3: 1.0, + 4: 0.22765282583621685, + }, + "total": { + 0: 3.544694348327566, + 1: 71424.64801363471, + 2: 28576.35198636529, + 3: 1.0, + 4: 0.4553056516724337, + }, } - ) + ), ) # fips -> hhs - new_data = geomapper.replace_geocode(self.fips_data_3.drop(columns=["timestamp"]), - "fips", "hhs", date_col=None) + new_data = geomapper.replace_geocode( + self.fips_data_3.drop(columns=["timestamp"]), "fips", "hhs", date_col=None + ) pd.testing.assert_frame_equal( new_data, pd.DataFrame().from_dict( { "hhs": {0: "2", 1: "6"}, "count": {0: 12, 1: 6}, - "total": {0: 111, 1: 13} + "total": {0: 111, 1: 13}, } - ) + ), ) # zip -> hhs new_data = geomapper.replace_geocode(self.zip_data, "zip", "hhs") - new_data = new_data.round(10) # get rid of a floating point error with 99.00000000000001 + new_data = new_data.round( + 10 + ) # get rid of a floating point error with 99.00000000000001 pd.testing.assert_frame_equal( new_data, pd.DataFrame().from_dict( { - "timestamp": {0: pd.Timestamp("2018-01-01"), 1: pd.Timestamp("2018-01-01"), - 2: pd.Timestamp("2018-01-03"), 3: pd.Timestamp("2018-01-03")}, + "timestamp": { + 0: pd.Timestamp("2018-01-01"), + 1: pd.Timestamp("2018-01-01"), + 2: pd.Timestamp("2018-01-03"), + 3: pd.Timestamp("2018-01-03"), + }, "hhs": {0: "5", 1: "9", 2: "5", 3: "9"}, "count": {0: 99.0, 1: 801.0, 2: 100.0, 3: 786.0}, - "total": {0: 198.0, 1: 1602.0, 2: 200.0, 3: 1572.0} + "total": {0: 198.0, 1: 1602.0, 2: 200.0, 3: 1572.0}, } - ) + ), ) - def test_get_geos(self, geomapper): + def test_get_geos(self, geomapper: GeoMapper): assert geomapper.get_geo_values("nation") == {"us"} assert geomapper.get_geo_values("hhs") == set(str(i) for i in range(1, 11)) assert len(geomapper.get_geo_values("fips")) == 3293 @@ -378,20 +442,114 @@ def test_get_geos(self, geomapper): assert len(geomapper.get_geo_values("state_id")) == 60 assert len(geomapper.get_geo_values("zip")) == 32976 - def test_get_geos_2019(self, geomapper_2019): + def test_get_geos_2019(self, geomapper_2019: GeoMapper): assert len(geomapper_2019.get_geo_values("fips")) == 3292 assert len(geomapper_2019.get_geo_values("chng-fips")) == 2710 - def test_get_geos_within(self, geomapper): - assert len(geomapper.get_geos_within("us","state","nation")) == 60 - assert len(geomapper.get_geos_within("al","county","state")) == 68 - assert len(geomapper.get_geos_within("al","fips","state")) == 68 - assert geomapper.get_geos_within("al","fips","state") == geomapper.get_geos_within("al","county","state") - assert len(geomapper.get_geos_within("al","chng-fips","state")) == 66 - assert len(geomapper.get_geos_within("4","state","hhs")) == 8 - assert geomapper.get_geos_within("4","state","hhs") == {'al', 'fl', 'ga', 'ky', 'ms', 'nc', "tn", "sc"} + def test_get_geos_within(self, geomapper: GeoMapper): + assert len(geomapper.get_geos_within("us", "state", "nation")) == 60 + assert len(geomapper.get_geos_within("al", "county", "state")) == 68 + assert len(geomapper.get_geos_within("al", "fips", "state")) == 68 + assert geomapper.get_geos_within( + "al", "fips", "state" + ) == geomapper.get_geos_within("al", "county", "state") + assert len(geomapper.get_geos_within("al", "chng-fips", "state")) == 66 + assert len(geomapper.get_geos_within("4", "state", "hhs")) == 8 + assert geomapper.get_geos_within("4", "state", "hhs") == { + "al", + "fl", + "ga", + "ky", + "ms", + "nc", + "tn", + "sc", + } - def test_census_year_pop(self, geomapper, geomapper_2019): + def test_census_year_pop(self, geomapper: GeoMapper, geomapper_2019: GeoMapper): df = pd.DataFrame({"fips": ["01001"]}) assert geomapper.add_population_column(df, "fips").population[0] == 56145 assert geomapper_2019.add_population_column(df, "fips").population[0] == 55869 + + def test_aggregate_by_weighted_sum(self, geomapper: GeoMapper): + df = pd.DataFrame( + { + "timestamp": [0] * 7, + "state": ["al", "al", "ca", "ca", "nd", "me", "me"], + "a": [1, 2, 3, 4, 12, -2, 2], + "b": [5, 6, 7, np.nan, np.nan, -1, -2], + "population_served": [10, 5, 8, 1, 3, 1, 2], + } + ) + agg_df = geomapper.aggregate_by_weighted_sum( + df, + to_geo="state", + sensor_col="a", + time_col="timestamp", + population_col="population_served", + ) + agg_df_by_hand = pd.DataFrame( + { + "timestamp": [0] * 4, + "state": ["al", "ca", "me", "nd"], + "weighted_a": [ + (1 * 10 + 2 * 5) / 15, + (3 * 8 + 4 * 1) / 9, + (-2 * 1 + 2 * 2) / 3, + (12 * 3) / 3, + ], + } + ) + pd.testing.assert_frame_equal(agg_df, agg_df_by_hand) + agg_df = geomapper.aggregate_by_weighted_sum( + df, + to_geo="state", + sensor_col="b", + time_col="timestamp", + population_col="population_served", + ) + agg_df_by_hand = pd.DataFrame( + { + "timestamp": [0] * 4, + "state": ["al", "ca", "me", "nd"], + "weighted_b": [ + (5 * 10 + 6 * 5) / 15, + (7 * 8 + 4 * 0) / 8, + (-1 * 1 + -2 * 2) / 3, + (np.nan) / 3, + ], + } + ) + pd.testing.assert_frame_equal(agg_df, agg_df_by_hand) + + df = pd.DataFrame( + { + "state": [ + "al", + "al", + "ca", + "ca", + "nd", + ], + "nation": ["us"] * 5, + "timestamp": [0] * 3 + [1] * 2, + "a": [1, 2, 3, 4, 12], + "b": [5, 6, 7, np.nan, np.nan], + "population_served": [10, 5, 8, 1, 3], + } + ) + agg_df = geomapper.aggregate_by_weighted_sum( + df, + to_geo="nation", + sensor_col="a", + time_col="timestamp", + population_col="population_served", + ) + agg_df_by_hand = pd.DataFrame( + { + "timestamp": [0, 1], + "nation": ["us"] * 2, + "weighted_a": [(1 * 10 + 2 * 5 + 3 * 8) / 23, (1 * 4 + 3 * 12) / 4], + } + ) + pd.testing.assert_frame_equal(agg_df, agg_df_by_hand)