-
Notifications
You must be signed in to change notification settings - Fork 16
First pass of the CDC Vaccination Indicator #1238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 18 commits
89410dc
4799139
5d92ed4
7836d23
52e04c2
16c1050
350f91c
9b102db
f08c9b1
2a0dcae
e1187f3
04fbc1d
ba11d3c
8f7b814
ff808a4
675106c
0804d12
8963748
b2769e6
b5f82b7
d0349a6
e9b4a6a
652664a
3401d71
58ee0e2
e035a21
292084b
8c9f41f
738b201
f67925b
661fab9
3043cb4
46e2d47
1365da1
55ed232
583c5e1
11186b9
21b908d
6e783c4
2f33968
0627683
a8ef938
7e8dff6
a0366c4
8fbf23b
80c67d8
db508d2
d8c0ec5
4d9d214
a1b443b
04bf3f9
b3c9325
ec51e1a
c10adcc
c450e3b
85dd31f
643bef9
1b1bc5e
8e0ebde
3b69c28
a41b6b4
ce4634b
2ad7b4c
868aa20
ed93b3c
309a013
37f97bc
a663f19
0a4dc61
f548c6a
5567ba0
d882622
0080d6e
693a25e
9dca6cb
7338a96
c2bcb47
d420a2e
2657fdb
fc09f49
a358dcd
198dd29
180ed9a
9ce0670
d4f8bf1
6573b08
bf28950
2e4c4c4
cb34ad0
225e16d
6cfe2bc
2e3ef74
bfb81f5
a762ca7
e132c7d
fe3d9c3
8f9d38f
bc083b7
01d142b
372ecab
0204d88
f361676
ea68224
b873a95
49a5766
874623e
f350dd6
d827480
3397277
4b8ee7a
b043d54
e40cd55
6849004
ff84e3a
74a84e4
6bfd724
50fd522
d6d0534
52f8cb2
511bf2e
d572e26
33e9325
775d125
e6ade5a
2f1927a
9a3f4f1
000dc8b
9b00342
c7d7ce0
02f7080
c21b544
ca09586
90ea653
b9c6e8a
d3544d0
ea6587d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
|
||
[MESSAGES CONTROL] | ||
|
||
disable=logging-format-interpolation, | ||
too-many-locals, | ||
too-many-arguments, | ||
# Allow pytest functions to be part of a class. | ||
no-self-use, | ||
# Allow pytest classes to have one test. | ||
too-few-public-methods | ||
|
||
[BASIC] | ||
|
||
# Allow arbitrarily short-named variables. | ||
variable-rgx=[a-z_][a-z0-9_]* | ||
argument-rgx=[a-z_][a-z0-9_]* | ||
attr-rgx=[a-z_][a-z0-9_]* | ||
|
||
[DESIGN] | ||
|
||
# Don't complain about pytest "unused" arguments. | ||
ignored-argument-names=(_.*|run_as_module) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
.PHONY = venv, lint, test, clean | ||
|
||
dir = $(shell find ./delphi_* -name __init__.py | grep -o 'delphi_[_[:alnum:]]*') | ||
|
||
venv: | ||
python3.8 -m venv env | ||
|
||
install: venv | ||
. env/bin/activate; \ | ||
pip install wheel ; \ | ||
pip install -e ../_delphi_utils_python ;\ | ||
pip install -e . | ||
|
||
lint: | ||
. env/bin/activate; pylint $(dir) | ||
. env/bin/activate; pydocstyle $(dir) | ||
|
||
test: | ||
. env/bin/activate ;\ | ||
(cd tests && ../env/bin/pytest --cov=$(dir) --cov-report=term-missing) | ||
|
||
clean: | ||
rm -rf env | ||
rm -f params.json | ||
|
||
run: | ||
env/bin/python -m $(dir) | ||
env/bin/python -m delphi_utils.validator --dry_run | ||
env/bin/python -m delphi_utils.archive |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
# CDC Vaccinations | ||
|
||
This indicator provides the official vaccination counts in the US. We export the county-level | ||
daily vaccination rates data as-is, and publish the result as a COVIDcast signal. | ||
We also aggregate the data to the MSA, HRR, State, HHS Region, and Nation levels. | ||
For detailed information see the files DETAILS.md contained in this directory. | ||
|
||
Note that individuals could be vaccinated outside of the US. Additionally, | ||
there is no county level data for counties in Texas and Hawaii. Each state has some vaccination counts assigned to "unknown county". Some vaccination counts are assigned to "unknown state, unknown county". | ||
|
||
|
||
## Running the Indicator | ||
|
||
The indicator is run by directly executing the Python module contained in this | ||
directory. The safest way to do this is to create a virtual environment, | ||
installed the common DELPHI tools, and then install the module and its | ||
dependencies. To do this, run the following command from this directory: | ||
|
||
``` | ||
make install | ||
``` | ||
|
||
This command will install the package in editable mode, so you can make changes that | ||
will automatically propagate to the installed package. | ||
|
||
All of the user-changable parameters are stored in `params.json`. To execute | ||
the module and produce the output datasets (by default, in `receiving`), run | ||
the following: | ||
|
||
``` | ||
env/bin/python -m delphi_cdc_vaccines | ||
``` | ||
|
||
If you want to enter the virtual environment in your shell, | ||
you can run `source env/bin/activate`. Run `deactivate` to leave the virtual environment. | ||
|
||
Once you are finished, you can remove the virtual environment and | ||
params file with the following: | ||
|
||
``` | ||
make clean | ||
``` | ||
|
||
## Testing the code | ||
|
||
To run static tests of the code style, run the following command: | ||
|
||
``` | ||
make lint | ||
``` | ||
|
||
Unit tests are also included in the module. To execute these, run the following | ||
command from this directory: | ||
|
||
``` | ||
make test | ||
``` | ||
|
||
To run individual tests, run the following: | ||
|
||
``` | ||
(cd tests && ../env/bin/pytest test_run.py --cov=delphi_ --cov-report=term-missing) | ||
``` | ||
|
||
The output will show the number of unit tests that passed and failed, along | ||
with the percentage of code covered by the tests. | ||
|
||
None of the linting or unit tests should fail, and the code lines that are not covered by unit tests should be small and | ||
should not include critical sub-routines. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
## Code Review (Python) | ||
|
||
A code review of this module should include a careful look at the code and the | ||
output. To assist in the process, but certainly not in replace of it, please | ||
check the following items. | ||
|
||
**Documentation** | ||
|
||
- [ ] the README.md file template is filled out and currently accurate; it is | ||
possible to load and test the code using only the instructions given | ||
- [ ] minimal docstrings (one line describing what the function does) are | ||
included for all functions; full docstrings describing the inputs and expected | ||
outputs should be given for non-trivial functions | ||
|
||
**Structure** | ||
|
||
- [ ] code should pass lint checks (`make lint`) | ||
- [ ] any required metadata files are checked into the repository and placed | ||
within the directory `static` | ||
- [ ] any intermediate files that are created and stored by the module should | ||
be placed in the directory `cache` | ||
- [ ] final expected output files to be uploaded to the API are placed in the | ||
`receiving` directory; output files should not be committed to the respository | ||
- [ ] all options and API keys are passed through the file `params.json` | ||
- [ ] template parameter file (`params.json.template`) is checked into the | ||
code; no personal (i.e., usernames) or private (i.e., API keys) information is | ||
included in this template file | ||
|
||
**Testing** | ||
|
||
- [ ] module can be installed in a new virtual environment (`make install`) | ||
- [ ] reasonably high level of unit test coverage covering all of the main logic | ||
of the code (e.g., missing coverage for raised errors that do not currently seem | ||
possible to reach are okay; missing coverage for options that will be needed are | ||
not) | ||
- [ ] all unit tests run without errors (`make test`) | ||
- [ ] indicator directory has been added to GitHub CI | ||
(`covidcast-indicators/.github/workflows/python-ci.yml`) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# -*- coding: utf-8 -*- | ||
"""Module to pull and clean indicators from the CDC source. | ||
|
||
This file defines the functions that are made public by the module. As the | ||
module is intended to be executed though the main method, these are primarily | ||
for testing. | ||
""" | ||
|
||
from __future__ import absolute_import | ||
from . import pull | ||
from . import run | ||
|
||
__version__ = "0.1.0" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# -*- coding: utf-8 -*- | ||
"""Call the function run_module when executed. | ||
|
||
This file indicates that calling the module (`python -m delphi_cdc_vaccines`) will | ||
call the function `run_module` found within the run.py file. There should be | ||
no need to change this template. | ||
""" | ||
|
||
from delphi_utils import read_params | ||
from .run import run_module # pragma: no cover | ||
|
||
run_module(read_params()) # pragma: no cover |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
"""Registry for variations.""" | ||
|
||
from itertools import product | ||
from delphi_utils import Smoother | ||
|
||
|
||
CUMULATIVE = 'cumulative' | ||
INCIDENCE ='incidence' | ||
FREQUENCY = [CUMULATIVE, INCIDENCE] | ||
STATUS = ["tot", "part"] | ||
AGE = ["", "_12P", "_18P", "_65P"] | ||
|
||
SIGNALS = [f"{frequency}_counts_{status}_vaccine{AGE}" for | ||
frequency, status, age in product(FREQUENCY, STATUS, AGE)] | ||
DIFFERENCE_MAPPING = { | ||
f"{INCIDENCE}_counts_{status}_vaccine{age}": f"{CUMULATIVE}_counts_{status}_vaccine{age}" | ||
for status, age in product(STATUS, AGE) | ||
} | ||
SIGNALS = list(DIFFERENCE_MAPPING.keys()) + list(DIFFERENCE_MAPPING.values()) | ||
|
||
|
||
GEOS = [ | ||
"nation", | ||
"state", | ||
"hrr", | ||
"hhs", | ||
"msa" | ||
] | ||
|
||
SMOOTHERS = [ | ||
(Smoother("identity", impute_method=None), ""), | ||
(Smoother("moving_average", window_length=7), "_7dav"), | ||
] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
# -*- coding: utf-8 -*- | ||
"""Functions for pulling data from the CDC data website for vaccines.""" | ||
import hashlib | ||
from logging import Logger | ||
from delphi_utils.geomap import GeoMapper | ||
import numpy as np | ||
import pandas as pd | ||
from .constants import SIGNALS, DIFFERENCE_MAPPING | ||
|
||
|
||
|
||
def pull_cdcvacc_data(base_url: str, logger: Logger) -> pd.DataFrame: | ||
"""Pull the latest data from the CDC on vaccines and conform it into a dataset. | ||
|
||
The output dataset has: | ||
- Each row corresponds to (County, Date), denoted (FIPS, timestamp) | ||
- Each row additionally has columns that correspond to the counts or | ||
cumulative counts of vaccination status (fully vaccinated, | ||
partially vaccinated) of various age groups (all, 12+, 18+, 65+) | ||
from December 13th 2020 until the latest date | ||
|
||
Note that the raw dataset gives the `cumulative` metrics, from which | ||
we compute `counts` by taking first differences. Hence, `counts` | ||
may be negative. This is wholly dependent on the quality of the raw | ||
dataset. | ||
|
||
We filter the data such that we only keep rows with valid FIPS, or "FIPS" | ||
codes defined under the exceptions of the README. The current exceptions | ||
include: | ||
# - 0: statewise unallocated | ||
Parameters | ||
---------- | ||
base_url: str | ||
Base URL for pulling the CDC Vaccination Data | ||
logger: Logger | ||
Returns | ||
------- | ||
pd.DataFrame | ||
Dataframe as described above. | ||
""" | ||
# Columns to drop the the data frame. | ||
drop_columns = [ | ||
"date", | ||
"recip_state", | ||
"series_complete_pop_pct", | ||
"mmwr_week", | ||
"recip_county", | ||
"state_id" | ||
] | ||
|
||
|
||
# Read data | ||
df = pd.read_csv(base_url) | ||
logger.info("data retrieved from source", | ||
num_rows=df.shape[0], | ||
num_cols=df.shape[1], | ||
min_date=min(df['Date']), | ||
max_date=max(df['Date']), | ||
checksum=hashlib.sha256(pd.util.hash_pandas_object(df).values).hexdigest()) | ||
df.columns = [i.lower() for i in df.columns] | ||
|
||
df['recip_state'] = df['recip_state'].str.lower() | ||
drop_columns.extend([x for x in df.columns if ("pct" in x) | ("svi" in x)]) | ||
drop_columns = list(set(drop_columns)) | ||
df = GeoMapper().add_geocode(df, "state_id", "state_code", | ||
from_col="recip_state", new_col="state_id", dropna=False) | ||
df['state_id'] = df['state_id'].fillna('0').astype(int) | ||
# Change FIPS from 0 to XX000 for statewise unallocated cases/deaths | ||
unassigned_index = (df["fips"] == "UNK") | ||
df.loc[unassigned_index, "fips"] = df["state_id"].loc[unassigned_index].values * 1000 | ||
|
||
# Conform FIPS | ||
df["fips"] = df["fips"].apply(lambda x: f"{int(x):05d}") | ||
df["timestamp"] = pd.to_datetime(df["date"]) | ||
# Drop unnecessary columns (state is pre-encoded in fips) | ||
try: | ||
df.drop(drop_columns, axis=1, inplace=True) | ||
except KeyError as e: | ||
raise ValueError( | ||
"Tried to drop non-existent columns. The dataset " | ||
"schema may have changed. Please investigate and " | ||
"amend drop_columns." | ||
) from e | ||
# timestamp: str -> datetime | ||
df.columns = ["fips", | ||
"cumulative_counts_tot_vaccine", | ||
"cumulative_counts_tot_vaccine_12P", | ||
"cumulative_counts_tot_vaccine_18P", | ||
"cumulative_counts_tot_vaccine_65P", | ||
"cumulative_counts_part_vaccine", | ||
"cumulative_counts_part_vaccine_12P", | ||
"cumulative_counts_part_vaccine_18P", | ||
"cumulative_counts_part_vaccine_65P", | ||
"timestamp"] | ||
Ananya-Joshi marked this conversation as resolved.
Show resolved
Hide resolved
|
||
df_dummy = df.loc[(df["fips"]!='00000') & (df["timestamp"] == min(df["timestamp"]))].copy() | ||
#handle fips 00000 separately | ||
df_oth = df.loc[((df["fips"]=='00000') & | ||
krivard marked this conversation as resolved.
Show resolved
Hide resolved
|
||
(df["timestamp"]==min(df[df['fips'] == '00000']['timestamp'])))].copy() | ||
df_dummy = pd.concat([df_dummy, df_oth]) | ||
df_dummy.loc[:, "timestamp"] = df_dummy.loc[:, "timestamp"] - pd.Timedelta(days=1) | ||
df_dummy.loc[:, ["cumulative_counts_tot_vaccine", | ||
"cumulative_counts_tot_vaccine_12P", | ||
"cumulative_counts_tot_vaccine_18P", | ||
"cumulative_counts_tot_vaccine_65P", | ||
"cumulative_counts_part_vaccine", | ||
"cumulative_counts_part_vaccine_12P", | ||
"cumulative_counts_part_vaccine_18P", | ||
"cumulative_counts_part_vaccine_65P", | ||
]] = 0 | ||
|
||
df =pd.concat([df_dummy, df]) | ||
# Obtain new_counts | ||
df.sort_values(["fips", "timestamp"], inplace=True) | ||
for to, from_d in DIFFERENCE_MAPPING.items(): | ||
df[to] = df[from_d].diff() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Btw, you might like this version of taking diffs, grouped by geos. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if we can keep this method for now, but then later I'll look to fix the method and use it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, that's alright. A possible refactor for later. |
||
|
||
rem_list = [ x for x in list(df.columns) if x not in ['timestamp', 'fips'] ] | ||
# Handle edge cases where we diffed across fips | ||
mask = df["fips"] != df["fips"].shift(1) | ||
df.loc[mask, rem_list] = np.nan | ||
df.reset_index(inplace=True, drop=True) | ||
# Final sanity checks | ||
unique_days = df["timestamp"].unique() | ||
min_timestamp = min(unique_days) | ||
max_timestamp = max(unique_days) | ||
n_days = (max_timestamp - min_timestamp) / np.timedelta64(1, "D") + 1 | ||
if n_days != len(unique_days): | ||
raise ValueError( | ||
f"Not every day between {min_timestamp} and " | ||
"{max_timestamp} is represented." | ||
) | ||
return df.loc[ | ||
df["timestamp"] >= min(df["timestamp"]), | ||
# Reorder | ||
["fips", "timestamp"] + SIGNALS, | ||
].reset_index(drop=True) |
Uh oh!
There was an error while loading. Please reload this page.