Skip to content

Commit 03c6f45

Browse files
authored
Merge pull request #1363 from cmu-delphi/krivard/community_profile
New indicator: test positivity and volume from DSEW Community Profile Report
2 parents 4ac71f7 + 2be28c5 commit 03c6f45

File tree

19 files changed

+1208
-1
lines changed

19 files changed

+1208
-1
lines changed

.github/workflows/python-ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ jobs:
1616
if: github.event.pull_request.draft == false
1717
strategy:
1818
matrix:
19-
packages: [_delphi_utils_python, changehc, claims_hosp, combo_cases_and_deaths, doctor_visits, google_symptoms, hhs_hosp, hhs_facilities, jhu, nchs_mortality, nowcast, quidel, quidel_covidtest, safegraph_patterns, sir_complainsalot, usafacts]
19+
packages: [_delphi_utils_python, changehc, claims_hosp, combo_cases_and_deaths, doctor_visits, dsew_community_profile, google_symptoms, hhs_hosp, hhs_facilities, jhu, nchs_mortality, nowcast, quidel, quidel_covidtest, safegraph_patterns, sir_complainsalot, usafacts]
2020
defaults:
2121
run:
2222
working-directory: ${{ matrix.packages }}
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
{
2+
"common": {
3+
"export_dir": "./receiving",
4+
"log_filename": "dsew_cpr.log"
5+
},
6+
"indicator": {
7+
"input_cache": "./input_cache",
8+
"reports": "new"
9+
},
10+
"validation": {
11+
"common": {
12+
"data_source": "dsew_cpr",
13+
"span_length": 14,
14+
"min_expected_lag": {"all": "5"},
15+
"max_expected_lag": {"all": "9"},
16+
"dry_run": true,
17+
"suppressed_errors": []
18+
},
19+
"static": {
20+
"minimum_sample_size": 0,
21+
"missing_se_allowed": true,
22+
"missing_sample_size_allowed": true
23+
},
24+
"dynamic": {
25+
"ref_window_size": 7,
26+
"smoothed_signals": [
27+
"naats_total_7dav",
28+
"naats_positivity_7dav"
29+
]
30+
}
31+
}
32+
}

dsew_community_profile/.pylintrc

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
2+
[MESSAGES CONTROL]
3+
4+
disable=logging-format-interpolation,
5+
too-many-locals,
6+
too-many-arguments,
7+
# Allow pytest functions to be part of a class.
8+
no-self-use,
9+
# Allow pytest classes to have one test.
10+
too-few-public-methods
11+
12+
[BASIC]
13+
14+
# Allow arbitrarily short-named variables.
15+
variable-rgx=[a-z_][a-z0-9_]*
16+
argument-rgx=[a-z_][a-z0-9_]*
17+
attr-rgx=[a-z_][a-z0-9_]*
18+
19+
[DESIGN]
20+
21+
# Don't complain about pytest "unused" arguments.
22+
ignored-argument-names=(_.*|run_as_module)

dsew_community_profile/DETAILS.md

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# Dataset layout
2+
3+
The Data Strategy and Execution Workgroup (DSEW) publishes a Community Profile
4+
Report each weekday, comprising a pair of files: an Excel workbook (.xlsx) and a
5+
PDF which shows select metrics from the workbook as time series charts and
6+
choropleth maps. These files are listed as attachments on the healthdata.gov
7+
site:
8+
9+
https://healthdata.gov/Health/COVID-19-Community-Profile-Report/gqxm-d9w9
10+
11+
Each Excel file attachment has a filename. The filename contains a date,
12+
presumably the publish date. The attachment also has an alphanumeric
13+
assetId. Both the filename and the assetId are required for downloading the
14+
file. Whether this means that updated versions of a particular file may be
15+
uploaded by DSEW at later times is not known. The attachment does not explicitly
16+
list an upload timestamp. To be safe, we cache our downloads using both the
17+
assetId and the filename.
18+
19+
# Workbook layout
20+
21+
Each Excel file is a workbook with multiple sheets. The exemplar file used in
22+
writing this indicator is "Community Profile Report 20211102.xlsx". The sheets
23+
include:
24+
25+
- User Notes: Instructions for using the workbook
26+
- Overview: US National figures for the last 5 weeks, plus monthly peaks back to
27+
April 2020
28+
- Regions*: Figures for FEMA regions (double-checked: they match HHS regions
29+
except that FEMA 2 does not include Palau while HHS 2 does)
30+
- States*: Figures for US states and territories
31+
- CBSAs*: Figures for US Census Block Statistical Areas
32+
- Counties*: Figures for US counties
33+
- Weekly Transmission Categories: Lists of high, substantial, and moderate
34+
transmission states and territories
35+
- National Peaks: Monthly national peaks back to April 2020
36+
- National Historic: Daily national figures back to January 22 2020
37+
- Data Notes: Source and methods information for all metrics
38+
- Color Thresholds: Color-coding is used extensively in all sheets; these are
39+
the keys
40+
41+
The starred sheets above have nearly-identical column layouts, and together
42+
cover the county, MSA, state, and HHS geographical levels used in
43+
covidcast. Rather than aggregate them ourselves and risk a mismatch, this
44+
indicator lifts these geographical aggregations directly from the corresponding
45+
sheets of the workbook.
46+
47+
GeoMapper _is_ used to generate national figures from
48+
state, due to architectural differences between the starred sheets and the
49+
Overview sheet. If we discover that our nation-level figures differ too much
50+
from those listed in the Overview sheet, we can add dedicated parsing for the
51+
Overview sheet and remove GeoMapper from this indicator altogether.
52+
53+
# Sheet layout
54+
55+
## Headers
56+
57+
Each starred sheet has two rows of headers. The first row uses merged cells to
58+
group several columns together under a single "overheader". This overheader
59+
often includes the reference period for that group of columns, such as:
60+
61+
- CASES/DEATHS: LAST WEEK (October 26-November 1)
62+
- TESTING: LAST WEEK (October 24-30, Test Volume October 20-26)
63+
- TESTING: PREVIOUS WEEK (October 17-23, Test Volume October 13-19)
64+
65+
Overheaders have changed periodically since the first report. For example, the
66+
"TESTING: LAST WEEK" overheader above has also appeared as "VIRAL (RT-PCR) LAB
67+
TESTING: LAST WEEK", with and without a separate reference date for Test
68+
Volume. All known overheader forms are checked in test_pull.py.
69+
70+
The second row contains a header for each column. The headers uniquely identify
71+
each column included in the sheet. Column headers include spaces, and typically
72+
specify both the metric and the reference period over which it was calculated,
73+
such as:
74+
75+
- Total NAATs - last 7 days (may be an underestimate due to delayed reporting)
76+
- NAAT positivity rate - previous 7 days (may be an underestimate due to delayed
77+
reporting)
78+
79+
Columns headers have also changed periodically since the first report. For
80+
example, the "Total NAATs - last 7 days" header above has also appeared as
81+
"Total RT-PCR diagnostic tests - last 7 days".
82+
83+
## Contents
84+
85+
Each starred sheet contains test positivity and total test volume figures for
86+
two reference periods, "last [week]" and "previous [week]". In some reports, the
87+
reference periods for test positivity and total test volume are the same; in
88+
others, they are different, such that the report contains figures for four
89+
distinct reference periods, two for each metric we extract.
90+
91+
# Time series conversions and parsing notes
92+
93+
## Reference date
94+
95+
The reference period in the overheader never includes the year. We guess the
96+
reference year by picking the same year as the publish date (i.e., the date
97+
extracted from the filename), and if the reference month is greater than the
98+
publish month, subtract 1 from the reference year. This adequately covers the
99+
December-January boundary.
100+
101+
We select as reference date the end date of the reference period for each
102+
metric. Reference periods are always 7 days, so this indicator produces
103+
seven-day averages. We divide the total testing volume by seven and leave the
104+
test positivity alone.
105+
106+
## Geo ID
107+
108+
The Counties sheet lists FIPS codes numerically, such that FIPS with a leading
109+
zero only have four digits. We fix this by zero-filling to five characters.
110+
111+
MSAs are a subset of CBSAs. We fix this by selecting only CBSAs with type
112+
"Metropolitan".
113+
114+
Most of the starred sheets have the geo id as the first non-index column. The
115+
Region sheet has no such column. We fix this by generating the HHS ids from the
116+
index column instead.
117+
118+
## Combining multiple reports
119+
120+
Each report file generates two reference dates for each metric, up to four
121+
reference dates total. Since it's not clear whether new versions of past files
122+
are ever made available, the default mode (params.indicator.reports="new")
123+
fetches any files that are not already in the input cache, then combines the
124+
results into a single data frame before exporting. This will generate correct
125+
behavior should (for instance) a previously-downloaded file get a new assetId.
126+
127+
For the initial run on an empty input cache, and for runs configured to process
128+
a range of reports (using params.indicator.reports=YYYY-mm-dd--YYYY-mm-dd), this
129+
indicator makes no distinction between figures that came from different
130+
reports. That may not be what you want. If the covidcast issue date needs to
131+
match the date on the report filename, then the indicator must instead be run
132+
repeatedly, with equal start and end dates, keeping the output of each run
133+
separate.

dsew_community_profile/Makefile

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
.PHONY = venv, lint, test, clean
2+
3+
dir = $(shell find ./delphi_* -name __init__.py | grep -o 'delphi_[_[:alnum:]]*')
4+
5+
venv:
6+
python3.8 -m venv env
7+
8+
install: venv
9+
. env/bin/activate; \
10+
pip install wheel ; \
11+
pip install -e ../_delphi_utils_python ;\
12+
pip install -e .
13+
14+
install-ci: venv
15+
. env/bin/activate; \
16+
pip install wheel ; \
17+
pip install ../_delphi_utils_python ;\
18+
pip install .
19+
20+
lint:
21+
. env/bin/activate; pylint $(dir)
22+
. env/bin/activate; pydocstyle $(dir)
23+
24+
test:
25+
. env/bin/activate ;\
26+
(cd tests && ../env/bin/pytest --cov=$(dir) --cov-report=term-missing)
27+
28+
clean:
29+
rm -rf env
30+
rm -f params.json

dsew_community_profile/README.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# COVID-19 Community Profile Report
2+
3+
The Data Strategy and Execution Workgroup (DSEW) publishes a Community Profile
4+
Report each weekday at this location:
5+
6+
https://healthdata.gov/Health/COVID-19-Community-Profile-Report/gqxm-d9w9
7+
8+
This indicator extracts COVID-19 test figures from these reports.
9+
10+
Indicator-specific parameters:
11+
12+
* `input_cache`: a directory where Excel (.xlsx) files downloaded from
13+
healthdata.gov will be stored for posterity. Each file is 3.3 MB in size, so
14+
we expect this directory to require ~1GB of disk space for each year of
15+
operation.
16+
* `reports`: {new | all | YYYY-mm-dd--YYYY-mm-dd} a string indicating which
17+
reports to export. The default, "new", downloads and exports only reports not
18+
already found in the input cache. The "all" setting exports data for all
19+
available reports, downloading them to the input cache if necessary. The date
20+
range setting refers to the date listed in the filename for the report,
21+
presumably the publish date. Only reports named with a date within the
22+
specified range (inclusive) will be downloaded to the input cache if necessary
23+
and exported.
24+
* `export_start_date`: a YYYY-mm-dd string indicating the first date to export.
25+
* `export_end_date`: a YYYY-mm-dd string indicating the final date to export.
26+
27+
## Running the Indicator
28+
29+
The indicator is run by directly executing the Python module contained in this
30+
directory. The safest way to do this is to create a virtual environment,
31+
installed the common DELPHI tools, and then install the module and its
32+
dependencies. To do this, run the following command from this directory:
33+
34+
```
35+
make install
36+
```
37+
38+
This command will install the package in editable mode, so you can make changes that
39+
will automatically propagate to the installed package.
40+
41+
All of the user-changable parameters are stored in `params.json`. To execute
42+
the module and produce the output datasets (by default, in `receiving`), run
43+
the following:
44+
45+
```
46+
env/bin/python -m delphi_dsew_community_profile
47+
```
48+
49+
If you want to enter the virtual environment in your shell,
50+
you can run `source env/bin/activate`. Run `deactivate` to leave the virtual environment.
51+
52+
Once you are finished, you can remove the virtual environment and
53+
params file with the following:
54+
55+
```
56+
make clean
57+
```
58+
59+
## Testing the code
60+
61+
To run static tests of the code style, run the following command:
62+
63+
```
64+
make lint
65+
```
66+
67+
Unit tests are also included in the module. To execute these, run the following
68+
command from this directory:
69+
70+
```
71+
make test
72+
```
73+
74+
To run individual tests, run the following:
75+
76+
```
77+
(cd tests && ../env/bin/pytest <your_test>.py --cov=delphi_dsew_community_profile --cov-report=term-missing)
78+
```
79+
80+
The output will show the number of unit tests that passed and failed, along
81+
with the percentage of code covered by the tests.
82+
83+
None of the linting or unit tests should fail, and the code lines that are not covered by unit tests should be small and
84+
should not include critical sub-routines.

dsew_community_profile/REVIEW.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
## Code Review (Python)
2+
3+
A code review of this module should include a careful look at the code and the
4+
output. To assist in the process, but certainly not in replace of it, please
5+
check the following items.
6+
7+
**Documentation**
8+
9+
- [ ] the README.md file template is filled out and currently accurate; it is
10+
possible to load and test the code using only the instructions given
11+
- [ ] minimal docstrings (one line describing what the function does) are
12+
included for all functions; full docstrings describing the inputs and expected
13+
outputs should be given for non-trivial functions
14+
15+
**Structure**
16+
17+
- [ ] code should pass lint checks (`make lint`)
18+
- [ ] any required metadata files are checked into the repository and placed
19+
within the directory `static`
20+
- [ ] any intermediate files that are created and stored by the module should
21+
be placed in the directory `cache`
22+
- [ ] final expected output files to be uploaded to the API are placed in the
23+
`receiving` directory; output files should not be committed to the respository
24+
- [ ] all options and API keys are passed through the file `params.json`
25+
- [ ] template parameter file (`params.json.template`) is checked into the
26+
code; no personal (i.e., usernames) or private (i.e., API keys) information is
27+
included in this template file
28+
29+
**Testing**
30+
31+
- [ ] module can be installed in a new virtual environment (`make install`)
32+
- [ ] reasonably high level of unit test coverage covering all of the main logic
33+
of the code (e.g., missing coverage for raised errors that do not currently seem
34+
possible to reach are okay; missing coverage for options that will be needed are
35+
not)
36+
- [ ] all unit tests run without errors (`make test`)
37+
- [ ] indicator directory has been added to GitHub CI
38+
(`covidcast-indicators/.github/workflows/python-ci.yml`)

dsew_community_profile/cache/.gitignore

Whitespace-only changes.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# -*- coding: utf-8 -*-
2+
"""Module to pull and clean indicators from the XXXXX source.
3+
4+
This file defines the functions that are made public by the module. As the
5+
module is intended to be executed though the main method, these are primarily
6+
for testing.
7+
"""
8+
9+
from __future__ import absolute_import
10+
11+
from . import run
12+
13+
__version__ = "0.1.0"
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# -*- coding: utf-8 -*-
2+
"""Call the function run_module when executed.
3+
4+
This file indicates that calling the module (`python -m delphi_dsew_community_profile`) will
5+
call the function `run_module` found within the run.py file. There should be
6+
no need to change this template.
7+
"""
8+
9+
from delphi_utils import read_params
10+
from .run import run_module # pragma: no cover
11+
12+
run_module(read_params()) # pragma: no cover

0 commit comments

Comments
 (0)