|
| 1 | +# Dataset layout |
| 2 | + |
| 3 | +The Data Strategy and Execution Workgroup (DSEW) publishes a Community Profile |
| 4 | +Report each weekday, comprising a pair of files: an Excel workbook (.xlsx) and a |
| 5 | +PDF which shows select metrics from the workbook as time series charts and |
| 6 | +choropleth maps. These files are listed as attachments on the healthdata.gov |
| 7 | +site: |
| 8 | + |
| 9 | +https://healthdata.gov/Health/COVID-19-Community-Profile-Report/gqxm-d9w9 |
| 10 | + |
| 11 | +Each Excel file attachment has a filename. The filename contains a date, |
| 12 | +presumably the publish date. The attachment also has an alphanumeric |
| 13 | +assetId. Both the filename and the assetId are required for downloading the |
| 14 | +file. Whether this means that updated versions of a particular file may be |
| 15 | +uploaded by DSEW at later times is not known. The attachment does not explicitly |
| 16 | +list an upload timestamp. To be safe, we cache our downloads using both the |
| 17 | +assetId and the filename. |
| 18 | + |
| 19 | +# Workbook layout |
| 20 | + |
| 21 | +Each Excel file is a workbook with multiple sheets. The exemplar file used in |
| 22 | +writing this indicator is "Community Profile Report 20211102.xlsx". The sheets |
| 23 | +include: |
| 24 | + |
| 25 | +- User Notes: Instructions for using the workbook |
| 26 | +- Overview: US National figures for the last 5 weeks, plus monthly peaks back to |
| 27 | + April 2020 |
| 28 | +- Regions*: Figures for FEMA regions (double-checked: they match HHS regions |
| 29 | + except that FEMA 2 does not include Palau while HHS 2 does) |
| 30 | +- States*: Figures for US states and territories |
| 31 | +- CBSAs*: Figures for US Census Block Statistical Areas |
| 32 | +- Counties*: Figures for US counties |
| 33 | +- Weekly Transmission Categories: Lists of high, substantial, and moderate |
| 34 | + transmission states and territories |
| 35 | +- National Peaks: Monthly national peaks back to April 2020 |
| 36 | +- National Historic: Daily national figures back to January 22 2020 |
| 37 | +- Data Notes: Source and methods information for all metrics |
| 38 | +- Color Thresholds: Color-coding is used extensively in all sheets; these are |
| 39 | + the keys |
| 40 | + |
| 41 | +The starred sheets above have nearly-identical column layouts, and together |
| 42 | +cover the county, MSA, state, and HHS geographical levels used in |
| 43 | +covidcast. Rather than aggregate them ourselves and risk a mismatch, this |
| 44 | +indicator lifts these geographical aggregations directly from the corresponding |
| 45 | +sheets of the workbook. |
| 46 | + |
| 47 | +GeoMapper _is_ used to generate national figures from |
| 48 | +state, due to architectural differences between the starred sheets and the |
| 49 | +Overview sheet. If we discover that our nation-level figures differ too much |
| 50 | +from those listed in the Overview sheet, we can add dedicated parsing for the |
| 51 | +Overview sheet and remove GeoMapper from this indicator altogether. |
| 52 | + |
| 53 | +# Sheet layout |
| 54 | + |
| 55 | +## Headers |
| 56 | + |
| 57 | +Each starred sheet has two rows of headers. The first row uses merged cells to |
| 58 | +group several columns together under a single "overheader". This overheader |
| 59 | +often includes the reference period for that group of columns, such as: |
| 60 | + |
| 61 | +- CASES/DEATHS: LAST WEEK (October 26-November 1) |
| 62 | +- TESTING: LAST WEEK (October 24-30, Test Volume October 20-26) |
| 63 | +- TESTING: PREVIOUS WEEK (October 17-23, Test Volume October 13-19) |
| 64 | + |
| 65 | +Overheaders have changed periodically since the first report. For example, the |
| 66 | +"TESTING: LAST WEEK" overheader above has also appeared as "VIRAL (RT-PCR) LAB |
| 67 | +TESTING: LAST WEEK", with and without a separate reference date for Test |
| 68 | +Volume. All known overheader forms are checked in test_pull.py. |
| 69 | + |
| 70 | +The second row contains a header for each column. The headers uniquely identify |
| 71 | +each column included in the sheet. Column headers include spaces, and typically |
| 72 | +specify both the metric and the reference period over which it was calculated, |
| 73 | +such as: |
| 74 | + |
| 75 | +- Total NAATs - last 7 days (may be an underestimate due to delayed reporting) |
| 76 | +- NAAT positivity rate - previous 7 days (may be an underestimate due to delayed |
| 77 | + reporting) |
| 78 | + |
| 79 | +Columns headers have also changed periodically since the first report. For |
| 80 | +example, the "Total NAATs - last 7 days" header above has also appeared as |
| 81 | +"Total RT-PCR diagnostic tests - last 7 days". |
| 82 | + |
| 83 | +## Contents |
| 84 | + |
| 85 | +Each starred sheet contains test positivity and total test volume figures for |
| 86 | +two reference periods, "last [week]" and "previous [week]". In some reports, the |
| 87 | +reference periods for test positivity and total test volume are the same; in |
| 88 | +others, they are different, such that the report contains figures for four |
| 89 | +distinct reference periods, two for each metric we extract. |
| 90 | + |
| 91 | +# Time series conversions and parsing notes |
| 92 | + |
| 93 | +## Reference date |
| 94 | + |
| 95 | +The reference period in the overheader never includes the year. We guess the |
| 96 | +reference year by picking the same year as the publish date (i.e., the date |
| 97 | +extracted from the filename), and if the reference month is greater than the |
| 98 | +publish month, subtract 1 from the reference year. This adequately covers the |
| 99 | +December-January boundary. |
| 100 | + |
| 101 | +We select as reference date the end date of the reference period for each |
| 102 | +metric. Reference periods are always 7 days, so this indicator produces |
| 103 | +seven-day averages. We divide the total testing volume by seven and leave the |
| 104 | +test positivity alone. |
| 105 | + |
| 106 | +## Geo ID |
| 107 | + |
| 108 | +The Counties sheet lists FIPS codes numerically, such that FIPS with a leading |
| 109 | +zero only have four digits. We fix this by zero-filling to five characters. |
| 110 | + |
| 111 | +MSAs are a subset of CBSAs. We fix this by selecting only CBSAs with type |
| 112 | +"Metropolitan". |
| 113 | + |
| 114 | +Most of the starred sheets have the geo id as the first non-index column. The |
| 115 | +Region sheet has no such column. We fix this by generating the HHS ids from the |
| 116 | +index column instead. |
| 117 | + |
| 118 | +## Combining multiple reports |
| 119 | + |
| 120 | +Each report file generates two reference dates for each metric, up to four |
| 121 | +reference dates total. Since it's not clear whether new versions of past files |
| 122 | +are ever made available, the default mode (params.indicator.reports="new") |
| 123 | +fetches any files that are not already in the input cache, then combines the |
| 124 | +results into a single data frame before exporting. This will generate correct |
| 125 | +behavior should (for instance) a previously-downloaded file get a new assetId. |
| 126 | + |
| 127 | +For the initial run on an empty input cache, and for runs configured to process |
| 128 | +a range of reports (using params.indicator.reports=YYYY-mm-dd--YYYY-mm-dd), this |
| 129 | +indicator makes no distinction between figures that came from different |
| 130 | +reports. That may not be what you want. If the covidcast issue date needs to |
| 131 | +match the date on the report filename, then the indicator must instead be run |
| 132 | +repeatedly, with equal start and end dates, keeping the output of each run |
| 133 | +separate. |
0 commit comments