Skip to content

Commit 43df798

Browse files
authored
Merge pull request #193 from cmu-delphi/quidel_covidtest
Quidel covidtest tests_per_device
2 parents 5a4d298 + 2167231 commit 43df798

23 files changed

+37995
-1457
lines changed

quidel_covidtest/DETAILS.md

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44
Starting May 9, 2020, we began getting Quidel COVID Test data and started reporting it from May 26, 2020 due to limitation in the data volume. The data contains a number of features for every test, including localization at 5-digit Zip Code level, a TestDate and StorageDate, patient age, and several identifiers that uniquely identify the device on which the test was performed (SofiaSerNum, the individual test (FluTestNum), and the result (ResultID). Multiple tests are stored on each device. The present Quidel COVID Test sensor concerns the positive rate in the test result.
55

66
### Signal names
7-
- raw_pct_positive: estimates of the percentage of positive tests in total tests
8-
- smoothed_pct_positive: same as in the first one, but where the estimates are formed by pooling together the last 7 days of data
7+
- covid_ag_raw_pct_positive: percent of tests returning positive that day
8+
- covid_ag_smoothed_pct_positive: same as above, but for the moving average of the most recent 7 days
99

1010
### Estimating percent positive test proportion
1111
Let n be the number of total COVID tests taken over a given time period and a given location (the test result can be negative/positive/invalid). Let x be the number of tests taken with positive results in this location over the given time period. We are interested in estimating the percentage of positive tests which is defined as:
@@ -35,10 +35,28 @@ p = 100 * X / N
3535

3636
The estimated standard error is simply:
3737
```
38-
se = 1/100 * sqrt{ p*(1-p)/N }
38+
se = 100 * sqrt{ p/100 *(1-p/100)/N }
3939
```
4040
where we assume for each time point, the estimates follow a binomial distribution.
4141

4242

43-
### Temporal Pooling
44-
Additionally, as with the Quidel COVID Test signal, we consider smoothed estimates formed by pooling data over time. That is, daily, for each location, we first pool all data available in that location over the last 7 days, and we then recompute everything described in the last two subsections. Pooling in this data makes estimates available in more geographic areas, as many areas report very few tests per day, but have enough data to report when 7 days are considered.
43+
### Temporal and Spatial Pooling
44+
We conduct temporal and spatial pooling for the smoothed signal. The spatial pooling is described in the previous section where we shrink the estimates to the state's mean if the total test number is smaller than 50 for a certain location on a certain day. Additionally, as with the Quidel COVID Test signal, we consider smoothed estimates formed by pooling data over time. That is, daily, for each location, we first pool all data available in that location over the last 7 days, and we then recompute everything described in the last two subsections. Pooling in this data makes estimates available in more geographic areas.
45+
46+
### Exceptions
47+
There are 9 special zip codes that are included in Quidel COVID raw data but are not included in our reports temporarily since we do not have enough mapping information for them.
48+
49+
|zip |State| Number of Tests|
50+
|---|-------|------|
51+
|78086 |TX|98|
52+
|20174 | VA|17|
53+
|48824 |MI|14|
54+
|32313 |FL|37|
55+
|29486 |SC|69|
56+
|75033 |TX|2318|
57+
|79430 |TX|43|
58+
|44325 |OH|56|
59+
|75072 |TX|63|
60+
61+
* Number of tests calculated until 08-05-2020
62+
* Until 08-05-2020, only 2,715 tests out of 942,293 tests for those zip codes.
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
"""Registry for constants"""
2+
# global constants
3+
MIN_OBS = 50 # minimum number of observations in order to compute a proportion.
4+
POOL_DAYS = 7 # number of days in the past (including today) to pool over
5+
END_FROM_TODAY_MINUS = 5 # report data until - X days
6+
EXPORT_DAY_RANGE = 40 # Number of dates to report
7+
# Signal names
8+
SMOOTHED_POSITIVE = "covid_ag_smoothed_pct_positive"
9+
RAW_POSITIVE = "covid_ag_raw_pct_positive"
10+
SMOOTHED_TEST_PER_DEVICE = "covid_ag_smoothed_test_per_device"
11+
RAW_TEST_PER_DEVICE = "covid_ag_raw_test_per_device"
12+
# Geo types
13+
COUNTY = "county"
14+
MSA = "msa"
15+
HRR = "hrr"
16+
17+
GEO_RESOLUTIONS = [
18+
COUNTY,
19+
MSA,
20+
HRR
21+
]
22+
SENSORS = [
23+
SMOOTHED_POSITIVE,
24+
RAW_POSITIVE
25+
# SMOOTHED_TEST_PER_DEVICE,
26+
# RAW_TEST_PER_DEVICE
27+
]
28+
SMOOTHERS = {
29+
SMOOTHED_POSITIVE: (False, True),
30+
RAW_POSITIVE: (False, False)
31+
# SMOOTHED_TEST_PER_DEVICE: (True, True),
32+
# RAW_TEST_PER_DEVICE: (True, False)
33+
}

quidel_covidtest/delphi_quidel_covidtest/data_tools.py

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -250,3 +250,126 @@ def smoothed_positive_prop(positives, tests, min_obs, pool_days,
250250
pooled_tests = tpooled_tests
251251
## STEP 2: CALCULATE AS THOUGH THEY'RE RAW
252252
return raw_positive_prop(pooled_positives, pooled_tests, min_obs)
253+
254+
255+
def raw_tests_per_device(devices, tests, min_obs):
256+
'''
257+
Calculates the tests per device for a single geographic
258+
location, without any temporal smoothing.
259+
260+
If on any day t, tests[t] < min_obs, then we report np.nan.
261+
The second and third returned np.ndarray are the standard errors,
262+
currently all np.nan; and the sample size.
263+
Args:
264+
devices: np.ndarray[float]
265+
Number of devices, ordered in time, where each array element
266+
represents a subsequent day. If there were no devices, this should
267+
be zero (never np.nan).
268+
tests: np.ndarray[float]
269+
Number of tests performed. If there were no tests performed, this
270+
should be zero (never np.nan).
271+
min_obs: int
272+
Minimum number of observations in order to compute a ratio
273+
Returns:
274+
np.ndarray
275+
Tests per device on each day, with the same length
276+
as devices and tests.
277+
np.ndarray
278+
Placeholder for standard errors
279+
np.ndarray
280+
Sample size used to compute estimates.
281+
'''
282+
devices = devices.astype(float)
283+
tests = tests.astype(float)
284+
if (np.any(np.isnan(devices)) or np.any(np.isnan(tests))):
285+
print(devices)
286+
print(tests)
287+
raise ValueError('devices and tests should be non-negative '
288+
'with no np.nan')
289+
if min_obs <= 0:
290+
raise ValueError('min_obs should be positive')
291+
tests[tests < min_obs] = np.nan
292+
tests_per_device = tests / devices
293+
se = np.repeat(np.nan, len(devices))
294+
sample_size = tests
295+
296+
return tests_per_device, se, sample_size
297+
298+
def smoothed_tests_per_device(devices, tests, min_obs, pool_days,
299+
parent_devices=None, parent_tests=None):
300+
"""
301+
Calculates the ratio of tests per device for a single geographic
302+
location, with temporal smoothing.
303+
For a given day t, if sum(tests[(t-pool_days+1):(t+1)]) < min_obs, then we
304+
'borrow' min_obs - sum(tests[(t-pool_days+1):(t+1)]) observations from the
305+
parents over the same timespan. Importantly, it will make sure NOT to
306+
borrow observations that are _already_ in the current geographic partition
307+
being considered.
308+
If min_obs is specified but not satisfied over the pool_days, and
309+
parent arrays are not provided, then we report np.nan.
310+
The second and third returned np.ndarray are the standard errors,
311+
currently all placeholder np.nan; and the reported sample_size.
312+
Args:
313+
devices: np.ndarray[float]
314+
Number of devices, ordered in time, where each array element
315+
represents a subsequent day. If there were no devices, this should
316+
be zero (never np.nan).
317+
tests: np.ndarray[float]
318+
Number of tests performed. If there were no tests performed, this
319+
should be zero (never np.nan).
320+
min_obs: int
321+
Minimum number of observations in order to compute a ratio
322+
pool_days: int
323+
Number of days in the past (including today) over which to pool data.
324+
parent_devices: np.ndarray
325+
Like devices, but for the parent geographic partition (e.g., State)
326+
If this is None, then this shall have 0 devices uniformly.
327+
parent_tests: np.ndarray
328+
Like tests, but for the parent geographic partition (e.g., State)
329+
If this is None, then this shall have 0 tests uniformly.
330+
Returns:
331+
np.ndarray
332+
Tests per device after the pool_days pooling, with the same
333+
length as devices and tests.
334+
np.ndarray
335+
Standard errors, currently uniformly np.nan (placeholder).
336+
np.ndarray
337+
Effective sample size (after temporal and geographic pooling).
338+
"""
339+
devices = devices.astype(float)
340+
tests = tests.astype(float)
341+
if (parent_devices is None) or (parent_tests is None):
342+
has_parent = False
343+
else:
344+
has_parent = True
345+
parent_devices = parent_devices.astype(float)
346+
parent_tests = parent_tests.astype(float)
347+
if (np.any(np.isnan(devices)) or np.any(np.isnan(tests))):
348+
raise ValueError('devices and tests '
349+
'should be non-negative with no np.nan')
350+
if has_parent:
351+
if (np.any(np.isnan(parent_devices))
352+
or np.any(np.isnan(parent_tests))):
353+
raise ValueError('parent devices and parent tests '
354+
'should be non-negative with no np.nan')
355+
if min_obs <= 0:
356+
raise ValueError('min_obs should be positive')
357+
if (pool_days <= 0) or not isinstance(pool_days, int):
358+
raise ValueError('pool_days should be a positive int')
359+
# STEP 0: DO THE TEMPORAL POOLING
360+
tpooled_devices = _slide_window_sum(devices, pool_days)
361+
tpooled_tests = _slide_window_sum(tests, pool_days)
362+
if has_parent:
363+
tpooled_pdevices = _slide_window_sum(parent_devices, pool_days)
364+
tpooled_ptests = _slide_window_sum(parent_tests, pool_days)
365+
borrow_prop = _geographical_pooling(tpooled_tests, tpooled_ptests,
366+
min_obs)
367+
pooled_devices = (tpooled_devices
368+
+ borrow_prop * tpooled_pdevices)
369+
pooled_tests = (tpooled_tests
370+
+ borrow_prop * tpooled_ptests)
371+
else:
372+
pooled_devices = tpooled_devices
373+
pooled_tests = tpooled_tests
374+
## STEP 2: CALCULATE AS THOUGH THEY'RE RAW
375+
return raw_tests_per_device(pooled_devices, pooled_tests, min_obs)

quidel_covidtest/delphi_quidel_covidtest/export.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,8 @@ def export_csv(df, geo_name, sensor, receiving_dir, start_date, end_date):
3232
t = pd.to_datetime(str(date))
3333
date_short = t.strftime('%Y%m%d')
3434
export_fn = f"{date_short}_{geo_name}_{sensor}.csv"
35-
result_df = df[df["timestamp"] == date][["geo_id", "val", "se", "sample_size"]].dropna()
35+
result_df = df[df["timestamp"] == date][["geo_id", "val", "se", "sample_size"]]
36+
result_df = result_df[result_df["sample_size"].notnull()]
3637
result_df.to_csv(f"{receiving_dir}/{export_fn}",
3738
index=False,
3839
float_format="%.8f")

quidel_covidtest/delphi_quidel_covidtest/generate_sensor.py

Lines changed: 90 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -3,85 +3,145 @@
33
Functions to help generate sensor for different geographical levels
44
"""
55
import pandas as pd
6-
from .data_tools import fill_dates, raw_positive_prop, smoothed_positive_prop
6+
from .data_tools import (fill_dates, raw_positive_prop,
7+
smoothed_positive_prop,
8+
smoothed_tests_per_device,
9+
raw_tests_per_device)
710

811
MIN_OBS = 50 # minimum number of observations in order to compute a proportion.
912
POOL_DAYS = 7
1013

11-
def generate_sensor_for_states(state_data, smooth, first_date, last_date):
14+
def generate_sensor_for_states(state_groups, smooth, device, first_date, last_date):
1215
"""
1316
fit over states
1417
Args:
15-
state_data: pd.DataFrame
18+
state_groups: pd.groupby.generic.DataFrameGroupBy
1619
state_key: "state_id"
1720
smooth: bool
21+
Consider raw or smooth
22+
device: bool
23+
Consider test_per_device or pct_positive
1824
Returns:
1925
df: pd.DataFrame
2026
"""
2127
state_df = pd.DataFrame(columns=["geo_id", "val", "se", "sample_size", "timestamp"])
22-
state_groups = state_data.groupby("state_id")
2328
state_list = list(state_groups.groups.keys())
2429
for state in state_list:
2530
state_group = state_groups.get_group(state)
2631
state_group = state_group.drop(columns=["state_id"])
2732
state_group.set_index("timestamp", inplace=True)
2833
state_group = fill_dates(state_group, first_date, last_date)
2934

30-
if smooth:
31-
stat, se, sample_size = smoothed_positive_prop(tests=state_group['totalTest'].values,
32-
positives=state_group['positiveTest'].values,
33-
min_obs=MIN_OBS, pool_days=POOL_DAYS)
35+
# smoothed test per device
36+
if device & smooth:
37+
stat, se, sample_size = smoothed_tests_per_device(
38+
devices=state_group["numUniqueDevices"].values,
39+
tests=state_group['totalTest'].values,
40+
min_obs=MIN_OBS, pool_days=POOL_DAYS)
41+
# raw test per device
42+
elif device & (not smooth):
43+
stat, se, sample_size = raw_tests_per_device(
44+
devices=state_group["numUniqueDevices"].values,
45+
tests=state_group['totalTest'].values,
46+
min_obs=MIN_OBS)
47+
# smoothed pct positive
48+
elif (not device) & smooth:
49+
stat, se, sample_size = smoothed_positive_prop(
50+
tests=state_group['totalTest'].values,
51+
positives=state_group['positiveTest'].values,
52+
min_obs=MIN_OBS, pool_days=POOL_DAYS)
53+
stat = stat * 100
54+
# raw pct positive
3455
else:
35-
stat, se, sample_size = raw_positive_prop(tests=state_group['totalTest'].values,
36-
positives=state_group['positiveTest'].values,
37-
min_obs=MIN_OBS)
38-
stat = stat * 100
56+
stat, se, sample_size = raw_positive_prop(
57+
tests=state_group['totalTest'].values,
58+
positives=state_group['positiveTest'].values,
59+
min_obs=MIN_OBS)
60+
stat = stat * 100
61+
3962
se = se * 100
4063
state_df = state_df.append(pd.DataFrame({"geo_id": state,
4164
"timestamp": state_group.index,
4265
"val": stat,
4366
"se": se,
4467
"sample_size": sample_size}))
45-
return state_df, state_groups
68+
return state_df
4669

47-
def generate_sensor_for_other_geores(state_groups, data, res_key, smooth, first_date, last_date):
70+
def generate_sensor_for_other_geores(state_groups, data, res_key, smooth,
71+
device, first_date, last_date):
4872
"""
4973
fit over counties/HRRs/MSAs
5074
Args:
5175
data: pd.DataFrame
5276
res_key: "fips", "cbsa_id" or "hrrnum"
5377
smooth: bool
78+
Consider raw or smooth
79+
device: bool
80+
Consider test_per_device or pct_positive
5481
Returns:
5582
df: pd.DataFrame
5683
"""
84+
has_parent = True
5785
res_df = pd.DataFrame(columns=["geo_id", "val", "se", "sample_size"])
5886
res_groups = data.groupby(res_key)
5987
loc_list = list(res_groups.groups.keys())
6088
for loc in loc_list:
6189
res_group = res_groups.get_group(loc)
6290
parent_state = res_group['state_id'].values[0]
63-
parent_group = state_groups.get_group(parent_state)
64-
res_group = res_group.merge(parent_group, how="left",
65-
on="timestamp", suffixes=('', '_parent'))
66-
res_group = res_group.drop(columns=[res_key, "state_id", "state_id" + '_parent'])
91+
try:
92+
parent_group = state_groups.get_group(parent_state)
93+
res_group = res_group.merge(parent_group, how="left",
94+
on="timestamp", suffixes=('', '_parent'))
95+
res_group = res_group.drop(columns=[res_key, "state_id", "state_id" + '_parent'])
96+
except:
97+
has_parent = False
98+
res_group = res_group.drop(columns=[res_key, "state_id"])
6799
res_group.set_index("timestamp", inplace=True)
68100
res_group = fill_dates(res_group, first_date, last_date)
69101

70102
if smooth:
71-
stat, se, sample_size = smoothed_positive_prop(
72-
tests=res_group['totalTest'].values,
73-
positives=res_group['positiveTest'].values,
74-
min_obs=MIN_OBS, pool_days=POOL_DAYS,
75-
parent_tests=res_group["totalTest_parent"].values,
76-
parent_positives=res_group['positiveTest_parent'].values)
103+
if has_parent:
104+
if device:
105+
stat, se, sample_size = smoothed_tests_per_device(
106+
devices=res_group["numUniqueDevices"].values,
107+
tests=res_group['totalTest'].values,
108+
min_obs=MIN_OBS, pool_days=POOL_DAYS,
109+
parent_devices=res_group["numUniqueDevices_parent"].values,
110+
parent_tests=res_group["totalTest_parent"].values)
111+
else:
112+
stat, se, sample_size = smoothed_positive_prop(
113+
tests=res_group['totalTest'].values,
114+
positives=res_group['positiveTest'].values,
115+
min_obs=MIN_OBS, pool_days=POOL_DAYS,
116+
parent_tests=res_group["totalTest_parent"].values,
117+
parent_positives=res_group['positiveTest_parent'].values)
118+
stat = stat * 100
119+
else:
120+
if device:
121+
stat, se, sample_size = smoothed_tests_per_device(
122+
devices=res_group["numUniqueDevices"].values,
123+
tests=res_group['totalTest'].values,
124+
min_obs=MIN_OBS, pool_days=POOL_DAYS)
125+
else:
126+
stat, se, sample_size = smoothed_positive_prop(
127+
tests=res_group['totalTest'].values,
128+
positives=res_group['positiveTest'].values,
129+
min_obs=MIN_OBS, pool_days=POOL_DAYS)
130+
stat = stat * 100
77131
else:
78-
stat, se, sample_size = raw_positive_prop(
79-
tests=res_group['totalTest'].values,
80-
positives=res_group['positiveTest'].values,
81-
min_obs=MIN_OBS)
82-
stat = stat * 100
83-
se = se * 100
132+
if device:
133+
stat, se, sample_size = raw_tests_per_device(
134+
devices=res_group["numUniqueDevices"].values,
135+
tests=res_group['totalTest'].values,
136+
min_obs=MIN_OBS)
137+
else:
138+
stat, se, sample_size = raw_positive_prop(
139+
tests=res_group['totalTest'].values,
140+
positives=res_group['positiveTest'].values,
141+
min_obs=MIN_OBS)
142+
stat = stat * 100
84143

144+
se = se * 100
85145
res_df = res_df.append(pd.DataFrame({"geo_id": loc,
86146
"timestamp": res_group.index,
87147
"val": stat,

0 commit comments

Comments
 (0)