Skip to content

Commit ce2bcfc

Browse files
authored
Merge pull request #148 from cmu-delphi/combined-docs
First draft of combination indicator documentation
2 parents a66dac7 + 288e591 commit ce2bcfc

File tree

1 file changed

+143
-1
lines changed

1 file changed

+143
-1
lines changed

docs/api/covidcast-signals/indicator-combination.md

Lines changed: 143 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,11 @@ calculated or composed by Delphi. It is not a primary data source.
2727

2828
These signals combine Delphi's indicators---*not* including cases and deaths,
2929
but including other signals expected to be related to the underlying rate of
30-
coronavirus infection---to produce a single indicator.
30+
coronavirus infection---to produce a single indicator. The goal is to provide a
31+
single map of the COVID-19 activity at each geographic level each day that
32+
summarizes other indicators so that users can study its fluctuations over space
33+
and time and not be overwhelmed by having to necessarily monitor many individual
34+
sensors.
3135

3236
* `nmf_day_doc_fbc_fbs_ght`: This signal uses a rank-1 approximation, from a
3337
nonnegative matrix factorization approach, to identify an underlying signal
@@ -46,6 +50,144 @@ coronavirus infection---to produce a single indicator.
4650
`smoothed_adj_cli`. This signal is deprecated and is no longer updated as of
4751
May 28, 2020.
4852

53+
### Estimation
54+
55+
Let $$x_{rs}(t)$$ denote the value of sensor $$s$$ on day $$t$$ in region $$r$$
56+
(*not* necessarily mean centered). We aim to construct a single scalar latent
57+
variable each day for every region $$r$$, denoted by $$z_{r}(t)$$, that can
58+
simultaneously reconstruct all other reported sensors. We call this scalar
59+
summary the *combined* indicator signal. Below we describe the method used to
60+
compute the combined signal. For notational brevity, we fix a time $$t$$ and
61+
drop the corresponding index.
62+
63+
64+
#### Optimization Objective
65+
66+
At each time $$t$$ (which in our case has resolution of a day), given a set
67+
$$\mathcal{S}$$ of sensors (with total $$S$$ sensors) and a set $$\mathcal{R}$$
68+
of regions (with total $$R$$ regions), we aim to find a combined indicator that
69+
best reconstructs all sensor observations after being passed through a learned
70+
sensor-specific transformation $$g_{s}(\cdot)$$. That is, for each time $$t$$, we
71+
minimize the sum of squared reconstruction error:
72+
73+
$$
74+
\sum_{s \in \mathcal{S}} \sum_{r \in \mathcal{R}} (x_{rs} - g_{s}(z_{r}))^2,
75+
$$
76+
77+
over combined indicator values $$z = z_{1:R}$$ and candidate transformations $$g
78+
= g_{1:S}$$. To solve the optimization problem, we constrain $$z$$ and $$g$$ in
79+
specific ways as explained next.
80+
81+
#### Optimization Constraints
82+
83+
We constrain the sensor-specific transformation $$g_{s}$$ to be a linear
84+
function such that $$g_{s}(x) = a_{s} x$$. Then, the transformations $$g$$ are simply
85+
various rescalings $$a = a_{1:S}$$, and the objective can be more succinctly
86+
written as
87+
88+
$$
89+
\min_{z \in\mathbb{R}^{R}, a\in\mathbb{R}^S} \|X - z a^\top\|_F^2.
90+
$$
91+
92+
where $$X$$ is the $$R \times S$$ matrix of sensor values $$x_{rs}$$ (each row
93+
corresponds to a region and each column to a sensor value). Additionally, to
94+
make the parameters identifiable, we constrain the norm of $$a$$.
95+
96+
If $$u_1, v_1, \sigma_1$$ represent the first left singular vector, right singular
97+
vector, and singular value of $$X$$, then the well-known solution to
98+
this objective is given by $$z = \sigma_1 u_1$$ and $$a = v_1$$, which is
99+
similar to the principal component analysis (PCA). However, in contrast to PCA,
100+
we *do not remove the mean signal* from each column of $$X$$. To the extent
101+
that the mean signals are important for reconstruction, they are reflected in
102+
$$z$$.
103+
104+
Furthermore, since all sensors should be increasing in the combined indicator
105+
value, we further constrain the entries of $$a$$ to be nonnegative. Similarly,
106+
since all sensors are nonnegative, we also constrain the entries of $$z$$ to be
107+
nonnegative. This effectively turns the optimization problem into a non-negative
108+
matrix factorization problem as follows:
109+
110+
$$
111+
\min_{z \in\mathbb{R}^{R}: z \ge 0, a\in\mathbb{R}^S : a \ge 0} \|X - z a^\top\|_F^2.
112+
$$
113+
114+
### Data Preprocessing
115+
116+
#### Scaling of Data
117+
118+
Since different sensor values are on different scales, we perform global column
119+
scaling. Before approximating the matrix $$X$$, we scale each column by the root
120+
mean square observed entry of that sensor over all of time (so that the same
121+
scaling is applied each day).
122+
123+
Note that one might consider local column scaling before approximating the
124+
matrix $$X$$, where we scale each column by its root mean square observed entry
125+
each so that each sensor has (observed) second moment equal to 1 (each day). But
126+
this suffers from the issue that the locally scaled time series for a given
127+
sensor and region can look quite different from the unscaled time series. In
128+
particular, increasing trends can be replaced with decreasing trends and, as a
129+
result, the derived combined indicator may look quite distinct from any unscaled
130+
sensor. This can be avoided with global column scaling.
131+
132+
#### Lags and Sporadic Missingness
133+
134+
The matrix $$X$$ is not necessarily complete and we may have entries missing.
135+
Several forms of missingness arise in our data. On certain days, all observations of a given sensor are missing due to release lag. For example, Doctor Visits is released several days late. Also, for any given region and sensor, a sensor may be available on some days but not others due to sample size cutoffs. Additionally, on any given day, different sensors are observed in different regions.
136+
137+
To ensure that our combined indicator value has comparable scaling over time and
138+
is free from erratic jumps that are just due to missingness, we use the
139+
following imputation strategies:
140+
*lag imputation*, where if a sensor is missing for all regions on a given day, we copy all observations from the last day on which any observation was available for that sensor;
141+
*recent imputation*, where if a sensor value if missing on a given day is missing but at least one of past $T$ values is observed, we impute it with the most recent value. We limit $T$ to be 7 days.
142+
143+
#### Persistent Missingness
144+
145+
Even with the above imputation strategies, we still have issues that some
146+
sensors are never available in a given region. The result is that combined
147+
indicator values for that region that may be on a completely different scale
148+
from values in other regions with additional observed sensors. This can only be
149+
overcome by regularizing or pooling information across space. Note that a very
150+
similar problem occurs when a sensor is unavailable for a very long period of
151+
time (so long that recent imputation is inadvisable and avoided by setting $$T =
152+
7$$ days).
153+
154+
We deal with this problem by *geographic imputation*, where we impute values from
155+
regions that share a higher level of aggregation (e.g., the median observed score
156+
in an MSA or state), or by imputing values from megacounties (since the counties
157+
in question are missing and hence should be reflected in the rest of state
158+
estimate). The order in which we look to perform geographic imputations is
159+
observed values from megacounties, followed by median
160+
observed values in the geographic hierarchy (county, MSA, state, or country).
161+
We chose this imputation sequence among different options by evaluating
162+
their effectiveness to mimic the actual observed sensor values in validation experiments.
163+
164+
### Standard Errors
165+
166+
We compute standard errors for the combined indicator using the bootstrap.
167+
The input data sources are resampled individually, and the combined indicator
168+
is recomputed for the resampled input. Then, the standard error is given by
169+
taking the standard deviation of the resampled combined indicators. We take
170+
$$B=100$$ bootstrap replicates and use the jackknife to verify that
171+
for this number of replicates, estimated variance is small relative to the
172+
variance of variances.
173+
174+
The resampling method for each input source is as follows:
175+
176+
* *Doctor Visits:* We inject a single additional observation with value 0.5 into
177+
the calculation of the proportion, and then resample from the binomial
178+
distribution, using the "Jeffreysized" proportion and sample size $$n+1$$.
179+
* *Symptom Survey:* We first inject a single additional observation with value
180+
0.35 into the calculation of the proportion $$p$$. Then, we sample an average of
181+
$$n+1$$ independent $$\text{Binomial}(p, m)/m$$ variables, where we choose $$m$$
182+
so the household variance $$p(1-p)/m = n\text{SE}^2$$, which is equivalent to
183+
sampling $$\text{Binomial}(p, mn) / mn$$. The prior proportion of 0.35 was
184+
chosen to match the resampling distribution with the original distribution.
185+
* *Facebook Community Survey:* We resample from the binomial distribution, with
186+
the reported proportion and sample size.
187+
* *Google Health Trends:* Because we do not have access to the sampling
188+
distribution, we do not resample this signal.
189+
190+
49191
## Compositional Signals: Confirmed Cases and Deaths
50192

51193
* **First issued:** 7 July 2020

0 commit comments

Comments
 (0)