You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
over combined indicator values $$z = z_{1:R}$$ and candidate transformations $$g
78
+
= g_{1:S}$$. To solve the optimization problem, we constrain $$z$$ and $$g$$ in
79
+
specific ways as explained next.
80
+
81
+
#### Optimization Constraints
82
+
83
+
We constrain the sensor-specific transformation $$g_{s}$$ to be a linear
84
+
function such that $$g_{s}(x) = a_{s} x$$. Then, the transformations $$g$$ are simply
85
+
various rescalings $$a = a_{1:S}$$, and the objective can be more succinctly
86
+
written as
87
+
88
+
$$
89
+
\min_{z \in\mathbb{R}^{R}, a\in\mathbb{R}^S} \|X - z a^\top\|_F^2.
90
+
$$
91
+
92
+
where $$X$$ is the $$R \times S$$ matrix of sensor values $$x_{rs}$$ (each row
93
+
corresponds to a region and each column to a sensor value). Additionally, to
94
+
make the parameters identifiable, we constrain the norm of $$a$$.
95
+
96
+
If $$u_1, v_1, \sigma_1$$ represent the first left singular vector, right singular
97
+
vector, and singular value of $$X$$, then the well-known solution to
98
+
this objective is given by $$z = \sigma_1 u_1$$ and $$a = v_1$$, which is
99
+
similar to the principal component analysis (PCA). However, in contrast to PCA,
100
+
we *do not remove the mean signal* from each column of $$X$$. To the extent
101
+
that the mean signals are important for reconstruction, they are reflected in
102
+
$$z$$.
103
+
104
+
Furthermore, since all sensors should be increasing in the combined indicator
105
+
value, we further constrain the entries of $$a$$ to be nonnegative. Similarly,
106
+
since all sensors are nonnegative, we also constrain the entries of $$z$$ to be
107
+
nonnegative. This effectively turns the optimization problem into a non-negative
108
+
matrix factorization problem as follows:
109
+
110
+
$$
111
+
\min_{z \in\mathbb{R}^{R}: z \ge 0, a\in\mathbb{R}^S : a \ge 0} \|X - z a^\top\|_F^2.
112
+
$$
113
+
114
+
### Data Preprocessing
115
+
116
+
#### Scaling of Data
117
+
118
+
Since different sensor values are on different scales, we perform global column
119
+
scaling. Before approximating the matrix $$X$$, we scale each column by the root
120
+
mean square observed entry of that sensor over all of time (so that the same
121
+
scaling is applied each day).
122
+
123
+
Note that one might consider local column scaling before approximating the
124
+
matrix $$X$$, where we scale each column by its root mean square observed entry
125
+
each so that each sensor has (observed) second moment equal to 1 (each day). But
126
+
this suffers from the issue that the locally scaled time series for a given
127
+
sensor and region can look quite different from the unscaled time series. In
128
+
particular, increasing trends can be replaced with decreasing trends and, as a
129
+
result, the derived combined indicator may look quite distinct from any unscaled
130
+
sensor. This can be avoided with global column scaling.
131
+
132
+
#### Lags and Sporadic Missingness
133
+
134
+
The matrix $$X$$ is not necessarily complete and we may have entries missing.
135
+
Several forms of missingness arise in our data. On certain days, all observations of a given sensor are missing due to release lag. For example, Doctor Visits is released several days late. Also, for any given region and sensor, a sensor may be available on some days but not others due to sample size cutoffs. Additionally, on any given day, different sensors are observed in different regions.
136
+
137
+
To ensure that our combined indicator value has comparable scaling over time and
138
+
is free from erratic jumps that are just due to missingness, we use the
139
+
following imputation strategies:
140
+
*lag imputation*, where if a sensor is missing for all regions on a given day, we copy all observations from the last day on which any observation was available for that sensor;
141
+
*recent imputation*, where if a sensor value if missing on a given day is missing but at least one of past $T$ values is observed, we impute it with the most recent value. We limit $T$ to be 7 days.
142
+
143
+
#### Persistent Missingness
144
+
145
+
Even with the above imputation strategies, we still have issues that some
146
+
sensors are never available in a given region. The result is that combined
147
+
indicator values for that region that may be on a completely different scale
148
+
from values in other regions with additional observed sensors. This can only be
149
+
overcome by regularizing or pooling information across space. Note that a very
150
+
similar problem occurs when a sensor is unavailable for a very long period of
151
+
time (so long that recent imputation is inadvisable and avoided by setting $$T =
152
+
7$$ days).
153
+
154
+
We deal with this problem by *geographic imputation*, where we impute values from
155
+
regions that share a higher level of aggregation (e.g., the median observed score
156
+
in an MSA or state), or by imputing values from megacounties (since the counties
157
+
in question are missing and hence should be reflected in the rest of state
158
+
estimate). The order in which we look to perform geographic imputations is
159
+
observed values from megacounties, followed by median
160
+
observed values in the geographic hierarchy (county, MSA, state, or country).
161
+
We chose this imputation sequence among different options by evaluating
162
+
their effectiveness to mimic the actual observed sensor values in validation experiments.
163
+
164
+
### Standard Errors
165
+
166
+
We compute standard errors for the combined indicator using the bootstrap.
167
+
The input data sources are resampled individually, and the combined indicator
168
+
is recomputed for the resampled input. Then, the standard error is given by
169
+
taking the standard deviation of the resampled combined indicators. We take
170
+
$$B=100$$ bootstrap replicates and use the jackknife to verify that
171
+
for this number of replicates, estimated variance is small relative to the
172
+
variance of variances.
173
+
174
+
The resampling method for each input source is as follows:
175
+
176
+
**Doctor Visits:* We inject a single additional observation with value 0.5 into
177
+
the calculation of the proportion, and then resample from the binomial
178
+
distribution, using the "Jeffreysized" proportion and sample size $$n+1$$.
179
+
**Symptom Survey:* We first inject a single additional observation with value
180
+
0.35 into the calculation of the proportion $$p$$. Then, we sample an average of
181
+
$$n+1$$ independent $$\text{Binomial}(p, m)/m$$ variables, where we choose $$m$$
182
+
so the household variance $$p(1-p)/m = n\text{SE}^2$$, which is equivalent to
183
+
sampling $$\text{Binomial}(p, mn) / mn$$. The prior proportion of 0.35 was
184
+
chosen to match the resampling distribution with the original distribution.
185
+
**Facebook Community Survey:* We resample from the binomial distribution, with
186
+
the reported proportion and sample size.
187
+
**Google Health Trends:* Because we do not have access to the sampling
188
+
distribution, we do not resample this signal.
189
+
190
+
49
191
## Compositional Signals: Confirmed Cases and Deaths
0 commit comments