You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -81,7 +81,7 @@ specific ways as explained next.
81
81
#### Optimization Constraints
82
82
83
83
We constrain the sensor-specific transformation $$g_{s}$$ to be a linear
84
-
function $$g_{s}(x) = a_{s} x$$. Then, the transformations $$g$$ are simply
84
+
function such that $$g_{s}(x) = a_{s} x$$. Then, the transformations $$g$$ are simply
85
85
various rescalings $$a = a_{1:S}$$, and the objective can be more succinctly
86
86
written as
87
87
@@ -132,48 +132,34 @@ sensor. This can be avoided with global column scaling.
132
132
#### Lags and Sporadic Missingness
133
133
134
134
The matrix $$X$$ is not necessarily complete and we may have entries missing.
135
-
Several forms of missingness arise in our data.
136
-
137
-
* On certain days, all observations of a given sensor are missing due to release
138
-
lag. For example, Doctor Visits is released several days late.
139
-
* On any given day, different sensors are observed in different regions.
140
-
* For any given region and sensor, the sensor may be available on some days but
141
-
not others due to sample size cutoffs.
135
+
Several forms of missingness arise in our data. On certain days, all observations of a given sensor are missing due to release lag. For example, Doctor Visits is released several days late. Also, for any given region and sensor, a sensor may be available on some days but not others due to sample size cutoffs. Additionally, on any given day, different sensors are observed in different regions.
142
136
143
137
To ensure that our combined indicator value has comparable scaling over time and
144
138
is free from erratic jumps that are just due to missingness, we use the
145
-
following strategies:
146
-
147
-
**Lag imputation*: If a sensor is missing for all regions on a given day, we
148
-
copy all observations from the last day on which any observation was available
149
-
for that sensor.
150
-
**Recent imputation*: If $$x_{rs}(t)$$ is missing but at least one of
151
-
$$x_{rs}(t-1), x_{rs}(t-2), \dots, x_{rs}(t-T)$$ is observed, impute
152
-
$$x_{rs}(t)$$ with the most recent of $$x_{rs}(t-1), x_{rs}(t-2), \dots,
153
-
x_{rs}(t-L)$$. We limit $$T$$ to be 7 days.
139
+
following imputation strategies:
140
+
*lag imputation*, where if a sensor is missing for all regions on a given day, we copy all observations from the last day on which any observation was available for that sensor;
141
+
*recent imputation*, where if a sensor value if missing on a given day is missing but at least one of past $T$ values is observed, we impute it with the most recent value. We limit $T$ to be 7 days.
154
142
155
143
#### Persistent Missingness
156
144
157
145
Even with the above imputation strategies, we still have issues that some
158
146
sensors are never available in a given region. The result is that combined
159
147
indicator values for that region that may be on a completely different scale
160
148
from values in other regions with additional observed sensors. This can only be
161
-
overcome by regularizing / pooling information across space. Note that a very
149
+
overcome by regularizing or pooling information across space. Note that a very
162
150
similar problem occurs when a sensor is unavailable for a very long period of
163
151
time (so long that recent imputation is inadvisable and avoided by setting $$T =
164
152
7$$ days).
165
153
166
-
We deal with this problem by *geographic imputation* where we impute values from
167
-
regions that share a higher level of aggregation (e.g., the mean observed score
168
-
in an MSA / state), or by imputing values from megacounties (since the counties
154
+
We deal with this problem by *geographic imputation*, where we impute values from
155
+
regions that share a higher level of aggregation (e.g., the median observed score
156
+
in an MSA or state), or by imputing values from megacounties (since the counties
169
157
in question are missing and hence should be reflected in the rest of state
170
-
estimate). The order in which we try to perform geographic imputations is first
171
-
attempting to impute from observed values from megacounties, followed by median
172
-
observed values in the geographic hierarchy (county, MSA, and state), followed
173
-
finally by the median observed value over the entire country if all the previous
174
-
imputation attempts fail. This imputation order is chosen among different
175
-
options by evaluating their effectiveness to mimic the actual observed values in
176
-
validation experiments.
158
+
estimate). The order in which we look to perform geographic imputations is
159
+
observed values from megacounties, followed by median
160
+
observed values in the geographic hierarchy (county, MSA, state, or country).
161
+
We chose this imputation sequence among different options by evaluating
162
+
their effectiveness to mimic the actual observed sensor values in validation experiments.
0 commit comments