Skip to content

Commit 288e591

Browse files
capnrefsmmatispnp
andauthored
Apply suggestions from code review
Co-authored-by: ispnp <[email protected]>
1 parent 8e0453f commit 288e591

File tree

1 file changed

+15
-29
lines changed

1 file changed

+15
-29
lines changed

docs/api/covidcast-signals/indicator-combination.md

Lines changed: 15 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ $$\mathcal{S}$$ of sensors (with total $$S$$ sensors) and a set $$\mathcal{R}$$
6868
of regions (with total $$R$$ regions), we aim to find a combined indicator that
6969
best reconstructs all sensor observations after being passed through a learned
7070
sensor-specific transformation $$g_{s}(\cdot)$$. That is, for each time $$t$$, we
71-
aim to minimize the sum of squared reconstruction error:
71+
minimize the sum of squared reconstruction error:
7272

7373
$$
7474
\sum_{s \in \mathcal{S}} \sum_{r \in \mathcal{R}} (x_{rs} - g_{s}(z_{r}))^2,
@@ -81,7 +81,7 @@ specific ways as explained next.
8181
#### Optimization Constraints
8282

8383
We constrain the sensor-specific transformation $$g_{s}$$ to be a linear
84-
function $$g_{s}(x) = a_{s} x$$. Then, the transformations $$g$$ are simply
84+
function such that $$g_{s}(x) = a_{s} x$$. Then, the transformations $$g$$ are simply
8585
various rescalings $$a = a_{1:S}$$, and the objective can be more succinctly
8686
written as
8787

@@ -132,48 +132,34 @@ sensor. This can be avoided with global column scaling.
132132
#### Lags and Sporadic Missingness
133133

134134
The matrix $$X$$ is not necessarily complete and we may have entries missing.
135-
Several forms of missingness arise in our data.
136-
137-
* On certain days, all observations of a given sensor are missing due to release
138-
lag. For example, Doctor Visits is released several days late.
139-
* On any given day, different sensors are observed in different regions.
140-
* For any given region and sensor, the sensor may be available on some days but
141-
not others due to sample size cutoffs.
135+
Several forms of missingness arise in our data. On certain days, all observations of a given sensor are missing due to release lag. For example, Doctor Visits is released several days late. Also, for any given region and sensor, a sensor may be available on some days but not others due to sample size cutoffs. Additionally, on any given day, different sensors are observed in different regions.
142136

143137
To ensure that our combined indicator value has comparable scaling over time and
144138
is free from erratic jumps that are just due to missingness, we use the
145-
following strategies:
146-
147-
* *Lag imputation*: If a sensor is missing for all regions on a given day, we
148-
copy all observations from the last day on which any observation was available
149-
for that sensor.
150-
* *Recent imputation*: If $$x_{rs}(t)$$ is missing but at least one of
151-
$$x_{rs}(t-1), x_{rs}(t-2), \dots, x_{rs}(t-T)$$ is observed, impute
152-
$$x_{rs}(t)$$ with the most recent of $$x_{rs}(t-1), x_{rs}(t-2), \dots,
153-
x_{rs}(t-L)$$. We limit $$T$$ to be 7 days.
139+
following imputation strategies:
140+
*lag imputation*, where if a sensor is missing for all regions on a given day, we copy all observations from the last day on which any observation was available for that sensor;
141+
*recent imputation*, where if a sensor value if missing on a given day is missing but at least one of past $T$ values is observed, we impute it with the most recent value. We limit $T$ to be 7 days.
154142

155143
#### Persistent Missingness
156144

157145
Even with the above imputation strategies, we still have issues that some
158146
sensors are never available in a given region. The result is that combined
159147
indicator values for that region that may be on a completely different scale
160148
from values in other regions with additional observed sensors. This can only be
161-
overcome by regularizing / pooling information across space. Note that a very
149+
overcome by regularizing or pooling information across space. Note that a very
162150
similar problem occurs when a sensor is unavailable for a very long period of
163151
time (so long that recent imputation is inadvisable and avoided by setting $$T =
164152
7$$ days).
165153

166-
We deal with this problem by *geographic imputation* where we impute values from
167-
regions that share a higher level of aggregation (e.g., the mean observed score
168-
in an MSA / state), or by imputing values from megacounties (since the counties
154+
We deal with this problem by *geographic imputation*, where we impute values from
155+
regions that share a higher level of aggregation (e.g., the median observed score
156+
in an MSA or state), or by imputing values from megacounties (since the counties
169157
in question are missing and hence should be reflected in the rest of state
170-
estimate). The order in which we try to perform geographic imputations is first
171-
attempting to impute from observed values from megacounties, followed by median
172-
observed values in the geographic hierarchy (county, MSA, and state), followed
173-
finally by the median observed value over the entire country if all the previous
174-
imputation attempts fail. This imputation order is chosen among different
175-
options by evaluating their effectiveness to mimic the actual observed values in
176-
validation experiments.
158+
estimate). The order in which we look to perform geographic imputations is
159+
observed values from megacounties, followed by median
160+
observed values in the geographic hierarchy (county, MSA, state, or country).
161+
We chose this imputation sequence among different options by evaluating
162+
their effectiveness to mimic the actual observed sensor values in validation experiments.
177163

178164
### Standard Errors
179165

0 commit comments

Comments
 (0)