Skip to content

Commit 46d6a69

Browse files
committed
Merge remote-tracking branch 'upstream/main' into csv_acquisition
2 parents 445a34c + f4d55dd commit 46d6a69

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

48 files changed

+3028
-666
lines changed

docs/api/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Delphi Epidata API
3-
nav_order: 2
3+
nav_order: 3
44
has_children: true
55
---
66

docs/api/covidcast-signals/doctor-visits.md

+21-16
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,12 @@ grand_parent: COVIDcast API
1313
* **Available for:** county, hrr, msa, state (see [geography coding docs](../covidcast_geography.md))
1414

1515
This data source is based on information about outpatient visits, provided to us
16-
by a national health system. Using this outpatient data, we estimate the
16+
by health system partners. Using this outpatient data, we estimate the
1717
percentage of COVID-related doctor's visits in a given location, on a given day.
1818

1919
| Signal | Description |
2020
| --- | --- |
21-
| `smoothed_cli` | Estimated percentage of outpatient doctor visits primarily about COVID-related symptoms, based on data from a national health system, smoothed in time using a Gaussian linear smoother |
21+
| `smoothed_cli` | Estimated percentage of outpatient doctor visits primarily about COVID-related symptoms, based on data from health system partners, smoothed in time using a Gaussian linear smoother |
2222
| `smoothed_adj_cli` | Same, but with systematic day-of-week effects removed; see [details below](#day-of-week-adjustment) |
2323

2424
## Table of contents
@@ -29,10 +29,10 @@ percentage of COVID-related doctor's visits in a given location, on a given day.
2929

3030
## Lag and Backfill
3131

32-
Note that because doctor's visits may be reported to the health system several
33-
days after they occur, these signals are typically available with several days
34-
of lag. This means that estimates for a specific day are only available several
35-
days later.
32+
Note that because doctor's visits may be reported to the health system partners
33+
several days after they occur, these signals are typically available with
34+
several days of lag. This means that estimates for a specific day are only
35+
available several days later.
3636

3737
The amount of lag in reporting can vary, and not all visits are reported with
3838
the same lag. After we first report estimates for a specific date, further data
@@ -43,10 +43,11 @@ June 16th.
4343

4444
## Limitations
4545

46-
This data source is based on outpatient visit data provided to us by a national
47-
health system. The system can report on a portion of United States outpatient
48-
doctor's visits, but not all of them, and so this source only represents those
49-
visits known to them. Their coverage may vary across the United States.
46+
This data source is based on outpatient visit data provided to us by health
47+
system partners. The partners can report on a portion of United States
48+
outpatient doctor's visits, but not all of them, and so this source only
49+
represents those visits known to them. Their coverage may vary across the United
50+
States.
5051

5152
Standard errors are not available for this data source.
5253

@@ -115,16 +116,20 @@ the ratio between the doctor visit signals on Sunday and Monday would be a
115116
constant. Formally, we assume that
116117

117118
$$
118-
\log \mu_t = \alpha_{wd(t)} + \phi_t
119+
\begin{aligned}
120+
\mathbb{E}[Y_{it}] &= \mu_t\\
121+
\log \mu_t &= \alpha_{\text{wd}(t)} + \phi_t,
122+
\end{aligned}
119123
$$
120124

121-
where $$\mu_t$$ is the expected doctor visits percentage of CLI at time $$t$$,
122-
$$\alpha_{wd(t)}$$ is the weekday correction for the weekday of day $$t$$, and
125+
where $$Y_{it}$$ is the observed doctor visits percentage of CLI at time $$t$$,
126+
$$\text{wd}(t) \in \{0, \dots, 6\}$$ is the day-of-week of time $$t$$,
127+
$$\alpha_{\text{wd}(t)}$$ is the corresponding weekday correction, and
123128
$$\phi_t$$ is the corrected doctor visits percentage of CLI at time $$t$$.
124129

125-
For simplicity, we fit assume that the weekday parameters do not change over
126-
time or location. To fit the $$\alpha$$ parameters, we minimize the following
127-
convex objective function:
130+
For simplicity, we assume that the weekday parameters do not change over time or
131+
location. To fit the $$\alpha$$ parameters, we minimize the following convex
132+
objective function:
128133

129134
$$
130135
f(\alpha, \phi | \mu) = -\log \ell (\alpha,\phi|\mu) + \lambda ||\Delta^3 \phi||_1

docs/api/covidcast-signals/fb-survey.md

+10-2
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,8 @@ day.
2727

2828
| Signal | Description |
2929
| --- | --- |
30-
| `raw_cli` | Estimated percentage of people with COVID-like illness based on the [criteria below](#defining-household-ili-and-cli), with no smoothing or survey weighting |
31-
| `raw_ili` | Estimated percentage of people with influenza-like illness based on the [criteria below](#defining-household-ili-and-cli), with no smoothing or survey weighting |
30+
| `raw_cli` | Estimated percentage of people with COVID-like illness based on the [criteria below](#ili-and-cli-indicators), with no smoothing or survey weighting |
31+
| `raw_ili` | Estimated percentage of people with influenza-like illness based on the [criteria below](#ili-and-cli-indicators), with no smoothing or survey weighting |
3232
| `raw_wcli` | Estimated percentage of people with COVID-like illness; adjusted using survey weights [as described below](#survey-weighting) |
3333
| `raw_wili` | Estimated percentage of people with influenza-like illness; adjusted using survey weights [as described below](#survey-weighting) |
3434
| `raw_hh_cmnty_cli` | Estimated percentage of people reporting illness in their local community, as [described below](#estimating-community-cli), including their household, with no smoothing or survey weighting |
@@ -92,6 +92,14 @@ COVID-like illness or CLI is not a standard indicator. Through our discussions
9292
with the CDC, we chose to define it as: fever along with cough or shortness of
9393
breath or difficulty breathing.
9494

95+
Symptoms alone are not sufficient to diagnose influenza or coronavirus
96+
infections, and so these ILI and CLI indicators are *not* expected to be
97+
unbiased estimates of the true rate of influenza or coronavirus infections.
98+
These symptoms can be caused by many other conditions, and many true infections
99+
can be asymptomatic. Instead, we expect these indicators to be useful for
100+
comparison across the United States and across time, to determine where symptoms
101+
appear to be increasing.
102+
95103
### Defining Household ILI and CLI
96104

97105
For a single survey, we are interested in the quantities:

docs/api/covidcast-signals/ght.md

+14-8
Original file line numberDiff line numberDiff line change
@@ -32,14 +32,20 @@ numbers of COVID-related searches.
3232
## Estimation
3333

3434
We query the Google Health Trends API for overall searcher interest in a set of
35-
COVID-19 related terms which encompass the following topics: coronavirus
36-
symptoms; coronavirus help; coronavirus test-seeking; anosmia (lack of smell or
37-
taste). The API provides data at the Nielsen Designated Marketing Area (DMA)
38-
level and at the State level. This information reported by the API is unitless
39-
and pre-normalized for population size; i.e., the time series obtained for New
40-
York and Wyoming states are directly comparable. The public has access to a
41-
limited view of such information through [Google
42-
Trends](https://trends.google.com).
35+
COVID-19 related terms about anosmia (lack of smell or taste), which emerged as
36+
a symptom of the coronavirus. The specific terms are:
37+
38+
* "why cant i smell or taste"
39+
* "loss of smell"
40+
* "loss of taste"
41+
* Anosmia generally, by querying for topics linked by Google to the anosmia item
42+
in the Freebase knowledge graph (ID `/m/0m7pl`)
43+
44+
The API provides data at the Nielsen Designated Marketing Area (DMA) level and
45+
at the State level. This information reported by the API is unitless and
46+
pre-normalized for population size; i.e., the time series obtained for New York
47+
and Wyoming states are directly comparable. The public has access to a limited
48+
view of such information through [Google Trends](https://trends.google.com).
4349

4450
DMA-level data are aggregated to the MSA and HRR level through
4551
population-weighted averaging.

docs/api/covidcast-signals/google-survey.md

+233-1
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ grand_parent: COVIDcast API
55
---
66

77
# Google Symptom Surveys
8+
{: .no_toc}
89

910
* **Source name:** `google-survey`
1011
* **Number of data revisions since 19 May 2020:** 0
@@ -35,4 +36,235 @@ specific geographical areas as needed to support forecasting efforts.
3536
| Signal | Description |
3637
| --- | --- |
3738
| `raw_cli` | Estimated percentage of people who know someone in their community with COVID-like illness |
38-
| `smoothed_cli` | Estimated percentage of people who know someone in their community with COVID-like illness, smoothed in time |
39+
| `smoothed_cli` | Estimated percentage of people who know someone in their community with COVID-like illness, smoothed in time [as described below](#smoothing) |
40+
41+
## Table of contents
42+
{: .no_toc .text-delta}
43+
44+
1. TOC
45+
{:toc}
46+
47+
## Estimation
48+
49+
Let $$Y$$ be the number of people who know someone in their community with
50+
COVID-like illness or CLI, over a given time period and in a given location, and
51+
let $$N$$ be the number of people in this location who do *not* know someone in
52+
their community with CLI. We are interested in the proportion
53+
54+
$$
55+
p = \frac{Y}{Y+N}.
56+
$$
57+
58+
Since the Google Surveys system provides estimated counties for each respondent,
59+
we are able to report $$p$$ for counties, MSAs, HRRs, and states. Our current
60+
rule-of-thumb is to discard any estimate (whether at a county, MSA, HRR, or
61+
state level) that is composed of fewer than 100 survey responses.
62+
63+
At the county level, MSA, and HRR levels, our estimation procedure is fairly
64+
simple, and is outlined below. Estimation for mega-counties and states is more
65+
complex, and deferred to the next subsection.
66+
67+
### County Level
68+
69+
Recall that we run surveys separately (in a stratified manner) in each county.
70+
In a given county, if $$Y$$ denotes the number of respondents who know someone
71+
in their community with CLI, $$N$$ denotes the total number who do not, and $$n
72+
= Y + N$$ the number of "yes" and "no" responses combined, then to estimate
73+
$$p$$ in the county, we simply use:
74+
75+
$$
76+
\hat{p} = \frac{Y}{n}.
77+
$$
78+
79+
Its estimated standard error is:
80+
81+
$$
82+
\widehat{\text{se}}(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}},
83+
$$
84+
85+
which is the plug-in estimate of the standard error of the estimator, treating
86+
$$n$$ as fixed.
87+
88+
### MSA and HRR Levels
89+
90+
Suppose a given MSA or HRR contains $$m$$ counties. In county $$i$$, let $$Y_i$$ denote the
91+
number of "yes" responses, $$N_i$$ denote the number of "no" responses, and
92+
$$n_i = Y_i + N_i$$ the total number of yes and no responses. Let $$\hat p_i =
93+
Y_i/n_i$$ be the estimate for each county. Also, let $$k_i$$ denote the
94+
population of county $$i$$, and let $$k = \sum_{i=1}^m k_i$$ denote the total
95+
population of all surveyed counties within the MSA or HRR.
96+
97+
Our estimator is then
98+
99+
$$
100+
\hat{p} = \sum_{i=1}^m \frac{k_i}{k} \hat{p}_i,
101+
$$
102+
103+
the standard stratified sampling estimate of $$p$$. Its estimated standard error
104+
is
105+
106+
$$
107+
\hat{p} = \sqrt{\sum_{i=1}^m \Big(\frac{k_i}{k}\Big)^2
108+
\frac{\hat{p}_i(1-\hat{p}_i)}{n_i}},
109+
$$
110+
111+
again the plug-in estimate of standard error of the estimator.
112+
113+
### State and Mega-County Estimates
114+
115+
State estimates are somewhat complicated by the multi-resolution nature of
116+
sampling within a state: recall that we run surveys directly in each state, but
117+
also in directly in all of its counties with more than 100,000 population. In
118+
order to combine state and county level surveys into a state-level community
119+
%CLI estimate, we use a Bayesian approach.
120+
121+
For *every* county $$i$$ in the state, irrespective of whether the county was
122+
surveyed, let $$(Y_{c,i},N_{c,i})$$ represent the number of observed yes and no
123+
responses, and define $$n_{c,i} = Y_{c,i} + N_{c,i}$$. Let $$m_{c,i}$$ be the county
124+
population, and let $$p_{c,i}^*$$ represent the true fraction of individuals in
125+
county $$i$$ who would have responded yes (assuming all individuals would have
126+
responded yes or no and ignoring that "unsure" is a valid option). Note that for a
127+
county not surveyed, we have $$Y_{c,i} = N_{c,i} = n_{c,i} = 0$$.
128+
129+
For the state survey, let $$(Y_s,N_s)$$ be the number of observed yes and no
130+
responses, and define $$n_{s} = Y_{s} + N_{s}$$. Let $$m_{s}$$ be the state
131+
population, and let
132+
133+
$$
134+
p_{s}^* = \sum_{i} m_{c,i} p_{c,i}^*/m_s
135+
$$
136+
137+
represent the true fraction of individuals in the state who would have responded
138+
yes (assuming all individuals would have responded yes or no and ignoring that
139+
unsure is a valid option).
140+
141+
Suppose that we assume the county probabilities $$p_{c,i}^*$$ are drawn
142+
independently from a common $$\operatorname{Beta}(a,b)$$ prior.
143+
144+
Maximum a posteriori (MAP) estimates $$\hat{p}_{c,i}$$ of $$p_{c,i}^*$$
145+
can be obtained for all counties $$i$$ by maximizing
146+
147+
$$
148+
\begin{aligned}
149+
&Y_s \log(p_s) + N_s \log(1-p_s) +
150+
\sum_{i} \tilde{Y}_{c,i} \log (p_{c,i}) + \tilde{N}_{c,i} \log(1 - p_{c,i}) \\
151+
&=
152+
Y_s \log\left(\sum_{i} m_{c,i} p_{c,i}/m_s\right) + N_s \log\left(1-\sum_{i}
153+
m_{c,i} p_{c,i}/m_s\right) \\
154+
&+
155+
\sum_{i} \tilde{Y}_{c,i} \log (p_{c,i}) + \tilde{N}_{c,i} \log(1 - p_{c,i})
156+
\end{aligned}
157+
$$
158+
159+
over $$p_{c,i}$$ subject to
160+
161+
$$
162+
\begin{aligned}
163+
0 &\leq p_{c,i} & \forall i &: \tilde{Y}_{c,i} = 0 \text{ and} \\
164+
p_{c,i} &\leq 1 & \forall i &: \tilde{N}_{c,i} = 0,
165+
\end{aligned}
166+
$$
167+
168+
where
169+
170+
$$
171+
(\tilde{Y}_{c,i},\tilde{N}_{c,i},\tilde{n}_{c,i})=(Y_{c,i}+a-1, N_{c,i}+b-1, \tilde{Y}_{c,i}+\tilde{N}_{c,i})
172+
$$
173+
174+
are pseudo-counts induced by the prior.
175+
Then the MAP estimate for the state probability is given by
176+
177+
$$
178+
\hat{p}_s = \sum_{i} m_{c,i} \hat{p}_{c,i}/m_s.
179+
$$
180+
181+
For the megacounty, we can lump all unsurveyed counties together into a single
182+
"other" county with associated population $$m_o = \sum_{\text{unsurveyed } i}
183+
m_{c,i}$$ and estimated proportion given by
184+
185+
$$
186+
\hat{p}_o = \sum_{\text{unsurveyed } i} m_{c,i} \hat{p}_{c,i}/m_o.
187+
$$
188+
189+
Notably, the maximization problem is concave and coincides with maximum
190+
likelihood estimation when $$a = b = 1$$.
191+
192+
#### Empirical Bayes and Prior Choice
193+
194+
Selecting $$a, b > 1$$ ensures that all pseudo-counts are non-zero and prevents
195+
degenerate estimates of the form $$p_{c,i} \in \{0,1\}$$ by shrinking each
196+
county estimate, even the unsurveyed ones, toward some relevant prior value.
197+
198+
We currently set the prior hyperparameters so that the prior mode
199+
$$\frac{a-1}{(a-1)+(b-1)}$$ matches the pooled mean of surveyed county
200+
proportions and each county receives $$\tilde{n}$$ additional pseudocounts from
201+
the prior:
202+
203+
$$
204+
\begin{aligned}
205+
a &= 1 + \tilde{n}\hat{\mu}\\
206+
b &= 1 + \tilde{n}(1-\hat{\mu}), \text{ for}\\
207+
\hat{\mu} &= \frac{\sum_{\text{surveyed } i} Y_{c,i}}{\sum_{\text{surveyed } i} n_{c,i}}.
208+
\end{aligned}
209+
$$
210+
211+
The number of pseudocounts $$\tilde n$$ is currently set to 5, although it may
212+
be possible to choose a value that varies to minimize mean squared error.
213+
214+
#### Modification for when State Survey is Missing
215+
216+
When state survey results are missing due to problems in the sampling process,
217+
the MAP estimate of the megacounties can be obtained by directly taking the
218+
prior mode:
219+
220+
$$
221+
\hat p_o = \frac{a-1}{(b-1)+(a-1)} = \hat \mu = \sum_{\text{surveyed } i}
222+
Y_{c,i} / \sum_{\text{surveyed } i} n_{c,i}
223+
$$
224+
225+
and the state MAP estimate is the weighted average of the individual
226+
county-level estimates, reproduced here:
227+
228+
$$
229+
\hat{p}_s = \frac{m_o \hat p_o + \sum_{\text{surveyed } i} m_{c,i} \hat
230+
p_{c,i}}{m_s} = \frac{\sum_{\text{surveyed } i} m_{c,i} \hat p_{c,i}}{m_s-m_o}.
231+
$$
232+
233+
Since this estimator is clearly biased, the variance is not representative of
234+
the amount of uncertainty in the estimate. Our alternative to reporting variance
235+
is to report the MSE of the MAP estimate:
236+
237+
$$
238+
\text{MSE}(\hat p_s) =
239+
\left(\sum_{\text{surveyed } i} \frac{\hat{p}_{c,i}
240+
\left(1-\hat{p}_{c,i}\right)}{n_i}\left(\frac{m_i}{m}\right)^2\right) +
241+
\left(\sum_{\text{unsurveyed } i} \frac{m_i}{m} \cdot (\hat{p}_{c,i} - p_{c,i})
242+
\right)^2,
243+
$$
244+
245+
using the pseudocount $$n_i = Y_{c,i} + N_{c,i} + \tilde n$$. Writing the latter
246+
bias term using the megacounty, an upper bound for this term is (using $$m =
247+
\sum_i m_i$$):
248+
249+
$$
250+
\left(\frac{m_o}{m}\right)^2(\hat{p}_o - p_o)^2 \le \left(\frac{m_o}{m}\right)^2
251+
\max\left((1-\hat{p}_o)^2, \hat{p}_o^2\right)
252+
$$
253+
254+
The MSE assumes that the the survey county data is random and that the prior
255+
parameters are fixed and not random, so the unsurveyed counties only contribute
256+
bias while the surveyed counties are unbiased for their respective county
257+
probabilities and contribute variance.
258+
259+
## Smoothing
260+
261+
Additionally, as with the Facebook surveys, we consider estimates formed by
262+
pooling data over time. That is, daily, for each location, we first pool all
263+
data available in that location over the last 5 days, and compute the estimates
264+
given above using all five days of data.
265+
266+
In contrast to the Facebook surveys, this pooling does not significantly change
267+
the availability of estimates, because of our stratified sampling procedure
268+
(essentially always) delivers sufficient data at the county level---at least 100
269+
survey responses---to warrant their own estimates. However, the pooling
270+
procedure still does help by serving as a smoother.

0 commit comments

Comments
 (0)