Skip to content

Commit a522b68

Browse files
committed
first pass at estimation details, background, singal names
1 parent e8ea107 commit a522b68

File tree

1 file changed

+133
-69
lines changed

1 file changed

+133
-69
lines changed

docs/api/covidcast-signals/youtube-survey.md

Lines changed: 133 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -17,42 +17,24 @@ grand_parent: COVIDcast Main Endpoint
1717

1818
## Overview
1919

20-
The Youtube-survey is a voluntary COVID-like illness 4-question survey that was part of a research study led by the Delphi group at Carnegie Mellon University. The survey consisted of the following introduction and questions:
20+
This data source is based on a short survey about COVID-19-like illness
21+
run by the Delphi group at Carnegie Mellon.
22+
Youtube directed a random sample of its users to these surveys, which were
23+
voluntary. Users age 18 or older were eligible to complete the surveys, and
24+
their survey responses are held by CMU. No individual survey responses are
25+
shared back to Youtube.
2126

22-
This voluntary survey is part of a research study led by the Delphi group at Carnegie Mellon University. Even if you are healthy, your responses may contribute to a better public health understanding of where the coronavirus pandemic is moving, to improve our local and national responses. The data captured does not include any personally identifiable information about you and your answers to all questions will remain confidential. Published results will be in aggregate and will not identify individual participants or their responses. This study is not conducted by YouTube and no individual responses will be shared back to YouTube. There are no foreseeable risks in participating and no compensation is offered. If you have any questions, contact: [email protected].
27+
This survey was an early version of the [COVID-19 Trends and Impact Survey (CTIS)](../../symptom-survey/), collecting data only about COVID-19 symptoms. CTIS is much longer-running and more detailed, also collecting belief and behavior data, and is recommended in most usecases. See our [surveys
28+
page](https://delphi.cmu.edu/covid19/ctis/) for more detail about how CTIS works.
2329

24-
Qualifying Questions
25-
You must be 18 years or older to take this survey. Are you 18 years or older?
26-
What is the ZIP Code of the city or town where you slept last night? [We mean the place where you are currently staying. This may be different from your usual residence.]
27-
What is your current ZIP code?
30+
[TODO note that indicators differ between the two surveys for unknown reasons]
2831

29-
List of Symptoms
30-
Fever (100°F or higher)
31-
Sore throat
32-
Cough
33-
Shortness of breath
34-
Difficulty breathing
32+
As of late April 2020, the number of Youtube survey responses we
33+
received each day was 4-7 thousand. This was sparse at finer geographic levels, so this indicator only reports at the state level. The survey ran from April 21, 2020 to June
34+
17, 2020, collecting about 159 thousand responses in the United States in that
35+
time.
3536

36-
Survey Question 1
37-
"How many additional people in your local community that you know personally are sick (fever, along with at least one other symptom from the above list)?
38-
39-
Survey Question 2
40-
"How many people in your household (including yourself) are sick (fever, along with at least one other symptom from the above list)?"
41-
42-
Survey Question 3
43-
"How many people in your household (including yourself) are experiencing at least one symptom from above?"
44-
45-
Survey Question 4
46-
"In the past 24 hours, have you or anyone in your household experienced any of the following:"
47-
48-
| Signal | Description |
49-
| --- | --- |
50-
| `smoothed_outpatient_covid` | Estimated percentage of outpatient doctor visits with confirmed COVID-19, based on Change Healthcare claims data that has been de-identified in accordance with HIPAA privacy regulations, smoothed in time using a Gaussian linear smoother <br/> **Earliest date available:** 2020-02-01 |
51-
| `smoothed_adj_outpatient_covid` | Same, but with systematic day-of-week effects removed; see [details below](#day-of-week-adjustment) <br/> **Earliest date available:** 2020-02-01 |
52-
| `smoothed_outpatient_cli` | Estimated percentage of outpatient doctor visits primarily about COVID-related symptoms, based on Change Healthcare claims data that has been de-identified in accordance with HIPAA privacy regulations, smoothed in time using a Gaussian linear smoother <br/> **Earliest date available:** 2020-02-01 |
53-
| `smoothed_adj_outpatient_cli` | Same, but with systematic day-of-week effects removed; see [details below](#day-of-week-adjustment) <br/> **Earliest date available:** 2020-02-01 |
54-
| `smoothed_outpatient_flu` | Estimated percentage of outpatient doctor visits with confirmed influenza, based on Change Healthcare claims data that has been de-identified in accordance with HIPAA privacy regulations, smoothed in time using a Gaussian linear smoother <br/> **Earliest issue available:** 2021-12-06 <br/> **Earliest date available:** 2020-02-01 |
55-
| `smoothed_adj_outpatient_flu` | Same, but with systematic day-of-week effects removed; see [details below](#day-of-week-adjustment) <br/> **Earliest issue available:** 2021-12-06 <br/> **Earliest date available:** 2020-02-01 |
37+
We produce [influenza-like and COVID-like illness indicators](#ili-and-cli-indicators) based on the survey data.
5638

5739
## Table of Contents
5840
{: .no_toc .text-delta}
@@ -62,44 +44,135 @@ Survey Question 4
6244

6345
## Survey Text and Questions
6446

65-
The survey starts with the following 5 questions:
47+
The survey contains the following 5 questions:
6648

67-
1. In the past 24 hours, have you or anyone in your household had any of the
68-
following (yes/no for each):
49+
1. In the past 24 hours, have you or anyone in your household experienced any of the following:
6950
- (a) Fever (100 °F or higher)
7051
- (b) Sore throat
7152
- (c) Cough
7253
- (d) Shortness of breath
7354
- (e) Difficulty breathing
74-
2. How many people in your household (including yourself) are sick (fever, along
75-
with at least one other symptom from the above list)?
76-
3. How many people are there in your household in total (including yourself)?
77-
*[Beginning in wave 4, this question asks respondents to break the number
78-
down into three age categories.]*
55+
2. How many people in your household (including yourself) are sick (fever, along with at least one other symptom from the above list)?
56+
3. How many people are there in your household (including yourself)?
7957
4. What is your current ZIP code?
80-
5. How many additional people in your local community that you know personally
81-
are sick (fever, along with at least one other symptom from the above list)?
58+
5. How many additional people in your local community that you know personally are sick (fever, along with at least one other symptom from the above list)?
59+
60+
61+
## ILI and CLI Indicators
62+
63+
We define COVID-like illness (fever, along with cough, or shortness of breath,
64+
or difficulty breathing) or influenza-like illness (fever, along with cough or
65+
sore throat) for use in forecasting and modeling. Using this survey data, we
66+
estimate the percentage of people (age 18 or older) who have a COVID-like
67+
illness, or influenza-like illness, in a given location, on a given day.
68+
69+
| Signals | Description |
70+
| --- | --- |
71+
| `raw_cli` and `smoothed_cli` | Estimated percentage of people with COVID-like illness <br/> **Earliest date available:** 2020-04-21 |
72+
| `raw_ili` and `smoothed_ili` | Estimated percentage of people with influenza-like illness <br/> **Earliest date available:** 2020-04-21 |
73+
74+
Influenza-like illness or ILI is a standard indicator, and is defined by the CDC
75+
as: fever along with sore throat or cough. From the list of symptoms from Q1 on
76+
our survey, this means a and (b or c).
77+
78+
COVID-like illness or CLI is not a standard indicator. Through our discussions
79+
with the CDC, we chose to define it as: fever along with cough or shortness of
80+
breath or difficulty breathing. From the list of symptoms from Q1 on
81+
our survey, this means a and (c or d or e).
82+
83+
Symptoms alone are not sufficient to diagnose influenza or coronavirus
84+
infections, and so these ILI and CLI indicators are *not* expected to be
85+
unbiased estimates of the true rate of influenza or coronavirus infections.
86+
These symptoms can be caused by many other conditions, and many true infections
87+
can be asymptomatic. Instead, we expect these indicators to be useful for
88+
comparison across the United States and across time, to determine where symptoms
89+
appear to be increasing.
8290

83-
Beyond these 5 questions, there are also many other questions that follow in the
84-
survey, which go into more detail on symptoms, contacts, risk factors, and
85-
demographics. These are used for many of our behavior and testing indicators
86-
below. The full text of the survey (including all deployed versions) can be
87-
found on our [questions and coding page](../../symptom-survey/coding.md).
91+
**Smoothing.** The signals beginning with `smoothed` estimate the same quantities as their
92+
`raw` partners, but are smoothed in time to reduce day-to-day sampling noise;
93+
see [details below](#smoothing). Crucially, because the smoothed signals combine
94+
information across multiple days, they have larger sample sizes and hence are
95+
available for more locations than the raw signals.
8896

89-
### Day-of-Week Adjustment
9097

98+
### Defining Household ILI and CLI
9199

100+
[TODO check]
92101

93-
### Backwards Padding
102+
For a single survey, we are interested in the quantities:
94103

104+
- $$X =$$ the number of people in the household with ILI;
105+
- $$Y =$$ the number of people in the household with CLI;
106+
- $$N =$$ the number of people in the household.
107+
108+
Note that $$N$$ comes directly from the answer to Q3, but neither $$X$$ nor
109+
$$Y$$ can be computed directly (because Q2 does not give an answer to the
110+
precise symptomatic profile of all individuals in the household, it only asks
111+
how many individuals have fever and at least one other symptom from the list).
112+
113+
We hence estimate $$X$$ and $$Y$$ with the following simple strategy. Consider
114+
ILI, without a loss of generality (we apply the same strategy to CLI). Let $$Z$$
115+
be the answer to Q2.
116+
117+
- If the answer to Q1 does not meet the ILI definition, then we report $$X=0$$.
118+
- If the answer to Q1 does meet the ILI definition, then we report $$X = Z$$.
119+
120+
This can only "over count" (result in too large estimates of) the true $$X$$ and
121+
$$Y$$. For example, this happens when some members of the household experience
122+
ILI that does not also qualify as CLI, while others experience CLI that does not
123+
also qualify as ILI. In this case, for both $$X$$ and $$Y$$, our simple strategy
124+
would return the sum of both types of cases. However, given the extreme degree
125+
of overlap between the definitions of ILI and CLI, it is reasonable to believe
126+
that, if symptoms across all household members qualified as both ILI and CLI,
127+
each individual would have both, or neither---with neither being more common.
128+
Therefore we do not consider this "over counting" phenomenon practically
129+
problematic.
130+
131+
132+
### Estimating Percent ILI and CLI
133+
134+
[TODO check]
135+
136+
Let $$x$$ and $$y$$ be the number of people with ILI and CLI, respectively, over
137+
a given time period, and in a given location (for example, the time period being
138+
a particular day, and a location being a particular state). Let $$n$$ be the
139+
total number of people in this location. We are interested in estimating the
140+
true ILI and CLI percentages, which we denote by $$p$$ and $$q$$, respectively:
141+
142+
$$
143+
p = 100 \cdot \frac{x}{n}
144+
\quad\text{and}\quad
145+
q = 100 \cdot \frac{y}{n}.
146+
$$
147+
148+
In a given aggregation unit (for example, daily-state), let $$X_i$$ and $$Y_i$$
149+
denote number of ILI and CLI cases in the household, respectively (computed
150+
according to the simple strategy [described
151+
above](#defining-household-ili-and-cli)), and let $$N_i$$ denote the total
152+
number of people in the household, in survey $$i$$, out of $$m$$ surveys we
153+
collected. Then our unweighted estimates of $$p$$ and $$q$$ are:
154+
155+
$$
156+
\hat{p} = 100 \cdot \frac{1}{m}\sum_{i=1}^m \frac{X_i}{N_i}
157+
\quad\text{and}\quad
158+
\hat{q} = 100 \cdot \frac{1}{m}\sum_{i=1}^m \frac{Y_i}{N_i}.
159+
$$
95160

96161

97162
### Smoothing
98163

164+
The smoothed versions of all `youtube-survey` signals (with `smoothed` prefix) are
165+
calculated using seven day pooling. For example, the estimate reported for June
166+
7 in a specific geographical area is formed by
167+
collecting all surveys completed between June 1 and 7 (inclusive) and using that
168+
data in the estimation procedures described above.
99169

100170

101171
## Lag and Backfill
102172

173+
Lag is 1 day. Backfill continues for a couple days.
174+
175+
[TODO more detail]
103176

104177

105178
## Limitations
@@ -111,16 +184,11 @@ limitations of this survey data.
111184
they are age 18 or older, they are currently located in the USA, and they are
112185
an active user of Youtube. The survey data does not report on children under
113186
age 18, and the Youtube adult user population may differ from the United
114-
States population generally in important ways. We use our [survey
115-
weighting](#survey-weighting-and-estimation) to adjust the estimates to match
116-
age and gender demographics by state, but this process doesn't adjust for
117-
other demographic biases we may not be aware of.
187+
States population generally in important ways. We don't adjust for any
188+
demographic biases.
118189
* **Non-response bias.** The survey is voluntary, and people who accept the
119190
invitation when it is presented to them on Youtube may be different from
120-
those who do not. The [survey weights provided by
121-
Youtube](#survey-weighting-and-estimation) attempt to model the probability
122-
of response for each user and hence adjust for this, but it is difficult to
123-
tell if these weights account for all possible non-response bias.
191+
those who do not.
124192
* **Social desirability.** Previous survey research has shown that people's
125193
responses to surveys are often biased by what responses they believe are
126194
socially desirable or acceptable. For example, if it there is widespread
@@ -129,13 +197,13 @@ limitations of this survey data.
129197
expect the social desirability effect to be smaller, but it may still be
130198
present.
131199
* **False responses.** As with anything on the Internet, a small percentage of
132-
users give deliberately incorrect responses. We discard a small number of
200+
users give deliberately incorrect responses. [TODO check if true] We discard a small number of
133201
responses that are obviously false, but do **not** perform extensive
134-
filtering. However, the large size of the study, and our procedure for
202+
filtering. However, the large size of the study, and [TODO check if true] our procedure for
135203
ensuring that each respondent can only be counted once when they are invited
136204
to take the survey, prevents individual respondents from having a large effect
137205
on results.
138-
* **Repeat invitations.** Individual respondents can be invited by Youtube to
206+
* **Repeat invitations.** [TODO check] Individual respondents can be invited by Youtube to
139207
take the survey several times. Usually Youtube only re-invites a respondent
140208
after one month. Hence estimates of values on a single day are calculated
141209
using independent survey responses from unique respondents (or, at least,
@@ -151,16 +219,12 @@ even if point estimates are biased.
151219

152220
### Privacy Restrictions
153221

154-
To protect respondent privacy, we discard any estimate (whether at a county,
155-
MSA, HRR, or state level) that is based on fewer than 100 survey responses. For
222+
To protect respondent privacy, we discard any estimate that is based on fewer than 100 survey responses. For
156223
signals reported using a 7-day average (those beginning with `smoothed_`), this
157224
means a geographic area must have at least 100 responses in 7 days to be
158225
reported.
159226

160-
This affects some items more than others. For instance, items about vaccine
161-
hesitancy reasons are only asked of respondents who are unvaccinated and
162-
hesitant, not to all survey respondents. It also affects some geographic areas
163-
more than others, particularly rural areas with low population densities. When
164-
doing analysis of county-level data, one should be aware that missing counties
165-
are typically more rural and less populous than those present in the data, which
166-
may introduce bias into the analysis.
227+
This affects some items more than others. It affects some geographic areas
228+
more than others, particularly areas with smaller populations. This affect is
229+
less pronounced with smoothed signals, since responses are pooled across a
230+
longer time period.

0 commit comments

Comments
 (0)