From 86cdee5b402a6e8a649e18a235e537aee420f9e5 Mon Sep 17 00:00:00 2001 From: Tina Townes Date: Tue, 25 Jun 2024 01:43:21 -0400 Subject: [PATCH 01/10] Creates Youtube-survey doc page This is a draft of the Youtube-survey doc page. A lot of information is missing or assumed based off of (https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/fb-survey.html) and may very possibly altogether be incorrect. A comment from Katie from a May 4/7 2020 tooling team sprint states: "YouTube 4-question survey - this is essentially a copy of fb-survey without the weighted signals". --- docs/api/covidcast-signals/youtube-survey | 166 ++++++++++++++++++++++ 1 file changed, 166 insertions(+) create mode 100644 docs/api/covidcast-signals/youtube-survey diff --git a/docs/api/covidcast-signals/youtube-survey b/docs/api/covidcast-signals/youtube-survey new file mode 100644 index 000000000..a7dce715c --- /dev/null +++ b/docs/api/covidcast-signals/youtube-survey @@ -0,0 +1,166 @@ +--- +title: Youtube Survey +parent: Inactive Signals +grand_parent: COVIDcast Main Endpoint +--- + +# Youtube Survey +{: .no_toc} + +* **Source name:** `youtube-survey` +* **Earliest issue available:** April, 04, 2020 +* **Number of data revisions since May 19, 2020:** 0 +* **Date of last change:** Never +* **Available for:** state (see [geography coding docs](../covidcast_geography.md)) +* **Time type:** day (see [date format docs](../covidcast_times.md)) +* **License:** [CC BY-NC](../covidcast_licensing.md#creative-commons-attribution-noncommercial) + +## Overview + +The Youtube-survey is a voluntary COVID-like illness 4-question survey that was part of a research study led by the Delphi group at Carnegie Mellon University. The survey consisted of the following introduction and questions: + +This voluntary survey is part of a research study led by the Delphi group at Carnegie Mellon University. Even if you are healthy, your responses may contribute to a better public health understanding of where the coronavirus pandemic is moving, to improve our local and national responses. The data captured does not include any personally identifiable information about you and your answers to all questions will remain confidential. Published results will be in aggregate and will not identify individual participants or their responses. This study is not conducted by YouTube and no individual responses will be shared back to YouTube. There are no foreseeable risks in participating and no compensation is offered. If you have any questions, contact: delphi-admin-survey-yt@lists.andrew.cmu.edu. + +Qualifying Questions +You must be 18 years or older to take this survey. Are you 18 years or older? +What is the ZIP Code of the city or town where you slept last night? [We mean the place where you are currently staying. This may be different from your usual residence.] +What is your current ZIP code? + +List of Symptoms +Fever (100°F or higher) +Sore throat +Cough +Shortness of breath +Difficulty breathing + +Survey Question 1 +"How many additional people in your local community that you know personally are sick (fever, along with at least one other symptom from the above list)? + +Survey Question 2 +"How many people in your household (including yourself) are sick (fever, along with at least one other symptom from the above list)?" + +Survey Question 3 +"How many people in your household (including yourself) are experiencing at least one symptom from above?" + +Survey Question 4 +"In the past 24 hours, have you or anyone in your household experienced any of the following:" + +| Signal | Description | +| --- | --- | +| `smoothed_outpatient_covid` | Estimated percentage of outpatient doctor visits with confirmed COVID-19, based on Change Healthcare claims data that has been de-identified in accordance with HIPAA privacy regulations, smoothed in time using a Gaussian linear smoother
**Earliest date available:** 2020-02-01 | +| `smoothed_adj_outpatient_covid` | Same, but with systematic day-of-week effects removed; see [details below](#day-of-week-adjustment)
**Earliest date available:** 2020-02-01 | +| `smoothed_outpatient_cli` | Estimated percentage of outpatient doctor visits primarily about COVID-related symptoms, based on Change Healthcare claims data that has been de-identified in accordance with HIPAA privacy regulations, smoothed in time using a Gaussian linear smoother
**Earliest date available:** 2020-02-01 | +| `smoothed_adj_outpatient_cli` | Same, but with systematic day-of-week effects removed; see [details below](#day-of-week-adjustment)
**Earliest date available:** 2020-02-01 | +| `smoothed_outpatient_flu` | Estimated percentage of outpatient doctor visits with confirmed influenza, based on Change Healthcare claims data that has been de-identified in accordance with HIPAA privacy regulations, smoothed in time using a Gaussian linear smoother
**Earliest issue available:** 2021-12-06
**Earliest date available:** 2020-02-01 | +| `smoothed_adj_outpatient_flu` | Same, but with systematic day-of-week effects removed; see [details below](#day-of-week-adjustment)
**Earliest issue available:** 2021-12-06
**Earliest date available:** 2020-02-01 | + +## Table of Contents +{: .no_toc .text-delta} + +1. TOC +{:toc} + +## Survey Text and Questions + +The survey starts with the following 5 questions: + +1. In the past 24 hours, have you or anyone in your household had any of the + following (yes/no for each): + - (a) Fever (100 °F or higher) + - (b) Sore throat + - (c) Cough + - (d) Shortness of breath + - (e) Difficulty breathing +2. How many people in your household (including yourself) are sick (fever, along + with at least one other symptom from the above list)? +3. How many people are there in your household in total (including yourself)? + *[Beginning in wave 4, this question asks respondents to break the number + down into three age categories.]* +4. What is your current ZIP code? +5. How many additional people in your local community that you know personally + are sick (fever, along with at least one other symptom from the above list)? + +Beyond these 5 questions, there are also many other questions that follow in the +survey, which go into more detail on symptoms, contacts, risk factors, and +demographics. These are used for many of our behavior and testing indicators +below. The full text of the survey (including all deployed versions) can be +found on our [questions and coding page](../../symptom-survey/coding.md). + +### Day-of-Week Adjustment + + + +### Backwards Padding + + + +### Smoothing + + + +## Lag and Backfill + + + +## Limitations + +When interpreting the signals above, it is important to keep in mind several +limitations of this survey data. + +* **Survey population.** People are eligible to participate in the survey if + they are age 18 or older, they are currently located in the USA, and they are + an active user of Youtube. The survey data does not report on children under + age 18, and the Youtube adult user population may differ from the United + States population generally in important ways. We use our [survey + weighting](#survey-weighting-and-estimation) to adjust the estimates to match + age and gender demographics by state, but this process doesn't adjust for + other demographic biases we may not be aware of. +* **Non-response bias.** The survey is voluntary, and people who accept the + invitation when it is presented to them on Youtube may be different from + those who do not. The [survey weights provided by + Youtube](#survey-weighting-and-estimation) attempt to model the probability + of response for each user and hence adjust for this, but it is difficult to + tell if these weights account for all possible non-response bias. +* **Social desirability.** Previous survey research has shown that people's + responses to surveys are often biased by what responses they believe are + socially desirable or acceptable. For example, if it there is widespread + pressure to wear masks, respondents who do *not* wear masks may feel pressured + to answer that they *do*. This survey is anonymous and online, meaning we + expect the social desirability effect to be smaller, but it may still be + present. +* **False responses.** As with anything on the Internet, a small percentage of + users give deliberately incorrect responses. We discard a small number of + responses that are obviously false, but do **not** perform extensive + filtering. However, the large size of the study, and our procedure for + ensuring that each respondent can only be counted once when they are invited + to take the survey, prevents individual respondents from having a large effect + on results. +* **Repeat invitations.** Individual respondents can be invited by Youtube to + take the survey several times. Usually Youtube only re-invites a respondent + after one month. Hence estimates of values on a single day are calculated + using independent survey responses from unique respondents (or, at least, + unique Youtube accounts), whereas estimates from different months may involve + the same respondents. + +Whenever possible, you should compare this data to other independent sources. We +believe that while these biases may affect point estimates -- that is, they may +bias estimates on a specific day up or down -- the biases should not change +strongly over time. This means that *changes* in signals, such as increases or +decreases, are likely to represent true changes in the underlying population, +even if point estimates are biased. + +### Privacy Restrictions + +To protect respondent privacy, we discard any estimate (whether at a county, +MSA, HRR, or state level) that is based on fewer than 100 survey responses. For +signals reported using a 7-day average (those beginning with `smoothed_`), this +means a geographic area must have at least 100 responses in 7 days to be +reported. + +This affects some items more than others. For instance, items about vaccine +hesitancy reasons are only asked of respondents who are unvaccinated and +hesitant, not to all survey respondents. It also affects some geographic areas +more than others, particularly rural areas with low population densities. When +doing analysis of county-level data, one should be aware that missing counties +are typically more rural and less populous than those present in the data, which +may introduce bias into the analysis. \ No newline at end of file From 23eab56fa3e37594259f18b7430976d930183d77 Mon Sep 17 00:00:00 2001 From: Nat DeFries <42820733+nmdefries@users.noreply.github.com> Date: Tue, 30 Jul 2024 15:05:01 -0400 Subject: [PATCH 02/10] other endpoints intro --- docs/api/README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/api/README.md b/docs/api/README.md index 709d068e0..768c741fc 100644 --- a/docs/api/README.md +++ b/docs/api/README.md @@ -4,11 +4,12 @@ nav_order: 3 has_children: true --- -# Epidata API (Other Diseases) +# Other Endpoints (COVID-19 and Other Diseases) This is the home of [Delphi](https://delphi.cmu.edu/)'s epidemiological data API for tracking epidemics such as influenza, dengue, and norovirus. Note that -our work on COVID-19 is described in the [COVIDcast Epidata API documentation](covidcast.md). +additional data, especially related to COVID-19, is available in the +[main Epidata API (formerly known as COVIDcast)](covidcast.md). ## Table of Contents {: .no_toc .text-delta} From 8fc89849904a55ee98c5ad87cbd918a53b4135dc Mon Sep 17 00:00:00 2001 From: Nat DeFries <42820733+nmdefries@users.noreply.github.com> Date: Tue, 30 Jul 2024 15:06:35 -0400 Subject: [PATCH 03/10] focus on covid --- docs/api/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/api/README.md b/docs/api/README.md index 768c741fc..a3a3986fa 100644 --- a/docs/api/README.md +++ b/docs/api/README.md @@ -8,7 +8,7 @@ has_children: true This is the home of [Delphi](https://delphi.cmu.edu/)'s epidemiological data API for tracking epidemics such as influenza, dengue, and norovirus. Note that -additional data, especially related to COVID-19, is available in the +additional data, including most COVID-19 signals, is available in the [main Epidata API (formerly known as COVIDcast)](covidcast.md). ## Table of Contents From e8ea10791fdd8119b469ac94b775b94ed5c0c586 Mon Sep 17 00:00:00 2001 From: Nat DeFries <42820733+nmdefries@users.noreply.github.com> Date: Wed, 31 Jul 2024 17:31:04 -0400 Subject: [PATCH 04/10] add .md suffix --- docs/api/covidcast-signals/{youtube-survey => youtube-survey.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename docs/api/covidcast-signals/{youtube-survey => youtube-survey.md} (100%) diff --git a/docs/api/covidcast-signals/youtube-survey b/docs/api/covidcast-signals/youtube-survey.md similarity index 100% rename from docs/api/covidcast-signals/youtube-survey rename to docs/api/covidcast-signals/youtube-survey.md From a522b681aa0d0c8833d2fc8b3646fdc85f67c68e Mon Sep 17 00:00:00 2001 From: Nat DeFries <42820733+nmdefries@users.noreply.github.com> Date: Wed, 31 Jul 2024 19:17:11 -0400 Subject: [PATCH 05/10] first pass at estimation details, background, singal names --- docs/api/covidcast-signals/youtube-survey.md | 202 ++++++++++++------- 1 file changed, 133 insertions(+), 69 deletions(-) diff --git a/docs/api/covidcast-signals/youtube-survey.md b/docs/api/covidcast-signals/youtube-survey.md index a7dce715c..0523a9f42 100644 --- a/docs/api/covidcast-signals/youtube-survey.md +++ b/docs/api/covidcast-signals/youtube-survey.md @@ -17,42 +17,24 @@ grand_parent: COVIDcast Main Endpoint ## Overview -The Youtube-survey is a voluntary COVID-like illness 4-question survey that was part of a research study led by the Delphi group at Carnegie Mellon University. The survey consisted of the following introduction and questions: +This data source is based on a short survey about COVID-19-like illness +run by the Delphi group at Carnegie Mellon. +Youtube directed a random sample of its users to these surveys, which were +voluntary. Users age 18 or older were eligible to complete the surveys, and +their survey responses are held by CMU. No individual survey responses are +shared back to Youtube. -This voluntary survey is part of a research study led by the Delphi group at Carnegie Mellon University. Even if you are healthy, your responses may contribute to a better public health understanding of where the coronavirus pandemic is moving, to improve our local and national responses. The data captured does not include any personally identifiable information about you and your answers to all questions will remain confidential. Published results will be in aggregate and will not identify individual participants or their responses. This study is not conducted by YouTube and no individual responses will be shared back to YouTube. There are no foreseeable risks in participating and no compensation is offered. If you have any questions, contact: delphi-admin-survey-yt@lists.andrew.cmu.edu. +This survey was an early version of the [COVID-19 Trends and Impact Survey (CTIS)](../../symptom-survey/), collecting data only about COVID-19 symptoms. CTIS is much longer-running and more detailed, also collecting belief and behavior data, and is recommended in most usecases. See our [surveys +page](https://delphi.cmu.edu/covid19/ctis/) for more detail about how CTIS works. -Qualifying Questions -You must be 18 years or older to take this survey. Are you 18 years or older? -What is the ZIP Code of the city or town where you slept last night? [We mean the place where you are currently staying. This may be different from your usual residence.] -What is your current ZIP code? +[TODO note that indicators differ between the two surveys for unknown reasons] -List of Symptoms -Fever (100°F or higher) -Sore throat -Cough -Shortness of breath -Difficulty breathing +As of late April 2020, the number of Youtube survey responses we +received each day was 4-7 thousand. This was sparse at finer geographic levels, so this indicator only reports at the state level. The survey ran from April 21, 2020 to June +17, 2020, collecting about 159 thousand responses in the United States in that +time. -Survey Question 1 -"How many additional people in your local community that you know personally are sick (fever, along with at least one other symptom from the above list)? - -Survey Question 2 -"How many people in your household (including yourself) are sick (fever, along with at least one other symptom from the above list)?" - -Survey Question 3 -"How many people in your household (including yourself) are experiencing at least one symptom from above?" - -Survey Question 4 -"In the past 24 hours, have you or anyone in your household experienced any of the following:" - -| Signal | Description | -| --- | --- | -| `smoothed_outpatient_covid` | Estimated percentage of outpatient doctor visits with confirmed COVID-19, based on Change Healthcare claims data that has been de-identified in accordance with HIPAA privacy regulations, smoothed in time using a Gaussian linear smoother
**Earliest date available:** 2020-02-01 | -| `smoothed_adj_outpatient_covid` | Same, but with systematic day-of-week effects removed; see [details below](#day-of-week-adjustment)
**Earliest date available:** 2020-02-01 | -| `smoothed_outpatient_cli` | Estimated percentage of outpatient doctor visits primarily about COVID-related symptoms, based on Change Healthcare claims data that has been de-identified in accordance with HIPAA privacy regulations, smoothed in time using a Gaussian linear smoother
**Earliest date available:** 2020-02-01 | -| `smoothed_adj_outpatient_cli` | Same, but with systematic day-of-week effects removed; see [details below](#day-of-week-adjustment)
**Earliest date available:** 2020-02-01 | -| `smoothed_outpatient_flu` | Estimated percentage of outpatient doctor visits with confirmed influenza, based on Change Healthcare claims data that has been de-identified in accordance with HIPAA privacy regulations, smoothed in time using a Gaussian linear smoother
**Earliest issue available:** 2021-12-06
**Earliest date available:** 2020-02-01 | -| `smoothed_adj_outpatient_flu` | Same, but with systematic day-of-week effects removed; see [details below](#day-of-week-adjustment)
**Earliest issue available:** 2021-12-06
**Earliest date available:** 2020-02-01 | +We produce [influenza-like and COVID-like illness indicators](#ili-and-cli-indicators) based on the survey data. ## Table of Contents {: .no_toc .text-delta} @@ -62,44 +44,135 @@ Survey Question 4 ## Survey Text and Questions -The survey starts with the following 5 questions: +The survey contains the following 5 questions: -1. In the past 24 hours, have you or anyone in your household had any of the - following (yes/no for each): +1. In the past 24 hours, have you or anyone in your household experienced any of the following: - (a) Fever (100 °F or higher) - (b) Sore throat - (c) Cough - (d) Shortness of breath - (e) Difficulty breathing -2. How many people in your household (including yourself) are sick (fever, along - with at least one other symptom from the above list)? -3. How many people are there in your household in total (including yourself)? - *[Beginning in wave 4, this question asks respondents to break the number - down into three age categories.]* +2. How many people in your household (including yourself) are sick (fever, along with at least one other symptom from the above list)? +3. How many people are there in your household (including yourself)? 4. What is your current ZIP code? -5. How many additional people in your local community that you know personally - are sick (fever, along with at least one other symptom from the above list)? +5. How many additional people in your local community that you know personally are sick (fever, along with at least one other symptom from the above list)? + + +## ILI and CLI Indicators + +We define COVID-like illness (fever, along with cough, or shortness of breath, +or difficulty breathing) or influenza-like illness (fever, along with cough or +sore throat) for use in forecasting and modeling. Using this survey data, we +estimate the percentage of people (age 18 or older) who have a COVID-like +illness, or influenza-like illness, in a given location, on a given day. + +| Signals | Description | +| --- | --- | +| `raw_cli` and `smoothed_cli` | Estimated percentage of people with COVID-like illness
**Earliest date available:** 2020-04-21 | +| `raw_ili` and `smoothed_ili` | Estimated percentage of people with influenza-like illness
**Earliest date available:** 2020-04-21 | + +Influenza-like illness or ILI is a standard indicator, and is defined by the CDC +as: fever along with sore throat or cough. From the list of symptoms from Q1 on +our survey, this means a and (b or c). + +COVID-like illness or CLI is not a standard indicator. Through our discussions +with the CDC, we chose to define it as: fever along with cough or shortness of +breath or difficulty breathing. From the list of symptoms from Q1 on +our survey, this means a and (c or d or e). + +Symptoms alone are not sufficient to diagnose influenza or coronavirus +infections, and so these ILI and CLI indicators are *not* expected to be +unbiased estimates of the true rate of influenza or coronavirus infections. +These symptoms can be caused by many other conditions, and many true infections +can be asymptomatic. Instead, we expect these indicators to be useful for +comparison across the United States and across time, to determine where symptoms +appear to be increasing. -Beyond these 5 questions, there are also many other questions that follow in the -survey, which go into more detail on symptoms, contacts, risk factors, and -demographics. These are used for many of our behavior and testing indicators -below. The full text of the survey (including all deployed versions) can be -found on our [questions and coding page](../../symptom-survey/coding.md). +**Smoothing.** The signals beginning with `smoothed` estimate the same quantities as their +`raw` partners, but are smoothed in time to reduce day-to-day sampling noise; +see [details below](#smoothing). Crucially, because the smoothed signals combine +information across multiple days, they have larger sample sizes and hence are +available for more locations than the raw signals. -### Day-of-Week Adjustment +### Defining Household ILI and CLI +[TODO check] -### Backwards Padding +For a single survey, we are interested in the quantities: +- $$X =$$ the number of people in the household with ILI; +- $$Y =$$ the number of people in the household with CLI; +- $$N =$$ the number of people in the household. + +Note that $$N$$ comes directly from the answer to Q3, but neither $$X$$ nor +$$Y$$ can be computed directly (because Q2 does not give an answer to the +precise symptomatic profile of all individuals in the household, it only asks +how many individuals have fever and at least one other symptom from the list). + +We hence estimate $$X$$ and $$Y$$ with the following simple strategy. Consider +ILI, without a loss of generality (we apply the same strategy to CLI). Let $$Z$$ +be the answer to Q2. + +- If the answer to Q1 does not meet the ILI definition, then we report $$X=0$$. +- If the answer to Q1 does meet the ILI definition, then we report $$X = Z$$. + +This can only "over count" (result in too large estimates of) the true $$X$$ and +$$Y$$. For example, this happens when some members of the household experience +ILI that does not also qualify as CLI, while others experience CLI that does not +also qualify as ILI. In this case, for both $$X$$ and $$Y$$, our simple strategy +would return the sum of both types of cases. However, given the extreme degree +of overlap between the definitions of ILI and CLI, it is reasonable to believe +that, if symptoms across all household members qualified as both ILI and CLI, +each individual would have both, or neither---with neither being more common. +Therefore we do not consider this "over counting" phenomenon practically +problematic. + + +### Estimating Percent ILI and CLI + +[TODO check] + +Let $$x$$ and $$y$$ be the number of people with ILI and CLI, respectively, over +a given time period, and in a given location (for example, the time period being +a particular day, and a location being a particular state). Let $$n$$ be the +total number of people in this location. We are interested in estimating the +true ILI and CLI percentages, which we denote by $$p$$ and $$q$$, respectively: + +$$ +p = 100 \cdot \frac{x}{n} +\quad\text{and}\quad +q = 100 \cdot \frac{y}{n}. +$$ + +In a given aggregation unit (for example, daily-state), let $$X_i$$ and $$Y_i$$ +denote number of ILI and CLI cases in the household, respectively (computed +according to the simple strategy [described +above](#defining-household-ili-and-cli)), and let $$N_i$$ denote the total +number of people in the household, in survey $$i$$, out of $$m$$ surveys we +collected. Then our unweighted estimates of $$p$$ and $$q$$ are: + +$$ +\hat{p} = 100 \cdot \frac{1}{m}\sum_{i=1}^m \frac{X_i}{N_i} +\quad\text{and}\quad +\hat{q} = 100 \cdot \frac{1}{m}\sum_{i=1}^m \frac{Y_i}{N_i}. +$$ ### Smoothing +The smoothed versions of all `youtube-survey` signals (with `smoothed` prefix) are +calculated using seven day pooling. For example, the estimate reported for June +7 in a specific geographical area is formed by +collecting all surveys completed between June 1 and 7 (inclusive) and using that +data in the estimation procedures described above. ## Lag and Backfill +Lag is 1 day. Backfill continues for a couple days. + +[TODO more detail] ## Limitations @@ -111,16 +184,11 @@ limitations of this survey data. they are age 18 or older, they are currently located in the USA, and they are an active user of Youtube. The survey data does not report on children under age 18, and the Youtube adult user population may differ from the United - States population generally in important ways. We use our [survey - weighting](#survey-weighting-and-estimation) to adjust the estimates to match - age and gender demographics by state, but this process doesn't adjust for - other demographic biases we may not be aware of. + States population generally in important ways. We don't adjust for any + demographic biases. * **Non-response bias.** The survey is voluntary, and people who accept the invitation when it is presented to them on Youtube may be different from - those who do not. The [survey weights provided by - Youtube](#survey-weighting-and-estimation) attempt to model the probability - of response for each user and hence adjust for this, but it is difficult to - tell if these weights account for all possible non-response bias. + those who do not. * **Social desirability.** Previous survey research has shown that people's responses to surveys are often biased by what responses they believe are socially desirable or acceptable. For example, if it there is widespread @@ -129,13 +197,13 @@ limitations of this survey data. expect the social desirability effect to be smaller, but it may still be present. * **False responses.** As with anything on the Internet, a small percentage of - users give deliberately incorrect responses. We discard a small number of + users give deliberately incorrect responses. [TODO check if true] We discard a small number of responses that are obviously false, but do **not** perform extensive - filtering. However, the large size of the study, and our procedure for + filtering. However, the large size of the study, and [TODO check if true] our procedure for ensuring that each respondent can only be counted once when they are invited to take the survey, prevents individual respondents from having a large effect on results. -* **Repeat invitations.** Individual respondents can be invited by Youtube to +* **Repeat invitations.** [TODO check] Individual respondents can be invited by Youtube to take the survey several times. Usually Youtube only re-invites a respondent after one month. Hence estimates of values on a single day are calculated using independent survey responses from unique respondents (or, at least, @@ -151,16 +219,12 @@ even if point estimates are biased. ### Privacy Restrictions -To protect respondent privacy, we discard any estimate (whether at a county, -MSA, HRR, or state level) that is based on fewer than 100 survey responses. For +To protect respondent privacy, we discard any estimate that is based on fewer than 100 survey responses. For signals reported using a 7-day average (those beginning with `smoothed_`), this means a geographic area must have at least 100 responses in 7 days to be reported. -This affects some items more than others. For instance, items about vaccine -hesitancy reasons are only asked of respondents who are unvaccinated and -hesitant, not to all survey respondents. It also affects some geographic areas -more than others, particularly rural areas with low population densities. When -doing analysis of county-level data, one should be aware that missing counties -are typically more rural and less populous than those present in the data, which -may introduce bias into the analysis. \ No newline at end of file +This affects some items more than others. It affects some geographic areas +more than others, particularly areas with smaller populations. This affect is +less pronounced with smoothed signals, since responses are pooled across a +longer time period. From 42c60d0230f039f1a6d3e1c6ae9379829f2014b9 Mon Sep 17 00:00:00 2001 From: Nat DeFries <42820733+nmdefries@users.noreply.github.com> Date: Wed, 31 Jul 2024 19:17:28 -0400 Subject: [PATCH 06/10] symptom numbers for fb --- docs/api/covidcast-signals/fb-survey.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/api/covidcast-signals/fb-survey.md b/docs/api/covidcast-signals/fb-survey.md index 35ac7401b..fee18715b 100644 --- a/docs/api/covidcast-signals/fb-survey.md +++ b/docs/api/covidcast-signals/fb-survey.md @@ -126,7 +126,8 @@ our survey, this means a and (b or c). COVID-like illness or CLI is not a standard indicator. Through our discussions with the CDC, we chose to define it as: fever along with cough or shortness of -breath or difficulty breathing. +breath or difficulty breathing. From the list of symptoms from Q1 on +our survey, this means a and (c or d or e). Symptoms alone are not sufficient to diagnose influenza or coronavirus infections, and so these ILI and CLI indicators are *not* expected to be From 6e416bff99298027a3ac24c74693acef73d5eb38 Mon Sep 17 00:00:00 2001 From: Nat DeFries <42820733+nmdefries@users.noreply.github.com> Date: Thu, 1 Aug 2024 10:38:12 -0400 Subject: [PATCH 07/10] intro --- docs/api/covidcast-signals/youtube-survey.md | 35 ++++++++++++-------- 1 file changed, 22 insertions(+), 13 deletions(-) diff --git a/docs/api/covidcast-signals/youtube-survey.md b/docs/api/covidcast-signals/youtube-survey.md index 0523a9f42..0b3a44400 100644 --- a/docs/api/covidcast-signals/youtube-survey.md +++ b/docs/api/covidcast-signals/youtube-survey.md @@ -8,7 +8,7 @@ grand_parent: COVIDcast Main Endpoint {: .no_toc} * **Source name:** `youtube-survey` -* **Earliest issue available:** April, 04, 2020 +* **Earliest issue available:** May 01, 2020 * **Number of data revisions since May 19, 2020:** 0 * **Date of last change:** Never * **Available for:** state (see [geography coding docs](../covidcast_geography.md)) @@ -19,22 +19,31 @@ grand_parent: COVIDcast Main Endpoint This data source is based on a short survey about COVID-19-like illness run by the Delphi group at Carnegie Mellon. -Youtube directed a random sample of its users to these surveys, which were +[Youtube directed](https://9to5google.com/2020/04/29/google-covid-19-cmu-research-survey/) +a random sample of its users to these surveys, which were voluntary. Users age 18 or older were eligible to complete the surveys, and their survey responses are held by CMU. No individual survey responses are shared back to Youtube. -This survey was an early version of the [COVID-19 Trends and Impact Survey (CTIS)](../../symptom-survey/), collecting data only about COVID-19 symptoms. CTIS is much longer-running and more detailed, also collecting belief and behavior data, and is recommended in most usecases. See our [surveys -page](https://delphi.cmu.edu/covid19/ctis/) for more detail about how CTIS works. - -[TODO note that indicators differ between the two surveys for unknown reasons] - -As of late April 2020, the number of Youtube survey responses we -received each day was 4-7 thousand. This was sparse at finer geographic levels, so this indicator only reports at the state level. The survey ran from April 21, 2020 to June -17, 2020, collecting about 159 thousand responses in the United States in that -time. - -We produce [influenza-like and COVID-like illness indicators](#ili-and-cli-indicators) based on the survey data. +This survey was a pared-down version of the +[COVID-19 Trends and Impact Survey (CTIS)](../../symptom-survey/), +collecting data only about COVID-19 symptoms. CTIS is much longer-running +and more detailed, also collecting belief and behavior data. See our +[surveys page](https://delphi.cmu.edu/covid19/ctis/) for more detail +about how CTIS works. + +The two surveys report some of the same metrics. While nominally the same, +note that values from the same dates differ between the two surveys for +[unknown reasons](#limitations). + +As of late April 2020, the number of Youtube survey responses we received each +day was 4-7 thousand. This was not enough coverage to report at finer +geographic levels, so this indicator only reports at the state level. The +survey ran from April 21, 2020 to June 17, 2020, collecting about 159 +thousand responses in the United States in that time. + +We produce [influenza-like and COVID-like illness indicators](#ili-and-cli-indicators) +based on the survey data. ## Table of Contents {: .no_toc .text-delta} From d7903456e67c8f307a4e5d12bb132d0aa6801fd5 Mon Sep 17 00:00:00 2001 From: Nat DeFries <42820733+nmdefries@users.noreply.github.com> Date: Thu, 1 Aug 2024 12:03:04 -0400 Subject: [PATCH 08/10] link to fb-survey for calculation info; add detail --- docs/api/covidcast-signals/youtube-survey.md | 106 ++++--------------- 1 file changed, 21 insertions(+), 85 deletions(-) diff --git a/docs/api/covidcast-signals/youtube-survey.md b/docs/api/covidcast-signals/youtube-survey.md index 0b3a44400..9b7105171 100644 --- a/docs/api/covidcast-signals/youtube-survey.md +++ b/docs/api/covidcast-signals/youtube-survey.md @@ -4,6 +4,8 @@ parent: Inactive Signals grand_parent: COVIDcast Main Endpoint --- +[//]: # (code at https://github.com/cmu-delphi/covid-19/tree/deeb4dc1e9a30622b415361ef6b99198e77d2a94/youtube) + # Youtube Survey {: .no_toc} @@ -28,7 +30,8 @@ shared back to Youtube. This survey was a pared-down version of the [COVID-19 Trends and Impact Survey (CTIS)](../../symptom-survey/), collecting data only about COVID-19 symptoms. CTIS is much longer-running -and more detailed, also collecting belief and behavior data. See our +and more detailed, also collecting belief and behavior data. CTIS also reports +demographic-corrected versions of some metrics. See our [surveys page](https://delphi.cmu.edu/covid19/ctis/) for more detail about how CTIS works. @@ -97,76 +100,14 @@ can be asymptomatic. Instead, we expect these indicators to be useful for comparison across the United States and across time, to determine where symptoms appear to be increasing. -**Smoothing.** The signals beginning with `smoothed` estimate the same quantities as their -`raw` partners, but are smoothed in time to reduce day-to-day sampling noise; -see [details below](#smoothing). Crucially, because the smoothed signals combine -information across multiple days, they have larger sample sizes and hence are -available for more locations than the raw signals. - - -### Defining Household ILI and CLI - -[TODO check] - -For a single survey, we are interested in the quantities: - -- $$X =$$ the number of people in the household with ILI; -- $$Y =$$ the number of people in the household with CLI; -- $$N =$$ the number of people in the household. - -Note that $$N$$ comes directly from the answer to Q3, but neither $$X$$ nor -$$Y$$ can be computed directly (because Q2 does not give an answer to the -precise symptomatic profile of all individuals in the household, it only asks -how many individuals have fever and at least one other symptom from the list). - -We hence estimate $$X$$ and $$Y$$ with the following simple strategy. Consider -ILI, without a loss of generality (we apply the same strategy to CLI). Let $$Z$$ -be the answer to Q2. - -- If the answer to Q1 does not meet the ILI definition, then we report $$X=0$$. -- If the answer to Q1 does meet the ILI definition, then we report $$X = Z$$. - -This can only "over count" (result in too large estimates of) the true $$X$$ and -$$Y$$. For example, this happens when some members of the household experience -ILI that does not also qualify as CLI, while others experience CLI that does not -also qualify as ILI. In this case, for both $$X$$ and $$Y$$, our simple strategy -would return the sum of both types of cases. However, given the extreme degree -of overlap between the definitions of ILI and CLI, it is reasonable to believe -that, if symptoms across all household members qualified as both ILI and CLI, -each individual would have both, or neither---with neither being more common. -Therefore we do not consider this "over counting" phenomenon practically -problematic. +## Estimation ### Estimating Percent ILI and CLI -[TODO check] - -Let $$x$$ and $$y$$ be the number of people with ILI and CLI, respectively, over -a given time period, and in a given location (for example, the time period being -a particular day, and a location being a particular state). Let $$n$$ be the -total number of people in this location. We are interested in estimating the -true ILI and CLI percentages, which we denote by $$p$$ and $$q$$, respectively: - -$$ -p = 100 \cdot \frac{x}{n} -\quad\text{and}\quad -q = 100 \cdot \frac{y}{n}. -$$ - -In a given aggregation unit (for example, daily-state), let $$X_i$$ and $$Y_i$$ -denote number of ILI and CLI cases in the household, respectively (computed -according to the simple strategy [described -above](#defining-household-ili-and-cli)), and let $$N_i$$ denote the total -number of people in the household, in survey $$i$$, out of $$m$$ surveys we -collected. Then our unweighted estimates of $$p$$ and $$q$$ are: - -$$ -\hat{p} = 100 \cdot \frac{1}{m}\sum_{i=1}^m \frac{X_i}{N_i} -\quad\text{and}\quad -\hat{q} = 100 \cdot \frac{1}{m}\sum_{i=1}^m \frac{Y_i}{N_i}. -$$ - +Estimates are calculated using the +[same method as CTIS](./fb-survey#estimating-percent-ili-and-cli). +However, the Youtube survey does not do weighting. ### Smoothing @@ -174,14 +115,16 @@ The smoothed versions of all `youtube-survey` signals (with `smoothed` prefix) a calculated using seven day pooling. For example, the estimate reported for June 7 in a specific geographical area is formed by collecting all surveys completed between June 1 and 7 (inclusive) and using that -data in the estimation procedures described above. - +data in the estimation procedures described above. Because the smoothed signals combine +information across multiple days, they have larger sample sizes and hence are +available for more locations than the raw signals. ## Lag and Backfill -Lag is 1 day. Backfill continues for a couple days. - -[TODO more detail] +This indicator has a lag of 2 days. Reported values can be revised for one +day (corresponding to a lag of 3 days), due to how we receive survey +responses. However, these tend to be associated with minimal changes in +value. ## Limitations @@ -205,19 +148,6 @@ limitations of this survey data. to answer that they *do*. This survey is anonymous and online, meaning we expect the social desirability effect to be smaller, but it may still be present. -* **False responses.** As with anything on the Internet, a small percentage of - users give deliberately incorrect responses. [TODO check if true] We discard a small number of - responses that are obviously false, but do **not** perform extensive - filtering. However, the large size of the study, and [TODO check if true] our procedure for - ensuring that each respondent can only be counted once when they are invited - to take the survey, prevents individual respondents from having a large effect - on results. -* **Repeat invitations.** [TODO check] Individual respondents can be invited by Youtube to - take the survey several times. Usually Youtube only re-invites a respondent - after one month. Hence estimates of values on a single day are calculated - using independent survey responses from unique respondents (or, at least, - unique Youtube accounts), whereas estimates from different months may involve - the same respondents. Whenever possible, you should compare this data to other independent sources. We believe that while these biases may affect point estimates -- that is, they may @@ -237,3 +167,9 @@ This affects some items more than others. It affects some geographic areas more than others, particularly areas with smaller populations. This affect is less pronounced with smoothed signals, since responses are pooled across a longer time period. + + +## Source and Licensing + +This indicator aggregates responses from a Delphi-run survey that is hosted on the Youtube platform. +The data is licensed as [CC BY-NC](../covidcast_licensing.md#creative-commons-attribution-noncommercial). From d3c295142ea2f49a69b57ebafec079eee108ed2c Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" <41898282+github-actions[bot]@users.noreply.github.com> Date: Fri, 2 Aug 2024 09:56:52 -0400 Subject: [PATCH 09/10] Update Google Docs Meta Data (#1512) Co-authored-by: melange396 --- src/server/endpoints/covidcast_utils/db_signals.csv | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/server/endpoints/covidcast_utils/db_signals.csv b/src/server/endpoints/covidcast_utils/db_signals.csv index b14d477ae..9844a3ffb 100644 --- a/src/server/endpoints/covidcast_utils/db_signals.csv +++ b/src/server/endpoints/covidcast_utils/db_signals.csv @@ -1575,7 +1575,7 @@ NSSP does not report county-level data for all counties with reporting EDs; some The following states report no data through NSSP at the county level: CA, WA, AK, AZ, AL, CO, SD, ND, MO, AR, FL, OH, NH, CT, NJ. South Dakota, Missouri, and territories report no data through NSSP at the state level.",Percentage,percent,other,bad,FALSE,FALSE,FALSE,FALSE,FALSE,,,,,, -nssp,pct_ed_visits_rsv,FALSE,pct_ed_visits_rsv,FALSE,COVID Emergency Department Visits (Percent of total ED visits),TRUE,Percent of ED visits that had a discharge diagnosis code of rsv,Percent of ED visits that had a discharge diagnosis code of rsv,National Syndromic Surveillance Program,rsv,Hospitalizations,USA,"county,state,hrr,msa","hrr,msa",2022-10-01,,ongoing,,week,Week,weekly,,,All,None,hospitalized,,"Data is available for 78% of US emergency departments. California, Colorado, Missouri, Oklahoma, and Virginia have the most noticeable gaps in coverage, with many counties in those states having either no eligible EDs or having no recently reported data in NSSP. However, most states have some counties that do not contain any reporting EDs. +nssp,pct_ed_visits_rsv,FALSE,pct_ed_visits_rsv,FALSE,RSV Emergency Department Visits (Percent of total ED visits),TRUE,Percent of ED visits that had a discharge diagnosis code of rsv,Percent of ED visits that had a discharge diagnosis code of rsv,National Syndromic Surveillance Program,rsv,Hospitalizations,USA,"county,state,hrr,msa","hrr,msa",2022-10-01,,ongoing,,week,Week,weekly,,,All,None,hospitalized,,"Data is available for 78% of US emergency departments. California, Colorado, Missouri, Oklahoma, and Virginia have the most noticeable gaps in coverage, with many counties in those states having either no eligible EDs or having no recently reported data in NSSP. However, most states have some counties that do not contain any reporting EDs. NSSP does not report county-level data for all counties with reporting EDs; some reporting EDs are only included in state-level values. @@ -1596,14 +1596,14 @@ NSSP does not report county-level data for all counties with reporting EDs; some The following states report no data through NSSP at the county level: CA, WA, AK, AZ, AL, CO, SD, ND, MO, AR, FL, OH, NH, CT, NJ. South Dakota, Missouri, and territories report no data through NSSP at the state level.",Percentage,percent,other,bad,TRUE,FALSE,FALSE,FALSE,FALSE,,,,,, -nssp,pct_ed_visits_influenza,TRUE,smoothed_pct_ed_visits_influenza,FALSE,Influenza Emergency Department Visits (Percent of total ED visits),TRUE,3-week moving average of percent of ED visits that had a discharge diagnosis code of influenza,3-week moving average of percent of ED visits that had a discharge diagnosis code of influenza,National Syndromic Surveillance Program,flu,Hospitalizations,USA,"county,state,hrr,msa","hrr,msa",2022-10-01,,ongoing,,week,Week,weekly,,,All,None,hospitalized,,"Data is available for 78% of US emergency departments. California, Colorado, Missouri, Oklahoma, and Virginia have the most noticeable gaps in coverage, with many counties in those states having either no eligible EDs or having no recently reported data in NSSP. However, most states have some counties that do not contain any reporting EDs. +nssp,pct_ed_visits_influenza,TRUE,smoothed_pct_ed_visits_influenza,FALSE,"Influenza Emergency Department Visits (Percent of total ED visits, 3-week average)",TRUE,3-week moving average of percent of ED visits that had a discharge diagnosis code of influenza,3-week moving average of percent of ED visits that had a discharge diagnosis code of influenza,National Syndromic Surveillance Program,flu,Hospitalizations,USA,"county,state,hrr,msa","hrr,msa",2022-10-01,,ongoing,,week,Week,weekly,,,All,None,hospitalized,,"Data is available for 78% of US emergency departments. California, Colorado, Missouri, Oklahoma, and Virginia have the most noticeable gaps in coverage, with many counties in those states having either no eligible EDs or having no recently reported data in NSSP. However, most states have some counties that do not contain any reporting EDs. NSSP does not report county-level data for all counties with reporting EDs; some reporting EDs are only included in state-level values. The following states report no data through NSSP at the county level: CA, WA, AK, AZ, AL, CO, SD, ND, MO, AR, FL, OH, NH, CT, NJ. South Dakota, Missouri, and territories report no data through NSSP at the state level.",Percentage,percent,other,bad,TRUE,FALSE,FALSE,FALSE,FALSE,,,,,, -nssp,pct_ed_visits_rsv,TRUE,smoothed_pct_ed_visits_rsv,FALSE,COVID Emergency Department Visits (Percent of total ED visits),TRUE,3-week moving average of percent of ED visits that had a discharge diagnosis code of rsv,3-week moving average of percent of ED visits that had a discharge diagnosis code of rsv,National Syndromic Surveillance Program,rsv,Hospitalizations,USA,"county,state,hrr,msa","hrr,msa",2022-10-01,,ongoing,,week,Week,weekly,,,All,None,hospitalized,,"Data is available for 78% of US emergency departments. California, Colorado, Missouri, Oklahoma, and Virginia have the most noticeable gaps in coverage, with many counties in those states having either no eligible EDs or having no recently reported data in NSSP. However, most states have some counties that do not contain any reporting EDs. +nssp,pct_ed_visits_rsv,TRUE,smoothed_pct_ed_visits_rsv,FALSE,"RSV Emergency Department Visits (Percent of total ED visits, 3-week average)",TRUE,3-week moving average of percent of ED visits that had a discharge diagnosis code of rsv,3-week moving average of percent of ED visits that had a discharge diagnosis code of rsv,National Syndromic Surveillance Program,rsv,Hospitalizations,USA,"county,state,hrr,msa","hrr,msa",2022-10-01,,ongoing,,week,Week,weekly,,,All,None,hospitalized,,"Data is available for 78% of US emergency departments. California, Colorado, Missouri, Oklahoma, and Virginia have the most noticeable gaps in coverage, with many counties in those states having either no eligible EDs or having no recently reported data in NSSP. However, most states have some counties that do not contain any reporting EDs. NSSP does not report county-level data for all counties with reporting EDs; some reporting EDs are only included in state-level values. From 3ca502f5bd3a4387f47c71bd1722c07054e157a3 Mon Sep 17 00:00:00 2001 From: melange396 Date: Thu, 22 Aug 2024 16:27:21 +0000 Subject: [PATCH 10/10] chore: release delphi-epidata 4.1.26 --- .bumpversion.cfg | 2 +- dev/local/setup.cfg | 2 +- src/client/delphi_epidata.R | 2 +- src/client/delphi_epidata.js | 2 +- src/client/packaging/npm/package.json | 2 +- src/server/_config.py | 2 +- 6 files changed, 6 insertions(+), 6 deletions(-) diff --git a/.bumpversion.cfg b/.bumpversion.cfg index fa394e9d6..835b767fa 100644 --- a/.bumpversion.cfg +++ b/.bumpversion.cfg @@ -1,5 +1,5 @@ [bumpversion] -current_version = 4.1.25 +current_version = 4.1.26 commit = False tag = False diff --git a/dev/local/setup.cfg b/dev/local/setup.cfg index dd30723a4..a5587416d 100644 --- a/dev/local/setup.cfg +++ b/dev/local/setup.cfg @@ -1,6 +1,6 @@ [metadata] name = Delphi Development -version = 4.1.25 +version = 4.1.26 [options] packages = diff --git a/src/client/delphi_epidata.R b/src/client/delphi_epidata.R index fd461de00..665522b58 100644 --- a/src/client/delphi_epidata.R +++ b/src/client/delphi_epidata.R @@ -15,7 +15,7 @@ Epidata <- (function() { # API base url BASE_URL <- getOption('epidata.url', default = 'https://api.delphi.cmu.edu/epidata/') - client_version <- '4.1.25' + client_version <- '4.1.26' auth <- getOption("epidata.auth", default = NA) diff --git a/src/client/delphi_epidata.js b/src/client/delphi_epidata.js index 7afa235c0..b403eec73 100644 --- a/src/client/delphi_epidata.js +++ b/src/client/delphi_epidata.js @@ -22,7 +22,7 @@ } })(this, function (exports, fetchImpl, jQuery) { const BASE_URL = "https://api.delphi.cmu.edu/epidata/"; - const client_version = "4.1.25"; + const client_version = "4.1.26"; // Helper function to cast values and/or ranges to strings function _listitem(value) { diff --git a/src/client/packaging/npm/package.json b/src/client/packaging/npm/package.json index c88d0c6ec..2eb3d4972 100644 --- a/src/client/packaging/npm/package.json +++ b/src/client/packaging/npm/package.json @@ -2,7 +2,7 @@ "name": "delphi_epidata", "description": "Delphi Epidata API Client", "authors": "Delphi Group", - "version": "4.1.25", + "version": "4.1.26", "license": "MIT", "homepage": "https://github.com/cmu-delphi/delphi-epidata", "bugs": { diff --git a/src/server/_config.py b/src/server/_config.py index 7ca9ae486..17c0b79b6 100644 --- a/src/server/_config.py +++ b/src/server/_config.py @@ -7,7 +7,7 @@ load_dotenv() -VERSION = "4.1.25" +VERSION = "4.1.26" MAX_RESULTS = int(10e6) MAX_COMPATIBILITY_RESULTS = int(3650)