Skip to content

Google symptoms dap #28

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 43 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
27302bd
Initial comparison of GHT indicator and GS anosmia + ageusia
nmdefries Sep 30, 2020
9a13562
Dig into GS missingness
nmdefries Oct 6, 2020
eb1362f
bug fix
nmdefries Oct 6, 2020
b863b62
Adding tests from symptoms survey blog post
nmdefries Oct 6, 2020
7dae33a
Adding tests from symptoms survey blog post
nmdefries Oct 6, 2020
c2147f4
Added additional alternative models to compare usefulness of GS vs GHT
nmdefries Oct 8, 2020
3571c40
Alternative models run with gurobi or read saved data
nmdefries Oct 9, 2020
051a1bc
Comparison of 4 models, with GS A+A and GHT. Supporting dataframes pr…
nmdefries Oct 12, 2020
a8dbfd2
Knit output for only cases vs GS A+A
nmdefries Oct 15, 2020
3cf1eaf
Knit output for cases vs cases + GS A+A vs cases + GHT vs cases + GS …
nmdefries Oct 15, 2020
ea7434e
Separated forecasting error by days ahead into pairwise model compari…
nmdefries Oct 15, 2020
37bacdb
Fixed days-ahead bug
nmdefries Oct 16, 2020
77f36d3
More commentary on days-ahead graphs
nmdefries Oct 16, 2020
43fc7ee
add file looking at cases vs cases + symptoms models at the county level
nmdefries Oct 27, 2020
4013181
remove extraneous files
nmdefries Nov 3, 2020
15da6cc
Add correlation comparison
Nov 11, 2020
8e7a794
add predictive power comparison
Nov 12, 2020
51c264e
rm the old combined predictive power comparison script
Nov 12, 2020
77cd4aa
add county level predictive power comparison
Nov 12, 2020
c319115
add county level comparison html
Nov 12, 2020
6e6595c
add msa level predictive power comparison
Nov 12, 2020
e842b88
add html version for msa level predictive power comparison
Nov 12, 2020
aca2ce9
remove geo-wise correlation, re-run time-series correlation analysis
Nov 17, 2020
4c8a18a
add raw error plot for days ahead
nmdefries Nov 17, 2020
8b94fd1
add exploration for other symptoms
Nov 23, 2020
820a256
fixed typos
Nov 23, 2020
5156788
fixed an error in the title of figures
Nov 23, 2020
44253b9
replace states with MSAs
Nov 23, 2020
1b47981
add predictive power analysis considering larger geographic scope
Dec 14, 2020
5701f8d
add scripts for sensorization
Jan 20, 2021
6c73020
add script for calculating rank correlations
Jan 20, 2021
8f828ad
add v0 report
Jan 29, 2021
642ab95
upload appendix for correlation analysis
Feb 2, 2021
ce9c391
update data source
Feb 3, 2021
494a2bf
update scripts
Feb 3, 2021
7621982
add appendix2
Feb 4, 2021
705936b
upload final report
Feb 4, 2021
540ff90
fixed an error
Feb 4, 2021
76e9b01
update docs
Feb 4, 2021
f8bb9bb
relocate scripts
Feb 4, 2021
63960e1
delete scripts outside the scripts folder
Feb 4, 2021
44c9d77
added explainations and fixed errors
Mar 4, 2021
ef0da2d
Add details for the issue of as_of date
Mar 4, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,159 changes: 1,159 additions & 0 deletions google_symptoms_exploration/03_compare_gs_ght.Rmd

Large diffs are not rendered by default.

695 changes: 695 additions & 0 deletions google_symptoms_exploration/03_compare_gs_ght.html

Large diffs are not rendered by default.

506 changes: 506 additions & 0 deletions google_symptoms_exploration/04_choose_symptoms.Rmd

Large diffs are not rendered by default.

305 changes: 305 additions & 0 deletions google_symptoms_exploration/05_correlations_comparison.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,305 @@
---
title: "Correlation analyses for Google-Symptoms"
author: "Jingjing"
date: "11/10/2020"
output:
html_document:
code_folding: hide
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, cache=TRUE)
```

### County Level

#### Getting data from API

```{r, echo = FALSE, message = FALSE, cache=TRUE}
library(covidcast)
library(dplyr)
library(ggplot2)

# Fetch the following sources and signals from the API
sources = c("doctor-visits", "fb-survey", "fb-survey", "hospital-admissions")
signals = c("smoothed_adj_cli", "smoothed_cli", "smoothed_hh_cmnty_cli",
"smoothed_adj_covid19")
names = c("Doctor visits", "Facebook CLI", "Facebook CLI-in-community",
"Hospitalizations")

start_day = "2020-04-15"
end_day = NULL

df_signals = vector("list", length(signals))
for (i in 1:length(signals)) {
df_signals[[i]] = covidcast_signal(sources[i], signals[i], start_day, end_day)
}

# Fetch USAFacts confirmed case incidence proportion (smoothed with 7-day
# trailing average)
df_cases = covidcast_signal("usa-facts", "confirmed_7dav_incidence_prop",
start_day, end_day)
n = length(signals) + 3
```

#### Get GS data
Read the smoothed search signals from Google-Symptoms: anosmia, ageusia, combined_symptoms

```{r, message=FALSE, warning=FALSE}
library(stringr)
dir = "../test_files/google-symptoms/"
df_signals[[n-2]] = read.csv(paste(dir,"county_ageusia_smoothed_search.csv", sep = ""),
colClasses=c("geo_value"="character", "time_value"="Date")) %>%
filter(time_value >= as.Date(start_day))
df_signals[[n-2]]$geo_value = str_pad(df_signals[[n-2]]$geo_value, width=5, side="left", pad="0")
sources[n-2] = "google-symptoms"
signals[n-2] = "ageusia_smoothed_search"
names[n-2] = "GS Ageusia Smoothed Search"

df_signals[[n-1]] = read.csv(paste(dir,"county_anosmia_smoothed_search.csv", sep = ""),
colClasses=c("geo_value"="character", "time_value"="Date")) %>%
filter(time_value >= as.Date(start_day))
df_signals[[n-1]]$geo_value = str_pad(df_signals[[n-1]]$geo_value, width=5, side="left", pad="0")
sources[n-1] = "google-symptoms"
signals[n-1] = "anosmia_smoothed_search"
names[n-1] = "GS Anosmia Smoothed Search"

df_signals[[n]] = read.csv(paste(dir,"county_combined_symptoms_smoothed_search.csv", sep = ""),
colClasses=c("geo_value"="character", "time_value"="Date")) %>%
filter(time_value >= as.Date(start_day))
df_signals[[n]]$geo_value = str_pad(df_signals[[n]]$geo_value, width=5, side="left", pad="0")
sources[n] = "google-symptoms"
signals[n] = "combined_symptoms_smoothed_search"
names[n] = "GS A+A Smoothed Search"

```

```{r}
# Consider only counties with at least 500 cumulative cases
case_num = 500
geo_values = covidcast_signal("usa-facts", "confirmed_cumulative_num",
max(df_cases$time_value),
max(df_cases$time_value)) %>%
filter(value >= case_num) %>% pull(geo_value)
```

#### Correlations sliced by time (Geo-wise correlation)

Correlations sliced by time
Here we look at Spearman (rank) correlations between our signals and COVID-19 case incidence rates, sliced by time. That is, for each day, we compute the correlation between each signal and COVID-19 case incidence rates, over all
counties (with at least 500 cumulative cases).

```{r, message=FALSE, warning=FALSE, eval=FALSE}
df_cor = vector("list", n)
for (i in 1:n) {
df_cor[[i]] = covidcast_cor(df_signals[[i]] %>%
filter(geo_value %in% geo_values),
df_cases %>%
filter(geo_value %in% geo_values),
by = "time_value", method = "spearman")
df_cor[[i]]$signal = names[i]
}
df = do.call(rbind, df_cor)

ggplot(df, aes(x = time_value, y = value)) +
geom_line(aes(color = signal)) +
guides(color = guide_legend(nrow = 2)) +
labs(title = "Correlation between signals and case rates",
subtitle = sprintf("Over all counties with at least %i cumulative cases",
case_num), x = "Date", y = "Correlation") +
theme(legend.position = "bottom", legend.title = element_blank())


#The google-symptoms signals obtain relatively high county-level geo-wise correlation compared with other signals. There is no big different among the 3 signals of google-symptoms. This shows that though only ~100 counties are available, the estimates are of high quality within those areas. There are two sudden drops in June and August which worth further exploration.
```

Note that the search volume is normalized by population and is scaled by the maximum popularity across a specific time range within certain geographic regions, the values are not comparable between regions. So, it is meaningless to consider the geo-wise correlation.


#### Correlations sliced by county (Time-series correlation)
Now we look at Spearman (rank) correlations between our signals and COVID-19 case incidence rates, sliced by county. That is, for each county (with at least 500 cumulative cases), we compute the correlation between each signal and COVID-19 case incidence rates, over all time.

```{r, message=FALSE, warning=FALSE}
df_cor = vector("list", length(signals))
for (i in 1:length(signals)) {
df_cor[[i]] = covidcast_cor(df_signals[[i]] %>%
filter(geo_value %in% geo_values),
df_cases %>%
filter(geo_value %in% geo_values),
by = "geo_value", method = "spearman")
df_cor[[i]]$signal = names[i]
}
df = do.call(rbind, df_cor)

ggplot(df, aes(value)) +
geom_density(aes(color = signal, fill = signal), alpha = 0.4) +
guides(color = guide_legend(nrow = 2)) +
labs(title = "Correlation between signals and case rates",
subtitle = sprintf("Over all counties with at least %i cumulative cases",
case_num), x = "Correlation", y = "density") +
theme(legend.position = "bottom", legend.title = element_blank())
```

According to the pdf of correlations for different signals, the GS related Smoothed Search signal obtains large time-series correlation with for a large proportion of regions with available estimates. The curves of GS A+A smoothed search and GS ageusia smoothed search are even more left-skewed compared with Facebook CLI-in-community.


```{r}
num_county_ageusia = length(unique(df_signals[[5]][!is.na(df_signals[[5]]$value),]$geo_value))
num_county_anosmia = length(unique(df_signals[[6]][!is.na(df_signals[[6]]$value),]$geo_value))
```

Only 91 counties available for ageusia-related searches and 109 counties available for anosmia-related searches and the A+A search signal. For most of the available counties, they achieve good time-series correlation case data.

We can also look at choropleth maps to get a geographic sense of the correlation distribution for each signal.

```{r, message=FALSE, warning=FALSE}
for (i in 5:length(signals)) {
df_cor[[i]]$time_value = start_day
df_cor[[i]]$issue = start_day
attributes(df_cor[[i]])$metadata$geo_type = "county"
class(df_cor[[i]]) = c("covidcast_signal", "data.frame")

print(plot(df_cor[[i]], range = c(-1, 1), choro_col = cm.colors(10),
title = sprintf("Correlations for %s", names[i])))
}
```


### Metro area analysis

#### Getting data from API
We fetch various signals from our API, from April 15 through to the current day.

```{r, message=FALSE, warning=FALSE, cache=TRUE}
# Fetch the following sources and signals from the API
sources = c("doctor-visits", "fb-survey", "fb-survey", "ght",
"hospital-admissions", "indicator-combination")
signals = c("smoothed_adj_cli", "smoothed_cli", "smoothed_hh_cmnty_cli",
"smoothed_search", "smoothed_adj_covid19",
"nmf_day_doc_fbc_fbs_ght")
names = c("Doctor visits", "Facebook CLI", "Facebook CLI-in-community",
"Google trends", "Hospitalizations", "Combo indicator")

start_day = "2020-04-15"
end_day = NULL

df_signals = vector("list", length(signals))
for (i in 1:length(signals)) {
df_signals[[i]] = covidcast_signal(sources[i], signals[i], start_day, end_day,
geo_type = "msa")
}

# Fetch USAFacts confirmed case incidence proportion (smoothed with 7-day
# trailing average)
df_cases = covidcast_signal("usa-facts", "confirmed_7dav_incidence_prop",
start_day, end_day, geo_type = "msa")
n = length(signals) + 3
```

#### Get GS data

```{r, message=FALSE, warning=FALSE}
library(stringr)
dir = "../test_files/google-symptoms/"
df_signals[[n-2]] = read.csv(paste(dir,"msa_ageusia_smoothed_search.csv", sep = ""),
colClasses=c("geo_value"="character", "time_value"="Date")) %>%
filter(time_value >= as.Date("2020-04-15"))
df_signals[[n-2]]$geo_value = str_pad(df_signals[[n-2]]$geo_value, width=5, side="left", pad="0")
sources[n-2] = "google-symptoms"
signals[n-2] = "ageusia_smoothed_search"
names[n-2] = "GS Ageusia Smoothed Search"

df_signals[[n-1]] = read.csv(paste(dir,"msa_anosmia_smoothed_search.csv", sep = ""),
colClasses=c("geo_value"="character", "time_value"="Date")) %>%
filter(time_value >= as.Date("2020-04-15"))
df_signals[[n-1]]$geo_value = str_pad(df_signals[[n-1]]$geo_value, width=5, side="left", pad="0")
sources[n-1] = "google-symptoms"
signals[n-1] = "anosmia_smoothed_search"
names[n-1] = "GS Anosmia Smoothed Search"

df_signals[[n]] = read.csv(paste(dir,"msa_combined_symptoms_smoothed_search.csv", sep = ""),
colClasses=c("geo_value"="character", "time_value"="Date")) %>%
filter(time_value >= as.Date("2020-04-15"))
df_signals[[n]]$geo_value = str_pad(df_signals[[n]]$geo_value, width=5, side="left", pad="0")
sources[n] = "google-symptoms"
signals[n] = "combined_symptoms_smoothed_search"
names[n] = "GS A+A Smoothed Search"
```

```{r, message=FALSE, warning=FALSE}
# Consider only metro areas with at least 500 cumulative cases
case_num = 500
geo_values = covidcast_signal("usa-facts", "confirmed_cumulative_num",
max(df_cases$time_value),
max(df_cases$time_value),
geo_type = "msa") %>%
filter(value >= case_num) %>% pull(geo_value)

```

#### Correlations sliced by time (Geo-wise correlation)
Here we look at Spearman (rank) correlations between our signals and COVID-19 case incidence rates, sliced by time. That is, for each day, we compute the correlation between each signal and COVID-19 case incidence rates, over all metro areas (with at least 500 cumulative cases).

```{r, message=FALSE, warning=FALSE, eval=FALSE}

df_cor = vector("list", length(signals))
for (i in 1:length(signals)) {
df_cor[[i]] = covidcast_cor(df_signals[[i]] %>%
filter(geo_value %in% geo_values),
df_cases %>%
filter(geo_value %in% geo_values),
by = "time_value", method = "spearman")
df_cor[[i]]$signal = names[i]
}
df = do.call(rbind, df_cor)

ggplot(df, aes(x = time_value, y = value)) +
geom_line(aes(color = signal)) +
guides(color = guide_legend(nrow = 3)) +
labs(title = "Correlation between signals and case rates",
subtitle = sprintf("Over metro areas with at least %i cumulative cases",
case_num), x = "Date", y = "Correlation") +
theme(legend.position = "bottom", legend.title = element_blank())

#The google-symptoms signals do not obtain very competitive geo-wise correlation at MSA level which is intuitive since there is only ~100 counties available for ~60 MSAs. Remember that we aggregate the county level search volume by population-weighted average. The counties that fail to meet the privacy or quality thresholds will have value 0 instead of NAN during the aggregation from county level to MSA level. Thus, our MSA level estimates are not the `actual` ones but having bias to some extent due to a large proportion of counties with missing values within certain MSAs.
```


Note that the search volume is normalized by population and is scaled by the maximum popularity across a specific time range within certain geographic regions, the values are not comparable between regions. So, it is meaningless to consider the geo-wise correlation.


#### Correlations sliced by metro area
Now we look at Spearman (rank) correlations between our signals and COVID-19 case incidence rates, sliced by metro area That is, for each metro area (with at least 500 cumulative cases), we compute the correlation between each signal and COVID-19 case incidence rates, over all time.

```{r, message=FALSE, warning=FALSE}
df_cor = vector("list", length(signals))
for (i in 1:length(signals)) {
df_cor[[i]] = covidcast_cor(df_signals[[i]] %>%
filter(geo_value %in% geo_values),
df_cases %>%
filter(geo_value %in% geo_values),
by = "geo_value", method = "spearman")
df_cor[[i]]$signal = names[i]
}
df = do.call(rbind, df_cor)

ggplot(df, aes(value)) +
geom_density(aes(color = signal, fill = signal), alpha = 0.4) +
guides(color = guide_legend(nrow = 3)) +
labs(title = "Correlation between signals and case rates",
subtitle = sprintf("Over metro areas with at least %i cumulative cases",
case_num), x = "Correlation", y = "density") +
theme(legend.position = "bottom", legend.title = element_blank())
```

The time-series correlation at MSA level are not as good as what we see at county level. But in general, the GS related signals still obtain high correlation with a large proportion of MSA with available estimates. And the pdf curves for GS related signals are still highly left-skewed.


Now look at choropleth maps to get a geographic sense of the correlation distribution for each signal.

```{r, message=FALSE, warning=FALSE}
num_msa_ageusia = length(unique(df_signals[[7]][!is.na(df_signals[[7]]$value),]$geo_value))
num_msa_anosmia = length(unique(df_signals[[8]][!is.na(df_signals[[8]]$value),]$geo_value))
```

GS anosmia smoothed search has 64 MSAs available while GS ageusia smoothed search has 54 MSAs available.
552 changes: 552 additions & 0 deletions google_symptoms_exploration/05_correlations_comparison.html

Large diffs are not rendered by default.

Loading