diff --git a/.Rbuildignore b/.Rbuildignore new file mode 100644 index 00000000..91114bf2 --- /dev/null +++ b/.Rbuildignore @@ -0,0 +1,2 @@ +^.*\.Rproj$ +^\.Rproj\.user$ diff --git a/.gitignore b/.gitignore new file mode 100644 index 00000000..f4f606b0 --- /dev/null +++ b/.gitignore @@ -0,0 +1,5 @@ +.Rproj.user +.Rhistory +.RData +.Ruserdata +*.Rproj diff --git a/DESCRIPTION b/DESCRIPTION index 8703c98c..791d2def 100755 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -4,6 +4,9 @@ Title: Tools for basic signal processing in epidemiology Version: 1.0.0 Authors@R: c( + person(given = "Daniel", + family = "McDonald", + role = "ctb"), person(given = "Ryan", family = "Tibshirani", role = c("aut", "cre"), diff --git a/docs/articles/correlations.html b/docs/articles/correlations.html index cf644f1e..6380ca1b 100644 --- a/docs/articles/correlations.html +++ b/docs/articles/correlations.html @@ -135,7 +135,8 @@

z1 <- sliced_cor(x, y, by = "time_value") -ggplot(z1, aes(x = time_value, y = value)) + geom_line() + +ggplot(z1, aes(x = time_value, y = value)) + + geom_line() + scale_x_date(minor_breaks = "month", date_labels = "%b %y") + labs(x = "Date", y = "Correlation")

@@ -191,7 +192,8 @@

z %>% group_by(lag) %>% summarize(mean = Mean(value)) %>% - ggplot(aes(x = lag, y = mean)) + geom_line() + geom_point() + + ggplot(aes(x = lag, y = mean)) + + geom_line() + geom_point() + labs(x = "Lag", y = "Mean correlation")

We can see that some pretty clear curvature here in the mean correlation between case and death rates (where the correlations come from slicing by location) as a function of lag. The maximum occurs at a lag of somewhere around 17 days.

diff --git a/docs/articles/correlations_files/figure-html/unnamed-chunk-2-1.png b/docs/articles/correlations_files/figure-html/unnamed-chunk-2-1.png index a4ada68b..d2373ef1 100644 Binary files a/docs/articles/correlations_files/figure-html/unnamed-chunk-2-1.png and b/docs/articles/correlations_files/figure-html/unnamed-chunk-2-1.png differ diff --git a/docs/articles/correlations_files/figure-html/unnamed-chunk-3-1.png b/docs/articles/correlations_files/figure-html/unnamed-chunk-3-1.png index 972b456e..3f694b33 100644 Binary files a/docs/articles/correlations_files/figure-html/unnamed-chunk-3-1.png and b/docs/articles/correlations_files/figure-html/unnamed-chunk-3-1.png differ diff --git a/docs/articles/correlations_files/figure-html/unnamed-chunk-4-1.png b/docs/articles/correlations_files/figure-html/unnamed-chunk-4-1.png index 1e261d8b..083fd3d4 100644 Binary files a/docs/articles/correlations_files/figure-html/unnamed-chunk-4-1.png and b/docs/articles/correlations_files/figure-html/unnamed-chunk-4-1.png differ diff --git a/docs/articles/correlations_files/figure-html/unnamed-chunk-5-1.png b/docs/articles/correlations_files/figure-html/unnamed-chunk-5-1.png index 67f4ea61..d38249a0 100644 Binary files a/docs/articles/correlations_files/figure-html/unnamed-chunk-5-1.png and b/docs/articles/correlations_files/figure-html/unnamed-chunk-5-1.png differ diff --git a/docs/articles/derivatives.html b/docs/articles/derivatives.html index 3133f891..1e41cce7 100644 --- a/docs/articles/derivatives.html +++ b/docs/articles/derivatives.html @@ -136,7 +136,6 @@

## 10 5.72 fl 2020-06-10 2020-10-30 0.308

We can see that a column deriv has been added to the output data frame, which contains the derivative estimates. Below we visualize these estimates in tandem with the signal itself. The purple ticks on the x-axis mark time points at which the derivative estimate exceeds a threshold (arbitrarily chosen) of 0.25. These seem to roughly but reasonably mark times of upswing in the underlying signal.

library(ggplot2)
-library(gridExtra)
 theme_set(theme_bw())
 threshold = 0.25
 
@@ -153,7 +152,7 @@ 

scale_x_date(minor_breaks = "month", date_labels = "%b %y") + labs(x = "Date", y = "Derivative (linear regression)") -grid.arrange(p1, p2, nrow = 1)

+gridExtra::grid.arrange(p1, p2, nrow = 1)

@@ -179,7 +178,7 @@

scale_x_date(minor_breaks = "month", date_labels = "%b %y") + labs(x = "Date", y = "Derivative (smoothing spline)") -grid.arrange(p1, p2, nrow = 1)

+gridExtra::grid.arrange(p1, p2, nrow = 1)

The estimated derivates—in purple for the smoothing spline with 8 degrees of freedom of 8, and in green for the one tuned by cross-validation—appear less smooth than those above, from linear regression. Using cross-validation offers more adaptivity to the time-varying level of smoothness, but this can sometimes result in erratic derivative estimates (big green spikes in the right plot).

@@ -206,7 +205,7 @@

scale_x_date(minor_breaks = "month", date_labels = "%b %y") + labs(x = "Date", y = "Derivative (smoothing spline)") -grid.arrange(p1, p2, nrow = 1) +gridExtra::grid.arrange(p1, p2, nrow = 1)

The estimated derivates now appear a bit smoother than the last ones, from the smoothing spline methods. Again, using cross-validation offers a noticeable improvement in adapting to to the time-varying level of smoothness, and also does not appear to be suffering from the same volatility as in the smoothing spline case.

diff --git a/docs/articles/epitools.html b/docs/articles/epitools.html index f713d5fd..437d925d 100644 --- a/docs/articles/epitools.html +++ b/docs/articles/epitools.html @@ -127,8 +127,7 @@

issue: the time value at which the given signal value was issued.

A data frame or tibble can be converted into an object of class epi_signal, using as.epi_signal(), provided that it has (at least) the first 3 columns in the above list (if an issue column is not present, then today’s date is used for the issue dates). To learn more about the epi_signal format, you can read the documentation for as.epi_signal().

-

As an example, we’ll look at daily cumulative COVID-19 cases for 4 states (CA, FL, NY, and TX) in the U.S., over a year from mid 2020 to mid 2021, using the covidcast package to
-fetch this data from the COVIDcast API.

+

As an example, we’ll look at daily cumulative COVID-19 cases for 4 states (CA, FL, NY, and TX) in the U.S., over a year from mid 2020 to mid 2021, using the covidcast package to fetch this data from the COVIDcast API.

library(covidcast)
 
 case_data <- covidcast_signal(data_source = "jhu-csse",
@@ -233,12 +232,12 @@ 

## # A tibble: 6 × 4
 ##   value geo_value time_value issue     
 ##   <int> <chr>     <date>     <date>    
-## 1     1 ca        2003-02-23 2021-10-14
-## 2     0 ca        2003-02-24 2021-10-14
-## 3     0 ca        2003-02-25 2021-10-14
-## 4     1 ca        2003-02-26 2021-10-14
-## 5     0 ca        2003-02-27 2021-10-14
-## 6     1 ca        2003-02-28 2021-10-14
+## 1 1 ca 2003-02-23 2021-10-19 +## 2 0 ca 2003-02-24 2021-10-19 +## 3 0 ca 2003-02-25 2021-10-19 +## 4 1 ca 2003-02-26 2021-10-19 +## 5 0 ca 2003-02-27 2021-10-19 +## 6 1 ca 2003-02-28 2021-10-19

ggplot(x, aes(x = time_value, y = value)) +
   geom_col(aes(y = value)) +
   scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
diff --git a/docs/articles/epitools_files/figure-html/unnamed-chunk-7-1.png b/docs/articles/epitools_files/figure-html/unnamed-chunk-7-1.png
new file mode 100644
index 00000000..fac364a4
Binary files /dev/null and b/docs/articles/epitools_files/figure-html/unnamed-chunk-7-1.png differ
diff --git a/docs/articles/pct-change.html b/docs/articles/pct-change.html
index 00026194..c9f25642 100644
--- a/docs/articles/pct-change.html
+++ b/docs/articles/pct-change.html
@@ -146,33 +146,30 @@ 

## 21 3494 fl 2020-06-21 2020-10-14 86.8

We can see that a column pct_change column has been appended to the output data frame, which contains the percentage change values estimates. Next we plot these values alongside the signal itself. We highlight the periods in time for which the percentage change is above 10% (in red) and below -10% (in blue).

library(ggplot2)
-library(gridExtra)
 theme_set(theme_bw())
 
 threshold_upper = 10
 threshold_lower = -10
 
-p1 <- x %>%
-  ggplot(aes(x = time_value, y = value)) +
+p1 <- ggplot(x, aes(x = time_value, y = value)) +
   geom_tile(data = x %>% filter(pct_change >= threshold_upper),
             aes(x = time_value, y = 0, width = 7, height = Inf),
             fill = 2, alpha = 0.1) +
   geom_tile(data = x %>% filter(pct_change <= threshold_lower),
             aes(x = time_value, y = 0, width = 7, height = Inf),
-            fill = 4, alpha = 0.1) + geom_line() +
+            fill = 4, alpha = 0.1) +
   geom_line() +
   scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
   labs(x = "Date", y = "Reported COVID-19 cases")
 
-p2 <- x %>%
-  ggplot(aes(x = time_value, y = pct_change)) +
+p2 <- ggplot(x, aes(x = time_value, y = pct_change)) +
   geom_line() +
   geom_hline(yintercept = threshold_upper, linetype = 2, col = 2) +
   geom_hline(yintercept = threshold_lower, linetype = 2, col = 4) +
   scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
   labs(x = "Date", y = "Weekly percentage change")
 
-grid.arrange(p1, p2, nrow = 1)
+gridExtra::grid.arrange(p1, p2, nrow = 1)

@@ -183,27 +180,25 @@

slide_by_geo(slide_fun = ~ Mean(.x$value), n = 7, col_name = "value") %>% pct_change(n = 14) -p1 <- x %>% - ggplot(aes(x = time_value, y = value)) + +p1 <- ggplot(x, aes(x = time_value, y = value)) + geom_tile(data = x %>% filter(pct_change >= threshold_upper), aes(x = time_value, y = 0, width = 7, height = Inf), fill = 2, alpha = 0.1) + geom_tile(data = x %>% filter(pct_change <= threshold_lower), aes(x = time_value, y = 0, width = 7, height = Inf), - fill = 4, alpha = 0.1) + geom_line() + + fill = 4, alpha = 0.1) + geom_line() + scale_x_date(minor_breaks = "month", date_labels = "%b %y") + labs(x = "Date", y = "Reported COVID-19 cases (7-day average)") -p2 <- x %>% - ggplot(aes(x = time_value, y = pct_change)) + +p2 <- ggplot(x, aes(x = time_value, y = pct_change)) + geom_line() + geom_hline(yintercept = threshold_upper, linetype = 2, col = 2) + geom_hline(yintercept = threshold_lower, linetype = 2, col = 4) + scale_x_date(minor_breaks = "month", date_labels = "%b %y") + labs(x = "Date", y = "Weekly percentage change") -grid.arrange(p1, p2, nrow = 1)

+gridExtra::grid.arrange(p1, p2, nrow = 1)

diff --git a/docs/articles/pct-change_files/figure-html/unnamed-chunk-3-1.png b/docs/articles/pct-change_files/figure-html/unnamed-chunk-3-1.png index a058e0df..98ff3ada 100644 Binary files a/docs/articles/pct-change_files/figure-html/unnamed-chunk-3-1.png and b/docs/articles/pct-change_files/figure-html/unnamed-chunk-3-1.png differ diff --git a/docs/articles/pct-change_files/figure-html/unnamed-chunk-4-1.png b/docs/articles/pct-change_files/figure-html/unnamed-chunk-4-1.png index 44c7d8c4..85db7e42 100644 Binary files a/docs/articles/pct-change_files/figure-html/unnamed-chunk-4-1.png and b/docs/articles/pct-change_files/figure-html/unnamed-chunk-4-1.png differ diff --git a/docs/articles/slide.html b/docs/articles/slide.html index f3f9d9f7..0cd677a6 100644 --- a/docs/articles/slide.html +++ b/docs/articles/slide.html @@ -119,8 +119,8 @@

1. Slide a computation over signal values

Slide with a formula

We first demonstrate how to apply a 7-day trailing average to the daily counts in order to smooth the signal, by using a formula in the slide_fun argument of slide_by_geo().

-
slide_by_geo(x, slide_fun = ~ Mean(.x$value), n = 7) %>%
-  head(10)
+
x <- slide_by_geo(x, slide_fun = ~ Mean(.x$value), n = 7)
+head(x, 10)
## # A tibble: 10 × 5
 ##    value geo_value time_value issue      slide_value
 ##    <dbl> <chr>     <date>     <date>           <dbl>
@@ -136,27 +136,26 @@ 

## 10 3208 ca 2020-06-10 2021-09-16 2826.

The formula specified via slide_fun has access to all columns present in the original epi_signal data frame, and must refer to them with the prefix .x$. Here the function Mean() is a simple wrapper around mean() that omits NA values by default (provided by the epitools package).

Notice that slide_by_geo() returns a data frame with a new column appended that contains the results (from sliding the formula), named slide_value by default. We can instead specify a name up front using the col_name argument:

-
slide_by_geo(x, slide_fun = ~ Mean(.x$value), n = 7, col_name = "7dav") %>%
-  head(10)
-
## # A tibble: 10 × 5
-##    value geo_value time_value issue      `7dav`
-##    <dbl> <chr>     <date>     <date>      <dbl>
-##  1  2360 ca        2020-06-01 2021-09-27  2360 
-##  2  2372 ca        2020-06-02 2021-09-16  2366 
-##  3  2214 ca        2020-06-03 2021-09-16  2315.
-##  4  3011 ca        2020-06-04 2021-09-16  2489.
-##  5  3025 ca        2020-06-05 2021-09-27  2596.
-##  6  3046 ca        2020-06-06 2021-09-16  2671.
-##  7  2404 ca        2020-06-07 2021-09-16  2633.
-##  8  2385 ca        2020-06-08 2021-09-27  2637.
-##  9  2700 ca        2020-06-09 2021-09-16  2684.
-## 10  3208 ca        2020-06-10 2021-09-16  2826.
+
x <- slide_by_geo(x, slide_fun = ~ Mean(.x$value), n = 7, col_name = "7dav")
+head(x, 10)
+
## # A tibble: 10 × 6
+##    value geo_value time_value issue      slide_value `7dav`
+##    <dbl> <chr>     <date>     <date>           <dbl>  <dbl>
+##  1  2360 ca        2020-06-01 2021-09-27       2360   2360 
+##  2  2372 ca        2020-06-02 2021-09-16       2366   2366 
+##  3  2214 ca        2020-06-03 2021-09-16       2315.  2315.
+##  4  3011 ca        2020-06-04 2021-09-16       2489.  2489.
+##  5  3025 ca        2020-06-05 2021-09-27       2596.  2596.
+##  6  3046 ca        2020-06-06 2021-09-16       2671.  2671.
+##  7  2404 ca        2020-06-07 2021-09-16       2633.  2633.
+##  8  2385 ca        2020-06-08 2021-09-27       2637.  2637.
+##  9  2700 ca        2020-06-09 2021-09-16       2684.  2684.
+## 10  3208 ca        2020-06-10 2021-09-16       2826.  2826.

As a simple sanity check, we visualize the 7-day trailing averages computed on top of the original counts:

library(ggplot2)
 theme_set(theme_bw())
 
-slide_by_geo(x, slide_fun = ~ Mean(.x$value), n = 7, col_name = "7dav") %>%
-  ggplot(aes(x = time_value)) +
+ggplot(x, aes(x = time_value)) +
   geom_col(aes(y = value, fill = geo_value), alpha = 0.5) +
   geom_line(aes(y = `7dav`, col = geo_value), show.legend = FALSE) +
   facet_wrap(~ geo_value, scales = "free_y") +
@@ -168,23 +167,137 @@ 

Slide with a function

We can also pass a function for the slide_fun argument in slide_by_geo(). In this case, the passed function must have the following argument structure: x, a data frame the same column names as the original data frame; followed by any number of named additional arguments; and ending with ..., to capture general additional arguments. Recreating the last example of a 7-day trailing average:

-
slide_by_geo(x, slide_fun = function(x, ...) Mean(x$value), n = 7,
-             col_name = "7dav") %>%
-  head(10)
-
## # A tibble: 10 × 5
-##    value geo_value time_value issue      `7dav`
-##    <dbl> <chr>     <date>     <date>      <dbl>
-##  1  2360 ca        2020-06-01 2021-09-27  2360 
-##  2  2372 ca        2020-06-02 2021-09-16  2366 
-##  3  2214 ca        2020-06-03 2021-09-16  2315.
-##  4  3011 ca        2020-06-04 2021-09-16  2489.
-##  5  3025 ca        2020-06-05 2021-09-27  2596.
-##  6  3046 ca        2020-06-06 2021-09-16  2671.
-##  7  2404 ca        2020-06-07 2021-09-16  2633.
-##  8  2385 ca        2020-06-08 2021-09-27  2637.
-##  9  2700 ca        2020-06-09 2021-09-16  2684.
-## 10  3208 ca        2020-06-10 2021-09-16  2826.
-

As a more complicated example, todo

+
x <- slide_by_geo(x, slide_fun = function(x, ...) Mean(x$value), n = 7,
+                  col_name = "7dav")
+head(x, 10)
+
## # A tibble: 10 × 6
+##    value geo_value time_value issue      slide_value `7dav`
+##    <dbl> <chr>     <date>     <date>           <dbl>  <dbl>
+##  1  2360 ca        2020-06-01 2021-09-27       2360   2360 
+##  2  2372 ca        2020-06-02 2021-09-16       2366   2366 
+##  3  2214 ca        2020-06-03 2021-09-16       2315.  2315.
+##  4  3011 ca        2020-06-04 2021-09-16       2489.  2489.
+##  5  3025 ca        2020-06-05 2021-09-27       2596.  2596.
+##  6  3046 ca        2020-06-06 2021-09-16       2671.  2671.
+##  7  2404 ca        2020-06-07 2021-09-16       2633.  2633.
+##  8  2385 ca        2020-06-08 2021-09-27       2637.  2637.
+##  9  2700 ca        2020-06-09 2021-09-16       2684.  2684.
+## 10  3208 ca        2020-06-10 2021-09-16       2826.  2826.
+
+
+

+Building and running a local forecaster

+

As a more complicated example, we create a forecaster based on a local (in time) autoregression or AR model. AR models can be fit in numerous ways (using base R functions and various packages), but here we define it “by hand” both because it provides a more advanced example of sliding a function over an epi_signal object, and because it allows us to be a bit more flexible in defining a probabilistic forecaster: one that outputs not just a point prediction, but a notion of uncertainty around this. In particular, our forecaster will output a point prediction along with a 80% uncertainty band, represented by a predictive quantiles at the 10% and 90% levels (lower and upper endpoints of the uncertainty band).

+

The function defined below, local_ar(), is our local AR forecaster. The lags argument indicates which lags to use in the model, and ahead indicates how far ahead in the future to make forecasts (both are encoded in terms of the units of the time_value column; so, days, in the working epi_signal being considered in this vignette). The implementation here uses dplyr::lag(), dplyr::lead(), and purrr:map() to fit the AR model with tidy code.

+
local_ar <- function(v, lags = c(0, 7, 14), ahead = 7, min_train_window = 10,
+                     quantile_levels = c(0.1, 0.9), symmetrize = TRUE) {
+  # Return NA if insufficient training data
+  if (length(v) < min_train_window + max(lags) + ahead) {
+    return(data.frame(value = NA, level = c(NA, quantile_levels)))
+  }
+
+  # Perform other simple checks
+  stopifnot(all(lags >= 0), ahead > 0, max(lags) + ahead < length(v))
+
+  # Go and fit the AR model
+  y <- dplyr::lead(v, n = ahead)
+  x <- do.call(cbind, purrr::map(lags, ~ dplyr::lag(v, n = .x)))
+  fit <- lm(y ~ x, na.action = na.omit)
+
+  # Make a prediction
+  newx <- tail(x, 1)
+  point <- drop(c(1, newx) %*% coef(fit))
+
+  # Compute a band and return
+  r <- residuals(fit)
+  s <- ifelse(symmetrize, -1, NA) # Should the residuals be symmetrized?
+  q <- c(0, quantile(c(r, s * r), probs = quantile_levels, na.rm = TRUE))
+  return(data.frame(value = q + point, level = c(NA, quantile_levels)))
+}
+

Now we go ahead and slide this local forecaster over the working epi_signal of COVID-19 cases. Note that we actually model the 7dav column, to operate on the scale of smoothed COVID-19 cases. (This is clearly equivalent, up to a constant, to modeling weekly sums of COVID-19 cases.)

+
forecasts <- x %>%
+  slide_by_geo(slide_fun = ~ local_ar(.x$`7dav`), n = 63,
+               col_type = "list", col_name = "value") %>%
+  rename(forecast_date = time_value) %>%
+  mutate(target_date = forecast_date + 7) %>%
+  select(-issue, -`7dav`)
+
+tail(forecasts, 3)
+
## An `epi_signal` data frame with 3 rows and 5 columns.
+## 
+## name      : covid19_cases
+## geo_type  : state
+## time_type : day
+## 
+## # A tibble: 3 × 5
+##   value        geo_value forecast_date slide_value target_date
+##   <list>       <chr>     <date>              <dbl> <date>     
+## 1 <df [3 × 2]> tx        2021-05-29          1740. 2021-06-05 
+## 2 <df [3 × 2]> tx        2021-05-30          1692. 2021-06-06 
+## 3 <df [3 × 2]> tx        2021-05-31          1306. 2021-06-07
+

We cab see that the value column is a list column (as we specified in the call to slide_to_geo(), which is needed since local_ar() returns a data frame), and we can use tidyr::unnest() to unnest it.

+
forecasts <- tidyr::unnest(forecasts, value)
+tail(forecasts, 10)
+
## # A tibble: 10 × 6
+##    value level geo_value forecast_date slide_value target_date
+##    <dbl> <dbl> <chr>     <date>              <dbl> <date>     
+##  1 1981.   0.9 tx        2021-05-28          1802  2021-06-04 
+##  2 1721.  NA   tx        2021-05-29          1740. 2021-06-05 
+##  3 1447.   0.1 tx        2021-05-29          1740. 2021-06-05 
+##  4 1995.   0.9 tx        2021-05-29          1740. 2021-06-05 
+##  5 1694.  NA   tx        2021-05-30          1692. 2021-06-06 
+##  6 1415.   0.1 tx        2021-05-30          1692. 2021-06-06 
+##  7 1974.   0.9 tx        2021-05-30          1692. 2021-06-06 
+##  8 1181.  NA   tx        2021-05-31          1306. 2021-06-07 
+##  9  876.   0.1 tx        2021-05-31          1306. 2021-06-07 
+## 10 1487.   0.9 tx        2021-05-31          1306. 2021-06-07
+

The level column specifies the quantile level corresponding to the value column. Here NA means that the associated value entry is the point forecast and 0.1 and 0.9 mean that the associated value entries are the estimated quantiles at the 0.1 and 0.9 probability levels.

+

To finish off, we plot the forecasts at various times (spaced out by a month) over the last year, at multiple horizons (7, 14, and 28 days ahead). To do so, we encapsulate the process of generating forecasts into a simple function, so that we can call it a few times.

+
k_week_ahead <- function(x, ahead = 7) {
+  # Ensure each forecast task has 42 effective days of training data
+  arg_list = as.list(args(local_ar))
+  n = 42 + max(eval(arg_list$lags)) + ahead
+  return(x %>%
+           slide_by_geo(slide_fun = ~ local_ar(.x$`7dav`, ahead = ahead),
+                        n = n, col_type = "list", col_name = "value") %>%
+           dplyr::rename(forecast_date = time_value) %>%
+           mutate(target_date = forecast_date + ahead) %>%
+           select(-issue, -`7dav`) %>%
+           tidyr::unnest(value))
+}
+
+# First generate the forecasts, and bind them together
+forecasts <- bind_rows(k_week_ahead(x, ahead = 7),
+                       k_week_ahead(x, ahead = 14),
+                       k_week_ahead(x, ahead = 28))
+
+# Now plot some of them, spaced apart by 2 months
+dates <- seq(as.Date("2020-08-01"), as.Date("2021-06-01"), by = "2 months")
+
+bands <- forecasts %>%
+  filter(forecast_date %in% dates) %>%
+  group_by(forecast_date, target_date, geo_value) %>%
+  mutate(level = case_when(
+    is.na(level) ~ "point",
+    level < .5 ~ "lo",
+    TRUE ~ "hi")) %>%
+  tidyr::pivot_wider(names_from = level, values_from = value)
+
+p <- ggplot(bands, aes(x = target_date, y = point, group = forecast_date)) +
+  geom_ribbon(aes(ymin = lo, ymax = hi), fill = "orchid", alpha = 0.5) +
+  geom_line() + geom_point(size = 0.5) +
+  geom_vline(aes(xintercept = forecast_date), linetype = 2, alpha = 0.5) +
+  facet_wrap(vars(geo_value), scales = "free_y") +
+  scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
+  labs(x = "Date", y = "Reported COVID-19 cases")
+
+gginnards::append_layers(
+  p, geom_line(data = x, aes(x = time_value, y = `7dav`),
+               inherit.aes = FALSE, color = "gray50"), pos = "bottom")
+

+

Two points are worth making. First, the AR forecaster here is not very good. At various points in time, we can see that its forecasts are both pretty volatile (its point predictions are all over the place) and overconfident (its bands are too narrow). This is only meant as a simple demo and slightly more sophisticated AR models can go a long way. For example, the overconfidence in this example is due to the fact that the bands are based on quantiles of training residuals, and when the AR model fits well to the training set, these can clearly be too small.

+

Second, the AR forecaster here is using using finalized data, meaning, it uses the signal values (reported COVID-19 cases) corresponding to the latest issue dates available, for both training models and making predictions. However, this is not reflective of the provisional nature of the data that it must cope with in a true forecast task. Training and making predictions on finalized data can lead to an overly optimismic sense of accuracy; see, for example, McDonald et al. (2021), and references therein.

+

Note that both of these issues (building more AR models, and properly handling provisional data in training models and making predictions) are addressed in the epipred package.

diff --git a/docs/articles/slide_files/figure-html/unnamed-chunk-10-1.png b/docs/articles/slide_files/figure-html/unnamed-chunk-10-1.png new file mode 100644 index 00000000..55c2aafb Binary files /dev/null and b/docs/articles/slide_files/figure-html/unnamed-chunk-10-1.png differ diff --git a/docs/articles/slide_files/figure-html/unnamed-chunk-9-1.png b/docs/articles/slide_files/figure-html/unnamed-chunk-9-1.png new file mode 100644 index 00000000..0bbbe432 Binary files /dev/null and b/docs/articles/slide_files/figure-html/unnamed-chunk-9-1.png differ diff --git a/docs/authors.html b/docs/authors.html index 3194b5d4..66a88b08 100644 --- a/docs/authors.html +++ b/docs/authors.html @@ -135,6 +135,10 @@

Authors