Skip to content

mapply() error when trying to use percent_cli as a covariate in arx_forecaster() #128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rachlobay opened this issue Aug 16, 2022 · 2 comments
Labels
bug Something isn't working P1 high priority

Comments

@rachlobay
Copy link
Contributor

rachlobay commented Aug 16, 2022

Once Issue #208) in epiprocess is addressed (by removing the problematic tibble::as_tibble() %>% line from the slide function for an archive in epiprocess), we get the following error when trying to use percent_cli as a covariate in arx_forecaster():

Error in mapply(.f, .x, .y, MoreArgs = list(...), SIMPLIFY = FALSE) : 
zero-length inputs cannot be mixed with those of non-zero length

Below is an example to reproduce this error (run after the related epiprocess issue is fixed):

library(epipredict)
library(epiprocess)
library(covidcast)
library(data.table)
library(dplyr)
library(tidyr)
library(ggplot2)

y <- covidcast_signals(
  c("doctor-visits", "jhu-csse"),
  c("smoothed_adj_cli", "confirmed_7dav_incidence_prop"),
  start_day = "2020-06-01",
  end_day = "2021-12-01",
  issues = c("2020-06-01", "2021-12-01"),
  geo_type = "state",
  geo_values = c("ca", "fl"))

z <- y[[1]] %>%
  select(geo_value, time_value, version = issue, percent_cli = value) %>%
  as_epi_archive()

z <- epix_merge(
  z, y[[2]] %>%
    select(geo_value, time_value, version = issue, case_rate = value) %>%
    as_epi_archive(), sync = "locf")

fc_time_values <- seq(as.Date("2020-08-01"), as.Date("2021-12-01"),
                      by = "1 month")
ahead = 7

# Old arx_forecaster works fine with percent_cli
z %>% epix_slide(fc = arx_forecaster(x = percent_cli, y = case_rate, 
                                     key_vars = geo_value, time_value = time_value,
                                     args = arx_args_list(ahead = ahead)),
                 n = 120, ref_time_values = fc_time_values)

# New forecaster - error in mapply()
z %>%
  epix_slide(function(x, ...)
    arx_forecaster(x, outcome = "case_rate",
                       predictors = c("case_rate", "percent_cli"),
                       args = arx_args_list(ahead = ahead))$predictions %>%
      select(-c(geo_value, time_value)),
    n = 120, ref_time_values = fc_time_values, new_col_name = "fc")

Please look into what is causing this error and what we can do to fix it. Possibly related to the abundance of missing values in percent_cli.

I haven’t looked into this much, but here are some initial thoughts: The error looks like it traces back to the call of map2 in line #109 of layer_residual.quantiles.R. And if one of the inputs is zero, then the mapply() function (which map2 uses) gives such an error… So are there alternatives to map2 that may be useful here? Or is there something more fundamentally wrong?

@rachlobay rachlobay added bug Something isn't working P1 high priority labels Aug 16, 2022
@rachlobay rachlobay changed the title mapply() error when trying to use percent_cli as a covariate in arx_epi_forecaster() mapply() error when trying to use percent_cli as a covariate in arx_forecaster() Aug 16, 2022
@dajmcdon
Copy link
Contributor

dajmcdon commented Aug 22, 2022

Update: I tracked the error to the missing data in percent_cli with proper as_of. Here's a simple example that gives the same result as with epix_slide():

library(epipredict) # frosting branch

jhu <- case_death_rate_subset %>%
  dplyr::filter(time_value >= as.Date("2021-12-01"))

out <- arx_forecaster(jhu, "death_rate",
                      c("case_rate", "death_rate"))
#> Warning: The forecast_date is less than the most recent update date of the
#> data.forecast_date = 2021-12-31 while data is from 2022-05-31.
# still difficulty extracting the "untrained" recipe, which we want

r <- epi_recipe(jhu) %>%
  step_epi_ahead(death_rate, ahead = 7) %>%
  step_epi_lag(case_rate, death_rate, lag = c(0, 7, 14)) %>%
  step_epi_naomit()

latest <- get_test_data(r, jhu)
predict(out$epi_workflow, new_data = latest) # works
#> Warning: The forecast_date is less than the most recent update date of the
#> data.forecast_date = 2021-12-31 while data is from 2022-05-31.
#> An `epi_df` object, 56 x 6 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2022-05-31 12:08:25
#> 
#> # A tibble: 56 × 6
#>    geo_value time_value  .pred         .pred_distn forecast_date target_date
#>  * <chr>     <date>      <dbl>              <dist> <date>        <date>     
#>  1 ak        2021-12-31 0.355  [0.05, 0.95]<q-rng> 2021-12-31    2022-01-07 
#>  2 al        2021-12-31 0.325  [0.05, 0.95]<q-rng> 2021-12-31    2022-01-07 
#>  3 ar        2021-12-31 0.496  [0.05, 0.95]<q-rng> 2021-12-31    2022-01-07 
#>  4 as        2021-12-31 0.0836 [0.05, 0.95]<q-rng> 2021-12-31    2022-01-07 
#>  5 az        2021-12-31 0.614  [0.05, 0.95]<q-rng> 2021-12-31    2022-01-07 
#>  6 ca        2021-12-31 0.327  [0.05, 0.95]<q-rng> 2021-12-31    2022-01-07 
#>  7 co        2021-12-31 0.567  [0.05, 0.95]<q-rng> 2021-12-31    2022-01-07 
#>  8 ct        2021-12-31 0.544  [0.05, 0.95]<q-rng> 2021-12-31    2022-01-07 
#>  9 dc        2021-12-31 0.831  [0.05, 0.95]<q-rng> 2021-12-31    2022-01-07 
#> 10 de        2021-12-31 0.607  [0.05, 0.95]<q-rng> 2021-12-31    2022-01-07 
#> # … with 46 more rows
#> # ℹ Use `print(n = ...)` to see more rows

latest_but_missing <- latest
latest_but_missing$case_rate[latest_but_missing$time_value > as.Date("2021-12-29")] <- NA
predict(out$epi_workflow, new_data = latest_but_missing) # same error
#> Error in `list_of()`:
#> ! Could not find common type for elements of `x`.

Created on 2022-08-22 by the reprex package (v2.0.1)


We've basically been lucky so far that all the signals are fully available at forecast time. This is in general false, as illustrated in the epix_slide() behaviour. We need to make some decision on how canned forecasters should operate (and maybe also, the rest) for this sort of real-time task. These are pretty much the same as the epix_merge() options:

  • Default to forbid (not my desired case)
  • Default to locf.
  • Default to truncate.
  • Default to na: keep the NAs (I think this doesn't really work here)

In any case, we should probably start creating some layers that give these different options (and maybe some other imputations).

Just FYI, the old version of arx_forecaster() implemented locf via tidyr::fill() inside of make_predictions().

@dajmcdon
Copy link
Contributor

closed by #134

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 high priority
Projects
None yet
Development

No branches or pull requests

2 participants