Skip to content

Future redesign of epi_slide and epix_slide #458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dshemetov opened this issue Jun 6, 2024 · 5 comments
Open

Future redesign of epi_slide and epix_slide #458

dshemetov opened this issue Jun 6, 2024 · 5 comments

Comments

@dshemetov
Copy link
Contributor

dshemetov commented Jun 6, 2024

Related issue: #234

Logan and I did some brainstorming about what use cases epi_slide is trying to cover. We found that it doesn't behave gracefully when the slide computation function f returns data.frame/tibble outputs (which is something may occur if you're trying to, say, make a forecast per ref_time_value). This led us to thinking that perhaps we shouldn't even recommend using epi_slide for forecasting use cases, perhaps epix_slide should be the goto function for that (where an epi_df can be quickly converted to a fake archive with version = time_value). This in turn even led us to suspect that maybe epi_slide computation functions f should be limited to functions that return atomic vectors, like those covered by epi_slide_opt.

TODO

  • examine the use cases of epi_slide in our vignettes and catalogue them; we need to make sure that we have an alternative/workaround for its more exotic use cases
  • consider more specialized functions epi_slide_reframe or epi_slide_mutate, to imply and constrain the outputs of f to the user

Side-note: formats for epix_slide to use when the slide function f returns data.frame/tibble outputs

Three possible output formats

# Deep
geo | time | slide_value (list)
----------------------------
a        1        list(m = , w = , p = )
a        2        list(m = , w = , p = )
a        3        list(m = , w = , p = )

# Wide
geo | time | metadata (list) | workflow (list) | preds (list)
-------------------------------------------------------
a        1        ...
a        2        ...
a        3        ...

# Long
geo | time | name              |  value (list)
-------------------------------------------------------
a        1        "metadata"        < >
a        2        "workflow"
a        3        "preds"

The deep format is simple from the epi_slide writer's POV, but maybe difficult for the user to index into. The wide format will have issues with name collisions (e.g. if the output has geo_value and time_value as well).

@brookslogan
Copy link
Contributor

brookslogan commented Jun 6, 2024

On formats for epix_slide to use: I've constructed some dummy examples below, plus >=1 way per format to extract predictions and get in submission-ish format (primary use case), or to extract coefficients in a long format, as if we were planning to debug by ggplotting over forecast dates what the coefficients were (maybe faceted by ahead?).

library(tidyverse)
library(epiprocess)
#> 
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(epipredict)
#> Loading required package: parsnip
#> 
#> Attaching package: 'epipredict'
#> The following object is masked from 'package:ggplot2':
#> 
#>     layer

# NOTE from ?arx_forecaster
jhu <- case_death_rate_subset %>%
  dplyr::filter(time_value >= as.Date("2021-12-01"))
out <- arx_forecaster(
  jhu, "death_rate",
  c("case_rate", "death_rate")
)

# Here are potential formats we could get out of `epix_slide()`:

deep <- tibble(
  time_value = 1:3, # XXX actually, version...
  slide_value = rep(list(out), 3)
)

long <- deep %>%
  rowwise() %>%
  reframe(time_value, enframe(unclass(slide_value)))

wide <- long %>%
  pivot_wider(id_cols = time_value)

wide_packed <- wide %>%
  pack(slide_value = c(predictions, epi_workflow, metadata))

deep
#> # A tibble: 3 × 2
#>   time_value slide_value
#>        <int> <list>     
#> 1          1 <arx_fcst> 
#> 2          2 <arx_fcst> 
#> 3          3 <arx_fcst>
long
#> # A tibble: 9 × 3
#>   time_value name         value            
#>        <int> <chr>        <list>           
#> 1          1 predictions  <tibble [56 × 5]>
#> 2          1 epi_workflow <ep_wrkfl>       
#> 3          1 metadata     <named list [2]> 
#> 4          2 predictions  <tibble [56 × 5]>
#> 5          2 epi_workflow <ep_wrkfl>       
#> 6          2 metadata     <named list [2]> 
#> 7          3 predictions  <tibble [56 × 5]>
#> 8          3 epi_workflow <ep_wrkfl>       
#> 9          3 metadata     <named list [2]>
# NOTE there could also be an alternative long format using a single named list column, but I have no idea how to work with that.
wide
#> # A tibble: 3 × 4
#>   time_value predictions       epi_workflow metadata        
#>        <int> <list>            <list>       <list>          
#> 1          1 <tibble [56 × 5]> <ep_wrkfl>   <named list [2]>
#> 2          2 <tibble [56 × 5]> <ep_wrkfl>   <named list [2]>
#> 3          3 <tibble [56 × 5]> <ep_wrkfl>   <named list [2]>
wide_packed
#> # A tibble: 3 × 2
#>   time_value slide_value$predictions $epi_workflow $metadata       
#>        <int> <list>                  <list>        <list>          
#> 1          1 <tibble [56 × 5]>       <ep_wrkfl>    <named list [2]>
#> 2          2 <tibble [56 × 5]>       <ep_wrkfl>    <named list [2]>
#> 3          3 <tibble [56 × 5]>       <ep_wrkfl>    <named list [2]>
# Here's one way for each that we could use to get to submission-ish format:

preds <-
  deep %>%
  mutate(predictions = map(slide_value, "predictions"),
         slide_value = NULL) %>%
  unnest(predictions)
# (This one (`deep`) took a bit of time to think of. I think I knew the solution
# above was possible but ugly, and was hoping to think of a cleaner solution but
# failed and finally fell back on the above.)

# or

# deep %>%
#   mutate(slide_value = map(slide_value, unclass)) %>%
#   hoist(slide_value, "predictions") %>%
#   select(-slide_value) %>%
#   unnest(predictions)

# or

# deep %>%
#   mutate(predictions = map(slide_value, "predictions"),
#          .keep = "unused") %>%
#   unnest(predictions)

preds <-
  long %>%
  filter(name == "predictions") %>%
  select(-name) %>%
  unnest(value)

preds <-
  wide %>%
  select(time_value, predictions) %>%
  unnest(predictions)

preds <-
  wide_packed %>%
  mutate(predictions = slide_value$predictions,
         slide_value = NULL) %>%
  unnest(predictions)

# or

# preds <-
#   wide_packed %>%
#   rowwise() %>%
#   reframe(time_value, slide_value$predictions[[1L]])

# or ...
# Here's how we might extract coefficients in a long format:

get_coefs <- function(fit_ewf) {
  fit_ewf %>% workflows::extract_fit_engine() %>% coef()
}

coefs <-
  deep %>%
  rowwise() %>%
  reframe(time_value, slide_value$epi_workflow %>% get_coefs() %>% enframe())

coefs <-
  long %>%
  filter(name == "epi_workflow") %>%
  mutate(coefs = map(value, ~ get_coefs(.x) %>% enframe())) %>%
  unnest(coefs, names_sep = "_")

# or, in a better output format:

# coefs <-
#   long %>%
#   filter(name == "epi_workflow") %>%
#   mutate(name = NULL,
#          coefs = map(value, ~ get_coefs(.x) %>% enframe()),
#          value = NULL) %>%
#   unnest(coefs)

coefs <-
  wide %>%
  rowwise() %>%
  reframe(time_value, epi_workflow %>% get_coefs() %>% enframe())

coefs <-
  wide_packed %>%
  rowwise() %>%
  reframe(time_value, slide_value$epi_workflow[[1]] %>% get_coefs() %>% enframe())
# (This one (`wide_packed`) took a bit of time to think of.)

# or

# coefs <-
#   wide_packed %>%
#   transmute(time_value, coefs = map(slide_value$epi_workflow, ~ get_coefs(.x) %>% enframe())) %>%
#   unnest(coefs)

# or ....

Created on 2024-06-06 with reprex v2.0.2

Comment N+1: Non-long [and long] formats may cause people to tend to duplicate the original slide_value column/columns many times when unnesting, if they aren't careful to select out those columns before unnesting. This could potentially lead to confusion at all the duplicated complex objects, and, depending on how serialization & deserialization works, bloated disk or RAM usage when saving & loading these objects. (RAM size before saving shouldn't be an issue; these duplicates should be aliasing memory properly.)

  • [I accidentally did this on a smaller scale with long format on the coefs, or maybe it was on purpose imagining what level of patience users might have.]

Comment N+2: a lot of these felt awkward, trading off between rowwise() requiring manually attaching the time_value (actually, the version/ref_time_value) vs. group_by(time_value) requiring slide_value[[1]]. I don't like the former because it's arcane, and don't like the latter because it's ugly and also would silently drop data if I forgot an ID column (and there seems to be no simple unwrap() function in R for unwrapping length-1 lists & erroring on other lengths).

Comment N+3: we could also change the output of arx_forecaster() in other ways. I think we mentioned just returning predictions with the other stuff as attributes, which didn't seem robust to rbinding etc. But we could also potentially have it be more of a tibble-like thing with its own convenience functions for extracting a submission-ish format or extracting coefficients, etc. One drawback would be that if you're not doing a slide you want more scalar-like behavior without having to insert [[1]]'s everywhere... preserving that is probably substantial trouble / boilerplate.

@brookslogan
Copy link
Contributor

brookslogan commented Jun 20, 2024

Unfortunately, accepting named lists on their own will likely not fully resolve the issue Ryan and Richard were running into: we want for slide computations to be able to just output an arx_forecaster() result. I mean, it will probably, but then if you try to pull out $predictions (in unnested/unpacked format), you'll get naming conflicts with any grouping columns and time_value columns, since epix_slide() also directly outputs these. The time_value part is easy to resolve: we call it version instead and are more accurate and convenient. The grouping columns part is gnarlier, especially since we don't control unnest.tibble / unnest.data.frame or whatever will be called [if it were directly an issue with the group computation outputs, we can do some special logic to make sure these columns agree and toss the duplicates]. [Workaround is removing/renaming before and/or after unnesting.]

@brookslogan
Copy link
Contributor

Long format will also upset epi_slide() size checks unless we dance around it, though it's not recommended for forecasting anyway.

@brookslogan
Copy link
Contributor

brookslogan commented Oct 14, 2024

Status: slide improvements pass has specialized epi_slide, reworked how data frame outputs are processed, and allows list output.

  • The epi_slide specialization makes epi_slide_reframe less likely to serve a purpose; it would have been used for simultaneous geo aggregation for example, but we currently forbid having multiple time series in the same group.

Remaining:

  • Decide whether to accept scalars and which format they should be put in, work out sequential data-masking expression details. Right now a workaround for deep format is fairly easy by the user just doing return(list(scalar)).

@brookslogan
Copy link
Contributor

brookslogan commented Apr 8, 2025

Comment N+4:

  • A more convenient format to work with arx_forecaster-style output might be to have a list of tibbles, each as if a single element had been extracted out of the arx_forecaster result inside of the epix_slide computation. Transforming to this from other formats seems in the realm of tidyr unnest, hoist, etc., but I don't see anything quite like it.
  • Another possibility would be to have a wide format, but unchopped by $predictions. This would require some sort of specialization for epipredict output objects to know which thing to unchop by. It also risks potentially quite large save/reload sizes if serialization & deserialization do not replicate aliasing, since the fit objects would be repeated very many times. The pro is that for any prediction, you have the fit information right alongside.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants