Skip to content

Steps can potentially clobber epi_df #175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dajmcdon opened this issue Apr 26, 2023 · 6 comments · Fixed by #180
Closed

Steps can potentially clobber epi_df #175

dajmcdon opened this issue Apr 26, 2023 · 6 comments · Fixed by #180
Labels
bug Something isn't working P0 do this immediately

Comments

@dajmcdon
Copy link
Contributor

recipes::step_naomit() converts to tibble. This is not generally a problem unless you do this before you do something that tries to read the epi_df metadata (e.g. step_training_window()).

This is potentially a major pain: any step that clobbers our epi_df could do this.

Find a way to log the epi_df more explicitly elsewhere in a way that’s accessible, but not removable.

library(epipredict)
#> Loading required package: epiprocess
#> 
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> Loading required package: parsnip


tib <- tibble::tibble(
  x = 1:10,
  y = 1:10,
  time_value = rep(seq(as.Date("2020-01-01"), by = 1,
                     length.out = 5), times = 2),
  geo_value = rep(c("ca", "hi"), each = 5)) %>%
  as_epi_df()

epi_recipe(y ~ x, data = tib) %>%
  step_training_window(n_recent = 3) %>%
  recipes::prep(tib) %>%
  recipes::bake(new_data = NULL)
#> An `epi_df` object, 6 x 4 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2023-04-26 15:30:42
#> 
#> # A tibble: 6 × 4
#>       x     y time_value geo_value
#> * <int> <int> <date>     <chr>    
#> 1     3     3 2020-01-03 ca       
#> 2     4     4 2020-01-04 ca       
#> 3     5     5 2020-01-05 ca       
#> 4     8     8 2020-01-03 hi       
#> 5     9     9 2020-01-04 hi       
#> 6    10    10 2020-01-05 hi

epi_recipe(y ~ x, data = tib) %>%
  recipes::step_naomit() %>%
  step_training_window(n_recent = 3) %>%
  recipes::prep(tib) %>%
  recipes::bake(new_data = NULL)
#> # A tibble: 3 × 4
#>       x     y time_value geo_value
#>   <int> <int> <date>     <chr>    
#> 1     9     9 2020-01-04 hi       
#> 2     5     5 2020-01-05 ca       
#> 3    10    10 2020-01-05 hi

Created on 2023-04-26 with reprex v2.0.2

@dshemetov
Copy link
Contributor

dshemetov commented Apr 27, 2023

I am still very vague on R's classes and object attributes, so this is a very basic question: when a function like step_na_omit returns a tibble, is it just resetting the class attribute of the input object to one without epi_df or is it making a completely new tibble and returning that? If recipes function do the former (and are consistent about it), then it seems like we probably have the option of saving the epi_df class tag in a different attribute. But if it's the latter, then do we have alternatives?

@brookslogan
Copy link
Contributor

brookslogan commented Apr 29, 2023

In most of R stuff, it's supposed to be [something like] shallow-copy-on-write[-unless-it-does-not-seem-necessary]. The way it's doing it right now is basically creating a new "object" (SEXP) with pointers to new attributes (including the new class setting, and copying(/aliasing) some of the old attributes), and pointing to the same set of columns.

If the epi_df into the step was ungrouped, our metadata attributes still hang around, which might help getting back to an epi_df. But if it was grouped, then the attributes are tossed. (I'm working to get some consistency with grouped epi_dfs in epiprocess, but it's more of a project than I would have hoped.)

But whether or not we have the old metadata, it doesn't look like we can make step_naomit output an epi_df without messing up the behavior that as_tibble.epi_df should have in other contexts, and that means coding epi_steps around all this epi_df dropping anyway, OR planning to have the epi_df-ness dropped and stay that way, and set any roles and additional info we'll need in epi_recipe.

@brookslogan
Copy link
Contributor

To try to more answer @dshemetov's question: if recipes consistently uses as_tibble (I don't know if it does), then it will consistently do what as_tibble does. as_tibble is supposed to convert to a tibble class, but it's not well-documented what it does with the rest of the attributes, including our metadata or any epi_df tag attribute you might consider. It appears that currently it would keep those other attributes for an ungrouped epi_df but drop them for a grouped epi_df (because as we've implemented epi_df, it will dispatch to as_tibble.grouped_df, which will clear those attributes).

@dajmcdon
Copy link
Contributor Author

The issue here is really a confluence of factors. So TL;DR, there's the proposed fix at the bottom. The rest of this is a description of some R functionality as well as a bit about how guts of recipes/prepping/baking interact.

  1. In R, many functions are actually "generics". These functions dispatch to "methods" based on the Class of the first argument (technically these are S3 methods, which are the most common and easiest to use). An example is print(). You call print(x) and internally, R dispatches to print.class() based on the first element of class(x). If that's not available, it falls back to print.default(x). Good R package writing practice means giving objects that you create (almost always lists) a class, and writing methods for any generics you think are likely to be tried by the user. One further wrinkle is that objects can have "subclasses". Really, class(x) is a vector of classes with an ordering. Generics dispatch along the vector starting from the first and proceeding until they find one that has an appropriate method.
  2. epi_df is a class that has attributes. It's also a subclass of tbl_df (the class for a data frame made with tibble::tibble()).
  3. Much of epipredict consists of methods that operate on epi_df's. Mostly, it needs two features: (1) dataframes with the epi_df class so that it can dispatch to the right methods, and (2) a guarantee that certain columns (time_value and geo_value) exist in that data. If an object has this class, it must have those two columns. So this is safe. The only possible way around it is for a user to do something like x <- 1:5 followed by class(x) <- 'epi_df'. Then all these methods would try to operate on that and probably throw uninterpretable errors. But this is pretty dumb.
  4. An epi_recipe is a subclass of recipe, so it's class vector has those two classes. The object itself is a list with a bunch of other named lists inside. As you create the recipe, it updates all of this but doesn't do anything to any data.
  5. prep() (and bake()) are generic functions. So calling prep(r, dat) where r is an epi_recipe dispatches to prep.epi_recipe(r, dat).
  6. The basic idea of prep(), is
    # does preprocessing
    for (s in seq_along(r$steps)) dat <- prep(r$step[[i]], dat)
    # does postprocessing
    Inside of prep.epi_recipe(), it calls prep.step_*() sequentially on the data, dispatching to the method associated to each of the steps you added to the epi_recipe. Importantly, the prep() generic only gets whatever is in r$step[[i]] and not the rest of the recipe. This makes it so that I can use steps that operate on epi_df's but also those provided by any other R package in the wild.
  7. In the {recipes} package, many steps end with a line like tibble::as_tibble(dat) (same is true of other related things). This removes the epi_df subclass (along with other modifications). The problem here is that, the next step in the sequence can no longer depend on any of the properties that the epi_df had. Trying a workaround like attempting to write attributes elsewhere on the epi_df would be buggy: if the tidy team alters as_tibble(), then it could eventually kill my workaround.
  8. Now, as_tibble() is a generic, so I could do something like create epipredict::as_tibble.epi_df(), forcing such calls to do what I want, rather than what they want. But this is not a good idea, for the reason in 7. It's also very unexpected to the user, similar to doing the dumb thing in 3. So this is bad practice.
  9. The functionality in 6 is great, because I can use anyone's steps. But I need it not to clobber the class in case future steps need the class.

I think that the correct fix (at least for now) is to modify prep.epi_recipe() and bake.epi_recipe() to reconstruct the epi_df after each step. This ensures no one else's step can destroy it in a breakable manner. At the same time, we should ensure that any steps in {epipredict} that require an epi_df check for that up front. That way, if someone wants to use some in a bare recipe (rather than an epi_recipe), they could unless we force otherwise.

This issue is one of many similar that I've run into. The logic of the things this other group does is frequently undocumented. So I can't make mine match until it breaks.

@dajmcdon
Copy link
Contributor Author

dajmcdon commented May 5, 2023

@dshemetov @brookslogan I'm working on a fix for this, but I wanted to see what you think.

When a recipe is prepped/baked, it loops through all the steps. These steps must output a tbl_df or it aborts (this is the behaviour in {recipes}). My plan is that an we would then coerce back to epi_df. (It can't just check and abort, because any step that's from a different package would almost certainly fail.) The question is, how strict should this coercion be?

Some things to note: validating against metadata is potentially tricky. The metadata is not necessarily enforced. So the data could be daily, but have a time_type = "weekly". This is because the constructor only tries to check if the user doesn't input these manually. If they do, they're not validated.

  1. We could try to force the metadata of the most recent epi_df and fail if it can't. Seemingly safe because steps from other packages "probably" won't make changes to the required key columns. They could remove one for no reason, which would mean failure here. They could perhaps group_by() %>% summarise() such that a key column is removed, though that should probably fail as being ill-advised (preferred would be to use a step_epi_*() that is safe and outputs an epi_df). They could probably also convert, say, daily data to weekly data, which again, would fail.
  2. Same as 1, but warn if recoverable. As long as geo_value and time_value (and any other keys) still exist, we'd be ok.
  3. Rather than forcing the previous metadata, use as_epi_df(x, additional_metadata = previous_additional_metadata). This would guess the time type and the geo type on the new data and automatically be "correct" even if changed. But the change would be silent.
  4. Same as 3, but try to make the changes warn.

I'm sort of leaning toward 3. I have a bit of a preference for "unexpected behaviour to be documented rather than printed on the console". What do you guys think?

@brookslogan
Copy link
Contributor

brookslogan commented May 5, 2023

  1. --- sounds decent. You might also consider if dplyr_reconstruct.epi_df (soon to hit dev from Make epix_slide more like group_modify epiprocess#311 if it's not already there) would be helpful.
  2. --- I was trying to describe problems with this but actually came up with two variants that might be decent. Thoughts?:
    4a. Identify specific potentially-problematic steps that should cause metadata updates (e.g., step_mutate), and provide "epi" versions of them (e.g., step_epi_mutate) that: (a) convert to epi_df at the beginning, (b) hypothetically do the right metadata updates because epiprocess is 100% bug-free, (c) record the metadata as the "previous metadata". If we can preserve epi_recipe-ness, then we could warn when they don't use the "epi" version here, or try to replace the non-epi steps with the epi equivalents.
    4b. Take the last idea from 4a but consider all steps "potentially problematic" --- make sure to preserve epi_recipe-ness, and do the epi_df conversion and metadata-recording after every one.

Side note: I also hit on some of this geo_type, time_type validation stuff when investigating ?dplyr_extending. I suspect the "right" approach is to have special classes for these columns that encode the type and ensure that the contents are consistent with the type, and remove the *_type attributes from epi_df. (Any changes in resolution would be done using functions that output another one of these special types.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 do this immediately
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants