Steps can potentially clobber `epi_df` #175

dajmcdon · 2023-04-26T22:30:56Z

recipes::step_naomit() converts to tibble. This is not generally a problem unless you do this before you do something that tries to read the epi_df metadata (e.g. step_training_window()).

This is potentially a major pain: any step that clobbers our epi_df could do this.

Find a way to log the epi_df more explicitly elsewhere in a way that’s accessible, but not removable.

library(epipredict)
#> Loading required package: epiprocess
#> 
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> Loading required package: parsnip


tib <- tibble::tibble(
  x = 1:10,
  y = 1:10,
  time_value = rep(seq(as.Date("2020-01-01"), by = 1,
                     length.out = 5), times = 2),
  geo_value = rep(c("ca", "hi"), each = 5)) %>%
  as_epi_df()

epi_recipe(y ~ x, data = tib) %>%
  step_training_window(n_recent = 3) %>%
  recipes::prep(tib) %>%
  recipes::bake(new_data = NULL)
#> An `epi_df` object, 6 x 4 with metadata:
#> * geo_type  = state
#> * time_type = day
#> * as_of     = 2023-04-26 15:30:42
#> 
#> # A tibble: 6 × 4
#>       x     y time_value geo_value
#> * <int> <int> <date>     <chr>    
#> 1     3     3 2020-01-03 ca       
#> 2     4     4 2020-01-04 ca       
#> 3     5     5 2020-01-05 ca       
#> 4     8     8 2020-01-03 hi       
#> 5     9     9 2020-01-04 hi       
#> 6    10    10 2020-01-05 hi

epi_recipe(y ~ x, data = tib) %>%
  recipes::step_naomit() %>%
  step_training_window(n_recent = 3) %>%
  recipes::prep(tib) %>%
  recipes::bake(new_data = NULL)
#> # A tibble: 3 × 4
#>       x     y time_value geo_value
#>   <int> <int> <date>     <chr>    
#> 1     9     9 2020-01-04 hi       
#> 2     5     5 2020-01-05 ca       
#> 3    10    10 2020-01-05 hi

^{Created on 2023-04-26 with reprex v2.0.2}

The text was updated successfully, but these errors were encountered:

dshemetov · 2023-04-27T19:33:10Z

I am still very vague on R's classes and object attributes, so this is a very basic question: when a function like step_na_omit returns a tibble, is it just resetting the class attribute of the input object to one without epi_df or is it making a completely new tibble and returning that? If recipes function do the former (and are consistent about it), then it seems like we probably have the option of saving the epi_df class tag in a different attribute. But if it's the latter, then do we have alternatives?

brookslogan · 2023-04-29T01:18:18Z

In most of R stuff, it's supposed to be [something like] shallow-copy-on-write[-unless-it-does-not-seem-necessary]. The way it's doing it right now is basically creating a new "object" (SEXP) with pointers to new attributes (including the new class setting, and copying(/aliasing) some of the old attributes), and pointing to the same set of columns.

If the epi_df into the step was ungrouped, our metadata attributes still hang around, which might help getting back to an epi_df. But if it was grouped, then the attributes are tossed. (I'm working to get some consistency with grouped epi_dfs in epiprocess, but it's more of a project than I would have hoped.)

But whether or not we have the old metadata, it doesn't look like we can make step_naomit output an epi_df without messing up the behavior that as_tibble.epi_df should have in other contexts, and that means coding epi_steps around all this epi_df dropping anyway, OR planning to have the epi_df-ness dropped and stay that way, and set any roles and additional info we'll need in epi_recipe.

brookslogan · 2023-04-29T01:25:41Z

To try to more answer @dshemetov's question: if recipes consistently uses as_tibble (I don't know if it does), then it will consistently do what as_tibble does. as_tibble is supposed to convert to a tibble class, but it's not well-documented what it does with the rest of the attributes, including our metadata or any epi_df tag attribute you might consider. It appears that currently it would keep those other attributes for an ungrouped epi_df but drop them for a grouped epi_df (because as we've implemented epi_df, it will dispatch to as_tibble.grouped_df, which will clear those attributes).

dajmcdon · 2023-04-29T16:51:29Z

The issue here is really a confluence of factors. So TL;DR, there's the proposed fix at the bottom. The rest of this is a description of some R functionality as well as a bit about how guts of recipes/prepping/baking interact.

In R, many functions are actually "generics". These functions dispatch to "methods" based on the Class of the first argument (technically these are S3 methods, which are the most common and easiest to use). An example is print(). You call print(x) and internally, R dispatches to print.class() based on the first element of class(x). If that's not available, it falls back to print.default(x). Good R package writing practice means giving objects that you create (almost always lists) a class, and writing methods for any generics you think are likely to be tried by the user. One further wrinkle is that objects can have "subclasses". Really, class(x) is a vector of classes with an ordering. Generics dispatch along the vector starting from the first and proceeding until they find one that has an appropriate method.
epi_df is a class that has attributes. It's also a subclass of tbl_df (the class for a data frame made with tibble::tibble()).
Much of epipredict consists of methods that operate on epi_df's. Mostly, it needs two features: (1) dataframes with the epi_df class so that it can dispatch to the right methods, and (2) a guarantee that certain columns (time_value and geo_value) exist in that data. If an object has this class, it must have those two columns. So this is safe. The only possible way around it is for a user to do something like x <- 1:5 followed by class(x) <- 'epi_df'. Then all these methods would try to operate on that and probably throw uninterpretable errors. But this is pretty dumb.
An epi_recipe is a subclass of recipe, so it's class vector has those two classes. The object itself is a list with a bunch of other named lists inside. As you create the recipe, it updates all of this but doesn't do anything to any data.
prep() (and bake()) are generic functions. So calling prep(r, dat) where r is an epi_recipe dispatches to prep.epi_recipe(r, dat).
The basic idea of prep(), is
```
# does preprocessing
for (s in seq_along(r$steps)) dat <- prep(r$step[[i]], dat)
# does postprocessing
```
Inside of prep.epi_recipe(), it calls prep.step_*() sequentially on the data, dispatching to the method associated to each of the steps you added to the epi_recipe. Importantly, the prep() generic only gets whatever is in r$step[[i]] and not the rest of the recipe. This makes it so that I can use steps that operate on epi_df's but also those provided by any other R package in the wild.
In the {recipes} package, many steps end with a line like tibble::as_tibble(dat) (same is true of other related things). This removes the epi_df subclass (along with other modifications). The problem here is that, the next step in the sequence can no longer depend on any of the properties that the epi_df had. Trying a workaround like attempting to write attributes elsewhere on the epi_df would be buggy: if the tidy team alters as_tibble(), then it could eventually kill my workaround.
Now, as_tibble() is a generic, so I could do something like create epipredict::as_tibble.epi_df(), forcing such calls to do what I want, rather than what they want. But this is not a good idea, for the reason in 7. It's also very unexpected to the user, similar to doing the dumb thing in 3. So this is bad practice.
The functionality in 6 is great, because I can use anyone's steps. But I need it not to clobber the class in case future steps need the class.

I think that the correct fix (at least for now) is to modify prep.epi_recipe() and bake.epi_recipe() to reconstruct the epi_df after each step. This ensures no one else's step can destroy it in a breakable manner. At the same time, we should ensure that any steps in {epipredict} that require an epi_df check for that up front. That way, if someone wants to use some in a bare recipe (rather than an epi_recipe), they could unless we force otherwise.

This issue is one of many similar that I've run into. The logic of the things this other group does is frequently undocumented. So I can't make mine match until it breaks.

dajmcdon · 2023-05-05T16:44:21Z

@dshemetov @brookslogan I'm working on a fix for this, but I wanted to see what you think.

When a recipe is prepped/baked, it loops through all the steps. These steps must output a tbl_df or it aborts (this is the behaviour in {recipes}). My plan is that an we would then coerce back to epi_df. (It can't just check and abort, because any step that's from a different package would almost certainly fail.) The question is, how strict should this coercion be?

Some things to note: validating against metadata is potentially tricky. The metadata is not necessarily enforced. So the data could be daily, but have a time_type = "weekly". This is because the constructor only tries to check if the user doesn't input these manually. If they do, they're not validated.

We could try to force the metadata of the most recent epi_df and fail if it can't. Seemingly safe because steps from other packages "probably" won't make changes to the required key columns. They could remove one for no reason, which would mean failure here. They could perhaps group_by() %>% summarise() such that a key column is removed, though that should probably fail as being ill-advised (preferred would be to use a step_epi_*() that is safe and outputs an epi_df). They could probably also convert, say, daily data to weekly data, which again, would fail.
Same as 1, but warn if recoverable. As long as geo_value and time_value (and any other keys) still exist, we'd be ok.
Rather than forcing the previous metadata, use as_epi_df(x, additional_metadata = previous_additional_metadata). This would guess the time type and the geo type on the new data and automatically be "correct" even if changed. But the change would be silent.
Same as 3, but try to make the changes warn.

I'm sort of leaning toward 3. I have a bit of a preference for "unexpected behaviour to be documented rather than printed on the console". What do you guys think?

brookslogan · 2023-05-05T18:38:24Z

--- sounds decent. You might also consider if dplyr_reconstruct.epi_df (soon to hit dev from Make epix_slide more like group_modify epiprocess#311 if it's not already there) would be helpful.
--- I was trying to describe problems with this but actually came up with two variants that might be decent. Thoughts?:
4a. Identify specific potentially-problematic steps that should cause metadata updates (e.g., step_mutate), and provide "epi" versions of them (e.g., step_epi_mutate) that: (a) convert to epi_df at the beginning, (b) hypothetically do the right metadata updates because epiprocess is 100% bug-free, (c) record the metadata as the "previous metadata". If we can preserve epi_recipe-ness, then we could warn when they don't use the "epi" version here, or try to replace the non-epi steps with the epi equivalents.
4b. Take the last idea from 4a but consider all steps "potentially problematic" --- make sure to preserve epi_recipe-ness, and do the epi_df conversion and metadata-recording after every one.

Side note: I also hit on some of this geo_type, time_type validation stuff when investigating ?dplyr_extending. I suspect the "right" approach is to have special classes for these columns that encode the type and ensure that the contents are consistent with the type, and remove the *_type attributes from epi_df. (Any changes in resolution would be done using functions that output another one of these special types.)

dajmcdon added bug Something isn't working P0 do this immediately labels Apr 26, 2023

dajmcdon mentioned this issue Apr 26, 2023

When epi_recipe() outputs a regular recipe, errors can be confusing #168

Open

dajmcdon mentioned this issue May 7, 2023

Training window #180

Merged

dajmcdon closed this as completed in #180 May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Steps can potentially clobber `epi_df` #175

Steps can potentially clobber `epi_df` #175

dajmcdon commented Apr 26, 2023

dshemetov commented Apr 27, 2023 •

edited

Loading

brookslogan commented Apr 29, 2023 •

edited

Loading

brookslogan commented Apr 29, 2023

dajmcdon commented Apr 29, 2023

dajmcdon commented May 5, 2023 •

edited

Loading

brookslogan commented May 5, 2023 •

edited

Loading

Steps can potentially clobber epi_df #175

Steps can potentially clobber epi_df #175

Comments

dajmcdon commented Apr 26, 2023

dshemetov commented Apr 27, 2023 • edited Loading

brookslogan commented Apr 29, 2023 • edited Loading

brookslogan commented Apr 29, 2023

dajmcdon commented Apr 29, 2023

dajmcdon commented May 5, 2023 • edited Loading

brookslogan commented May 5, 2023 • edited Loading

Steps can potentially clobber `epi_df` #175

Steps can potentially clobber `epi_df` #175

dshemetov commented Apr 27, 2023 •

edited

Loading

brookslogan commented Apr 29, 2023 •

edited

Loading

dajmcdon commented May 5, 2023 •

edited

Loading

brookslogan commented May 5, 2023 •

edited

Loading