Consider tidyselect in `epi_slide()` #18

brookslogan · 2021-10-26T17:02:27Z

Users may want to directly write slide_by_geo(newcol = f(oldcol), ........) rather than separately specifying the function and name. We could use rlang/tidyselect stuff to accomplish this.

transmute_scattered_ts and mutate_scattered_ts from #7 performs the time_value wrangling required to apply functions assuming that values correspond to increasing, evenly-spaced time values. One can basically just write a transmute statement and add the additional time window info requires. This might potentially help implement this revision. It would be group-by-geo followed by mutate_scattered_ts and ungroup. The mutation might be tricky as mutate_scattered_ts works on vectorized functions while slide_by_geo works on non-vectorized functions.

The text was updated successfully, but these errors were encountered:

ryantibs · 2021-10-27T03:26:52Z

Thanks @brookslogan. That seems like it could/would be great for usability and readability! If you can get a demo working I think it'd be really nice to see.

Note: I think what you're suggesting could replace the formula option in the slide function. It'd be much cleaner. However, in certain scenarios, I could imagine that you'd still want to be able to specify the function directly, i.e., when it's a complicated function that takes the whole input data frame (row-subsetted appropriately), like in some forecasting tasks.

ryantibs · 2021-11-09T13:58:22Z

Adding more thoughts to record them, even if just for my own sake. In a helpful side conversation, @earowang suggested something similar, where we refactor things so that there aren't separate functions for pct_change(), etc., but these operate within a call to epi_slide() (just like mutate()), as in:

covid_data %>% epi_slide(cases_pct_change = pct_change(cases, n = 14))

That is, pct_change() and deriv() become helpers to be used inside a call to epi_slide(), rather than standalone functions as they are now. (Here I guess I would rename estimate_deriv() to deriv().)

It seems like we could pull this off similarly to how @earowang implemented index_by() in the tsibble package, see: https://github.com/tidyverts/tsibble/blob/7261fbb858d5d1ab82cce7a7bebeffcb08e3915f/R/index-by.R#L83-L113.

One thing I'm uncertain of is whether the aforementioned implementation leverages tidyselect functionality, or does it differently using expression evaluation tools from rlang. I didn't study it careful enough to know (nor do I know enough about these topics to be able to tell).

ryantibs · 2021-11-13T17:53:59Z

Wondering what @earowang thinks about my last comment and the example I gave there? Is that along the lines of what you were thinking?

And do you have an opinion on what the path towards implementation would be---would it be good to follow the example you gave in index_by() as I highlighted above?

earowang · 2021-11-15T01:41:06Z

Does epi_slide() by default slide by geo_value?

Can we group currently available functions into two sections: (1) table functions handling data frames and (2) vector functions?

Regarding table functions, you may like to name them more generically, e.g. slide_by(). For vector functions, you could add epi_ prefix, e.g. epi_pct_change(), which enables autocompletion for less typing.

covid_data %>% 
  slide_by(cases_pct_change = epi_pct_change(cases, n = 14))

brookslogan · 2021-11-15T14:18:49Z

I don't think epi_slide groups by geo_value anymore after #14; the user will need to group_by first.

(Regarding naming I don't have a clear opinion right now. Maybe consider potential confusion introduced by mixing epi_ and non-epi_ functions in the same package when a few existing packages have the pattern of prefixing nearly all members (e.g., stringr::str_*).)

By tidyselect I wasn't being very precise. I think it makes sense to try to mirror the behavior of transmute and/or mutate; indeed, transmute_scattered_ts and mutate_scattered_ts from #7 comments are implemented on top of one of those.

Another detail to think about: what should be the output time_values? Users may consider using epi_slide to prepare covariate and/or response data, but things get tricky; consider:

epi.tibble %>% 
  group_by(geo) %>%
  epi_slide(x1= dplyr::lag(cases, 3L), x2 = dplyr::lag(cases, 10L))

Run on daily data with no missingness, the output will not include the most recent 3 time_values, and the first three time_values will have NA output. For real-time forecasting, the dropped time_values might include the test instance. This is undesirable. But what would be the interface to specify what the time_values should be? (I can't think of anything that avoids this dropped-data situation that doesn't involve manually specifying the output time_values, typing lead&lag values multiple times and potentially making mistakes, or first going through another function that users may not know about.)

Now consider also including y = dplyr::lead(cases, 14L); now there will be some instances with non-missing ys dropped, but I think this is less of a concern since we'd normally chop off these anyway because all the covariates will probably be lagging and be NA.

ryantibs · 2021-11-15T15:20:30Z

Thanks both! I like the idea of separating out at least conceptually into table functions and vector functions.

Re naming: I'm not sure it makes sense to prefix everything by epi, since many functions in this package will be generic ones. I'd prefer to keep the prefix to the few things that are actually special to the epi_df data structure (the latest name suggested for this data structure, see #27).

Re mutate: can you look at the code I linked above (in Earo's index_by() implementation) and see whether you think it's suitable, and/or how it differs from yours? I understand you want to mimic mutate(), and I think that's a good idea, but I'm just trying to reconcile what the template should be for how to do this.

Re time values: this is a good point. But I think any desired behavior could be accomplished with some combination of setting the following two arguments:

complete (should we require a complete window for sliding? see epi_slide behavior when not all data is available #21); and
na_rm (should the underlying slide function let NA inputs propagate into an NA output?).

brookslogan · 2021-11-16T18:36:16Z

Re time values: I think this is more about preventing users from accidentally doing something they don't want rather than being able to do what they want. For example, anything involving dplyr::lag or rolling-sum/apply functions are convenient but probably buggy --- they don't behave well when there are gaps, and have the issues with the output time_values potentially chopping off desired data. Maybe we could at least detect these types of functions and warn if the user uses them. But then we probably also would want an option to silence these warnings or to specify what lag windows they actually intend to use.

brookslogan · 2021-11-16T18:52:20Z

Re index_by: I think this is not the exact semantics we want for epi_slide/slide_by; it is essentially grouping by (or adding to the existing non-index groups) the time index or some coarsening thereof. We'd want something like that for aggregating daily data into epiweek resolution, but not for computing shifts or rolling-window operations. But we can try to mimic the interface! Haven't thought much, but if we already have our own custom class(es), then making this like a modified group_by/index_by does seem more natural.

ryantibs changed the title ~~Consider tidyselect in slide_by_geo~~ Consider tidyselect in epi_slide() Nov 13, 2021

ryantibs mentioned this issue Jan 14, 2022

Tidy slide #31

Merged

ryantibs closed this as completed in #31 Jan 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider tidyselect in `epi_slide()` #18

Consider tidyselect in `epi_slide()` #18

brookslogan commented Oct 26, 2021 •

edited

Loading

ryantibs commented Oct 27, 2021

ryantibs commented Nov 9, 2021

ryantibs commented Nov 13, 2021

earowang commented Nov 15, 2021

brookslogan commented Nov 15, 2021

ryantibs commented Nov 15, 2021

brookslogan commented Nov 16, 2021

brookslogan commented Nov 16, 2021

Consider tidyselect in epi_slide() #18

Consider tidyselect in epi_slide() #18

Comments

brookslogan commented Oct 26, 2021 • edited Loading

ryantibs commented Oct 27, 2021

ryantibs commented Nov 9, 2021

ryantibs commented Nov 13, 2021

earowang commented Nov 15, 2021

brookslogan commented Nov 15, 2021

ryantibs commented Nov 15, 2021

brookslogan commented Nov 16, 2021

brookslogan commented Nov 16, 2021

Consider tidyselect in `epi_slide()` #18

Consider tidyselect in `epi_slide()` #18

brookslogan commented Oct 26, 2021 •

edited

Loading