Skip to content

Consider tidyselect in epi_slide() #18

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
brookslogan opened this issue Oct 26, 2021 · 8 comments · Fixed by #31
Closed

Consider tidyselect in epi_slide() #18

brookslogan opened this issue Oct 26, 2021 · 8 comments · Fixed by #31

Comments

@brookslogan
Copy link
Contributor

brookslogan commented Oct 26, 2021

Users may want to directly write slide_by_geo(newcol = f(oldcol), ........) rather than separately specifying the function and name. We could use rlang/tidyselect stuff to accomplish this.

transmute_scattered_ts and mutate_scattered_ts from #7 performs the time_value wrangling required to apply functions assuming that values correspond to increasing, evenly-spaced time values. One can basically just write a transmute statement and add the additional time window info requires. This might potentially help implement this revision. It would be group-by-geo followed by mutate_scattered_ts and ungroup. The mutation might be tricky as mutate_scattered_ts works on vectorized functions while slide_by_geo works on non-vectorized functions.

@ryantibs
Copy link
Member

Thanks @brookslogan. That seems like it could/would be great for usability and readability! If you can get a demo working I think it'd be really nice to see.

Note: I think what you're suggesting could replace the formula option in the slide function. It'd be much cleaner. However, in certain scenarios, I could imagine that you'd still want to be able to specify the function directly, i.e., when it's a complicated function that takes the whole input data frame (row-subsetted appropriately), like in some forecasting tasks.

@ryantibs
Copy link
Member

ryantibs commented Nov 9, 2021

Adding more thoughts to record them, even if just for my own sake. In a helpful side conversation, @earowang suggested something similar, where we refactor things so that there aren't separate functions for pct_change(), etc., but these operate within a call to epi_slide() (just like mutate()), as in:

covid_data %>% epi_slide(cases_pct_change = pct_change(cases, n = 14))

That is, pct_change() and deriv() become helpers to be used inside a call to epi_slide(), rather than standalone functions as they are now. (Here I guess I would rename estimate_deriv() to deriv().)

It seems like we could pull this off similarly to how @earowang implemented index_by() in the tsibble package, see: https://github.com/tidyverts/tsibble/blob/7261fbb858d5d1ab82cce7a7bebeffcb08e3915f/R/index-by.R#L83-L113.

One thing I'm uncertain of is whether the aforementioned implementation leverages tidyselect functionality, or does it differently using expression evaluation tools from rlang. I didn't study it careful enough to know (nor do I know enough about these topics to be able to tell).

@ryantibs ryantibs changed the title Consider tidyselect in slide_by_geo Consider tidyselect in epi_slide() Nov 13, 2021
@ryantibs
Copy link
Member

Wondering what @earowang thinks about my last comment and the example I gave there? Is that along the lines of what you were thinking?

And do you have an opinion on what the path towards implementation would be---would it be good to follow the example you gave in index_by() as I highlighted above?

@earowang
Copy link

Does epi_slide() by default slide by geo_value?

Can we group currently available functions into two sections: (1) table functions handling data frames and (2) vector functions?

Regarding table functions, you may like to name them more generically, e.g. slide_by(). For vector functions, you could add epi_ prefix, e.g. epi_pct_change(), which enables autocompletion for less typing.

covid_data %>% 
  slide_by(cases_pct_change = epi_pct_change(cases, n = 14))

@brookslogan
Copy link
Contributor Author

I don't think epi_slide groups by geo_value anymore after #14; the user will need to group_by first.

(Regarding naming I don't have a clear opinion right now. Maybe consider potential confusion introduced by mixing epi_ and non-epi_ functions in the same package when a few existing packages have the pattern of prefixing nearly all members (e.g., stringr::str_*).)

By tidyselect I wasn't being very precise. I think it makes sense to try to mirror the behavior of transmute and/or mutate; indeed, transmute_scattered_ts and mutate_scattered_ts from #7 comments are implemented on top of one of those.

Another detail to think about: what should be the output time_values? Users may consider using epi_slide to prepare covariate and/or response data, but things get tricky; consider:

epi.tibble %>% 
  group_by(geo) %>%
  epi_slide(x1= dplyr::lag(cases, 3L), x2 = dplyr::lag(cases, 10L))

Run on daily data with no missingness, the output will not include the most recent 3 time_values, and the first three time_values will have NA output. For real-time forecasting, the dropped time_values might include the test instance. This is undesirable. But what would be the interface to specify what the time_values should be? (I can't think of anything that avoids this dropped-data situation that doesn't involve manually specifying the output time_values, typing lead&lag values multiple times and potentially making mistakes, or first going through another function that users may not know about.)

Now consider also including y = dplyr::lead(cases, 14L); now there will be some instances with non-missing ys dropped, but I think this is less of a concern since we'd normally chop off these anyway because all the covariates will probably be lagging and be NA.

@ryantibs
Copy link
Member

Thanks both! I like the idea of separating out at least conceptually into table functions and vector functions.

Re naming: I'm not sure it makes sense to prefix everything by epi, since many functions in this package will be generic ones. I'd prefer to keep the prefix to the few things that are actually special to the epi_df data structure (the latest name suggested for this data structure, see #27).

Re mutate: can you look at the code I linked above (in Earo's index_by() implementation) and see whether you think it's suitable, and/or how it differs from yours? I understand you want to mimic mutate(), and I think that's a good idea, but I'm just trying to reconcile what the template should be for how to do this.

Re time values: this is a good point. But I think any desired behavior could be accomplished with some combination of setting the following two arguments:

  1. complete (should we require a complete window for sliding? see epi_slide behavior when not all data is available #21); and
  2. na_rm (should the underlying slide function let NA inputs propagate into an NA output?).

@brookslogan
Copy link
Contributor Author

Re time values: I think this is more about preventing users from accidentally doing something they don't want rather than being able to do what they want. For example, anything involving dplyr::lag or rolling-sum/apply functions are convenient but probably buggy --- they don't behave well when there are gaps, and have the issues with the output time_values potentially chopping off desired data. Maybe we could at least detect these types of functions and warn if the user uses them. But then we probably also would want an option to silence these warnings or to specify what lag windows they actually intend to use.

@brookslogan
Copy link
Contributor Author

Re index_by: I think this is not the exact semantics we want for epi_slide/slide_by; it is essentially grouping by (or adding to the existing non-index groups) the time index or some coarsening thereof. We'd want something like that for aggregating daily data into epiweek resolution, but not for computing shifts or rolling-window operations. But we can try to mimic the interface! Haven't thought much, but if we already have our own custom class(es), then making this like a modified group_by/index_by does seem more natural.

@ryantibs ryantibs mentioned this issue Jan 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants