Skip to content

Make tidy sliding play nicely with data frame inputs #55

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ryantibs opened this issue Mar 10, 2022 · 10 comments
Open

Make tidy sliding play nicely with data frame inputs #55

ryantibs opened this issue Mar 10, 2022 · 10 comments
Labels
enhancement New feature or request P2 low priority

Comments

@ryantibs
Copy link
Member

ryantibs commented Mar 10, 2022

Tidy sliding is pretty convenient. But in some applications, I might want to actually be sliding a function that takes an entire data frame rather than pass in individual columns as arguments.

For example, let's suppose my_fun() is a complicated function that I've written that acts on a whole data frame, whose signature is:

my_fun(df, args)

Then I could do:

x %>% 
  epi_slide(f = my_fun, args = my_args, n = 120, new_col_name = "fc")

But I could NOT easily do some tidy sliding, because there's currently no way to get the tidy eval thing I've written to understand "pass the sliding data frame". It only understands "pass these sliding variables".

Ideally, we'd code in some special variable name that epi_slide() (and analogously epix_slide()) understands and treats in a special way, say .cur_df (prefixing a dot so that it doesn't conflict with variable names in the environment) and then we could do:

x %>% 
  epi_slide(fc = my_fun(.cur_df, args = my_args), n = 120)

which is more intuitive/convenient for folks who like tidy code.

@lcbrooks @dajmcdon @jacobbien @elray1 what do you think? Anybody want to take a stab at implementing it?

@ryantibs ryantibs added the enhancement New feature or request label Mar 10, 2022
@brookslogan
Copy link
Contributor

brookslogan commented Mar 10, 2022

We just ran into this as part of cmu-delphi/epipredict#8 and cmu-delphi/epipredict#3. Definitely want this in one way or another.

Thinking about the implementation makes me wonder if the interface of epi_slide is a bit complicated and whether we'd want to try to simplify now. Currently we have at least the function, formula, and missing(= quosure stuff on dots) cases for f, and we are looking at ways to directly reference both the .cur_df and its columns by name. And we might think about another case where we want to take .cur_df, transmute it to get multiple columns, and have those multiple columns appear in the result of the slide (although maybe this is just a list-column and a dplyr::unnest away for the user). Maybe it'd be simpler to only accept f as a function or formula from .cur_df to a list/df of results?

Implementation: I could work on this after forecaster calibration exploration, but that will take a bit.

.cur_df approach:

  • formula f case: not sure how [Ryan pointed out it's already handled; user get the df as.x]
  • function f case: not sure how [Ryan pointed out it's already handled; user gets the df]
  • missing f case (tidy case): this may just involve constructing a env arg for eval_tidy. Currently we use the default (might be a bug --- it seems like this would use some of the epi_slide internals in the env --- this might need to be eval_tidy(quo, x, caller_env())). That would change to eval_tidy(quo, x, eval_env), where eval_env is a child of caller_env() with the extra .cur_df binding set. [But maybe we need to look at why rlang uses .data and .env pronouns in their own namespaces rather than doing child environment construction.] [from the future: eval_tidy(quo, x, eval_env) isn't right; either first arg should be an expr and third arg should be passed, or first arg should be a quosure and the third arg should not be there.]

@ryantibs
Copy link
Member Author

ryantibs commented Mar 10, 2022

Thanks for the response. As for the epipredict issues: that's partly why I thought of it. Though I'm in the middle of writing a long comment on cmu-delphi/epipredict#8 that's going to argue against letting the user pass a straight data frame. So I think we should think of the use case in the current issue as just one we'd want to support in general.

Responses are helpful but I think may be a bit of confusion here? I don't see a way to simplify epi_slide() without removing functionality. I would not want to remove tidy sliding from epi_slide(). That's the most convenient way of doing things for many simple applications of sliding.

And the multicolumn case you're describing I believe would be handed by returning a list column and using the default argument as_list_col = FALSE, which unnests it in the output.

So to be clear, the .cur_df approach is ONLY intended to be for tidy sliding. When using a function or a formula you won't need it. When using a function, it already receives the entire data frame. When using a formula, you can access the entire data frame with .x (this is provided by slider package).

When using tidy sliding, I'm proposing you can access the entire data frame with .cur_df. But maybe we should change it to be .x to be consistent with the formula case.

@ryantibs
Copy link
Member Author

Just following up: the analogous implementation (whatever we do) should be something we carry over to epix_slide().

And in thinking that through, I realized I never implemented a formula option in epix_slide()! (Though the documentation claims it's possible, I just forgot to do it. You get it for free in epi_slide() because of slider.)

Should we implement that? Would give the greatest degree of symmetry in between the two sliding functions which I think we should try to maintain.

@brookslogan
Copy link
Contributor

brookslogan commented Mar 10, 2022

Thanks for clearing up my confusion on the current functionality.

So maybe "simplifying" would just be removing useful functionality. For [(epipred, other generalized)] coding on top of epi_slide we also want the formula/function interface[:] e.g., epi_slide(df, ~ .x %>% summarize(a = ......, b = ......., n = .....)) doesn't try to match n to the epi_slide argument rather than making a new column.

Implementation detail-wise we might need to check why rlang uses "pronouns" rather than child-env-creation for .data and .env. Maybe it's not as simple as I hoped.

@ryantibs
Copy link
Member Author

Sorry, I'm not sure I understand your current perspective.

  1. Are you suggesting we get rid of tidy evaluation for sliding?
  2. Are you suggesting we get rid of function/formulation evaluation for sliding?

I see a use case for both, and now given your implementation notes (thank you, helpful), I'm actually leaning towards doing nothing further right now for tidy sliding---meaning, tidy sliding can only access variables by name.

If you ever want to "get this whole data frame", then you should be able to do that with a formula or function, and that's one of their main use cases. If we wanted to do something as advanced as what you're suggesting, in your last comment (with piping) then it's best to use a function. I don't think pipes play nicely with formulas (my impression based on basic experimentation just now).

Thoughts?

@ryantibs
Copy link
Member Author

ryantibs commented Mar 10, 2022

p.s. I fixed the formula problem (that epix_slide() doesn't support it) in #58. Will merge a bit later.

If you think it's more fluid we can talk about this and other issues on the call later today. Thanks!

@brookslogan
Copy link
Contributor

Neither now, sorry for the noise. As you said, both have use cases.

The tidy eval + .cur_df does still seem like a convenient feature, but low priority compared, e.g., to #58.

@ryantibs
Copy link
Member Author

No worries, your thoughts & discussions points are helpful (here and always). So I think we can leave this issue open but plan not to attack it until we feel that it's risen to a much higher priority.

@brookslogan
Copy link
Contributor

brookslogan commented Oct 21, 2022

Note that some dplyr operations support cur_data(), cur_data_all() [deprecated; now use pick()], and some other functions to solve this problem. However:

  • I don't think group_modify is one of them
  • While we could use summarize instead of group_modify, its cur_data* functions give plain tibbles rather than epi_dfs, which differs from what we feed f functions/formulas
  • Using cur_data requires extra thought about nesting of dplyr pipelines; naive approaches can sometimes refer to data from the wrong pipeline.

@brookslogan
Copy link
Contributor

brookslogan commented May 3, 2024

--- CURRENT STATE: ---
epi_slide() and epix_slide() support .data pronoun in tidy computations. However, pronouns do not support full dplyr operations, and our slide functions do not support pick() to get actual data frames. [But we have a pronoun-like object .x with the full operations available now.]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P2 low priority
Projects
None yet
Development

No branches or pull requests

3 participants