Skip to content

.f with tibble output and non-null new_col_name results in packed tibble column #534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dsweber2 opened this issue Oct 1, 2024 · 5 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@dsweber2
Copy link
Contributor

dsweber2 commented Oct 1, 2024

@dshemetov this is the issue I was talking about. I'm going to just route around it by not using step_epi_slide for now.

taking test_data from test_epi_slide.R

ress <- epi_slide(
  test_data,
  .window_size = 4,
  .f = function(slice, geo_key, ref_time_value) {
    tibble(a = mean(slice[["value"]]), b = median(slice[["value"]]))
  },
  .new_col_name = "foo"
)
ress %>% glimpse
Rows: 89
Columns: 4
Groups: geo_value [3]
$ geo_value  <chr> "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a",…
$ time_value <date> 2020-01-01, 2020-01-02, 2020-01-03, 2020-01-04, 2020-01-05$ value      <dbl> 0, 1, 4, 9, 16, 25, 36, 49, 64, 100, 121, 144, 169, 196, 22$ foo        <tibble[,2]> <tbl_df[26 x 2]>

ress
An `epi_df` object, 89 x 4 with metadata:
* geo_type  = custom
* time_type = day
* as_of     = 2020-01-31

# A tibble: 89 × 4
# Groups:   geo_value [3]
   geo_value time_value value foo$a    $b
 * <chr>     <date>     <dbl> <dbl> <dbl>
 1 a         2020-01-01     0  NA    NA  
 2 a         2020-01-02     1  NA    NA

with the case where new_col_name is NULL

ress2 <- epi_slide(
  test_data,
  .window_size = 4,
  .f = function(slice, geo_key, ref_time_value) {
    tibble(a = mean(slice[["value"]]), b = median(slice[["value"]]))
  }
)

ress2 %>% glimpse
Rows: 89
Columns: 5
Groups: geo_value [3]
$ geo_value  <chr> "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a",…
$ time_value <date> 2020-01-01, 2020-01-02, 2020-01-03, 2020-01-04, 2020-01-05$ value      <dbl> 0, 1, 4, 9, 16, 25, 36, 49, 64, 100, 121, 144, 169, 196, 22$ a          <dbl> NA, NA, NA, 3.5, 7.5, 13.5, 21.5, 31.5, 43.5, NA, NA, NA, 1$ b          <dbl> NA, NA, NA, 2.5, 6.5, 12.5, 20.5, 30.5, 42.5, NA, NA, NA, 1

I ran into this in step_epi_slide, where there is always a column name added, and where nested tibbles don't interact nicely with recipes. I don't know if this is intended behavior, but I think it would probably be better to be consistent about always appending normal-ish row data. in the case of a name being specified, I was expecting "fooa" and "foob", or maybe "foo_a" and "foo_b".

Also, the docs are thankfully pretty precise about the expected input of .f. They are not, however, very precise about the expected output.

@dsweber2 dsweber2 added bug Something isn't working enhancement New feature or request labels Oct 1, 2024
@dshemetov
Copy link
Contributor

Does having to do foo$column to select columns downstream pose problems for epipredict workflows in a way that foo_column doesn't or is it just unexpected?

As for the docs issue: this is mentioned in the Details section of the slide docs here.

, fn_returning_a_data_frame(.x) will unpack the output of the function into multiple columns in the result.

Named expressions evaluating to data frames will be placed into tidyr::packed columns.

Though admittedly this information is both:

  • buried in the already-extremely dense epi_slide documentation
  • is quite terse and probably doesn't make sense to people who aren't familiar with packing and unpacking (which was me until about a month ago when I dove into the slide changes)

We've been in maintenance mode wrt slide docs, just making sure they aren't incorrect, through the recent migrations, but we really need to take a pass through it for readability for people who are not already experts with tidyr::pack and other tidyselect features.

@dsweber2
Copy link
Contributor Author

dsweber2 commented Oct 3, 2024

yes, they came up with a pretty way to print it, but nested tibbles are not even dplyr-friendly. If you're e.g. listing the column names, you only get the top level, and not foo$column

> ex <- tibble(a = c(1,2), b = tibble(c=c(1,3),d=c(1,5)))
> ex
# A tibble: 2 × 2
      a   b$c    $d
  <dbl> <dbl> <dbl>
1     1     1     1
2     2     3     5
> names(ex)
[1] "a" "b"
> ex %>% select(ends_with("c"))
# A tibble: 2 × 0

As far as most functions are concerned, there are only 2 columns. Since recipes relies on column logic quite a bit, it can't handle nested tibbles.

@dsweber2
Copy link
Contributor Author

dsweber2 commented Oct 3, 2024

This issue is relatively low priority because it's possible to route around by just naming things yourself, either in the function or using rename, but we should probably avoid the nesting eventually.

@dshemetov
Copy link
Contributor

dshemetov commented Oct 3, 2024

Ah I see, so it creates a nested structure which doesn't seem to be well-supported by tidyselect. @brookslogan do you know of a workaround for this? I think the intention of removing names_sep and nesting like this was to avoid name collisions, not sure if we took into account downstream limitations of recipes.

@brookslogan
Copy link
Contributor

brookslogan commented Oct 3, 2024

-- Context --

This behavior was intentional. My goal was to

  • avoid the name prefix spam situations people got into with epix_slide for pseudoprospective forecasting; removing prefixing altogether seemed like one way
    • (& avoid some column name conflicts encountered when trying to use a simple workaround to the prefix spam, having slide comp output a tibble containing a list containing a tibble to create a list-of-tibbles column, and manually unnesting that list-of-tibbles column [this was just part of the workaround to avoid epi_slide broadcasting] as_list_col = TRUE, renaming, unnest-ing, since epix_slide(......, names_sep = NULL) would complain)
  • make out = f(.x) consistent with ~ f(.x), .new_col_name = "out", and consistent with tidyverse; this meant packed columns; try tibble(x = 1, y = tibble(x2 = x + x))
  • make .new_col_name = "myname" output a column with the name myname

Tidyeval here is following data-masking (used by mutate etc.), not tidyselect (used by epi_slide_opt, across).

-- My guess at tidyverse context --

yes, they came up with a pretty way to print it, but nested tibbles are not even dplyr-friendly. If you're e.g. listing the column names, you only get the top level, and not foo$column

I'm pretty sure this is intended behavior. It's meant to hide/bundle those columns.

As far as most functions are concerned, there are only 2 columns. Since recipes relies on column logic quite a bit, it can't handle nested tibbles.

Packed tibbles seem like they were meant to be a solution to a lot of such naming issues in tidyverse, though maybe recipes doesn't use this solution (and allowing for multiple roles is more flexible than packing). Instead of relying on naming schemes & patterns to maybe select the intended set of columns, you can pack all the relevant columns into one bundle when you generate them & reference them via the packed col with the friendly name. Or you can pack a column or columns with variable names into a column with a fixed name, rather than using/struggling with !! :=.

Tidyverse's solution to outputting new columns based on old ones + suffixes is by using across; you can use that with a little difficulty in slides.

Tidyverse's solution to prefixes is via an unpack function (or in across there's alternatively a .unpack arg); you could use that naturally with slides; slide_result %>% unpack(your_col_name, names_sep = "_"). (unnest also has this if you are working with a list of tibbles)

-- Workarounds --

  • unpack for prefixes; see above.
  • across for suffixes; see below.
  • add one (non-tibble) column at a time.
  • fully name columns inside computation and return a tibble, don't provide .new_col_name.

-- Idea --

  • We could add prefixing/suffixing features onto slides as an alternative to .new_col_name. Suffixes really seem to be what we want for epi_slide_opt.
  • We could try to support across more natively in slides; right now you'd have to do something like .f = ~ .x %>% dplyr::reframe(across(your_col_names, list(your_func_name = your_func)))} [and it's probably super slow if your_func is very fast (and your group x time windows are very small)]
  • We could add some sort of .renamer arg as an alternative to .new_col_name. Or maybe something like .names and/or .unpack in across(), though dplyr doesn't have those for verbs more comparable to slides like mutate and summarize, so there might be a good reason not to.

@brookslogan brookslogan changed the title .f with multi-line output and non-null new_col_name results in nested tibbles .f with multi-line output and non-null new_col_name results in packed tibbles Nov 12, 2024
@brookslogan brookslogan changed the title .f with multi-line output and non-null new_col_name results in packed tibbles .f with tibble output and non-null new_col_name results in packed tibble column Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants