`.f` with tibble output and non-null `new_col_name` results in packed tibble column #534

dsweber2 · 2024-10-01T16:25:59Z

@dshemetov this is the issue I was talking about. I'm going to just route around it by not using step_epi_slide for now.

taking test_data from test_epi_slide.R

ress <- epi_slide(
  test_data,
  .window_size = 4,
  .f = function(slice, geo_key, ref_time_value) {
    tibble(a = mean(slice[["value"]]), b = median(slice[["value"]]))
  },
  .new_col_name = "foo"
)
ress %>% glimpse
Rows: 89
Columns: 4
Groups: geo_value [3]
$ geo_value  <chr> "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a",…
$ time_value <date> 2020-01-01, 2020-01-02, 2020-01-03, 2020-01-04, 2020-01-05…
$ value      <dbl> 0, 1, 4, 9, 16, 25, 36, 49, 64, 100, 121, 144, 169, 196, 22…
$ foo        <tibble[,2]> <tbl_df[26 x 2]>

ress
An `epi_df` object, 89 x 4 with metadata:
* geo_type  = custom
* time_type = day
* as_of     = 2020-01-31

# A tibble: 89 × 4
# Groups:   geo_value [3]
   geo_value time_value value foo$a    $b
 * <chr>     <date>     <dbl> <dbl> <dbl>
 1 a         2020-01-01     0  NA    NA  
 2 a         2020-01-02     1  NA    NA

with the case where new_col_name is NULL

ress2 <- epi_slide(
  test_data,
  .window_size = 4,
  .f = function(slice, geo_key, ref_time_value) {
    tibble(a = mean(slice[["value"]]), b = median(slice[["value"]]))
  }
)

ress2 %>% glimpse
Rows: 89
Columns: 5
Groups: geo_value [3]
$ geo_value  <chr> "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a",…
$ time_value <date> 2020-01-01, 2020-01-02, 2020-01-03, 2020-01-04, 2020-01-05…
$ value      <dbl> 0, 1, 4, 9, 16, 25, 36, 49, 64, 100, 121, 144, 169, 196, 22…
$ a          <dbl> NA, NA, NA, 3.5, 7.5, 13.5, 21.5, 31.5, 43.5, NA, NA, NA, 1…
$ b          <dbl> NA, NA, NA, 2.5, 6.5, 12.5, 20.5, 30.5, 42.5, NA, NA, NA, 1…

I ran into this in step_epi_slide, where there is always a column name added, and where nested tibbles don't interact nicely with recipes. I don't know if this is intended behavior, but I think it would probably be better to be consistent about always appending normal-ish row data. in the case of a name being specified, I was expecting "fooa" and "foob", or maybe "foo_a" and "foo_b".

Also, the docs are thankfully pretty precise about the expected input of .f. They are not, however, very precise about the expected output.

The text was updated successfully, but these errors were encountered:

dshemetov · 2024-10-02T23:35:10Z

Does having to do foo$column to select columns downstream pose problems for epipredict workflows in a way that foo_column doesn't or is it just unexpected?

As for the docs issue: this is mentioned in the Details section of the slide docs here.

, fn_returning_a_data_frame(.x) will unpack the output of the function into multiple columns in the result.

Named expressions evaluating to data frames will be placed into tidyr::packed columns.

Though admittedly this information is both:

buried in the already-extremely dense epi_slide documentation
is quite terse and probably doesn't make sense to people who aren't familiar with packing and unpacking (which was me until about a month ago when I dove into the slide changes)

We've been in maintenance mode wrt slide docs, just making sure they aren't incorrect, through the recent migrations, but we really need to take a pass through it for readability for people who are not already experts with tidyr::pack and other tidyselect features.

dsweber2 · 2024-10-03T15:36:02Z

yes, they came up with a pretty way to print it, but nested tibbles are not even dplyr-friendly. If you're e.g. listing the column names, you only get the top level, and not foo$column

> ex <- tibble(a = c(1,2), b = tibble(c=c(1,3),d=c(1,5)))
> ex
# A tibble: 2 × 2
      a   b$c    $d
  <dbl> <dbl> <dbl>
1     1     1     1
2     2     3     5
> names(ex)
[1] "a" "b"
> ex %>% select(ends_with("c"))
# A tibble: 2 × 0

As far as most functions are concerned, there are only 2 columns. Since recipes relies on column logic quite a bit, it can't handle nested tibbles.

dsweber2 · 2024-10-03T15:39:19Z

This issue is relatively low priority because it's possible to route around by just naming things yourself, either in the function or using rename, but we should probably avoid the nesting eventually.

dshemetov · 2024-10-03T18:32:01Z

Ah I see, so it creates a nested structure which doesn't seem to be well-supported by tidyselect. @brookslogan do you know of a workaround for this? I think the intention of removing names_sep and nesting like this was to avoid name collisions, not sure if we took into account downstream limitations of recipes.

brookslogan · 2024-10-03T21:51:38Z

-- Context --

This behavior was intentional. My goal was to

avoid the name prefix spam situations people got into with epix_slide for pseudoprospective forecasting; removing prefixing altogether seemed like one way
- (& avoid some column name conflicts encountered when trying to use a simple workaround to the prefix spam, having ~~slide comp output a tibble containing a list containing a tibble to create a list-of-tibbles column, and manually unnesting that list-of-tibbles column~~ [this was just part of the workaround to avoid epi_slide broadcasting] as_list_col = TRUE, renaming, unnest-ing, since epix_slide(......, names_sep = NULL) would complain)
make out = f(.x) consistent with ~ f(.x), .new_col_name = "out", and consistent with tidyverse; this meant packed columns; try tibble(x = 1, y = tibble(x2 = x + x))
make .new_col_name = "myname" output a column with the name myname

Tidyeval here is following data-masking (used by mutate etc.), not tidyselect (used by epi_slide_opt, across).

-- My guess at tidyverse context --

yes, they came up with a pretty way to print it, but nested tibbles are not even dplyr-friendly. If you're e.g. listing the column names, you only get the top level, and not foo$column

I'm pretty sure this is intended behavior. It's meant to hide/bundle those columns.

As far as most functions are concerned, there are only 2 columns. Since recipes relies on column logic quite a bit, it can't handle nested tibbles.

Packed tibbles seem like they were meant to be a solution to a lot of such naming issues in tidyverse, though maybe recipes doesn't use this solution (and allowing for multiple roles is more flexible than packing). Instead of relying on naming schemes & patterns to maybe select the intended set of columns, you can pack all the relevant columns into one bundle when you generate them & reference them via the packed col with the friendly name. Or you can pack a column or columns with variable names into a column with a fixed name, rather than using/struggling with !! :=.

Tidyverse's solution to outputting new columns based on old ones + suffixes is by using across; you can use that with a little difficulty in slides.

Tidyverse's solution to prefixes is via an unpack function (or in across there's alternatively a .unpack arg); you could use that naturally with slides; slide_result %>% unpack(your_col_name, names_sep = "_"). (unnest also has this if you are working with a list of tibbles)

-- Workarounds --

unpack for prefixes; see above.
across for suffixes; see below.
add one (non-tibble) column at a time.
fully name columns inside computation and return a tibble, don't provide .new_col_name.

-- Idea --

We could add prefixing/suffixing features onto slides as an alternative to .new_col_name. Suffixes really seem to be what we want for epi_slide_opt.
We could try to support across more natively in slides; right now you'd have to do something like .f = ~ .x %>% dplyr::reframe(across(your_col_names, list(your_func_name = your_func)))} [and it's probably super slow if your_func is very fast (and your group x time windows are very small)]
We could add some sort of .renamer arg as an alternative to .new_col_name. Or maybe something like .names and/or .unpack in across(), though dplyr doesn't have those for verbs more comparable to slides like mutate and summarize, so there might be a good reason not to.

dsweber2 added bug Something isn't working enhancement New feature or request labels Oct 1, 2024

brookslogan mentioned this issue Oct 14, 2024

Add .suffix (and .prefix?) arguments to slides #542

Open

brookslogan changed the title ~~.f with multi-line output and non-null new_col_name results in nested tibbles~~ .f with multi-line output and non-null new_col_name results in packed tibbles Nov 12, 2024

brookslogan changed the title ~~.f with multi-line output and non-null new_col_name results in packed tibbles~~ .f with tibble output and non-null new_col_name results in packed tibble column Nov 12, 2024

brookslogan mentioned this issue Nov 22, 2024

Adjust epi_slide_opt output naming; adjust epi_slide* autogrouping; fix tidyselect issue #564

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`.f` with tibble output and non-null `new_col_name` results in packed tibble column #534

`.f` with tibble output and non-null `new_col_name` results in packed tibble column #534

dsweber2 commented Oct 1, 2024 •

edited

Loading

dshemetov commented Oct 2, 2024

dsweber2 commented Oct 3, 2024 •

edited

Loading

dsweber2 commented Oct 3, 2024

dshemetov commented Oct 3, 2024 •

edited

Loading

brookslogan commented Oct 3, 2024 •

edited

Loading

.f with tibble output and non-null new_col_name results in packed tibble column #534

.f with tibble output and non-null new_col_name results in packed tibble column #534

Comments

dsweber2 commented Oct 1, 2024 • edited Loading

dshemetov commented Oct 2, 2024

dsweber2 commented Oct 3, 2024 • edited Loading

dsweber2 commented Oct 3, 2024

dshemetov commented Oct 3, 2024 • edited Loading

brookslogan commented Oct 3, 2024 • edited Loading

`.f` with tibble output and non-null `new_col_name` results in packed tibble column #534

`.f` with tibble output and non-null `new_col_name` results in packed tibble column #534

dsweber2 commented Oct 1, 2024 •

edited

Loading

dsweber2 commented Oct 3, 2024 •

edited

Loading

dshemetov commented Oct 3, 2024 •

edited

Loading

brookslogan commented Oct 3, 2024 •

edited

Loading