Skip to content

wip: rework slide window args #513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Aug 23, 2024
Merged

Conversation

dshemetov
Copy link
Contributor

@dshemetov dshemetov commented Aug 14, 2024

Checklist

Please:

  • Make sure this PR is against "dev", not "main" (unless this is a release
    PR).
  • Request a review from one of the current main reviewers:
    brookslogan, nmdefries.
  • Makes sure to bump the version number in DESCRIPTION. Always increment
    the patch version number (the third number), unless you are making a
    release PR from dev to main, in which case increment the minor version
    number (the second number).
  • Describe changes made in NEWS.md, making sure breaking changes
    (backwards-incompatible changes to the documented interface) are noted.
    Collect the changes under the next release number (e.g. if you are on
    1.7.2, then write your changes under the 1.8 heading).
  • See DEVELOPMENT.md for more information on the development
    process.

Change explanations for reviewer

Prototype of new slide interface. Fixing tests takes a long time, want to make sure we agree on inner logic and deprecation handling before I tackle all that. The core logic may be easier to read outside the diffs, so reproducing here:

  if (!is.null(before) || !is.null(after)) {
    cli_abort("`before` and `after` are deprecated for `epi_slide`. Use `n` instead.")
  }

  # Handle window arguments
  align <- match.arg(align)
  time_type <- attr(x, "metadata")$time_type
  validate_slide_window_arg(n, time_type)
  if (n == Inf) {
    if (align == "left") {
      before <- Inf
      if (time_type %in% c("day", "week")) {
        after <- as.difftime(0, units = glue::glue("{time_type}s"))
      } else {
        after <- 0
      }
    } else {
      cli_abort(
        "`epi_slide`: center and right alignment are not supported with an infinite window size."
      )
    }
  } else {
    if (align == "left") {
      before <- n - 1
      after <- 0
    } else if (align == "center") {
      # For n = 5, before = 2, after = 2. For n = 4, before = 2, after = 1.
      before <- floor(n / 2)
      after <- n - before - 1
    } else if (align == "right") {
      before <- 0
      after <- n - 1
    }
  }

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

@brookslogan
Copy link
Contributor

brookslogan commented Aug 14, 2024

[Made checkboxes for completion/wontdo status.]

Snippet above looks pretty good; I haven't looked at actual diffs.

  • issue: I think right & left align cases are backward. The terminology in frollmean & others is right = trailing.
  • [musing: should we be requiring/allowing clearer names than right/center/left?]
  • note: not currently shown in sketch above, but default align should be trailing (say, align = c("right", "center", "left")).
  • todo: hear back from dajmcdon and/or search docs for whether we want/need to support non-right/center/left alignments.
  • partially separate issue: adding n back as an arg would mean going to be back to having a naming conflict if a user tries to pass an n arg to their computation and/or use data-masking tidyeval to straightforwardly define a column with name n. In older epiprocess, ryantibs and I separately ran into one/both of these, and at least one of these cases had a very confusing error message. I'm not sure if we want to try to incrementally move toward dot-prefixing arg names here, then follow up with dot-prefixing in epix_slide separately. Or see if it's possible to use other approaches to avoid the [confusing error messages on] n name conflicts; I think that's the main one to be worried about. (epi_slide_opt wouldn't need & wouldn't tidyverse-idiomatically use dot prefixes, as it's not passing tidyeval through ... [but it is forwarding ... to methods].)

misc. suggests:

  • identical for comparisons with Inf, like in existing code
  • rlang::arg_match instead of match.arg for better error messages, if you also agree that partial matches (e.g., "c" for "center") should be forbidden
  • maybe comment reminding us what kinds of things n can be somewhere? e.g., I initially forgot it could be a difftime when reading through the "center" calc (it does seem to work properly with difftimes).

@dajmcdon
Copy link
Contributor

Responding to @brookslogan's questions above:

  • I can't really think of a reason for non-left/center/right alignment. Though as soon as we drop support, one may appear...
  • On the n naming issue, the dot-prefix is an option. Also things like window_size or n_timesteps would be OK. I think that the issue we want to solve is the mental overhead of thinking "7 day trailing window" and then having to put before = 6. But maybe someone else really wants n specifically?

@dshemetov
Copy link
Contributor Author

I started thinking "left window" rather than "align on window's left side", oops. I think right/center/left names are fine. I thought about suggesting n change to window_size, but I don't want to bikeshed these details. Adding dot prefixes to slide arguments is also reasonable, that's the suggestion made in the tidy design guide.

@dshemetov
Copy link
Contributor Author

  • Went ahead a converted all match.arg to rlang::arg_match. Big fan of disabling partial matching in both arguments and values.
  • Went ahead and prepended all arguments with . and setup a clear error message if the old ones are used.
  • Addressed all other points, switched .n to .window_size and added documentation about what values is takes.

@brookslogan
Copy link
Contributor

brookslogan commented Aug 15, 2024

musing: Sorry for all the trouble, I'm not sure dot prefixing / name conflict stuff was actually necessary here after all. I think we do get a small benefit from it though if it is included.

  • idea: if there are lingering problems from including it in this PR, seems like we can delay it some more since I'm having trouble even remembering/reproducing the old issue. ("n" happens not to be the best symbol to try to search for ever.)
  • musing: The root cause in the past might(??) have been the fact that we had an n = 7 default at the time. But dot-prefixing/rare-naming in this PR does prevent some weird messages if someone tried to use n/window_size in ... and forgot that they needed to provide the window size to epi_slide. (I haven't been able to track down what the original thing that actually caused the issues Ryan(?) and I encountered was. I'm starting to suspect the original issues were maybe in epix_slide(n =) which wouldn't be an issue presently anyway.)

thoughts: re. naming: ... I don't know if these are bikeshedding or load-bearing signage. With epi_slide no longer closely coupled with epix_slide interface, it's more likely to be bikeshedding than before. [And, since we require this to be provided, and provided by name, it'll be an annoying and/or breaking change to rename it.]

  • I expect ryantibs would strongly favor n and would accept .n.
  • n is quite popular in a small set of packages I tested; this could be good or bad for n/.n but I don't have the energy to analyze it right now
reprex: grabbing some arg naming info
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
tibble(pkgname = c("dplyr", "tidyr", "slider", "zoo", "xts", "data.table")) %>%
  mutate(member_value = lapply(pkgname, function(pkgname) as.list(asNamespace(pkgname)))) %>%
  mutate(member_name = lapply(member_value, names)) %>%
  tidyr::unchop(c(member_name, member_value)) %>%
  mutate(argname = lapply(member_value, function(x) {
    if (is.function(x)) {
      formalArgs(args(x))
    } else {
      character(0L)
    }
  }), member_value = NULL) %>%
  tidyr::unchop(argname) %>%
  filter(argname %in% c("n", ".n", "window", ".window", "window_size", ".window_size")) %>%
  print(n = 100L)
#> # A tibble: 74 × 3
#>    pkgname    member_name                argname
#>    <chr>      <chr>                      <chr>  
#>  1 dplyr      slice_min.data.frame       n      
#>  2 dplyr      vec_as_location_invert     n      
#>  3 dplyr      check_slice_unnamed_n_prop n      
#>  4 dplyr      sample_int                 n      
#>  5 dplyr      check_size                 n      
#>  6 dplyr      dplyr_new_data_frame       n      
#>  7 dplyr      shift_c                    n      
#>  8 dplyr      shift                      n      
#>  9 dplyr      check_slice_n_prop         n      
#> 10 dplyr      minimal_names              n      
#> 11 dplyr      ntile                      n      
#> 12 dplyr      slice_max                  n      
#> 13 dplyr      slice_tail                 n      
#> 14 dplyr      ntext                      n      
#> 15 dplyr      get_slice_size             n      
#> 16 dplyr      fmt_check_length_val       n      
#> 17 dplyr      progress_estimated         n      
#> 18 dplyr      slice_min                  n      
#> 19 dplyr      slice_tail.data.frame      n      
#> 20 dplyr      random_table_name          n      
#> 21 dplyr      shift_slice                n      
#> 22 dplyr      slice_max.data.frame       n      
#> 23 dplyr      nth                        n      
#> 24 dplyr      slice_sample               n      
#> 25 dplyr      lag                        n      
#> 26 dplyr      lead                       n      
#> 27 dplyr      top_n_rank                 n      
#> 28 dplyr      top_frac                   n      
#> 29 dplyr      slice_head                 n      
#> 30 dplyr      slice_sample.data.frame    n      
#> 31 dplyr      slice_head.data.frame      n      
#> 32 dplyr      top_n                      n      
#> 33 dplyr      check_length_val           n      
#> 34 dplyr      last_dplyr_warnings        n      
#> 35 tidyr      str_split_fixed            n      
#> 36 slider     vec_init_unspecified_list  n      
#> 37 slider     compute_from               n      
#> 38 slider     compute_to                 n      
#> 39 zoo        scale_y_yearqtr            n      
#> 40 zoo        scale_x_yearmon            n      
#> 41 zoo        make.par.list              n      
#> 42 zoo        yearqtr_trans              n      
#> 43 zoo        head.zoo                   n      
#> 44 zoo        scale_y_yearmon            n      
#> 45 zoo        yearmon_trans              n      
#> 46 zoo        scale_x_yearqtr            n      
#> 47 zoo        tail.zoo                   n      
#> 48 xts        shift.time                 n      
#> 49 xts        last.default               n      
#> 50 xts        naCheck                    n      
#> 51 xts        align.time.xts             n      
#> 52 xts        align.time.POSIXct         n      
#> 53 xts        shift.time.xts             n      
#> 54 xts        first.xts                  n      
#> 55 xts        last.xts                   n      
#> 56 xts        align.time.POSIXlt         n      
#> 57 xts        first.default              n      
#> 58 data.table froll                      n      
#> 59 data.table gtail                      n      
#> 60 data.table frollsum                   n      
#> 61 data.table first                      n      
#> 62 data.table g[                         n      
#> 63 data.table gshift                     n      
#> 64 data.table last                       n      
#> 65 data.table cedta                      n      
#> 66 data.table shift                      n      
#> 67 data.table head.data.table            n      
#> 68 data.table setalloccol                n      
#> 69 data.table alloc.col                  n      
#> 70 data.table frollapply                 n      
#> 71 data.table tail.data.table            n      
#> 72 data.table ghead                      n      
#> 73 data.table g[[                        n      
#> 74 data.table frollmean                  n

Created on 2024-08-15 with reprex v2.1.1

  • [.window_size is more accurate/evocative for current suboptimal epi_slide, while n / .n would be more accurate/evocative for epi_slide_{sum,mean,opt}]

  • I thought some time ago we found some Python/Rust library with names we liked for this and for n_recent?

  • tidyeval guidelines probably don't apply to current epi_slide_opt since we aren't forwarding or tidyevaluating ... [on second look, they do seem to apply since we are forwarding ... to the slider etc. functions] (and they have dplyr::slice_head(n =) for example)... but it seems like it'd be nice for our ease of documentation and maybe also also better for users if we did consider including it in any impending dot prefixing PRs, for a more consistent set of arg names.

  • todo: resolve current mix of .n and .window_size in current PR (I think .window_size is current arg name, but .n is in an error message somewhere).

  • todo: in epipredict and any other important downstream code, pin dependency on epiprocess until we work through any breaking changes (before/after hard deprecation, dot prefixing if we want to follow up with more such work and implement it as a breaking change, etc.)

@dshemetov
Copy link
Contributor Author

dshemetov commented Aug 15, 2024

My personal take is that these names don't matter as much as clear documentation around them and we should spend our very finite research time elsewhere (though I appreciate the thoroughness you're bringing to this question Logan). I'm not going to weigh this discussion down with more opinions about n or window_size, I think we should pick one and not spend more time here.

This PR fixes the key window arithmetic issues Dan pointed out and incrementally moves this package towards a tidy design best practice. I can address the todos above and I propose we move ahead.

As for "clear documentation" mentioned above, I hope to address that point with our renewed effort on documentation in the coming months.

Copy link
Contributor

@brookslogan brookslogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Browsing review requests and wasn't sure if this was still wip or not. Just dropped a few comments but realized it's probably still wip. If it's ready, please re-request!

@dshemetov dshemetov changed the base branch from dev to lcb/slide-improvements-2024-06 August 22, 2024 17:22
@dshemetov
Copy link
Contributor Author

Updated all the documentation that I could find, including vignettes. I'm sure there will be things I missed. Will fix issues as I spot them on the site in the documentation pass coming up.

@dshemetov dshemetov merged commit de948b5 into lcb/slide-improvements-2024-06 Aug 23, 2024
2 checks passed
@dshemetov dshemetov deleted the ds/slide branch August 23, 2024 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

epi_slide and epix_slide args
3 participants