Skip to content

Make epix_slide more like group_modify #311

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
May 30, 2023
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
d6b2639
Clarify archive grouping-related documentation
lcbrooks Mar 17, 2023
9a092af
Update epix_slide code to have `reframe`-like row behavior
lcbrooks Mar 16, 2023
d2edb2e
Add missing `grouped_epi_archive` test file
lcbrooks Mar 20, 2023
bf99f30
Update `.drop`+`epix_slide` given `reframe`-like `epix_slide`
lcbrooks Mar 20, 2023
74631af
If `epix_slide` outputs an `epi_df`, use `versions_end` for `as_of`
lcbrooks Mar 20, 2023
8039ed3
Update checks to also run on PRs (but not pushes) to `dev`
lcbrooks Mar 21, 2023
9d10b5f
Fix missing quotes in _pkgdown.yml
lcbrooks Mar 24, 2023
b8f768a
Update `epi[x]_slide` comparisons given epix like `reframe`
lcbrooks Mar 28, 2023
953717f
Remove `all_rows` for `epix_slide`; add deprecation tests
lcbrooks Mar 28, 2023
cb195c2
Always output ungrouped, no-metadata tibble from `epix_slide`, `as_ti…
lcbrooks May 4, 2023
b7a07e4
Refactor `ref_time_values` calc, test 0-row `f` outputs in `epix_slide`
lcbrooks May 4, 2023
c06f3ee
Fix DESCRIPTION version, update NEWS.md with `epix_slide` changes
lcbrooks May 4, 2023
b02b3a1
Change `epix_slide` to be more analogous to `group_modify`
lcbrooks May 4, 2023
442f3e9
Fix `epix_slide` `as_list_col=TRUE` outputting df-type col not list
lcbrooks May 5, 2023
e440a18
Add `epix_slide`-like-`group_modify` migration notes in NEWS.md
lcbrooks May 5, 2023
134b47c
Fixm roxygen copy-paste error for `group_modify`
lcbrooks May 5, 2023
a4baf7a
Address non-ASCII character issue from `check()`
lcbrooks May 5, 2023
6d0f060
Fix `guess_period` `@param` names
lcbrooks May 5, 2023
21a090b
Fix `as_tibble.epi_df` docs, make it have its own topic
lcbrooks May 5, 2023
6efd823
Fix `group_modify.epi_df` roxygen typo + missing `@param`s
lcbrooks May 5, 2023
a047b07
repl(epix_slide): describe migration in `all_rows` deprecation
lcbrooks May 19, 2023
26e9d2c
refactor: favor validation section over validation + `else if`
lcbrooks May 19, 2023
dbb658c
Fix epix_slide warning class naming inconsistencies
lcbrooks May 19, 2023
af49aae
Fix missing word in `advanced.Rmd` basic usage description
lcbrooks May 19, 2023
60ada45
Remove remaining references to `comp_effective_key_vars`
lcbrooks May 24, 2023
80d29c6
Add missing `epix_slide` tests for {fn,~,tidy} x {all_versions,not}
lcbrooks May 24, 2023
f70b911
Reword advanced.Rmd on epi vs epix slide number of output rows
lcbrooks May 24, 2023
de398b1
Track missing .Rd file
lcbrooks May 24, 2023
8985db2
Make `as_list_col=TRUE` consistent for vecs and dfs from slide comps
lcbrooks May 24, 2023
2014016
Merge branch 'dev' into lcb/make_epix_slide_more_like_reframe
nmdefries May 24, 2023
9ba843c
Fix `all_rows = TRUE` to work with move to `vctrs`
lcbrooks May 27, 2023
cf2ad2f
Update tests&NEWS.md about `epi_slide` `f` Date vec output
lcbrooks May 30, 2023
116bae0
Add NEWS.md entry for `as_list_col` `all_rows` `NA` -> `NULL`
lcbrooks May 30, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/R-CMD-check.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ on:
push:
branches: [main, master]
pull_request:
branches: [main, master]
branches: [main, master, dev]

name: R-CMD-check

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/pkgdown.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ on:
push:
branches: [main, master]
pull_request:
branches: [main, master]
branches: [main, master, dev]
release:
types: [published]
workflow_dispatch:
Expand Down
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Type: Package
Package: epiprocess
Title: Tools for basic signal processing in epidemiology
Version: 0.6.0
Version: 0.6.0.9999
Authors@R: c(
person("Jacob", "Bien", role = "ctb"),
person("Logan", "Brooks", role = "aut"),
Expand Down Expand Up @@ -61,7 +61,7 @@ Config/testthat/edition: 3
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.1
RoxygenNote: 7.2.3
Depends:
R (>= 2.10)
URL: https://cmu-delphi.github.io/epiprocess/
Expand Down
3 changes: 3 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ S3method(as_epi_df,data.frame)
S3method(as_epi_df,epi_df)
S3method(as_epi_df,tbl_df)
S3method(as_epi_df,tbl_ts)
S3method(as_tibble,epi_df)
S3method(as_tsibble,epi_df)
S3method(dplyr_col_modify,col_modify_recorder_df)
S3method(dplyr_col_modify,epi_df)
Expand All @@ -17,6 +18,7 @@ S3method(group_by,epi_archive)
S3method(group_by,epi_df)
S3method(group_by,grouped_epi_archive)
S3method(group_by_drop_default,grouped_epi_archive)
S3method(group_modify,epi_df)
S3method(groups,grouped_epi_archive)
S3method(next_after,Date)
S3method(next_after,integer)
Expand Down Expand Up @@ -97,6 +99,7 @@ importFrom(rlang,sym)
importFrom(rlang,syms)
importFrom(stats,cor)
importFrom(stats,median)
importFrom(tibble,as_tibble)
importFrom(tidyr,unnest)
importFrom(tidyselect,eval_select)
importFrom(tidyselect,starts_with)
Expand Down
16 changes: 16 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,22 @@ Note that `epiprocess` uses the [Semantic Versioning
("semver")](https://semver.org/) scheme for all release versions, but any
inter-release development versions will include an additional ".9999" suffix.

## Breaking changes:

* `epix_slide` has been made more like `dplyr::group_modify`. It will no longer
perform element/row recycling for size stability, accepts slide computation
outputs containing any number of rows, and no longer supports `all_rows`.
* To keep the old behavior, manually perform row recycling within `f`
computations, and/or `left_join` a data frame representing the desired
output structure with the current `epix_slide()` result to obtain the
desired repetitions and completions expected with `all_rows = TRUE`.
* `epix_slide` will only output grouped or ungrouped tibbles. Previously, it
would sometimes output `epi_df`s, but not consistently, and not always with
the metadata desired. Future versions will revisit this design, and consider
more closely whether/when/how to output an `epi_df`.
* To keep the old behavior, convert the output of `epix_slide()` to `epi_df`
when desired and set the metadata appropriately.

# epiprocess 0.6.0

## Breaking changes:
Expand Down
6 changes: 3 additions & 3 deletions R/archive.R
Original file line number Diff line number Diff line change
Expand Up @@ -307,7 +307,7 @@ epi_archive =
Abort("compactify must be boolean or null.")
}

# Apply defaults and conduct checks and apply defaults for
# Apply defaults and conduct checks for
# `clobberable_versions_start`, `versions_end`:
if (missing(clobberable_versions_start)) {
clobberable_versions_start <- NA
Expand Down Expand Up @@ -640,7 +640,7 @@ epi_archive =
slide = function(f, ..., before, ref_time_values,
time_step, new_col_name = "slide_value",
as_list_col = FALSE, names_sep = "_",
all_rows = FALSE, all_versions = FALSE) {
all_versions = FALSE) {
# For an "ungrouped" slide, treat all rows as belonging to one big
# group (group by 0 vars), like `dplyr::summarize`, and let the
# resulting `grouped_epi_archive` handle the slide:
Expand All @@ -649,7 +649,7 @@ epi_archive =
before = before, ref_time_values = ref_time_values,
time_step = time_step, new_col_name = new_col_name,
as_list_col = as_list_col, names_sep = names_sep,
all_rows = all_rows, all_versions = all_versions
all_versions = all_versions
) %>%
# We want a slide on ungrouped archives to output something
# ungrouped, rather than retaining the trivial (0-variable)
Expand Down
162 changes: 54 additions & 108 deletions R/grouped_epi_archive.R
Original file line number Diff line number Diff line change
Expand Up @@ -115,11 +115,11 @@ grouped_epi_archive =
cat("Public `grouped_epi_archive` R6 methods:\n")
grouped_method_names = names(grouped_epi_archive$public_methods)
ungrouped_method_names = names(epi_archive$public_methods)
writeLines(wrap_varnames(initial = " Specialized `epi_archive` methods: ",
writeLines(wrap_varnames(initial = "\u2022 Specialized `epi_archive` methods: ",
intersect(grouped_method_names, ungrouped_method_names)))
writeLines(wrap_varnames(initial = " Exclusive to `grouped_epi_archive`: ",
writeLines(wrap_varnames(initial = "\u2022 Exclusive to `grouped_epi_archive`: ",
setdiff(grouped_method_names, ungrouped_method_names)))
writeLines(wrap_varnames(initial = " `ungroup` to use: ",
writeLines(wrap_varnames(initial = "\u2022 `ungroup` to use: ",
setdiff(ungrouped_method_names, grouped_method_names)))
}
# Return self invisibly for convenience in `$`-"pipe":
Expand Down Expand Up @@ -190,7 +190,11 @@ grouped_epi_archive =
slide = function(f, ..., before, ref_time_values,
time_step, new_col_name = "slide_value",
as_list_col = FALSE, names_sep = "_",
all_rows = FALSE, all_versions = FALSE) {
all_versions = FALSE) {
# Perform some deprecated argument checks without using `<param> =
# deprecated()` in the function signature, because they are from
# early development versions and much more likely to be clutter than
# informative in the signature.
if ("group_by" %in% nse_dots_names(...)) {
Abort("
The `group_by` argument to `slide` has been removed; please use
Expand All @@ -200,12 +204,19 @@ grouped_epi_archive =
this check is a false positive, but you will still need to use a
different column name here and rename the resulting column after
the slide.)
")
", class = "epiprocess__epix_slide_group_by_parameter_deprecated")
}
if ("all_rows" %in% nse_dots_names(...)) {
Abort("
The `all_rows` argument has been removed from `epix_slide` (but
is still supported in `epi_slide`). Since `epix_slide` now
allows any number of rows out of slide computations, it's
unclear how `all_rows=TRUE` should fill in missing results.
", class = "epiprocess__epix_slide_all_rows_parameter_deprecated")
}

if (missing(ref_time_values)) {
versions_with_updates = c(private$ungrouped$DT$version, private$ungrouped$versions_end)
ref_time_values = tidyr::full_seq(versions_with_updates, guess_period(versions_with_updates))
ref_time_values = epix_slide_ref_time_values_default(private$ungrouped)
} else if (length(ref_time_values) == 0L) {
Abort("`ref_time_values` must have at least one element.")
} else if (any(is.na(ref_time_values))) {
Expand Down Expand Up @@ -247,30 +258,14 @@ grouped_epi_archive =
if (! (rlang::is_string(names_sep) || is.null(names_sep)) ) {
Abort("`names_sep` must be a (single) string or NULL.")
}
if (!rlang::is_bool(all_rows)) {
Abort("`all_rows` must be TRUE or FALSE.")
}
if (!rlang::is_bool(all_versions)) {
Abort("`all_versions` must be TRUE or FALSE.")
}

# Each computation is expected to output a data frame with either
# one element/row total or one element/row per encountered
# nongrouping, nontime, nonversion key value. These nongrouping,
# nontime, nonversion key columns can be seen as the "effective" key
# of the computation; the computation might return an object that
# reports a different key or no key, but the "effective" key should
# still be a valid unique key for the data, and is something that we
# could use even with `.keep = FALSE`.
comp_effective_key_vars =
setdiff(key(private$ungrouped$DT),
c(private$vars, "time_value", "version"))

# Computation for one group, one time value
comp_one_grp = function(.data_group, .group_key,
f, ...,
ref_time_value,
comp_effective_key_vars,
new_col) {
# Carry out the specified computation
comp_value = f(.data_group, .group_key, ...)
Expand All @@ -282,77 +277,17 @@ grouped_epi_archive =
.data_group = .data_group$DT
}

# Calculate the number of output elements/rows we expect the
# computation to output: one per distinct "effective computation
# key variable" value encountered in the input.
#
# Note: this mirrors how `epi_slide` does things if we're using
# unique keys, but can diverge if using nonunique keys. The
# `epi_slide` approach of counting occurrences of the
# `ref_time_value` in the `time_value` column, which helps lines
# up the computation results with corresponding rows of the
# input data, wouldn't quite apply here: we'd want to line up
# with rows (from the same group) with `version` matching the
# `ref_time_value`, but would still need to summarize these rows
# somehow and drop the `time_value` input column, but this
# summarization requires something like a to-be-unique output
# key to determine a sensible number of rows to output (and the
# contents of those rows).
count =
if (length(comp_effective_key_vars) != 0L) {
comp_effective_key_vals_in_comp_input =
if (data.table::is.data.table(.data_group)) {
.data_group[, comp_effective_key_vars, with=FALSE]
} else {
.data_group[, comp_effective_key_vars]
}
sum(!duplicated(comp_effective_key_vals_in_comp_input))
} else {
# Same idea as above, but accounting for `duplicated` working
# differently (outputting `logical(0)`) on 0-column inputs
# rather than matching the number of rows. (Instead, we use
# the same count we would get if we were counting distinct
# values of a column defined as `rep(val, target_n_rows)`.)
if (nrow(.data_group) == 0L) {
0L
} else {
1L
}
}

# If we get back an atomic vector
if (is.atomic(comp_value)) {
if (length(comp_value) == 1) {
comp_value = rep(comp_value, count)
}
# If not a singleton, should be the right length, else abort
else if (length(comp_value) != count) {
Abort('If the slide computation returns an atomic vector, then it must have either (a) a single element, or (b) one element per distinct combination of key variables, excluding the `time_value`, `version`, and grouping variables, that is present in the first argument to the computation.')
}
}

# If we get back a data frame
else if (is.data.frame(comp_value)) {
if (nrow(comp_value) == 1) {
comp_value = rep(list(comp_value), count)
}
# If not a single row, should be the right length, else abort
else if (nrow(comp_value) != count) {
Abort("If the slide computation returns a data frame, then it must have a single row, or else one row per appearance of the reference time value in the local window.")
}
# Make into a list
else {
comp_value = split(comp_value, seq_len(nrow(comp_value)))
}
}

# If neither an atomic vector data frame, then abort
else {
if (! (is.atomic(comp_value) || is.data.frame(comp_value))) {
Abort("The slide computation must return an atomic vector or a data frame.")
}

if (is.data.frame(comp_value)) {
# Wrap in a list so that we get a list-type col rather than a
# data.frame-type col when `as_list_col = TRUE`:
comp_value <- list(comp_value)
}

# Label every result row with the `ref_time_value`:
return(tibble::tibble(time_value = rep(.env$ref_time_value, count),
return(tibble::tibble(time_value = .env$ref_time_value,
!!new_col := .env$comp_value))
}

Expand All @@ -374,24 +309,24 @@ grouped_epi_archive =
group_modify_fn = comp_one_grp
} else {
as_of_archive = as_of_raw
# We essentially want to `group_modify` the archive, but don't
# provide an implementation yet. Next best would be
# `group_modify` on its `$DT`, but that has different behavior
# based on whether or not `dtplyr` is loaded. Instead, go
# through a , trying to avoid copies.
# We essentially want to `group_modify` the archive, but
# haven't implemented this method yet. Next best would be
# `group_modify` on its `$DT`, but that has different
# behavior based on whether or not `dtplyr` is loaded.
# Instead, go through an ordinary data frame, trying to avoid
# copies.
if (address(as_of_archive$DT) == address(private$ungrouped$DT)) {
# `as_of` aliased its the full `$DT`; copy before mutating:
as_of_archive$DT <- copy(as_of_archive$DT)
}
dt_key = data.table::key(as_of_archive$DT)
as_of_df = as_of_archive$DT
data.table::setDF(as_of_df)

# Convert each subgroup chunk to an archive before running the calculation.
group_modify_fn = function(.data_group, .group_key,
f, ...,
ref_time_value,
comp_effective_key_vars,
new_col) {
# .data_group is coming from as_of_df as a tibble, but we
# want to feed `comp_one_grp` an `epi_archive` backed by a
Expand All @@ -402,19 +337,17 @@ grouped_epi_archive =
.data_group_archive$DT = .data_group
comp_one_grp(.data_group_archive, .group_key, f = f, ...,
ref_time_value = ref_time_value,
comp_effective_key_vars = comp_effective_key_vars,
new_col = new_col
)
}
}

return(
dplyr::group_by(as_of_df, dplyr::across(tidyselect::all_of(private$vars)),
.drop=private$drop) %>%
dplyr::group_modify(group_modify_fn,
f = f, ...,
ref_time_value = ref_time_value,
comp_effective_key_vars = comp_effective_key_vars,
new_col = new_col,
.keep = TRUE)
)
Expand Down Expand Up @@ -501,13 +434,26 @@ grouped_epi_archive =
if (!as_list_col) {
x = tidyr::unnest(x, !!new_col, names_sep = names_sep)
}

# Join to get all rows, if we need to, then return
if (all_rows) {
cols = c(private$vars, "time_value")
y = unique(private$ungrouped$DT[, ..cols])
x = dplyr::left_join(y, x, by = cols)
}

# if (is_epi_df(x)) {
# # The analogue of `epi_df`'s `as_of` metadata for an archive is
# # `<archive>$versions_end`, at least in the current absence of
# # separate fields/columns denoting the "archive version" with a
# # different resolution, or from the perspective of a different
# # stage of a data pipeline. The `as_of` that is automatically
# # derived won't always match; override:
# attr(x, "metadata")[["as_of"]] <- private$ungrouped$versions_end
# }

# XXX We need to work out when we want to return an `epi_df` and how
# to get appropriate keys (see #290, #223, #163). We'll probably
# need the commented-out code above if we ever output an `epi_df`.
# However, as a stopgap measure to have some more consistency across
# different ways of calling `epix_slide`, and to prevent `epi_df`
# output with invalid metadata, always output a (grouped or
# ungrouped) tibble.
x <- decay_epi_df(x)

return(x)
}
)
Expand Down
Loading