Skip to content

Ndefries/autocompletion slide #415

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,9 @@ importFrom(stats,median)
importFrom(tibble,as_tibble)
importFrom(tibble,new_tibble)
importFrom(tibble,validate_tibble)
importFrom(tidyr,complete)
importFrom(tidyr,expand)
importFrom(tidyr,nesting)
importFrom(tidyr,unnest)
importFrom(tidyselect,eval_select)
importFrom(tidyselect,starts_with)
Expand Down
103 changes: 96 additions & 7 deletions R/slide.R
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,15 @@
#' the missing marker is a `NULL` entry in the list column; for certain
#' operations, you might want to replace these `NULL` entries with a different
#' `NA` marker.
#' @param autocomplete_windows Should input data be completed so that each
#' window that the computation is applied to is guaranteed to have `before +
#' after + 1` (`n`) observations per epikey combination? Defaults to `TRUE`.
#' Turning this off makes it very easy to write an erroneous 7-day average
#' or 7-day sum. For example, if we `slide` the natural sum over ungrouped
#' data with `m` epikey combinations, we get the sum of less than `n * m`
#' things for the first `n - 1` time_values and wherever there's a gap in
#' availability (e.g. a missing geo for one or more dates).
#' @param fill as in [`tidyr::complete`].
#' @return An `epi_df` object given by appending a new column to `x`, named
#' according to the `new_col_name` argument.
#'
Expand Down Expand Up @@ -124,6 +133,8 @@
#' @importFrom lubridate days weeks
#' @importFrom dplyr bind_rows group_vars filter select
#' @importFrom rlang .data .env !! enquo enquos sym env missing_arg
#' @importFrom tidyr complete nesting expand
#' @importFrom checkmate assert_list
#' @export
#' @seealso [`epi_slide_mean`]
#' @examples
Expand Down Expand Up @@ -168,9 +179,21 @@
epi_slide <- function(x, f, ..., before, after, ref_time_values,
time_step,
new_col_name = "slide_value", as_list_col = FALSE,
names_sep = "_", all_rows = FALSE) {
names_sep = "_", all_rows = FALSE,
autocomplete_windows = TRUE, fill = list()) {
assert_class(x, "epi_df")

if (!autocomplete_windows) {
Warn(
c(
"Turning off `autocomplete_windows` makes it very easy to write an
erroneous 7-day average or 7-day sum that uses less than the expected
`n` observations per window."
),
class = "epiprocess__epi_slide__autocomplete_off"
)
}

if (missing(ref_time_values)) {
ref_time_values <- unique(x$time_value)
} else {
Expand Down Expand Up @@ -215,6 +238,66 @@ epi_slide <- function(x, f, ..., before, after, ref_time_values,
after <- 0L
}

if (autocomplete_windows) {
assert_list(fill, names = "unique")
fill_vars_not_in_x <- !(names(fill) %in% colnames(x))
if (any(fill_vars_not_in_x)) {
Warn(
c(
"Some names provided in `fill` do not correspond to column names in
the input data. These will be ignored."
),
class = "epiprocess__epi_slide__fill_vars_not_in_x",
epiprocess__fill = fill,
epiprocess__colnames_x = colnames(x),
epiprocess__fill_vars_not_in_x = fill_vars_not_in_x
)
}

key_cols <- key_colnames(x)
maybe_first_duplicate_key_row_index <- anyDuplicated(x, by = key_cols)
if (maybe_first_duplicate_key_row_index != 0L) {
Abort("`x` must have one row per unique combination of the key variables to use window completion. If you have additional key variables other than `geo_value`, `time_value`, and `version`, such as an age group column, please specify them in `other_keys`. Otherwise, check for duplicate rows and/or conflicting values for the same measurement.",
class = "epiprocess__epi_slide__epi_df_requires_unique_key_for_completion"
)
}

# Make a complete date sequence between min(x$time_value) and max
# (x$time_value), plus pad values. We need to include pad dates in
# the complete date sequence so that the first and last n - 1
# ref_time_values have a complete window to use.
date_seq_list <- full_date_seq(x, before, after, time_step)
all_dates <- c(
date_seq_list$pad_early_dates,
date_seq_list$all_dates,
date_seq_list$pad_late_dates
Comment on lines +271 to +273
Copy link
Contributor Author

@nmdefries nmdefries Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full_date_seq returns all_dates and pad dates separately to support current epi_slide_mean implementation, but epi_slide_mean completion can be changed (combine all_dates and pad dates upfront, with or without switching to tidyr::complete) to simplify the bit here.

)

# A helper column marking real observations.
x$.real <- TRUE
# Fill the `.real` column with `FALSE` for rows added during completion.
fill$.real <- FALSE
Copy link
Contributor

@brookslogan brookslogan Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is .real handling still computationally significant? Wonder if in autocomplete case it could just go away. Maybe an optimization to pursue in a later version.

Copy link
Contributor Author

@nmdefries nmdefries Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Individual calls like this are quite fast. The prior .real handling was in f_wrapper around the user-passed computation; it was slow in aggregate because it got called a ton. This should be pretty fast since it only happens once.

I'm curious what the alternative to the .real col would be, if we still plan on dropping completed rows.


key_cols_no_time <- kill_time_value(key_cols)

# Add in rows for each present key column combination for all dates.
# - This happens within each group if the data is grouped. This
# doesn't change the impact of completion because we apply the
# computation by group as well.
# - If a geo first appears halfway through the dataset, it will be
# completed all the way back to the beginning of the data.
x <- tidyr::complete(x,
expand(x, nesting(!!key_cols_no_time), data.frame(time_value = all_dates)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, this is complicated. Two issues. They feel a bit edge-case-y, but matter more if we're trying to have some consistency with epix_slide, or maybe a future epi_slide_reframe that doesn't try to match the existing keys in the output.

  • If reporting starts for one epikey later on, or if there are large gaps in its reporting, then even without data revisions, you could have later epi_slides on expanded epi_dfs outputting different results for earlier ref_time_values, as they will be considering additional epikeys. If grouping by all epikeys, then maybe this can't happen for epi_slide (but could for an analogous epix_slide). But if you don't group by anything, you might turn earlier national sums into a bunch of NAs.
  • Should we be using only the epikeys that have appeared, or the cross product of all seen geos x all seen age groups x etc.? If we have to pick one, I think the current approach of using only epikeys that have appeared is right; user can complete or join or whatever if they don't like that. (Again, for epi_slide, this probably only matters about whether we make not-grouped-by-all-epikeys epi_slides output NAs...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing: I couldn't find where the matching-old-epikey-time_value/broadcasting part is. I wonder if it's using this completed x. If it is, then the two questions above actually are more relevant for epi_slide since it involves also changing the epikeysets output for earlier time values simply by adding an epikey in rows for later time_values, and would also mean outputting values for gaps, and maybe also NA's at the leading and trailing edges. Ideally we should consciously make decisions here, or make it configurable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(you may be able to get rid of the expand call)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A full_seq may be missing in definition of all_dates. If there is a gap for all epikeys then completion may not fill it in.

Copy link
Contributor Author

@nmdefries nmdefries Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full_date_seq fn below should take care of that. The three cases (tsibble class or numeric, else time_step not defined, else time_step defined) all generate date sequences by stepping by time_step (or similar) between the input data's min and max time_values. They don't look at unique, existing time_values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key cols are broadcast backwards here:

expand(x, nesting(!!key_cols_no_time), data.frame(time_value = all_dates))

This wouldn't change output epikeysets for earlier time values. In the non-aggregating case, computations are both completed across and computed across all keycols, and we later (at the end of this function) filter out not-.real obs. Epikeysets that were broadcast back in time will be removed by the filter.

In the aggregating case (e.g. states -> national sum), we're outputting all-new epikeysets. We should end up with one row per group. As long as the input group data contains one .real = TRUE obs, the output should be kept (?)(I'm not sure how .real is handled right now -- need to check/implement that logic).

# `complete` checks that fill types match existing column types.
fill = fill,
# Existing missings will be replaced by `fill`, too.
explicit = TRUE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicit = TRUE doesn't seem very natural to me. Could we also make this a parameter? (And maybe use a more sane name&default? Tradeoff: I guess that could also surprise people if fill makes them think we're mirroring tidyr defaults, or make them read more docs. Not sure what's preferable.)

)

# `complete` strips epi_df format and metadata. Restore them.
x <- reclass(x, attributes(x)$metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to use pre-recorded metadata beforehand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does/could metadata get modified by prior processing?

}

# If a custom time step is specified, then redefine units
if (!missing(time_step)) {
before <- time_step(before)
Expand Down Expand Up @@ -243,8 +326,8 @@ epi_slide <- function(x, f, ..., before, after, ref_time_values,
new_col) {
# Figure out which reference time values appear in the data group in the
# first place (we need to do this because it could differ based on the
# group, hence the setup/checks for the reference time values based on all
# the data could still be off):
# group, and reference time values found above are based on all of the
# data):
o <- ref_time_values %in% .data_group$time_value
starts <- starts[o]
stops <- stops[o]
Expand All @@ -263,7 +346,7 @@ epi_slide <- function(x, f, ..., before, after, ref_time_values,
)

# Now figure out which rows in the data group are in the reference time
# values; this will be useful for all sorts of checks that follow
# values; this will be used for checks below.
o <- .data_group$time_value %in% kept_ref_time_values
num_ref_rows <- sum(o)

Expand Down Expand Up @@ -337,7 +420,7 @@ epi_slide <- function(x, f, ..., before, after, ref_time_values,
# computation. `i` is contained in the `f_wrapper_factory` environment such
# that when called within `slide_one_grp` `i` is reset for every group.
f_wrapper_factory <- function(kept_ref_time_values) {
# Use `i` to advance through list of start dates.
# Use `i` to advance through list of reference dates.
i <- 1L
f_wrapper <- function(.x, .group_key, ...) {
.ref_time_value <- kept_ref_time_values[[i]]
Expand All @@ -362,6 +445,12 @@ epi_slide <- function(x, f, ..., before, after, ref_time_values,
x <- unnest(x, !!new_col, names_sep = names_sep)
}

# Remove any remaining phony observations.
if (autocomplete_windows) {
.x <- .x[.x$.real, ]
.x$.real <- NULL
}

return(x)
}

Expand Down Expand Up @@ -718,7 +807,7 @@ full_date_seq <- function(x, before, after, time_step) {
if (is.na(by)) {
Abort(
c(
"`frollmean` requires a full window to compute a result, but
"Current settings require a full window to compute a `slide` result, but
`time_type` associated with the epi_df was not mappable to period
type valid for creating a date sequence.",
"i" = c("The input data's `time_type` was probably `custom` or `day-time`.
Expand Down Expand Up @@ -773,4 +862,4 @@ full_date_seq <- function(x, before, after, time_step) {
pad_early_dates = pad_early_dates,
pad_late_dates = pad_late_dates
))
}
}
14 changes: 13 additions & 1 deletion man/epi_slide.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.