Skip to content

Created a preprocessing step that limits the size of the training window #53

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Aug 29, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
d558289
Added code to make training_window, roxygen comments and some tests
Jun 17, 2022
6bf446c
Removed id from user facing fun
Jun 22, 2022
975afdb
Added ID back to how it was before (encountered fatal error)
Jun 22, 2022
ae759f7
,
Jun 22, 2022
1dd28a3
Made some changes as requested
Jun 24, 2022
18d8dd5
Added ... to print as per warning
Jun 24, 2022
0b3ee3b
Updated doc
Jun 24, 2022
dad1580
dplyr::all_of()
Jun 24, 2022
20e5961
Added another test
Jun 25, 2022
6114f3b
Updated ex. that includes multiple keys
Jun 25, 2022
126b81b
Merge branch 'frosting' into 36-step_training_window
rachlobay Jun 25, 2022
431d9e6
Added space
Jun 25, 2022
cca063d
rlang::enquos
Jun 25, 2022
8fc9017
testing
Jun 25, 2022
d6273cb
removed tibble::as_tibble()
Jun 26, 2022
c03370c
Pulled all changes from frosting and tried to resolve conflict with n…
rachlobay Aug 12, 2022
ed12a04
Trying epi_juice soln to decay to tibble problem
rachlobay Aug 12, 2022
e4d07f1
Round 2 to try to get epi_juice to work
rachlobay Aug 12, 2022
0055837
<<- Make assign outside of fun
rachlobay Aug 12, 2022
1a74021
utils::
rachlobay Aug 12, 2022
d4d132e
Delete .gitignore 2
rachlobay Aug 12, 2022
38e8df9
Add sliding vignette from comp
rachlobay Aug 15, 2022
21dbb7f
Delete here.
rachlobay Aug 15, 2022
0fa65ef
Added bake.epi_recipe and removed related code in zzz
rachlobay Aug 16, 2022
f42e04e
Added formats as in original bake
rachlobay Aug 16, 2022
93fec95
Remove hopefully unnecessary call
rachlobay Aug 16, 2022
8e81e7c
Added roxygen doc and devtools::document()
rachlobay Aug 16, 2022
4581115
Add is_empty to namespace
rachlobay Aug 16, 2022
207733c
Added some necessary funs from recipes
rachlobay Aug 16, 2022
57d10ac
Changed documentation
rachlobay Aug 16, 2022
2044b62
Merge branch 'frosting' into 36-step_training_window
rachlobay Aug 16, 2022
e1ddfee
Added @param
rachlobay Aug 16, 2022
5ffadca
Merge branch '36-step_training_window' of https://github.com/cmu-delp…
rachlobay Aug 16, 2022
d4cd12a
Removed one abort message
rachlobay Aug 16, 2022
243e45e
recipes:::strings2factors
rachlobay Aug 16, 2022
3f81cb0
Remove some unnecessary comments
rachlobay Aug 16, 2022
57fcae7
See if now works after epiprocess genlasso switch
rachlobay Aug 17, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,11 @@ S3method(Ops,dist_quantiles)
S3method(apply_frosting,default)
S3method(apply_frosting,epi_workflow)
S3method(augment,epi_workflow)
S3method(bake,epi_recipe)
S3method(bake,step_epi_ahead)
S3method(bake,step_epi_lag)
S3method(bake,step_population_scaling)
S3method(bake,step_training_window)
S3method(detect_layer,frosting)
S3method(detect_layer,workflow)
S3method(epi_keys,default)
Expand Down Expand Up @@ -36,10 +38,12 @@ S3method(prep,epi_recipe)
S3method(prep,step_epi_ahead)
S3method(prep,step_epi_lag)
S3method(prep,step_population_scaling)
S3method(prep,step_training_window)
S3method(print,epi_workflow)
S3method(print,frosting)
S3method(print,step_epi_ahead)
S3method(print,step_epi_lag)
S3method(print,step_training_window)
S3method(quantile,dist_quantiles)
S3method(refresh_blueprint,default_epi_recipe_blueprint)
S3method(residuals,flatline)
Expand Down Expand Up @@ -105,6 +109,7 @@ export(step_epi_ahead)
export(step_epi_lag)
export(step_epi_naomit)
export(step_population_scaling)
export(step_training_window)
export(validate_layer)
import(distributional)
import(recipes)
Expand All @@ -119,7 +124,9 @@ importFrom(rlang,":=")
importFrom(rlang,`%||%`)
importFrom(rlang,abort)
importFrom(rlang,caller_env)
importFrom(rlang,is_empty)
importFrom(rlang,is_null)
importFrom(rlang,quos)
importFrom(stats,as.formula)
importFrom(stats,family)
importFrom(stats,lm)
Expand All @@ -130,4 +137,6 @@ importFrom(stats,predict)
importFrom(stats,qnorm)
importFrom(stats,quantile)
importFrom(stats,residuals)
importFrom(stats,setNames)
importFrom(tibble,is_tibble)
importFrom(tibble,tibble)
104 changes: 104 additions & 0 deletions R/bake.epi_recipe.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
#' Bake an epi_recipe
#'
#' @param object A trained object such as a [recipe()] with at least
#' one preprocessing operation.
#' @param new_data An `epi_df`, data frame or tibble for whom the
#' preprocessing will be applied. If `NULL` is given to `new_data`,
#' the pre-processed _training data_ will be returned.
#' @param ... One or more selector functions to choose which variables will be
#' returned by the function. See \code{\link[selections]{recipes}} for
#' more details. If no selectors are given, the default is to
#' use [everything()].
#' @return An `epi_df` that may have different columns than the
#' original columns in `new_data`.
#' @importFrom rlang is_empty quos
#' @importFrom tibble is_tibble
#' @rdname bake
#' @export
bake.epi_recipe <- function(object, new_data, ...) {

if (rlang::is_missing(new_data)) {
rlang::abort("'new_data' must be either an epi_df or NULL. No value is not allowed.")
}

if (is.null(new_data)) {
return(epi_juice(object, ...))
}

if (!fully_trained(object)) {
rlang::abort("At least one step has not been trained. Please run `prep`.")
}

terms <- quos(...)
if (is_empty(terms)) {
terms <- quos(everything())
}

# In case someone used the deprecated `newdata`:
if (is.null(new_data) || is.null(ncol(new_data))) {
if (any(names(terms) == "newdata")) {
rlang::abort("Please use `new_data` instead of `newdata` with `bake`.")
} else {
rlang::abort("Please pass a data set to `new_data`.")
}
}

if (!is_tibble(new_data)) {
new_data <- as_tibble(new_data)
}

recipes:::check_role_requirements(object, new_data)

recipes:::check_nominal_type(new_data, object$orig_lvls)

# Drop completely new columns from `new_data` and reorder columns that do
# still exist to match the ordering used when training
original_names <- names(new_data)
original_training_names <- unique(object$var_info$variable)
bakeable_names <- intersect(original_training_names, original_names)
new_data <- new_data[, bakeable_names]

n_steps <- length(object$steps)

for (i in seq_len(n_steps)) {
step <- object$steps[[i]]

if (recipes:::is_skipable(step)) {
next
}

new_data <- bake(step, new_data = new_data)

if (!is_tibble(new_data)) {
abort("bake() methods should always return tibbles")
}
}

# Use `last_term_info`, which maintains info on all columns that got added
# and removed from the training data. This is important for skipped steps
# which might have resulted in columns not being added/removed in the test
# set.
info <- object$last_term_info

# Now reduce to only user selected columns
out_names <- recipes_eval_select(terms, new_data, info,
check_case_weights = FALSE)
new_data <- new_data[, out_names]

# The levels are not null when no nominal data are present or
# if strings_as_factors = FALSE in `prep`
if (!is.null(object$levels)) {
var_levels <- object$levels
var_levels <- var_levels[out_names]
check_values <-
vapply(var_levels, function(x) {
(!all(is.na(x)))
}, c(all = TRUE))
var_levels <- var_levels[check_values]
if (length(var_levels) > 0) {
new_data <- recipes:::strings2factors(new_data, var_levels)
}
}

new_data
}
42 changes: 42 additions & 0 deletions R/epi_juice.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
#' Extract transformed training set
#'
#' @inheritParams bake.epi_recipe
epi_juice <- function(object, ...) {
if (!fully_trained(object)) {
rlang::abort("At least one step has not been trained. Please run `prep()`.")
}

if (!isTRUE(object$retained)) {
rlang::abort(paste0(
"Use `retain = TRUE` in `prep()` to be able ",
"to extract the training set"
))
}

terms <- quos(...)
if (is_empty(terms)) {
terms <- quos(everything())
}

# Get user requested columns
new_data <- object$template
out_names <- recipes_eval_select(terms, new_data, object$term_info,
check_case_weights = FALSE)
new_data <- new_data[, out_names]

# Since most models require factors, do the conversion from character
if (!is.null(object$levels)) {
var_levels <- object$levels
var_levels <- var_levels[out_names]
check_values <-
vapply(var_levels, function(x) {
(!all(is.na(x)))
}, c(all = TRUE))
var_levels <- var_levels[check_values]
if (length(var_levels) > 0) {
new_data <- recipes:::strings2factors(new_data, var_levels)
}
}

new_data
}
104 changes: 104 additions & 0 deletions R/training_window.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
#' Limits the size of the training window to the most recent observations
#'
#' `step_training_window` creates a *specification* of a recipe step that
#' limit the size of the training window to the `n_recent` most recent
#' observations in `time_value` per group, where the groups are formed
#' based on the remaining `epi_keys`.
#'
#' @param recipe A recipe object. The step will be added to the
#' sequence of operations for this recipe.
#' @param role Not used by this step since no new variables are created.
#' @param trained A logical to indicate if the quantities for
#' preprocessing have been estimated.
#' @param n_recent An integer value that represents the number of most recent
#' observations that are to be kept in the training window per location.
#' The default value is 50.
#' @param id A character string that is unique to this step to identify it.
#' @template step-return
#'
#' @details Note that `step_epi_lead()` and `step_epi_lag()` should come
#' after any filtering step.
#'
#' @export
#'
#' @examples
#' tib <- tibble::tibble(
#' x = 1:10, y = 1:10,
#' time_value = rep(seq(as.Date("2020-01-01"), by = 1,
#' length.out = 5), times = 2),
#' geo_value = rep(c("ca", "hi"), each = 5)
#' ) %>% epiprocess::as_epi_df()
#'
#' library(recipes)
#' epi_recipe(y ~ x, data = tib) %>%
#' step_training_window(n_recent = 3) %>%
#' prep(tib) %>%
#' bake(new_data = NULL)
step_training_window <-
function(recipe,
role = NA,
trained = FALSE,
n_recent = 50,
id = rand_id("training_window")) {

add_step(
recipe,
step_training_window_new(
role = role,
trained = trained,
n_recent = n_recent,
skip = TRUE,
id = id
)
)
}

step_training_window_new <-
function(terms, role, trained, n_recent, skip, id = id) {
step(
subclass = "training_window",
role = role,
trained = trained,
n_recent = n_recent,
skip = skip,
id = id
)
}

#' @export
prep.step_training_window <- function(x, training, info = NULL) {

step_training_window_new(
role = x$role,
trained = TRUE,
n_recent = x$n_recent,
skip = x$skip,
id = x$id
)
}

#' @export
bake.step_training_window <- function(object, new_data) {
if (!all(object$n_recent == as.integer(object$n_recent))) {
rlang::abort("step_training_window requires 'n_recent' to be integer valued.")
}

ek <- epi_keys(new_data)[which(epi_keys(new_data) != "time_value")]

new_data %>%
dplyr::group_by(dplyr::across(dplyr::all_of(ek))) %>%
dplyr::arrange(time_value) %>%
dplyr::slice_tail(n = object$n_recent) %>%
dplyr::ungroup()
}

#' @export
print.step_training_window <-
function(x, width = max(20, options()$width - 30), ...) {
title <- "Number of most recent observations per location used in training window "
n_recent = x$n_recent
tr_obj = format_selectors(rlang::enquos(n_recent), width)
recipes::print_step(tr_obj, rlang::enquos(n_recent),
x$trained, title, width)
invisible(x)
}
27 changes: 27 additions & 0 deletions man/bake.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

19 changes: 19 additions & 0 deletions man/epi_juice.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/epi_workflow.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 2 additions & 3 deletions man/flatline.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading