-
Notifications
You must be signed in to change notification settings - Fork 10
feat: check_enough_train_data #283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 13 commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
cf65b1b
wip: check_enough_train_data
dshemetov ad74faa
feat: add check_enough_data
dshemetov f237ca5
doc: update check_enough_train_data docstring
dshemetov 33e00ca
fix: remove browser()
dshemetov d4d5189
typos and ambiguous names
dsweber2 3662bc3
slightly more test structure
dsweber2 fe2c91a
refactor: use Dan's suggest in check_enough_train_data
dshemetov 1846097
fix: check_enough_train_data, tests
dshemetov 4bc17c4
feat: add check_enough_data to arx_forecaster
dshemetov 3f1630d
fix: add default n to check_enough_train_data, import dplyr funcs
dshemetov 38f19da
repo: ignore renv stuff
dshemetov c655bf6
doc: update NEWS
dshemetov 437f5d1
doc: document
dshemetov edb15e3
refactor: rename check_enough_data args in arx_forecaster
dshemetov 39bb81e
feat: add check_enough_data to arx_classifier
dshemetov efde51e
refactor: change the default n in check_enough_data to #predictors
dshemetov 580ca5a
repo: change version bump
dshemetov b869222
doc: document
dshemetov File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,5 @@ | ||
^renv$ | ||
^renv\.lock$ | ||
^epipredict\.Rproj$ | ||
^\.Rproj\.user$ | ||
^LICENSE\.md$ | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,3 +7,6 @@ inst/doc | |
.DS_Store | ||
/doc/ | ||
/Meta/ | ||
.Rprofile | ||
renv.lock | ||
renv/ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,44 +1,49 @@ | ||
# epipredict (development) | ||
|
||
# epipredict 0.0.7.9000 | ||
|
||
- add `check_enough_train_data` that will error if training data is too small | ||
- added `check_enough_train_data` to `arx_forecaster` | ||
|
||
# epipredict 0.0.7 | ||
|
||
* simplify `layer_residual_quantiles()` to avoid timesuck in `utils::methods()` | ||
- simplify `layer_residual_quantiles()` to avoid timesuck in `utils::methods()` | ||
|
||
# epipredict 0.0.6 | ||
|
||
* rename the `dist_quantiles()` to be more descriptive, breaking change) | ||
* removes previous `pivot_quantiles()` (now `*_wider()`, breaking change) | ||
* add `pivot_quantiles_wider()` for easier plotting | ||
* add complement `pivot_quantiles_longer()` | ||
* add `cdc_baseline_forecaster()` and `flusight_hub_formatter()` | ||
- rename the `dist_quantiles()` to be more descriptive, breaking change) | ||
- removes previous `pivot_quantiles()` (now `*_wider()`, breaking change) | ||
- add `pivot_quantiles_wider()` for easier plotting | ||
- add complement `pivot_quantiles_longer()` | ||
- add `cdc_baseline_forecaster()` and `flusight_hub_formatter()` | ||
|
||
# epipredict 0.0.5 | ||
|
||
* add `smooth_quantile_reg()` | ||
* improved printing of various methods / internals | ||
* canned forecasters get a class | ||
* fixed quantile bug in `flatline_forecaster()` | ||
* add functionality to output the unfit workflow from the canned forecasters | ||
- add `smooth_quantile_reg()` | ||
- improved printing of various methods / internals | ||
- canned forecasters get a class | ||
- fixed quantile bug in `flatline_forecaster()` | ||
- add functionality to output the unfit workflow from the canned forecasters | ||
|
||
# epipredict 0.0.4 | ||
|
||
* add quantile_reg() | ||
* clean up documentation bugs | ||
* add smooth_quantile_reg() | ||
* add classifier | ||
* training window step debugged | ||
* `min_train_window` argument removed from canned forecasters | ||
- add quantile_reg() | ||
- clean up documentation bugs | ||
- add smooth_quantile_reg() | ||
- add classifier | ||
- training window step debugged | ||
- `min_train_window` argument removed from canned forecasters | ||
|
||
# epipredict 0.0.3 | ||
|
||
* add forecasters | ||
* implement postprocessing | ||
* vignettes avaliable | ||
* arx_forecaster | ||
* pkgdown | ||
- add forecasters | ||
- implement postprocessing | ||
- vignettes avaliable | ||
- arx_forecaster | ||
- pkgdown | ||
|
||
# epipredict 0.0.0.9000 | ||
|
||
* Publish public for easy navigation | ||
* Two simple forecasters as test beds | ||
* Working vignette | ||
- Publish public for easy navigation | ||
- Two simple forecasters as test beds | ||
- Working vignette |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,149 @@ | ||
#' Check the dataset contains enough data points. | ||
#' | ||
#' `check_enough_train_data` creates a *specification* of a recipe | ||
#' operation that will check if variables contain enough data. | ||
#' | ||
#' @param recipe A recipe object. The check will be added to the | ||
#' sequence of operations for this recipe. | ||
#' @param ... One or more selector functions to choose variables | ||
#' for this check. See [selections()] for more details. | ||
#' @param n The minimum number of data points required for training. | ||
#' @param epi_keys A character vector of column names on which to group the data | ||
#' and check threshold within each group. Useful if your forecaster trains | ||
#' per group (for example, per geo_value). | ||
#' @param drop_na A logical for whether to count NA values as valid rows. | ||
#' @param role Not used by this check since no new variables are | ||
#' created. | ||
#' @param trained A logical for whether the selectors in `...` | ||
#' have been resolved by [prep()]. | ||
#' @param columns An internal argument that tracks which columns are evaluated | ||
#' for this check. Should not be used by the user. | ||
#' @param id A character string that is unique to this check to identify it. | ||
#' @param skip A logical. Should the check be skipped when the | ||
#' recipe is baked by [bake()]? While all operations are baked | ||
#' when [prep()] is run, some operations may not be able to be | ||
#' conducted on new data (e.g. processing the outcome variable(s)). | ||
#' Care should be taken when using `skip = TRUE` as it may affect | ||
#' the computations for subsequent operations. | ||
#' @family checks | ||
#' @export | ||
#' @details This check will break the `bake` function if any of the checked | ||
#' columns have not enough non-NA values. If the check passes, nothing is | ||
#' changed to the data. | ||
#' | ||
#' # tidy() results | ||
#' | ||
#' When you [`tidy()`][tidy.recipe()] this check, a tibble with column | ||
#' `terms` (the selectors or variables selected) is returned. | ||
#' | ||
check_enough_train_data <- | ||
function(recipe, | ||
..., | ||
n = NULL, | ||
epi_keys = NULL, | ||
drop_na = TRUE, | ||
role = NA, | ||
trained = FALSE, | ||
columns = NULL, | ||
skip = TRUE, | ||
id = rand_id("enough_train_data")) { | ||
add_check( | ||
recipe, | ||
check_enough_train_data_new( | ||
n = n, | ||
epi_keys = epi_keys, | ||
drop_na = drop_na, | ||
terms = rlang::enquos(...), | ||
role = role, | ||
trained = trained, | ||
columns = columns, | ||
skip = skip, | ||
id = id | ||
) | ||
) | ||
} | ||
|
||
check_enough_train_data_new <- | ||
function(n, epi_keys, drop_na, terms, role, trained, columns, skip, id) { | ||
check( | ||
subclass = "enough_train_data", | ||
prefix = "check_", | ||
n = n, | ||
epi_keys = epi_keys, | ||
drop_na = drop_na, | ||
terms = terms, | ||
role = role, | ||
trained = trained, | ||
columns = columns, | ||
skip = skip, | ||
id = id | ||
) | ||
} | ||
|
||
#' @export | ||
#' @importFrom dplyr group_by summarise ungroup across all_of n | ||
#' @importFrom tidyr drop_na | ||
prep.check_enough_train_data <- function(x, training, info = NULL, ...) { | ||
col_names <- recipes_eval_select(x$terms, training, info) | ||
if (is.null(x$n)) { | ||
x$n <- length(col_names) + 5 | ||
} | ||
|
||
cols_not_enough_data <- training %>% | ||
{ | ||
if (x$drop_na) { | ||
drop_na(.) | ||
} else { | ||
. | ||
} | ||
} %>% | ||
group_by(across(all_of(.env$x$epi_keys))) %>% | ||
summarise(across(all_of(.env$col_names), ~ n() < .env$x$n), .groups = "drop") %>% | ||
summarise(across(all_of(.env$col_names), any), .groups = "drop") %>% | ||
unlist() %>% | ||
names(.)[.] | ||
|
||
if (length(cols_not_enough_data) > 0) { | ||
cli::cli_abort( | ||
"The following columns don't have enough data to predict: {cols_not_enough_data}." | ||
) | ||
} | ||
|
||
check_enough_train_data_new( | ||
n = x$n, | ||
epi_keys = x$epi_keys, | ||
drop_na = x$drop_na, | ||
terms = x$terms, | ||
role = x$role, | ||
trained = TRUE, | ||
columns = col_names, | ||
skip = x$skip, | ||
id = x$id | ||
) | ||
} | ||
|
||
#' @export | ||
bake.check_enough_train_data <- function(object, new_data, ...) { | ||
new_data | ||
} | ||
|
||
#' @export | ||
print.check_enough_train_data <- function(x, width = max(20, options()$width - 30), ...) { | ||
title <- paste0("Check enough data (n = ", x$n, ") for ") | ||
print_step(x$columns, x$terms, x$trained, title, width) | ||
invisible(x) | ||
} | ||
|
||
#' @export | ||
tidy.check_enough_train_data <- function(x, ...) { | ||
if (is_trained(x)) { | ||
res <- tibble(terms = unname(x$columns)) | ||
} else { | ||
res <- tibble(terms = sel2char(x$terms)) | ||
} | ||
res$id <- x$id | ||
res$n <- x$n | ||
res$epi_keys <- x$epi_keys | ||
res$drop_na <- x$drop_na | ||
res | ||
} |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering iflooks like you have a test demonstrating it does do that! So we definitely have a functional check for training data, if not test data.skip=TRUE
by default would solve the issue about running duringfit
vspredict
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup! I'd like to handle test data checking next, unclear if that will be possible.