wip: enforce epi_slide to work on grouped epi_dfs only #519

dshemetov · 2024-08-24T01:59:24Z

Checklist

Please:

Make sure this PR is against "dev", not "main" (unless this is a release
PR).
Request a review from one of the current main reviewers:
brookslogan, nmdefries.
Makes sure to bump the version number in DESCRIPTION. Always increment
the patch version number (the third number), unless you are making a
release PR from dev to main, in which case increment the minor version
number (the second number).
Describe changes made in NEWS.md, making sure breaking changes
(backwards-incompatible changes to the documented interface) are noted.
Collect the changes under the next release number (e.g. if you are on
1.7.2, then write your changes under the 1.8 heading).
See DEVELOPMENT.md for more information on the development
process.

Change explanations for reviewer

epi_slide requires the epi_df to be grouped on all key_cols (automatically groups too)
refactored epi_slide for readability
epi_slide_one_group is now defined outside epi_slide (as per @dsweber2's comment)
epi_slide_one_group completes the time index when sliding (matching epi_slide_opt); grouping also simplifies many of the internals
huge test coverage improvement for epi_slide - includes all valid parameter combinations
add sum_group_epi_df helper function to help aggregate across geos and other columns (since we don't allow epi_slide to handle that functionality anymore)

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

Resolves Warn if a non-version key is duplicated in epi_df or epi_archive #154

* works with grouped epi_dfs only * add .complete_only parameter * correct deprecation messages * add huge amounts of tests * add aggregate_epi_df * single data point per group epi_df now defaults to day time type

R/methods-epi_df.R

R/slide.R

R/epi_df.R

brookslogan · 2024-09-06T12:40:42Z

R/epi_df.R

+  duplicated_time_values <- x %>%
+    group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>%
+    dplyr::summarize(n = dplyr::n(), .groups = "drop") %>%
+    filter(n > 1)


Suggested change

duplicated_time_values <- x %>%

group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>%

dplyr::summarize(n = dplyr::n(), .groups = "drop") %>%

filter(n > 1)

if ("n" %in% other_keys) {

cli_abort("`other_keys` can't include \"n\"")

}

duplicated_time_values <- x %>%

count(across(all_of(c("geo_value", "time_value", other_keys)))) %>%

filter(n > 1)

for a guard + less typing after the guard (and hopefully higher performance? --- though if performance is an issue, we should probably be considering anyDuplicated or something custom to not slow down the happy case, and then doing the above or below to generate nicer feedback in the unhappy case), or

Suggested change

duplicated_time_values <- x %>%

group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>%

dplyr::summarize(n = dplyr::n(), .groups = "drop") %>%

filter(n > 1)

if ("n" %in% other_keys) {

cli_abort("`other_keys` can't include \"n\"")

}

duplicated_time_values <- x %>%

group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>%

filter(dplyr::n() > 1) %>%

ungroup()

if you want to actually show data

todo: we should error or decay to tibble when we violate this in other operations. Just glancing, this probably could mean just updating the functions referenced in ?dplyr_extending that could lead to violations, and that probably means just updating dplyr_reconstruct.epi_df.

I'll add the guard, but I also renamed the n to something that will be less likely to collide .time_value_counts. Not too worried about performance rn, but we can easily refactor this section if it ever comes to that.

todo: we should error or decay to tibble when we violate this in other operations

we should talk through the ways this can get violated in other operations. a bit wary of decaying to tibble - silently losing epi_df-ness seems like it could lead to unpleasant surprises. will come back to this when i address other points.

hm, getting a call stack overflow cause i use dplyr functions in the duplicated time values check above. do i really have to implement my very own group_by for this?

Is this still an issue? Do you have a reprex?

I don't, but the gist is that if you group_by an epi_df inside dplyr_reconstruct, you end up in an infinite recursion.

I don't understand how that happens. Aren't we working off of a tibble input & haven't converted it yet?

Is it failing in the current state of the branch? Or do you have some local work on dplyr_reconstruct that's triggering the error?

I need to re-read dplyr_extending and see if we actually need to touch dplyr_reconstruct. We may only need it on dplyr_row_slice and maybe [.

If you're having trouble here, feel free to punt the invariant-preservation part to me, as I think I have some vague memories & experience here that would be helpful. [Either way, we could do this invariant-preservation part post-merge into the big slide improvements branch]

Caching in some stuff from memory:

We probably don't want to do a full key validity check in dplyr_reconstruct; we run that all the time and it could be expensive.

We might be able to get by with:

In dplyr_row_slice.epi_df: check that they have not requested the same row index more than once.

In [.epi_df when it selects columns: if any of the other_keys columns has been dropped, we need to do a full uniqueness check (or some optimization of it). If it's also selecting rows (using integer indices)... not sure if we also need that row slice check here or if row slice is already being called somehow.

?dplyr_extending seems to suggest we don't need a dplyr_col_modify implementation for this.

And if you're running into infinite loop things, then you may need to manually decay_epi_df, though it seems a bit fishy because these dplyr_extending things are supposed to be delivering tibbles to you to work with, so I'm wondering if something might be wrong.

R/grouped_epi_archive.R

R/slide.R

brookslogan

Making a lot of progress here! Does testing overhaul need much review? I've glossed over it for now.

This looks good but I'd like to check any changes made to preserve the epi_df unique key invariant & any changes to decaying behavior.

tests/testthat/test-epi_slide.R

R/slide.R

R/methods-epi_df.R

R/slide.R

tests/testthat/test-epi_slide.R

dshemetov · 2024-09-10T00:36:28Z

R/slide.R

* duplicated time values check in epi_df constructor improved * dplyr warning and unnecessary if in sum_groups_epi_df fixed * args in epi_slide are now validated in order of func signature * simplify deprecated check * error if .new_col_name is "geo_value" or "time_value" * better TODO comment over last part of epi_slide * comment about yearmonth - Inf weirdness * change tests few tests * remove complete_only and auto complete Co-authored-by: brookslogan <[email protected]>

dshemetov added 2 commits August 23, 2024 17:18

doc: improve the .f documentation for epi_slide and epix_slide

c0f6499

doc: switch centre to center

2f91a90

dshemetov marked this pull request as draft August 24, 2024 23:10

dshemetov force-pushed the ds/epi-slide-group branch 2 times, most recently from 57bafb0 to adb8504 Compare September 6, 2024 02:16

dshemetov marked this pull request as ready for review September 6, 2024 02:16

dshemetov requested a review from brookslogan September 6, 2024 02:16

feat: refactor epi_slide

dd5c769

* works with grouped epi_dfs only * add .complete_only parameter * correct deprecation messages * add huge amounts of tests * add aggregate_epi_df * single data point per group epi_df now defaults to day time type

dshemetov force-pushed the ds/epi-slide-group branch from adb8504 to dd5c769 Compare September 6, 2024 02:22