Skip to content

wip: enforce epi_slide to work on grouped epi_dfs only #519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Sep 18, 2024

Conversation

dshemetov
Copy link
Contributor

@dshemetov dshemetov commented Aug 24, 2024

Checklist

Please:

  • Make sure this PR is against "dev", not "main" (unless this is a release
    PR).
  • Request a review from one of the current main reviewers:
    brookslogan, nmdefries.
  • Makes sure to bump the version number in DESCRIPTION. Always increment
    the patch version number (the third number), unless you are making a
    release PR from dev to main, in which case increment the minor version
    number (the second number).
  • Describe changes made in NEWS.md, making sure breaking changes
    (backwards-incompatible changes to the documented interface) are noted.
    Collect the changes under the next release number (e.g. if you are on
    1.7.2, then write your changes under the 1.8 heading).
  • See DEVELOPMENT.md for more information on the development
    process.

Change explanations for reviewer

  • epi_slide requires the epi_df to be grouped on all key_cols (automatically groups too)
  • refactored epi_slide for readability
  • epi_slide_one_group is now defined outside epi_slide (as per @dsweber2's comment)
  • epi_slide_one_group completes the time index when sliding (matching epi_slide_opt); grouping also simplifies many of the internals
  • huge test coverage improvement for epi_slide - includes all valid parameter combinations
  • add sum_group_epi_df helper function to help aggregate across geos and other columns (since we don't allow epi_slide to handle that functionality anymore)

Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch

@dshemetov dshemetov marked this pull request as draft August 24, 2024 23:10
@dshemetov dshemetov force-pushed the ds/epi-slide-group branch 2 times, most recently from 57bafb0 to adb8504 Compare September 6, 2024 02:16
@dshemetov dshemetov marked this pull request as ready for review September 6, 2024 02:16
* works with grouped epi_dfs only
* add .complete_only parameter
* correct deprecation messages
* add huge amounts of tests
* add aggregate_epi_df
* single data point per group epi_df now defaults to day time type
R/epi_df.R Outdated
Comment on lines 282 to 285
duplicated_time_values <- x %>%
group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>%
dplyr::summarize(n = dplyr::n(), .groups = "drop") %>%
filter(n > 1)
Copy link
Contributor

@brookslogan brookslogan Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
duplicated_time_values <- x %>%
group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>%
dplyr::summarize(n = dplyr::n(), .groups = "drop") %>%
filter(n > 1)
if ("n" %in% other_keys) {
cli_abort("`other_keys` can't include \"n\"")
}
duplicated_time_values <- x %>%
count(across(all_of(c("geo_value", "time_value", other_keys)))) %>%
filter(n > 1)

for a guard + less typing after the guard (and hopefully higher performance? --- though if performance is an issue, we should probably be considering anyDuplicated or something custom to not slow down the happy case, and then doing the above or below to generate nicer feedback in the unhappy case), or

Suggested change
duplicated_time_values <- x %>%
group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>%
dplyr::summarize(n = dplyr::n(), .groups = "drop") %>%
filter(n > 1)
if ("n" %in% other_keys) {
cli_abort("`other_keys` can't include \"n\"")
}
duplicated_time_values <- x %>%
group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>%
filter(dplyr::n() > 1) %>%
ungroup()

if you want to actually show data

Copy link
Contributor

@brookslogan brookslogan Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: we should error or decay to tibble when we violate this in other operations. Just glancing, this probably could mean just updating the functions referenced in ?dplyr_extending that could lead to violations, and that probably means just updating dplyr_reconstruct.epi_df.

Copy link
Contributor Author

@dshemetov dshemetov Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the guard, but I also renamed the n to something that will be less likely to collide .time_value_counts. Not too worried about performance rn, but we can easily refactor this section if it ever comes to that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: we should error or decay to tibble when we violate this in other operations

we should talk through the ways this can get violated in other operations. a bit wary of decaying to tibble - silently losing epi_df-ness seems like it could lead to unpleasant surprises. will come back to this when i address other points.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, getting a call stack overflow cause i use dplyr functions in the duplicated time values check above. do i really have to implement my very own group_by for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still an issue? Do you have a reprex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't, but the gist is that if you group_by an epi_df inside dplyr_reconstruct, you end up in an infinite recursion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how that happens. Aren't we working off of a tibble input & haven't converted it yet?

Copy link
Contributor

@brookslogan brookslogan Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it failing in the current state of the branch? Or do you have some local work on dplyr_reconstruct that's triggering the error?

I need to re-read dplyr_extending and see if we actually need to touch dplyr_reconstruct. We may only need it on dplyr_row_slice and maybe [.

If you're having trouble here, feel free to punt the invariant-preservation part to me, as I think I have some vague memories & experience here that would be helpful. [Either way, we could do this invariant-preservation part post-merge into the big slide improvements branch]

Copy link
Contributor

@brookslogan brookslogan Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caching in some stuff from memory:

  • We probably don't want to do a full key validity check in dplyr_reconstruct; we run that all the time and it could be expensive.
  • We might be able to get by with:
    • In dplyr_row_slice.epi_df: check that they have not requested the same row index more than once.
    • In [.epi_df when it selects columns: if any of the other_keys columns has been dropped, we need to do a full uniqueness check (or some optimization of it). If it's also selecting rows (using integer indices)... not sure if we also need that row slice check here or if row slice is already being called somehow.
    • ?dplyr_extending seems to suggest we don't need a dplyr_col_modify implementation for this.

And if you're running into infinite loop things, then you may need to manually decay_epi_df, though it seems a bit fishy because these dplyr_extending things are supposed to be delivering tibbles to you to work with, so I'm wondering if something might be wrong.

Copy link
Contributor

@brookslogan brookslogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making a lot of progress here! Does testing overhaul need much review? I've glossed over it for now.

This looks good but I'd like to check any changes made to preserve the epi_df unique key invariant & any changes to decaying behavior.

@dshemetov
Copy link
Contributor Author

dshemetov commented Sep 10, 2024

TODO:

  • error when all groups return an empty tibble / all NAs
  • make .window_size required and remove the default of 1
  • rework .complete_only - remove and slide on contiguous windows by default
    • filter ref_time_values down to just the ones between min and max time values
    • update tests
    • make sure that epi_slide_opt behaves similarly here
  • a few more test updates
  • can we standardize the NA/NULL/list(NULL)/list(NA) outputs from epi_slide? answer: no, don't bother.
  • add complete(geo_value, time_value = full_seq) and other intended example of complete usage to documentation; can we create a wrapper like complete_canonical() that automatically makes decisions based on time_type?
  • arrange_col and arrange_row
  • debug and fix borked test
  • rebase and merge

* duplicated time values check in epi_df constructor improved
* dplyr warning and unnecessary if in sum_groups_epi_df fixed
* args in epi_slide are now validated in order of func signature
* simplify deprecated check
* error if .new_col_name is "geo_value" or "time_value"
* better TODO comment over last part of epi_slide
* comment about yearmonth - Inf weirdness
* change tests few tests
* remove complete_only and auto complete

Co-authored-by: brookslogan <[email protected]>
@dshemetov dshemetov mentioned this pull request Sep 13, 2024
8 tasks
@dshemetov dshemetov mentioned this pull request Sep 17, 2024
4 tasks
@dshemetov dshemetov merged commit 214100d into lcb/slide-improvements-2024-06 Sep 18, 2024
@dshemetov dshemetov deleted the ds/epi-slide-group branch September 18, 2024 18:20
@dshemetov dshemetov mentioned this pull request Sep 18, 2024
4 tasks
@dshemetov dshemetov mentioned this pull request Sep 27, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants