-
Notifications
You must be signed in to change notification settings - Fork 8
wip: enforce epi_slide to work on grouped epi_dfs only #519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wip: enforce epi_slide to work on grouped epi_dfs only #519
Conversation
57bafb0
to
adb8504
Compare
* works with grouped epi_dfs only * add .complete_only parameter * correct deprecation messages * add huge amounts of tests * add aggregate_epi_df * single data point per group epi_df now defaults to day time type
adb8504
to
dd5c769
Compare
R/epi_df.R
Outdated
duplicated_time_values <- x %>% | ||
group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>% | ||
dplyr::summarize(n = dplyr::n(), .groups = "drop") %>% | ||
filter(n > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated_time_values <- x %>% | |
group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>% | |
dplyr::summarize(n = dplyr::n(), .groups = "drop") %>% | |
filter(n > 1) | |
if ("n" %in% other_keys) { | |
cli_abort("`other_keys` can't include \"n\"") | |
} | |
duplicated_time_values <- x %>% | |
count(across(all_of(c("geo_value", "time_value", other_keys)))) %>% | |
filter(n > 1) |
for a guard + less typing after the guard (and hopefully higher performance? --- though if performance is an issue, we should probably be considering anyDuplicated
or something custom to not slow down the happy case, and then doing the above or below to generate nicer feedback in the unhappy case), or
duplicated_time_values <- x %>% | |
group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>% | |
dplyr::summarize(n = dplyr::n(), .groups = "drop") %>% | |
filter(n > 1) | |
if ("n" %in% other_keys) { | |
cli_abort("`other_keys` can't include \"n\"") | |
} | |
duplicated_time_values <- x %>% | |
group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>% | |
filter(dplyr::n() > 1) %>% | |
ungroup() |
if you want to actually show data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: we should error or decay to tibble when we violate this in other operations. Just glancing, this probably could mean just updating the functions referenced in ?dplyr_extending
that could lead to violations, and that probably means just updating dplyr_reconstruct.epi_df
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the guard, but I also renamed the n
to something that will be less likely to collide .time_value_counts
. Not too worried about performance rn, but we can easily refactor this section if it ever comes to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: we should error or decay to tibble when we violate this in other operations
we should talk through the ways this can get violated in other operations. a bit wary of decaying to tibble - silently losing epi_df-ness seems like it could lead to unpleasant surprises. will come back to this when i address other points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, getting a call stack overflow cause i use dplyr functions in the duplicated time values check above. do i really have to implement my very own group_by for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still an issue? Do you have a reprex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't, but the gist is that if you group_by an epi_df inside dplyr_reconstruct, you end up in an infinite recursion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how that happens. Aren't we working off of a tibble input & haven't converted it yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it failing in the current state of the branch? Or do you have some local work on dplyr_reconstruct
that's triggering the error?
I need to re-read dplyr_extending
and see if we actually need to touch dplyr_reconstruct
. We may only need it on dplyr_row_slice
and maybe [
.
If you're having trouble here, feel free to punt the invariant-preservation part to me, as I think I have some vague memories & experience here that would be helpful. [Either way, we could do this invariant-preservation part post-merge into the big slide improvements branch]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caching in some stuff from memory:
- We probably don't want to do a full key validity check in dplyr_reconstruct; we run that all the time and it could be expensive.
- We might be able to get by with:
- In
dplyr_row_slice.epi_df
: check that they have not requested the same row index more than once. - In
[.epi_df
when it selects columns: if any of the other_keys columns has been dropped, we need to do a full uniqueness check (or some optimization of it). If it's also selecting rows (using integer indices)... not sure if we also need that row slice check here or if row slice is already being called somehow. - ?dplyr_extending seems to suggest we don't need a
dplyr_col_modify
implementation for this.
- In
And if you're running into infinite loop things, then you may need to manually decay_epi_df
, though it seems a bit fishy because these dplyr_extending
things are supposed to be delivering tibbles to you to work with, so I'm wondering if something might be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making a lot of progress here! Does testing overhaul need much review? I've glossed over it for now.
This looks good but I'd like to check any changes made to preserve the epi_df
unique key invariant & any changes to decaying behavior.
TODO:
|
* duplicated time values check in epi_df constructor improved * dplyr warning and unnecessary if in sum_groups_epi_df fixed * args in epi_slide are now validated in order of func signature * simplify deprecated check * error if .new_col_name is "geo_value" or "time_value" * better TODO comment over last part of epi_slide * comment about yearmonth - Inf weirdness * change tests few tests * remove complete_only and auto complete Co-authored-by: brookslogan <[email protected]>
297beec
to
dfd49f5
Compare
Checklist
Please:
PR).
brookslogan, nmdefries.
DESCRIPTION
. Always incrementthe patch version number (the third number), unless you are making a
release PR from dev to main, in which case increment the minor version
number (the second number).
(backwards-incompatible changes to the documented interface) are noted.
Collect the changes under the next release number (e.g. if you are on
1.7.2, then write your changes under the 1.8 heading).
process.
Change explanations for reviewer
sum_group_epi_df
helper function to help aggregate across geos and other columns (since we don't allow epi_slide to handle that functionality anymore)Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch
version
key is duplicated inepi_df
orepi_archive
#154