-
Notifications
You must be signed in to change notification settings - Fork 8
Speedups for epix_slide
#386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
R/grouped_epi_archive.R
Outdated
.drop=private$drop) %>% | ||
dplyr::group_modify(group_modify_fn, | ||
f = f, ..., | ||
ref_time_value = ref_time_value, | ||
new_col = new_col, | ||
.keep = TRUE) | ||
) | ||
}) | ||
}) %>% rbindlist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice speedup! Is this going to change the output class of our result? This might need to have a setDF
+ as_tibble
(but check to make sure we can't possibly be aliasing something, e.g., in the 1-reference-time case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the class aspect, but not sure about the aliasing. My attempt at investigating this using lobstr::ref
to compare addresses of list of epi_dfs containing computed values and tibble of computed values after using rbindlist:
> data$DT
# Input data
geo_value sex time_value version value
1: al M 2012-01-01 2012-01-01 25
2: ca M 2012-01-01 2012-01-01 100
> data %>%
+ epix_slide(f = ~ 5, before = 0, ref_time_values = as.Date("2012-01-01")) %>%
+ ungroup()
# output from lobstr::ref(<list of epi_dfs containing computed values >, <tibble of computed values after using rbindlist>)
█ [1:0x559b7f330b08] <list>
└─█ [2:0x559b80ee4dd8] <tibble[,2]>
├─time_value = [3:0x559b7f330b40] <date>
└─slide_value = █ [4:0x559b7f467918] <list>
└─[5:0x559b7f331160] <dbl>
█ [6:0x559b80ee37d8] <tibble[,2]>
├─time_value = [7:0x559b7f4842f8] <date>
└─slide_value = █ [8:0x559b7f484288] <list>
└─[5:0x559b7f331160]
# Output of epix_slide
# A tibble: 1 × 2
time_value slide_value
* <date> <dbl>
1 2012-01-01 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this maintain the speed benefits? group_by
could potentially be costly. I'll try to double-check the no-aliasing part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this maintain the speed benefits? group_by could potentially be costly.
Yes, it seems to. I was also wondering about the time for that group_by but it didn't seem to impact anything. Edit: it's called fewer times because it's not in the inner loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Regarding aliasing:
- Actually, we don't need to avoid it; we are fine with out outputs aliasing stuff I think.
- I've updated the docs to reflect the above and some other mutation&aliasing details. Please sanity check. (I hope we can eventually simplify & make all these notes internal by providing a filter method & hiding away the DT from the user.)
- We already probably don't ever alias.
all_versions = FALSE
gets input that has gone through a DT -> tibble conversion usingas_tibble
with nosetDT
, which should prevent DT/column aliasingall_versions = TRUE
has a manual step making sure we don't alias theDT
(I'm actually not sure why now... maybe I was thinking we were going to be mutating like infill_through_version
) but might potentially somehow alias the columns (I'm not sure if this is possible... but since the aliasing case ofunique
didn't appear to be documented before, we probably should assume it could change to alias columns instead of the DT).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding all of those comments, they look good.
I'll remove some pipes as a final change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did removing pipes actually make a difference in your testing? I just tried re-introducing magrittr pipes to what I thought would be the highest-traffic pipe operations, the group_by
and group_modify
inside the FUN
arg to lapply
, but don't see any real change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it'd be nice if you could maybe summarize in general what seemed like the highest-impact changes here or any other places you think might benefit significantly from optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see that de-piping changed runtime. Theoretically, extra function calls will always take more time, but the cost is just so low compared to everything else we're doing here.
On the other hand, not using pipes does make profiles easier to read. Using pipes somehow causes functions to be marked as not having a sensible parent call, so e.g. group_modify
shows up in a spot on the profile that you don't expect and that doesn't match the code.
These are nice cheap speedups. FYI a couple other potential avenues at a much higher cost:
These are probably recorded in another issue or issues somewhere. All definitely seem more involved than these super-simple improvements, but if any catch your eye, feel free to try them! (Probably separately, it will be nice to have these simple improvements first!) |
@brookslogan If there are other particular The above changes were based on a task from the examples,
|
Saw in profiles that messaging takes a fair amount of time.
with the |
Seems like something that would cause more development work downstream as those functions change, so I'm not that thrilled about it. Question: are group_map and group_split called very often? Is there way for us to get around this by calling them less often? |
Yeah, it's a bad idea... but tempting since the change is so small 😀 We use |
It does seem like an easy change for ~10% speed boost! Hm, since they're used indirectly, does this mean loading epiprocess would have to overwrite those functions in the dplyr namespace? |
Yep. I was hoping there could be a workaround, but unclear. |
I don't know yet. I didn't expect Re. |
@brookslogan if you review this before I'm back and everything looks good, feel free to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Have some trailing requests regarding documenting whether the de-pipe-ing was worth it, and which optimizations had the most impact, but this seems ready to merge.
Slow bits in epix_slide from a test call. 23.5 s is the baseline runtime.
It was hard to tell in Dmitry's tests profile which functions were coming from |
Code to profile, and output of baseline profile. |
Remaining slow parts are doing We can't just swap in a different function here, so like Logan was saying above (here and here), this gets into more demanding changes. We should figure out what particular types of requests/computations are slow and how much slower than desired before spending a lot of time on a rewrite. |
Speeds
epix_slide
up ~2x on a series of test cases.