Skip to content

Benchmark epi_archive operations with and without compactification #111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
brookslogan opened this issue Jun 20, 2022 · 8 comments
Closed
Assignees
Labels
P2 low priority performance

Comments

@brookslogan
Copy link
Contributor

Compactification (#101) should save space (at least once/if the original input data is garbage collected), but does it save (meaningful) time? We should benchmark some epi_archive operations (construction, as_of, epix_slide overhead, ...) with and without compactification, and compare against other operations (download time, costs of common f functions passed to epi_slide, etc.) to contextualize whether any changes are important. If benchmarking shows this actually slows down the epi archive considerably, then we should reconsider the default for compactifaction; if benchmarking shows a significant speedup, then we might want to show this off in a vignette (and perhaps try to work on move to an encapsulated design or dtplyr approach so we can perhaps message or be silent when compactifying by default rather than warning).

@brookslogan brookslogan added P2 low priority performance labels Jun 20, 2022
@kenmawer kenmawer self-assigned this Jun 21, 2022
@kenmawer
Copy link
Contributor

kenmawer commented Jun 22, 2022

@brookslogan Are we concerned about user or system speed with performance?

image

Seeing that we use dplyr::filter rather than subset or the Python-like filtering with [], which has lower system speed but higher user speed, this implies that system speed is the most important.

I know I should be testing epi_archive operations, but I just noticed something when trying to figure out how the relative performance on the user end may differ from the system end.

@kenmawer
Copy link
Contributor

Finished with PR #118.

@brookslogan
Copy link
Contributor Author

Do we use dplyr::filter with epi_archives? If so, that might be a bug; we probably want to use data.table-native things (note that [ on data.tables has a special implementation; it's not the same as using [ on a regular data.frame because we've enabled .datatable.aware; it might actually be multithreading...).

I think it's natural to assume that filter on fewer entries will run faster. The benchmarks should test epi_archive operations as you said; what was the issue you noted?

Benchmarking has a lot of gotchas; we should probably use a pre-packaged solution like microbenchmark rather than rolling our own.

@kenmawer
Copy link
Contributor

@brookslogan No, I am not using filter (or subset or the Python-like filtering) on an epi_archive, but I am able to use it on a data.table (same with the other two filtering methods).

If we are only testing epi_archive, the issue of benchmarking filter (or any of the other methods) is irrelevant. However, the issue I'm talking about is that there are both user and system speeds; thus, I would like to know if I should include both of those or only one.

@brookslogan
Copy link
Contributor Author

Right, I meant filter on the archive's DT; it's natural to assume that filter would be faster when we've reduced the number of rows. (If it's not, that's very surprising, and probably indicates a bug.) Where speed gains are less easy to just claim without data is where data.table is potentially doing something complicated where fewer rows might not necessarily be better, e.g., maybe as_of.

  • For our reassurance: is filter faster on the DT of a compactified epi_archive based on your tests? If so, great, and we should probably save the relevant code snippet in this thread or the performance issue thread, but it might be questionable for the main archive vignette. In fact, all the performance stuff might need to be its own vignette.
  • For the vignette: we want to test the performance of $as_of, and maybe $new.
  • For the main archive vignette and roxygen documentation, we want to avoid claiming speed benefits if these benchmarks don't back them up. The space benefit is clear (well, once garbage collection has run, and ignoring any memory the OS doesn't reclaim after garbage collection....).
    I think you've already addressed the points up above except for maybe the $new benchmarking? But that's lower priority; it doesn't seem like it would be that much of a cost when $as_of could be called many times per $new from a single $slide.

Regarding system vs. user vs. real: I think we care separately about system+user and real: based on our use case:

  • Case 1: user doesn't parallelize outside $as_of and $slide doesn't use parallelism: we care about "real" time. Some data.table operations provide a benefit here for $as_of through their internal use of parallelism.
  • Case 2: users does parallelize code, or $slide does use parallelism: we care about system+user time, what I think is supposed to be the "CPU time" (or the real time when the benchmark code mimics this type of parallelism). If they are already parallelizing, data.table will try to detect this and turn off its internal parallelism because it probably will hurt more than help.

@kenmawer
Copy link
Contributor

kenmawer commented Jul 8, 2022

  • Yeah, I put filter separately as it's not an operation for epi_archive.
  • I updated that.
  • I would still need to figure out how to track memory usage.

How would I go about calculating "real" time? I know what I did with elapsed time is Case 2, as elapsed time is the sum of system and user time.

@brookslogan
Copy link
Contributor Author

I think we discussed this via Slack; are there any remaining questions here? Did you find from reading about the different benchmarking methods that the one used in the vignette now is considered accurate/reliable or not?

@brookslogan
Copy link
Contributor Author

Closing as completed by #101.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 low priority performance
Projects
None yet
Development

No branches or pull requests

2 participants