Benchmark `epi_archive` operations with and without compactification #111

brookslogan · 2022-06-20T18:53:42Z

Compactification (#101) should save space (at least once/if the original input data is garbage collected), but does it save (meaningful) time? We should benchmark some epi_archive operations (construction, as_of, epix_slide overhead, ...) with and without compactification, and compare against other operations (download time, costs of common f functions passed to epi_slide, etc.) to contextualize whether any changes are important. If benchmarking shows this actually slows down the epi archive considerably, then we should reconsider the default for compactifaction; if benchmarking shows a significant speedup, then we might want to show this off in a vignette (and perhaps try to work on move to an encapsulated design or dtplyr approach so we can perhaps message or be silent when compactifying by default rather than warning).

The text was updated successfully, but these errors were encountered:

kenmawer · 2022-06-22T16:21:02Z

@brookslogan Are we concerned about user or system speed with performance?

Seeing that we use dplyr::filter rather than subset or the Python-like filtering with [], which has lower system speed but higher user speed, this implies that system speed is the most important.

I know I should be testing epi_archive operations, but I just noticed something when trying to figure out how the relative performance on the user end may differ from the system end.

kenmawer · 2022-06-22T21:56:05Z

Finished with PR #118.

brookslogan · 2022-06-27T17:15:05Z

Do we use dplyr::filter with epi_archives? If so, that might be a bug; we probably want to use data.table-native things (note that [ on data.tables has a special implementation; it's not the same as using [ on a regular data.frame because we've enabled .datatable.aware; it might actually be multithreading...).

I think it's natural to assume that filter on fewer entries will run faster. The benchmarks should test epi_archive operations as you said; what was the issue you noted?

Benchmarking has a lot of gotchas; we should probably use a pre-packaged solution like microbenchmark rather than rolling our own.

kenmawer · 2022-06-28T16:07:04Z

@brookslogan No, I am not using filter (or subset or the Python-like filtering) on an epi_archive, but I am able to use it on a data.table (same with the other two filtering methods).

If we are only testing epi_archive, the issue of benchmarking filter (or any of the other methods) is irrelevant. However, the issue I'm talking about is that there are both user and system speeds; thus, I would like to know if I should include both of those or only one.

brookslogan · 2022-07-08T07:32:16Z

Right, I meant filter on the archive's DT; it's natural to assume that filter would be faster when we've reduced the number of rows. (If it's not, that's very surprising, and probably indicates a bug.) Where speed gains are less easy to just claim without data is where data.table is potentially doing something complicated where fewer rows might not necessarily be better, e.g., maybe as_of.

For our reassurance: is filter faster on the DT of a compactified epi_archive based on your tests? If so, great, and we should probably save the relevant code snippet in this thread or the performance issue thread, but it might be questionable for the main archive vignette. In fact, all the performance stuff might need to be its own vignette.
For the vignette: we want to test the performance of $as_of, and maybe $new.
For the main archive vignette and roxygen documentation, we want to avoid claiming speed benefits if these benchmarks don't back them up. The space benefit is clear (well, once garbage collection has run, and ignoring any memory the OS doesn't reclaim after garbage collection....).
I think you've already addressed the points up above except for maybe the $new benchmarking? But that's lower priority; it doesn't seem like it would be that much of a cost when $as_of could be called many times per $new from a single $slide.

Regarding system vs. user vs. real: I think we care separately about system+user and real: based on our use case:

Case 1: user doesn't parallelize outside $as_of and $slide doesn't use parallelism: we care about "real" time. Some data.table operations provide a benefit here for $as_of through their internal use of parallelism.
Case 2: users does parallelize code, or $slide does use parallelism: we care about system+user time, what I think is supposed to be the "CPU time" (or the real time when the benchmark code mimics this type of parallelism). If they are already parallelizing, data.table will try to detect this and turn off its internal parallelism because it probably will hurt more than help.

kenmawer · 2022-07-08T23:30:46Z

Yeah, I put filter separately as it's not an operation for epi_archive.
I updated that.
I would still need to figure out how to track memory usage.

How would I go about calculating "real" time? I know what I did with elapsed time is Case 2, as elapsed time is the sum of system and user time.

brookslogan · 2022-07-25T09:11:18Z

I think we discussed this via Slack; are there any remaining questions here? Did you find from reading about the different benchmarking methods that the one used in the vignette now is considered accurate/reliable or not?

brookslogan · 2023-03-09T00:05:09Z

Closing as completed by #101.

brookslogan added P2 low priority performance labels Jun 20, 2022

kenmawer self-assigned this Jun 21, 2022

brookslogan mentioned this issue Jun 27, 2022

compactify changes including adding a vignette #118

Closed

brookslogan mentioned this issue Jul 22, 2022

Data variables don't store all the right values (at least archive_cases_dv_subset) #150

Closed

brookslogan mentioned this issue Jul 25, 2022

Km compactify rectify #101

Merged

brookslogan closed this as completed Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark `epi_archive` operations with and without compactification #111

Benchmark `epi_archive` operations with and without compactification #111

brookslogan commented Jun 20, 2022

kenmawer commented Jun 22, 2022 •

edited

Loading

kenmawer commented Jun 22, 2022

brookslogan commented Jun 27, 2022

kenmawer commented Jun 28, 2022

brookslogan commented Jul 8, 2022

kenmawer commented Jul 8, 2022

brookslogan commented Jul 25, 2022

brookslogan commented Mar 9, 2023

Benchmark epi_archive operations with and without compactification #111

Benchmark epi_archive operations with and without compactification #111

Comments

brookslogan commented Jun 20, 2022

kenmawer commented Jun 22, 2022 • edited Loading

kenmawer commented Jun 22, 2022

brookslogan commented Jun 27, 2022

kenmawer commented Jun 28, 2022

brookslogan commented Jul 8, 2022

kenmawer commented Jul 8, 2022

brookslogan commented Jul 25, 2022

brookslogan commented Mar 9, 2023

Benchmark `epi_archive` operations with and without compactification #111

Benchmark `epi_archive` operations with and without compactification #111

kenmawer commented Jun 22, 2022 •

edited

Loading