Skip to content

Speed up duplicate detection in as_epi_df() #560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
brookslogan opened this issue Oct 31, 2024 · 4 comments · Fixed by #564
Closed

Speed up duplicate detection in as_epi_df() #560

brookslogan opened this issue Oct 31, 2024 · 4 comments · Fixed by #564
Assignees

Comments

@brookslogan
Copy link
Contributor

brookslogan commented Oct 31, 2024

When writing some code for archive to archive slides, as_epi_df was taking most of the time. I can/should probably avoid that with new_epi_df or an as_epi_df.data.table, but it'd probably still be nice to speed this up in case we/users want to have the convenience/security of as_epi_df.

Most of the time in as_epi_df appears to be spent in duplicate detection:
2024-10-31-072042_535x39_scrot

Here's some limited testing on duplicate check approaches; looks like we can speed duplicate checks up by >50x, for "medium"-sized inputs at least.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(epiprocess)
#> Registered S3 method overwritten by 'tsibble':
#>   method               from 
#>   as_tibble.grouped_df dplyr
#> 
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#> 
#>     filter

dup_check1 <- function(x, other_keys) {
  duplicated_time_values <- x %>%
    group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>%
    filter(dplyr::n() > 1) %>%
    ungroup()
  nrow(duplicated_time_values) != 0
}

dup_check2 <- function(x, other_keys) {
  anyDuplicated(x[c("geo_value", "time_value", other_keys)]) != 0L
}

dup_check3 <- function(x, other_keys) {
  if (nrow(x) <= 1L) {
    FALSE
  } else {
    epikeytime_names <- c("geo_value", "time_value", other_keys)
    arranged <- arrange(x, across(all_of(epikeytime_names)))
    arranged_epikeytimes <- arranged[epikeytime_names]
    any(vctrs::vec_equal(arranged_epikeytimes[-1L,], arranged_epikeytimes[-nrow(arranged_epikeytimes),]))
  }
}

test_tbl <- as_tibble(covid_case_death_rates_extended)

bench::mark(
  dup_check1(test_tbl, character()),
  dup_check2(test_tbl, character()),
  dup_check3(test_tbl, character())
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dup_check1(test_tbl, character… 295.55ms 299.13ms      3.34        NA     13.4
#> 2 dup_check2(test_tbl, character… 168.25ms 170.59ms      5.85        NA     21.5
#> 3 dup_check3(test_tbl, character…   4.09ms   4.56ms    194.          NA     22.0

Created on 2024-10-31 with reprex v2.1.1

vctrs::vec_equal should keep this pretty general, though I don't know how it compares to less general approaches speed-wise.

I'm not immediately PR-ing this because it probably needs a bit more correctness and performance testing on different sizes.

@brookslogan
Copy link
Contributor Author

brookslogan commented Nov 25, 2024

I ordered selection & filtering in probably a suboptimal way in dup_check3; I've experimented with adjusting that in dup_check4.

dup_check4 seems pretty consistent time-wise, though with fc_test1 dup_check2 seems to be much faster on tiny epi_dfs, and it's not like dup_check{3,4} are slow in absolute terms & they improve on dup_check1 there, so we're probably good to go ahead and change to dup_check{3,4}.

My R installation isn't configured for memory profiling, so no memory comparison results. But it should all be in temporary objects. Some smart-copy-on-write might be disabled permanently (and more likely in dup_check4 than dup_check3), but with dplyr it seems to get disabled super easily no matter what is done.

I'm planning to swap in the dup_check4 approach.

  • todo: fix bad performance and unclear interface on grouped tibbles (done in PR, by ensuring ungrouped)

Note: the bad tests are less important performance-wise, and they also aren't fair to dup_check1 as 2--4 will need to go back and prepare an error message. I don't think this changes the preferred approach.

Note 2: bench::mark defaults give noisy timings. I've attempted to ameliorate that without having to wait too long by setting a small min_time that will hypothetically still 6x the testing time.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(epidatasets)
library(epipredict)
#> Loading required package: epiprocess
#> Registered S3 method overwritten by 'tsibble':
#>   method               from 
#>   as_tibble.grouped_df dplyr
#> 
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> Loading required package: parsnip
#> Registered S3 method overwritten by 'epipredict':
#>   method            from   
#>   print.step_naomit recipes

dup_check1 <- function(x, other_keys) {
  duplicated_time_values <- x %>%
    group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>%
    filter(dplyr::n() > 1) %>%
    ungroup()
  nrow(duplicated_time_values) != 0
}

dup_check2 <- function(x, other_keys) {
  anyDuplicated(x[c("geo_value", "time_value", other_keys)]) != 0L
}

dup_check3 <- function(x, other_keys) {
  if (nrow(x) <= 1L) {
    FALSE
  } else {
    epikeytime_names <- c("geo_value", "time_value", other_keys)
    arranged <- arrange(x, across(all_of(epikeytime_names)))
    arranged_epikeytimes <- arranged[epikeytime_names]
    any(vctrs::vec_equal(arranged_epikeytimes[-1L,], arranged_epikeytimes[-nrow(arranged_epikeytimes),]))
  }
}

dup_check4 <- function(x, other_keys) {
  if (nrow(x) <= 1L) {
    FALSE
  } else {
    ukey_names <- c("geo_value", "time_value", other_keys)
    arranged_ukeys <- arrange(x[ukey_names], across(all_of(ukey_names)))
    any(vctrs::vec_equal(arranged_ukeys[-1L,], arranged_ukeys[-nrow(arranged_ukeys),]))
  }
}

test_tbl <- as_tibble(covid_case_death_rates_extended)

bench::mark(
  dup_check1(test_tbl, character()),
  dup_check2(test_tbl, character()),
  dup_check3(test_tbl, character()),
  dup_check4(test_tbl, character()),
  min_time = 3
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dup_check1(test_tbl, character… 293.93ms 298.49ms      3.29        NA     7.57
#> 2 dup_check2(test_tbl, character… 150.56ms 176.29ms      5.06        NA    11.4 
#> 3 dup_check3(test_tbl, character…   3.88ms   5.42ms    169.          NA    20.3 
#> 4 dup_check4(test_tbl, character…   3.78ms   3.92ms    224.          NA    22.3

objs <- tibble::lst(
  cases_deaths_subset,
  counts_subset,
  county_smoothed_cli_comparison,
  covid_case_death_rates,
  covid_case_death_rates_extended,
  covid_confirmed_cumulative_num,
  covid_incidence_county_subset,
  covid_incidence_outliers,
  ctis_covid_behaviours,
  )

for(nm in names(objs)) {
  print(nm)
  print(dup_check4(objs[[nm]], character()))
  print(bench::mark(
    dup_check1(objs[[nm]], character()),
    dup_check2(objs[[nm]], character()),
    dup_check3(objs[[nm]], character()),
    dup_check4(objs[[nm]], character()),
    min_time = 3
  ))
}
#> [1] "cases_deaths_subset"
#> [1] FALSE
#> # A tibble: 4 × 13
#>   expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 dup_check… 35.89ms 37.02ms      26.9        NA     9.72    58    21      2.16s
#> 2 dup_check… 14.33ms 15.67ms      62.9        NA    14.4    144    33      2.29s
#> 3 dup_check…  3.04ms  3.16ms     302.         NA     4.49   874    13      2.89s
#> 4 dup_check…  3.01ms  3.13ms     310.         NA     4.12   904    12      2.92s
#> # ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
#> [1] "counts_subset"
#> [1] FALSE
#> # A tibble: 4 × 13
#>   expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 dup_check… 14.72ms 15.36ms      64.8        NA     6.64   166    17      2.56s
#> 2 dup_check…  3.74ms  3.98ms     251.         NA    12.0    689    33      2.75s
#> 3 dup_check…  2.65ms  2.83ms     351.         NA     4.82  1018    14       2.9s
#> 4 dup_check…  2.69ms  2.83ms     347.         NA     4.46  1010    13      2.91s
#> # ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
#> [1] "county_smoothed_cli_comparison"
#> [1] FALSE
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 13
#>   expression               min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#>   <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#> 1 dup_check1(objs[[n… 527.89ms  531.3ms      1.87        NA     7.47     6    24
#> 2 dup_check2(objs[[n… 256.02ms 305.89ms      2.85        NA    11.4      9    36
#> 3 dup_check3(objs[[n…   8.48ms   9.11ms     93.7         NA    18.3    282    55
#> 4 dup_check4(objs[[n…      8ms   8.63ms    102.          NA    13.3    307    40
#> # ℹ 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
#> #   time <list>, gc <list>
#> [1] "covid_case_death_rates"
#> [1] FALSE
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 13
#>   expression               min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#>   <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#> 1 dup_check1(objs[[n… 159.46ms 163.32ms      6.09        NA     7.69    19    24
#> 2 dup_check2(objs[[n…  83.33ms  98.08ms      9.39        NA    11.3     29    35
#> 3 dup_check3(objs[[n…   4.36ms   5.08ms    185.          NA     8.65   556    26
#> 4 dup_check4(objs[[n…   4.17ms   4.63ms    191.          NA     8.32   574    25
#> # ℹ 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
#> #   time <list>, gc <list>
#> [1] "covid_case_death_rates_extended"
#> [1] FALSE
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 13
#>   expression               min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#>   <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#> 1 dup_check1(objs[[n… 285.94ms 302.54ms      3.34        NA     7.89    11    26
#> 2 dup_check2(objs[[n… 153.32ms 189.23ms      4.89        NA    10.4     15    32
#> 3 dup_check3(objs[[n…   5.73ms   5.99ms    142.          NA    12.5    443    39
#> 4 dup_check4(objs[[n…   5.59ms   5.78ms    164.          NA    11.7    491    35
#> # ℹ 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
#> #   time <list>, gc <list>
#> [1] "covid_confirmed_cumulative_num"
#> [1] FALSE
#> # A tibble: 4 × 13
#>   expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 dup_check… 23.94ms 24.87ms      39.9        NA     8.23    97    20      2.43s
#> 2 dup_check…  9.88ms 10.65ms      94.0        NA    14.0    222    33      2.36s
#> 3 dup_check…  2.34ms  2.44ms     394.         NA     4.48  1142    13       2.9s
#> 4 dup_check…  2.22ms  2.27ms     426.         NA     4.10  1246    12      2.93s
#> # ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
#> [1] "covid_incidence_county_subset"
#> [1] FALSE
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 13
#>   expression               min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#>   <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#> 1 dup_check1(objs[[n… 133.29ms 136.23ms      7.35        NA     7.35    23    23
#> 2 dup_check2(objs[[n…  61.42ms  75.97ms     12.1         NA    10.8     37    33
#> 3 dup_check3(objs[[n…   4.14ms   4.41ms    206.          NA     7.33   618    22
#> 4 dup_check4(objs[[n…   3.82ms   4.17ms    216.          NA     6.99   648    21
#> # ℹ 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
#> #   time <list>, gc <list>
#> [1] "covid_incidence_outliers"
#> [1] FALSE
#> # A tibble: 4 × 13
#>   expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 dup_check… 12.38ms 12.78ms      77.4        NA     5.81   213    16      2.75s
#> 2 dup_check…  2.63ms  2.82ms     350.         NA    11.8    976    33      2.79s
#> 3 dup_check…  2.71ms   2.8ms     346.         NA     4.46  1009    13      2.91s
#> 4 dup_check…  2.66ms  2.79ms     348.         NA     4.81  1013    14      2.91s
#> # ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
#> [1] "ctis_covid_behaviours"
#> [1] FALSE
#> # A tibble: 4 × 13
#>   expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 dup_check… 14.75ms 15.39ms      64.5        NA     6.21   166    16      2.58s
#> 2 dup_check…  3.77ms  4.04ms     246.         NA    12.0    675    33      2.74s
#> 3 dup_check…  2.75ms  2.85ms     342.         NA     4.81   994    14      2.91s
#> 4 dup_check…  2.74ms  2.82ms     347.         NA     4.46  1011    13      2.92s
#> # ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>

print(dup_check4(grad_employ_subset, attr(grad_employ_subset, "metadata")$other_keys))
#> [1] FALSE

bench::mark(
  dup_check1(grad_employ_subset, attr(grad_employ_subset, "metadata")$other_keys),
  dup_check2(grad_employ_subset, attr(grad_employ_subset, "metadata")$other_keys),
  dup_check3(grad_employ_subset, attr(grad_employ_subset, "metadata")$other_keys),
  dup_check4(grad_employ_subset, attr(grad_employ_subset, "metadata")$other_keys),
  min_time = 3
)
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 "dup_check1(grad_employ_subset, … 18.94ms 19.92ms      49.8        NA     7.01
#> 2 "dup_check2(grad_employ_subset, … 12.97ms 14.04ms      70.6        NA    13.1 
#> 3 "dup_check3(grad_employ_subset, …  2.92ms  3.02ms     322.         NA     4.69
#> 4 "dup_check4(grad_employ_subset, …  2.84ms  3.13ms     312.         NA     4.46


bad1 <- rbind(grad_employ_subset, grad_employ_subset[800,])

bench::mark(
  dup_check1(bad1, attr(bad1, "metadata")$other_keys),
  dup_check2(bad1, attr(bad1, "metadata")$other_keys),
  dup_check3(bad1, attr(bad1, "metadata")$other_keys),
  dup_check4(bad1, attr(bad1, "metadata")$other_keys),
  min_time = 3
)
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 "dup_check1(bad1, attr(bad1, \"m… 18.74ms 19.27ms      51.4        NA     6.95
#> 2 "dup_check2(bad1, attr(bad1, \"m… 12.49ms 13.55ms      72.7        NA    13.5 
#> 3 "dup_check3(bad1, attr(bad1, \"m…   2.9ms  3.02ms     321.         NA     4.84
#> 4 "dup_check4(bad1, attr(bad1, \"m…  2.94ms  3.04ms     317.         NA     4.47

dup_check4(bad1, attr(bad1, "metadata")$other_keys)
#> [1] TRUE

# from ?arx_forecaster:
jhu <- covid_case_death_rates %>%
  dplyr::filter(time_value >= as.Date("2021-12-01"))
out <- arx_forecaster(jhu, "death_rate",
                      c("case_rate", "death_rate"),
                      trainer = quantile_reg(),
                      args_list = arx_args_list(quantile_levels = 1:9 / 10)
                      )

fc_test1 <- out$predictions %>%
  transmute(geo_value, time_value = target_date, .pred, .pred_distn) %>%
  as_epi_df(as_of = attr(jhu, "metadata")$as_of)

dup_check4(fc_test1, character())
#> [1] FALSE

bench::mark(
  dup_check1(fc_test1, attr(fc_test1, "metadata")$other_keys),
  dup_check2(fc_test1, attr(fc_test1, "metadata")$other_keys),
  dup_check3(fc_test1, attr(fc_test1, "metadata")$other_keys),
  dup_check4(fc_test1, attr(fc_test1, "metadata")$other_keys),
  min_time = 3
)
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 "dup_check1(fc_test1, attr(fc_…   7.79ms   8.19ms      118.        NA     5.75
#> 2 "dup_check2(fc_test1, attr(fc_… 301.24µs 322.18µs     3005.        NA     7.57
#> 3 "dup_check3(fc_test1, attr(fc_…   2.62ms   2.75ms      359.        NA     3.40
#> 4 "dup_check4(fc_test1, attr(fc_…   2.61ms   2.73ms      355.        NA     3.40

fc_test_bad1 <- rbind(fc_test1, fc_test1[30,])

dup_check4(fc_test_bad1, character())
#> [1] TRUE

bench::mark(
  dup_check1(fc_test_bad1, attr(fc_test1, "metadata")$other_keys),
  dup_check2(fc_test_bad1, attr(fc_test1, "metadata")$other_keys),
  dup_check3(fc_test_bad1, attr(fc_test1, "metadata")$other_keys),
  dup_check4(fc_test_bad1, attr(fc_test1, "metadata")$other_keys),
  min_time = 3
)
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 "dup_check1(fc_test_bad1, attr…   7.67ms   7.84ms      123.        NA     3.10
#> 2 "dup_check2(fc_test_bad1, attr… 305.54µs 323.89µs     3089.        NA     6.50
#> 3 "dup_check3(fc_test_bad1, attr…   2.63ms   2.76ms      360.        NA     3.40
#> 4 "dup_check4(fc_test_bad1, attr…   2.62ms   2.73ms      365.        NA     3.40

packed_key_ex1 <- cases_deaths_subset %>%
  mutate(geo_value = tibble(country = "us", state = .env$.data$geo_value))

dup_check4(packed_key_ex1, character())
#> [1] FALSE

bench::mark(
  dup_check1(packed_key_ex1, attr(packed_key_ex1, "metadata")$other_keys),
  dup_check2(packed_key_ex1, attr(packed_key_ex1, "metadata")$other_keys),
  dup_check3(packed_key_ex1, attr(packed_key_ex1, "metadata")$other_keys),
  dup_check4(packed_key_ex1, attr(packed_key_ex1, "metadata")$other_keys),
  min_time = 3
)
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 "dup_check1(packed_key_ex1, at…  37.65ms  41.21ms     23.8         NA     5.44
#> 2 "dup_check2(packed_key_ex1, at… 174.24ms 176.55ms      5.66        NA    42.5 
#> 3 "dup_check3(packed_key_ex1, at…   3.19ms   3.31ms    294.          NA     4.47
#> 4 "dup_check4(packed_key_ex1, at…   2.85ms   3.16ms    314.          NA     3.98

packed_key_bad1 <- cases_deaths_subset %>%
  mutate(geo_value = tibble(country = "us", state = .env$.data$geo_value)) %>%
  rbind({.})

dup_check4(packed_key_bad1, character())
#> [1] TRUE

bench::mark(
  dup_check1(packed_key_bad1, attr(packed_key_bad1, "metadata")$other_keys),
  dup_check2(packed_key_bad1, attr(packed_key_bad1, "metadata")$other_keys),
  dup_check3(packed_key_bad1, attr(packed_key_bad1, "metadata")$other_keys),
  dup_check4(packed_key_bad1, attr(packed_key_bad1, "metadata")$other_keys),
  min_time = 3
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 "dup_check1(packed_key_bad1, a…  43.81ms   45.7ms     21.6         NA     4.99
#> 2 "dup_check2(packed_key_bad1, a… 332.53ms 339.56ms      2.93        NA     5.86
#> 3 "dup_check3(packed_key_bad1, a…   3.56ms   3.88ms    227.          NA     6.66
#> 4 "dup_check4(packed_key_bad1, a…   3.43ms   3.72ms    266.          NA     6.66

test_tbl2 <- test_tbl[1:1000,]

bench::mark(
  dup_check1(test_tbl2, character()),
  dup_check2(test_tbl2, character()),
  dup_check3(test_tbl2, character()),
  dup_check4(test_tbl2, character()),
  min_time = 3
)
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 dup_check1(test_tbl2, character(… 10.68ms 11.32ms      88.1        NA     4.62
#> 2 dup_check2(test_tbl2, character(…  3.54ms  3.78ms     268.         NA     8.57
#> 3 dup_check3(test_tbl2, character(…  2.02ms  2.16ms     459.         NA     3.05
#> 4 dup_check4(test_tbl2, character(…     2ms  2.14ms     469.         NA     3.40


test_tbl2 <- test_tbl[1:56,]

bench::mark(
  dup_check1(test_tbl2, character()),
  dup_check2(test_tbl2, character()),
  dup_check3(test_tbl2, character()),
  dup_check4(test_tbl2, character()),
  min_time = 3
)
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dup_check1(test_tbl2, characte…   4.28ms   4.57ms      219.        NA     3.43
#> 2 dup_check2(test_tbl2, characte… 225.07µs 259.12µs     3853.        NA     7.33
#> 3 dup_check3(test_tbl2, characte…   1.95ms   2.09ms      482.        NA     3.40
#> 4 dup_check4(test_tbl2, characte…   1.95ms   2.09ms      480.        NA     3.05

test_tbl3 <- test_tbl[1:200,]

bench::mark(
  dup_check1(test_tbl3, character()),
  dup_check2(test_tbl3, character()),
  dup_check3(test_tbl3, character()),
  dup_check4(test_tbl3, character()),
  min_time = 3
)
#> # A tibble: 4 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dup_check1(test_tbl3, characte…   5.23ms   5.66ms      175.        NA     3.80
#> 2 dup_check2(test_tbl3, characte… 740.34µs 801.31µs     1252.        NA     7.94
#> 3 dup_check3(test_tbl3, characte…   2.03ms   2.14ms      459.        NA     3.06
#> 4 dup_check4(test_tbl3, characte…      2ms    2.1ms      472.        NA     3.40

test_tbl4 <- test_tbl[1:500,]

bench::mark(
  dup_check1(test_tbl4, character()),
  dup_check2(test_tbl4, character()),
  dup_check3(test_tbl4, character()),
  dup_check4(test_tbl4, character()),
  min_time = 3
)
#> # A tibble: 4 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 dup_check1(test_tbl4, character())  7.33ms 7.82ms      126.        NA     4.03
#> 2 dup_check2(test_tbl4, character())  1.77ms 1.91ms      521.        NA     8.35
#> 3 dup_check3(test_tbl4, character())  2.04ms 2.15ms      449.        NA     3.07
#> 4 dup_check4(test_tbl4, character())  1.98ms 2.12ms      465.        NA     3.06

test_tbl5 <- test_tbl[1:1000,]

bench::mark(
  dup_check1(test_tbl5, character()),
  dup_check2(test_tbl5, character()),
  dup_check3(test_tbl5, character()),
  dup_check4(test_tbl5, character()),
  min_time = 3
)
#> # A tibble: 4 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 dup_check1(test_tbl5, character(… 11.14ms  11.5ms      85.2        NA     4.63
#> 2 dup_check2(test_tbl5, character(…  3.52ms  3.79ms     265.         NA     8.54
#> 3 dup_check3(test_tbl5, character(…  2.02ms  2.15ms     463.         NA     3.06
#> 4 dup_check4(test_tbl5, character(…     2ms  2.14ms     469.         NA     3.40

Created on 2024-11-25 with reprex v2.1.1

@nmdefries
Copy link
Contributor

Memory results: Option 2 is always uses the least memory by a large factor. Option 3 sometimes uses less memory than option 1, option 4 always does (how much less varies).

  duplicated_time_values <- x %>%
    group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>%
    filter(dplyr::n() > 1) %>%
    ungroup()
  nrow(duplicated_time_values) != 0
}

dup_check2 <- function(x, other_keys) {
  anyDuplicated(x[c("geo_value", "time_value", other_keys)]) != 0L
}

dup_check3 <- function(x, other_keys) {
  if (nrow(x) <= 1L) {
    FALSE
  } else {
    epikeytime_names <- c("geo_value", "time_value", other_keys)
    arranged <- arrange(x, across(all_of(epikeytime_names)))
    arranged_epikeytimes <- arranged[epikeytime_names]
    any(vctrs::vec_equal(arranged_epikeytimes[-1L,], arranged_epikeytimes[-nrow(arranged_epikeytimes),]))
  }
}

dup_check4 <- function(x, other_keys) {
  if (nrow(x) <= 1L) {
    FALSE
  } else {
    ukey_names <- c("geo_value", "time_value", other_keys)
    arranged_ukeys <- arrange(x[ukey_names], across(all_of(ukey_names)))
    any(vctrs::vec_equal(arranged_ukeys[-1L,], arranged_ukeys[-nrow(arranged_ukeys),]))
  }
}

test_tbl <- as_tibble(covid_case_death_rates_extended)

bench::mark(
  dup_check1(test_tbl, character()),
  dup_check2(test_tbl, character()),
  dup_check3(test_tbl, character()),
  dup_check4(test_tbl, character()),
  min_time = 3
)
# # A tibble: 4 × 13
#   expression                     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 dup_check1(test_tbl, cha… 457.27ms 475.93ms      2.03     3.5MB     4.35     7    15      3.45s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 dup_check2(test_tbl, cha… 246.21ms 281.69ms      3.28  821.97KB     6.89    10    21      3.05s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 dup_check3(test_tbl, cha…   6.45ms   7.61ms    123.      4.02MB    14.0    371    42         3s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 dup_check4(test_tbl, cha…   6.04ms   6.59ms    129.      3.39MB    12.3    387    37         3s <lgl>  <Rprofmem> <bench_tm> <tibble>

objs <- tibble::lst(
  cases_deaths_subset,
  counts_subset,
  county_smoothed_cli_comparison,
  covid_case_death_rates,
  covid_case_death_rates_extended,
  covid_confirmed_cumulative_num,
  covid_incidence_county_subset,
  covid_incidence_outliers,
  ctis_covid_behaviours,
)

for(nm in names(objs)) {
  print(nm)
  print(dup_check4(objs[[nm]], character()))
  print(bench::mark(
    dup_check1(objs[[nm]], character()),
    dup_check2(objs[[nm]], character()),
    dup_check3(objs[[nm]], character()),
    dup_check4(objs[[nm]], character()),
    min_time = 3
  ))
}
[1] "cases_deaths_subset"
[1] FALSE
# # A tibble: 4 × 13
#   expression                      min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                  <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 dup_check1(objs[[nm]], cha… 56.82ms 58.92ms      16.8   708.9KB     5.30    38    12      2.26s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 dup_check2(objs[[nm]], cha… 24.54ms 25.55ms      39.1    79.3KB     8.50    92    20      2.35s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 dup_check3(objs[[nm]], cha…  4.56ms  4.89ms     203.    577.1KB     2.74   591     8      2.92s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 dup_check4(objs[[nm]], cha…  4.47ms  4.83ms     205.      451KB     2.76   594     8       2.9s <lgl>  <Rprofmem> <bench_tm> <tibble>
[1] "counts_subset"
[1] FALSE
# # A tibble: 4 × 13
#   expression                      min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                  <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 dup_check1(objs[[nm]], cha…  23.8ms 24.64ms      40.4   251.9KB     4.11   108    11      2.67s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 dup_check2(objs[[nm]], cha…  6.57ms     7ms     141.     28.5KB     6.51   391    18      2.76s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 dup_check3(objs[[nm]], cha…  4.15ms  4.49ms     221.    174.1KB     2.75   642     8      2.91s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 dup_check4(objs[[nm]], cha…  4.13ms  4.45ms     222.    157.5KB     2.74   648     8      2.92s <lgl>  <Rprofmem> <bench_tm> <tibble>
[1] "county_smoothed_cli_comparison"
[1] FALSE
# # A tibble: 4 × 13
#   expression                      min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                  <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 dup_check1(objs[[nm]], cha… 747.9ms 766.4ms      1.22   12.16MB     4.89     4    16      3.27s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 dup_check2(objs[[nm]], cha… 432.8ms 509.9ms      1.99    1.23MB     7.95     7    28      3.52s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 dup_check3(objs[[nm]], cha…  13.2ms  13.9ms     54.1     9.31MB    14.3    163    43      3.01s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 dup_check4(objs[[nm]], cha…  12.5ms  12.9ms     68.8     7.85MB    14.0    207    42      3.01s <lgl>  <Rprofmem> <bench_tm> <tibble>
[1] "covid_case_death_rates"
[1] FALSE
# # A tibble: 4 × 13
#   expression                     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 dup_check1(objs[[nm]], c… 251.98ms 255.85ms      3.90    3.94MB     4.88    12    15      3.08s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 dup_check2(objs[[nm]], c… 141.56ms 152.48ms      6.04  496.33KB     6.94    20    23      3.31s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 dup_check3(objs[[nm]], c…   6.83ms   7.19ms    125.      2.85MB     8.53   382    26      3.05s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 dup_check4(objs[[nm]], c…   6.64ms   7.06ms    130.      2.54MB     7.29   409    23      3.16s <lgl>  <Rprofmem> <bench_tm> <tibble>
[1] "covid_case_death_rates_extended"
[1] FALSE
# # A tibble: 4 × 13
#   expression                     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 dup_check1(objs[[nm]], c… 456.01ms 463.02ms      2.15    6.71MB     5.23     7    17      3.25s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 dup_check2(objs[[nm]], c… 253.84ms 284.94ms      3.41  952.48KB     7.13    11    23      3.23s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 dup_check3(objs[[nm]], c…    9.1ms   9.62ms     90.4     4.97MB    12.0    272    36      3.01s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 dup_check4(objs[[nm]], c…   8.81ms   9.32ms     92.0      4.4MB     9.66   276    29         3s <lgl>  <Rprofmem> <bench_tm> <tibble>
[1] "covid_confirmed_cumulative_num"
[1] FALSE
# # A tibble: 4 × 13
#   expression                      min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                  <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 dup_check1(objs[[nm]], cha… 38.48ms  39.6ms      25.2     342KB     4.88    62    12      2.46s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 dup_check2(objs[[nm]], cha… 16.97ms 17.65ms      56.5      54KB     7.91   143    20      2.53s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 dup_check3(objs[[nm]], cha…  3.47ms  3.73ms     265.      573KB     3.44   770    10      2.91s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 dup_check4(objs[[nm]], cha…  3.19ms  3.42ms     288.      330KB     2.38   847     7      2.94s <lgl>  <Rprofmem> <bench_tm> <tibble>
[1] "covid_incidence_county_subset"
[1] FALSE
# # A tibble: 4 × 13
#   expression                     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 dup_check1(objs[[nm]], c… 205.96ms 205.96ms      4.86    2.94MB    68.0      1    14   205.96ms <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 dup_check2(objs[[nm]], c… 104.47ms 108.49ms      9.17  318.12KB    70.3      3    23   327.29ms <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 dup_check3(objs[[nm]], c…   6.47ms   6.93ms    140.      2.29MB     7.37   360    19      2.58s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 dup_check4(objs[[nm]], c…   6.09ms   6.47ms    154.      1.92MB     6.81   408    18      2.65s <lgl>  <Rprofmem> <bench_tm> <tibble>
[1] "covid_incidence_outliers"
[1] FALSE
# # A tibble: 4 × 13
#   expression                      min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                  <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 dup_check1(objs[[nm]], cha… 19.49ms 20.23ms      49.1   154.5KB     3.64   135    10      2.75s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 dup_check2(objs[[nm]], cha…  4.45ms  4.74ms     209.     16.7KB     6.74   588    19      2.82s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 dup_check3(objs[[nm]], cha…  4.04ms  4.36ms     225.    105.4KB     2.74   657     8      2.92s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 dup_check4(objs[[nm]], cha…  4.03ms  4.36ms     226.     99.6KB     3.09   659     9      2.91s <lgl>  <Rprofmem> <bench_tm> <tibble>
[1] "ctis_covid_behaviours"
[1] FALSE
# # A tibble: 4 × 13
#   expression                      min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                  <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 dup_check1(objs[[nm]], cha… 23.43ms 24.41ms      40.8   251.9KB     4.08   110    11       2.7s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 dup_check2(objs[[nm]], cha…  6.44ms  6.86ms     145.     28.5KB     6.48   404    18      2.78s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 dup_check3(objs[[nm]], cha…  4.12ms  4.43ms     222.    174.1KB     2.91   612     8      2.75s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 dup_check4(objs[[nm]], cha…  4.07ms  4.38ms     220.    157.5KB     2.74   644     8      2.92s <lgl>  <Rprofmem> <bench_tm> <tibble>

print(dup_check4(grad_employ_subset, attr(grad_employ_subset, "metadata")$other_keys))
[1] FALSE

bench::mark(
  dup_check1(grad_employ_subset, attr(grad_employ_subset, "metadata")$other_keys),
  dup_check2(grad_employ_subset, attr(grad_employ_subset, "metadata")$other_keys),
  dup_check3(grad_employ_subset, attr(grad_employ_subset, "metadata")$other_keys),
  dup_check4(grad_employ_subset, attr(grad_employ_subset, "metadata")$other_keys),
  min_time = 3
)
# # A tibble: 4 × 13
#   expression                      min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                  <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 "dup_check1(grad_employ_su… 29.95ms 30.77ms      32.3   307.2KB     4.18    85    11      2.63s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 "dup_check2(grad_employ_su…  20.6ms 21.31ms      46.7    40.9KB     8.61   114    21      2.44s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 "dup_check3(grad_employ_su…  4.39ms  4.72ms     210.    247.4KB     2.76   609     8       2.9s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 "dup_check4(grad_employ_su…  4.37ms  4.69ms     209.    213.3KB     2.74   612     8      2.92s <lgl>  <Rprofmem> <bench_tm> <tibble>


bad1 <- rbind(grad_employ_subset, grad_employ_subset[800,])

bench::mark(
  dup_check1(bad1, attr(bad1, "metadata")$other_keys),
  dup_check2(bad1, attr(bad1, "metadata")$other_keys),
  dup_check3(bad1, attr(bad1, "metadata")$other_keys),
  dup_check4(bad1, attr(bad1, "metadata")$other_keys),
  min_time = 3
)
# # A tibble: 4 × 13
#   expression                      min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                  <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 "dup_check1(bad1, attr(bad… 30.42ms 31.49ms      31.2   318.6KB     3.81    82    10      2.62s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 "dup_check2(bad1, attr(bad… 21.02ms 21.84ms      45.6    33.1KB     7.81   111    19      2.43s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 "dup_check3(bad1, attr(bad…  4.41ms  4.73ms     208.    253.2KB     3.10   605     9       2.9s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 "dup_check4(bad1, attr(bad…  4.36ms  4.68ms     211.    219.2KB     2.74   617     8      2.92s <lgl>  <Rprofmem> <bench_tm> <tibble>

dup_check4(bad1, attr(bad1, "metadata")$other_keys)
[1] TRUE

# from ?arx_forecaster:
jhu <- covid_case_death_rates %>%
  dplyr::filter(time_value >= as.Date("2021-12-01"))
out <- arx_forecaster(jhu, "death_rate",
                      c("case_rate", "death_rate"),
                      trainer = quantile_reg(),
                      args_list = arx_args_list(quantile_levels = 1:9 / 10)
)

fc_test1 <- out$predictions %>%
  transmute(geo_value, time_value = target_date, .pred, .pred_distn) %>%
  as_epi_df(as_of = attr(jhu, "metadata")$as_of)

dup_check4(fc_test1, character())
[1] FALSE

bench::mark(
  dup_check1(fc_test1, attr(fc_test1, "metadata")$other_keys),
  dup_check2(fc_test1, attr(fc_test1, "metadata")$other_keys),
  dup_check3(fc_test1, attr(fc_test1, "metadata")$other_keys),
  dup_check4(fc_test1, attr(fc_test1, "metadata")$other_keys),
  min_time = 3
)
# # A tibble: 4 × 13
#   expression                     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 "dup_check1(fc_test1, at…  11.64ms  12.27ms      79.1    26.5KB     3.55   223    10      2.82s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 "dup_check2(fc_test1, at… 480.94µs 520.45µs    1884.      1.3KB     6.64  5108    18      2.71s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 "dup_check3(fc_test1, at…   3.98ms   4.24ms     233.     14.8KB     3.79   677    11       2.9s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 "dup_check4(fc_test1, at…   3.93ms   4.17ms     236.     12.4KB     3.79   684    11       2.9s <lgl>  <Rprofmem> <bench_tm> <tibble>

fc_test_bad1 <- rbind(fc_test1, fc_test1[30,])

dup_check4(fc_test_bad1, character())
[1] TRUE

bench::mark(
  dup_check1(fc_test_bad1, attr(fc_test1, "metadata")$other_keys),
  dup_check2(fc_test_bad1, attr(fc_test1, "metadata")$other_keys),
  dup_check3(fc_test_bad1, attr(fc_test1, "metadata")$other_keys),
  dup_check4(fc_test_bad1, attr(fc_test1, "metadata")$other_keys),
  min_time = 3
)
# # A tibble: 4 × 13
#   expression                     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 "dup_check1(fc_test_bad1…  11.47ms  12.03ms      81.9   32.27KB     3.55   231    10      2.82s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 "dup_check2(fc_test_bad1… 485.88µs 522.44µs    1829.     1.31KB     6.55  5309    19       2.9s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 "dup_check3(fc_test_bad1…   3.98ms   4.25ms     231.    17.45KB     3.80   670    11       2.9s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 "dup_check4(fc_test_bad1…   3.97ms   4.21ms     234.    14.97KB     3.80   677    11      2.89s <lgl>  <Rprofmem> <bench_tm> <tibble>

packed_key_ex1 <- cases_deaths_subset %>%
  mutate(geo_value = tibble(country = "us", state = .env$.data$geo_value))

dup_check4(packed_key_ex1, character())
[1] FALSE

bench::mark(
  dup_check1(packed_key_ex1, attr(packed_key_ex1, "metadata")$other_keys),
  dup_check2(packed_key_ex1, attr(packed_key_ex1, "metadata")$other_keys),
  dup_check3(packed_key_ex1, attr(packed_key_ex1, "metadata")$other_keys),
  dup_check4(packed_key_ex1, attr(packed_key_ex1, "metadata")$other_keys),
  min_time = 3
)
# # A tibble: 4 × 13
#   expression                     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 "dup_check1(packed_key_e…   58.5ms  60.01ms     16.2     1.03MB     4.97    49    15      3.02s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 "dup_check2(packed_key_e… 303.33ms 314.68ms      3.19  373.09KB     5.43    10    17      3.13s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 "dup_check3(packed_key_e…   4.78ms   5.11ms    178.    848.86KB     3.66   534    11         3s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 "dup_check4(packed_key_e…   4.66ms   4.98ms    196.    722.83KB     3.33   588    10         3s <lgl>  <Rprofmem> <bench_tm> <tibble>

packed_key_bad1 <- cases_deaths_subset %>%
  mutate(geo_value = tibble(country = "us", state = .env$.data$geo_value)) %>%
  rbind({.})

dup_check4(packed_key_bad1, character())
[1] TRUE

bench::mark(
  dup_check1(packed_key_bad1, attr(packed_key_bad1, "metadata")$other_keys),
  dup_check2(packed_key_bad1, attr(packed_key_bad1, "metadata")$other_keys),
  dup_check3(packed_key_bad1, attr(packed_key_bad1, "metadata")$other_keys),
  dup_check4(packed_key_bad1, attr(packed_key_bad1, "metadata")$other_keys),
  min_time = 3
)
# # A tibble: 4 × 13
#   expression                     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 "dup_check1(packed_key_b…  71.53ms   73.8ms     13.2     3.43MB     5.28    40    16      3.03s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 "dup_check2(packed_key_b… 600.81ms 634.56ms      1.59  727.45KB     5.41     5    17      3.14s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 "dup_check3(packed_key_b…   5.81ms   6.18ms    146.       1.7MB     5.99   437    18         3s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 "dup_check4(packed_key_b…   5.64ms   6.03ms    150.      1.45MB     5.32   452    16         3s <lgl>  <Rprofmem> <bench_tm> <tibble>

test_tbl2 <- test_tbl[1:1000,]

bench::mark(
  dup_check1(test_tbl2, character()),
  dup_check2(test_tbl2, character()),
  dup_check3(test_tbl2, character()),
  dup_check4(test_tbl2, character()),
  min_time = 3
)
# # A tibble: 4 × 13
#   expression                      min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                  <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 dup_check1(test_tbl2, char… 18.07ms 18.76ms      53.0   125.8KB     5.22   142    14      2.68s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 dup_check2(test_tbl2, char…  5.94ms  6.34ms     157.     15.9KB     8.83   426    24      2.72s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 dup_check3(test_tbl2, char…  3.04ms  3.23ms     304.    137.7KB     3.44   883    10      2.91s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 dup_check4(test_tbl2, char…  3.01ms   3.2ms     308.    121.9KB     3.80   891    11       2.9s <lgl>  <Rprofmem> <bench_tm> <tibble>


test_tbl2 <- test_tbl[1:56,]

bench::mark(
  dup_check1(test_tbl2, character()),
  dup_check2(test_tbl2, character()),
  dup_check3(test_tbl2, character()),
  dup_check4(test_tbl2, character()),
  min_time = 3
)
# # A tibble: 4 × 13
#   expression                     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 dup_check1(test_tbl2, ch…   6.74ms   7.28ms      135.   15.77KB     3.85   385    11      2.86s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 dup_check2(test_tbl2, ch… 406.36µs 437.33µs     2254.    1.03KB     7.26  6524    21      2.89s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 dup_check3(test_tbl2, ch…   2.94ms   3.11ms      314.   11.48KB     3.66   859    10      2.73s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 dup_check4(test_tbl2, ch…   2.91ms   3.08ms      320.   10.49KB     3.78   930    11      2.91s <lgl>  <Rprofmem> <bench_tm> <tibble>

test_tbl3 <- test_tbl[1:200,]

bench::mark(
  dup_check1(test_tbl3, character()),
  dup_check2(test_tbl3, character()),
  dup_check3(test_tbl3, character()),
  dup_check4(test_tbl3, character()),
  min_time = 3
)
# # A tibble: 4 × 13
#   expression                       min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                    <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 dup_check1(test_tbl3, charac… 8.45ms 9.01ms      109.   31.26KB     3.87   310    11      2.85s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 dup_check2(test_tbl3, charac… 1.24ms  1.3ms      759.    3.66KB     8.40  2167    24      2.86s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 dup_check3(test_tbl3, charac… 2.96ms 3.12ms      317.   32.99KB     3.78   922    11      2.91s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 dup_check4(test_tbl3, charac… 2.93ms  3.1ms      318.   29.76KB     3.43   926    10      2.91s <lgl>  <Rprofmem> <bench_tm> <tibble>

test_tbl4 <- test_tbl[1:500,]

bench::mark(
  dup_check1(test_tbl4, character()),
  dup_check2(test_tbl4, character()),
  dup_check3(test_tbl4, character()),
  dup_check4(test_tbl4, character()),
  min_time = 3
)
# # A tibble: 4 × 13
#   expression                      min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                  <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 dup_check1(test_tbl4, char… 12.27ms 12.69ms      77.4    66.7KB     4.32   215    12      2.78s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 dup_check2(test_tbl4, char…  2.98ms  3.13ms     318.        8KB     8.93   889    25       2.8s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 dup_check3(test_tbl4, char…  2.98ms  3.16ms     312.     72.3KB     3.44   907    10      2.91s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 dup_check4(test_tbl4, char…  2.96ms  3.12ms     314.     64.3KB     3.66   859    10      2.73s <lgl>  <Rprofmem> <bench_tm> <tibble>

test_tbl5 <- test_tbl[1:1000,]

bench::mark(
  dup_check1(test_tbl5, character()),
  dup_check2(test_tbl5, character()),
  dup_check3(test_tbl5, character()),
  dup_check4(test_tbl5, character()),
  min_time = 3
)
# # A tibble: 4 × 13
#   expression                      min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr>                  <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 dup_check1(test_tbl5, char… 18.12ms  18.8ms      52.6   125.8KB     5.22   141    14      2.68s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 2 dup_check2(test_tbl5, char…  5.91ms  6.29ms     159.     15.9KB     8.82   432    24      2.72s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 3 dup_check3(test_tbl5, char…  3.04ms  3.26ms     303.    137.7KB     3.44   882    10      2.91s <lgl>  <Rprofmem> <bench_tm> <tibble>
# 4 dup_check4(test_tbl5, char…     3ms  3.22ms     305.    121.9KB     3.79   885    11       2.9s <lgl>  <Rprofmem> <bench_tm> <tibble>

@brookslogan
Copy link
Contributor Author

Thanks for checking on the memory usage! I did suspect 2 would be better on memory usage... dup_check4 seems to involve something around ~3x the original object size / ~6x the key size for the large objects; that's not very great.

One key copy is probably avoidable by working off the original number of rows and binding generalized-dplyr-lag. More from native code but compilation's a bad idea to introduce at this moment / maybe ever.

I'm not sure how much this needs optimized time or memory-wise. I think I noticed some long construction times so thought I would just throw this in. It still probably isn't fast enough to spam it in our hacky dplyr_extending implementation/approximation.

@brookslogan
Copy link
Contributor Author

brookslogan commented Jan 23, 2025

I'm considering trying to improve performance of duplicate key detection in as_epi_archive as well. A few notes from some initial exploration. anyDuplicated.data.table seems like it may have both speed close to dup_check4 and the low memory allocations of dup_check2; if we're already using the data.table method then there may not be an obvious way to try to improve performance.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
library(epiprocess)
#> Loading required package: epidatasets
#> Registered S3 method overwritten by 'tsibble':
#>   method               from 
#>   as_tibble.grouped_df dplyr
#> 
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#> 
#>     filter

x <- archive_cases_dv_subset
dup_check4 <- function(x) {
  if (nrow(x) <= 1L) {
    FALSE
  } else {
    ukey_names <- key(x)
    arranged_ukeys <- arrange(x[ukey_names], across(all_of(ukey_names)))
    any(vctrs::vec_equal(arranged_ukeys[-1L,], arranged_ukeys[-nrow(arranged_ukeys),]))
  }
}

DT_as_tbl <- as_tibble(archive_cases_dv_subset$DT)

benchmark <- bench::mark(
  as.logical(anyDuplicated(DT_as_tbl, by = key(DT))),
  as.logical(anyDuplicated(x$DT, by = key(x$DT))),
  dup_check4(x$DT)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.

benchmark
#> # A tibble: 3 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                          <bch:> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 as.logical(anyDuplicated(DT_as_tbl…  1.31s  1.31s     0.762       2MB    12.2 
#> 2 as.logical(anyDuplicated(x$DT, by … 3.23ms 3.52ms   273.        1.5MB     7.96
#> 3 dup_check4(x$DT)                    3.02ms 3.18ms   293.       4.19MB     3.98

Created on 2025-01-23 with reprex v2.1.1

This might motivate us to consider a data.table-based approach in epi_df duplicated detection as well.

Second, some questions/issues with profiling. From ?bench::mark:

  memory: If 'TRUE' (the default when R is compiled with memory
          profiling), track memory allocations using
          'utils::Rprofmem()'. If 'FALSE' disable memory tracking.
  1. Is this potentially happening for us? It seems like we'd have to dig into native code backing methods + learn about what all there hits R heap to be able to determine this.
  2. Amount of memory allocated isn't as important as the max amount of memory held at any point in the process, since that max amount of memory used may persist, since in my experience R doesn't give memory back to the OS, and because this is what puts us in threat of running out of memory. --- So even if we tried to chunk up and evenly distribute the memory allocations (e.g., with each column and/or groups of rows processed separately; this might also allow for faster performance in the has-duplicate-keys case with early exits), mem_alloc may not be the thing to look for.

There's also vctrs::vec_duplicate_any() to try out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants