Remove R6 interface for `epi_archive`s #340

brookslogan · 2023-06-30T21:38:29Z

From discussion in #181 and off GitHub: in epi_archive we are currently paying the costs of maintaining a separate R6 interface and considering details of aliasing and mutation, but none of our operations actually benefit from the ability to use mutation, and we've eliminated or don't use some of the other benefits of R6 (keeping package namespace clean, print, cross-package inheritance details).

We should eliminate the R6 user interface. [All archive$<fn>() parts of the interface.]
We should refactor away all internals that rely on mutation to not do so. But unless we also move away from data.table (see below), we still need to be careful about documenting aliasing of DTs.
We should consider changing epi_archive and grouped_epi_archive to S3 classes, not R6.

Should grouped epi archives be of class c("grouped_epi_archive", "epi_archive") or just "grouped_epi_archive"? The former is more consistent with dplyr, but the latter is less error-prone.
- Let's go with just "grouped_epi_archive".
Is R6 providing better error signaling when users try to use functions we haven't thought of on epi_archives that might start "working" and doing wrong things as lists? (E.g., c? rbind?)
- This needs testing. If sensible-sounding S3 generics "work" (incorrectly) on the new S3-class list format, we should probably provide methods for them to throw appropriate errors until we actually provide real implementations. This also would mean there is still usefulness for listing actually-applicable methods like we do currently.
Consider technicalities with is_epi_archive etc. on saved archives from older versions. We don't have a format version attribute now, so we'd need to check if it's an R6 object for example. Maybe we should add an attr now indicating the "archive format version"?

Consider whether we should simultaneously try to move to tibbles + dplyr 1.1 joins, polars, or lazy_dt, so we don't have to mention aliasing at all, and to try to improve performance.

Initial testing indicates that dplyr 1.1 joins are significantly slower, and other alternatives may have comparable or greater baggage than data.table.
lazy_dt might be nice except it would mean simply loading epiprocess can break unrelated user code using dplyr + non-dtplyr data.tables.
Seems good to stay with data.table for now, but we will need to check that (i) none of our epix_ methods mutate DT without copying it first (or doing sneaky things aliasing columns), and (ii) that we warn the user that if they should do the same (minus the sneaky things part at least).

The text was updated successfully, but these errors were encountered:

brookslogan · 2023-06-30T21:39:28Z

This is blocking or slowing every additional feature we want to add to epi_archives.

dsweber2 · 2023-09-25T17:55:29Z

I would add this to the kanban, but I apparently don't have permissions here? Could someone add me as a writer? Better yet, the entire forecasting-maintainers list.

brookslogan · 2023-10-25T19:43:33Z

~~This is blocked by improvements to print and implementation of a filter method.~~ De-R6ification probably isn't actually strictly blocked by these, but if we can encapsulate or replace DT with something following the typical R memory model, then that might mean less documentation re-writing.

dshemetov · 2024-01-31T20:35:11Z

TODO:

recall if R6 removal and DT encapsulation are separable tasks (unclear if these were just put together by proximity or there was some detailed that coupled them)
break this issue into smaller actionable subtasks

brookslogan · 2024-02-01T00:10:26Z

Adding filter and encapsulating DT seem separable, at the cost of doing more documentation re-writes. Probably worth.

Attempt at task breakdown:

Potentially later:

Check that common R functions we might naturally think about calling on archives, but which we don't yet support (e.g., c and rbind at time of writing) don't silently break & return garbage (some garbage might be okay if it's not an archive-related class since that's more easily detectable, but it's still not ideal; but garbage with an archive class attached isn't). If they don't, add S3 methods to make sure that they loudly break or do the intended thing.
Add a list element or attr to describe what "version" of the epi_archive and grouped_epi_archive format these objects are using. Check that we're using a compatible version in every method.
Refactor away these non-exported internals. This can be done piece by piece, unlike the transition to the S3-classed list which needs to be simultaneous.
Consider either
- moving to dtplyr (preferably >= 1.3.0) lazy_dts instead of data.tables, or
- implementing a filter method and encapsulating the DT (but providing a way to [get at a copy of the DT data, or a reference to the DT with typical-R-idiomatic or well-documented memory models/warnings, etc. Originally I was hoping a destructure operation that "destroys" the archive as we "move" out the DT would work here, but not if we have other aliases to the DT (e.g., from an alias to the same archive being destructured), since those would need to be destroyed as well and that seems hard.]). [We should also have a convenient way to access the unique versions in an archive, so people can use a rule to pick out what versions they want to epix_slide over, especially when doing nested epix_slides for evaluating backfill-aware models.] [Would also be a good time to get rid of compactify default message spam.]

dshemetov · 2024-02-01T22:58:09Z

Could this work for lazy_dt: make dtplyr a dependency, but don't import anything from it, instead use dtplyr:: in places we expect lazy_dt?

brookslogan · 2024-02-03T04:44:53Z

No, unfortunately. Merely loading dtplyr via :: calls is enough to trigger dtplyr S3 stuff.

brookslogan · 2024-02-04T20:33:24Z

Oh wait! dtplyr 1.3.0 might have fixed this!

[Side note... I'm not sure dplyr on data.tables without dtplyr would actually be safe even if people do it. E.g., as_tibble breaks data.table's memory model, so other dplyr ops probably do this as well, and I wonder if e.g. [<- (or maybe even just +? seems less likely, copy elision seems harder to do here) within a dplyr mutate might cause R to look, see a refcount of 1 on the column (--- actually, no, this isn't sufficient; it'd also need to know refcount of the table & other "descendants" are also 1; the table might(?) be okay with mutation in some cases but copy-elision stuff isn't aware of that), and mutate in place. Or if you just say, as_tibble or mutate to add a column and get a tibble result, but later do a set on the data.table, if it'd skip refcount checking and simply mutate in-place & mutate the tibble result as well.

I think we have tests that fail by loading older dtplyrs, so this probably means we are doing something risky as well. Hopefully that's only in test code and not in the package code...]

dshemetov · 2024-02-06T00:35:58Z

Looking for places where mutable aliasing might occur, I searched the repo for uses of :=. Thankfully I found only a few:

in archive.R::fill_through_version, used on nonversion_key_vals_ever_recorded, which is derived from self$DT but it is explicitly copied on the line before if it shares a pointer with self$DT, so this usage looks safe
- side-note: so unique can still keep pointers to the same DT object? I suppose this is possible if the array does not change, meaning it is unique the whole time?
in slide.R::epi_slide::slide_one_group, used on .data_group with !!new_col := slide_values. .data_group is passed from group_modify, I assume a data.table object, which is possibly aliased to a DT
the rest were either in tests or in a vignette, constructing examples

brookslogan · 2024-02-06T10:17:48Z

slide.R is using dplyr/tidyverse's :=, not data.table's, and is not archive related, so that should be good.

data.table::set* functions also mutate, any of those?

Also, certain as.data.table or similar calls might(?) think they can take ownership, based on attrs set beforehand and/or explicit parameters [and if they think they need to sort might do it in place, mutating]. If we do use these though it would probably have to be knowingly and with a good amount of thought already into them already; I don't think data.table would do this without user effort involved.

dshemetov · 2024-02-14T01:35:35Z

Thanks. Going through the "See Also" part of the set* docs I found these functions and searched for them all (on this commit fd2339d):

setDF
- in grouped_epi_archive.R in the slide function line 331: seems safe; the object is already copied above on line 330, so this is just for easier operation with dplyr
- in grouped_epi_archive.R in the slide function line 365: seems safe; the object inside setDF has no other referents
setDT
- in grouped_epi_archive.R in the slide function line 342: seems unsafe; the by reference DT is fed into comp_one_group where f is user-supplied and could potentially modify the original object
- in methods-epi_archive.R in the epix_detailed_restricted_mutate function line 514: this seems safe; setDT is used on a temporary tibble object that's constructed earlier in the function, so there's no potential for downstream aliasing
setkey
- unused
setkeyv
- in archive.R in the initialize function line 347: seems safe, the object being modified is copied in as.data.table in the line prior
- in archive.R in the fill_through_version function line 576: seems safe, the object modified is an rbind of two data.tables with no other aliases
- in methods-epi_archive.R in epix_merge function lines 260, 261: in practice safe, the objects x_DT and y_DT are aliases of x$DT, y$DT in the case that sync="forbid", which have aliases outside the function, however this setkeyv just ensures that x_DT has the same keys as x$DT, which should be a noop for x$DT
- in methods-epi_archive.R in epix_merge function line 370: safe, since the object being modified is result_DT and has no other aliases
set
- in methods-epi_archive.R in epix_merge function lines 339, 358: safe, since the object being modified is result_DT and has no other aliases
setorder
- unused
setorderv
- unused
setattr
- in grouped_epi_archive.R in the slide function line 331: seems unsafe, see the line 342 comment above for setDT
- in methods-epi_archive.R in the epix_detailed_restricted_mutate function line 513: seems safe, see the line 514 comment above for setDT
setnames
- unused
setcolorder
- unused
setnafill
- one use in archive.Rmd that seems safe, not looking too deeply here
unique (undocumented aliasing edge cases here)
- in archive.R in epi_archive function line 511: seems safe, the object being operated on is a DT subset with the i component set, which means the table is copied and therefore has no other aliases
- in archive.R in epi_archive function line 561: seems safe, there is a memory address check below that copies if needed
- in slide.R in epi_slide function line 174: seems safe, extracting a column from a DT and assigning it produces a (CoM) copy
as_tibble
- in archive.R in the epi_archive function line 515: seems safe, the tibble object is copied from a DT with no other references
- in grouped_epi_archive.R in the slide function line 365: seems safe, the tibble object is copied from a DT created inside the same expression, so no other references
- a few other references around, might look another time

The following functions appear to be safe, even in the case of noop: subset, as.data.table (might need to check more about these extra attrs / parameters you mention, not sure which ones matter here).

brookslogan · 2024-02-14T02:03:11Z

setDT

in grouped_epi_archive.R in the slide function line 342: seems unsafe; the by reference DT is fed into comp_one_group where f is user-supplied and could potentially modify the original object

I might not be following the flow of operations correctly here, but I think this is probably safe. Let's say comp_one_grp + f mutates .data_group. When is .data_group possibly used after that? Only(?) by f storing away an alias to part/all of it, or by our subsequent sliding code, or by other code if .data_group aliases (parts of) an input. So let's say 1st is okay, and that 2nd ~~can read but not modify .data_group~~ can't be used so it won't mess up 1st and 1st / earlier modifications by f won't mess it up. For 2nd we just don't use it again. And 3rd should be avoided by L318 (and probably even earlier by group_modify except when everything's in one group)... assuming we haven't violated data.table's memory model and aliased columns without aliasing the containing DT. ----- okay, this reasoning is very ad-hoc and I should probably actually try to map things more to some sort of Rust equivalent to stop hand-waving.

brookslogan added op-semantics Operational semantics; many potentially breaking changes here performance blocking-something-else P1 medium priority labels Jun 30, 2023

This was referenced Jun 30, 2023

Revisit #67: are we happy with current reference semantics scheme #181

Closed

Unclutter epi_archive's print method + add preview of DT #341

Closed

dsweber2 mentioned this issue Sep 25, 2023

epi_archive to pure S3 cmu-delphi/epipredict#235

Closed

brookslogan added the blocked-by-internal Can't/shouldn't proceed w/o internal action --- link to relevant Issue, rm label when complete label Oct 25, 2023

nmdefries added this to the [CSTE] Improve scaling of epiprocess milestone Oct 25, 2023

dshemetov assigned dshemetov and nmdefries Feb 2, 2024

brookslogan mentioned this issue Feb 4, 2024

Fix failing tests/behavior when dtplyr < 1.3 is loaded, any mem violations if not #267

Open

This was referenced Mar 18, 2024

feat+wip: refactor epi_archive to use S3 #430

Closed

wip feat: replace R6 epi_archive with S3 implementation #431

Merged

brookslogan closed this as completed in #431 May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove R6 interface for `epi_archive`s #340

Remove R6 interface for `epi_archive`s #340

brookslogan commented Jun 30, 2023 •

edited

Loading

brookslogan commented Jun 30, 2023

dsweber2 commented Sep 25, 2023

brookslogan commented Oct 25, 2023 •

edited

Loading

dshemetov commented Jan 31, 2024 •

edited

Loading

brookslogan commented Feb 1, 2024 •

edited by dshemetov

Loading

dshemetov commented Feb 1, 2024

brookslogan commented Feb 3, 2024 via email •

edited

Loading

brookslogan commented Feb 4, 2024 •

edited

Loading

dshemetov commented Feb 6, 2024 •

edited

Loading

brookslogan commented Feb 6, 2024 •

edited

Loading

dshemetov commented Feb 14, 2024

brookslogan commented Feb 14, 2024 •

edited

Loading

Remove R6 interface for epi_archives #340

Remove R6 interface for epi_archives #340

Comments

brookslogan commented Jun 30, 2023 • edited Loading

brookslogan commented Jun 30, 2023

dsweber2 commented Sep 25, 2023

brookslogan commented Oct 25, 2023 • edited Loading

dshemetov commented Jan 31, 2024 • edited Loading

brookslogan commented Feb 1, 2024 • edited by dshemetov Loading

dshemetov commented Feb 1, 2024

brookslogan commented Feb 3, 2024 via email • edited Loading

brookslogan commented Feb 4, 2024 • edited Loading

dshemetov commented Feb 6, 2024 • edited Loading

brookslogan commented Feb 6, 2024 • edited Loading

dshemetov commented Feb 14, 2024

brookslogan commented Feb 14, 2024 • edited Loading

Remove R6 interface for `epi_archive`s #340

Remove R6 interface for `epi_archive`s #340

brookslogan commented Jun 30, 2023 •

edited

Loading

brookslogan commented Oct 25, 2023 •

edited

Loading

dshemetov commented Jan 31, 2024 •

edited

Loading

brookslogan commented Feb 1, 2024 •

edited by dshemetov

Loading

brookslogan commented Feb 3, 2024 via email •

edited

Loading

brookslogan commented Feb 4, 2024 •

edited

Loading

dshemetov commented Feb 6, 2024 •

edited

Loading

brookslogan commented Feb 6, 2024 •

edited

Loading

brookslogan commented Feb 14, 2024 •

edited

Loading