Skip to content

Remove R6 interface for epi_archives #340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
brookslogan opened this issue Jun 30, 2023 · 12 comments · Fixed by #431
Closed
2 of 3 tasks

Remove R6 interface for epi_archives #340

brookslogan opened this issue Jun 30, 2023 · 12 comments · Fixed by #431
Assignees
Labels
blocked-by-internal Can't/shouldn't proceed w/o internal action --- link to relevant Issue, rm label when complete blocking-something-else op-semantics Operational semantics; many potentially breaking changes here P1 medium priority performance

Comments

@brookslogan
Copy link
Contributor

brookslogan commented Jun 30, 2023

From discussion in #181 and off GitHub: in epi_archive we are currently paying the costs of maintaining a separate R6 interface and considering details of aliasing and mutation, but none of our operations actually benefit from the ability to use mutation, and we've eliminated or don't use some of the other benefits of R6 (keeping package namespace clean, print, cross-package inheritance details).

  1. We should eliminate the R6 user interface. [All archive$<fn>() parts of the interface.]
  2. We should refactor away all internals that rely on mutation to not do so. But unless we also move away from data.table (see below), we still need to be careful about documenting aliasing of DTs.
  3. We should consider changing epi_archive and grouped_epi_archive to S3 classes, not R6.
  • Should grouped epi archives be of class c("grouped_epi_archive", "epi_archive") or just "grouped_epi_archive"? The former is more consistent with dplyr, but the latter is less error-prone.
    • Let's go with just "grouped_epi_archive".
  • Is R6 providing better error signaling when users try to use functions we haven't thought of on epi_archives that might start "working" and doing wrong things as lists? (E.g., c? rbind?)
    • This needs testing. If sensible-sounding S3 generics "work" (incorrectly) on the new S3-class list format, we should probably provide methods for them to throw appropriate errors until we actually provide real implementations. This also would mean there is still usefulness for listing actually-applicable methods like we do currently.
  • Consider technicalities with is_epi_archive etc. on saved archives from older versions. We don't have a format version attribute now, so we'd need to check if it's an R6 object for example. Maybe we should add an attr now indicating the "archive format version"?
  1. Consider whether we should simultaneously try to move to tibbles + dplyr 1.1 joins, polars, or lazy_dt, so we don't have to mention aliasing at all, and to try to improve performance.
  • Initial testing indicates that dplyr 1.1 joins are significantly slower, and other alternatives may have comparable or greater baggage than data.table.
  • lazy_dt might be nice except it would mean simply loading epiprocess can break unrelated user code using dplyr + non-dtplyr data.tables.
  • Seems good to stay with data.table for now, but we will need to check that (i) none of our epix_ methods mutate DT without copying it first (or doing sneaky things aliasing columns), and (ii) that we warn the user that if they should do the same (minus the sneaky things part at least).
@brookslogan brookslogan added op-semantics Operational semantics; many potentially breaking changes here performance blocking-something-else P1 medium priority labels Jun 30, 2023
@brookslogan
Copy link
Contributor Author

This is blocking or slowing every additional feature we want to add to epi_archives.

@dsweber2
Copy link
Contributor

I would add this to the kanban, but I apparently don't have permissions here? Could someone add me as a writer? Better yet, the entire forecasting-maintainers list.

@brookslogan brookslogan added the blocked-by-internal Can't/shouldn't proceed w/o internal action --- link to relevant Issue, rm label when complete label Oct 25, 2023
@brookslogan
Copy link
Contributor Author

brookslogan commented Oct 25, 2023

This is blocked by improvements to print and implementation of a filter method. De-R6ification probably isn't actually strictly blocked by these, but if we can encapsulate or replace DT with something following the typical R memory model, then that might mean less documentation re-writing.

@dshemetov
Copy link
Contributor

dshemetov commented Jan 31, 2024

TODO:

  • recall if R6 removal and DT encapsulation are separable tasks (unclear if these were just put together by proximity or there was some detailed that coupled them)
  • break this issue into smaller actionable subtasks

@brookslogan
Copy link
Contributor Author

brookslogan commented Feb 1, 2024

Adding filter and encapsulating DT seem separable, at the cost of doing more documentation re-writes. Probably worth.

Attempt at task breakdown:

  • Verify that S3 methods and any non-S3 epix_* functions cover all functionality for epi_archives and grouped_epi_archives, and that they do not involve mutating something that could alias something else (without copying first).
  • Turn epi_archive and grouped_epi_archive into S3-classed lists (class vector "epi_archive" and "grouped_epi_archive", respectively).
    • Make as_epi_archive construct something in this format. The elements of the (named) list should match the fields from the R6 version.
    • Turn all the R6 implementations into non-exported functions (with some special prefix or suffix to identify them).
      • When converting, add another parameter to each function for the (first) input archive (say, archive or x), refactor away self$ and private$ to instead access e.g. archive$fieldname or call fill_through_version_internals(archive, <other args>). [There's no need to create an internals function if the existing R6 method just calls an S3 or epix_* and isn't called by anything else.]
      • Get rid of all the clone-ing of the R6 object (but not the copying of the DT; that's still required).
    • Update any remaining calls to R6 methods to reference above non-exported function, e.g., within S3 methods and epix_* functions. (Not just within R/ source code, but also tests/, and, below, roxygen docs and vignettes.)
    • Get rid of any remaining clone-ing (but not copying) within S3 methods and epix_ functions.
    • Get rid of any remaining calls to R6 built-in stuff: e.g., print.epi_archive will not be able to use epi_archive$public_methods; it'd be nice to instead list all the applicable S3 methods and any applicable non-S3 epix_* functions.
    • Update roxygen.
      • Remove references to archives being R6 objects.
      • Remove uses of R6 interface.
      • Remove references to aliasing and mutation of the archive.
      • Tell user to never change any parts of the DT without copying it first. Or something more precise.
    • Update vignettes.
      • Remove references to archives being R6 objects.
      • Remove uses of R6 interface.
      • Remove references to aliasing and mutation of the archive.
      • If there's something talking about working with the DT, tell user to never change any parts of the DT without copying it first. Or something more precise.
      • Check that rendered vignettes otherwise look the same.

Potentially later:

  • Check that common R functions we might naturally think about calling on archives, but which we don't yet support (e.g., c and rbind at time of writing) don't silently break & return garbage (some garbage might be okay if it's not an archive-related class since that's more easily detectable, but it's still not ideal; but garbage with an archive class attached isn't). If they don't, add S3 methods to make sure that they loudly break or do the intended thing.
  • Add a list element or attr to describe what "version" of the epi_archive and grouped_epi_archive format these objects are using. Check that we're using a compatible version in every method.
  • Refactor away these non-exported internals. This can be done piece by piece, unlike the transition to the S3-classed list which needs to be simultaneous.
  • Consider either
    • moving to dtplyr (preferably >= 1.3.0) lazy_dts instead of data.tables, or
    • implementing a filter method and encapsulating the DT (but providing a way to [get at a copy of the DT data, or a reference to the DT with typical-R-idiomatic or well-documented memory models/warnings, etc. Originally I was hoping a destructure operation that "destroys" the archive as we "move" out the DT would work here, but not if we have other aliases to the DT (e.g., from an alias to the same archive being destructured), since those would need to be destroyed as well and that seems hard.]). [We should also have a convenient way to access the unique versions in an archive, so people can use a rule to pick out what versions they want to epix_slide over, especially when doing nested epix_slides for evaluating backfill-aware models.] [Would also be a good time to get rid of compactify default message spam.]

@dshemetov
Copy link
Contributor

Could this work for lazy_dt: make dtplyr a dependency, but don't import anything from it, instead use dtplyr:: in places we expect lazy_dt?

@brookslogan
Copy link
Contributor Author

brookslogan commented Feb 3, 2024 via email

@brookslogan
Copy link
Contributor Author

brookslogan commented Feb 4, 2024

Oh wait! dtplyr 1.3.0 might have fixed this!

[Side note... I'm not sure dplyr on data.tables without dtplyr would actually be safe even if people do it. E.g., as_tibble breaks data.table's memory model, so other dplyr ops probably do this as well, and I wonder if e.g. [<- (or maybe even just +? seems less likely, copy elision seems harder to do here) within a dplyr mutate might cause R to look, see a refcount of 1 on the column (--- actually, no, this isn't sufficient; it'd also need to know refcount of the table & other "descendants" are also 1; the table might(?) be okay with mutation in some cases but copy-elision stuff isn't aware of that), and mutate in place. Or if you just say, as_tibble or mutate to add a column and get a tibble result, but later do a set on the data.table, if it'd skip refcount checking and simply mutate in-place & mutate the tibble result as well.

I think we have tests that fail by loading older dtplyrs, so this probably means we are doing something risky as well. Hopefully that's only in test code and not in the package code...]

@dshemetov
Copy link
Contributor

dshemetov commented Feb 6, 2024

Looking for places where mutable aliasing might occur, I searched the repo for uses of :=. Thankfully I found only a few:

  • in archive.R::fill_through_version, used on nonversion_key_vals_ever_recorded, which is derived from self$DT but it is explicitly copied on the line before if it shares a pointer with self$DT, so this usage looks safe
    • side-note: so unique can still keep pointers to the same DT object? I suppose this is possible if the array does not change, meaning it is unique the whole time?
  • in slide.R::epi_slide::slide_one_group, used on .data_group with !!new_col := slide_values. .data_group is passed from group_modify, I assume a data.table object, which is possibly aliased to a DT
  • the rest were either in tests or in a vignette, constructing examples

@brookslogan
Copy link
Contributor Author

brookslogan commented Feb 6, 2024

slide.R is using dplyr/tidyverse's :=, not data.table's, and is not archive related, so that should be good.

data.table::set* functions also mutate, any of those?

Also, certain as.data.table or similar calls might(?) think they can take ownership, based on attrs set beforehand and/or explicit parameters [and if they think they need to sort might do it in place, mutating]. If we do use these though it would probably have to be knowingly and with a good amount of thought already into them already; I don't think data.table would do this without user effort involved.

@dshemetov
Copy link
Contributor

Thanks. Going through the "See Also" part of the set* docs I found these functions and searched for them all (on this commit fd2339d):

  • setDF
    • in grouped_epi_archive.R in the slide function line 331: seems safe; the object is already copied above on line 330, so this is just for easier operation with dplyr
    • in grouped_epi_archive.R in the slide function line 365: seems safe; the object inside setDF has no other referents
  • setDT
    • in grouped_epi_archive.R in the slide function line 342: seems unsafe; the by reference DT is fed into comp_one_group where f is user-supplied and could potentially modify the original object
    • in methods-epi_archive.R in the epix_detailed_restricted_mutate function line 514: this seems safe; setDT is used on a temporary tibble object that's constructed earlier in the function, so there's no potential for downstream aliasing
  • setkey
    • unused
  • setkeyv
    • in archive.R in the initialize function line 347: seems safe, the object being modified is copied in as.data.table in the line prior
    • in archive.R in the fill_through_version function line 576: seems safe, the object modified is an rbind of two data.tables with no other aliases
    • in methods-epi_archive.R in epix_merge function lines 260, 261: in practice safe, the objects x_DT and y_DT are aliases of x$DT, y$DT in the case that sync="forbid", which have aliases outside the function, however this setkeyv just ensures that x_DT has the same keys as x$DT, which should be a noop for x$DT
    • in methods-epi_archive.R in epix_merge function line 370: safe, since the object being modified is result_DT and has no other aliases
  • set
    • in methods-epi_archive.R in epix_merge function lines 339, 358: safe, since the object being modified is result_DT and has no other aliases
  • setorder
    • unused
  • setorderv
    • unused
  • setattr
    • in grouped_epi_archive.R in the slide function line 331: seems unsafe, see the line 342 comment above for setDT
    • in methods-epi_archive.R in the epix_detailed_restricted_mutate function line 513: seems safe, see the line 514 comment above for setDT
  • setnames
    • unused
  • setcolorder
    • unused
  • setnafill
    • one use in archive.Rmd that seems safe, not looking too deeply here
  • unique (undocumented aliasing edge cases here)
    • in archive.R in epi_archive function line 511: seems safe, the object being operated on is a DT subset with the i component set, which means the table is copied and therefore has no other aliases
    • in archive.R in epi_archive function line 561: seems safe, there is a memory address check below that copies if needed
    • in slide.R in epi_slide function line 174: seems safe, extracting a column from a DT and assigning it produces a (CoM) copy
  • as_tibble
    • in archive.R in the epi_archive function line 515: seems safe, the tibble object is copied from a DT with no other references
    • in grouped_epi_archive.R in the slide function line 365: seems safe, the tibble object is copied from a DT created inside the same expression, so no other references
    • a few other references around, might look another time

The following functions appear to be safe, even in the case of noop: subset, as.data.table (might need to check more about these extra attrs / parameters you mention, not sure which ones matter here).

@brookslogan
Copy link
Contributor Author

brookslogan commented Feb 14, 2024

  • setDT
    • in grouped_epi_archive.R in the slide function line 342: seems unsafe; the by reference DT is fed into comp_one_group where f is user-supplied and could potentially modify the original object

I might not be following the flow of operations correctly here, but I think this is probably safe. Let's say comp_one_grp + f mutates .data_group. When is .data_group possibly used after that? Only(?) by f storing away an alias to part/all of it, or by our subsequent sliding code, or by other code if .data_group aliases (parts of) an input. So let's say 1st is okay, and that 2nd can read but not modify .data_group can't be used so it won't mess up 1st and 1st / earlier modifications by f won't mess it up. For 2nd we just don't use it again. And 3rd should be avoided by L318 (and probably even earlier by group_modify except when everything's in one group)... assuming we haven't violated data.table's memory model and aliased columns without aliasing the containing DT. ----- okay, this reasoning is very ad-hoc and I should probably actually try to map things more to some sort of Rust equivalent to stop hand-waving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked-by-internal Can't/shouldn't proceed w/o internal action --- link to relevant Issue, rm label when complete blocking-something-else op-semantics Operational semantics; many potentially breaking changes here P1 medium priority performance
Projects
None yet
4 participants