-
Notifications
You must be signed in to change notification settings - Fork 8
Distinguish snapshot of signal vs. archive of signal, and as-of vs. issue #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is an awesome suggestion Logan. I'm not tagging you simply because I'm hoping you'll stop reading GitHub this week and actually take some time off 🙂! I love this idea in general, and after mulled over it for a couple of days, I'm ready to go with it. In the spirit of moving quickly, I'm thinking about just refactoring A better and more efficient implementation of |
- As per the discussion on #8, from now on, an `epi_signal` object represents a single snapshot of a data set - Modify `as.epi_signal()` to reflect this change - Update documentation and vignettes accordingly
Regarding implementation:
For the main implementation I could pursue something along the lines of one of the "Benchmark" code is somewhere in here:
|
Excellent, thank you @brookslogan very useful! How do you feel about me just defining an This will definitely not be the most efficient, but then you can redesign using |
This seems fine, as long as it has the same semantics for the snapshot-extraction function. I think the Diff-based archive contents for a covidcast signal could be returned in the new clients by just issuing a huge query for all dates and all issues from say Jan 1, 1234 to Jan 1, 3456. Or in the old client extract from the below:
|
Just a heads up that I'm going to close this with #14. @brookslogan Can you please create a new issue about:
As we discussed, for this next round, we should just get this all implemented and not worry about efficiency. Then once this is done, we can rewrite the backend for efficiency. So that can also be a second new issue (which could only be completed after the first one), and for that, you can copy-paste relevant parts of your comments here. Thanks! |
Added final newline to epi_archive print statement.
The current
epi_signal
implementation containsissue
as a column, allowing it to contain the full revision history of a signal, but functions usingepi_signal
appear to work only with the latest issue, and expects that issue to constitute a full snapshot of the data as of that issue, rather than, e.g., just the new & updated values. It makes sense to separate out the concept of such a snapshot from the full revision history, giving these concepts different names, classes, functions, etc., and requiring the user to explicitly convert between them (for conceptual clarity and code readability).epi_signal
should add the metadata fieldas-of-issue
oras-of
or something along those lines. Maybe it should be renamed toepi_snapshot
? Theissue
column should be removed or renamed; some users might want to characterize how mature an observation is; some might do that withas_of_issue - time_value
, and others might want to uselatest_issue_that_updated_this_row - time_value
.epi_archive
(orepi_signal_archive
?) should be added. It would have a structure similar to the structure of the currentepi_signal
table, but might benefit from being based on adata.table
orR6
class wrapping adata.table
, and may also need or benefit from some extra metadata.epi.archive$snapshot_as_of(as.of.issue)
, which would return anepi_signal
or an error if theas.of.issue
isn't in the range covered by the archive. LOCF could be used to fill in the gaps between issues but not after the latest issue. For the LOCF, something like one of the below:unique(diff.DT[issue <= as.of.issue], by=prefixed.nonissue.key.names, fromLast=TRUE)[status == "new.or.updated"]
diff.DT[setkeyv(set(unique(diff.DT, by=prefixed.nonissue.key.names)[,prefixed.nonissue.key.names],,"issue",as.of.issue), c(prefixed.nonissue.key.names,"issue")), roll=TRUE][status == "new.or.updated"]
c(paste0("diff.", names(epi.signal)), "issue", "status")
, wherestatus
is either"new.or.updated"
or"removed"
. It could be made more complex to allow for more error checking or to provide extra info to users interested in revision modeling.max.issue
value, is required if the archive covers all updates through, e.g., Sep 15, but data did not change from Sep 2 to Sep 15, in order to not complain when the user requests Sep 14 snapshot but to complain when they request Sep 16 snapshot, and in order to complain when the user tries to "add" a Sep 15 snapshot a second time by accident. It is also necessary to enable a "reproducible" option; this option would assume that the latest recorded issue could be subject to revisions (e.g., because the data for an issue date was revised in the middle of that date, and it's either currently that date & before the change was made, or any time/day after the change was made & before the user found out about it, due to pipeline delays and data fetching frequencies). It is also necessary if we want to allow a user to revise the archive by adding a new snapshot in the middle of the currently recorded issue range.unique(diff.DT, by=prefixed.nonissue.key.names)[,prefixed.nonissue.key.names]
, to improve speed of the second query approach above. But users would also benefit from a function that returns this data, so that they know what geos× were ever included in a data set.as_of_issue
when inserting snapshots, as that would make the diffs very inefficient.epi_signal
s should continue to work onepi_signal
s; instead of working on the latest "issue" by default, they will require the user to provide just a single snapshot of the data.Secondly, we should disambiguate "issue". E.g., at least for
epi_signal
, it should be called "as.of.issue" or "as.of"; the archive could still use just "issue" or maybe it should also have a more specific term.The text was updated successfully, but these errors were encountered: