Should `epi_signal` inherit from `tsibble`? #7

ryantibs · 2021-10-15T01:39:03Z

The tsibble package, as @jacobbien mentioned, looks like it has a structure that would be helpful to have epi_signal inherit, rather a plain tibble. A tsibble has the following structure:

key: for us this would be geo_value, I think;
index: for us would be be time_value, I think;
any other columns as measured variables: for us, we need at least value, here.

Generally, we should also check out the tidyverts, and yes, that's not a typo, click on the link. It would be nice to see if anything here just takes care of things we might need (or might have already implemented?) in epitools, and soon, in epipred.

@jacobbien @dajmcdon @capnrefsmmat @brookslogan Looking through all of this would be a big job, but do any of you guys just want to each take a look at some part, and report back what you think here, on this issue?

The text was updated successfully, but these errors were encountered:

dajmcdon · 2021-10-16T16:36:32Z

One potential advantage to {tsibble} is the fill_gaps() function that turns implicit missing dates in the time series into explicit NAs.

brookslogan · 2021-10-17T21:56:46Z

If we don't want to use tsibble, these helper functions are useful for calculating rollapply on a ordered&gap-filled version of a time series and then returning to the original row order and number:

#' Transmute time series data in unordered df with gaps as if ordered&regular
#'
#' Columns are defined as if the time series data were ordered and regular, but
#' the result preserves the row number and order of the original data frame.
#'
#' @details Within each group of the df (or the entire df if ungrouped),
#'   \link[dplyr:arrange]{arranges} by time, \link[dplyr:complete]{completes} it
#'   to make the time a regular sequence, performs the
#'   \link[dplyr:transmute]{transmutation} (preserving any grouping columns and
#'   tacking on the time column by default), then \link[dplyr:filter]{filters}
#'   and rearranges to match the original row number and order of the original
#'   data frame.
#'
#' @param .data a data frame (including grouped tibbles)
#' @param .time_col_selector \code{\link[dplyr:pull]{dplyr::pull}} argument to
#'   select the time column
#' @param .period period/stride/spacing of time values to use when completing
#'   the time column
#' @param ... \code{\link[dplyr:transmute]{dplyr::transmute}} arguments, which
#'   can assume that they are working on an arranged, completed version of each
#'   group of the data frame
#' @return a data frame with any grouping columns, the time column (by default),
#'   and the requested columns from `...`, with grouping and time columns in the
#'   original order
#'
#' @importFrom pipeR %>>%
#' @importFrom dplyr group_modify pull complete mutate filter arrange
#' @importFrom tibble tibble
#' @importFrom rlang enquos !!!
#' @importFrom assertthat assert_that
transmute_scattered_ts = function(.data, .time_col_selector, .period, ...) {
  main.dots = enquos(...) # [note from future: not sure why I'm not just forwarding the dots; that should work fine if we don't need to inspect them]
  .data %>>%
    group_modify(function(group.data, ...) {
      time.values = pull(group.data, {{.time_col_selector}})
      tibble(
        original_row_index = seq_along(time.values),
        t = time.values,
        data_wrapper = group.data
      ) %>>%
        {
          full.seq.t = full_seq(.[["t"]], .period)
          intermediate.result = complete(., t = full.seq.t)
          assert_that(nrow(intermediate.result) == length(full.seq.t),
                      msg = 'Time series data appears to contain duplicate time values')
          intermediate.result
        } %>>%
        mutate(data_wrapper = transmute(data_wrapper, !!!main.dots, {{.time_col_selector}})) %>>%
        filter(!is.na(original_row_index)) %>>%
        arrange(original_row_index) %>>%
        pull(data_wrapper)
    })
}
## XXX improve speed: maybe try a less group-wise approach, inventing new name
## in the original for original_row_index, and/or special-casing when the data
## frame is already sorted overall and/or within groups.
##
## todo decide on styling given other naming conventions involved

#' Mutate time series data in unordered df with gaps as if ordered&regular
#'
#' Works like \code{\link{transmute_scattered_ts}}, but preserving all the
#' original columns by default
mutate_scattered_ts = function(.data, .time_col_selector, .period, ...) {
  transmute_scattered_ts(.data, {{.time_col_selector}}, across(everything()), ..., .period=.period)
}
## todo use template or shared topic to document args here

ryantibs · 2021-10-21T00:54:56Z

So as I alluded to over on #12, #10, #8, I'm thinking about making a big redesign on a branch to address all of these issues, as well as the current one---moving over to the tsibble format. I was reading a bit more about the tsibble format and I think it appears to offer some advantages beyond those listed already, for example:

wide variety of built-in formats for the index column (time column), well beyond the two we already support in this package (day and week)
bunch of handy helper/processing functions associated with this, like tsibble::index_by() for aggregation

As suggested on the other issues, it seems like the best way forward is to stick with the tsibble notation exactly. So there would be:

one column designated as index (for the generalization of time variable)
zero or more columns designated as key (for the generalization of the geo variable)
any number of columns which count as measured variables

The column names can be anything and the designation as index and key happens at the level of metadata (handled by tsibble).

So now I'm left wondering: what, if any, is the purpose of defining an epi_signal class? Seems like it fits exactly into a tsibble format already, with very few differences: e.g., it may just have some additional metadata (for example, metadata field that specifying the issue date corresponding to the data set).

Somewhat relatedly, though really not the same question, I'm wondering what is the importance of actually treating geographies explicitly anymore. Is geo value just one type of key, and we don't even attempt to distinguish it from an arbitrary key? Or should we be providing built-in utilities for handling standard geo types (like aggregation or dis-aggregation between standard geo types).

I want to spend a bit of time thinking here before doing anything. Just treating everything as a tsibble and defining no special class structure for this package would be remarkably simple, but also somehow feel vacuous & trivial. There would be basically no standardization for data formats that we're providing for at all.

@elray1 @dajmcdon @jacobbien Your thoughts? (Again I'm sure it would be really useful to hear from Logan but not tagging him so that he can enjoy a couple days off!)

jacobbien · 2021-10-21T08:49:42Z

Yeah, I agree with @ryantibs that tsibble appears to do everything that's needed. I sort of like the idea of creating a class that inherits from tsibble, perhaps called tsibble_us (or epi_signal_us?) that inherits from tsibble where it enforces the key to be one of national, state, hrr, county, msa. Let me elaborate on what Ryan mentions above about facilitating geo-based aggregation.

At first it might seem this isn't really necessary. For example, consider this example from the tsibble documentation showing how index_by() + summarize() can be used to aggregate to a coarser index:

tourism %>%
  index_by(Year = ~ year(.)) %>%
  group_by(Region, State) %>%
  summarise(Total = sum(Trips))

If one wanted to aggregate to the Year-Region level instead, couldn't one just remove State from the group_by? Well, but there's a problem here. Namely, what if some states are missing? The advantage of tsibble_us is that it would know if some states are missing (and likewise, if you try to aggregate counties to state level it would know if there are missing counties).

This is directly analogous to how tsibble worries a lot (so that the user doesn't have to!) about missing time values. In Handle implicit missingness with tsibble, Earo Wang describes four functions that have to do with handling missing time values: has_gaps, scan_gaps, count_gaps, and fill_gaps. Our class tsibble_us could do the same for missing locations. Imagine four corresponding functions for missing locations (let's call a missing location a "hole" for now for lack of a better term):

has_holes - are there missing locations? (E.g., if key is at the county level, are all US counties there? One could also have an argument where one specifies a limited scope, e.g., if the scope is defined as CA, then %>% has_gaps(scope = state("CA")) would only check if there were missing CA counties in the data object.
scan_holes - what are the missing counties
count_holes - how many missing counties
fill_holes - this function could at minimum create NA time series for the missing counties. This means that if we aggregate from the county to state level, it will be apparent that CA was missing some counties. One could also imagine cases where filling in 0s makes sense. In fact, tsibble_us could even offer imputation based on the geo hierarchy (e.g. fill in with the average of all counties within this state) or based on spatial proximity.

tsibble_us could also have population size information for each geo value, so that weighted averages could be easy to do in aggregation.

An example: Suppose dat is a tsibble_us object with daily-county cases (three columns: Date, the index; County, the geo-key; and Cases being a column with incidence rate, per 100k). We could aggregate to state-level epiweek data with something like the following:

dat %>%
  fill_gaps() %>% 
  fill_holes() %>%
  index_by(Epiweek = ~ epiweek(.)) %>%
  geokey_by(State = ~ state(.)) %>%
  summarise(Aggregated_cases = population_weighted_mean(Cases))

Here geokey_by, state and population_weighted_mean would be tsibble_us functions that are based on the information contained in the package about the us geography.

Obviously, instead of tsibble_us, one could write a more general tsibble_geo and then there could be location specific classes that inherit from it, like tsibble_us, tsibble_europe, tsibble_world, etc.

elray1 · 2021-10-21T14:43:31Z

Here are a few very quick/minor thoughts:

Tracking the metadata does seem valuable.
Jacob's ideas about spatial aggregation also seem worthwhile.
I think in terms of aggregating over time, we'll most often want to aggregate from daily observations to epiweek observations. I guess that could be done using a date to MMWR week function in combination with the index_by %>% group_by %>% summarize construction Jacob mentioned. But maybe we could make a convenience function for that.
If we want to think proactively about what things will look like when we get to porting this to Python -- I assume a python port of this will use pandas data frames, where index is essentially a reserved key word (and it might be used more generally than just for tracking the temporal index). It's not clear to me whether it's more important for the R version of the package to be consistent with tsibble (suggesting the use of index instead of time_value) or for a standard set of terminology to be developed that translates cleanly between python and R versions of epitools (suggesting the use of time_value instead of index). People in our group use both python and R for different projects, so anticipating that I'll get confused about this some day, I'm now slightly leaning toward using time_value instead of index. But this is not a strong feeling.

ryantibs · 2021-10-21T15:18:23Z

Thanks very much to both of you for helpful answers.

I'm still struggling though. I'm trying to play devil's advocate here, to be clear, for both sides of the argument:

I agree completely geo aggregation utilities would be nice, but why would such functions go in epitools? Why not a separate package for that specifically designed for that? Or even just a big PR that we or somebody else makes to tsibble directly? What is "epi" about any of this?
Handling issue dates is about the only thing that I see is getting kind of close to "epi" specific. Though other time series data sets have revisions (such as in Econ). But at least it's a very prominent feature of epidemiology data.

So is there any point in defining a custom epi_signal class: is it just the metadata that makes it special? I'm leaning towards yes, but I can see the argument both ways.

To go back the specific question about structure of the data frame itself. Do we really let key and index be anything that tsibble can accept? Or should we require a bit more structure than that, like do we require a specific name for the index column, do we require that at least one key be a geo value, and for that to have a specific column name, etc.

(Evan: I didn't really get your last comment which seems related to this. Was it about renaming index field in the metadata? Or a concern about requiring a column to be named index? Just to be clear: my understanding is that tsibble wouldn't require that the index be literally named index in the data frame. It's just a metadata designation that a particular column serves as the index.)

dajmcdon · 2021-10-21T15:32:36Z

Also quick, minor. (Drafted pre-Ryan, but posted after)

I really like Jacob's spatial aggregation suggestions.
Related to spatial+python, I get that tsibble is "just for time series" so understanding index as something to do with time is fine. But the name index is pretty meaningless. I prefer time_value instead, and geokey makes for a nice complement. So I'll put my vote with Evan for replacing index.
One hiccup to consider for locations: isn't there some craziness with counties where JHU breaks some up? NYC boroughs maybe? It's hard to fault tsibble for defining a gap in days as the most natural thing. But locations is more problematic. Do we expect states to include the US Virgin Islands? Is Guam separate from Northern Marianas? Is Saipan its own thing? Plus the NY boroughs. Of course we could be specific that for our purposes, we're using JHU's locations, 2020 Census populations, etc. But this could get crazy. Also, due to being hounded by my humanist wife yesterday about how data people are to blame for the worlds ills, I would be remiss if I didn't ask, what would a Puerto Rican think of us defining PR as a "state" (or not)? Time is hard to argue with but, here classification is weird.
One might also want to be able to do other index's (age, gender). One could leave these to the user to handle similar gap_filler functions. But given that index seems insufficient for our purposes (need at least this functionality for time + different spatial classes), maybe the way to proceed is to create some constructors that can build the family appropriate to different things we would want. I don't know if R has a good way to do this, but it would be nice to not have to manually create tsibble_us tsibble_msa, ... tsibble_us_age etc.

elray1 · 2021-10-22T15:23:56Z

I basically don't have an opinion about Ryan's questions about whether we'd want to enforce more structure for the index and key in terms of requiring specific column names.

Brief follow up on the index/time_value thread: my thinking is that the meaning and treatment of index is different in tsibble and pandas: in tsibble, an index is definitely a time_value and I think corresponds to exactly one column, but in pandas, an index may possibly track information from more than one column and may track information other than a time_value. For example, any of the three constructions below is ok for a pandas data frame, but if I understand correctly only the first makes sense for a tsibble:

>>> example_dat = {
...   'geo_val': ['ma', 'ma', 'pa', 'pa'],
...   'date': ['2021-01-01', '2021-01-02', '2021-01-01', '2021-01-02'],
...   'value': [1, 2, 3, 4]
... }
>>> pd.DataFrame(example_dat,
...              columns = ['geo_val', 'value'],
...              index = example_dat['date'])
           geo_val  value
2021-01-01      ma      1
2021-01-02      ma      2
2021-01-01      pa      3
2021-01-02      pa      4
>>> pd.DataFrame(example_dat,
...              columns = ['date', 'value'],
...              index = [example_dat['geo_val']])
          date  value
ma  2021-01-01      1
ma  2021-01-02      2
pa  2021-01-01      3
pa  2021-01-02      4
>>> pd.DataFrame(example_dat,
...              columns = ['value'],
...              index = [example_dat['geo_val'], example_dat['date']])
               value
ma 2021-01-01      1
   2021-01-02      2
pa 2021-01-01      3
   2021-01-02      4

This is a minor point, but I think a small argument in favor of defining a class and using a name other than index for this concept is that things would then translate a little more precisely between R and python implementations.

ryantibs · 2021-10-24T16:21:04Z

Thanks @elray1. Just an update for you all @dajmcdon @jacobbien @brookslogan.

This morning I worked on defining epi_signal to inherit from tsibble but I'm not longer sure that this is a good idea.

The tsibble class appears to be a fragile label: if x is a tsibble, then directly modifying entires like x$time_value[1] = as.Date("2020-07-01") drops the tsibble class from x (even if this maintains validity of the tsibble class). This actually is going to be quite annoying to deal with because after such manipulations we could end up with a situation where you have an epi_signal that is NOT a tsibble. Which could make some S3 methods code complicated in order to deal with this.
The tsibble class maintains that every observation is uniquely identified by key and index columns. As such there are a bunch of validity checks that appear to get run internally and this might create just extra overhead for large data sets. Also, while this is not a bad thing to enforce in general, I'm not sure we'd always want that to be the case in epi_signal. I think we'd want it to be a "principle", but maybe not a "rule" that gets explicitly checked every time.

So I'm going to trash all my changes from this morning and revert back to having epi_signal inherit from tibble. I think this still works nicely because if we ever want to access the tsibble functionality (much of it is awesome) then we can always do things like

x %>%
  tsibble:as_tsibble(x, index = time_value, key = geo_value) %>%
  tsibble:index_by(...)

and so on. Meaning, just convert to a tsibble as needed on-the-fly. (Using either tsibble::as_tsibble() or tsibble::build_tsibble().)

I think we should still keep it in mind that we may want to refactor everything in the future to inherit from tsibble, but I think at this point it's going to slow me down a lot to deal with the issues mentioned above, so I'm going to move forward without it.

- As per the discussion on #7, I ditched the `value` column in an `epi_signal` object completely - Aside from `geo_value` and `time_value`, the other columns are arbitrary and can be treated as measured variables - This required refactoring `pct_change()`, `estimate_deriv()`, and `sliced_cor()` so that the user can specify the variables they want to act on - In doing so, I used tidy evaluation (so the user just passes variable names directly), and extended this to the `by` arg of `sliced_cor()` as well - This will dovetail nicely into allow arbitrary groupings later on (for sliding by more than just geo, for example, as per #12) - The other thing I did here (probably shouldn't have lumped it into the same commit though ...) was to broaden the allowed `time_type`s to include a new "day-time" type, coded as a `POSIXct` object (time on a given day, measured to the second) - I expanded `slide_by_geo()` so that the user can specify the definition of a time step; so, for example, one can slide over the last 7 hours (if `time_type` was "day-time"), or the last 7 minutes, or whatever - Rewrote a lot of the documentation, updated the vignettes

earowang · 2021-10-25T07:36:17Z

Hi Team,

Thanks for considering to leverage {tsibble} for epi_signal. I'd like to elaborate on a couple of points raised by @ryantibs

The tsibble class appears to be a fragile label: if x is a tsibble, then directly modifying entires like x$time_value[1] = as.Date("2020-07-01") drops the tsibble class from x.

This design is on purpose:

when a tsibble is created, analysts barely manipulate values in time index and key.
we encourage dplyr verbs over base methods like $<-, to wrangle data, since users can articulate data thinking through expressive code. Also $<- can do multiple things at once, and make results unpredictable for edge cases, so I didn't implement a rigorous method for it.
{epitools} could provide their own $<- method to preserve a valid epi_signal, if needed.

As such there are a bunch of validity checks that appear to get run internally and this might create just extra overhead for large data sets.

We run validity checks when a tsibble is created, albeit it can be turned off with validate = FALSE. Once a tsibble is created, we don't validate for every operation, except when key and index values are being manipulated.

ryantibs · 2021-10-25T13:28:58Z

Thanks @earowang. It's great to have THE tsibble expert among our ranks!

What you say makes a lot of sense. I think for now in the interest of moving quickly I'm going to keep the tibble inheritance but will write vignette to show how you can cast to tsibble on-the-fly and then leverage some of the useful functions available in this format. Would love your comments and feedback when that's ready. (And, generally, your help in any!)

ryantibs · 2021-10-28T15:12:25Z

Just a heads up that I'm going to close this with #14.

@jacobbien (Sorry to pick on you 🙂, but it's what your "reward" for having all sorts of useful comments about these):

Can you please create two new issues? One on time aggregation and one on geo aggregation.

Time aggregation one: should be about coming up with examples to demo casting to tsibble on-the-fly and leveraging the handy functions there. And this should go into the first half of the "aggregation.Rmd" vignette.

Geo aggregation one: should outline the kind of functionality we want implemented (summarize/consolidate the nice suggestions you and others made here), and also have as part of it filling out the second half of the "aggregation.Rmd" vignette. Thanks!!

Epi slide error message

dajmcdon mentioned this issue Oct 15, 2021

add more complex slide example #9

Merged

ryantibs mentioned this issue Oct 20, 2021

more flexible specification of columns identifying units of analysis for epi_signal #12

Closed

ryantibs mentioned this issue Oct 21, 2021

Big redesign #14

Merged

brookslogan mentioned this issue Oct 26, 2021

Consider tidyselect in epi_slide() #18

Closed

ryantibs closed this as completed Oct 28, 2021

This was referenced Nov 9, 2021

Demonstrate time aggregation in vignette #24

Closed

Create geo-aggregation extension of tsibble and demonstrate its use in aggregation vignette cmu-delphi/gtsibble#1

Open

dajmcdon mentioned this issue Jun 22, 2022

S3 methods troubles #117

Open

kenmawer pushed a commit that referenced this issue Jul 5, 2022

Merge pull request #7 from dajmcdon/epi_slide-error_message

f94b87c

Epi slide error message

dshemetov mentioned this issue Oct 20, 2023

Consider comparing with tidyverts #361

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should `epi_signal` inherit from `tsibble`? #7

Should `epi_signal` inherit from `tsibble`? #7

ryantibs commented Oct 15, 2021

dajmcdon commented Oct 16, 2021

brookslogan commented Oct 17, 2021 •

edited

Loading

ryantibs commented Oct 21, 2021

jacobbien commented Oct 21, 2021

elray1 commented Oct 21, 2021 •

edited

Loading

ryantibs commented Oct 21, 2021

dajmcdon commented Oct 21, 2021 •

edited

Loading

elray1 commented Oct 22, 2021 •

edited

Loading

ryantibs commented Oct 24, 2021

earowang commented Oct 25, 2021 •

edited

Loading

ryantibs commented Oct 25, 2021

ryantibs commented Oct 28, 2021

Should epi_signal inherit from tsibble? #7

Should epi_signal inherit from tsibble? #7

Comments

ryantibs commented Oct 15, 2021

dajmcdon commented Oct 16, 2021

brookslogan commented Oct 17, 2021 • edited Loading

ryantibs commented Oct 21, 2021

jacobbien commented Oct 21, 2021

elray1 commented Oct 21, 2021 • edited Loading

ryantibs commented Oct 21, 2021

dajmcdon commented Oct 21, 2021 • edited Loading

elray1 commented Oct 22, 2021 • edited Loading

ryantibs commented Oct 24, 2021

earowang commented Oct 25, 2021 • edited Loading

ryantibs commented Oct 25, 2021

ryantibs commented Oct 28, 2021

Should `epi_signal` inherit from `tsibble`? #7

Should `epi_signal` inherit from `tsibble`? #7

brookslogan commented Oct 17, 2021 •

edited

Loading

elray1 commented Oct 21, 2021 •

edited

Loading

dajmcdon commented Oct 21, 2021 •

edited

Loading

elray1 commented Oct 22, 2021 •

edited

Loading

earowang commented Oct 25, 2021 •

edited

Loading