Skip to content

Should epi_signal inherit from tsibble? #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ryantibs opened this issue Oct 15, 2021 · 12 comments
Closed

Should epi_signal inherit from tsibble? #7

ryantibs opened this issue Oct 15, 2021 · 12 comments

Comments

@ryantibs
Copy link
Member

The tsibble package, as @jacobbien mentioned, looks like it has a structure that would be helpful to have epi_signal inherit, rather a plain tibble. A tsibble has the following structure:

  • key: for us this would be geo_value, I think;
  • index: for us would be be time_value, I think;
  • any other columns as measured variables: for us, we need at least value, here.

Generally, we should also check out the tidyverts, and yes, that's not a typo, click on the link. It would be nice to see if anything here just takes care of things we might need (or might have already implemented?) in epitools, and soon, in epipred.

@jacobbien @dajmcdon @capnrefsmmat @brookslogan Looking through all of this would be a big job, but do any of you guys just want to each take a look at some part, and report back what you think here, on this issue?

@dajmcdon
Copy link
Contributor

One potential advantage to {tsibble} is the fill_gaps() function that turns implicit missing dates in the time series into explicit NAs.

@brookslogan
Copy link
Contributor

brookslogan commented Oct 17, 2021

If we don't want to use tsibble, these helper functions are useful for calculating rollapply on a ordered&gap-filled version of a time series and then returning to the original row order and number:

#' Transmute time series data in unordered df with gaps as if ordered&regular
#'
#' Columns are defined as if the time series data were ordered and regular, but
#' the result preserves the row number and order of the original data frame.
#'
#' @details Within each group of the df (or the entire df if ungrouped),
#'   \link[dplyr:arrange]{arranges} by time, \link[dplyr:complete]{completes} it
#'   to make the time a regular sequence, performs the
#'   \link[dplyr:transmute]{transmutation} (preserving any grouping columns and
#'   tacking on the time column by default), then \link[dplyr:filter]{filters}
#'   and rearranges to match the original row number and order of the original
#'   data frame.
#'
#' @param .data a data frame (including grouped tibbles)
#' @param .time_col_selector \code{\link[dplyr:pull]{dplyr::pull}} argument to
#'   select the time column
#' @param .period period/stride/spacing of time values to use when completing
#'   the time column
#' @param ... \code{\link[dplyr:transmute]{dplyr::transmute}} arguments, which
#'   can assume that they are working on an arranged, completed version of each
#'   group of the data frame
#' @return a data frame with any grouping columns, the time column (by default),
#'   and the requested columns from `...`, with grouping and time columns in the
#'   original order
#'
#' @importFrom pipeR %>>%
#' @importFrom dplyr group_modify pull complete mutate filter arrange
#' @importFrom tibble tibble
#' @importFrom rlang enquos !!!
#' @importFrom assertthat assert_that
transmute_scattered_ts = function(.data, .time_col_selector, .period, ...) {
  main.dots = enquos(...) # [note from future: not sure why I'm not just forwarding the dots; that should work fine if we don't need to inspect them]
  .data %>>%
    group_modify(function(group.data, ...) {
      time.values = pull(group.data, {{.time_col_selector}})
      tibble(
        original_row_index = seq_along(time.values),
        t = time.values,
        data_wrapper = group.data
      ) %>>%
        {
          full.seq.t = full_seq(.[["t"]], .period)
          intermediate.result = complete(., t = full.seq.t)
          assert_that(nrow(intermediate.result) == length(full.seq.t),
                      msg = 'Time series data appears to contain duplicate time values')
          intermediate.result
        } %>>%
        mutate(data_wrapper = transmute(data_wrapper, !!!main.dots, {{.time_col_selector}})) %>>%
        filter(!is.na(original_row_index)) %>>%
        arrange(original_row_index) %>>%
        pull(data_wrapper)
    })
}
## XXX improve speed: maybe try a less group-wise approach, inventing new name
## in the original for original_row_index, and/or special-casing when the data
## frame is already sorted overall and/or within groups.
##
## todo decide on styling given other naming conventions involved

#' Mutate time series data in unordered df with gaps as if ordered&regular
#'
#' Works like \code{\link{transmute_scattered_ts}}, but preserving all the
#' original columns by default
mutate_scattered_ts = function(.data, .time_col_selector, .period, ...) {
  transmute_scattered_ts(.data, {{.time_col_selector}}, across(everything()), ..., .period=.period)
}
## todo use template or shared topic to document args here

@ryantibs
Copy link
Member Author

So as I alluded to over on #12, #10, #8, I'm thinking about making a big redesign on a branch to address all of these issues, as well as the current one---moving over to the tsibble format. I was reading a bit more about the tsibble format and I think it appears to offer some advantages beyond those listed already, for example:

  • wide variety of built-in formats for the index column (time column), well beyond the two we already support in this package (day and week)
  • bunch of handy helper/processing functions associated with this, like tsibble::index_by() for aggregation

As suggested on the other issues, it seems like the best way forward is to stick with the tsibble notation exactly. So there would be:

  • one column designated as index (for the generalization of time variable)
  • zero or more columns designated as key (for the generalization of the geo variable)
  • any number of columns which count as measured variables

The column names can be anything and the designation as index and key happens at the level of metadata (handled by tsibble).

So now I'm left wondering: what, if any, is the purpose of defining an epi_signal class? Seems like it fits exactly into a tsibble format already, with very few differences: e.g., it may just have some additional metadata (for example, metadata field that specifying the issue date corresponding to the data set).

Somewhat relatedly, though really not the same question, I'm wondering what is the importance of actually treating geographies explicitly anymore. Is geo value just one type of key, and we don't even attempt to distinguish it from an arbitrary key? Or should we be providing built-in utilities for handling standard geo types (like aggregation or dis-aggregation between standard geo types).

I want to spend a bit of time thinking here before doing anything. Just treating everything as a tsibble and defining no special class structure for this package would be remarkably simple, but also somehow feel vacuous & trivial. There would be basically no standardization for data formats that we're providing for at all.

@elray1 @dajmcdon @jacobbien Your thoughts? (Again I'm sure it would be really useful to hear from Logan but not tagging him so that he can enjoy a couple days off!)

@ryantibs ryantibs mentioned this issue Oct 21, 2021
@jacobbien
Copy link
Contributor

Yeah, I agree with @ryantibs that tsibble appears to do everything that's needed. I sort of like the idea of creating a class that inherits from tsibble, perhaps called tsibble_us (or epi_signal_us?) that inherits from tsibble where it enforces the key to be one of national, state, hrr, county, msa. Let me elaborate on what Ryan mentions above about facilitating geo-based aggregation.

At first it might seem this isn't really necessary. For example, consider this example from the tsibble documentation showing how index_by() + summarize() can be used to aggregate to a coarser index:

tourism %>%
  index_by(Year = ~ year(.)) %>%
  group_by(Region, State) %>%
  summarise(Total = sum(Trips))

If one wanted to aggregate to the Year-Region level instead, couldn't one just remove State from the group_by? Well, but there's a problem here. Namely, what if some states are missing? The advantage of tsibble_us is that it would know if some states are missing (and likewise, if you try to aggregate counties to state level it would know if there are missing counties).

This is directly analogous to how tsibble worries a lot (so that the user doesn't have to!) about missing time values. In Handle implicit missingness with tsibble, Earo Wang describes four functions that have to do with handling missing time values: has_gaps, scan_gaps, count_gaps, and fill_gaps. Our class tsibble_us could do the same for missing locations. Imagine four corresponding functions for missing locations (let's call a missing location a "hole" for now for lack of a better term):

  • has_holes - are there missing locations? (E.g., if key is at the county level, are all US counties there? One could also have an argument where one specifies a limited scope, e.g., if the scope is defined as CA, then %>% has_gaps(scope = state("CA")) would only check if there were missing CA counties in the data object.
  • scan_holes - what are the missing counties
  • count_holes - how many missing counties
  • fill_holes - this function could at minimum create NA time series for the missing counties. This means that if we aggregate from the county to state level, it will be apparent that CA was missing some counties. One could also imagine cases where filling in 0s makes sense. In fact, tsibble_us could even offer imputation based on the geo hierarchy (e.g. fill in with the average of all counties within this state) or based on spatial proximity.

tsibble_us could also have population size information for each geo value, so that weighted averages could be easy to do in aggregation.

An example: Suppose dat is a tsibble_us object with daily-county cases (three columns: Date, the index; County, the geo-key; and Cases being a column with incidence rate, per 100k). We could aggregate to state-level epiweek data with something like the following:

dat %>%
  fill_gaps() %>% 
  fill_holes() %>%
  index_by(Epiweek = ~ epiweek(.)) %>%
  geokey_by(State = ~ state(.)) %>%
  summarise(Aggregated_cases = population_weighted_mean(Cases))

Here geokey_by, state and population_weighted_mean would be tsibble_us functions that are based on the information contained in the package about the us geography.

Obviously, instead of tsibble_us, one could write a more general tsibble_geo and then there could be location specific classes that inherit from it, like tsibble_us, tsibble_europe, tsibble_world, etc.

@elray1
Copy link
Collaborator

elray1 commented Oct 21, 2021

Here are a few very quick/minor thoughts:

  • Tracking the metadata does seem valuable.
  • Jacob's ideas about spatial aggregation also seem worthwhile.
  • I think in terms of aggregating over time, we'll most often want to aggregate from daily observations to epiweek observations. I guess that could be done using a date to MMWR week function in combination with the index_by %>% group_by %>% summarize construction Jacob mentioned. But maybe we could make a convenience function for that.
  • If we want to think proactively about what things will look like when we get to porting this to Python -- I assume a python port of this will use pandas data frames, where index is essentially a reserved key word (and it might be used more generally than just for tracking the temporal index). It's not clear to me whether it's more important for the R version of the package to be consistent with tsibble (suggesting the use of index instead of time_value) or for a standard set of terminology to be developed that translates cleanly between python and R versions of epitools (suggesting the use of time_value instead of index). People in our group use both python and R for different projects, so anticipating that I'll get confused about this some day, I'm now slightly leaning toward using time_value instead of index. But this is not a strong feeling.

@ryantibs
Copy link
Member Author

Thanks very much to both of you for helpful answers.

I'm still struggling though. I'm trying to play devil's advocate here, to be clear, for both sides of the argument:

  • I agree completely geo aggregation utilities would be nice, but why would such functions go in epitools? Why not a separate package for that specifically designed for that? Or even just a big PR that we or somebody else makes to tsibble directly? What is "epi" about any of this?
  • Handling issue dates is about the only thing that I see is getting kind of close to "epi" specific. Though other time series data sets have revisions (such as in Econ). But at least it's a very prominent feature of epidemiology data.

So is there any point in defining a custom epi_signal class: is it just the metadata that makes it special? I'm leaning towards yes, but I can see the argument both ways.

To go back the specific question about structure of the data frame itself. Do we really let key and index be anything that tsibble can accept? Or should we require a bit more structure than that, like do we require a specific name for the index column, do we require that at least one key be a geo value, and for that to have a specific column name, etc.

(Evan: I didn't really get your last comment which seems related to this. Was it about renaming index field in the metadata? Or a concern about requiring a column to be named index? Just to be clear: my understanding is that tsibble wouldn't require that the index be literally named index in the data frame. It's just a metadata designation that a particular column serves as the index.)

@dajmcdon
Copy link
Contributor

dajmcdon commented Oct 21, 2021

Also quick, minor. (Drafted pre-Ryan, but posted after)

  • I really like Jacob's spatial aggregation suggestions.
  • Related to spatial+python, I get that tsibble is "just for time series" so understanding index as something to do with time is fine. But the name index is pretty meaningless. I prefer time_value instead, and geokey makes for a nice complement. So I'll put my vote with Evan for replacing index.
  • One hiccup to consider for locations: isn't there some craziness with counties where JHU breaks some up? NYC boroughs maybe? It's hard to fault tsibble for defining a gap in days as the most natural thing. But locations is more problematic. Do we expect states to include the US Virgin Islands? Is Guam separate from Northern Marianas? Is Saipan its own thing? Plus the NY boroughs. Of course we could be specific that for our purposes, we're using JHU's locations, 2020 Census populations, etc. But this could get crazy. Also, due to being hounded by my humanist wife yesterday about how data people are to blame for the worlds ills, I would be remiss if I didn't ask, what would a Puerto Rican think of us defining PR as a "state" (or not)? Time is hard to argue with but, here classification is weird.
  • One might also want to be able to do other index's (age, gender). One could leave these to the user to handle similar gap_filler functions. But given that index seems insufficient for our purposes (need at least this functionality for time + different spatial classes), maybe the way to proceed is to create some constructors that can build the family appropriate to different things we would want. I don't know if R has a good way to do this, but it would be nice to not have to manually create tsibble_us tsibble_msa, ... tsibble_us_age etc.

@elray1
Copy link
Collaborator

elray1 commented Oct 22, 2021

I basically don't have an opinion about Ryan's questions about whether we'd want to enforce more structure for the index and key in terms of requiring specific column names.

Brief follow up on the index/time_value thread: my thinking is that the meaning and treatment of index is different in tsibble and pandas: in tsibble, an index is definitely a time_value and I think corresponds to exactly one column, but in pandas, an index may possibly track information from more than one column and may track information other than a time_value. For example, any of the three constructions below is ok for a pandas data frame, but if I understand correctly only the first makes sense for a tsibble:

>>> example_dat = {
...   'geo_val': ['ma', 'ma', 'pa', 'pa'],
...   'date': ['2021-01-01', '2021-01-02', '2021-01-01', '2021-01-02'],
...   'value': [1, 2, 3, 4]
... }
>>> pd.DataFrame(example_dat,
...              columns = ['geo_val', 'value'],
...              index = example_dat['date'])
           geo_val  value
2021-01-01      ma      1
2021-01-02      ma      2
2021-01-01      pa      3
2021-01-02      pa      4
>>> pd.DataFrame(example_dat,
...              columns = ['date', 'value'],
...              index = [example_dat['geo_val']])
          date  value
ma  2021-01-01      1
ma  2021-01-02      2
pa  2021-01-01      3
pa  2021-01-02      4
>>> pd.DataFrame(example_dat,
...              columns = ['value'],
...              index = [example_dat['geo_val'], example_dat['date']])
               value
ma 2021-01-01      1
   2021-01-02      2
pa 2021-01-01      3
   2021-01-02      4

This is a minor point, but I think a small argument in favor of defining a class and using a name other than index for this concept is that things would then translate a little more precisely between R and python implementations.

@ryantibs
Copy link
Member Author

Thanks @elray1. Just an update for you all @dajmcdon @jacobbien @brookslogan.

This morning I worked on defining epi_signal to inherit from tsibble but I'm not longer sure that this is a good idea.

  • The tsibble class appears to be a fragile label: if x is a tsibble, then directly modifying entires like x$time_value[1] = as.Date("2020-07-01") drops the tsibble class from x (even if this maintains validity of the tsibble class). This actually is going to be quite annoying to deal with because after such manipulations we could end up with a situation where you have an epi_signal that is NOT a tsibble. Which could make some S3 methods code complicated in order to deal with this.
  • The tsibble class maintains that every observation is uniquely identified by key and index columns. As such there are a bunch of validity checks that appear to get run internally and this might create just extra overhead for large data sets. Also, while this is not a bad thing to enforce in general, I'm not sure we'd always want that to be the case in epi_signal. I think we'd want it to be a "principle", but maybe not a "rule" that gets explicitly checked every time.

So I'm going to trash all my changes from this morning and revert back to having epi_signal inherit from tibble. I think this still works nicely because if we ever want to access the tsibble functionality (much of it is awesome) then we can always do things like

x %>%
  tsibble:as_tsibble(x, index = time_value, key = geo_value) %>%
  tsibble:index_by(...)

and so on. Meaning, just convert to a tsibble as needed on-the-fly. (Using either tsibble::as_tsibble() or tsibble::build_tsibble().)

I think we should still keep it in mind that we may want to refactor everything in the future to inherit from tsibble, but I think at this point it's going to slow me down a lot to deal with the issues mentioned above, so I'm going to move forward without it.

ryantibs added a commit that referenced this issue Oct 25, 2021
- As per the discussion on #7, I ditched the `value` column in an
  `epi_signal` object completely
- Aside from `geo_value` and `time_value`, the other columns are
  arbitrary and can be treated as measured variables
- This required refactoring `pct_change()`, `estimate_deriv()`,
  and `sliced_cor()` so that the user can specify the variables
  they want to act on
- In doing so, I used tidy evaluation (so the user just passes
  variable names directly), and extended this to the `by` arg of
  `sliced_cor()` as well
- This will dovetail nicely into allow arbitrary groupings later
  on (for sliding by more than just geo, for example, as per #12)
- The other thing I did here (probably shouldn't have lumped it
  into the same commit though ...) was to broaden the allowed
  `time_type`s to include a new "day-time" type, coded as a
  `POSIXct` object (time on a given day, measured to the second)
- I expanded `slide_by_geo()` so that the user can specify the
  definition of a time step; so, for example, one can slide over
  the last 7 hours (if `time_type` was "day-time"), or the last 7
  minutes, or whatever
- Rewrote a lot of the documentation, updated the vignettes
@earowang
Copy link

earowang commented Oct 25, 2021

Hi Team,

Thanks for considering to leverage {tsibble} for epi_signal. I'd like to elaborate on a couple of points raised by @ryantibs

The tsibble class appears to be a fragile label: if x is a tsibble, then directly modifying entires like x$time_value[1] = as.Date("2020-07-01") drops the tsibble class from x.

This design is on purpose:

  1. when a tsibble is created, analysts barely manipulate values in time index and key.
  2. we encourage dplyr verbs over base methods like $<-, to wrangle data, since users can articulate data thinking through expressive code. Also $<- can do multiple things at once, and make results unpredictable for edge cases, so I didn't implement a rigorous method for it.
  3. {epitools} could provide their own $<- method to preserve a valid epi_signal, if needed.

As such there are a bunch of validity checks that appear to get run internally and this might create just extra overhead for large data sets.

We run validity checks when a tsibble is created, albeit it can be turned off with validate = FALSE. Once a tsibble is created, we don't validate for every operation, except when key and index values are being manipulated.

@ryantibs
Copy link
Member Author

Thanks @earowang. It's great to have THE tsibble expert among our ranks!

What you say makes a lot of sense. I think for now in the interest of moving quickly I'm going to keep the tibble inheritance but will write vignette to show how you can cast to tsibble on-the-fly and then leverage some of the useful functions available in this format. Would love your comments and feedback when that's ready. (And, generally, your help in any!)

@ryantibs
Copy link
Member Author

Just a heads up that I'm going to close this with #14.

@jacobbien (Sorry to pick on you 🙂, but it's what your "reward" for having all sorts of useful comments about these):

Can you please create two new issues? One on time aggregation and one on geo aggregation.

Time aggregation one: should be about coming up with examples to demo casting to tsibble on-the-fly and leveraging the handy functions there. And this should go into the first half of the "aggregation.Rmd" vignette.

Geo aggregation one: should outline the kind of functionality we want implemented (summarize/consolidate the nice suggestions you and others made here), and also have as part of it filling out the second half of the "aggregation.Rmd" vignette. Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants