Skip to content

more flexible specification of columns identifying units of analysis for epi_signal #12

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elray1 opened this issue Oct 20, 2021 · 7 comments

Comments

@elray1
Copy link
Collaborator

elray1 commented Oct 20, 2021

The main example I'm imagining supporting is flusurv hospitalizations, which report hospitalization rates for each combination of location and age group. For example, a user might want to be able to apply sliding functions to each series corresponding to one combination of these factors. I think supporting this could happen with two changes:

  1. Allow users to specify something like unit_ids when creating an epi_signal. This should be a character vector of column names in the data set, and would be stored in the metadata. This could default to "geo_value" to match current behaviors. A user working with flusurv data might specify unit_ids = c("geo_value", "age_group").
  2. Maybe rename slide_by_geo to slide_by_unit? And in that function change the group_by to group by the id columns stored in the metadata.
@ryantibs
Copy link
Member

Yep, great idea, again @elray1. Thanks! Please check out my reply on #10, and see if you agree with where I ended up, then I can try to address the current suggestion in a refactor as well.

One note I forgot to mention on #10, applies equally-well here. I think making the structure more flexible as in #10 and the current issue is going to force us to forgo inheriting from tsibble, see #7. At least in the most general case (with multiple columns representing signals and units). I think tsibble has a single key and index. But I haven't read about it in detail. Maybe you could check out tsibble and see if you agree that it'd be too rigid? Would be good to know that we'd be forgoing that if we do indeed refactor as suggested here and on #10.

@elray1
Copy link
Collaborator Author

elray1 commented Oct 20, 2021

It looks like the key in tsibble can support multiple columns, so I think it should work ok. Maybe we should use the key terminology to match what's already done in tsibble rather than the unit terminology i suggested in this issue.

@ryantibs
Copy link
Member

Great! Thanks for checking that. So it looks the index column in tsibble must remain a single column (time_value for us) but I guess that's not a big deal.

So tsibble offers the flexibility we need then. And then it looks like then a single big refactor can be used to address #12, #10, and #7 all at once.

One minor question before I attempt anything, which I should probably ask on #7, but just for convenience let me ask here. Should we keep the column naming scheme index and key to be exactly as in tsibble? Meaning, no time_value and geo_value named columns?

@elray1
Copy link
Collaborator Author

elray1 commented Oct 20, 2021

Yeah, I guess I have a weak preference to generally match what's being done for tsibble and avoid introducing new terms.

@jacobbien
Copy link
Contributor

This sounds good to me. So the function would then be named slide_by_key?

This was referenced Oct 21, 2021
ryantibs added a commit that referenced this issue Oct 25, 2021
- As per the discussion on #7, I ditched the `value` column in an
  `epi_signal` object completely
- Aside from `geo_value` and `time_value`, the other columns are
  arbitrary and can be treated as measured variables
- This required refactoring `pct_change()`, `estimate_deriv()`,
  and `sliced_cor()` so that the user can specify the variables
  they want to act on
- In doing so, I used tidy evaluation (so the user just passes
  variable names directly), and extended this to the `by` arg of
  `sliced_cor()` as well
- This will dovetail nicely into allow arbitrary groupings later
  on (for sliding by more than just geo, for example, as per #12)
- The other thing I did here (probably shouldn't have lumped it
  into the same commit though ...) was to broaden the allowed
  `time_type`s to include a new "day-time" type, coded as a
  `POSIXct` object (time on a given day, measured to the second)
- I expanded `slide_by_geo()` so that the user can specify the
  definition of a time step; so, for example, one can slide over
  the last 7 hours (if `time_type` was "day-time"), or the last 7
  minutes, or whatever
- Rewrote a lot of the documentation, updated the vignettes
@brookslogan
Copy link
Contributor

Another type of "unit" to consider: forecasting tasks. E.g., in the AR example, ahead should be another part of the key; if we tried to use slide_by_geo on rbound results for multiple aheads, I would expect it to break / give invalid results.

ryantibs added a commit that referenced this issue Oct 27, 2021
- `epi_signal` is now `epi_tibble`. As suggested by Logan on #14,
  it might not be a good idea to call it use "signal, since it can
  contain more than one signal column
- Got rid of the grouping done internally by `slide_by_geo()`.
  Instead allow the user to do their own grouping ahead of time
- This jells with the discussion in #12, and it seems more in line
  with standard tidyverse-style data wrangling
- As a result, renamed the sliding function to `epi_slide()`
- And had to define S3 methods for grouping and joining for our
  `epi_tibble` objects, so that the class and metadata wouldn't
  be dropped by these actions
- Update documentation and vignettes
@ryantibs
Copy link
Member

Just a heads up that I'm going to close this with #14.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants