Skip to content

Big redesign #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 28, 2021
Merged

Big redesign #14

merged 7 commits into from
Oct 28, 2021

Conversation

ryantibs
Copy link
Member

Attempting to address many issues at once: #7 #8 #10 #12.

- As per the discussion on #8, from now on, an `epi_signal` object
  represents a single snapshot of a data set
- Modify `as.epi_signal()` to reflect this change
- Update documentation and vignettes accordingly
@ryantibs
Copy link
Member Author

It's worth noting that I decided to forgo the tsibble inheritance as in #8. I started making the switch this morning, but then decided this may not be a good idea (and it'd slow me down a bit right now). I explain in this comment.

- As per the discussion on #7, I ditched the `value` column in an
  `epi_signal` object completely
- Aside from `geo_value` and `time_value`, the other columns are
  arbitrary and can be treated as measured variables
- This required refactoring `pct_change()`, `estimate_deriv()`,
  and `sliced_cor()` so that the user can specify the variables
  they want to act on
- In doing so, I used tidy evaluation (so the user just passes
  variable names directly), and extended this to the `by` arg of
  `sliced_cor()` as well
- This will dovetail nicely into allow arbitrary groupings later
  on (for sliding by more than just geo, for example, as per #12)
- The other thing I did here (probably shouldn't have lumped it
  into the same commit though ...) was to broaden the allowed
  `time_type`s to include a new "day-time" type, coded as a
  `POSIXct` object (time on a given day, measured to the second)
- I expanded `slide_by_geo()` so that the user can specify the
  definition of a time step; so, for example, one can slide over
  the last 7 hours (if `time_type` was "day-time"), or the last 7
  minutes, or whatever
- Rewrote a lot of the documentation, updated the vignettes
- This was basically "for free" implementation-wise; just needed
  to update needed the documentation to reflect this
- Rename to `grouped_cor()` from `sliced_cor()`; update all uses
  of the word "slice" in documentation and vignettes accordingly
@ryantibs
Copy link
Member Author

ryantibs commented Oct 26, 2021

Just a note to myself or anybody else reading this: the last 2 commits were just me being silly with rebasing. I was trying to fix some typos in the commit messages, but clearly I need more practice with git rebase.

- `epi_signal` is now `epi_tibble`. As suggested by Logan on #14,
  it might not be a good idea to call it use "signal, since it can
  contain more than one signal column
- Got rid of the grouping done internally by `slide_by_geo()`.
  Instead allow the user to do their own grouping ahead of time
- This jells with the discussion in #12, and it seems more in line
  with standard tidyverse-style data wrangling
- As a result, renamed the sliding function to `epi_slide()`
- And had to define S3 methods for grouping and joining for our
  `epi_tibble` objects, so that the class and metadata wouldn't
  be dropped by these actions
- Update documentation and vignettes
- Create scaffolding for new vignettes on time and geo aggregation
  and on issue dates and `epi_archive` objects
- My notes on these vignettes also contains ideas about what to
  implement
- While I'm here, get rid of all join methods for the `epi_tibble`
  class. This was flawed because it's not clear how to handle the
  metadata
- Rewrite the correlations vignette accordingly
- Introduce an `ungroup()` method for the `epi_tibble` class, and
  fix a few small things to do with class and metadata assignment
@ryantibs
Copy link
Member Author

@brookslogan @elray1 @dajmcdon @jacobbien @capnrefsmmat I'm about done with this now.

The sequence of commits on this PR is not as clean as it should have been (too many big commits, and my rebasing attempts only clogged the presentation above). Oops, sorry ... next time I will remember to do it better.

So because the commits are too big and may be unwieldy to scan, here's what I did as a summary.

Renaming

  • Renamed a bunch of things (with explanation below as to why).
    • epi_signal --> epi_tibble
    • slide_by_geo() --> epi_slide()
    • sliced_cor() --> group_cor()

Handling of signal variables, groupings

  • Removed the value from an epi_tibble object, allowing for an arbitrary number of signal columns, as discussed on Think about how to handle multiple signals #10. As raised there, given a change to allowing multiple signal columns, it would make sense to move away from "signal" in the name of the object, hence the epi_tibble name.
  • Note that there is currently no special designation as to something being a signal column or not. This is my current preference is to make the data structure very lightweight: just geo_value, time_value, and then everything else. "Everything else" could be something you want to group on, like age_group, or something you want to do computations on, like covid_cases. But all functionality downstream allows you to flexibly handle this in a way that I think is easy.
  • Refactored all functions pct_change(), estimate_deriv(), grouped_cor(), so that you specify the variable(s) you would like to be operating on (since there is no one designated signal column).
  • Accordingly, grouped_cor() now takes a single epi_tibble object, and allows you to correlate any two specified columns.
  • I removed groupings from within the slide function, and hence renamed it to epi_slide(). It now works by having the user call group_by() first, as in:
     x %>% 
      group_by(geo_value, age_group) %>%
      epi_slide(...)
    
    This should address more flexible specification of columns identifying units of analysis for epi_signal #12. Of course, you can do the same thing with pct_change() and estimate_deriv().
  • In order to pull this off, I had to write S3 methods for group_by() and ungroup() for the epi_tibble (since these would otherwise drop the class and attributes). This was a great idea that Alex had. Except the first time I wrote it I got some infinite recursion that ended up with C stack errors and it took me a while to figure out how to get past this. So, thank you, Alex.
  • For grouped_cor(), I left the grouping inside, because there's some annoying stuff that would need to be done each time if it were pulled out. (Namely, regardless of how you group for the correlations, you always want to group by geo value before you compute lags or leads, and then ungroup, to prep for the whatever grouping you want to do for the correlation). Actually, now, you can do arbitrary groupings with the by argument (not only time or geo) in grouped_cor().
  • Note that epi_slide() (and the functions pct_change(), estimate_deriv() based on it) return an epi_tibble object; but grouped_cor() does not. I think this makes sense; the latter wouldn't necessarily have to be an epi_tibble, depending on how you grouped, you could have gotten rid of time of space variables completely (or both, if you didn't group at all).

Issues

  • As discussed on Distinguish snapshot of signal vs. archive of signal, and as-of vs. issue #8, I removed the issue column from the epi_tibble object. This now is in the metadata. It generally represents the latest issue among the issues of all individual observations.
  • I wrote an outline for a new vignette on issues and archive objects that describes what I want us to implement for the epi_archive object and the functionality as_of() that we provide to derive an epi_tibble from it, and functionality to slide over it.
  • I want us to proceed in a way that's very simple on the backend implementation at the moment (for example, an epi_archive is just an epi_signal with an issue column) but just gets the interface right for as_of(), epi_slide(), and so on. That can be on the "next" PR.
  • Then Logan can implement some fancy backend stuff that's much faster on the "next next" PR.

Tsibble

@ryantibs ryantibs closed this Oct 27, 2021
@ryantibs ryantibs reopened this Oct 27, 2021
@ryantibs
Copy link
Member Author

Well, I'm a bit of an idiot. I just closed this PR, by clicking "close with comment" instead of "comment". Luckily I could open it back up easily!

Anyway, I am planning on merging this soon. I've rewritten all the documentation and vignettes and checked that it all works.

I'll give you guys a little bit of time to chime in if you'd like ... but since nobody is really using this package yet, and we're still in the early stages, I feel like merging without an official review process is still OK. Probably soon though, we should implement some tests, and do things more properly.

@ryantibs ryantibs marked this pull request as ready for review October 27, 2021 03:15
- Changing what we highlight about what this function does; also,
  moving "cor" to the front of the name
- While I'm here, I snuck in a fix to the `NA` problem raised by
  Logan on #17. Now `pct_change()` and `estimate_deriv()` take as
  an argument `na_rm` with default value `TRUE`
@ryantibs
Copy link
Member Author

I just snuck in a few more fixes. And, I got the thumbs up on slack from a bunch of you. So I'm going to go ahead and merge this, which will close #7 #8 #10 #12 #17.

@ryantibs ryantibs merged commit d8fe095 into main Oct 28, 2021
@ryantibs ryantibs deleted the big-redesign branch January 5, 2022 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant