-
Notifications
You must be signed in to change notification settings - Fork 8
Think about how to handle multiple signals #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
A couple other options I was coincidentally pondering a couple days ago, not sure they have advantages over the above though:
|
Thanks @elray1 for jumping in with this great suggestion. I like many aspects of this proposal! I'm just not sure I see yet a reason to have this completely replace the current I see advantages to both. The disadvantages of the current format are the ones you listed. The disadvantage of the format that you're suggesting is that the column names themselves may have to contain info in some applications that you may want to use downstream, and extracting this info from the column names could be brittle. For example, for the simple AR forecaster which outputs three predictive quantiles, we'd have column names like So here is a suggestion about we could proceed.
Instead of subjecting the user to write their own wrangling code to go in between formats 1 and 2a and 2b, we could just write some convenience functions (based on The advantage to supporting formats 1 and 2a and 2b is flexibility. If you happen to like format 2b and I happen to like format 1, and they're both supported and we can easily transform back and forth between them, then we're both happy. The disadvantage is maintenance cost. If we have multiple formats that we support then it's more work to make sure that the package's functions work with all formats ... What do you think? |
After having slept on it, I think I already dislike my suggestion of having these three types purely for maintenance reasons. It's going to be a lot of work to support Also, your suggested data structure, So my general take at this point is that we should go with the structure you're suggesting (but just keep it called In the spirit of moving fast 🙂, I'm thinking about just implementing this later today. Any reactions @elray1 @dajmcdon @jacobbien? (I would tag Logan but I'm hoping he's actually going to stop reading GitHub these next few days and take some time off!). |
This sounds reasonable to me. A small hanging piece is whether we'd want to somehow track metadata for different signals in a list column (e.g. via a nested list?) so that they could be "unpacked" correctly when unnesting the column. But my instinct is to not worry about that for now. |
Of the various formats,
In my current forecast exploration, I have also been trying to use the nested format. I do prefer this over the unnamed-list-of-signal-df's format as filtering/joining to particular signals is easier. Its purpose currently seems to be just to get from a covidcast API response (list-of-dfs format) to the wide format, applying lags to covariates along the way. (E.g., we can specify the desired covariates along with the time shifts to apply in a tibble such as the following, and left-join it to the nest format to check that the desired signals are all there and to extract the desired features:
) But lags can be applied in the wide format as well. It's just that specifying them now would be a little awkward because one must rely on a certain naming convention and transform things like the tibble above to use such a naming convention. With the wide+metadata idea from Evan, maybe we could make it work more naturally. One use case that is always a bit awkward for me with wide data, though, is ggplotting. I am still not comfortable with the new |
A trivial yet important issue: naming. If there were still a class |
I agree completely with everything @brookslogan says above about the wide format for forecasting. There are a handful of functions inside Both assume that you have applied
|
Thanks for the pointers @dajmcdon. Yes, these functions are quite handy, but I expect some of the brittle-ness to come into play with the hhs data set.
So I was trying to patch things up and make the forecaster's expectations more explicit, and it seemed easier to move to the nested format and try to make similar functions that may eventually become PRs to the originals or something to put in |
Yes thanks for all the helpful points @brookslogan! And welcome back 🙂. In my PR #14, I've already moved everything to wide format, but without distinguishing it from the original class. Now everything's called |
By the way, as a follow-up comment, I don't know that it's super helpful to actually distinguish between wide and long formats, necessarily, in terms of object types. An The user has the ability "do stuff" to it, with the utility functions. These utilities do not explicitly check for whether of not every row is uniquely determined by I think I would rather write clear documentation and leave it up to the user to make sure that the intended application makes sense. And provide them with plenty of examples via vignettes of wrangling before and after applying the utilities (and maybe some extra utilities for convenient wrangling). I would also like to provide them the ability to modify the utilities so that they group on more than just What do you think? This is my current plan. |
Just a heads up that I'm going to close this with #14. |
A common use case will be models that want to interact with multiple signals, e.g., cases, hospitalizations, deaths, and perhaps other covariates like mobility data. It would be ideal if this package supported this use case in a complete way. I see two concerns at the moment:
Currently, the package requires a column to be named
value
, with the name of the signal stored as metadata. This is slightly awkward in a couple of ways:value
is not a very meaningful term; we have to inspect the metadata to figure out what it is the data frame is tracking, rather than just looking at the column names.(a) by adding extra columns tracking the things I want. But this doesn't seem ideal because
- we essentially have to pick one of these columns as being "special" though there may not be reason to do so
- metadata would only describe the variable in the value column, not the other columns
(b) by using a list column for the value column. But this doesn't seem ideal because:
- I think the user would need to unnest that list column before actually doing anything with the data like making plots or fitting models.
- there may not be enough flexibility in the metadata to track information about all the variables in the data set
Here's a proposal:
value
value_names
, a vector of names of columns in the data frame that contain signals of interest. This is required to have length at least 1value_names
. This has the advantage of optionally explicitly tracking things like issue dates and signal types separately for each signal in the data set. (For example, tracking issue dates separately may be beneficial if I'm modeling signals that don't all have regular release schedules that coincide on the same day.)The text was updated successfully, but these errors were encountered: