Skip to content

epi_df structure #482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dajmcdon opened this issue Jul 16, 2024 · 3 comments · Fixed by #511
Closed

epi_df structure #482

dajmcdon opened this issue Jul 16, 2024 · 3 comments · Fixed by #511
Labels
enhancement New feature or request

Comments

@dajmcdon
Copy link
Contributor

Possibly worth adding some sorting to as_epi_df. Some ideas,

  1. Columns are always time_value, geo_value, additional keys, other columns
  2. Rows always vary time_value fastest, then geo_value, etc.

For the second, perhaps add a dplyr::arrange.epi_df() method.

@dajmcdon dajmcdon added the enhancement New feature or request label Jul 16, 2024
@brookslogan
Copy link
Contributor

On point 1., I polled some preferences regarding data-fetching argument ordering which might be relevant.

On point 2., I'm a little confused. I think arrange already works properly on epi_dfs; is it not working for you, or are you saying it should be forbidden? Related: what do you think about these approaches?:

  • A. epi_dfs should always have rows sorted in this manner (so we should also make sure to maintain this order after any manipulations); arrange.epi_df() should output an explanatory error message.
  • B. epi_dfs should be sorted upon construction (maybe only by default, controlled by an arrange arg to as_epi_df), e.g., by calling arrange() at some point, but user should be able to later re-arrange() as they'd like
  • C. Users should be able to choose either behavior by using a constructor argument (e.g., keep_arranged).

@dajmcdon
Copy link
Contributor Author

Point 2.

Hmmm, yes, arrange() works. I guess I was imagining a function (possibly different from arrange() that would rearrange to a standard ordering if requested. We don't even have this in our saved data, for example, compare epiprocess::jhu_csse_daily_subset to epipredict::case_death_rate_subset.

@dshemetov
Copy link
Contributor

dshemetov commented Jul 20, 2024

RE point 1, we looked into similar polling when deciding for argument orders for epidatr and came to: location > time > other. Sorting by geo value and then time value seems most performant to me too, since group_by(geo_value) is common for time series operations. Do we not already do this in the constructor? Seems like something we should do. (I guess I'm assuming tibbles maintain some metadata about how they were sorted last that they can leverage to speedup groupby or other operations; not sure if this is actually the case.)

RE 2, I'm not really seeing the benefit of a "default" arrange function over a generic one that can accept different orders as arguments. Using existing arrange to sort at construction sounds good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants