Skip to content

validation for correct grouping structure #26

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elray1 opened this issue Nov 11, 2021 · 7 comments
Closed

validation for correct grouping structure #26

elray1 opened this issue Nov 11, 2021 · 7 comments

Comments

@elray1
Copy link
Collaborator

elray1 commented Nov 11, 2021

As discussed in comments on PR #14, we are asking users to group their data according to any relevant key/identifier variables before calling functions like slide or detect_outliers, so that those functions know to operate on one group of observations at a time. It would be good to check at some point that users have done their grouping correctly, so that e.g. we're sliding over a series representing a signal for a single geo_value or a single combination of geo_value and age_group. The main way I can think of to check this is to verify that there are no duplicated values of time_value within the same group. I could imagine two ways this could go:

  1. We could write a helper function for this validation that gets called from any epitools functions that require the correct grouping structure to make sense (like slide and detect_outliers). For example, maybe something like this:
#' Validate that there are not duplicate `time_values` within the same group of an `epi_tibble`
#'
#' @parameter x an epi_tibble object
#'
#' @return invisible logical, TRUE if grouping structure is OK. Otherwise, an error is thrown.
#'
#' @importFrom dplyr count pull
validate_epi_tibble_grouping = function(x) {
  # extract counts of the number of times each time_value appears within each group of `x`
  time_value_counts = x %>% count(time_value) %>% pull(n) %>% unique()

  # throw an error if any count other than (i.e., larger than) 1 appears within the same group
  if (!identical(time_value_counts, 1L)) {
    stop("x must not have duplicate values of `time_value` within grouping ",
         "levels. Do you need to add more grouping structure?")
  }

  return(invisible(TRUE))
}

Then, within slide, detect_outliers, correct_outliers, and any other relevant functions, we'd call validate_epi_tibble_grouping(x) before doing anything else.

Example output/start of test case:

> x <- covidcast_signal(data_source = "jhu-csse",
+                       signal = "confirmed_incidence_num",
+                       start_day = "2020-06-01",
+                       end_day = "2021-05-31",
+                       geo_type = "state",
+                       geo_values = c("fl", "nj"),
+                       as_of = "2021-10-28") %>%
+   select(geo_value, time_value, cases = value) %>%
+   as.epi_tibble()
Fetched day 2020-06-01 to 2021-05-31: num_entries = 730
> head(x)
# A tibble: 6 × 3
  geo_value time_value cases
  <chr>     <date>     <dbl>
1 fl        2020-06-01   667
2 nj        2020-06-01   486
3 fl        2020-06-02   617
4 nj        2020-06-02   658
5 fl        2020-06-03  1317
6 nj        2020-06-03   541
> x %>% validate_epi_tibble_grouping()
Error in validate_epi_tibble_grouping(.) : 
  x must not have duplicate values of `time_value` within grouping levels. Do you need to add more grouping structure?
> x %>% group_by(geo_value) %>% validate_epi_tibble_grouping()
> x %>% filter(geo_value == 'fl') %>% validate_epi_tibble_grouping()
  1. As a second approach, we could enforce that an appropriate grouping structure is set at the time the epi_tibble is created, and also validate that the rules are not broken any time the groupings are updated in epitools::group_by.epi_tibble... and I guess maybe also anything else that may incidentally introduce conflicts, e.g. bind_rows?

I feel hazy enough about how approach 2 would work that I'm thinking the first approach might be safer. But I'm open to ideas about any part of this.

@ryantibs
Copy link
Member

Thanks @elray1. Great points. I think the first solution is more inline with the philosophy we've been sticking to so far in keeping the data structure pretty minimal and putting the onus on the user (or on functions) downstream to make sure that things satisfy the proper requirements for certain functionality.

@ryantibs
Copy link
Member

@elray1 Now that the package has developed a bit more, what's your current thinking on this? Is there something to do with the package in terms of new functionality, or new examples (in a vignette), or ??

@elray1
Copy link
Collaborator Author

elray1 commented Feb 24, 2022

Thanks for the reminder of this. Some thoughts:

  • The package has changed quite a bit since I wrote this! In particular, the switch to organizing functions like detect_outlr around variables rather than around tables means the proposal I made above is not directly applicable
  • We've discussed on Slack that if an "incomplete grouping" is specified as the input to the slide function, the results may actually be meaningful, so I don't think we want that validation there.
  • But I think that it probably makes sense to insert a check along these lines in the detect_outlr function about here that there are no duplicate values in x. I can submit a PR for that soon.
  • Perhaps a similar check should be done in growth_rate? Does that function produce meaningful results if it's provided with an epi_df that doesn't group by all of the relevant key variables?

@ryantibs
Copy link
Member

ryantibs commented Feb 24, 2022

Good points.

So in growth_rate(), I put in an option dup_rm with default FALSE (along with na_rm, whose default is also FALSE) to signal whether we want to remove duplicates in x. And in the function body, that gets done here. Maybe something like this makes sense for detect_outlr() as well?

Notes (to myself, as much as anybody else ...) If dup_rm = FALSE and x contains duplicates, then in growth_rate(),

  • I actually think it won't necessarily break the percentage change and linear regression methods, the way I wrote them. They should probably work as is. So, e.g., you should be able to compute a growth rate with these methods, by pooling data across geo values.
  • (However, they may be inefficient in the sense that they perform the same computation multiple times in a row, since I'm doing purrr::map() on the x values ... can be solved by looping over unique x values only, and then post-recycling them appropriately. This is similar to what I'm doing in epi_slide(), for example.)
  • It could be that it breaks the smoothing spline and trend filtering methods, because these may throw an error if we have duplicate x values. Though they may be intelligent and just average y points per duplicate x's, I'm not sure (can't remember).

@qpmnguyen @rafaelcatoia Can you take a look at this (my bullet points), both for the sake of improving this for robustness/efficiency, and also for the sake of writing tests around this? For the former, you can open up a separate issue to improve the robustness/efficiency of growth_rate() with duplicate x values, when that duplication is passed by the user on purpose.

@elray1
Copy link
Collaborator Author

elray1 commented Feb 25, 2022

I just submitted a PR that addresses the part of this issue that's related to detect_outlr. Note that I went ahead with just throwing an error if there were duplicated values in x; for this function, I think it makes sense to work with one time series at a time and throw errors if anything suspicious is going on rather than trying to guess how to fix the user's mistake.

@ryantibs
Copy link
Member

ryantibs commented Feb 25, 2022

Thanks! Will take a look shortly.

@qpmnguyen @rafaelcatoia Having slept on it, I'm not sure it's worth investigating how to get growth_rate() to work in situations where the user has passed duplicate x values on purpose. Initially I was thinking that this might be something they want to do to model a joint growth rate over locations, as an example, but even there I think it's probably simpler just to suitably preprocess/average the data and then call growth_rate() on a single time series. So probably not worth the effort to follow up on the things I suggested. Sorry for the back-and-forth.

@ryantibs
Copy link
Member

Closed via #50 #51.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants