Skip to content

Fix var shifting in advanced.Rmd: by geo group, handling time gaps #53

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 4, 2022

Conversation

brookslogan
Copy link
Contributor

Fix problem with lags not being calculated by geo group. Also edit to handle any time gaps within a geo group, or omissions at beginning or end of data set due to lag=0 not being included or covered by the range of shifts.

Current approach uses a lot of full_joins (potentially slow?) and makes prob_arx involve the separate := functionality of both rlang and data.table, which could be confusing. An alternative (below) would be using pivoting to perform the join, shown below. But this would be slightly more fragile to derive other forecasters from, as it requires the variables to have the same type, and if a variable is mapped to 0 rows, it will be silently omitted rather than filled in with NAs.

  # Build features and response for the AR model, and then fit it
  dat =
    # Make each (covariate,lag) request into a row of a tibble...
    tibble::tibble(i = seq_len(ncol(x)), lag=lags) %>%
    tidyr::unchop(lag) %>%
    # ... assign names we know will won't conflict...
    dplyr::mutate(name = paste0("x",seq_len(nrow(.)))) %>%
    # ... turn each of these requests into a separate tibble with indexing columns...
    dplyr::mutate(tbl_for_var_shift = purrr::pmap(., function(i, lag, name) {
      tibble::tibble(
                geo_value = geo_value,
                time_value = time_value + lag, # shifting values back by `lag` = shifting times forward by `lag`
                value = x[,i]
              )
    })) %>%
    dplyr::select(name, tbl_for_var_shift) %>%
    # ... as well as one for the desired shift of y...
    dplyr::bind_rows(tibble::tibble(
                               name = "y",
                               tbl_for_var_shift = list(tibble(
                                 geo_value = geo_value,
                                 time_value = time_value - ahead, # shifting variable forward by `ahead` = subtracting `lead` from time_values
                                 value = y
                               ))
                             )) %>%
    #  ... and combine them together.
    tidyr::unnest(tbl_for_var_shift) %>%
    tidyr::pivot_wider(id_cols=c(geo_value, time_value), names_from=name)

@brookslogan brookslogan force-pushed the fix-shift-var-prep-in-advanced-vignette branch from 2105797 to b53b3c8 Compare March 3, 2022 13:18
@brookslogan
Copy link
Contributor Author

brookslogan commented Mar 3, 2022

The approach to handling time gaps sort of breaks the test time selection due to creating some extra time_values that weren't in the original [and not being properly sorted by time!].

New approach to shifting more easily encountered issues with unsorted training
`time_value`s, which could cause `nafill` to corrupt earlier times (although
those times would probably be non-test times).

New approach to shifting introduced `time_values`s to `dat` beyond the test time
value, and led to errors when used in combination with the old test time
selection code.  Fix this by setting the test time looking at the user
`time_value` argument and not the expanded set of `time_value`s in `dat`.
@brookslogan
Copy link
Contributor Author

brookslogan commented Mar 3, 2022

This still has time indexing issues. The problem is that if we are making a forecast at fc_time_value, we are still only getting data through, at the latest, fc_time_value-1L (could also be even earlier depending on the data sources involved). So the test_time_value does not match the fc_time_value, which is probably what we would have wanted.

Also, separately, in some debug examples, I'm seeing what looks like ~14 days of latency in the doctor-visits CLI data. I need to double-check whether I'm reading this correctly and whether it's another bug. [This latency is reflected in direct as_of queries as well. The question now is whether LOCFing over all that is all right or not for this example.]

lcbrooks and others added 4 commits March 3, 2022 14:19
- Clean-up/simplify Logan's reimplementation of prob_arx() in the
  advanced vignette
- Move that same prob_arx() forecaster over to taking list args so
  that we can eventually move the implementation over to epipredict
- Allow an option in all three vignettes (slide, archive, advanced)
  for no intercept (default is no intercept)
@ryantibs
Copy link
Member

ryantibs commented Mar 4, 2022

So I just made a few small changes to the forecasters (third to last commit message gives details). Turns out turning off the intercept makes them look better overall. Will merge to main. Thanks very much @lcbrooks for the help (and for catching the bugs!!).

This closes #52 (although I only did it in the advanced vignette).

@ryantibs ryantibs merged commit 62c6f08 into main Mar 4, 2022
@dshemetov dshemetov deleted the fix-shift-var-prep-in-advanced-vignette branch November 15, 2023 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants