Skip to content

bug in layer_add_forecast_date() #109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dajmcdon opened this issue Jul 20, 2022 · 5 comments · Fixed by #220
Closed

bug in layer_add_forecast_date() #109

dajmcdon opened this issue Jul 20, 2022 · 5 comments · Fixed by #220
Labels
bug Something isn't working P0 do this immediately

Comments

@dajmcdon
Copy link
Contributor

dajmcdon commented Jul 20, 2022

The default forecast_date (and target_date) seems wrong.

Using case_death_rate_subset the fd should default to 2021-12-31 (most recent available) but instead is 2022-01-28. I think this has to do with the fact that the components$keys$time_value gets adjusted in the training data to add the ahead. We actually need the latest value in the test data (though if there isn't a lag 0, this wouldn't work either).

It may help to have access to the entire workflow at predict time as suggested in #108 .

@dajmcdon dajmcdon added bug Something isn't working P0 do this immediately labels Jul 20, 2022
@rachlobay
Copy link
Contributor

rachlobay commented Jul 20, 2022

Whoops... Posted in the related #108 by accident, so moving those comments over here.

Based on the ~July 6 discussion about this on Components conundrum doc, the time_value from components$keys is currently being used for both forecast_date and test_date defaults. If you want the non-forged test data to be used instead, it still does not look to be currently available in components, as I think you noted. For reference, I will post the ex. from that doc below to show what is available in components for the layer_add_forecast_date test (it looks like the molded training data and the forged test data parts are there). If that is the case, components does not look to have what we need. So what is the best way to get the original test data? Has there been something new introduced that would make this a piece of cake because right now, it isn’t clear that object, components, the_recipe, or the_fit contains it… And even if we included the_frosting in slather(), would that contain the test data that is just inputted in predict()? I am not sure about that, so let me know if it does…. Likewise for the_workflow.

We could always revert these back to what I had a long time ago and have the user input the (non-forged) test data into the layer directly as an argument, but I’d have thought that there’d be an easier way for something like this.

@rachlobay
Copy link
Contributor

Components conundrum ex. set-up (see test-layer_add_forecast_date.R for this):

jhu <- case_death_rate_subset %>%
  dplyr::filter(time_value > "2021-11-01", geo_value %in% c("ak", "ca", "ny"))
r <- epi_recipe(jhu) %>%
  step_epi_lag(death_rate, lag = c(0, 7, 14)) %>%
  step_epi_ahead(death_rate, ahead = 7) %>%
  step_naomit(all_predictors()) %>%
  step_naomit(all_outcomes(), skip = TRUE)
wf <- epi_workflow(r, parsnip::linear_reg()) %>% fit(jhu)
latest <- jhu %>%
  dplyr::filter(time_value >= max(time_value) - 14)

What (I think) we want to extract max(time_value):

> latest
An `epi_df` object, with metadata:
* geo_type  = state
* time_type = day
* as_of     = 2022-05-31 12:08:25

# A tibble: 45 × 4
   geo_value time_value case_rate death_rate
 * <chr>     <date>         <dbl>      <dbl>
 1 ak        2021-12-17      23.1      1.19 
 2 ca        2021-12-17      16.9      0.158
 3 ny        2021-12-17      67.1      0.310
 4 ak        2021-12-18      23.1      1.19 
 5 ca        2021-12-18      17.6      0.164
 6 ny        2021-12-18      71.0      0.309
 7 ak        2021-12-19      23.1      1.19 
 8 ca        2021-12-19      19.1      0.165
 9 ny        2021-12-19      84.1      0.319
10 ak        2021-12-20      23.2      1.17 
# … with 35 more rows

What components contains (pertains to the test about the default forecast_date).

> components
$mold
$mold$predictors
# A tibble: 117 × 3
   lag_0_death_rate lag_7_death_rate lag_14_death_rate
              <dbl>            <dbl>             <dbl>
 1            0.336            1.74              0.395
 2            0.225            0.180             0.201
 3            0.168            0.192             0.171
 4            0.198            1.86              0.415
 5            0.198            0.210             0.186
 6            0.166            0.184             0.176
 7            0.198            1.80              0.376
 8            0.213            0.217             0.189
 9            0.169            0.191             0.177
10            0.593            1.78              0.316
# … with 107 more rows

$mold$outcomes
# A tibble: 117 × 1
   ahead_7_death_rate
                <dbl>
 1              0.474
 2              0.234
 3              0.170
 4              0.553
 5              0.251
 6              0.181
 7              0.553
 8              0.202
 9              0.166
10              0.296
# … with 107 more rows

$mold$blueprint
Recipe blueprint: 
 
# Predictors: 0 
  # Outcomes: 0 
   Intercept: FALSE 
Novel Levels: FALSE 
 Composition: tibble 

$mold$extras
$mold$extras$roles
$mold$extras$roles$time_value
# A tibble: 117 × 1
   time_value
   <date>    
 1 2021-11-16
 2 2021-11-16
 3 2021-11-16
 4 2021-11-17
 5 2021-11-17
 6 2021-11-17
 7 2021-11-18
 8 2021-11-18
 9 2021-11-18
10 2021-11-19
# … with 107 more rows

$mold$extras$roles$geo_value
# A tibble: 117 × 1
   geo_value
   <chr>    
 1 ak       
 2 ca       
 3 ny       
 4 ak       
 5 ca       
 6 ny       
 7 ak       
 8 ca       
 9 ny       
10 ak       
# … with 107 more rows

$mold$extras$roles$raw
# A tibble: 117 × 2
   case_rate death_rate
       <dbl>      <dbl>
 1      53.5      0.336
 2      13.2      0.225
 3      29.1      0.168
 4      53.3      0.198
 5      13.3      0.198
 6      29.8      0.166
 7      63.5      0.198
 8      14.5      0.213
 9      30.9      0.169
10      56.7      0.593
# … with 107 more rows




$forged
$forged$predictors
# A tibble: 108 × 3
   lag_0_death_rate lag_7_death_rate lag_14_death_rate
              <dbl>            <dbl>             <dbl>
 1               NA               NA                NA
 2               NA               NA                NA
 3               NA               NA                NA
 4               NA               NA                NA
 5               NA               NA                NA
 6               NA               NA                NA
 7               NA               NA                NA
 8               NA               NA                NA
 9               NA               NA                NA
10               NA               NA                NA
# … with 98 more rows

$forged$outcomes
NULL

$forged$extras
$forged$extras$roles
$forged$extras$roles$time_value
# A tibble: 108 × 1
   time_value
   <date>    
 1 2021-12-10
 2 2021-12-10
 3 2021-12-10
 4 2021-12-11
 5 2021-12-11
 6 2021-12-11
 7 2021-12-12
 8 2021-12-12
 9 2021-12-12
10 2021-12-13
# … with 98 more rows

$forged$extras$roles$geo_value
# A tibble: 108 × 1
   geo_value
   <chr>    
 1 ak       
 2 ca       
 3 ny       
 4 ak       
 5 ca       
 6 ny       
 7 ak       
 8 ca       
 9 ny       
10 ak       
# … with 98 more rows

$forged$extras$roles$raw
# A tibble: 108 × 2
   case_rate death_rate
       <dbl>      <dbl>
 1        NA         NA
 2        NA         NA
 3        NA         NA
 4        NA         NA
 5        NA         NA
 6        NA         NA
 7        NA         NA
 8        NA         NA
 9        NA         NA
10        NA         NA
# … with 98 more rows




$keys
An `epi_df` object, with metadata:
* geo_type  = state
* time_type = day
* as_of     = 2022-05-31 12:08:25

# A tibble: 108 × 2
   geo_value time_value
 * <chr>     <date>    
 1 ak        2021-12-10
 2 ca        2021-12-10
 3 ny        2021-12-10
 4 ak        2021-12-11
 5 ca        2021-12-11
 6 ny        2021-12-11
 7 ak        2021-12-12
 8 ca        2021-12-12
 9 ny        2021-12-12
10 ak        2021-12-13
# … with 98 more rows

$predictions
An `epi_df` object, with metadata:
* geo_type  = state
* time_type = day
* as_of     = 2022-05-31 12:08:25

# A tibble: 108 × 3
   geo_value time_value .pred
   <chr>     <date>     <dbl>
 1 ak        2021-12-10    NA
 2 ca        2021-12-10    NA
 3 ny        2021-12-10    NA
 4 ak        2021-12-11    NA
 5 ca        2021-12-11    NA
 6 ny        2021-12-11    NA
 7 ak        2021-12-12    NA
 8 ca        2021-12-12    NA
 9 ny        2021-12-12    NA
10 ak        2021-12-13    NA
# … with 98 more rows

@dajmcdon
Copy link
Contributor Author

dajmcdon commented Aug 3, 2022

It seems like the easiest thing at the moment is to adjust the slather() signature to also receive the test data (for all slather.layer_name() methods). Do you agree?

@dajmcdon
Copy link
Contributor Author

dajmcdon commented Aug 3, 2022

I think we do that plus the entire workflow (as in #108).

One alternative (or addition) would be to add a step that detects the forecast date at training time. And then, given the workflow, we could look for that.

@rachlobay
Copy link
Contributor

rachlobay commented Aug 3, 2022

Ok. I agree with the first suggestion to adjust the signature of slather() to also get the test data... If you think that having a step that detects the forecast date at training time will be very useful beyond this function, then we could add that, but otherwise let's go for the simplest soln.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 do this immediately
Projects
None yet
2 participants