@@ -25,7 +25,7 @@ library(epidatr)
25
25
26
26
To get a better handle on custom ` epi_workflow() ` s, lets recreate and then
27
27
modify the example of ` four_week_ahead ` from the [ landing
28
- page] ( ../index.html#motivating-example ) ,
28
+ page] ( ../index.html#motivating-example )
29
29
30
30
``` {r make-four-forecasts, warning=FALSE}
31
31
training_data <- covid_case_death_rates |>
@@ -43,7 +43,7 @@ four_week_ahead <- arx_forecaster(
43
43
four_week_ahead$epi_workflow
44
44
```
45
45
# Anatomy of an ` epi_workflow `
46
- An ` epi_workflow ` is an extension of a ` workflows::workflow() ` to handle panel
46
+ An ` epi_workflow() ` is an extension of a ` workflows::workflow() ` to handle panel
47
47
data and post-processing.
48
48
It consists of 3 components, shown above:
49
49
@@ -52,17 +52,21 @@ It consists of 3 components, shown above:
52
52
steps] ( https://recipes.tidymodels.org/reference/index.html ) .
53
53
Think of it as a more flexible ` formula ` that you would pass to ` lm() ` : `y ~
54
54
x1 + log(x2) + lag(x1, 5)`.
55
+ The above model has 6 of these steps.
55
56
In general, there are 2 broad classes of transformation that ` {recipes} `
56
57
handles:
57
- - Transforms of both training and test data that are always applied.
58
+ - Transforms of both training and test data that are always applied.
58
59
Examples include taking the log of a variable, leading or lagging,
59
- filtering out rows, handling dummy variables, etc.
60
+ filtering out rows, handling dummy variables, calculating growth rates,
61
+ etc.
60
62
- Operations that fit parameters during training to apply during prediction,
61
63
such as centering by the mean.
62
64
This is a major benefit of ` {recipes} ` , since it prevents data leakage,
63
65
where information about the test/predict time data "leaks" into the
64
66
parameters. <!-- TODO unsure if worth even keeping, as we effectively
65
67
can't have data leakage. -->
68
+ However, the main mechanism we rely on to prevent data leakage is proper
69
+ [ backtesting] ( backtesting.html ) .
66
70
For the case of centering, we need to store the mean of the predictor from
67
71
the training data and use that value on the prediction data rather than
68
72
accidentally calculating the mean of the test predictor for centering.
@@ -74,24 +78,27 @@ It consists of 3 components, shown above:
74
78
between a wide collection of statistical models.
75
79
3 . Postprocessor: unique to this package, and used to format and modify the
76
80
prediction after the model has been fit.
77
- Each operation is a layer , and the stack of layers is known as frosting,
81
+ Each operation is a ` layer_ ` , and the stack of layers is known as ` frosting() ` ,
78
82
continuing the metaphor of baking a cake established in the recipe.
79
83
Some example operations include:
80
- - generate quantiles from purely point-prediction models,
81
- - undo operations done in the steps, such as convert back to counts from rates
82
- - threshold so the forecast doesn't include negative values
83
- - generally adapt the format of the prediction to it's eventual use.
84
+ - generate quantiles from purely point-prediction models,
85
+ - undo operations done in the steps, such as convert back to counts from rates
86
+ - threshold so the forecast doesn't include negative values
87
+ - generally adapt the format of the prediction to it's eventual use.
84
88
85
89
# Recreating ` four_week_ahead ` in an ` epi_workflow() `
86
90
To be able to extend this beyond what ` arx_forecaster() ` itself will let us do,
87
91
we first need to understand how to recreate it using a custom ` epi_workflow() ` .
88
92
89
93
To use a custom workflow, there are a couple of steps:
94
+
90
95
1 . define the ` epi_recipe() ` , which contains the preprocessing steps
91
96
2 . define the ` frosting() ` which contains the post-processing layers
92
- 3 . Combine these with a trainer such as ` quantile_reg() ` into an ` epi_workflow ` , which we can then fit on the training data
93
- 4 . fit the workflow on some data
94
- 5 . grab the right prediction data using ` get_test_data() ` and apply the fit data to generate a prediction
97
+ 3 . Combine these with a trainer such as ` quantile_reg() ` into an
98
+ ` epi_workflow() ` , which we can then fit on the training data
99
+ 4 . ` fit() ` the workflow on some data
100
+ 5 . grab the right prediction data using ` get_test_data() ` and apply the fit data
101
+ to generate a prediction
95
102
96
103
## Define the ` epi_recipe() `
97
104
To do this, we'll first take a look at the steps as they're found in
@@ -103,9 +110,9 @@ hardhat::extract_recipe(four_week_ahead$epi_workflow)
103
110
104
111
So there are 6 steps we will need to recreate.
105
112
One thing to note about the extracted recipe is that it has already been
106
- trained; for steps such as ` recipes::step_BoxCox() ` , this means that their
113
+ trained; for steps such as ` recipes::step_BoxCox() ` which have parameters , this means that their
107
114
parameters have been calculated.
108
- Recreating this:
115
+ Before defining steps, we need to create an ` epi_recipe() ` to hold them
109
116
``` {r make_recipe}
110
117
four_week_recipe <- epi_recipe(
111
118
covid_case_death_rates |>
@@ -120,7 +127,7 @@ or `other_keys`; it is typically easiest to just use the training data itself.
120
127
121
128
122
129
Then we add each step via pipes; in principle the order matters, though for this
123
- recipe only ` step_epi_naomit ` and ` step_training_window ` depend on the steps
130
+ recipe only ` step_epi_naomit() ` and ` step_training_window() ` depend on the steps
124
131
before them:
125
132
126
133
``` {r make_steps}
@@ -132,20 +139,21 @@ four_week_recipe <- four_week_recipe |>
132
139
step_training_window()
133
140
```
134
141
135
- one thing to note: we only added 5 steps here because ` step_epi_naomit() ` is
136
- actually a wrapper around adding 2 base ` step_naomit() ` s, on for
142
+ One thing to note: we only added 5 steps here because ` step_epi_naomit() ` is
143
+ actually a wrapper around adding 2 base ` step_naomit() ` s, one for
137
144
` all_predictors() ` and one for ` all_outcomes() ` , differing in their treatment at
138
145
predict time.
139
146
140
- ` step_epi_lag ` and ` step_epi_ahead ` both accept tidy syntax, so if we
141
- wanted the same lags for both ` case_rate ` and ` death_rate ` , we could have done
142
- ` step_epi_lag(ends_with("rate"), lag = c(0, 7, 14)) ` .
147
+ ` step_epi_lag() ` and ` step_epi_ahead() ` both accept tidy syntax, so if for
148
+ example we wanted the same lags for both ` case_rate ` and ` death_rate ` , we could
149
+ have done ` step_epi_lag(ends_with("rate"), lag = c(0, 7, 14)) ` .
143
150
144
151
In general for the ` {recipes} ` package, steps assign roles, such as ` predictor `
145
- and ` outcome ` to columns, either by adding new ones or adjusting existing ones.
146
- ` step_epi_lag() ` for example, creates a new column per lag with the name
152
+ and ` outcome ` to columns, either by adding new columns or adjusting existing
153
+ ones.
154
+ ` step_epi_lag() ` for example, creates a new column for each lag with the name
147
155
` lag_x_column_name ` and assigns them as predictors, while ` step_epi_ahead() `
148
- creates ` ahead_column_name ` columns and assigns them as outcomes.[ ^ 2 ]
156
+ creates ` ahead_x_column_name ` columns and assigns them as outcomes.
149
157
150
158
One way to inspect the roles assigned is to use ` prep() ` :
151
159
@@ -163,9 +171,14 @@ four_week_recipe |>
163
171
bake(training_data)
164
172
```
165
173
174
+ This is also useful for debugging malfunctioning pipelines; if you define
175
+ ` four_week_recipe ` only up to the step that is misbehaving, you can get a
176
+ partial evaluation to see the exact data the step is being applied to.
177
+ It also allows you to see the exact data that the parsnip model is fitting on.
178
+
166
179
## Define the ` frosting() `
167
- Since the post-processing frosting[ ^ 1 ] layers are unique to this package, to
168
- inspect them we use ` extract_frosting ` from ` {epipredict} ` :
180
+ Since the post-processing frosting layers [ ^ 1 ] are unique to this package, to
181
+ inspect them we use ` extract_frosting() ` from ` {epipredict} ` :
169
182
170
183
``` {r inspect_fwa_layers, warning=FALSE}
171
184
epipredict::extract_frosting(four_week_ahead$epi_workflow)
@@ -174,7 +187,7 @@ epipredict::extract_frosting(four_week_ahead$epi_workflow)
174
187
The above gives us detailed descriptions of the arguments to the functions named
175
188
above in the postprocessor of ` four_week_ahead$epiworkflow ` .
176
189
Creating the layers is a similar process, except with frosting instead of a
177
- ` recipe ` :
190
+ ` recipe ` [ ^ 2 ] :
178
191
179
192
``` {r make_frosting}
180
193
four_week_layers <- frosting() |>
@@ -185,26 +198,31 @@ four_week_layers <- frosting() |>
185
198
layer_threshold()
186
199
```
187
200
201
+
188
202
Most layers will work for any engine or steps; ` layer_predict() ` you will want
189
203
to call in every case.
190
204
There are a couple of layers, however, which depend on whether the engine used predicts quantiles or point estimates.
191
205
192
- The layers that are only supported by point estimate engines (such as ` linear_reg() ` ):
206
+ The layers that are only supported by point estimate engines (such as
207
+ ` linear_reg() ` ) are
193
208
194
- - ` layer_residual_quantiles() ` : the preferred method of generating quantiles, it
195
- uses the error residuals of the engine. This will work for most parsnip engines.
209
+ - ` layer_residual_quantiles() ` : the preferred method of generating quantiles for
210
+ non-quantile models, it uses the error residuals of the engine.
211
+ This will work for most parsnip engines.
196
212
- ` layer_predictive_distn() ` : alternate method of generating quantiles, it uses
197
213
an approximate parametric distribution. This will work for linear regression
198
214
specifically.
199
215
200
216
TODO check this
201
217
202
218
On the other hand, the layers that are only supported by quantile estimating
203
- engines (such as ` quantile_reg() ` ):
219
+ engines (such as ` quantile_reg() ` ) are
204
220
205
- - ` layer_quantile_distn() `
206
- - ` layer_point_from_distn() ` : this adds the median quantile as an point
207
- estimate[ ^ 3 ] .
221
+ - ` layer_quantile_distn() ` : adds the specified quantiles.
222
+ If they differ from the ones actually fit, they will be interpolated and/or
223
+ extrapolated.
224
+ - ` layer_point_from_distn() ` : this adds the median quantile as a point estimate,
225
+ and should be called after ` layer_quantile_distn() ` if called at all.
208
226
209
227
## Fitting an ` epi_workflow() `
210
228
@@ -217,28 +235,23 @@ four_week_workflow <- epi_workflow(
217
235
)
218
236
```
219
237
220
- And fit it[ ^ 4 ] :
238
+ And fit it to recreate ` four_week_ahead$epi_workflow `
221
239
``` {r workflow_fitting}
222
240
fit_workflow <- four_week_workflow |> fit(training_data)
223
241
```
224
242
225
- This has recreated ` four_week_ahead$epi_workflow `
226
-
227
243
## Predicting
228
244
229
245
To do a prediction, we need to first narrow the dataset down to the relevant
230
246
datapoints:
231
247
232
248
``` {r grab_data}
233
249
relevant_data <- get_test_data(
234
- extract_preprocessor(fit_workflow) ,
250
+ four_week_recipe ,
235
251
training_data
236
252
)
237
253
```
238
254
239
- Note the use of ` extract_preprocessor() ` rather than ` extract_recipe() ` ; this
240
- gets the unfit version of the ` epi_recipe ` , which is necessary.
241
-
242
255
With a fit workflow and test data in hand, we can actually make our predictions:
243
256
244
257
``` {r workflow_pred}
@@ -252,15 +265,26 @@ predictions:
252
265
fit_workflow |> predict(training_data)
253
266
```
254
267
255
- However, this produces forecasts for not just the actual ` forecast_date ` , but
256
- for every day in the dataset it has enough data to actually make a prediction.
268
+ The resulting tibble is 800 rows long, however.
269
+ This produces forecasts for not just the actual ` forecast_date ` , but for every
270
+ day in the dataset it has enough data to actually make a prediction.
271
+ To narrow this down, we could filter to rows where the ` time_value ` matches the ` forecast_date ` :
272
+
273
+ ``` {r workflow_pred_training_filter}
274
+ fit_workflow |>
275
+ predict(training_data) |>
276
+ filter(time_value == forecast_date)
277
+ ```
278
+
279
+ This can be useful for cases where ` get_test_data() ` doesn't pull sufficient
280
+ data.
257
281
258
282
# Extending ` four_week_ahead `
259
283
260
284
There are many ways we could modify ` four_week_ahead ` ; one simple modification
261
285
would be to include a growth rate estimate as part of the model.
262
- Another would be to include a time component to the prediction if we expect
263
- there to be a strong seasonal variation .
286
+ Another would be to include a time component to the prediction (useful if we
287
+ expect there to be a strong seasonal component) .
264
288
Another would be to convert from rates to counts, for example if that were the
265
289
preferred prediction format.
266
290
@@ -313,11 +337,11 @@ growth_rate_workflow <- epi_workflow(
313
337
)
314
338
315
339
relevant_data <- get_test_data(
316
- extract_preprocessor(fit_workflow) ,
340
+ growth_rate_recipe ,
317
341
training_data
318
342
)
319
343
gr_fit_workflow <- growth_rate_workflow |> fit(training_data)
320
- gr_predictions <- fit_workflow |>
344
+ gr_predictions <- gr_fit_workflow |>
321
345
predict(relevant_data) |>
322
346
filter(time_value == forecast_date)
323
347
```
@@ -354,3 +378,50 @@ result_plot
354
378
```
355
379
356
380
TODO ` get_test_data ` isn't actually working here...
381
+
382
+ [ ^ 1 ] : Think of baking a cake, where adding the frosting is the last step in the process of actually baking.
383
+
384
+ [ ^ 2 ] : Note that the frosting doesn't require any information about the training
385
+ data, since the output of the model only depends on the model used.
386
+
387
+ ## Population scaling
388
+
389
+ Suppose we're sending our predictions to someone who is looking to understand
390
+ counts, rather than rates.
391
+ Then we can adjust just the frosting to get a forecast for the counts from our
392
+ rates forecaster:
393
+
394
+ ``` {r rate_scale}
395
+ count_layers <-
396
+ frosting() |>
397
+ layer_predict() |>
398
+ layer_residual_quantiles(quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9)) |>
399
+ layer_population_scaling(
400
+ .pred,
401
+ .pred_distn,
402
+ df = epidatasets::state_census,
403
+ df_pop_col = "pop",
404
+ create_new = FALSE,
405
+ rate_rescaling = 1e5,
406
+ by = c("geo_value" = "abbr")
407
+ ) |>
408
+ layer_add_forecast_date() |>
409
+ layer_add_target_date() |>
410
+ layer_threshold()
411
+ # building the new workflow
412
+ count_workflow <- epi_workflow(
413
+ four_week_recipe,
414
+ linear_reg(),
415
+ count_layers
416
+ )
417
+ count_pred_data <- get_test_data(four_week_recipe, training_data)
418
+ count_predictions <- count_workflow |>
419
+ fit(training_data) |>
420
+ predict(count_pred_data)
421
+ count_predictions
422
+ ```
423
+
424
+ which are 2-3 orders of magnitude larger than the corresponding rates above.
425
+ ` df ` represents the scaling value; in this case it is the state populations,
426
+ while ` rate_rescaling ` gives the denominator of the rate (our fit values were
427
+ per 100,000).
0 commit comments