Skip to content

Commit bb2af5b

Browse files
committed
revising custom_epiworkflows
1 parent 2f41842 commit bb2af5b

File tree

3 files changed

+119
-46
lines changed

3 files changed

+119
-46
lines changed
156 Bytes
Loading

vignettes/custom_epiworkflows.Rmd

Lines changed: 117 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ library(epidatr)
2525

2626
To get a better handle on custom `epi_workflow()`s, lets recreate and then
2727
modify the example of `four_week_ahead` from the [landing
28-
page](../index.html#motivating-example),
28+
page](../index.html#motivating-example)
2929

3030
```{r make-four-forecasts, warning=FALSE}
3131
training_data <- covid_case_death_rates |>
@@ -43,7 +43,7 @@ four_week_ahead <- arx_forecaster(
4343
four_week_ahead$epi_workflow
4444
```
4545
# Anatomy of an `epi_workflow`
46-
An `epi_workflow` is an extension of a `workflows::workflow()` to handle panel
46+
An `epi_workflow()` is an extension of a `workflows::workflow()` to handle panel
4747
data and post-processing.
4848
It consists of 3 components, shown above:
4949

@@ -52,17 +52,21 @@ It consists of 3 components, shown above:
5252
steps](https://recipes.tidymodels.org/reference/index.html).
5353
Think of it as a more flexible `formula` that you would pass to `lm()`: `y ~
5454
x1 + log(x2) + lag(x1, 5)`.
55+
The above model has 6 of these steps.
5556
In general, there are 2 broad classes of transformation that `{recipes}`
5657
handles:
57-
- Transforms of both training and test data that are always applied.
58+
- Transforms of both training and test data that are always applied.
5859
Examples include taking the log of a variable, leading or lagging,
59-
filtering out rows, handling dummy variables, etc.
60+
filtering out rows, handling dummy variables, calculating growth rates,
61+
etc.
6062
- Operations that fit parameters during training to apply during prediction,
6163
such as centering by the mean.
6264
This is a major benefit of `{recipes}`, since it prevents data leakage,
6365
where information about the test/predict time data "leaks" into the
6466
parameters. <!-- TODO unsure if worth even keeping, as we effectively
6567
can't have data leakage. -->
68+
However, the main mechanism we rely on to prevent data leakage is proper
69+
[backtesting](backtesting.html).
6670
For the case of centering, we need to store the mean of the predictor from
6771
the training data and use that value on the prediction data rather than
6872
accidentally calculating the mean of the test predictor for centering.
@@ -74,24 +78,27 @@ It consists of 3 components, shown above:
7478
between a wide collection of statistical models.
7579
3. Postprocessor: unique to this package, and used to format and modify the
7680
prediction after the model has been fit.
77-
Each operation is a layer, and the stack of layers is known as frosting,
81+
Each operation is a `layer_`, and the stack of layers is known as `frosting()`,
7882
continuing the metaphor of baking a cake established in the recipe.
7983
Some example operations include:
80-
- generate quantiles from purely point-prediction models,
81-
- undo operations done in the steps, such as convert back to counts from rates
82-
- threshold so the forecast doesn't include negative values
83-
- generally adapt the format of the prediction to it's eventual use.
84+
- generate quantiles from purely point-prediction models,
85+
- undo operations done in the steps, such as convert back to counts from rates
86+
- threshold so the forecast doesn't include negative values
87+
- generally adapt the format of the prediction to it's eventual use.
8488

8589
# Recreating `four_week_ahead` in an `epi_workflow()`
8690
To be able to extend this beyond what `arx_forecaster()` itself will let us do,
8791
we first need to understand how to recreate it using a custom `epi_workflow()`.
8892

8993
To use a custom workflow, there are a couple of steps:
94+
9095
1. define the `epi_recipe()`, which contains the preprocessing steps
9196
2. define the `frosting()` which contains the post-processing layers
92-
3. Combine these with a trainer such as `quantile_reg()` into an `epi_workflow`, which we can then fit on the training data
93-
4. fit the workflow on some data
94-
5. grab the right prediction data using `get_test_data()` and apply the fit data to generate a prediction
97+
3. Combine these with a trainer such as `quantile_reg()` into an
98+
`epi_workflow()`, which we can then fit on the training data
99+
4. `fit()` the workflow on some data
100+
5. grab the right prediction data using `get_test_data()` and apply the fit data
101+
to generate a prediction
95102

96103
## Define the `epi_recipe()`
97104
To do this, we'll first take a look at the steps as they're found in
@@ -103,9 +110,9 @@ hardhat::extract_recipe(four_week_ahead$epi_workflow)
103110

104111
So there are 6 steps we will need to recreate.
105112
One thing to note about the extracted recipe is that it has already been
106-
trained; for steps such as `recipes::step_BoxCox()`, this means that their
113+
trained; for steps such as `recipes::step_BoxCox()` which have parameters, this means that their
107114
parameters have been calculated.
108-
Recreating this:
115+
Before defining steps, we need to create an `epi_recipe()` to hold them
109116
```{r make_recipe}
110117
four_week_recipe <- epi_recipe(
111118
covid_case_death_rates |>
@@ -120,7 +127,7 @@ or `other_keys`; it is typically easiest to just use the training data itself.
120127

121128

122129
Then we add each step via pipes; in principle the order matters, though for this
123-
recipe only `step_epi_naomit` and `step_training_window` depend on the steps
130+
recipe only `step_epi_naomit()` and `step_training_window()` depend on the steps
124131
before them:
125132

126133
```{r make_steps}
@@ -132,20 +139,21 @@ four_week_recipe <- four_week_recipe |>
132139
step_training_window()
133140
```
134141

135-
one thing to note: we only added 5 steps here because `step_epi_naomit()` is
136-
actually a wrapper around adding 2 base `step_naomit()`s, on for
142+
One thing to note: we only added 5 steps here because `step_epi_naomit()` is
143+
actually a wrapper around adding 2 base `step_naomit()`s, one for
137144
`all_predictors()` and one for `all_outcomes()`, differing in their treatment at
138145
predict time.
139146

140-
`step_epi_lag` and `step_epi_ahead` both accept tidy syntax, so if we
141-
wanted the same lags for both `case_rate` and `death_rate`, we could have done
142-
`step_epi_lag(ends_with("rate"), lag = c(0, 7, 14))`.
147+
`step_epi_lag()` and `step_epi_ahead()` both accept tidy syntax, so if for
148+
example we wanted the same lags for both `case_rate` and `death_rate`, we could
149+
have done `step_epi_lag(ends_with("rate"), lag = c(0, 7, 14))`.
143150

144151
In general for the `{recipes}` package, steps assign roles, such as `predictor`
145-
and `outcome` to columns, either by adding new ones or adjusting existing ones.
146-
`step_epi_lag()` for example, creates a new column per lag with the name
152+
and `outcome` to columns, either by adding new columns or adjusting existing
153+
ones.
154+
`step_epi_lag()` for example, creates a new column for each lag with the name
147155
`lag_x_column_name` and assigns them as predictors, while `step_epi_ahead()`
148-
creates `ahead_column_name` columns and assigns them as outcomes.[^2]
156+
creates `ahead_x_column_name` columns and assigns them as outcomes.
149157

150158
One way to inspect the roles assigned is to use `prep()`:
151159

@@ -163,9 +171,14 @@ four_week_recipe |>
163171
bake(training_data)
164172
```
165173

174+
This is also useful for debugging malfunctioning pipelines; if you define
175+
`four_week_recipe` only up to the step that is misbehaving, you can get a
176+
partial evaluation to see the exact data the step is being applied to.
177+
It also allows you to see the exact data that the parsnip model is fitting on.
178+
166179
## Define the `frosting()`
167-
Since the post-processing frosting[^1] layers are unique to this package, to
168-
inspect them we use `extract_frosting` from `{epipredict}`:
180+
Since the post-processing frosting layers[^1] are unique to this package, to
181+
inspect them we use `extract_frosting()` from `{epipredict}`:
169182

170183
```{r inspect_fwa_layers, warning=FALSE}
171184
epipredict::extract_frosting(four_week_ahead$epi_workflow)
@@ -174,7 +187,7 @@ epipredict::extract_frosting(four_week_ahead$epi_workflow)
174187
The above gives us detailed descriptions of the arguments to the functions named
175188
above in the postprocessor of `four_week_ahead$epiworkflow`.
176189
Creating the layers is a similar process, except with frosting instead of a
177-
`recipe`:
190+
`recipe`[^2]:
178191

179192
```{r make_frosting}
180193
four_week_layers <- frosting() |>
@@ -185,26 +198,31 @@ four_week_layers <- frosting() |>
185198
layer_threshold()
186199
```
187200

201+
188202
Most layers will work for any engine or steps; `layer_predict()` you will want
189203
to call in every case.
190204
There are a couple of layers, however, which depend on whether the engine used predicts quantiles or point estimates.
191205

192-
The layers that are only supported by point estimate engines (such as `linear_reg()`):
206+
The layers that are only supported by point estimate engines (such as
207+
`linear_reg()`) are
193208

194-
- `layer_residual_quantiles()`: the preferred method of generating quantiles, it
195-
uses the error residuals of the engine. This will work for most parsnip engines.
209+
- `layer_residual_quantiles()`: the preferred method of generating quantiles for
210+
non-quantile models, it uses the error residuals of the engine.
211+
This will work for most parsnip engines.
196212
- `layer_predictive_distn()`: alternate method of generating quantiles, it uses
197213
an approximate parametric distribution. This will work for linear regression
198214
specifically.
199215

200216
TODO check this
201217

202218
On the other hand, the layers that are only supported by quantile estimating
203-
engines (such as `quantile_reg()`):
219+
engines (such as `quantile_reg()`) are
204220

205-
- `layer_quantile_distn()`
206-
- `layer_point_from_distn()`: this adds the median quantile as an point
207-
estimate[^3].
221+
- `layer_quantile_distn()`: adds the specified quantiles.
222+
If they differ from the ones actually fit, they will be interpolated and/or
223+
extrapolated.
224+
- `layer_point_from_distn()`: this adds the median quantile as a point estimate,
225+
and should be called after `layer_quantile_distn()` if called at all.
208226

209227
## Fitting an `epi_workflow()`
210228

@@ -217,28 +235,23 @@ four_week_workflow <- epi_workflow(
217235
)
218236
```
219237

220-
And fit it[^4]:
238+
And fit it to recreate `four_week_ahead$epi_workflow`
221239
```{r workflow_fitting}
222240
fit_workflow <- four_week_workflow |> fit(training_data)
223241
```
224242

225-
This has recreated `four_week_ahead$epi_workflow`
226-
227243
## Predicting
228244

229245
To do a prediction, we need to first narrow the dataset down to the relevant
230246
datapoints:
231247

232248
```{r grab_data}
233249
relevant_data <- get_test_data(
234-
extract_preprocessor(fit_workflow),
250+
four_week_recipe,
235251
training_data
236252
)
237253
```
238254

239-
Note the use of `extract_preprocessor()` rather than `extract_recipe()`; this
240-
gets the unfit version of the `epi_recipe`, which is necessary.
241-
242255
With a fit workflow and test data in hand, we can actually make our predictions:
243256

244257
```{r workflow_pred}
@@ -252,15 +265,26 @@ predictions:
252265
fit_workflow |> predict(training_data)
253266
```
254267

255-
However, this produces forecasts for not just the actual `forecast_date`, but
256-
for every day in the dataset it has enough data to actually make a prediction.
268+
The resulting tibble is 800 rows long, however.
269+
This produces forecasts for not just the actual `forecast_date`, but for every
270+
day in the dataset it has enough data to actually make a prediction.
271+
To narrow this down, we could filter to rows where the `time_value` matches the `forecast_date`:
272+
273+
```{r workflow_pred_training_filter}
274+
fit_workflow |>
275+
predict(training_data) |>
276+
filter(time_value == forecast_date)
277+
```
278+
279+
This can be useful for cases where `get_test_data()` doesn't pull sufficient
280+
data.
257281

258282
# Extending `four_week_ahead`
259283

260284
There are many ways we could modify `four_week_ahead`; one simple modification
261285
would be to include a growth rate estimate as part of the model.
262-
Another would be to include a time component to the prediction if we expect
263-
there to be a strong seasonal variation.
286+
Another would be to include a time component to the prediction (useful if we
287+
expect there to be a strong seasonal component).
264288
Another would be to convert from rates to counts, for example if that were the
265289
preferred prediction format.
266290

@@ -313,11 +337,11 @@ growth_rate_workflow <- epi_workflow(
313337
)
314338
315339
relevant_data <- get_test_data(
316-
extract_preprocessor(fit_workflow),
340+
growth_rate_recipe,
317341
training_data
318342
)
319343
gr_fit_workflow <- growth_rate_workflow |> fit(training_data)
320-
gr_predictions <- fit_workflow |>
344+
gr_predictions <- gr_fit_workflow |>
321345
predict(relevant_data) |>
322346
filter(time_value == forecast_date)
323347
```
@@ -354,3 +378,50 @@ result_plot
354378
```
355379

356380
TODO `get_test_data` isn't actually working here...
381+
382+
[^1]: Think of baking a cake, where adding the frosting is the last step in the process of actually baking.
383+
384+
[^2]: Note that the frosting doesn't require any information about the training
385+
data, since the output of the model only depends on the model used.
386+
387+
## Population scaling
388+
389+
Suppose we're sending our predictions to someone who is looking to understand
390+
counts, rather than rates.
391+
Then we can adjust just the frosting to get a forecast for the counts from our
392+
rates forecaster:
393+
394+
```{r rate_scale}
395+
count_layers <-
396+
frosting() |>
397+
layer_predict() |>
398+
layer_residual_quantiles(quantile_levels = c(0.1, 0.25, 0.5, 0.75, 0.9)) |>
399+
layer_population_scaling(
400+
.pred,
401+
.pred_distn,
402+
df = epidatasets::state_census,
403+
df_pop_col = "pop",
404+
create_new = FALSE,
405+
rate_rescaling = 1e5,
406+
by = c("geo_value" = "abbr")
407+
) |>
408+
layer_add_forecast_date() |>
409+
layer_add_target_date() |>
410+
layer_threshold()
411+
# building the new workflow
412+
count_workflow <- epi_workflow(
413+
four_week_recipe,
414+
linear_reg(),
415+
count_layers
416+
)
417+
count_pred_data <- get_test_data(four_week_recipe, training_data)
418+
count_predictions <- count_workflow |>
419+
fit(training_data) |>
420+
predict(count_pred_data)
421+
count_predictions
422+
```
423+
424+
which are 2-3 orders of magnitude larger than the corresponding rates above.
425+
`df` represents the scaling value; in this case it is the state populations,
426+
while `rate_rescaling` gives the denominator of the rate (our fit values were
427+
per 100,000).

vignettes/epipredict.Rmd

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,8 @@ hardhat::extract_fit_engine(two_week_ahead$epi_workflow)
132132

133133
The default trainer is `parsnip::linear_reg()`, which generates quantiles after
134134
the fact in the post-processing layers, rather than as part of the model.
135+
While this does work, it is generally preferable to use `quantile_reg()`, as the
136+
quantiles generated in this manner can be poorly behaved.
135137
`quantile_reg()` on the other hand directly estimates different linear models
136138
for each quantile, reflected in the 3 different columns for `tau` above.
137139

0 commit comments

Comments
 (0)