@@ -17,11 +17,11 @@ knitr::opts_chunk$set(
17
17
```
18
18
19
19
``` {r libraries}
20
- library(epiprocess)
21
- library(epipredict)
22
20
library(dplyr)
23
21
library(parsnip)
24
22
library(recipes)
23
+ library(epiprocess)
24
+ library(epipredict)
25
25
```
26
26
27
27
[ Panel data] ( https://en.wikipedia.org/wiki/Panel_data ) , or longitudinal data,
@@ -33,22 +33,21 @@ dataset, which contains daily state-wise measures of `case_rate` and
33
33
` death_rate ` for COVID-19 in 2021:
34
34
35
35
``` {r epi-panel-ex, include=T}
36
- head(case_death_rate_subset)
36
+ head(case_death_rate_subset, 3 )
37
37
```
38
38
39
39
` epipredict ` functions work with data in [ ` epi_df ` ] (
40
40
https://cmu-delphi.github.io/epiprocess/reference/epi_df.html )
41
41
format. Despite the stated goal and name of the package, other panel datasets
42
- are also valid candidates for ` epipredict ` functionality. Specifically, the
43
- ` epipredict ` framework and direct forecasters can work with any panel data, as
44
- long as it's in ` epi_df ` format.
42
+ are also valid candidates for ` epipredict ` functionality, as long as they are
43
+ in ` epi_df ` format.
45
44
46
45
``` {r employ-stats, include=F}
47
46
year_start <- min(grad_employ_subset$time_value)
48
47
year_end <- max(grad_employ_subset$time_value)
49
48
```
50
49
51
- ## Example panel data overview
50
+ # Example panel data overview
52
51
53
52
In this vignette, we will demonstrate using ` epipredict ` with employment data
54
53
from Statistics Canada. We will be using
@@ -80,7 +79,7 @@ just described:
80
79
``` {r employ-query, eval=F}
81
80
library(cansim)
82
81
83
- # Get original dataset
82
+ # Get statcan data using get_cansim, which returns a tibble
84
83
statcan_grad_employ <- get_cansim("37-10-0115-01")
85
84
86
85
gemploy <- statcan_grad_employ %>%
@@ -141,13 +140,12 @@ using [`as_epi_df`](
141
140
https://cmu-delphi.github.io/epiprocess/reference/as_epi_df.html )
142
141
with additional keys. In our case, the additional keys are ` age_group ` , ` fos `
143
142
and ` edu_qual ` . Note that in the above modifications, we encoded ` time_value `
144
- as type ` integer ` . This allows us to set ` time_type ` to ` "year" ` , and to ensure
143
+ as type ` integer ` . This lets us set ` time_type = "year" ` , and ensures that
145
144
lag and ahead modifications later on are using the correct time units. See the
146
145
[ ` epi_df ` documentation] (
147
146
https://cmu-delphi.github.io/epiprocess/reference/epi_df.html#time-types ) for
148
147
a list of all the ` type_type ` s available.
149
148
150
-
151
149
``` {r convert-to-epidf, eval=F}
152
150
grad_employ_subset <- gemploy %>%
153
151
tsibble::as_tsibble(
@@ -163,8 +161,22 @@ employ_rowcount <- format(nrow(grad_employ_subset), big.mark=",")
163
161
employ_colcount <- length(names(grad_employ_subset))
164
162
```
165
163
166
- The data contains ` r employ_rowcount ` rows and ` r employ_colcount ` columns. Now,
167
- we are ready to use ` grad_employ_subset ` with ` epipredict ` .
164
+ Now, we are ready to use ` grad_employ_subset ` with ` epipredict ` .
165
+ Our ` epi_df ` contains ` r employ_rowcount ` rows and ` r employ_colcount ` columns.
166
+ Here is a quick summary of the columns in our ` epi_df ` :
167
+
168
+ * ` time_value ` (time value): year in YYYY format
169
+ * ` geo_value ` (geo value): province in Canada
170
+ * ` num_graduates ` (raw, time series value): number of graduates
171
+ * ` med_income_2y ` (raw, time series value): median employment income 2 years
172
+ after graduation
173
+ * ` med_income_5y ` (raw, time series value): median employment income 5 years
174
+ after graduation
175
+ * ` age_group ` (key): one of two age groups, either 15 to 34 years, or 35 to 64
176
+ years
177
+ * ` fos ` (key): one of 60 unique fields of study
178
+ * ` edu_qual ` (key): one of 32 unique educational qualifications, e.g.,
179
+ "Master's disploma"
168
180
169
181
``` {r preview-data, include=T}
170
182
# Rename for simplicity
@@ -176,48 +188,83 @@ In the following sections, we will go over preprocessing the data in the
176
188
` epi_recipe ` framework, and fitting a model and making predictions within the
177
189
` epipredict ` framework and using the package's canned forecasters.
178
190
179
- ## Preprocessing
191
+ # Preprocessing
192
+
193
+ We will create an ` epi_recipe ` that adds one ` ahead ` column and 3 ` lag ` columns.
194
+ The ` ahead ` column tells us how many time units ahead to predict, and the ` lag `
195
+ columns tell us how many previous time points to include as covariates. We will
196
+ just work with one of the time series in our data for now: ` num_graduates ` .
180
197
181
- We will create a recipe that adds one ` ahead ` column and 3 ` lag ` columns.
198
+ In this preprocessing step, no computation really happens. It just provides
199
+ a series of steps that will be applied when using ` epi_workflow ` later. And note
200
+ that since we specified our ` time_type ` to be ` year ` , our ` lag ` and ` lead `
201
+ values are both in years.
182
202
183
203
``` {r make-recipe, include=T}
184
204
r <- epi_recipe(employ) %>%
185
205
step_epi_ahead(num_graduates, ahead = 1) %>% # lag & ahead units in years
186
- step_epi_lag(num_graduates, lag = c(0, 1, 1) ) %>%
206
+ step_epi_lag(num_graduates, lag = 0:2 ) %>%
187
207
step_epi_naomit()
188
208
r
189
209
```
190
210
191
- There is one ` raw ` role which includes our value column ` num_graduates ` , and two
192
- ` key ` roles which include our additional keys ` age_group ` , ` fos ` and
193
- ` edu_qual ` . Let's take a look at what these additional columns look like.
211
+ There are 3 ` raw ` roles which are our three lagged ` num_graduates ` columns, and
212
+ three ` key ` roles which are our additional keys ` age_group ` , ` fos ` and
213
+ ` edu_qual ` . Let's apply this recipe using ` prep ` and ` bake ` to see all of the
214
+ additional ` lag ` and ` ahead ` columns created.
194
215
195
216
``` {r view-preprocessed, include=T}
196
217
# Display a sample of the preprocessed data
197
218
baked_sample <- r %>% prep() %>% bake(new_data = employ) %>% sample_n(5)
198
219
baked_sample
199
220
```
200
221
201
- ## Model fitting and prediction
222
+ # Model fitting and prediction
223
+
224
+ ## Within recipes framework
225
+
226
+ We will look at a simple model: [ ` parsnip::linear_reg() ` ] (
227
+ https://parsnip.tidymodels.org/reference/linear_reg.html ) with the default
228
+ engine ` lm ` , which fits a linear regression using ordinary least squares.
229
+ Specifically, our model will be an autoregressive linear model with:
202
230
203
- ### Within recipes framework
231
+ * Outcome $y_ {t}$, the number of graduates at time $t$. Corresponds to
232
+ setting ` ahead = 1 ` in our recipe.
233
+ <!-- TODO i don't like the wording i used here -->
234
+ * Predictors $x_ {t-1}$, $x_ {t-2}$, and $x_ {t-3}$, the number of graduates in the
235
+ three consecutive years prior to time $t$. Corresponds to setting ` lag = 0:2 ` in
236
+ our recipe.
204
237
205
- We will look at a simple model: ` parsnip::linear_reg() ` with default engine
206
- ` lm ` . We can use ` epi_workflow ` with the ` epi_recipe ` we defined in the
207
- preprocessing section to fit an autoregressive linear model using lags at
208
- time $t$ (current), $t-1$ (last year), and $t-2$ (2 years ago).
238
+ The model is represented algebraically as follows:
239
+
240
+ \[
241
+ y_ {t+1} = \alpha_1 x_ {t} + \alpha_2 x_ {t-1} + \alpha_3 x_ {t-2} + \epsilon_t
242
+ \]
243
+
244
+ We will use ` epi_workflow ` with the ` epi_recipe ` we defined in the
245
+ preprocessing section to fit this model.
209
246
210
247
``` {r linearreg-wf, include=T}
211
248
wf_linreg <- epi_workflow(r, parsnip::linear_reg()) %>% fit(employ)
212
249
wf_linreg
213
250
```
214
251
252
+ Let's take a look at the model object
253
+
254
+ ``` {r linearreg-fit, include=T}
255
+ # extract the parsnip model object
256
+ wf_lr_fit <- wf_linreg$fit$fit
257
+ wf_lr_fit
258
+ ```
259
+
260
+ <!-- TODO add diagnostics -->
261
+
215
262
Now that we have our workflow, we can generate predictions from a subset of our
216
- data. For this demo, we will predict the employment counts from the last 12
217
- months of our dataset.
263
+ data. For this demo, we will predict the number of graduates from the last 2
264
+ years of our dataset.
218
265
219
266
``` {r linearreg-predict, include=T}
220
- latest <- employ %>% filter(time_value >= max(time_value) - 12 )
267
+ latest <- employ %>% filter(time_value >= max(time_value) - 2 )
221
268
preds <- stats::predict(wf_linreg, latest) %>% filter(!is.na(.pred))
222
269
# Display a sample of the prediction values
223
270
head(preds)
@@ -226,7 +273,9 @@ head(preds)
226
273
Notice that ` predict ` still returns an ` epi_df ` with all of the keys that were
227
274
present in the original dataset.
228
275
229
- ### With canned forecasters
276
+ <!-- (TODO: residuals, predictions commentary) -->
277
+
278
+ ## With canned forecasters
230
279
231
280
Even though we aren't working with epidemiological data, canned forecasters
232
281
still work as expected, out of the box. We will demonstrate this with the simple
0 commit comments