@@ -23,6 +23,7 @@ library(recipes)
23
23
library(epiprocess)
24
24
library(epipredict)
25
25
library(ggplot2)
26
+ library(gridExtra)
26
27
```
27
28
28
29
[ Panel data] ( https://en.wikipedia.org/wiki/Panel_data ) , or longitudinal data,
@@ -50,8 +51,8 @@ year_end <- max(grad_employ_subset$time_value)
50
51
51
52
# Example panel data overview
52
53
53
- In this vignette, we will demonstrate using ` epipredict ` with employment data
54
- from Statistics Canada. We will be using
54
+ In this vignette, we will demonstrate using ` epipredict ` with employment panel
55
+ data from Statistics Canada. We will be using
55
56
[
56
57
Table 37-10-0115-01: Characteristics and median employment income of
57
58
longitudinal cohorts of postsecondary graduates two and five years after
@@ -61,7 +62,7 @@ from Statistics Canada. We will be using
61
62
62
63
The full dataset contains yearly median employment income two and five years
63
64
after graduation, and number of graduates. The data is stratified by
64
- variables such as geographic region (Canadian province), field of study , and
65
+ variables such as geographic region (Canadian province), education , and
65
66
age group. The year range of the dataset is ` r year_start ` to ` r year_end ` ,
66
67
inclusive. The full dataset also contains metadata that describes the
67
68
quality of data collected. For demonstration purposes, we make the following
@@ -72,7 +73,7 @@ modifications to get a subset of the full dataset:
72
73
* Only keep "good" or better quality data rows, as indicated by the [ ` STATUS ` ] (
73
74
https://www.statcan.gc.ca/en/concepts/definitions/guide-symbol ) column
74
75
* Choose a subset of covariates and aggregate across the remaining ones. The
75
- chosen covariates are age group, field of study, and educational qualification.
76
+ chosen covariates are age group, and educational qualification.
76
77
77
78
Below is the query for obtaining the full data and code for subsetting it as we
78
79
just described:
@@ -144,7 +145,7 @@ gemploy <- statcan_grad_employ %>%
144
145
To use this data with ` epipredict ` , we need to convert it into ` epi_df ` format
145
146
using [ ` as_epi_df ` ] (
146
147
https://cmu-delphi.github.io/epiprocess/reference/as_epi_df.html )
147
- with additional keys. In our case, the additional keys are ` age_group ` , ` fos `
148
+ with additional keys. In our case, the additional keys are ` age_group ` ,
148
149
and ` edu_qual ` . Note that in the above modifications, we encoded ` time_value `
149
150
as type ` integer ` . This lets us set ` time_type = "year" ` , and ensures that
150
151
lag and ahead modifications later on are using the correct time units. See the
@@ -182,7 +183,6 @@ after graduation
182
183
after graduation
183
184
* ` age_group ` (key): one of two age groups, either 15 to 34 years, or 35 to 64
184
185
years
185
- * ` fos ` (key): one of 60 unique fields of study
186
186
* ` edu_qual ` (key): one of 32 unique educational qualifications, e.g.,
187
187
"Master's disploma"
188
188
@@ -196,16 +196,18 @@ In the following sections, we will go over preprocessing the data in the
196
196
` epi_recipe ` framework, and fitting a model and making predictions within the
197
197
` epipredict ` framework and using the package's canned forecasters.
198
198
199
- # Preprocessing
199
+ # A simple example
200
+ ## Preprocessing
200
201
201
- As a simple example, let's work with the ` num_graduates ` column for now.
202
+ As a simple example, let's work with the ` num_graduates ` column for now. We will
203
+ first pre-process by "standardizing" each numeric column by the total within
204
+ each group of keys. We do this since those raw numeric values will vary greatly
205
+ from province to province since there are large differences in population.
202
206
203
207
``` {r employ-small, include=T}
204
208
employ_small <- employ %>%
205
- # Incomplete data - exclude
206
- filter(geo_value != "Territories") %>%
207
- # Select groups where there are complete timeseries values
208
209
group_by(geo_value, age_group, edu_qual) %>%
210
+ # Select groups where there are complete timeseries values
209
211
filter(n() >= 6) %>%
210
212
mutate(
211
213
num_graduates_prop = num_graduates / sum(num_graduates),
@@ -231,16 +233,17 @@ employ_small %>%
231
233
ggtitle("Trend in # of Graduates by Age Group and Education in BC and ON")
232
234
```
233
235
234
- We will predict the number of graduates in the next year (time $t+1$) using an
235
- autoregressive model with three lags (i.e., an AR(3) model). Such a model is
236
- represented algebraically like this:
236
+ We will predict the "standardized" number of graduates (a proportion) in the
237
+ next year (time $t+1$) using an autoregressive model with three lags (i.e., an
238
+ AR(3) model). Such a model is represented algebraically like this:
237
239
238
240
\[
239
241
x_ {t+1} =
240
242
\phi_0 + \phi_1 x_ {t} + \phi_2 x_ {t-1} + \phi_3 x_ {t-2} + \epsilon_t
241
243
\]
242
244
243
- where $x_i$ is the number of graduates at time $i$, and the current time is $t$.
245
+ where $x_i$ is the proportion of graduates at time $i$, and the current time is
246
+ $t$.
244
247
245
248
In the preprocessing step, we need to create additional columns in ` employ ` for
246
249
each of $x_ {t+1}$, $x_ {t}$, $x_ {t-1}$, and $x_ {t-2}$. We do this via an
@@ -261,32 +264,25 @@ r <- epi_recipe(employ_small) %>%
261
264
r
262
265
```
263
266
264
- There are 3 ` raw ` roles which are our three lagged ` num_graduates ` columns, and
265
- three ` key ` roles which are our additional keys ` age_group ` , ` fos ` and
266
- ` edu_qual ` .
267
-
268
267
Let's apply this recipe using ` prep ` and ` bake ` to generate and view the ` lag `
269
268
and ` ahead ` columns.
270
269
271
270
``` {r view-preprocessed, include=T}
272
271
# Display a sample of the preprocessed data
273
- baked_sample <- r %>%
274
- prep() %>%
275
- bake(new_data = employ_small) %>%
272
+ baked_sample <- r %>% prep() %>% bake(new_data = employ_small) %>%
276
273
sample_n(5)
277
274
baked_sample
278
275
```
279
276
280
277
We can see that the ` prep ` and ` bake ` steps created new columns according to
281
278
our ` epi_recipe ` :
282
279
283
- - ` ahead_1_num_graduates ` corresponds to $x_ {t+1}$
284
- - ` lag_0_num_graduates ` , ` lag_1_num_graduates ` , and ` lag_2_num_graduates `
285
- correspond to $x_ {t}$, $x_ {t-1}$, and $x_ {t-2}$ respectively.
280
+ - ` ahead_1_num_graduates_prop ` corresponds to $x_ {t+1}$
281
+ - ` lag_0_num_graduates_prop ` , ` lag_1_num_graduates_prop ` , and
282
+ ` lag_2_num_graduates_prop ` correspond to $x_ {t}$, $x_ {t-1}$, and $x_ {t-2}$
283
+ respectively.
286
284
287
- # Model fitting and prediction
288
-
289
- ## Within recipes framework
285
+ ## Model fitting and prediction
290
286
291
287
Since our goal for now is to fit a simple autoregressive model, we can use
292
288
[ ` parsnip::linear_reg() ` ] (
@@ -306,8 +302,8 @@ wf_linreg
306
302
```
307
303
308
304
This output tells us the coefficients of the fitted model; for instance,
309
- the intercept is $\phi_0 = -2.2426 $ and the coefficient for $x_ {t}$ is
310
- $\phi_1 = 1.14401 $.
305
+ the intercept is $\phi_0 = 0.24804 $ and the coefficient for $x_ {t}$ is
306
+ $\phi_1 = 0.06648 $.
311
307
312
308
Now that we have our workflow, we can generate predictions from a subset of our
313
309
data. For this demo, we will predict the number of graduates using the last 2
@@ -320,24 +316,46 @@ preds <- stats::predict(wf_linreg, latest) %>% filter(!is.na(.pred))
320
316
preds %>% head()
321
317
```
322
318
323
- We can do this using the ` augment ` function too:
319
+ We can do this using the ` augment ` function too. Note that ` predict ` and
320
+ ` augment ` both still return an ` epi_df ` with all of the keys that were present
321
+ in the original dataset.
322
+
324
323
``` {r linearreg-augment, include=T}
325
324
employ_small_with_preds <- augment(wf_linreg, latest)
326
325
employ_small_with_preds %>% head()
326
+ ```
327
+
328
+ ## AR(3) Model Diagnostics
329
+
330
+ First, we'll plot the residuals (that is, $y_ {t} - \hat{y}_ {t}$) against the
331
+ fitted values ($\hat{y}_ {t}$).
327
332
328
- employ_small_with_preds %>%
329
- mutate(resid = med_income_2y - .pred) %>%
330
- ggplot(aes(x = .pred, y = resid, color = geo_value)) +
331
- geom_point() +
333
+ ``` {r lienarreg-resid-plot, include=T, warning=F}
334
+ employ_small_with_preds <- employ_small_with_preds %>%
335
+ mutate(resid = num_graduates_prop - .pred)
336
+
337
+ p1 <- employ_small_with_preds %>%
338
+ ggplot(aes(x = .pred, y = resid)) +
339
+ geom_point(size = 1.5, alpha = .8) +
340
+ geom_smooth(method = "loess", color = "red", linetype = "dashed", size = .7) +
332
341
xlab("Fitted values") +
333
342
ylab("Residuals") +
334
- ggtitle("Plot of fitted values vs. residuals")
343
+ ggtitle("Plot of Fitted Values vs. Residuals in AR(3) Model")
344
+
345
+ p2 <- employ_small_with_preds %>%
346
+ ggplot(aes(sample = resid)) + stat_qq(alpha = .6) + stat_qq_line() +
347
+ ggtitle("Q-Q Plot of Residuals")
348
+
349
+ grid.arrange(p1, p2, ncol = 2)
335
350
```
336
351
337
- Notice that ` predict ` and ` augment ` both still returns an ` epi_df ` with all of
338
- the keys that were present in the original dataset.
352
+ The left plot shows us that the residuals are mostly clustered around zero,
353
+ but do not form an even band around the zero line, indicating that the variance
354
+ of the residuals is not constant. The right Q-Q plot shows us that the residuals
355
+ have heavier tails than a Normal distribution. So both the constant variance and
356
+ normality assumptions of the linear model have been violated.
339
357
340
- ## With canned forecasters
358
+ # Using canned forecasters
341
359
342
360
Even though we aren't working with epidemiological data, canned forecasters
343
361
still work as expected, out of the box. We will demonstrate this with the simple
0 commit comments