@@ -12,7 +12,8 @@ knitr::opts_chunk$set(
12
12
echo = TRUE,
13
13
collapse = TRUE,
14
14
comment = "#>",
15
- out.width = "100%"
15
+ out.width = "90%",
16
+ fig.align = "center"
16
17
)
17
18
```
18
19
@@ -117,7 +118,7 @@ In the following sections, we will go over pre-processing the data in the
117
118
` epi_recipe ` framework, and fitting a model and making predictions within the
118
119
` epipredict ` framework and using the package's canned forecasters.
119
120
120
- # Simple autoregressive with 3 lags to predict number of graduates in a year
121
+ # Autoregressive (AR) model to predict number of graduates in a year
121
122
122
123
## Pre-processing
123
124
@@ -165,21 +166,25 @@ next year (time $t+1$) using an autoregressive model with three lags (i.e., an
165
166
AR(3) model). Such a model is represented algebraically like this:
166
167
167
168
\[
168
- x _ {t+1} =
169
- \alpha_0 + \alpha_1 x _ {t } + \alpha_2 x _ {t-1} + \alpha_3 x _ {t-2} + \epsilon_t
169
+ y _ {t+1,ijk } =
170
+ \alpha_0 + \alpha_1 y _ {tijk } + \alpha_2 y _ {t-1,ijk } + \alpha_3 y _ {t-2,ijk } + \epsilon _ {tijk}
170
171
\]
171
172
172
- where $x_i $ is the proportion of graduates at time $i$, and the current time is
173
- $t $.
173
+ where $y _ {tij} $ is the proportion of graduates at time $t$ in location $i$ and
174
+ age group $j$ with education quality $k $.
174
175
175
176
In the pre-processing step, we need to create additional columns in ` employ ` for
176
- each of $x_ {t+1}$, $x_ {t}$, $x_ {t-1}$, and $x_ {t-2}$. We do this via an
177
+ each of $y_ {t+1,ijk}$, $y_ {tijk}$, $y_ {t-1,ijk}$, and $y_ {t-2,ijk}$.
178
+ We do this via an
177
179
` epi_recipe ` . Note that creating an ` epi_recipe ` alone doesn't add these
178
180
outcome and predictor columns; the recipe just stores the instructions for
179
181
adding them.
180
182
181
- Our ` epi_recipe ` should add one ` ahead ` column representing $x_ {t+1}$ and
182
- 3 ` lag ` columns representing $x_ {t}$, $x_ {t-1}$, and $x_ {t-2}$. Also note that
183
+ Our ` epi_recipe ` should add one ` ahead ` column representing $y_ {t+1,ijk}$ and
184
+ 3 ` lag ` columns representing $y_ {tijk}$, $y_ {t-1,ijk}$, and $y_ {t-2,ijk}$
185
+ (it's more accurate to think of the 0th "lag" as the "current" value with 2 lags,
186
+ but that's not quite how the processing works).
187
+ Also note that
183
188
since we specified our ` time_type ` to be ` year ` , our ` lag ` and ` lead `
184
189
values are both in years.
185
190
@@ -209,9 +214,9 @@ r %>% bake_and_show_sample(employ_small)
209
214
We can see that the ` prep ` and ` bake ` steps created new columns according to
210
215
our ` epi_recipe ` :
211
216
212
- - ` ahead_1_num_graduates_prop ` corresponds to $x _ {t+1}$
217
+ - ` ahead_1_num_graduates_prop ` corresponds to $y _ {t+1,ijk }$
213
218
- ` lag_0_num_graduates_prop ` , ` lag_1_num_graduates_prop ` , and
214
- ` lag_2_num_graduates_prop ` correspond to $x _ {t }$, $x _ {t-1}$, and $x _ {t-2}$
219
+ ` lag_2_num_graduates_prop ` correspond to $y _ {tijk }$, $y _ {t-1,ijk }$, and $y _ {t-2,ijk }$
215
220
respectively.
216
221
217
222
## Model fitting and prediction
@@ -224,8 +229,8 @@ engine `lm`, which fits a linear regression using ordinary least squares.
224
229
We will use ` epi_workflow ` with the ` epi_recipe ` we defined in the
225
230
pre-processing section along with the ` parsnip::linear_reg() ` model. Note
226
231
that ` epi_workflow ` is a container and doesn't actually do the fitting. We have
227
- to pass the workflow into ` fit() ` to get our model coefficients
228
- $\alpha_i, i=0,...,3$.
232
+ to pass the workflow into ` fit() ` to get our estimated model coefficients
233
+ $\widehat{\alpha} _ i,\ i=0,...,3$.
229
234
230
235
``` {r linearreg-wf, include=T}
231
236
wf_linreg <- epi_workflow(r, linear_reg()) %>%
@@ -234,13 +239,13 @@ summary(extract_fit_engine(wf_linreg))
234
239
```
235
240
236
241
This output tells us the coefficients of the fitted model; for instance,
237
- the intercept is $\alpha_0 = 0.24804$ and the coefficient for $x _ {t }$ is
238
- $\alpha_1 = 0.06648$. The summary also tells us that only the intercept and lags
239
- at 2 years and 3 years ago have coefficients significantly greater than zero.
240
-
241
- Extracting the 95% confidence intervals for the coefficients also leads us to
242
- the same conclusion: all the coefficients except for $\alpha_1$ (lag 0) contain
243
- 0 .
242
+ the estimated intercept is $\widehat{\alpha} _ 0 =$ ` r round(coef(extract_fit_engine(wf_linreg))[1], 3) ` and the coefficient for $y _ {tijk }$ is
243
+ $\widehat\ alpha_1 =$ ` r round(coef(extract_fit_engine(wf_linreg))[2], 3) ` .
244
+ The summary also tells us that all estimated coefficients are significantly
245
+ different from zero. Extracting the 95% confidence intervals for the
246
+ coefficients also leads us to
247
+ the same conclusion: all the coefficient estimates are significantly different
248
+ from 0.
244
249
245
250
``` {r}
246
251
confint(extract_fit_engine(wf_linreg))
@@ -258,20 +263,20 @@ preds %>% sample_n(5)
258
263
```
259
264
260
265
We can do this using the ` augment ` function too. Note that ` predict ` and
261
- ` augment ` both still return an ` epi_df ` with all of the keys that were present
262
- in the original dataset.
266
+ ` augment ` both still return an ` epiprocess:: epi_df` with all of the keys that
267
+ were present in the original dataset.
263
268
264
- ``` {r linearreg-augment, include=T, eval=F }
269
+ ``` {r linearreg-augment}
265
270
augment(wf_linreg, latest) %>% sample_n(5)
266
271
```
267
272
268
273
## Model diagnostics
269
274
270
- First, we'll plot the residuals (that is, $y_ {t } - \hat {y}_ {t }$) against the
271
- fitted values ($\hat {y}_ {t }$).
275
+ First, we'll plot the residuals (that is, $y_ {tijk } - \widehat {y}_ {tijk }$)
276
+ against the fitted values ($\widehat {y}_ {tijk }$).
272
277
273
278
``` {r lienarreg-resid-plot, include=T, fig.height = 5, fig.width = 5}
274
- par(mfrow = c(2, 2))
279
+ par(mfrow = c(2, 2), mar = c(5, 3, 1.2, 0) )
275
280
plot(extract_fit_engine(wf_linreg))
276
281
```
277
282
@@ -286,30 +291,33 @@ The Q-Q plot shows us that the residuals have heavier tails than a Normal
286
291
distribution. So the normality of residuals assumption doesn't hold either.
287
292
288
293
Finally, the residuals vs. leverage plot shows us that we have a few influential
289
- points based on the Cooks distance (those outside the red dotted line).
294
+ points based on the Cook's distance (those outside the red dotted line).
290
295
291
296
Since we appear to be violating the linear model assumptions, we might consider
292
297
transforming our data differently, or considering a non-linear model, or
293
298
something else.
294
299
295
- # Autoregressive model with exogenous inputs
300
+ # AR model with exogenous inputs
296
301
297
- Now suppose we want to model the 5-year employment income using 3 lags, while
302
+ Now suppose we want to model the 1-step-ahead 5-year employment income using
303
+ current and two previous values, while
298
304
also incorporating information from the other two time-series in our dataset:
299
305
the 2-year employment income and the number of graduates in the previous 2
300
306
years. We would do this using an autoregressive model with exogenous inputs,
301
307
defined as follows:
302
308
303
309
\[
304
- y_ {t+1} =
305
- \alpha_0 + \alpha_1 y_ {t} + \alpha_2 y_ {t-1} + \alpha_3 y_ {t-2} +
306
- \beta_1 x_ {t} + \beta_2 x_ {t-1}
307
- \gamma_2 z_ {t} + \gamma_2 z_ {t-1}
310
+ \begin{aligned}
311
+ y_ {t+1,ijk} &=
312
+ \alpha_0 + \alpha_1 y_ {tijk} + \alpha_2 y_ {t-1,ijk} + \alpha_3 y_ {t-2,ijk}\\
313
+ &\quad + \beta_1 x_ {tijk} + \beta_2 x_ {t-1,ijk}\\
314
+ &\quad + \gamma_2 z_ {tijk} + \gamma_2 z_ {t-1,ijk} + \epsilon_ {tijk}
315
+ \end{aligned}
308
316
\]
309
317
310
- where $y_i $ is the 5-year median income (proportion) at time $i$ ,
311
- $x_i $ is the 2-year median income (proportion) at time $i $, and
312
- $z_i $ is the number of graduates (proportion) at time $i $.
318
+ where $y _ {tijk} $ is the 5-year median income (proportion) at time $t$ (in location $i$, age group $j$ with education quality $k$) ,
319
+ $x _ {tijk} $ is the 2-year median income (proportion) at time $t $, and
320
+ $z _ {tijk} $ is the number of graduates (proportion) at time $t $.
313
321
314
322
## Pre-processing
315
323
@@ -318,9 +326,9 @@ Again, we construct an `epi_recipe` detailing the pre-processing steps.
318
326
``` {r custom-arx, include=T}
319
327
rx <- epi_recipe(employ_small) %>%
320
328
step_epi_ahead(med_income_5y_prop, ahead = 1) %>%
321
- # 5-year median income has 3 lags c(0,1, 2)
322
- step_epi_lag(med_income_5y_prop, lag = c(0, 1, 2) ) %>%
323
- # But the two exogenous variables have 2 lags c(0,1)
329
+ # 5-year median income has current, and two lags c(0, 1, 2)
330
+ step_epi_lag(med_income_5y_prop, lag = 0:2 ) %>%
331
+ # But the two exogenous variables have current values, and 1 lag c(0, 1)
324
332
step_epi_lag(med_income_2y_prop, lag = c(0, 1)) %>%
325
333
step_epi_lag(num_graduates_prop, lag = c(0, 1)) %>%
326
334
step_epi_naomit()
@@ -335,14 +343,15 @@ steps using a few [`frosting`](
335
343
https://cmu-delphi.github.io/epipredict/reference/frosting.html ) layers to do
336
344
a few things:
337
345
338
- 1 . Convert our predictions back to income values and number of graduates,
339
- rather than standardized proportions. We do this via the frosting layer
340
- ` layer_population_scaling ` .
341
- 2 . Threshold our predictions to 0. We are predicting proportions, which can't
346
+ 1 . Threshold our predictions to 0. We are predicting proportions, which can't
342
347
be negative. And the transformed values back to dollars and people can't be
343
348
negative either.
344
- 3 . Generate prediction intervals based on residual quantiles, allowing us to
349
+ 1 . Generate prediction intervals based on residual quantiles, allowing us to
345
350
quantify the uncertainty associated with future predicted values.
351
+ 1 . Convert our predictions back to income values and number of graduates,
352
+ rather than standardized proportions. We do this via the frosting layer
353
+ ` layer_population_scaling() ` .
354
+
346
355
347
356
``` {r custom-arx-post, include=T}
348
357
# Create dataframe of the sums we used for standardizing
@@ -373,18 +382,16 @@ wfx_linreg <- epi_workflow(rx, parsnip::linear_reg()) %>%
373
382
summary(extract_fit_engine(wfx_linreg))
374
383
```
375
384
376
- Based on the summary output for this model, only the intercept term, 5-year
377
- median income from 3 years ago, and the 2-year median income from 1 year ago
378
- are significant linear predictors for today's 5-year median income at a 95%
379
- confidence level. Both lags for the number of graduates were insignificant.
385
+ Based on the summary output for this model, we can examine confidence intervals
386
+ and perform hypothesis tests as usual.
380
387
381
388
Let's take a look at the predictions along with their 90% prediction intervals.
382
389
383
390
``` {r}
384
391
latest <- get_test_data(recipe = rx, x = employ_small)
385
392
predsx <- predict(wfx_linreg, latest)
386
393
387
- # Display values within prediction intervals
394
+ # Display predictions along with prediction intervals
388
395
predsx %>%
389
396
select(
390
397
geo_value, time_value, edu_qual, age_group,
@@ -419,8 +426,8 @@ by the keys in our `epi_df`.
419
426
In this first example, we'll use ` flatline_forecaster ` to make a simple
420
427
prediction of the 2-year median income for the next year, based on one previous
421
428
time point. This model is representated algebraically as:
422
- \[ y_ {t+1} = \alpha_0 + \alpha_0 y _ {t }\]
423
- where $y_i $ is the 2-year median income (proportion) at time $i $.
429
+ \[ y_ {t+1,ijk } = y _ {tijk} + \epsilon _ {tijk }\]
430
+ where $y _ {tijk} $ is the 2-year median income (proportion) at time $t $.
424
431
425
432
``` {r flatline, include=T, warning=F}
426
433
out_fl <- flatline_forecaster(employ_small, "med_income_2y_prop",
0 commit comments