Skip to content

Commit f114c65

Browse files
committed
edits, plots
1 parent 0428c40 commit f114c65

File tree

1 file changed

+57
-39
lines changed

1 file changed

+57
-39
lines changed

vignettes/panel-data.Rmd

Lines changed: 57 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ library(recipes)
2323
library(epiprocess)
2424
library(epipredict)
2525
library(ggplot2)
26+
library(gridExtra)
2627
```
2728

2829
[Panel data](https://en.wikipedia.org/wiki/Panel_data), or longitudinal data,
@@ -50,8 +51,8 @@ year_end <- max(grad_employ_subset$time_value)
5051

5152
# Example panel data overview
5253

53-
In this vignette, we will demonstrate using `epipredict` with employment data
54-
from Statistics Canada. We will be using
54+
In this vignette, we will demonstrate using `epipredict` with employment panel
55+
data from Statistics Canada. We will be using
5556
[
5657
Table 37-10-0115-01: Characteristics and median employment income of
5758
longitudinal cohorts of postsecondary graduates two and five years after
@@ -61,7 +62,7 @@ from Statistics Canada. We will be using
6162

6263
The full dataset contains yearly median employment income two and five years
6364
after graduation, and number of graduates. The data is stratified by
64-
variables such as geographic region (Canadian province), field of study, and
65+
variables such as geographic region (Canadian province), education, and
6566
age group. The year range of the dataset is `r year_start` to `r year_end`,
6667
inclusive. The full dataset also contains metadata that describes the
6768
quality of data collected. For demonstration purposes, we make the following
@@ -72,7 +73,7 @@ modifications to get a subset of the full dataset:
7273
* Only keep "good" or better quality data rows, as indicated by the [`STATUS`](
7374
https://www.statcan.gc.ca/en/concepts/definitions/guide-symbol) column
7475
* Choose a subset of covariates and aggregate across the remaining ones. The
75-
chosen covariates are age group, field of study, and educational qualification.
76+
chosen covariates are age group, and educational qualification.
7677

7778
Below is the query for obtaining the full data and code for subsetting it as we
7879
just described:
@@ -144,7 +145,7 @@ gemploy <- statcan_grad_employ %>%
144145
To use this data with `epipredict`, we need to convert it into `epi_df` format
145146
using [`as_epi_df`](
146147
https://cmu-delphi.github.io/epiprocess/reference/as_epi_df.html)
147-
with additional keys. In our case, the additional keys are `age_group`, `fos`
148+
with additional keys. In our case, the additional keys are `age_group`,
148149
and `edu_qual`. Note that in the above modifications, we encoded `time_value`
149150
as type `integer`. This lets us set `time_type = "year"`, and ensures that
150151
lag and ahead modifications later on are using the correct time units. See the
@@ -182,7 +183,6 @@ after graduation
182183
after graduation
183184
* `age_group` (key): one of two age groups, either 15 to 34 years, or 35 to 64
184185
years
185-
* `fos` (key): one of 60 unique fields of study
186186
* `edu_qual` (key): one of 32 unique educational qualifications, e.g.,
187187
"Master's disploma"
188188

@@ -196,16 +196,18 @@ In the following sections, we will go over preprocessing the data in the
196196
`epi_recipe` framework, and fitting a model and making predictions within the
197197
`epipredict` framework and using the package's canned forecasters.
198198

199-
# Preprocessing
199+
# A simple example
200+
## Preprocessing
200201

201-
As a simple example, let's work with the `num_graduates` column for now.
202+
As a simple example, let's work with the `num_graduates` column for now. We will
203+
first pre-process by "standardizing" each numeric column by the total within
204+
each group of keys. We do this since those raw numeric values will vary greatly
205+
from province to province since there are large differences in population.
202206

203207
```{r employ-small, include=T}
204208
employ_small <- employ %>%
205-
# Incomplete data - exclude
206-
filter(geo_value != "Territories") %>%
207-
# Select groups where there are complete timeseries values
208209
group_by(geo_value, age_group, edu_qual) %>%
210+
# Select groups where there are complete timeseries values
209211
filter(n() >= 6) %>%
210212
mutate(
211213
num_graduates_prop = num_graduates / sum(num_graduates),
@@ -231,16 +233,17 @@ employ_small %>%
231233
ggtitle("Trend in # of Graduates by Age Group and Education in BC and ON")
232234
```
233235

234-
We will predict the number of graduates in the next year (time $t+1$) using an
235-
autoregressive model with three lags (i.e., an AR(3) model). Such a model is
236-
represented algebraically like this:
236+
We will predict the "standardized" number of graduates (a proportion) in the
237+
next year (time $t+1$) using an autoregressive model with three lags (i.e., an
238+
AR(3) model). Such a model is represented algebraically like this:
237239

238240
\[
239241
x_{t+1} =
240242
\phi_0 + \phi_1 x_{t} + \phi_2 x_{t-1} + \phi_3 x_{t-2} + \epsilon_t
241243
\]
242244

243-
where $x_i$ is the number of graduates at time $i$, and the current time is $t$.
245+
where $x_i$ is the proportion of graduates at time $i$, and the current time is
246+
$t$.
244247

245248
In the preprocessing step, we need to create additional columns in `employ` for
246249
each of $x_{t+1}$, $x_{t}$, $x_{t-1}$, and $x_{t-2}$. We do this via an
@@ -261,32 +264,25 @@ r <- epi_recipe(employ_small) %>%
261264
r
262265
```
263266

264-
There are 3 `raw` roles which are our three lagged `num_graduates` columns, and
265-
three `key` roles which are our additional keys `age_group`, `fos` and
266-
`edu_qual`.
267-
268267
Let's apply this recipe using `prep` and `bake` to generate and view the `lag`
269268
and `ahead` columns.
270269

271270
```{r view-preprocessed, include=T}
272271
# Display a sample of the preprocessed data
273-
baked_sample <- r %>%
274-
prep() %>%
275-
bake(new_data = employ_small) %>%
272+
baked_sample <- r %>% prep() %>% bake(new_data = employ_small) %>%
276273
sample_n(5)
277274
baked_sample
278275
```
279276

280277
We can see that the `prep` and `bake` steps created new columns according to
281278
our `epi_recipe`:
282279

283-
- `ahead_1_num_graduates` corresponds to $x_{t+1}$
284-
- `lag_0_num_graduates`, `lag_1_num_graduates`, and `lag_2_num_graduates`
285-
correspond to $x_{t}$, $x_{t-1}$, and $x_{t-2}$ respectively.
280+
- `ahead_1_num_graduates_prop` corresponds to $x_{t+1}$
281+
- `lag_0_num_graduates_prop`, `lag_1_num_graduates_prop`, and
282+
`lag_2_num_graduates_prop` correspond to $x_{t}$, $x_{t-1}$, and $x_{t-2}$
283+
respectively.
286284

287-
# Model fitting and prediction
288-
289-
## Within recipes framework
285+
## Model fitting and prediction
290286

291287
Since our goal for now is to fit a simple autoregressive model, we can use
292288
[`parsnip::linear_reg()`](
@@ -306,8 +302,8 @@ wf_linreg
306302
```
307303

308304
This output tells us the coefficients of the fitted model; for instance,
309-
the intercept is $\phi_0 = -2.2426$ and the coefficient for $x_{t}$ is
310-
$\phi_1 = 1.14401$.
305+
the intercept is $\phi_0 = 0.24804$ and the coefficient for $x_{t}$ is
306+
$\phi_1 = 0.06648$.
311307

312308
Now that we have our workflow, we can generate predictions from a subset of our
313309
data. For this demo, we will predict the number of graduates using the last 2
@@ -320,24 +316,46 @@ preds <- stats::predict(wf_linreg, latest) %>% filter(!is.na(.pred))
320316
preds %>% head()
321317
```
322318

323-
We can do this using the `augment` function too:
319+
We can do this using the `augment` function too. Note that `predict` and
320+
`augment` both still return an `epi_df` with all of the keys that were present
321+
in the original dataset.
322+
324323
```{r linearreg-augment, include=T}
325324
employ_small_with_preds <- augment(wf_linreg, latest)
326325
employ_small_with_preds %>% head()
326+
```
327+
328+
## AR(3) Model Diagnostics
329+
330+
First, we'll plot the residuals (that is, $y_{t} - \hat{y}_{t}$) against the
331+
fitted values ($\hat{y}_{t}$).
327332

328-
employ_small_with_preds %>%
329-
mutate(resid = med_income_2y - .pred) %>%
330-
ggplot(aes(x = .pred, y = resid, color = geo_value)) +
331-
geom_point() +
333+
```{r lienarreg-resid-plot, include=T, warning=F}
334+
employ_small_with_preds <- employ_small_with_preds %>%
335+
mutate(resid = num_graduates_prop - .pred)
336+
337+
p1 <- employ_small_with_preds %>%
338+
ggplot(aes(x = .pred, y = resid)) +
339+
geom_point(size = 1.5, alpha = .8) +
340+
geom_smooth(method = "loess", color = "red", linetype = "dashed", size = .7) +
332341
xlab("Fitted values") +
333342
ylab("Residuals") +
334-
ggtitle("Plot of fitted values vs. residuals")
343+
ggtitle("Plot of Fitted Values vs. Residuals in AR(3) Model")
344+
345+
p2 <- employ_small_with_preds %>%
346+
ggplot(aes(sample = resid)) + stat_qq(alpha = .6) + stat_qq_line() +
347+
ggtitle("Q-Q Plot of Residuals")
348+
349+
grid.arrange(p1, p2, ncol = 2)
335350
```
336351

337-
Notice that `predict` and `augment` both still returns an `epi_df` with all of
338-
the keys that were present in the original dataset.
352+
The left plot shows us that the residuals are mostly clustered around zero,
353+
but do not form an even band around the zero line, indicating that the variance
354+
of the residuals is not constant. The right Q-Q plot shows us that the residuals
355+
have heavier tails than a Normal distribution. So both the constant variance and
356+
normality assumptions of the linear model have been violated.
339357

340-
## With canned forecasters
358+
# Using canned forecasters
341359

342360
Even though we aren't working with epidemiological data, canned forecasters
343361
still work as expected, out of the box. We will demonstrate this with the simple

0 commit comments

Comments
 (0)