edits, plots

mgyliu · mgyliu · commit f114c656cccd · 2022-09-13T21:38:55.000-07:00
diff --git a/vignettes/panel-data.Rmd b/vignettes/panel-data.Rmd
@@ -23,6 +23,7 @@ library(recipes)
 library(epiprocess)
 library(epipredict)
 library(ggplot2)
+library(gridExtra)
 ```
 
 [Panel data](https://en.wikipedia.org/wiki/Panel_data), or longitudinal data, 
@@ -50,8 +51,8 @@ year_end <- max(grad_employ_subset$time_value)
 
 # Example panel data overview
 
-In this vignette, we will demonstrate using `epipredict` with employment data 
-from Statistics Canada. We will be using 
+In this vignette, we will demonstrate using `epipredict` with employment panel
+data from Statistics Canada. We will be using 
 [
   Table 37-10-0115-01: Characteristics and median employment income of 
   longitudinal cohorts of postsecondary graduates two and five years after 
@@ -61,7 +62,7 @@ from Statistics Canada. We will be using
 
 The full dataset contains yearly median employment income two and five years 
 after graduation, and number of graduates. The data is stratified by 
-variables such as geographic region (Canadian province), field of study, and 
+variables such as geographic region (Canadian province), education, and 
 age group. The year range of the dataset is `r year_start` to `r year_end`, 
 inclusive. The full dataset also contains metadata that describes the 
 quality of data collected. For demonstration purposes, we make the following 
@@ -72,7 +73,7 @@ modifications to get a subset of the full dataset:
 * Only keep "good" or better quality data rows, as indicated by the [`STATUS`](
   https://www.statcan.gc.ca/en/concepts/definitions/guide-symbol) column
 * Choose a subset of covariates and aggregate across the remaining ones. The 
-chosen covariates are age group, field of study, and educational qualification.
+chosen covariates are age group, and educational qualification.
 
 Below is the query for obtaining the full data and code for subsetting it as we 
 just described:
@@ -144,7 +145,7 @@ gemploy <- statcan_grad_employ %>%
 To use this data with `epipredict`, we need to convert it into `epi_df` format 
 using [`as_epi_df`](
   https://cmu-delphi.github.io/epiprocess/reference/as_epi_df.html) 
-with additional keys. In our case, the additional keys are `age_group`, `fos` 
+with additional keys. In our case, the additional keys are `age_group`,
 and `edu_qual`. Note that in the above modifications, we encoded `time_value` 
 as type `integer`. This lets us set `time_type = "year"`, and ensures that
 lag and ahead modifications later on are using the correct time units. See the
@@ -182,7 +183,6 @@ after graduation
 after graduation  
 * `age_group` (key): one of two age groups, either 15 to 34 years, or 35 to 64 
 years 
-* `fos` (key): one of 60 unique fields of study 
 * `edu_qual` (key): one of 32 unique educational qualifications, e.g., 
 "Master's disploma"
 
@@ -196,16 +196,18 @@ In the following sections, we will go over preprocessing the data in the
 `epi_recipe` framework, and fitting a model and making predictions within the 
 `epipredict` framework and using the package's canned forecasters.
 
-# Preprocessing 
+# A simple example
+## Preprocessing 
 
-As a simple example, let's work with the `num_graduates` column for now.
+As a simple example, let's work with the `num_graduates` column for now. We will 
+first pre-process by "standardizing" each numeric column by the total within 
+each group of keys. We do this since those raw numeric values will vary greatly
+from province to province since there are large differences in population.
 
 ```{r employ-small, include=T}
 employ_small <- employ %>%
-  # Incomplete data - exclude
-  filter(geo_value != "Territories") %>%
-  # Select groups where there are complete timeseries values
   group_by(geo_value, age_group, edu_qual) %>%
+  # Select groups where there are complete timeseries values
   filter(n() >= 6) %>%
   mutate(
     num_graduates_prop = num_graduates / sum(num_graduates),
@@ -231,16 +233,17 @@ employ_small %>%
   ggtitle("Trend in # of Graduates by Age Group and Education in BC and ON")
 ```
 
-We will predict the number of graduates in the next year (time $t+1$) using an 
-autoregressive model with three lags (i.e., an AR(3) model). Such a model is 
-represented algebraically like this: 
+We will predict the "standardized" number of graduates (a proportion) in the 
+next year (time $t+1$) using an autoregressive model with three lags (i.e., an 
+AR(3) model). Such a model is represented algebraically like this: 
 
 \[
   x_{t+1} = 
   \phi_0 + \phi_1 x_{t} + \phi_2 x_{t-1} + \phi_3 x_{t-2} + \epsilon_t  
 \]
 
-where $x_i$ is the number of graduates at time $i$, and the current time is $t$.
+where $x_i$ is the proportion of graduates at time $i$, and the current time is 
+$t$.
 
 In the preprocessing step, we need to create additional columns in `employ` for 
 each of $x_{t+1}$, $x_{t}$, $x_{t-1}$, and $x_{t-2}$. We do this via an 
@@ -261,32 +264,25 @@ r <- epi_recipe(employ_small) %>%
 r
 ```
 
-There are 3 `raw` roles which are our three lagged `num_graduates` columns, and 
-three `key` roles which are our additional keys `age_group`, `fos` and 
-`edu_qual`. 
-
 Let's apply this recipe using `prep` and `bake` to generate and view the `lag` 
 and `ahead` columns.
 
 ```{r view-preprocessed, include=T}
 # Display a sample of the preprocessed data
-baked_sample <- r %>%
-  prep() %>%
-  bake(new_data = employ_small) %>%
+baked_sample <- r %>% prep() %>% bake(new_data = employ_small) %>%
   sample_n(5)
 baked_sample
 ```
 
 We can see that the `prep` and `bake` steps created new columns according to 
 our `epi_recipe`:
 
-- `ahead_1_num_graduates` corresponds to $x_{t+1}$
-- `lag_0_num_graduates`, `lag_1_num_graduates`, and `lag_2_num_graduates` 
-correspond to $x_{t}$, $x_{t-1}$, and $x_{t-2}$ respectively.
+- `ahead_1_num_graduates_prop` corresponds to $x_{t+1}$
+- `lag_0_num_graduates_prop`, `lag_1_num_graduates_prop`, and 
+`lag_2_num_graduates_prop` correspond to $x_{t}$, $x_{t-1}$, and $x_{t-2}$ 
+respectively.
 
-# Model fitting and prediction
-
-## Within recipes framework 
+## Model fitting and prediction
 
 Since our goal for now is to fit a simple autoregressive model, we can use
 [`parsnip::linear_reg()`](
@@ -306,8 +302,8 @@ wf_linreg
 ```
 
 This output tells us the coefficients of the fitted model; for instance, 
-the intercept is $\phi_0 = -2.2426$ and the coefficient for $x_{t}$ is 
-$\phi_1 = 1.14401$.
+the intercept is $\phi_0 = 0.24804$ and the coefficient for $x_{t}$ is 
+$\phi_1 = 0.06648$.
 
 Now that we have our workflow, we can generate predictions from a subset of our 
 data. For this demo, we will predict the number of graduates using the last 2 
@@ -320,24 +316,46 @@ preds <- stats::predict(wf_linreg, latest) %>% filter(!is.na(.pred))
 preds %>% head()
 ```
 
-We can do this using the `augment` function too:
+We can do this using the `augment` function too. Note that `predict` and 
+`augment` both still return an `epi_df` with all of the keys that were present 
+in the original dataset.
+
 ```{r linearreg-augment, include=T}
 employ_small_with_preds <- augment(wf_linreg, latest)
 employ_small_with_preds %>% head()
+```
+
+## AR(3) Model Diagnostics 
+
+First, we'll plot the residuals (that is, $y_{t} - \hat{y}_{t}$) against the 
+fitted values ($\hat{y}_{t}$).
 
-employ_small_with_preds %>%
-  mutate(resid = med_income_2y - .pred) %>%
-  ggplot(aes(x = .pred, y = resid, color = geo_value)) +
-  geom_point() +
+```{r lienarreg-resid-plot, include=T, warning=F}
+employ_small_with_preds <- employ_small_with_preds %>% 
+  mutate(resid = num_graduates_prop - .pred) 
+  
+p1 <- employ_small_with_preds %>%
+  ggplot(aes(x = .pred, y = resid)) + 
+  geom_point(size = 1.5, alpha = .8) +
+  geom_smooth(method = "loess", color = "red", linetype = "dashed", size = .7) + 
   xlab("Fitted values") +
   ylab("Residuals") +
-  ggtitle("Plot of fitted values vs. residuals")
+  ggtitle("Plot of Fitted Values vs. Residuals in AR(3) Model")
+
+p2 <- employ_small_with_preds %>% 
+  ggplot(aes(sample = resid)) + stat_qq(alpha = .6) + stat_qq_line() +
+  ggtitle("Q-Q Plot of Residuals")
+
+grid.arrange(p1, p2, ncol = 2)
 ```
 
-Notice that `predict` and `augment` both still returns an `epi_df` with all of 
-the keys that were present in the original dataset. 
+The left plot shows us that the residuals are mostly clustered around zero, 
+but do not form an even band around the zero line, indicating that the variance 
+of the residuals is not constant. The right Q-Q plot shows us that the residuals 
+have heavier tails than a Normal distribution. So both the constant variance and
+normality assumptions of the linear model have been violated.
 
-## With canned forecasters
+# Using canned forecasters
 
 Even though we aren't working with epidemiological data, canned forecasters 
 still work as expected, out of the box. We will demonstrate this with the simple