Skip to content

Commit 63e306a

Browse files
committed
add some math
1 parent 51471fa commit 63e306a

File tree

1 file changed

+77
-28
lines changed

1 file changed

+77
-28
lines changed

vignettes/panel-data.Rmd

Lines changed: 77 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,11 @@ knitr::opts_chunk$set(
1717
```
1818

1919
```{r libraries}
20-
library(epiprocess)
21-
library(epipredict)
2220
library(dplyr)
2321
library(parsnip)
2422
library(recipes)
23+
library(epiprocess)
24+
library(epipredict)
2525
```
2626

2727
[Panel data](https://en.wikipedia.org/wiki/Panel_data), or longitudinal data,
@@ -33,22 +33,21 @@ dataset, which contains daily state-wise measures of `case_rate` and
3333
`death_rate` for COVID-19 in 2021:
3434

3535
```{r epi-panel-ex, include=T}
36-
head(case_death_rate_subset)
36+
head(case_death_rate_subset, 3)
3737
```
3838

3939
`epipredict` functions work with data in [`epi_df`](
4040
https://cmu-delphi.github.io/epiprocess/reference/epi_df.html)
4141
format. Despite the stated goal and name of the package, other panel datasets
42-
are also valid candidates for `epipredict` functionality. Specifically, the
43-
`epipredict` framework and direct forecasters can work with any panel data, as
44-
long as it's in `epi_df` format.
42+
are also valid candidates for `epipredict` functionality, as long as they are
43+
in `epi_df` format.
4544

4645
```{r employ-stats, include=F}
4746
year_start <- min(grad_employ_subset$time_value)
4847
year_end <- max(grad_employ_subset$time_value)
4948
```
5049

51-
## Example panel data overview
50+
# Example panel data overview
5251

5352
In this vignette, we will demonstrate using `epipredict` with employment data
5453
from Statistics Canada. We will be using
@@ -80,7 +79,7 @@ just described:
8079
```{r employ-query, eval=F}
8180
library(cansim)
8281
83-
# Get original dataset
82+
# Get statcan data using get_cansim, which returns a tibble
8483
statcan_grad_employ <- get_cansim("37-10-0115-01")
8584
8685
gemploy <- statcan_grad_employ %>%
@@ -141,13 +140,12 @@ using [`as_epi_df`](
141140
https://cmu-delphi.github.io/epiprocess/reference/as_epi_df.html)
142141
with additional keys. In our case, the additional keys are `age_group`, `fos`
143142
and `edu_qual`. Note that in the above modifications, we encoded `time_value`
144-
as type `integer`. This allows us to set `time_type` to `"year"`, and to ensure
143+
as type `integer`. This lets us set `time_type = "year"`, and ensures that
145144
lag and ahead modifications later on are using the correct time units. See the
146145
[`epi_df` documentation](
147146
https://cmu-delphi.github.io/epiprocess/reference/epi_df.html#time-types) for
148147
a list of all the `type_type`s available.
149148

150-
151149
```{r convert-to-epidf, eval=F}
152150
grad_employ_subset <- gemploy %>%
153151
tsibble::as_tsibble(
@@ -163,8 +161,22 @@ employ_rowcount <- format(nrow(grad_employ_subset), big.mark=",")
163161
employ_colcount <- length(names(grad_employ_subset))
164162
```
165163

166-
The data contains `r employ_rowcount` rows and `r employ_colcount` columns. Now,
167-
we are ready to use `grad_employ_subset` with `epipredict`.
164+
Now, we are ready to use `grad_employ_subset` with `epipredict`.
165+
Our `epi_df` contains `r employ_rowcount` rows and `r employ_colcount` columns.
166+
Here is a quick summary of the columns in our `epi_df`:
167+
168+
* `time_value` (time value): year in YYYY format
169+
* `geo_value` (geo value): province in Canada
170+
* `num_graduates` (raw, time series value): number of graduates
171+
* `med_income_2y` (raw, time series value): median employment income 2 years
172+
after graduation
173+
* `med_income_5y` (raw, time series value): median employment income 5 years
174+
after graduation
175+
* `age_group` (key): one of two age groups, either 15 to 34 years, or 35 to 64
176+
years
177+
* `fos` (key): one of 60 unique fields of study
178+
* `edu_qual` (key): one of 32 unique educational qualifications, e.g.,
179+
"Master's disploma"
168180

169181
```{r preview-data, include=T}
170182
# Rename for simplicity
@@ -176,48 +188,83 @@ In the following sections, we will go over preprocessing the data in the
176188
`epi_recipe` framework, and fitting a model and making predictions within the
177189
`epipredict` framework and using the package's canned forecasters.
178190

179-
## Preprocessing
191+
# Preprocessing
192+
193+
We will create an `epi_recipe` that adds one `ahead` column and 3 `lag` columns.
194+
The `ahead` column tells us how many time units ahead to predict, and the `lag`
195+
columns tell us how many previous time points to include as covariates. We will
196+
just work with one of the time series in our data for now: `num_graduates`.
180197

181-
We will create a recipe that adds one `ahead` column and 3 `lag` columns.
198+
In this preprocessing step, no computation really happens. It just provides
199+
a series of steps that will be applied when using `epi_workflow` later. And note
200+
that since we specified our `time_type` to be `year`, our `lag` and `lead`
201+
values are both in years.
182202

183203
```{r make-recipe, include=T}
184204
r <- epi_recipe(employ) %>%
185205
step_epi_ahead(num_graduates, ahead = 1) %>% # lag & ahead units in years
186-
step_epi_lag(num_graduates, lag = c(0, 1, 1)) %>%
206+
step_epi_lag(num_graduates, lag = 0:2) %>%
187207
step_epi_naomit()
188208
r
189209
```
190210

191-
There is one `raw` role which includes our value column `num_graduates`, and two
192-
`key` roles which include our additional keys `age_group`, `fos` and
193-
`edu_qual`. Let's take a look at what these additional columns look like.
211+
There are 3 `raw` roles which are our three lagged `num_graduates` columns, and
212+
three `key` roles which are our additional keys `age_group`, `fos` and
213+
`edu_qual`. Let's apply this recipe using `prep` and `bake` to see all of the
214+
additional `lag` and `ahead` columns created.
194215

195216
```{r view-preprocessed, include=T}
196217
# Display a sample of the preprocessed data
197218
baked_sample <- r %>% prep() %>% bake(new_data = employ) %>% sample_n(5)
198219
baked_sample
199220
```
200221

201-
## Model fitting and prediction
222+
# Model fitting and prediction
223+
224+
## Within recipes framework
225+
226+
We will look at a simple model: [`parsnip::linear_reg()`](
227+
https://parsnip.tidymodels.org/reference/linear_reg.html) with the default
228+
engine `lm`, which fits a linear regression using ordinary least squares.
229+
Specifically, our model will be an autoregressive linear model with:
202230

203-
### Within recipes framework
231+
* Outcome $y_{t}$, the number of graduates at time $t$. Corresponds to
232+
setting `ahead = 1` in our recipe.
233+
<!-- TODO i don't like the wording i used here -->
234+
* Predictors $x_{t-1}$, $x_{t-2}$, and $x_{t-3}$, the number of graduates in the
235+
three consecutive years prior to time $t$. Corresponds to setting `lag = 0:2` in
236+
our recipe.
204237

205-
We will look at a simple model: `parsnip::linear_reg()` with default engine
206-
`lm`. We can use `epi_workflow` with the `epi_recipe` we defined in the
207-
preprocessing section to fit an autoregressive linear model using lags at
208-
time $t$ (current), $t-1$ (last year), and $t-2$ (2 years ago).
238+
The model is represented algebraically as follows:
239+
240+
\[
241+
y_{t+1} = \alpha_1 x_{t} + \alpha_2 x_{t-1} + \alpha_3 x_{t-2} + \epsilon_t
242+
\]
243+
244+
We will use `epi_workflow` with the `epi_recipe` we defined in the
245+
preprocessing section to fit this model.
209246

210247
```{r linearreg-wf, include=T}
211248
wf_linreg <- epi_workflow(r, parsnip::linear_reg()) %>% fit(employ)
212249
wf_linreg
213250
```
214251

252+
Let's take a look at the model object
253+
254+
```{r linearreg-fit, include=T}
255+
# extract the parsnip model object
256+
wf_lr_fit <- wf_linreg$fit$fit
257+
wf_lr_fit
258+
```
259+
260+
<!-- TODO add diagnostics -->
261+
215262
Now that we have our workflow, we can generate predictions from a subset of our
216-
data. For this demo, we will predict the employment counts from the last 12
217-
months of our dataset.
263+
data. For this demo, we will predict the number of graduates from the last 2
264+
years of our dataset.
218265

219266
```{r linearreg-predict, include=T}
220-
latest <- employ %>% filter(time_value >= max(time_value) - 12)
267+
latest <- employ %>% filter(time_value >= max(time_value) - 2)
221268
preds <- stats::predict(wf_linreg, latest) %>% filter(!is.na(.pred))
222269
# Display a sample of the prediction values
223270
head(preds)
@@ -226,7 +273,9 @@ head(preds)
226273
Notice that `predict` still returns an `epi_df` with all of the keys that were
227274
present in the original dataset.
228275

229-
### With canned forecasters
276+
<!-- (TODO: residuals, predictions commentary) -->
277+
278+
## With canned forecasters
230279

231280
Even though we aren't working with epidemiological data, canned forecasters
232281
still work as expected, out of the box. We will demonstrate this with the simple

0 commit comments

Comments
 (0)