Skip to content

Commit cc988cb

Browse files
committed
address @rachellobay review, adjust canned printing to handle other_keys
1 parent 1ee300b commit cc988cb

File tree

2 files changed

+69
-56
lines changed

2 files changed

+69
-56
lines changed

R/canned-epipred.R

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -67,11 +67,17 @@ print.canned_epipred <- function(x, name, ...) {
6767
)
6868
cli::cli_text("")
6969
cli::cli_text("Training data was an {.cls epi_df} with:")
70-
cli::cli_ul(c(
71-
"Geography: {.field {x$metadata$training$geo_type}},",
72-
"Time type: {.field {x$metadata$training$time_type}},",
73-
"Using data up-to-date as of: {.field {format(x$metadata$training$as_of)}}."
74-
))
70+
fn_meta <- function() {
71+
cli::cli_ul()
72+
cli::cli_li("Geography: {.field {x$metadata$training$geo_type}},")
73+
if (!is.null(x$metadata$training$other_keys)) {
74+
cli::cli_li("Other keys: {.field {x$metadata$training$other_keys}},")
75+
}
76+
cli::cli_li("Time type: {.field {x$metadata$training$time_type}},")
77+
cli::cli_li("Using data up-to-date as of: {.field {format(x$metadata$training$as_of)}}.")
78+
cli::cli_end()
79+
}
80+
fn_meta()
7581
cli::cli_text("")
7682

7783
cli::cli_rule("Predictions")

vignettes/panel-data.Rmd

Lines changed: 58 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@ knitr::opts_chunk$set(
1212
echo = TRUE,
1313
collapse = TRUE,
1414
comment = "#>",
15-
out.width = "100%"
15+
out.width = "90%",
16+
fig.align = "center"
1617
)
1718
```
1819

@@ -117,7 +118,7 @@ In the following sections, we will go over pre-processing the data in the
117118
`epi_recipe` framework, and fitting a model and making predictions within the
118119
`epipredict` framework and using the package's canned forecasters.
119120

120-
# Simple autoregressive with 3 lags to predict number of graduates in a year
121+
# Autoregressive (AR) model to predict number of graduates in a year
121122

122123
## Pre-processing
123124

@@ -165,21 +166,25 @@ next year (time $t+1$) using an autoregressive model with three lags (i.e., an
165166
AR(3) model). Such a model is represented algebraically like this:
166167

167168
\[
168-
x_{t+1} =
169-
\alpha_0 + \alpha_1 x_{t} + \alpha_2 x_{t-1} + \alpha_3 x_{t-2} + \epsilon_t
169+
y_{t+1,ijk} =
170+
\alpha_0 + \alpha_1 y_{tijk} + \alpha_2 y_{t-1,ijk} + \alpha_3 y_{t-2,ijk} + \epsilon_{tijk}
170171
\]
171172

172-
where $x_i$ is the proportion of graduates at time $i$, and the current time is
173-
$t$.
173+
where $y_{tij}$ is the proportion of graduates at time $t$ in location $i$ and
174+
age group $j$ with education quality $k$.
174175

175176
In the pre-processing step, we need to create additional columns in `employ` for
176-
each of $x_{t+1}$, $x_{t}$, $x_{t-1}$, and $x_{t-2}$. We do this via an
177+
each of $y_{t+1,ijk}$, $y_{tijk}$, $y_{t-1,ijk}$, and $y_{t-2,ijk}$.
178+
We do this via an
177179
`epi_recipe`. Note that creating an `epi_recipe` alone doesn't add these
178180
outcome and predictor columns; the recipe just stores the instructions for
179181
adding them.
180182

181-
Our `epi_recipe` should add one `ahead` column representing $x_{t+1}$ and
182-
3 `lag` columns representing $x_{t}$, $x_{t-1}$, and $x_{t-2}$. Also note that
183+
Our `epi_recipe` should add one `ahead` column representing $y_{t+1,ijk}$ and
184+
3 `lag` columns representing $y_{tijk}$, $y_{t-1,ijk}$, and $y_{t-2,ijk}$
185+
(it's more accurate to think of the 0th "lag" as the "current" value with 2 lags,
186+
but that's not quite how the processing works).
187+
Also note that
183188
since we specified our `time_type` to be `year`, our `lag` and `lead`
184189
values are both in years.
185190

@@ -209,9 +214,9 @@ r %>% bake_and_show_sample(employ_small)
209214
We can see that the `prep` and `bake` steps created new columns according to
210215
our `epi_recipe`:
211216

212-
- `ahead_1_num_graduates_prop` corresponds to $x_{t+1}$
217+
- `ahead_1_num_graduates_prop` corresponds to $y_{t+1,ijk}$
213218
- `lag_0_num_graduates_prop`, `lag_1_num_graduates_prop`, and
214-
`lag_2_num_graduates_prop` correspond to $x_{t}$, $x_{t-1}$, and $x_{t-2}$
219+
`lag_2_num_graduates_prop` correspond to $y_{tijk}$, $y_{t-1,ijk}$, and $y_{t-2,ijk}$
215220
respectively.
216221

217222
## Model fitting and prediction
@@ -224,8 +229,8 @@ engine `lm`, which fits a linear regression using ordinary least squares.
224229
We will use `epi_workflow` with the `epi_recipe` we defined in the
225230
pre-processing section along with the `parsnip::linear_reg()` model. Note
226231
that `epi_workflow` is a container and doesn't actually do the fitting. We have
227-
to pass the workflow into `fit()` to get our model coefficients
228-
$\alpha_i, i=0,...,3$.
232+
to pass the workflow into `fit()` to get our estimated model coefficients
233+
$\widehat{\alpha}_i,\ i=0,...,3$.
229234

230235
```{r linearreg-wf, include=T}
231236
wf_linreg <- epi_workflow(r, linear_reg()) %>%
@@ -234,13 +239,13 @@ summary(extract_fit_engine(wf_linreg))
234239
```
235240

236241
This output tells us the coefficients of the fitted model; for instance,
237-
the intercept is $\alpha_0 = 0.24804$ and the coefficient for $x_{t}$ is
238-
$\alpha_1 = 0.06648$. The summary also tells us that only the intercept and lags
239-
at 2 years and 3 years ago have coefficients significantly greater than zero.
240-
241-
Extracting the 95% confidence intervals for the coefficients also leads us to
242-
the same conclusion: all the coefficients except for $\alpha_1$ (lag 0) contain
243-
0.
242+
the estimated intercept is $\widehat{\alpha}_0 =$ `r round(coef(extract_fit_engine(wf_linreg))[1], 3)` and the coefficient for $y_{tijk}$ is
243+
$\widehat\alpha_1 =$ `r round(coef(extract_fit_engine(wf_linreg))[2], 3)`.
244+
The summary also tells us that all estimated coefficients are significantly
245+
different from zero. Extracting the 95% confidence intervals for the
246+
coefficients also leads us to
247+
the same conclusion: all the coefficient estimates are significantly different
248+
from 0.
244249

245250
```{r}
246251
confint(extract_fit_engine(wf_linreg))
@@ -258,20 +263,20 @@ preds %>% sample_n(5)
258263
```
259264

260265
We can do this using the `augment` function too. Note that `predict` and
261-
`augment` both still return an `epi_df` with all of the keys that were present
262-
in the original dataset.
266+
`augment` both still return an `epiprocess::epi_df` with all of the keys that
267+
were present in the original dataset.
263268

264-
```{r linearreg-augment, include=T, eval=F}
269+
```{r linearreg-augment}
265270
augment(wf_linreg, latest) %>% sample_n(5)
266271
```
267272

268273
## Model diagnostics
269274

270-
First, we'll plot the residuals (that is, $y_{t} - \hat{y}_{t}$) against the
271-
fitted values ($\hat{y}_{t}$).
275+
First, we'll plot the residuals (that is, $y_{tijk} - \widehat{y}_{tijk}$)
276+
against the fitted values ($\widehat{y}_{tijk}$).
272277

273278
```{r lienarreg-resid-plot, include=T, fig.height = 5, fig.width = 5}
274-
par(mfrow = c(2, 2))
279+
par(mfrow = c(2, 2), mar = c(5, 3, 1.2, 0))
275280
plot(extract_fit_engine(wf_linreg))
276281
```
277282

@@ -286,30 +291,33 @@ The Q-Q plot shows us that the residuals have heavier tails than a Normal
286291
distribution. So the normality of residuals assumption doesn't hold either.
287292

288293
Finally, the residuals vs. leverage plot shows us that we have a few influential
289-
points based on the Cooks distance (those outside the red dotted line).
294+
points based on the Cook's distance (those outside the red dotted line).
290295

291296
Since we appear to be violating the linear model assumptions, we might consider
292297
transforming our data differently, or considering a non-linear model, or
293298
something else.
294299

295-
# Autoregressive model with exogenous inputs
300+
# AR model with exogenous inputs
296301

297-
Now suppose we want to model the 5-year employment income using 3 lags, while
302+
Now suppose we want to model the 1-step-ahead 5-year employment income using
303+
current and two previous values, while
298304
also incorporating information from the other two time-series in our dataset:
299305
the 2-year employment income and the number of graduates in the previous 2
300306
years. We would do this using an autoregressive model with exogenous inputs,
301307
defined as follows:
302308

303309
\[
304-
y_{t+1} =
305-
\alpha_0 + \alpha_1 y_{t} + \alpha_2 y_{t-1} + \alpha_3 y_{t-2} +
306-
\beta_1 x_{t} + \beta_2 x_{t-1}
307-
\gamma_2 z_{t} + \gamma_2 z_{t-1}
310+
\begin{aligned}
311+
y_{t+1,ijk} &=
312+
\alpha_0 + \alpha_1 y_{tijk} + \alpha_2 y_{t-1,ijk} + \alpha_3 y_{t-2,ijk}\\
313+
&\quad + \beta_1 x_{tijk} + \beta_2 x_{t-1,ijk}\\
314+
&\quad + \gamma_2 z_{tijk} + \gamma_2 z_{t-1,ijk} + \epsilon_{tijk}
315+
\end{aligned}
308316
\]
309317

310-
where $y_i$ is the 5-year median income (proportion) at time $i$,
311-
$x_i$ is the 2-year median income (proportion) at time $i$, and
312-
$z_i$ is the number of graduates (proportion) at time $i$.
318+
where $y_{tijk}$ is the 5-year median income (proportion) at time $t$ (in location $i$, age group $j$ with education quality $k$),
319+
$x_{tijk}$ is the 2-year median income (proportion) at time $t$, and
320+
$z_{tijk}$ is the number of graduates (proportion) at time $t$.
313321

314322
## Pre-processing
315323

@@ -318,9 +326,9 @@ Again, we construct an `epi_recipe` detailing the pre-processing steps.
318326
```{r custom-arx, include=T}
319327
rx <- epi_recipe(employ_small) %>%
320328
step_epi_ahead(med_income_5y_prop, ahead = 1) %>%
321-
# 5-year median income has 3 lags c(0,1,2)
322-
step_epi_lag(med_income_5y_prop, lag = c(0, 1, 2)) %>%
323-
# But the two exogenous variables have 2 lags c(0,1)
329+
# 5-year median income has current, and two lags c(0, 1, 2)
330+
step_epi_lag(med_income_5y_prop, lag = 0:2) %>%
331+
# But the two exogenous variables have current values, and 1 lag c(0, 1)
324332
step_epi_lag(med_income_2y_prop, lag = c(0, 1)) %>%
325333
step_epi_lag(num_graduates_prop, lag = c(0, 1)) %>%
326334
step_epi_naomit()
@@ -335,14 +343,15 @@ steps using a few [`frosting`](
335343
https://cmu-delphi.github.io/epipredict/reference/frosting.html) layers to do
336344
a few things:
337345

338-
1. Convert our predictions back to income values and number of graduates,
339-
rather than standardized proportions. We do this via the frosting layer
340-
`layer_population_scaling`.
341-
2. Threshold our predictions to 0. We are predicting proportions, which can't
346+
1. Threshold our predictions to 0. We are predicting proportions, which can't
342347
be negative. And the transformed values back to dollars and people can't be
343348
negative either.
344-
3. Generate prediction intervals based on residual quantiles, allowing us to
349+
1. Generate prediction intervals based on residual quantiles, allowing us to
345350
quantify the uncertainty associated with future predicted values.
351+
1. Convert our predictions back to income values and number of graduates,
352+
rather than standardized proportions. We do this via the frosting layer
353+
`layer_population_scaling()`.
354+
346355

347356
```{r custom-arx-post, include=T}
348357
# Create dataframe of the sums we used for standardizing
@@ -373,18 +382,16 @@ wfx_linreg <- epi_workflow(rx, parsnip::linear_reg()) %>%
373382
summary(extract_fit_engine(wfx_linreg))
374383
```
375384

376-
Based on the summary output for this model, only the intercept term, 5-year
377-
median income from 3 years ago, and the 2-year median income from 1 year ago
378-
are significant linear predictors for today's 5-year median income at a 95%
379-
confidence level. Both lags for the number of graduates were insignificant.
385+
Based on the summary output for this model, we can examine confidence intervals
386+
and perform hypothesis tests as usual.
380387

381388
Let's take a look at the predictions along with their 90% prediction intervals.
382389

383390
```{r}
384391
latest <- get_test_data(recipe = rx, x = employ_small)
385392
predsx <- predict(wfx_linreg, latest)
386393
387-
# Display values within prediction intervals
394+
# Display predictions along with prediction intervals
388395
predsx %>%
389396
select(
390397
geo_value, time_value, edu_qual, age_group,
@@ -419,8 +426,8 @@ by the keys in our `epi_df`.
419426
In this first example, we'll use `flatline_forecaster` to make a simple
420427
prediction of the 2-year median income for the next year, based on one previous
421428
time point. This model is representated algebraically as:
422-
\[y_{t+1} = \alpha_0 + \alpha_0 y_{t}\]
423-
where $y_i$ is the 2-year median income (proportion) at time $i$.
429+
\[y_{t+1,ijk} = y_{tijk} + \epsilon_{tijk}\]
430+
where $y_{tijk}$ is the 2-year median income (proportion) at time $t$.
424431

425432
```{r flatline, include=T, warning=F}
426433
out_fl <- flatline_forecaster(employ_small, "med_income_2y_prop",

0 commit comments

Comments
 (0)