Skip to content

Commit 0b797b0

Browse files
committed
wip vignette
1 parent bd47fa8 commit 0b797b0

File tree

1 file changed

+108
-30
lines changed

1 file changed

+108
-30
lines changed

vignettes/panel-data.Rmd

Lines changed: 108 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
---
2-
title: "Using epipredict on panel data from other contexts"
2+
title: "Using epipredict on non-epidemic panel data"
33
output: rmarkdown::html_vignette
44
vignette: >
5-
%\VignetteIndexEntry{Using epipredict on panel data from other contexts}
5+
%\VignetteIndexEntry{Using epipredict on non-epidemic panel data}
66
%\VignetteEngine{knitr::rmarkdown}
77
%\VignetteEncoding{UTF-8}
88
---
99

10-
```{r setup, include = F}
10+
```{r setup, include=F}
1111
knitr::opts_chunk$set(
1212
collapse = TRUE,
1313
comment = "#>",
@@ -16,23 +16,22 @@ knitr::opts_chunk$set(
1616
)
1717
```
1818

19-
```{r load-data, include = F}
20-
# TODO: temp - remove me when I figure out how to run this rmd without loading
21-
# perhaps need to commit statcan_employ_subset.rda first?
22-
devtools::load_all()
23-
```
24-
2519
```{r libraries}
2620
library(epiprocess)
2721
library(epipredict)
2822
library(dplyr)
2923
library(stringr)
3024
library(parsnip)
25+
library(recipes)
3126
```
3227

33-
[Panel data](https://en.wikipedia.org/wiki/Panel_data), or longitudinal data, contains cross-sectional measurements of subjects over time. The `epipredict` package is most suitable for running forecasters on epidemiological data. However, other datasets with similar structures are also valid candidates for `epipredict` functionality.
28+
[Panel data](https://en.wikipedia.org/wiki/Panel_data), or longitudinal data,
29+
contains cross-sectional measurements of subjects over time. The `epipredict`
30+
package is most suitable for running forecasters on epidemiological data.
31+
However, other datasets with similar structures are also valid candidates for
32+
`epipredict` functionality.
3433

35-
```{r employ-stats, include = F}
34+
```{r employ-stats, include=F}
3635
date_format <- "%B %Y"
3736
date_start <- format(as.Date(min(statcan_employ_subset$time_value)), date_format)
3837
date_end <- format(as.Date(max(statcan_employ_subset$time_value)), date_format)
@@ -41,9 +40,19 @@ uniq_employee_type <- paste(unique(statcan_employ_subset$employee_type), collaps
4140

4241
## Example panel data overview
4342

44-
In this vignette, we will demonstrate using `epipredict` with employment data from Statistics Canada. We will be using [Table 14-10-0220-01: Employment and average weekly earnings (including overtime) for all employees by industry, monthly, seasonally adjusted, Canada](https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1410022001#data). The full dataset contains monthly employment counts from `r date_start` to `r date_end`, and presents employment data stratified by geographic region, [NAICS industries](https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=1181553), and employee type. The full dataset also contains metadata that describes the quality of data collected. For demonstration purposes, we make the following modifications to get a subset of the full dataset:
45-
46-
* Only keep level 1 industries (2-digit codes) in the [NAICS hierarchy](https://www23.statcan.gc.ca/imdb/pUtil.pl?Function=getNote&Id=1181553&NT=45) and remove aggregated industry codes.
43+
In this vignette, we will demonstrate using `epipredict` with employment data from
44+
Statistics Canada. We will be using
45+
[Table 14-10-0220-01: Employment and average weekly earnings (including overtime) for all employees by industry, monthly, seasonally adjusted, Canada](https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1410022001#data).
46+
The full dataset contains monthly employment counts from `r date_start` to `r date_end`,
47+
and presents employment data stratified by geographic region,
48+
[NAICS industries](https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=1181553),
49+
and employee type. The full dataset also contains metadata that describes the quality
50+
of data collected. For demonstration purposes, we make the following modifications to
51+
get a subset of the full dataset:
52+
53+
* Only keep level 1 industries (2-digit codes) in the
54+
[NAICS hierarchy](https://www23.statcan.gc.ca/imdb/pUtil.pl?Function=getNote&Id=1181553&NT=45)
55+
and remove aggregated industry codes.
4756
* Only keep province-level geographic region (the full data also has "Canada" as a region)
4857
* Only keep "good" or better quality data rows, as indicated by the [`STATUS`](https://www.statcan.gc.ca/en/concepts/definitions/guide-symbol) column
4958

@@ -84,60 +93,129 @@ statcan_employ_subset_input <- statcan_employ %>%
8493
mutate(naics_industry = factor(naics_industry))
8594
```
8695

87-
To use this data with `epipredict`, we need to convert it into `epi_df` format using `as_epi_df` with additional keys. In our case, the additional keys are `employee_type` and `naics_industry`. Note that in the above modifications, we encoded `time_value` as type `tsibble::yearmonth`. This allows us to set `time_type` to `"yearmonth"` below, and to ensure lag and ahead modifications later on are using the correct time units.
96+
To use this data with `epipredict`, we need to convert it into `epi_df` format using
97+
`as_epi_df` with additional keys. In our case, the additional keys are `employee_type`
98+
and `naics_industry`. Note that in the above modifications, we encoded `time_value`
99+
as type `tsibble::yearmonth`. This allows us to set `time_type` to `"yearmonth"` below,
100+
and to ensure lag and ahead modifications later on are using the correct time units.
88101

89102
```{r convert-to-epidf, eval=F}
90103
statcan_employ_subset <- statcan_employ_subset_input %>%
91104
tsibble::as_tsibble(
92105
index=time_value,
93106
key=c(geo_value, employee_type, naics_industry)) %>%
94-
as_epi_df(time_type = "yearmonth", as_of="2022-07-22")
107+
as_epi_df(
108+
additional_metadata=c(other_keys=list("employee_type", "naics_industry")))
95109
```
96110

97111
```{r data-dim, include=F}
98112
employ_rowcount <- format(nrow(statcan_employ_subset), big.mark=",")
99113
employ_colcount <- length(names(statcan_employ_subset))
100114
```
101115

102-
The data contains `r employ_rowcount` rows and `r employ_colcount` columns. Now, we are ready to use `statcan_employ_subset` with `epipredict`.
116+
The data contains `r employ_rowcount` rows and `r employ_colcount` columns. Now, we are
117+
ready to use `statcan_employ_subset` with `epipredict`.
103118

104119
```{r preview-data, include=T}
105-
head(statcan_employ_subset)
120+
# Rename for simplicity
121+
employ <- statcan_employ_subset
122+
head(employ)
106123
```
107124

108-
In the following sections, we will go over preprocessing the data in the `epi_recipe` framework, fitting 3 types of models from the `parsnip` package, and making future predictions.
125+
In the following sections, we will go over preprocessing the data in the `epi_recipe`
126+
framework, fitting 3 types of models from the `parsnip` package, and making future
127+
predictions.
109128

110129
## Preprocessing
111130

112131
We will create a recipe that adds one `ahead` column and 3 `lag` columns.
113132

114-
```{r make-recipe, include = T}
115-
r <- epi_recipe(statcan_employ_subset) %>%
133+
```{r make-recipe, include=T}
134+
r <- epi_recipe(employ) %>%
116135
step_epi_ahead(ppl_count, ahead = 6) %>% # lag & ahead units in months
117136
step_epi_lag(ppl_count, lag = c(0, 6, 12)) %>%
118137
step_epi_naomit()
119138
r
120139
```
121140

122-
There is one `raw` role which includes our value column `ppl_count`, and two `key` roles which include our additional keys `employee_type` and `naics_industry`. Let's take a look at what these additional columns look like.
141+
There is one `raw` role which includes our value column `ppl_count`, and two `key`
142+
roles which include our additional keys `employee_type` and `naics_industry`. Let's
143+
take a look at what these additional columns look like.
123144

124-
```{r view-preprocessed, include = T}
125-
latest <- statcan_employ_subset %>% filter(time_value >= max(time_value) - 12)
145+
```{r view-preprocessed, include=T}
126146
# Display a sample of the preprocessed data
127-
r %>% prep() %>% bake(new_data = latest) %>% sample_n(5)
147+
baked_sample <- r %>% prep() %>% bake(new_data = employ) %>% sample_n(5)
148+
baked_sample
128149
```
129150

130151
## Model fitting and prediction
131152

132-
First we will look at a simple model: `parsnip::linear_reg()` with default engine `lm`. We can use `epi_workflow` with the above `epi_recipe` to fit a linear model using lags at time $t$ (current), $t-6$ months, and $t-12$ months.
133-
```{r linearreg-wf, include = T}
134-
wf_linreg <- epi_workflow(r, linear_reg()) %>% fit(statcan_employ_subset)
153+
### With direct forecasters
154+
155+
Even though we aren't working with epidemiological data, the `arx_epi_forecaster`
156+
still works out of the box. And specifying `args_list` still works in the same way.
157+
158+
```{r arx-epi, include=T, warning=F}
159+
demo_args_list = arx_args_list(lags = c(0L, 6L, 12L), ahead = 6L)
160+
161+
out_lr <- arx_epi_forecaster(
162+
employ,
163+
outcome = "ppl_count",
164+
predictors = c("ppl_count"),
165+
args_list = demo_args_list)
166+
167+
out_lr$predictions
168+
```
169+
170+
Other changes to the forecaster, like changing the engine, also work as expected.
171+
172+
```{r arx-epi-rf, include=T, warning=F}
173+
out_rf <- arx_epi_forecaster(
174+
employ,
175+
outcome = "ppl_count",
176+
predictors = c("ppl_count"),
177+
trainer = parsnip::rand_forest(mode="regression", trees=100),
178+
args_list = demo_args_list)
179+
180+
out_rf$predictions
181+
```
182+
183+
### Within recipes framework
184+
185+
First we will look at a simple model: `parsnip::linear_reg()` with default engine
186+
`lm`. We can use `epi_workflow` with the above `epi_recipe` to fit an autoregressive
187+
linear model using lags at time $t$ (current), $t-6$ months, and $t-12$ months.
188+
189+
```{r linearreg-wf, include=T}
190+
wf_linreg <- epi_workflow(r, parsnip::linear_reg()) %>% fit(employ)
135191
wf_linreg
192+
```
136193

194+
Now that we have our workflow, we can generate predictions from a subset of our data.
195+
For this demo, we will predict the employment counts from the last 12 months of our dataset.
196+
197+
```{r linearreg-predict, include=T}
198+
latest <- employ %>% filter(time_value >= max(time_value) - 15)
137199
preds <- stats::predict(wf_linreg, latest) %>% filter(!is.na(.pred))
138200
# Display a sample of the prediction values
139201
preds %>% sample_n(5)
140202
```
141203

142-
<!-- TODO: ggplot the above predictions against the actual values -->
143-
<!-- TODO: fit more complex AR model -->
204+
Notice that `predict` returns an `epi_df` with all of the keys that were present in the original dataset.
205+
206+
```{r plot-pred, include=F}
207+
# library(ggplot2)
208+
# joined <- full_join(statcan_employ_subset, preds, by=c("geo_value", "time_value", "employee_type", "naics_industry"))
209+
210+
# joined %>% filter(!is.na(.pred)) %>% select(time_value) %>% unique()
211+
# joined %>% dplyr::filter(
212+
# geo_value %in% c("British Columbia", "Ontario") &
213+
# naics_industry == "Real estate and rental and leasing [53]" &
214+
# employee_type == "All employees") %>%
215+
# ggplot() +
216+
# geom_line(aes(x = time_value, y = ppl_count)) +
217+
# geom_line(aes(x = time_value, y = preds)) +
218+
# facet_wrap(vars(geo_value), scales = "free_y", ncol = 1) +
219+
# scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
220+
# labs(x = "Date", y = "Number employed")
221+
```

0 commit comments

Comments
 (0)