wording changes

mgyliu · mgyliu · commit 51471fa46d93 · 2022-09-13T20:09:48.000-07:00
diff --git a/vignettes/panel-data.Rmd b/vignettes/panel-data.Rmd
@@ -20,13 +20,12 @@ knitr::opts_chunk$set(
 library(epiprocess)
 library(epipredict)
 library(dplyr)
-library(stringr)
 library(parsnip)
 library(recipes)
 ```
 
 [Panel data](https://en.wikipedia.org/wiki/Panel_data), or longitudinal data, 
-contains cross-sectional measurements of subjects over time. The `epipredict` 
+contain cross-sectional measurements of subjects over time. The `epipredict` 
 package is most suitable for running forecasters on epidemiological panel data. 
 A built-in example of this is the [`case_death_rate_subset`](
   https://cmu-delphi.github.io/epipredict/reference/case_death_rate_subset.html) 
@@ -41,8 +40,8 @@ head(case_death_rate_subset)
   https://cmu-delphi.github.io/epiprocess/reference/epi_df.html) 
 format. Despite the stated goal and name of the package, other panel datasets 
 are also valid candidates for `epipredict` functionality. Specifically, the 
-`epipredict` framework and direct forecasters are able to work with any panel 
-data, as long as it's in `epi_df` format, 
+`epipredict` framework and direct forecasters can work with any panel data, as 
+long as it's in `epi_df` format.
 
 ```{r employ-stats, include=F}
 year_start <- min(grad_employ_subset$time_value)
@@ -59,8 +58,9 @@ from Statistics Canada. We will be using
   graduation, by educational qualification and field of study (primary 
   groupings)
 ](https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3710011501). 
+
 The full dataset contains yearly median employment income two and five years 
-after graduation, and number of graduates. The data is further stratified by 
+after graduation, and number of graduates. The data is stratified by 
 variables such as geographic region (Canadian province), field of study, and 
 age group. The year range of the dataset is `r year_start` to `r year_end`, 
 inclusive. The full dataset also contains metadata that describes the 
@@ -80,8 +80,11 @@ just described:
 ```{r employ-query, eval=F}
 library(cansim)
 
+# Get original dataset 
 statcan_grad_employ <- get_cansim("37-10-0115-01")
+
 gemploy <- statcan_grad_employ %>% 
+  # Drop some columns and rename the ones we keep
   select(c("REF_DATE", "GEO", "VALUE", "STATUS", "Educational qualification", 
     "Field of study", "Gender", "Age group", "Status of student in Canada", 
     "Characteristics after graduation", "Graduate statistics")) %>%
@@ -101,11 +104,13 @@ gemploy <- statcan_grad_employ %>%
   # `Graduate statistics` in the original data. Below we pivot the data 
   # wider so that each unique statistic can have its own column.
   mutate(
+    # Recode for easier pivoting
     grad_stat = recode_factor(
       grad_stat, 
       `Number of graduates` = "num_graduates", 
       `Median employment income two years after graduation` = "med_income_2y",
       `Median employment income five years after graduation` = "med_income_5y"),
+    # They are originally strings but want ints for conversion to epi_df later
     time_value = as.integer(time_value)
   ) %>%
   pivot_wider(names_from = grad_stat, values_from = value) %>%
@@ -137,7 +142,11 @@ using [`as_epi_df`](
 with additional keys. In our case, the additional keys are `age_group`, `fos` 
 and `edu_qual`. Note that in the above modifications, we encoded `time_value` 
 as type `integer`. This allows us to set `time_type` to `"year"`, and to ensure 
-lag and ahead modifications later on are using the correct time units.
+lag and ahead modifications later on are using the correct time units. See the
+[`epi_df` documentation](
+  https://cmu-delphi.github.io/epiprocess/reference/epi_df.html#time-types) for 
+a list of all the `type_type`s available.
+
 
 ```{r convert-to-epidf, eval=F}
 grad_employ_subset <- gemploy %>%
@@ -235,24 +244,25 @@ out_fl <- flatline_forecaster(employ, "med_income_2y",
 augment(out_fl$epi_workflow, employ)
 ```
 
-```{r arx, include=T}
+```{r arx-lr, include=T}
 arx_args <- arx_args_list(
-    lags = c(0L, 1L, 2L), ahead = 1L, forecast_date = as.Date("2022-08-01"))
-out_arx <- arx_forecaster(employ, "med_income_2y", 
+    lags = c(0L, 1L), ahead = 1L, forecast_date = as.Date("2022-08-01"))
+
+out_arx_lr <- arx_forecaster(employ, "med_income_2y", 
   c("med_income_2y", "med_income_5y", "num_graduates"), 
   args_list = arx_args)
 
-out_arx$predictions 
+out_arx_lr$predictions 
 ```
 
 Other changes to the direct AR forecaster, like changing the engine, also work 
 as expected.
 
-```{r arx-epi-rf, include=F, warning=F}
-out_rf <- arx_forecaster(
-  employ, "med_income_2y", c("med_income_2y", "med_income_5y"), 
-  trainer = parsnip::rand_forest(mode="regression", trees=100),
-  args_list = args)
+```{r arx-rf, include=F, warning=F}
+out_arx_rf <- arx_forecaster(
+  employ, "med_income_2y", c("med_income_2y", "med_income_5y", "num_graduates"), 
+  trainer = parsnip::boost_tree(mode = "regression", trees = 20),
+  args_list = arx_args)
 
-out_rf$predictions 
+out_arx_rf$predictions 
 ```