From 0abdac97b0d8afe1db1dcb598ac1004f1bc57ca8 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Mon, 8 Apr 2024 14:05:10 -0700 Subject: [PATCH 01/39] Add arx classifier vignette --- vignettes/arx-classifier-vignette.Rmd | 160 ++++++++++++++++++++++++++ 1 file changed, 160 insertions(+) create mode 100644 vignettes/arx-classifier-vignette.Rmd diff --git a/vignettes/arx-classifier-vignette.Rmd b/vignettes/arx-classifier-vignette.Rmd new file mode 100644 index 000000000..ee64db0bc --- /dev/null +++ b/vignettes/arx-classifier-vignette.Rmd @@ -0,0 +1,160 @@ +--- +title: "arx_classifier" +--- + +```{r setup, include = FALSE} +knitr::opts_chunk$set( + echo = TRUE, + collapse = FALSE, + comment = "#>", + warning = FALSE, + message = FALSE, + cache = TRUE +) +``` + +## Load required packages + +```{r, message = FALSE, warning = FALSE} +library(tidyverse) +library(epipredict) +``` + +## Introducing the ARX classifier + +The `arx_classifier()` function is an autoregressive classification model for `epi_df` data that is used to predict a discrete class for each case under consideration. It is a direct forecaster in that it estimates the classes at a specific horizon or ahead value. + +To get a sense of how the `arx_classifier()` works, let's consider a simple example with minimal inputs. For this, we will use the built-in `case_death_rate_subset` that contains confirmed COVID-19 cases and deaths from JHU CSSE for all states over Dec 31, 2020 to Dec 31, 2021. From this, we'll take a subset of data for five states over June 4, 2021 to December 31, 2021. Our objective is to predict whether the case rates are increasing when considering the 0, 7 and 14 day case rates: + +```{r} +jhu <- case_death_rate_subset %>% + dplyr::filter( + time_value >= "2021-06-04", + time_value <= "2021-12-31", + geo_value %in% c("ca", "fl", "tx", "ny", "nj") + ) + +out <- arx_classifier(jhu, + outcome = "case_rate", + predictors = "case_rate" +) + +out$predictions +``` + +The key takeaway from the predictions is that there are two prediction classes: (-Inf, 0.25] and (0.25, Inf). This is because for our goal of classification the classes must be discrete. The discretization of the real-valued outcome is controlled by the `breaks` argument, which defaults to 0.25. Such breaks will be automatically extended to cover the entire real line. For example, the default break of 0.25 is silently extended to breaks = c(-Inf, .25, Inf) and, therefore, results in two classes: [-Inf, 0.25] and (0.25, Inf). These two classes are used to discretize the outcome. The conversion of the outcome to such classes is handled internally. So if discrete classes already exist for the outcome in the `epi_df`, then we recommend to code a classifier from scratch using the `epi_workflow` framework for more control. + +The `trainer` is a `parsnip` model describing the type of estimation such that `mode = "classification"` is enforced. The two typical trainers that are used are `parsnip::logistic_reg()` for two classes or `parsnip::multinom_reg()` for more than two classes. + +```{r} +workflows::extract_spec_parsnip(out$epi_workflow) +``` + +From the parsnip model specification, we can see that the trainer used is logistic regression, which is expected for our binary outcome. More complicated trainers like `parsnip::naive_Bayes()` or `parsnip::rand_forest()` may also be used (however, we will stick to the basics in this gentle introduction to the classifier). + +If you use the default trainer of logistic regression for binary classification and you decide against using the default break of 0.25, then you should only input one break so that there are two classification bins to properly dichotomize the outcome. For example, let's set a break of 0.5 instead of relying on the default of 0.25. We can do this by passing 0.5 to the `breaks` argument in `arx_class_args_list()` as follows: + +```{r} +out_break_0.5 <- arx_classifier(jhu, + outcome = "case_rate", + predictors = "case_rate", + args_list = arx_class_args_list( + breaks = 0.5 + ) +) + +out_break_0.5$predictions +``` +Indeed, we can observe that the two `.pred_class` are now (-Inf, 0.5] and (0.5, Inf). + +```{r} +?arx_class_args_list +``` + +Additional arguments that may be supplied to `arx_class_args_list()` include the expected `lags` and `ahead` arguments for an autoregressive-type model. These have default values of 0, 7, and 14 days for the lags of the predictors and 7 days ahead of the forecast date for predicting the outcome. There is also `n_training` to indicate the upper bound for the number of training rows per key as well as `forecast_date` and `target_date` to specify the date that the forecast is created and intended, respectively. We will not dwell on these arguments here as they are not unique to this classifier or absolutely essential to understanding how it operates. The remaining arguments will be discussed organically, as they are needed to serve our purposes. For information on any remaining arguments that are not discussed here, please see the function documentation for a complete list and their definitions. + +## Example of using the ARX classifier + +Now, to demonstrate the power and utility of this built-in arx classifier, we will adapt the classification example that was written from scratch in the [Examples of Preprocessing and Models vignette](https://cmu-delphi.github.io/epipredict/articles/preprocessing-and-models.html). However, to keep things simple and not merely a direct translation, we will only consider two prediction categories and leave the extension to three as an exercise for the reader. + +To motivate this example, a major use of autoregressive classification models is to predict upsurges or downsurges like in hotspot prediction models to anticipate the direction of the outcome (see [McDonald, Bien, Green, Hu, et al. (2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118) for more on these). In our case, one simple question that such models can help answer is... Do we expect that the future will have increased case rates or not relative to the present? #%% Ok to say relative to the present? Also, should I be more precise here and say relative change or best to keep it simple? + +To answer this question, we can create a predictive model for upsurges and downsurges of case rates rather than the raw case rates themselves. For this situation, our target is to predict whether there is an increase in case rates or not. Following [McDonald, Bien, Green, Hu, et al. (2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118), we look at the relative change between $Y_{l,t}$ and $Y_{l, t+a}$, where the former is the case rate at location $l$ at time $t$ and the latter is the rate for that location at time $t+a$. Using these variables, we define a categorical response variable with two classes + +$$\begin{align} +Z_{l,t} = \left\{\begin{matrix} +\text{up,} & \text{if } Y_{l,t}^\Delta > 0.25\\ +\text{not up,} & \text{otherwise} +\end{matrix}\right. +\end{align}$$ +where $Y_{l,t}^\Delta = (Y_{l, t} - Y_{l, t- 7} / Y_{l, t- 7}$. If $Y_{l,t}^\Delta$ > 0.25, meaning that the number of new cases over the week has increased by over 25\%, then $Z_{l,t}$ is up. This is the criteria for location $l$ to be a hotspot at time $t$. On the other hand, if $Y_{l,t}^\Delta$ \leq 0.25$, then then $Z_{l,t}$ is categorized as not up, meaning that there has not been a >25\% increase in the new cases over the past week. + +The logistic regression model we use to predict this binary response can be considered to be a simplification of the multinomial regression model presented in the [Examples of Preprocessing and Models vignette](https://cmu-delphi.github.io/epipredict/articles/preprocessing-and-models.html): + +$$\begin{align} +\pi_{\text{up}}(x) &= Pr(Z_{l, t} = \text{up}|x) = \frac{e^{g_{\text{up}}(x)}}{1 + e^{g_{\text{up}}(x)}}, \\ +\pi_{\text{not up}}(x)&= Pr(Z_{l, t} = \text{not up}|x) = 1 - Pr(Z_{l, t} = \text{up}|x) = \frac{1}{1 + e^{g_{\text{up}}(x)}} +\end{align}$$ +where + +$$ +g_{\text{up}}(x) = \log\left ( \frac{\Pr(Z_{l, t} = \text{up} \vert x)}{\Pr(Z_{l, t} = \text{not up} \vert x)} \right ) = \beta_{10} + \beta_{11}t + \delta_{10}s_{state_1} + \delta_{10}s_{state_2} ... + \beta_{12}Y_{l,t}^\Delta + \beta_{13}Y_{l,t-7}^\Delta + \beta_{14}Y_{l,t-14}^\Delta. +$$ + +Now then, we will operate on the same subset of the `case_death_rate_subset` that we used in our above example. This time, we will use it to investigate whether the number of newly reported cases over the past 7 days has increased by at least 25\% compared to the preceding week for our sample of states. + +Notice that by using the `arx_classifier()` function we've completely eliminated the need to manually categorize the response variable and implement pre-processing steps, which was necessary in the [Examples of Preprocessing and Models vignette](https://cmu-delphi.github.io/epipredict/articles/preprocessing-and-models.html). + +```{r} +log_res <- arx_classifier( + jhu, + outcome = "case_rate", + predictors = "case_rate", + args_list = arx_class_args_list( + breaks = 0.25 / 7 # division by 7 gives weekly not daily + ) +) + +workflows::extract_preprocessor(log_res$epi_workflow) +``` + +Comparing the pre-processing steps for this to those in the other vignette, we can see that they are not precisely the same, but they cover the same essentials of transforming `case_rate` to the growth rate scale (`step_growth_rate()`), lagging the predictors (`step_epi_lag()`), leading the response (`step_epi_ahead()`), which are both constructed from the growth rates, and constructing the binary classification response variable (`step_mutate()`). + +On this topic, it is important to understand that we are not actually concerned about the case values themselves. Rather we are concerned whether the quantity of cases in the future is a lot larger than that in the present. For this reason, the outcome does not remain as cases, but rather it is transformed by using either growth rates (as the predictors and outcome in our example are) or lagged differences. While the latter is closer to the requirements for the [2022-23 CDC Flusight Hospitalization Experimental Target](https://github.com/cdcepi/Flusight-forecast-data/blob/745511c436923e1dc201dea0f4181f21a8217b52/data-experimental/README.md), and it is conceptually easy to understand because it is simply the change of the value for the horizon, it is not the default. The default is `growth_rate`. One reason for this choice is because the growth rate is on a rate scale, not on the absolute scale, so it fosters comparability across locations without any conscious effort (on the other hand, when using the `lag_difference` one would need to take care to operate on rates per 100k and not raw counts). We utilize `epiprocess::growth_rate()` to create the outcome using some of the additional arguments. One important argument for the growth rate calculation is the `method`. Only `rel_change` for relative change should be used as the method because the test data is the only data that is accessible and the other methods require access to the training data. #%% Last sentence correct? + +The other optional arguments for controlling the growth rate calculation (that can be inputted as `additional_gr_args`) can be found in the documentation for [`epiprocess::growth_rate()`](https://cmu-delphi.github.io/epiprocess/reference/growth_rate.html) and the related [vignette](https://cmu-delphi.github.io/epiprocess/articles/growth_rate.html). + +### Visualizing the results + +To visualize the prediction classes across the states for the target date, we can plot our results as a heatmap. However, if we were to plot the results for only one target date, like our 7-day ahead predictions, then that would be a pretty sad heatmap (which would look more like a bar chart than a heatmap)... So instead of doing that, let's get predictions for several aheads and plot a heatmap across the target dates. To get the predictions across several ahead values, we will use the map function in the same way that we did in other vignettes: + +```{r} +multi_log_res <- map(1:14, ~ arx_classifier( + jhu, + outcome = "case_rate", + predictors = "case_rate", + args_list = arx_class_args_list( + breaks = 0.25 / 7, + ahead = .x, + horizon = .x, + ))$predictions) %>% list_rbind() +``` + +Notice that the ahead and horizon are what we are are changing when we use `map()`. The reason we must also change `horizon` is because that is passed to the `h` argument of `epiprocess::growth_rate()` and determines the amount of data used to calculate the growth rate. So, for nearly all intents and purposes, it should be the same as `ahead`. + +We can plot a the heatmap of the results over the aheads to see if there's anything novel or interesting: + +```{r} +ggplot(multi_log_res, aes(target_date, geo_value, fill = .pred_class)) + + geom_tile() + + ylab("State") + + xlab("Target date") + +``` + +While there is a bit of variability near to the beginning, we can clearly see that there are upswings for all states starting from the beginning of January 2022, which we can recall was when there was a massive spike in cases for many states. So our results seem to align well with what actually happened at the beginning of January 2022. + +## A brief reflection + +The most noticeable benefit of using the `arx_classifier()` function is the simplification and reduction of the manual implementation of the classifier from about 30 down to 3 lines. However, as we noted before, the trade-off for simplicity is control over the precise pre-processing, post-processing, and additional features embedded in the coding of a classifier. So the good thing is that `epipredict` provides both - a built-in `arx_classifer()` or the means to implement your own classifier from scratch by using the `epi_workflow` framework. And which you choose depends on the circumstances. Our advice is to start with using the built-in classifier for ostensibly simple projects and begin to implement your own when the modelling project takes a complicated turn. To get some practice on coding up a classifier by hand, consider to take this opportunity to translate this binary classification model example to an `epi_workflow`, akin to that in [Examples of Preprocessing and Models vignette](https://cmu-delphi.github.io/epipredict/articles/preprocessing-and-models.html). #%% takes a complex turn, increases in complexity + From e090eb8b2aeaf138377cca80ef254a9427aecc4c Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Mon, 8 Apr 2024 14:05:50 -0700 Subject: [PATCH 02/39] Remove --- vignettes/arx-classifier-vignette.Rmd | 160 -------------------------- 1 file changed, 160 deletions(-) delete mode 100644 vignettes/arx-classifier-vignette.Rmd diff --git a/vignettes/arx-classifier-vignette.Rmd b/vignettes/arx-classifier-vignette.Rmd deleted file mode 100644 index ee64db0bc..000000000 --- a/vignettes/arx-classifier-vignette.Rmd +++ /dev/null @@ -1,160 +0,0 @@ ---- -title: "arx_classifier" ---- - -```{r setup, include = FALSE} -knitr::opts_chunk$set( - echo = TRUE, - collapse = FALSE, - comment = "#>", - warning = FALSE, - message = FALSE, - cache = TRUE -) -``` - -## Load required packages - -```{r, message = FALSE, warning = FALSE} -library(tidyverse) -library(epipredict) -``` - -## Introducing the ARX classifier - -The `arx_classifier()` function is an autoregressive classification model for `epi_df` data that is used to predict a discrete class for each case under consideration. It is a direct forecaster in that it estimates the classes at a specific horizon or ahead value. - -To get a sense of how the `arx_classifier()` works, let's consider a simple example with minimal inputs. For this, we will use the built-in `case_death_rate_subset` that contains confirmed COVID-19 cases and deaths from JHU CSSE for all states over Dec 31, 2020 to Dec 31, 2021. From this, we'll take a subset of data for five states over June 4, 2021 to December 31, 2021. Our objective is to predict whether the case rates are increasing when considering the 0, 7 and 14 day case rates: - -```{r} -jhu <- case_death_rate_subset %>% - dplyr::filter( - time_value >= "2021-06-04", - time_value <= "2021-12-31", - geo_value %in% c("ca", "fl", "tx", "ny", "nj") - ) - -out <- arx_classifier(jhu, - outcome = "case_rate", - predictors = "case_rate" -) - -out$predictions -``` - -The key takeaway from the predictions is that there are two prediction classes: (-Inf, 0.25] and (0.25, Inf). This is because for our goal of classification the classes must be discrete. The discretization of the real-valued outcome is controlled by the `breaks` argument, which defaults to 0.25. Such breaks will be automatically extended to cover the entire real line. For example, the default break of 0.25 is silently extended to breaks = c(-Inf, .25, Inf) and, therefore, results in two classes: [-Inf, 0.25] and (0.25, Inf). These two classes are used to discretize the outcome. The conversion of the outcome to such classes is handled internally. So if discrete classes already exist for the outcome in the `epi_df`, then we recommend to code a classifier from scratch using the `epi_workflow` framework for more control. - -The `trainer` is a `parsnip` model describing the type of estimation such that `mode = "classification"` is enforced. The two typical trainers that are used are `parsnip::logistic_reg()` for two classes or `parsnip::multinom_reg()` for more than two classes. - -```{r} -workflows::extract_spec_parsnip(out$epi_workflow) -``` - -From the parsnip model specification, we can see that the trainer used is logistic regression, which is expected for our binary outcome. More complicated trainers like `parsnip::naive_Bayes()` or `parsnip::rand_forest()` may also be used (however, we will stick to the basics in this gentle introduction to the classifier). - -If you use the default trainer of logistic regression for binary classification and you decide against using the default break of 0.25, then you should only input one break so that there are two classification bins to properly dichotomize the outcome. For example, let's set a break of 0.5 instead of relying on the default of 0.25. We can do this by passing 0.5 to the `breaks` argument in `arx_class_args_list()` as follows: - -```{r} -out_break_0.5 <- arx_classifier(jhu, - outcome = "case_rate", - predictors = "case_rate", - args_list = arx_class_args_list( - breaks = 0.5 - ) -) - -out_break_0.5$predictions -``` -Indeed, we can observe that the two `.pred_class` are now (-Inf, 0.5] and (0.5, Inf). - -```{r} -?arx_class_args_list -``` - -Additional arguments that may be supplied to `arx_class_args_list()` include the expected `lags` and `ahead` arguments for an autoregressive-type model. These have default values of 0, 7, and 14 days for the lags of the predictors and 7 days ahead of the forecast date for predicting the outcome. There is also `n_training` to indicate the upper bound for the number of training rows per key as well as `forecast_date` and `target_date` to specify the date that the forecast is created and intended, respectively. We will not dwell on these arguments here as they are not unique to this classifier or absolutely essential to understanding how it operates. The remaining arguments will be discussed organically, as they are needed to serve our purposes. For information on any remaining arguments that are not discussed here, please see the function documentation for a complete list and their definitions. - -## Example of using the ARX classifier - -Now, to demonstrate the power and utility of this built-in arx classifier, we will adapt the classification example that was written from scratch in the [Examples of Preprocessing and Models vignette](https://cmu-delphi.github.io/epipredict/articles/preprocessing-and-models.html). However, to keep things simple and not merely a direct translation, we will only consider two prediction categories and leave the extension to three as an exercise for the reader. - -To motivate this example, a major use of autoregressive classification models is to predict upsurges or downsurges like in hotspot prediction models to anticipate the direction of the outcome (see [McDonald, Bien, Green, Hu, et al. (2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118) for more on these). In our case, one simple question that such models can help answer is... Do we expect that the future will have increased case rates or not relative to the present? #%% Ok to say relative to the present? Also, should I be more precise here and say relative change or best to keep it simple? - -To answer this question, we can create a predictive model for upsurges and downsurges of case rates rather than the raw case rates themselves. For this situation, our target is to predict whether there is an increase in case rates or not. Following [McDonald, Bien, Green, Hu, et al. (2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118), we look at the relative change between $Y_{l,t}$ and $Y_{l, t+a}$, where the former is the case rate at location $l$ at time $t$ and the latter is the rate for that location at time $t+a$. Using these variables, we define a categorical response variable with two classes - -$$\begin{align} -Z_{l,t} = \left\{\begin{matrix} -\text{up,} & \text{if } Y_{l,t}^\Delta > 0.25\\ -\text{not up,} & \text{otherwise} -\end{matrix}\right. -\end{align}$$ -where $Y_{l,t}^\Delta = (Y_{l, t} - Y_{l, t- 7} / Y_{l, t- 7}$. If $Y_{l,t}^\Delta$ > 0.25, meaning that the number of new cases over the week has increased by over 25\%, then $Z_{l,t}$ is up. This is the criteria for location $l$ to be a hotspot at time $t$. On the other hand, if $Y_{l,t}^\Delta$ \leq 0.25$, then then $Z_{l,t}$ is categorized as not up, meaning that there has not been a >25\% increase in the new cases over the past week. - -The logistic regression model we use to predict this binary response can be considered to be a simplification of the multinomial regression model presented in the [Examples of Preprocessing and Models vignette](https://cmu-delphi.github.io/epipredict/articles/preprocessing-and-models.html): - -$$\begin{align} -\pi_{\text{up}}(x) &= Pr(Z_{l, t} = \text{up}|x) = \frac{e^{g_{\text{up}}(x)}}{1 + e^{g_{\text{up}}(x)}}, \\ -\pi_{\text{not up}}(x)&= Pr(Z_{l, t} = \text{not up}|x) = 1 - Pr(Z_{l, t} = \text{up}|x) = \frac{1}{1 + e^{g_{\text{up}}(x)}} -\end{align}$$ -where - -$$ -g_{\text{up}}(x) = \log\left ( \frac{\Pr(Z_{l, t} = \text{up} \vert x)}{\Pr(Z_{l, t} = \text{not up} \vert x)} \right ) = \beta_{10} + \beta_{11}t + \delta_{10}s_{state_1} + \delta_{10}s_{state_2} ... + \beta_{12}Y_{l,t}^\Delta + \beta_{13}Y_{l,t-7}^\Delta + \beta_{14}Y_{l,t-14}^\Delta. -$$ - -Now then, we will operate on the same subset of the `case_death_rate_subset` that we used in our above example. This time, we will use it to investigate whether the number of newly reported cases over the past 7 days has increased by at least 25\% compared to the preceding week for our sample of states. - -Notice that by using the `arx_classifier()` function we've completely eliminated the need to manually categorize the response variable and implement pre-processing steps, which was necessary in the [Examples of Preprocessing and Models vignette](https://cmu-delphi.github.io/epipredict/articles/preprocessing-and-models.html). - -```{r} -log_res <- arx_classifier( - jhu, - outcome = "case_rate", - predictors = "case_rate", - args_list = arx_class_args_list( - breaks = 0.25 / 7 # division by 7 gives weekly not daily - ) -) - -workflows::extract_preprocessor(log_res$epi_workflow) -``` - -Comparing the pre-processing steps for this to those in the other vignette, we can see that they are not precisely the same, but they cover the same essentials of transforming `case_rate` to the growth rate scale (`step_growth_rate()`), lagging the predictors (`step_epi_lag()`), leading the response (`step_epi_ahead()`), which are both constructed from the growth rates, and constructing the binary classification response variable (`step_mutate()`). - -On this topic, it is important to understand that we are not actually concerned about the case values themselves. Rather we are concerned whether the quantity of cases in the future is a lot larger than that in the present. For this reason, the outcome does not remain as cases, but rather it is transformed by using either growth rates (as the predictors and outcome in our example are) or lagged differences. While the latter is closer to the requirements for the [2022-23 CDC Flusight Hospitalization Experimental Target](https://github.com/cdcepi/Flusight-forecast-data/blob/745511c436923e1dc201dea0f4181f21a8217b52/data-experimental/README.md), and it is conceptually easy to understand because it is simply the change of the value for the horizon, it is not the default. The default is `growth_rate`. One reason for this choice is because the growth rate is on a rate scale, not on the absolute scale, so it fosters comparability across locations without any conscious effort (on the other hand, when using the `lag_difference` one would need to take care to operate on rates per 100k and not raw counts). We utilize `epiprocess::growth_rate()` to create the outcome using some of the additional arguments. One important argument for the growth rate calculation is the `method`. Only `rel_change` for relative change should be used as the method because the test data is the only data that is accessible and the other methods require access to the training data. #%% Last sentence correct? - -The other optional arguments for controlling the growth rate calculation (that can be inputted as `additional_gr_args`) can be found in the documentation for [`epiprocess::growth_rate()`](https://cmu-delphi.github.io/epiprocess/reference/growth_rate.html) and the related [vignette](https://cmu-delphi.github.io/epiprocess/articles/growth_rate.html). - -### Visualizing the results - -To visualize the prediction classes across the states for the target date, we can plot our results as a heatmap. However, if we were to plot the results for only one target date, like our 7-day ahead predictions, then that would be a pretty sad heatmap (which would look more like a bar chart than a heatmap)... So instead of doing that, let's get predictions for several aheads and plot a heatmap across the target dates. To get the predictions across several ahead values, we will use the map function in the same way that we did in other vignettes: - -```{r} -multi_log_res <- map(1:14, ~ arx_classifier( - jhu, - outcome = "case_rate", - predictors = "case_rate", - args_list = arx_class_args_list( - breaks = 0.25 / 7, - ahead = .x, - horizon = .x, - ))$predictions) %>% list_rbind() -``` - -Notice that the ahead and horizon are what we are are changing when we use `map()`. The reason we must also change `horizon` is because that is passed to the `h` argument of `epiprocess::growth_rate()` and determines the amount of data used to calculate the growth rate. So, for nearly all intents and purposes, it should be the same as `ahead`. - -We can plot a the heatmap of the results over the aheads to see if there's anything novel or interesting: - -```{r} -ggplot(multi_log_res, aes(target_date, geo_value, fill = .pred_class)) + - geom_tile() + - ylab("State") + - xlab("Target date") - -``` - -While there is a bit of variability near to the beginning, we can clearly see that there are upswings for all states starting from the beginning of January 2022, which we can recall was when there was a massive spike in cases for many states. So our results seem to align well with what actually happened at the beginning of January 2022. - -## A brief reflection - -The most noticeable benefit of using the `arx_classifier()` function is the simplification and reduction of the manual implementation of the classifier from about 30 down to 3 lines. However, as we noted before, the trade-off for simplicity is control over the precise pre-processing, post-processing, and additional features embedded in the coding of a classifier. So the good thing is that `epipredict` provides both - a built-in `arx_classifer()` or the means to implement your own classifier from scratch by using the `epi_workflow` framework. And which you choose depends on the circumstances. Our advice is to start with using the built-in classifier for ostensibly simple projects and begin to implement your own when the modelling project takes a complicated turn. To get some practice on coding up a classifier by hand, consider to take this opportunity to translate this binary classification model example to an `epi_workflow`, akin to that in [Examples of Preprocessing and Models vignette](https://cmu-delphi.github.io/epipredict/articles/preprocessing-and-models.html). #%% takes a complex turn, increases in complexity - From e7ec83651713feb1eb9d5e8ccbf543e8b9f92034 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Mon, 8 Apr 2024 14:06:30 -0700 Subject: [PATCH 03/39] Try adding vignette again --- vignettes/arx-classifier-vignette.Rmd | 160 ++++++++++++++++++++++++++ 1 file changed, 160 insertions(+) create mode 100644 vignettes/arx-classifier-vignette.Rmd diff --git a/vignettes/arx-classifier-vignette.Rmd b/vignettes/arx-classifier-vignette.Rmd new file mode 100644 index 000000000..ee64db0bc --- /dev/null +++ b/vignettes/arx-classifier-vignette.Rmd @@ -0,0 +1,160 @@ +--- +title: "arx_classifier" +--- + +```{r setup, include = FALSE} +knitr::opts_chunk$set( + echo = TRUE, + collapse = FALSE, + comment = "#>", + warning = FALSE, + message = FALSE, + cache = TRUE +) +``` + +## Load required packages + +```{r, message = FALSE, warning = FALSE} +library(tidyverse) +library(epipredict) +``` + +## Introducing the ARX classifier + +The `arx_classifier()` function is an autoregressive classification model for `epi_df` data that is used to predict a discrete class for each case under consideration. It is a direct forecaster in that it estimates the classes at a specific horizon or ahead value. + +To get a sense of how the `arx_classifier()` works, let's consider a simple example with minimal inputs. For this, we will use the built-in `case_death_rate_subset` that contains confirmed COVID-19 cases and deaths from JHU CSSE for all states over Dec 31, 2020 to Dec 31, 2021. From this, we'll take a subset of data for five states over June 4, 2021 to December 31, 2021. Our objective is to predict whether the case rates are increasing when considering the 0, 7 and 14 day case rates: + +```{r} +jhu <- case_death_rate_subset %>% + dplyr::filter( + time_value >= "2021-06-04", + time_value <= "2021-12-31", + geo_value %in% c("ca", "fl", "tx", "ny", "nj") + ) + +out <- arx_classifier(jhu, + outcome = "case_rate", + predictors = "case_rate" +) + +out$predictions +``` + +The key takeaway from the predictions is that there are two prediction classes: (-Inf, 0.25] and (0.25, Inf). This is because for our goal of classification the classes must be discrete. The discretization of the real-valued outcome is controlled by the `breaks` argument, which defaults to 0.25. Such breaks will be automatically extended to cover the entire real line. For example, the default break of 0.25 is silently extended to breaks = c(-Inf, .25, Inf) and, therefore, results in two classes: [-Inf, 0.25] and (0.25, Inf). These two classes are used to discretize the outcome. The conversion of the outcome to such classes is handled internally. So if discrete classes already exist for the outcome in the `epi_df`, then we recommend to code a classifier from scratch using the `epi_workflow` framework for more control. + +The `trainer` is a `parsnip` model describing the type of estimation such that `mode = "classification"` is enforced. The two typical trainers that are used are `parsnip::logistic_reg()` for two classes or `parsnip::multinom_reg()` for more than two classes. + +```{r} +workflows::extract_spec_parsnip(out$epi_workflow) +``` + +From the parsnip model specification, we can see that the trainer used is logistic regression, which is expected for our binary outcome. More complicated trainers like `parsnip::naive_Bayes()` or `parsnip::rand_forest()` may also be used (however, we will stick to the basics in this gentle introduction to the classifier). + +If you use the default trainer of logistic regression for binary classification and you decide against using the default break of 0.25, then you should only input one break so that there are two classification bins to properly dichotomize the outcome. For example, let's set a break of 0.5 instead of relying on the default of 0.25. We can do this by passing 0.5 to the `breaks` argument in `arx_class_args_list()` as follows: + +```{r} +out_break_0.5 <- arx_classifier(jhu, + outcome = "case_rate", + predictors = "case_rate", + args_list = arx_class_args_list( + breaks = 0.5 + ) +) + +out_break_0.5$predictions +``` +Indeed, we can observe that the two `.pred_class` are now (-Inf, 0.5] and (0.5, Inf). + +```{r} +?arx_class_args_list +``` + +Additional arguments that may be supplied to `arx_class_args_list()` include the expected `lags` and `ahead` arguments for an autoregressive-type model. These have default values of 0, 7, and 14 days for the lags of the predictors and 7 days ahead of the forecast date for predicting the outcome. There is also `n_training` to indicate the upper bound for the number of training rows per key as well as `forecast_date` and `target_date` to specify the date that the forecast is created and intended, respectively. We will not dwell on these arguments here as they are not unique to this classifier or absolutely essential to understanding how it operates. The remaining arguments will be discussed organically, as they are needed to serve our purposes. For information on any remaining arguments that are not discussed here, please see the function documentation for a complete list and their definitions. + +## Example of using the ARX classifier + +Now, to demonstrate the power and utility of this built-in arx classifier, we will adapt the classification example that was written from scratch in the [Examples of Preprocessing and Models vignette](https://cmu-delphi.github.io/epipredict/articles/preprocessing-and-models.html). However, to keep things simple and not merely a direct translation, we will only consider two prediction categories and leave the extension to three as an exercise for the reader. + +To motivate this example, a major use of autoregressive classification models is to predict upsurges or downsurges like in hotspot prediction models to anticipate the direction of the outcome (see [McDonald, Bien, Green, Hu, et al. (2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118) for more on these). In our case, one simple question that such models can help answer is... Do we expect that the future will have increased case rates or not relative to the present? #%% Ok to say relative to the present? Also, should I be more precise here and say relative change or best to keep it simple? + +To answer this question, we can create a predictive model for upsurges and downsurges of case rates rather than the raw case rates themselves. For this situation, our target is to predict whether there is an increase in case rates or not. Following [McDonald, Bien, Green, Hu, et al. (2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118), we look at the relative change between $Y_{l,t}$ and $Y_{l, t+a}$, where the former is the case rate at location $l$ at time $t$ and the latter is the rate for that location at time $t+a$. Using these variables, we define a categorical response variable with two classes + +$$\begin{align} +Z_{l,t} = \left\{\begin{matrix} +\text{up,} & \text{if } Y_{l,t}^\Delta > 0.25\\ +\text{not up,} & \text{otherwise} +\end{matrix}\right. +\end{align}$$ +where $Y_{l,t}^\Delta = (Y_{l, t} - Y_{l, t- 7} / Y_{l, t- 7}$. If $Y_{l,t}^\Delta$ > 0.25, meaning that the number of new cases over the week has increased by over 25\%, then $Z_{l,t}$ is up. This is the criteria for location $l$ to be a hotspot at time $t$. On the other hand, if $Y_{l,t}^\Delta$ \leq 0.25$, then then $Z_{l,t}$ is categorized as not up, meaning that there has not been a >25\% increase in the new cases over the past week. + +The logistic regression model we use to predict this binary response can be considered to be a simplification of the multinomial regression model presented in the [Examples of Preprocessing and Models vignette](https://cmu-delphi.github.io/epipredict/articles/preprocessing-and-models.html): + +$$\begin{align} +\pi_{\text{up}}(x) &= Pr(Z_{l, t} = \text{up}|x) = \frac{e^{g_{\text{up}}(x)}}{1 + e^{g_{\text{up}}(x)}}, \\ +\pi_{\text{not up}}(x)&= Pr(Z_{l, t} = \text{not up}|x) = 1 - Pr(Z_{l, t} = \text{up}|x) = \frac{1}{1 + e^{g_{\text{up}}(x)}} +\end{align}$$ +where + +$$ +g_{\text{up}}(x) = \log\left ( \frac{\Pr(Z_{l, t} = \text{up} \vert x)}{\Pr(Z_{l, t} = \text{not up} \vert x)} \right ) = \beta_{10} + \beta_{11}t + \delta_{10}s_{state_1} + \delta_{10}s_{state_2} ... + \beta_{12}Y_{l,t}^\Delta + \beta_{13}Y_{l,t-7}^\Delta + \beta_{14}Y_{l,t-14}^\Delta. +$$ + +Now then, we will operate on the same subset of the `case_death_rate_subset` that we used in our above example. This time, we will use it to investigate whether the number of newly reported cases over the past 7 days has increased by at least 25\% compared to the preceding week for our sample of states. + +Notice that by using the `arx_classifier()` function we've completely eliminated the need to manually categorize the response variable and implement pre-processing steps, which was necessary in the [Examples of Preprocessing and Models vignette](https://cmu-delphi.github.io/epipredict/articles/preprocessing-and-models.html). + +```{r} +log_res <- arx_classifier( + jhu, + outcome = "case_rate", + predictors = "case_rate", + args_list = arx_class_args_list( + breaks = 0.25 / 7 # division by 7 gives weekly not daily + ) +) + +workflows::extract_preprocessor(log_res$epi_workflow) +``` + +Comparing the pre-processing steps for this to those in the other vignette, we can see that they are not precisely the same, but they cover the same essentials of transforming `case_rate` to the growth rate scale (`step_growth_rate()`), lagging the predictors (`step_epi_lag()`), leading the response (`step_epi_ahead()`), which are both constructed from the growth rates, and constructing the binary classification response variable (`step_mutate()`). + +On this topic, it is important to understand that we are not actually concerned about the case values themselves. Rather we are concerned whether the quantity of cases in the future is a lot larger than that in the present. For this reason, the outcome does not remain as cases, but rather it is transformed by using either growth rates (as the predictors and outcome in our example are) or lagged differences. While the latter is closer to the requirements for the [2022-23 CDC Flusight Hospitalization Experimental Target](https://github.com/cdcepi/Flusight-forecast-data/blob/745511c436923e1dc201dea0f4181f21a8217b52/data-experimental/README.md), and it is conceptually easy to understand because it is simply the change of the value for the horizon, it is not the default. The default is `growth_rate`. One reason for this choice is because the growth rate is on a rate scale, not on the absolute scale, so it fosters comparability across locations without any conscious effort (on the other hand, when using the `lag_difference` one would need to take care to operate on rates per 100k and not raw counts). We utilize `epiprocess::growth_rate()` to create the outcome using some of the additional arguments. One important argument for the growth rate calculation is the `method`. Only `rel_change` for relative change should be used as the method because the test data is the only data that is accessible and the other methods require access to the training data. #%% Last sentence correct? + +The other optional arguments for controlling the growth rate calculation (that can be inputted as `additional_gr_args`) can be found in the documentation for [`epiprocess::growth_rate()`](https://cmu-delphi.github.io/epiprocess/reference/growth_rate.html) and the related [vignette](https://cmu-delphi.github.io/epiprocess/articles/growth_rate.html). + +### Visualizing the results + +To visualize the prediction classes across the states for the target date, we can plot our results as a heatmap. However, if we were to plot the results for only one target date, like our 7-day ahead predictions, then that would be a pretty sad heatmap (which would look more like a bar chart than a heatmap)... So instead of doing that, let's get predictions for several aheads and plot a heatmap across the target dates. To get the predictions across several ahead values, we will use the map function in the same way that we did in other vignettes: + +```{r} +multi_log_res <- map(1:14, ~ arx_classifier( + jhu, + outcome = "case_rate", + predictors = "case_rate", + args_list = arx_class_args_list( + breaks = 0.25 / 7, + ahead = .x, + horizon = .x, + ))$predictions) %>% list_rbind() +``` + +Notice that the ahead and horizon are what we are are changing when we use `map()`. The reason we must also change `horizon` is because that is passed to the `h` argument of `epiprocess::growth_rate()` and determines the amount of data used to calculate the growth rate. So, for nearly all intents and purposes, it should be the same as `ahead`. + +We can plot a the heatmap of the results over the aheads to see if there's anything novel or interesting: + +```{r} +ggplot(multi_log_res, aes(target_date, geo_value, fill = .pred_class)) + + geom_tile() + + ylab("State") + + xlab("Target date") + +``` + +While there is a bit of variability near to the beginning, we can clearly see that there are upswings for all states starting from the beginning of January 2022, which we can recall was when there was a massive spike in cases for many states. So our results seem to align well with what actually happened at the beginning of January 2022. + +## A brief reflection + +The most noticeable benefit of using the `arx_classifier()` function is the simplification and reduction of the manual implementation of the classifier from about 30 down to 3 lines. However, as we noted before, the trade-off for simplicity is control over the precise pre-processing, post-processing, and additional features embedded in the coding of a classifier. So the good thing is that `epipredict` provides both - a built-in `arx_classifer()` or the means to implement your own classifier from scratch by using the `epi_workflow` framework. And which you choose depends on the circumstances. Our advice is to start with using the built-in classifier for ostensibly simple projects and begin to implement your own when the modelling project takes a complicated turn. To get some practice on coding up a classifier by hand, consider to take this opportunity to translate this binary classification model example to an `epi_workflow`, akin to that in [Examples of Preprocessing and Models vignette](https://cmu-delphi.github.io/epipredict/articles/preprocessing-and-models.html). #%% takes a complex turn, increases in complexity + From d974e43e6c43384c7e9c7df5eca08c595104abbd Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Mon, 8 Apr 2024 14:24:41 -0700 Subject: [PATCH 04/39] Remove space --- vignettes/arx-classifier-vignette.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/arx-classifier-vignette.Rmd b/vignettes/arx-classifier-vignette.Rmd index ee64db0bc..21892f087 100644 --- a/vignettes/arx-classifier-vignette.Rmd +++ b/vignettes/arx-classifier-vignette.Rmd @@ -24,7 +24,7 @@ library(epipredict) The `arx_classifier()` function is an autoregressive classification model for `epi_df` data that is used to predict a discrete class for each case under consideration. It is a direct forecaster in that it estimates the classes at a specific horizon or ahead value. -To get a sense of how the `arx_classifier()` works, let's consider a simple example with minimal inputs. For this, we will use the built-in `case_death_rate_subset` that contains confirmed COVID-19 cases and deaths from JHU CSSE for all states over Dec 31, 2020 to Dec 31, 2021. From this, we'll take a subset of data for five states over June 4, 2021 to December 31, 2021. Our objective is to predict whether the case rates are increasing when considering the 0, 7 and 14 day case rates: +To get a sense of how the `arx_classifier()` works, let's consider a simple example with minimal inputs. For this, we will use the built-in `case_death_rate_subset` that contains confirmed COVID-19 cases and deaths from JHU CSSE for all states over Dec 31, 2020 to Dec 31, 2021. From this, we'll take a subset of data for five states over June 4, 2021 to December 31, 2021. Our objective is to predict whether the case rates are increasing when considering the 0, 7 and 14 day case rates: ```{r} jhu <- case_death_rate_subset %>% From aeef3c31756b1ef799019e522defc8d02328d402 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Mon, 8 Apr 2024 14:36:49 -0700 Subject: [PATCH 05/39] Spacing --- vignettes/arx-classifier-vignette.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/arx-classifier-vignette.Rmd b/vignettes/arx-classifier-vignette.Rmd index 21892f087..ee64db0bc 100644 --- a/vignettes/arx-classifier-vignette.Rmd +++ b/vignettes/arx-classifier-vignette.Rmd @@ -24,7 +24,7 @@ library(epipredict) The `arx_classifier()` function is an autoregressive classification model for `epi_df` data that is used to predict a discrete class for each case under consideration. It is a direct forecaster in that it estimates the classes at a specific horizon or ahead value. -To get a sense of how the `arx_classifier()` works, let's consider a simple example with minimal inputs. For this, we will use the built-in `case_death_rate_subset` that contains confirmed COVID-19 cases and deaths from JHU CSSE for all states over Dec 31, 2020 to Dec 31, 2021. From this, we'll take a subset of data for five states over June 4, 2021 to December 31, 2021. Our objective is to predict whether the case rates are increasing when considering the 0, 7 and 14 day case rates: +To get a sense of how the `arx_classifier()` works, let's consider a simple example with minimal inputs. For this, we will use the built-in `case_death_rate_subset` that contains confirmed COVID-19 cases and deaths from JHU CSSE for all states over Dec 31, 2020 to Dec 31, 2021. From this, we'll take a subset of data for five states over June 4, 2021 to December 31, 2021. Our objective is to predict whether the case rates are increasing when considering the 0, 7 and 14 day case rates: ```{r} jhu <- case_death_rate_subset %>% From 3df82b21a96cd5102e5a9be9079ecc0698bd9a5e Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Mon, 8 Apr 2024 14:41:21 -0700 Subject: [PATCH 06/39] Try to get rid of error --- vignettes/arx-classifier-vignette.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/arx-classifier-vignette.Rmd b/vignettes/arx-classifier-vignette.Rmd index ee64db0bc..21892f087 100644 --- a/vignettes/arx-classifier-vignette.Rmd +++ b/vignettes/arx-classifier-vignette.Rmd @@ -24,7 +24,7 @@ library(epipredict) The `arx_classifier()` function is an autoregressive classification model for `epi_df` data that is used to predict a discrete class for each case under consideration. It is a direct forecaster in that it estimates the classes at a specific horizon or ahead value. -To get a sense of how the `arx_classifier()` works, let's consider a simple example with minimal inputs. For this, we will use the built-in `case_death_rate_subset` that contains confirmed COVID-19 cases and deaths from JHU CSSE for all states over Dec 31, 2020 to Dec 31, 2021. From this, we'll take a subset of data for five states over June 4, 2021 to December 31, 2021. Our objective is to predict whether the case rates are increasing when considering the 0, 7 and 14 day case rates: +To get a sense of how the `arx_classifier()` works, let's consider a simple example with minimal inputs. For this, we will use the built-in `case_death_rate_subset` that contains confirmed COVID-19 cases and deaths from JHU CSSE for all states over Dec 31, 2020 to Dec 31, 2021. From this, we'll take a subset of data for five states over June 4, 2021 to December 31, 2021. Our objective is to predict whether the case rates are increasing when considering the 0, 7 and 14 day case rates: ```{r} jhu <- case_death_rate_subset %>% From e4d8fb6db47edf65cf078c612d041f3137d4a29f Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Mon, 8 Apr 2024 14:49:59 -0700 Subject: [PATCH 07/39] Space --- vignettes/arx-classifier-vignette.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/arx-classifier-vignette.Rmd b/vignettes/arx-classifier-vignette.Rmd index 21892f087..001823fc2 100644 --- a/vignettes/arx-classifier-vignette.Rmd +++ b/vignettes/arx-classifier-vignette.Rmd @@ -31,7 +31,7 @@ jhu <- case_death_rate_subset %>% dplyr::filter( time_value >= "2021-06-04", time_value <= "2021-12-31", - geo_value %in% c("ca", "fl", "tx", "ny", "nj") + geo_value %in% c("ca", "fl", "tx", "ny", "nj") ) out <- arx_classifier(jhu, From 29afcc3960a63d455624b708a7cfdb9391cddb6c Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Mon, 8 Apr 2024 15:14:23 -0700 Subject: [PATCH 08/39] Style file --- vignettes/arx-classifier-vignette.Rmd | 35 +++++++++++++-------------- 1 file changed, 17 insertions(+), 18 deletions(-) diff --git a/vignettes/arx-classifier-vignette.Rmd b/vignettes/arx-classifier-vignette.Rmd index 001823fc2..26a7e9471 100644 --- a/vignettes/arx-classifier-vignette.Rmd +++ b/vignettes/arx-classifier-vignette.Rmd @@ -31,12 +31,12 @@ jhu <- case_death_rate_subset %>% dplyr::filter( time_value >= "2021-06-04", time_value <= "2021-12-31", - geo_value %in% c("ca", "fl", "tx", "ny", "nj") - ) + geo_value %in% c("ca", "fl", "tx", "ny", "nj") + ) -out <- arx_classifier(jhu, - outcome = "case_rate", - predictors = "case_rate" +out <- arx_classifier(jhu, + outcome = "case_rate", + predictors = "case_rate" ) out$predictions @@ -55,12 +55,12 @@ From the parsnip model specification, we can see that the trainer used is logist If you use the default trainer of logistic regression for binary classification and you decide against using the default break of 0.25, then you should only input one break so that there are two classification bins to properly dichotomize the outcome. For example, let's set a break of 0.5 instead of relying on the default of 0.25. We can do this by passing 0.5 to the `breaks` argument in `arx_class_args_list()` as follows: ```{r} -out_break_0.5 <- arx_classifier(jhu, - outcome = "case_rate", - predictors = "case_rate", - args_list = arx_class_args_list( - breaks = 0.5 - ) +out_break_0.5 <- arx_classifier(jhu, + outcome = "case_rate", + predictors = "case_rate", + args_list = arx_class_args_list( + breaks = 0.5 + ) ) out_break_0.5$predictions @@ -129,15 +129,16 @@ The other optional arguments for controlling the growth rate calculation (that c To visualize the prediction classes across the states for the target date, we can plot our results as a heatmap. However, if we were to plot the results for only one target date, like our 7-day ahead predictions, then that would be a pretty sad heatmap (which would look more like a bar chart than a heatmap)... So instead of doing that, let's get predictions for several aheads and plot a heatmap across the target dates. To get the predictions across several ahead values, we will use the map function in the same way that we did in other vignettes: ```{r} -multi_log_res <- map(1:14, ~ arx_classifier( +multi_log_res <- map(1:14, ~ arx_classifier( jhu, outcome = "case_rate", predictors = "case_rate", args_list = arx_class_args_list( - breaks = 0.25 / 7, + breaks = 0.25 / 7, ahead = .x, horizon = .x, - ))$predictions) %>% list_rbind() + ) +)$predictions) %>% list_rbind() ``` Notice that the ahead and horizon are what we are are changing when we use `map()`. The reason we must also change `horizon` is because that is passed to the `h` argument of `epiprocess::growth_rate()` and determines the amount of data used to calculate the growth rate. So, for nearly all intents and purposes, it should be the same as `ahead`. @@ -145,11 +146,10 @@ Notice that the ahead and horizon are what we are are changing when we use `map( We can plot a the heatmap of the results over the aheads to see if there's anything novel or interesting: ```{r} -ggplot(multi_log_res, aes(target_date, geo_value, fill = .pred_class)) + - geom_tile() + +ggplot(multi_log_res, aes(target_date, geo_value, fill = .pred_class)) + + geom_tile() + ylab("State") + xlab("Target date") - ``` While there is a bit of variability near to the beginning, we can clearly see that there are upswings for all states starting from the beginning of January 2022, which we can recall was when there was a massive spike in cases for many states. So our results seem to align well with what actually happened at the beginning of January 2022. @@ -157,4 +157,3 @@ While there is a bit of variability near to the beginning, we can clearly see th ## A brief reflection The most noticeable benefit of using the `arx_classifier()` function is the simplification and reduction of the manual implementation of the classifier from about 30 down to 3 lines. However, as we noted before, the trade-off for simplicity is control over the precise pre-processing, post-processing, and additional features embedded in the coding of a classifier. So the good thing is that `epipredict` provides both - a built-in `arx_classifer()` or the means to implement your own classifier from scratch by using the `epi_workflow` framework. And which you choose depends on the circumstances. Our advice is to start with using the built-in classifier for ostensibly simple projects and begin to implement your own when the modelling project takes a complicated turn. To get some practice on coding up a classifier by hand, consider to take this opportunity to translate this binary classification model example to an `epi_workflow`, akin to that in [Examples of Preprocessing and Models vignette](https://cmu-delphi.github.io/epipredict/articles/preprocessing-and-models.html). #%% takes a complex turn, increases in complexity - From 14082832096858c27292dc48e5cb8f0ca4a0cdbc Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Mon, 8 Apr 2024 16:19:33 -0700 Subject: [PATCH 09/39] Add packages --- DESCRIPTION | 1 + vignettes/arx-classifier-vignette.Rmd | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/DESCRIPTION b/DESCRIPTION index 8dbe7ea7f..80e7de094 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -36,6 +36,7 @@ Imports: glue, hardhat (>= 1.3.0), magrittr, + purrr, quantreg, recipes (>= 1.0.4), rlang, diff --git a/vignettes/arx-classifier-vignette.Rmd b/vignettes/arx-classifier-vignette.Rmd index 26a7e9471..d8001c7c8 100644 --- a/vignettes/arx-classifier-vignette.Rmd +++ b/vignettes/arx-classifier-vignette.Rmd @@ -16,7 +16,8 @@ knitr::opts_chunk$set( ## Load required packages ```{r, message = FALSE, warning = FALSE} -library(tidyverse) +library(dplyr) +library(purrr) library(epipredict) ``` From bbf85542a678d07c0e12a1c3021c30fadc97c119 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Mon, 8 Apr 2024 16:30:55 -0700 Subject: [PATCH 10/39] Add ggplot2 --- vignettes/arx-classifier-vignette.Rmd | 1 + 1 file changed, 1 insertion(+) diff --git a/vignettes/arx-classifier-vignette.Rmd b/vignettes/arx-classifier-vignette.Rmd index d8001c7c8..64fa7fd94 100644 --- a/vignettes/arx-classifier-vignette.Rmd +++ b/vignettes/arx-classifier-vignette.Rmd @@ -18,6 +18,7 @@ knitr::opts_chunk$set( ```{r, message = FALSE, warning = FALSE} library(dplyr) library(purrr) +library(ggplot2) library(epipredict) ``` From e76be9c5d0ee2a6d94519bc56b49f7e80246b545 Mon Sep 17 00:00:00 2001 From: Rachel Lobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 12:44:08 -0700 Subject: [PATCH 11/39] - cache Co-authored-by: Daniel McDonald --- vignettes/arx-classifier-vignette.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/arx-classifier-vignette.Rmd b/vignettes/arx-classifier-vignette.Rmd index 64fa7fd94..e58e0fa7e 100644 --- a/vignettes/arx-classifier-vignette.Rmd +++ b/vignettes/arx-classifier-vignette.Rmd @@ -9,7 +9,7 @@ knitr::opts_chunk$set( comment = "#>", warning = FALSE, message = FALSE, - cache = TRUE +out.width = "100%" ) ``` From 2893035671f450cf6500d1e09cf0fbfc037856e2 Mon Sep 17 00:00:00 2001 From: Rachel Lobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 12:44:27 -0700 Subject: [PATCH 12/39] Update title header Co-authored-by: Daniel McDonald --- vignettes/arx-classifier-vignette.Rmd | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/vignettes/arx-classifier-vignette.Rmd b/vignettes/arx-classifier-vignette.Rmd index e58e0fa7e..5ed9f7e01 100644 --- a/vignettes/arx-classifier-vignette.Rmd +++ b/vignettes/arx-classifier-vignette.Rmd @@ -1,5 +1,10 @@ --- -title: "arx_classifier" +title: "Auto-regressive classifier" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Auto-regressive classifier} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} From 05467cdf415c14cfb11e92dc0c14fa07dccd2bda Mon Sep 17 00:00:00 2001 From: Rachel Lobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 12:44:37 -0700 Subject: [PATCH 13/39] Update vignettes/arx-classifier-vignette.Rmd Co-authored-by: Daniel McDonald --- vignettes/arx-classifier-vignette.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/arx-classifier-vignette.Rmd b/vignettes/arx-classifier-vignette.Rmd index 5ed9f7e01..ceee9e895 100644 --- a/vignettes/arx-classifier-vignette.Rmd +++ b/vignettes/arx-classifier-vignette.Rmd @@ -29,7 +29,7 @@ library(epipredict) ## Introducing the ARX classifier -The `arx_classifier()` function is an autoregressive classification model for `epi_df` data that is used to predict a discrete class for each case under consideration. It is a direct forecaster in that it estimates the classes at a specific horizon or ahead value. +The `arx_classifier()` is an autoregressive classification model for `epi_df` data that is used to predict a discrete class for each case under consideration. It is a direct forecaster in that it estimates the classes at a specific horizon or ahead value. To get a sense of how the `arx_classifier()` works, let's consider a simple example with minimal inputs. For this, we will use the built-in `case_death_rate_subset` that contains confirmed COVID-19 cases and deaths from JHU CSSE for all states over Dec 31, 2020 to Dec 31, 2021. From this, we'll take a subset of data for five states over June 4, 2021 to December 31, 2021. Our objective is to predict whether the case rates are increasing when considering the 0, 7 and 14 day case rates: From 18b744095cc3131e87f119e5815ac24d53ec9684 Mon Sep 17 00:00:00 2001 From: Rachel Lobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 12:44:52 -0700 Subject: [PATCH 14/39] Update vignettes/arx-classifier-vignette.Rmd Co-authored-by: Daniel McDonald --- vignettes/arx-classifier-vignette.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/arx-classifier-vignette.Rmd b/vignettes/arx-classifier-vignette.Rmd index ceee9e895..ecd6dcb58 100644 --- a/vignettes/arx-classifier-vignette.Rmd +++ b/vignettes/arx-classifier-vignette.Rmd @@ -82,7 +82,7 @@ Additional arguments that may be supplied to `arx_class_args_list()` include the ## Example of using the ARX classifier -Now, to demonstrate the power and utility of this built-in arx classifier, we will adapt the classification example that was written from scratch in the [Examples of Preprocessing and Models vignette](https://cmu-delphi.github.io/epipredict/articles/preprocessing-and-models.html). However, to keep things simple and not merely a direct translation, we will only consider two prediction categories and leave the extension to three as an exercise for the reader. +Now, to demonstrate the power and utility of this built-in arx classifier, we will adapt the classification example that was written from scratch in `vignette("preprocessing-and-models")`. However, to keep things simple and not merely a direct translation, we will only consider two prediction categories and leave the extension to three as an exercise for the reader. To motivate this example, a major use of autoregressive classification models is to predict upsurges or downsurges like in hotspot prediction models to anticipate the direction of the outcome (see [McDonald, Bien, Green, Hu, et al. (2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118) for more on these). In our case, one simple question that such models can help answer is... Do we expect that the future will have increased case rates or not relative to the present? #%% Ok to say relative to the present? Also, should I be more precise here and say relative change or best to keep it simple? From 3015abbbb46fb061ec9eaec2799805175c4d48f7 Mon Sep 17 00:00:00 2001 From: Rachel Lobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 12:45:23 -0700 Subject: [PATCH 15/39] help() instead of question mark Co-authored-by: Daniel McDonald --- vignettes/arx-classifier-vignette.Rmd | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/vignettes/arx-classifier-vignette.Rmd b/vignettes/arx-classifier-vignette.Rmd index ecd6dcb58..888e46f20 100644 --- a/vignettes/arx-classifier-vignette.Rmd +++ b/vignettes/arx-classifier-vignette.Rmd @@ -73,10 +73,7 @@ out_break_0.5 <- arx_classifier(jhu, out_break_0.5$predictions ``` Indeed, we can observe that the two `.pred_class` are now (-Inf, 0.5] and (0.5, Inf). - -```{r} -?arx_class_args_list -``` +See `help(arx_class_args_list)` for other available modifications. Additional arguments that may be supplied to `arx_class_args_list()` include the expected `lags` and `ahead` arguments for an autoregressive-type model. These have default values of 0, 7, and 14 days for the lags of the predictors and 7 days ahead of the forecast date for predicting the outcome. There is also `n_training` to indicate the upper bound for the number of training rows per key as well as `forecast_date` and `target_date` to specify the date that the forecast is created and intended, respectively. We will not dwell on these arguments here as they are not unique to this classifier or absolutely essential to understanding how it operates. The remaining arguments will be discussed organically, as they are needed to serve our purposes. For information on any remaining arguments that are not discussed here, please see the function documentation for a complete list and their definitions. From c27f688492d07b799e6af6d2847ab0544ad098bd Mon Sep 17 00:00:00 2001 From: Rachel Lobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 12:45:41 -0700 Subject: [PATCH 16/39] Update vignettes/arx-classifier-vignette.Rmd Co-authored-by: Daniel McDonald --- vignettes/arx-classifier-vignette.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/arx-classifier-vignette.Rmd b/vignettes/arx-classifier-vignette.Rmd index 888e46f20..3caef3407 100644 --- a/vignettes/arx-classifier-vignette.Rmd +++ b/vignettes/arx-classifier-vignette.Rmd @@ -126,7 +126,7 @@ Comparing the pre-processing steps for this to those in the other vignette, we c On this topic, it is important to understand that we are not actually concerned about the case values themselves. Rather we are concerned whether the quantity of cases in the future is a lot larger than that in the present. For this reason, the outcome does not remain as cases, but rather it is transformed by using either growth rates (as the predictors and outcome in our example are) or lagged differences. While the latter is closer to the requirements for the [2022-23 CDC Flusight Hospitalization Experimental Target](https://github.com/cdcepi/Flusight-forecast-data/blob/745511c436923e1dc201dea0f4181f21a8217b52/data-experimental/README.md), and it is conceptually easy to understand because it is simply the change of the value for the horizon, it is not the default. The default is `growth_rate`. One reason for this choice is because the growth rate is on a rate scale, not on the absolute scale, so it fosters comparability across locations without any conscious effort (on the other hand, when using the `lag_difference` one would need to take care to operate on rates per 100k and not raw counts). We utilize `epiprocess::growth_rate()` to create the outcome using some of the additional arguments. One important argument for the growth rate calculation is the `method`. Only `rel_change` for relative change should be used as the method because the test data is the only data that is accessible and the other methods require access to the training data. #%% Last sentence correct? -The other optional arguments for controlling the growth rate calculation (that can be inputted as `additional_gr_args`) can be found in the documentation for [`epiprocess::growth_rate()`](https://cmu-delphi.github.io/epiprocess/reference/growth_rate.html) and the related [vignette](https://cmu-delphi.github.io/epiprocess/articles/growth_rate.html). +The other optional arguments for controlling the growth rate calculation (that can be inputted as `additional_gr_args`) can be found in the documentation for `epiprocess::growth_rate()` and the related `vignette("growth_rate", package = "epiprocess")`. ### Visualizing the results From 0600ac3f9d696a927c970443ce11b30d03058a02 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 12:49:02 -0700 Subject: [PATCH 17/39] Add purrr to suggests --- DESCRIPTION | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/DESCRIPTION b/DESCRIPTION index 80e7de094..a246317f5 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -36,7 +36,6 @@ Imports: glue, hardhat (>= 1.3.0), magrittr, - purrr, quantreg, recipes (>= 1.0.4), rlang, @@ -56,6 +55,7 @@ Suggests: knitr, lubridate, poissonreg, + purrr, ranger, RcppRoll, rmarkdown, From 45da6d25556bcbe486d7b6b2231b2b83c8cc4533 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 12:49:46 -0700 Subject: [PATCH 18/39] Rename file to arx-classifier.Rmd --- vignettes/{arx-classifier-vignette.Rmd => arx-classifier.Rmd} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename vignettes/{arx-classifier-vignette.Rmd => arx-classifier.Rmd} (100%) diff --git a/vignettes/arx-classifier-vignette.Rmd b/vignettes/arx-classifier.Rmd similarity index 100% rename from vignettes/arx-classifier-vignette.Rmd rename to vignettes/arx-classifier.Rmd From 40f71b4b4009defabfa1490d22fd4be1f705ae56 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 12:52:28 -0700 Subject: [PATCH 19/39] Update references to vignettes in this package --- vignettes/arx-classifier.Rmd | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/vignettes/arx-classifier.Rmd b/vignettes/arx-classifier.Rmd index 3caef3407..3dbaf7438 100644 --- a/vignettes/arx-classifier.Rmd +++ b/vignettes/arx-classifier.Rmd @@ -81,7 +81,7 @@ Additional arguments that may be supplied to `arx_class_args_list()` include the Now, to demonstrate the power and utility of this built-in arx classifier, we will adapt the classification example that was written from scratch in `vignette("preprocessing-and-models")`. However, to keep things simple and not merely a direct translation, we will only consider two prediction categories and leave the extension to three as an exercise for the reader. -To motivate this example, a major use of autoregressive classification models is to predict upsurges or downsurges like in hotspot prediction models to anticipate the direction of the outcome (see [McDonald, Bien, Green, Hu, et al. (2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118) for more on these). In our case, one simple question that such models can help answer is... Do we expect that the future will have increased case rates or not relative to the present? #%% Ok to say relative to the present? Also, should I be more precise here and say relative change or best to keep it simple? +To motivate this example, a major use of autoregressive classification models is to predict upsurges or downsurges like in hotspot prediction models to anticipate the direction of the outcome (see [McDonald, Bien, Green, Hu, et al. (2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118) for more on these). In our case, one simple question that such models can help answer is... Do we expect that the future will have increased case rates or not relative to the present? To answer this question, we can create a predictive model for upsurges and downsurges of case rates rather than the raw case rates themselves. For this situation, our target is to predict whether there is an increase in case rates or not. Following [McDonald, Bien, Green, Hu, et al. (2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118), we look at the relative change between $Y_{l,t}$ and $Y_{l, t+a}$, where the former is the case rate at location $l$ at time $t$ and the latter is the rate for that location at time $t+a$. Using these variables, we define a categorical response variable with two classes @@ -93,7 +93,7 @@ Z_{l,t} = \left\{\begin{matrix} \end{align}$$ where $Y_{l,t}^\Delta = (Y_{l, t} - Y_{l, t- 7} / Y_{l, t- 7}$. If $Y_{l,t}^\Delta$ > 0.25, meaning that the number of new cases over the week has increased by over 25\%, then $Z_{l,t}$ is up. This is the criteria for location $l$ to be a hotspot at time $t$. On the other hand, if $Y_{l,t}^\Delta$ \leq 0.25$, then then $Z_{l,t}$ is categorized as not up, meaning that there has not been a >25\% increase in the new cases over the past week. -The logistic regression model we use to predict this binary response can be considered to be a simplification of the multinomial regression model presented in the [Examples of Preprocessing and Models vignette](https://cmu-delphi.github.io/epipredict/articles/preprocessing-and-models.html): +The logistic regression model we use to predict this binary response can be considered to be a simplification of the multinomial regression model presented in `vignette("preprocessing-and-models")`: $$\begin{align} \pi_{\text{up}}(x) &= Pr(Z_{l, t} = \text{up}|x) = \frac{e^{g_{\text{up}}(x)}}{1 + e^{g_{\text{up}}(x)}}, \\ @@ -107,7 +107,7 @@ $$ Now then, we will operate on the same subset of the `case_death_rate_subset` that we used in our above example. This time, we will use it to investigate whether the number of newly reported cases over the past 7 days has increased by at least 25\% compared to the preceding week for our sample of states. -Notice that by using the `arx_classifier()` function we've completely eliminated the need to manually categorize the response variable and implement pre-processing steps, which was necessary in the [Examples of Preprocessing and Models vignette](https://cmu-delphi.github.io/epipredict/articles/preprocessing-and-models.html). +Notice that by using the `arx_classifier()` function we've completely eliminated the need to manually categorize the response variable and implement pre-processing steps, which was necessary in `vignette("preprocessing-and-models")`. ```{r} log_res <- arx_classifier( @@ -124,7 +124,7 @@ workflows::extract_preprocessor(log_res$epi_workflow) Comparing the pre-processing steps for this to those in the other vignette, we can see that they are not precisely the same, but they cover the same essentials of transforming `case_rate` to the growth rate scale (`step_growth_rate()`), lagging the predictors (`step_epi_lag()`), leading the response (`step_epi_ahead()`), which are both constructed from the growth rates, and constructing the binary classification response variable (`step_mutate()`). -On this topic, it is important to understand that we are not actually concerned about the case values themselves. Rather we are concerned whether the quantity of cases in the future is a lot larger than that in the present. For this reason, the outcome does not remain as cases, but rather it is transformed by using either growth rates (as the predictors and outcome in our example are) or lagged differences. While the latter is closer to the requirements for the [2022-23 CDC Flusight Hospitalization Experimental Target](https://github.com/cdcepi/Flusight-forecast-data/blob/745511c436923e1dc201dea0f4181f21a8217b52/data-experimental/README.md), and it is conceptually easy to understand because it is simply the change of the value for the horizon, it is not the default. The default is `growth_rate`. One reason for this choice is because the growth rate is on a rate scale, not on the absolute scale, so it fosters comparability across locations without any conscious effort (on the other hand, when using the `lag_difference` one would need to take care to operate on rates per 100k and not raw counts). We utilize `epiprocess::growth_rate()` to create the outcome using some of the additional arguments. One important argument for the growth rate calculation is the `method`. Only `rel_change` for relative change should be used as the method because the test data is the only data that is accessible and the other methods require access to the training data. #%% Last sentence correct? +On this topic, it is important to understand that we are not actually concerned about the case values themselves. Rather we are concerned whether the quantity of cases in the future is a lot larger than that in the present. For this reason, the outcome does not remain as cases, but rather it is transformed by using either growth rates (as the predictors and outcome in our example are) or lagged differences. While the latter is closer to the requirements for the [2022-23 CDC Flusight Hospitalization Experimental Target](https://github.com/cdcepi/Flusight-forecast-data/blob/745511c436923e1dc201dea0f4181f21a8217b52/data-experimental/README.md), and it is conceptually easy to understand because it is simply the change of the value for the horizon, it is not the default. The default is `growth_rate`. One reason for this choice is because the growth rate is on a rate scale, not on the absolute scale, so it fosters comparability across locations without any conscious effort (on the other hand, when using the `lag_difference` one would need to take care to operate on rates per 100k and not raw counts). We utilize `epiprocess::growth_rate()` to create the outcome using some of the additional arguments. One important argument for the growth rate calculation is the `method`. Only `rel_change` for relative change should be used as the method because the test data is the only data that is accessible and the other methods require access to the training data. The other optional arguments for controlling the growth rate calculation (that can be inputted as `additional_gr_args`) can be found in the documentation for `epiprocess::growth_rate()` and the related `vignette("growth_rate", package = "epiprocess")`. @@ -160,4 +160,4 @@ While there is a bit of variability near to the beginning, we can clearly see th ## A brief reflection -The most noticeable benefit of using the `arx_classifier()` function is the simplification and reduction of the manual implementation of the classifier from about 30 down to 3 lines. However, as we noted before, the trade-off for simplicity is control over the precise pre-processing, post-processing, and additional features embedded in the coding of a classifier. So the good thing is that `epipredict` provides both - a built-in `arx_classifer()` or the means to implement your own classifier from scratch by using the `epi_workflow` framework. And which you choose depends on the circumstances. Our advice is to start with using the built-in classifier for ostensibly simple projects and begin to implement your own when the modelling project takes a complicated turn. To get some practice on coding up a classifier by hand, consider to take this opportunity to translate this binary classification model example to an `epi_workflow`, akin to that in [Examples of Preprocessing and Models vignette](https://cmu-delphi.github.io/epipredict/articles/preprocessing-and-models.html). #%% takes a complex turn, increases in complexity +The most noticeable benefit of using the `arx_classifier()` function is the simplification and reduction of the manual implementation of the classifier from about 30 down to 3 lines. However, as we noted before, the trade-off for simplicity is control over the precise pre-processing, post-processing, and additional features embedded in the coding of a classifier. So the good thing is that `epipredict` provides both - a built-in `arx_classifer()` or the means to implement your own classifier from scratch by using the `epi_workflow` framework. And which you choose depends on the circumstances. Our advice is to start with using the built-in classifier for ostensibly simple projects and begin to implement your own when the modelling project takes a complicated turn. To get some practice on coding up a classifier by hand, consider to take this opportunity to translate this binary classification model example to an `epi_workflow`, akin to that in `vignette("preprocessing-and-models")`. From 1a2f48545f1dd96765daf86492ac78a13216874c Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 13:09:52 -0700 Subject: [PATCH 20/39] Remove horizon stuff --- vignettes/arx-classifier.Rmd | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/vignettes/arx-classifier.Rmd b/vignettes/arx-classifier.Rmd index 3dbaf7438..cdafc9d1b 100644 --- a/vignettes/arx-classifier.Rmd +++ b/vignettes/arx-classifier.Rmd @@ -14,7 +14,7 @@ knitr::opts_chunk$set( comment = "#>", warning = FALSE, message = FALSE, -out.width = "100%" + out.width = "100%" ) ``` @@ -139,15 +139,12 @@ multi_log_res <- map(1:14, ~ arx_classifier( predictors = "case_rate", args_list = arx_class_args_list( breaks = 0.25 / 7, - ahead = .x, - horizon = .x, + ahead = .x ) )$predictions) %>% list_rbind() ``` -Notice that the ahead and horizon are what we are are changing when we use `map()`. The reason we must also change `horizon` is because that is passed to the `h` argument of `epiprocess::growth_rate()` and determines the amount of data used to calculate the growth rate. So, for nearly all intents and purposes, it should be the same as `ahead`. - -We can plot a the heatmap of the results over the aheads to see if there's anything novel or interesting: +We can plot a the heatmap of the results over the aheads to see if there's anything novel or interesting to take away: ```{r} ggplot(multi_log_res, aes(target_date, geo_value, fill = .pred_class)) + From 8092e0a7543ded0ac85199c33b2449e765d802b1 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 13:16:38 -0700 Subject: [PATCH 21/39] Add a bit about n_training --- vignettes/arx-classifier.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/arx-classifier.Rmd b/vignettes/arx-classifier.Rmd index cdafc9d1b..dba2e99ae 100644 --- a/vignettes/arx-classifier.Rmd +++ b/vignettes/arx-classifier.Rmd @@ -75,7 +75,7 @@ out_break_0.5$predictions Indeed, we can observe that the two `.pred_class` are now (-Inf, 0.5] and (0.5, Inf). See `help(arx_class_args_list)` for other available modifications. -Additional arguments that may be supplied to `arx_class_args_list()` include the expected `lags` and `ahead` arguments for an autoregressive-type model. These have default values of 0, 7, and 14 days for the lags of the predictors and 7 days ahead of the forecast date for predicting the outcome. There is also `n_training` to indicate the upper bound for the number of training rows per key as well as `forecast_date` and `target_date` to specify the date that the forecast is created and intended, respectively. We will not dwell on these arguments here as they are not unique to this classifier or absolutely essential to understanding how it operates. The remaining arguments will be discussed organically, as they are needed to serve our purposes. For information on any remaining arguments that are not discussed here, please see the function documentation for a complete list and their definitions. +Additional arguments that may be supplied to `arx_class_args_list()` include the expected `lags` and `ahead` arguments for an autoregressive-type model. These have default values of 0, 7, and 14 days for the lags of the predictors and 7 days ahead of the forecast date for predicting the outcome. There is also `n_training` to indicate the upper bound for the number of training rows per key. If you would like some practice with using this, then remove the filtering command to obtain data within "2021-06-04" and "2021-12-31" and instead set `n_training` to be the number of days between these two dates, inclusive of the end points. The end results should be the same. In addition to `n_training`, there are `forecast_date` and `target_date` to specify the date that the forecast is created and intended, respectively. We will not dwell on such arguments here as they are not unique to this classifier or absolutely essential to understanding how it operates. The remaining arguments will be discussed organically, as they are needed to serve our purposes. For information on any remaining arguments that are not discussed here, please see the function documentation for a complete list and their definitions. ## Example of using the ARX classifier From 5641c954de527dd3d9bdf40c5544ba56631eb427 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 13:22:18 -0700 Subject: [PATCH 22/39] remove surges --- vignettes/arx-classifier.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/vignettes/arx-classifier.Rmd b/vignettes/arx-classifier.Rmd index dba2e99ae..45554fa12 100644 --- a/vignettes/arx-classifier.Rmd +++ b/vignettes/arx-classifier.Rmd @@ -81,9 +81,9 @@ Additional arguments that may be supplied to `arx_class_args_list()` include the Now, to demonstrate the power and utility of this built-in arx classifier, we will adapt the classification example that was written from scratch in `vignette("preprocessing-and-models")`. However, to keep things simple and not merely a direct translation, we will only consider two prediction categories and leave the extension to three as an exercise for the reader. -To motivate this example, a major use of autoregressive classification models is to predict upsurges or downsurges like in hotspot prediction models to anticipate the direction of the outcome (see [McDonald, Bien, Green, Hu, et al. (2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118) for more on these). In our case, one simple question that such models can help answer is... Do we expect that the future will have increased case rates or not relative to the present? +To motivate this example, a major use of autoregressive classification models is to predict upswings or downswings like in hotspot prediction models to anticipate the direction of the outcome (see [McDonald, Bien, Green, Hu, et al. (2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118) for more on these). In our case, one simple question that such models can help answer is... Do we expect that the future will have increased case rates or not relative to the present? -To answer this question, we can create a predictive model for upsurges and downsurges of case rates rather than the raw case rates themselves. For this situation, our target is to predict whether there is an increase in case rates or not. Following [McDonald, Bien, Green, Hu, et al. (2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118), we look at the relative change between $Y_{l,t}$ and $Y_{l, t+a}$, where the former is the case rate at location $l$ at time $t$ and the latter is the rate for that location at time $t+a$. Using these variables, we define a categorical response variable with two classes +To answer this question, we can create a predictive model for upswings and downswings of case rates rather than the raw case rates themselves. For this situation, our target is to predict whether there is an increase in case rates or not. Following [McDonald, Bien, Green, Hu, et al. (2021)](https://www.pnas.org/doi/full/10.1073/pnas.2111453118), we look at the relative change between $Y_{l,t}$ and $Y_{l, t+a}$, where the former is the case rate at location $l$ at time $t$ and the latter is the rate for that location at time $t+a$. Using these variables, we define a categorical response variable with two classes $$\begin{align} Z_{l,t} = \left\{\begin{matrix} From b73ccac11e5a9175150f8a1407076d87de1ee900 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 13:26:42 -0700 Subject: [PATCH 23/39] Run use_this(purrr) and edit conclusion --- DESCRIPTION | 2 +- vignettes/arx-classifier.Rmd | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index a246317f5..80e7de094 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -36,6 +36,7 @@ Imports: glue, hardhat (>= 1.3.0), magrittr, + purrr, quantreg, recipes (>= 1.0.4), rlang, @@ -55,7 +56,6 @@ Suggests: knitr, lubridate, poissonreg, - purrr, ranger, RcppRoll, rmarkdown, diff --git a/vignettes/arx-classifier.Rmd b/vignettes/arx-classifier.Rmd index 45554fa12..8be96fb4b 100644 --- a/vignettes/arx-classifier.Rmd +++ b/vignettes/arx-classifier.Rmd @@ -157,4 +157,4 @@ While there is a bit of variability near to the beginning, we can clearly see th ## A brief reflection -The most noticeable benefit of using the `arx_classifier()` function is the simplification and reduction of the manual implementation of the classifier from about 30 down to 3 lines. However, as we noted before, the trade-off for simplicity is control over the precise pre-processing, post-processing, and additional features embedded in the coding of a classifier. So the good thing is that `epipredict` provides both - a built-in `arx_classifer()` or the means to implement your own classifier from scratch by using the `epi_workflow` framework. And which you choose depends on the circumstances. Our advice is to start with using the built-in classifier for ostensibly simple projects and begin to implement your own when the modelling project takes a complicated turn. To get some practice on coding up a classifier by hand, consider to take this opportunity to translate this binary classification model example to an `epi_workflow`, akin to that in `vignette("preprocessing-and-models")`. +The most noticeable benefit of using the `arx_classifier()` function is the simplification and reduction of the manual implementation of the classifier from about 30 down to 3 lines. However, as we noted before, the trade-off for simplicity is control over the precise pre-processing, post-processing, and additional features embedded in the coding of a classifier. So the good thing is that `epipredict` provides both - a built-in `arx_classifer()` or the means to implement your own classifier from scratch by using the `epi_workflow` framework. And which you choose will depend on the circumstances. Our advice is to start with using the built-in classifier for ostensibly simple projects and begin to implement your own when the modelling project takes a complicated turn. To get some practice on coding up a classifier by hand, consider to take this opportunity to translate this binary classification model example to an `epi_workflow`, akin to that in `vignette("preprocessing-and-models")`. From b3079dde3bd3dd786c2aa74b154c2497290c62c8 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 13:27:00 -0700 Subject: [PATCH 24/39] translating --- vignettes/arx-classifier.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/arx-classifier.Rmd b/vignettes/arx-classifier.Rmd index 8be96fb4b..ff8696db2 100644 --- a/vignettes/arx-classifier.Rmd +++ b/vignettes/arx-classifier.Rmd @@ -157,4 +157,4 @@ While there is a bit of variability near to the beginning, we can clearly see th ## A brief reflection -The most noticeable benefit of using the `arx_classifier()` function is the simplification and reduction of the manual implementation of the classifier from about 30 down to 3 lines. However, as we noted before, the trade-off for simplicity is control over the precise pre-processing, post-processing, and additional features embedded in the coding of a classifier. So the good thing is that `epipredict` provides both - a built-in `arx_classifer()` or the means to implement your own classifier from scratch by using the `epi_workflow` framework. And which you choose will depend on the circumstances. Our advice is to start with using the built-in classifier for ostensibly simple projects and begin to implement your own when the modelling project takes a complicated turn. To get some practice on coding up a classifier by hand, consider to take this opportunity to translate this binary classification model example to an `epi_workflow`, akin to that in `vignette("preprocessing-and-models")`. +The most noticeable benefit of using the `arx_classifier()` function is the simplification and reduction of the manual implementation of the classifier from about 30 down to 3 lines. However, as we noted before, the trade-off for simplicity is control over the precise pre-processing, post-processing, and additional features embedded in the coding of a classifier. So the good thing is that `epipredict` provides both - a built-in `arx_classifer()` or the means to implement your own classifier from scratch by using the `epi_workflow` framework. And which you choose will depend on the circumstances. Our advice is to start with using the built-in classifier for ostensibly simple projects and begin to implement your own when the modelling project takes a complicated turn. To get some practice on coding up a classifier by hand, consider translating this binary classification model example to an `epi_workflow`, akin to that in `vignette("preprocessing-and-models")`. From 198d39716ebdf9578f3b6b6495b447c9534844a3 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 13:43:01 -0700 Subject: [PATCH 25/39] scale_colour_brewer() --- vignettes/arx-classifier.Rmd | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/vignettes/arx-classifier.Rmd b/vignettes/arx-classifier.Rmd index ff8696db2..0c3d0477a 100644 --- a/vignettes/arx-classifier.Rmd +++ b/vignettes/arx-classifier.Rmd @@ -119,7 +119,7 @@ log_res <- arx_classifier( ) ) -workflows::extract_preprocessor(log_res$epi_workflow) +log_res$epi_workflow ``` Comparing the pre-processing steps for this to those in the other vignette, we can see that they are not precisely the same, but they cover the same essentials of transforming `case_rate` to the growth rate scale (`step_growth_rate()`), lagging the predictors (`step_epi_lag()`), leading the response (`step_epi_ahead()`), which are both constructed from the growth rates, and constructing the binary classification response variable (`step_mutate()`). @@ -149,6 +149,7 @@ We can plot a the heatmap of the results over the aheads to see if there's anyth ```{r} ggplot(multi_log_res, aes(target_date, geo_value, fill = .pred_class)) + geom_tile() + + scale_colour_brewer() + ylab("State") + xlab("Target date") ``` From f1f80d3909c6bef9d67b3e87526a8f8828bc3f31 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 14:23:19 -0700 Subject: [PATCH 26/39] Try 10% increase --- vignettes/arx-classifier.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/vignettes/arx-classifier.Rmd b/vignettes/arx-classifier.Rmd index 0c3d0477a..d2dafc077 100644 --- a/vignettes/arx-classifier.Rmd +++ b/vignettes/arx-classifier.Rmd @@ -138,7 +138,7 @@ multi_log_res <- map(1:14, ~ arx_classifier( outcome = "case_rate", predictors = "case_rate", args_list = arx_class_args_list( - breaks = 0.25 / 7, + breaks = 0.25/7, ahead = .x ) )$predictions) %>% list_rbind() @@ -149,9 +149,9 @@ We can plot a the heatmap of the results over the aheads to see if there's anyth ```{r} ggplot(multi_log_res, aes(target_date, geo_value, fill = .pred_class)) + geom_tile() + - scale_colour_brewer() + ylab("State") + - xlab("Target date") + xlab("Target date") + + scale_fill_brewer(palette = "Set1") ``` While there is a bit of variability near to the beginning, we can clearly see that there are upswings for all states starting from the beginning of January 2022, which we can recall was when there was a massive spike in cases for many states. So our results seem to align well with what actually happened at the beginning of January 2022. From 04553a6622af625115ce0feb2b29c31ac4b13a66 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 14:23:35 -0700 Subject: [PATCH 27/39] see previous --- vignettes/arx-classifier.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/arx-classifier.Rmd b/vignettes/arx-classifier.Rmd index d2dafc077..6152f34bd 100644 --- a/vignettes/arx-classifier.Rmd +++ b/vignettes/arx-classifier.Rmd @@ -138,7 +138,7 @@ multi_log_res <- map(1:14, ~ arx_classifier( outcome = "case_rate", predictors = "case_rate", args_list = arx_class_args_list( - breaks = 0.25/7, + breaks = 0.1, ahead = .x ) )$predictions) %>% list_rbind() From a78069ee50c5f57b652e74bb80633ea291b56ed2 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 14:29:05 -0700 Subject: [PATCH 28/39] description --- DESCRIPTION | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/DESCRIPTION b/DESCRIPTION index 80e7de094..e573bd902 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: epipredict Title: Basic epidemiology forecasting methods -Version: 0.0.12 +Version: 0.0.13 Authors@R: c( person("Daniel", "McDonald", , "daniel@stat.ubc.ca", role = c("aut", "cre")), person("Ryan", "Tibshirani", , "ryantibs@cmu.edu", role = "aut"), From fb6ba335411d8034c18c47b5d12a9e6417654f42 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 14:32:37 -0700 Subject: [PATCH 29/39] Try to pass check --- vignettes/arx-classifier.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/arx-classifier.Rmd b/vignettes/arx-classifier.Rmd index 6152f34bd..63cac865f 100644 --- a/vignettes/arx-classifier.Rmd +++ b/vignettes/arx-classifier.Rmd @@ -141,7 +141,7 @@ multi_log_res <- map(1:14, ~ arx_classifier( breaks = 0.1, ahead = .x ) -)$predictions) %>% list_rbind() +)$predictions) %>% list_rbind() ``` We can plot a the heatmap of the results over the aheads to see if there's anything novel or interesting to take away: From 24ff1e7d26a8aae192e722fdb8f7f210fcc703e8 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 14:39:33 -0700 Subject: [PATCH 30/39] map over larger window --- vignettes/arx-classifier.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/vignettes/arx-classifier.Rmd b/vignettes/arx-classifier.Rmd index 63cac865f..1cbf316d7 100644 --- a/vignettes/arx-classifier.Rmd +++ b/vignettes/arx-classifier.Rmd @@ -133,12 +133,12 @@ The other optional arguments for controlling the growth rate calculation (that c To visualize the prediction classes across the states for the target date, we can plot our results as a heatmap. However, if we were to plot the results for only one target date, like our 7-day ahead predictions, then that would be a pretty sad heatmap (which would look more like a bar chart than a heatmap)... So instead of doing that, let's get predictions for several aheads and plot a heatmap across the target dates. To get the predictions across several ahead values, we will use the map function in the same way that we did in other vignettes: ```{r} -multi_log_res <- map(1:14, ~ arx_classifier( +multi_log_res <- map(1:40, ~ arx_classifier( jhu, outcome = "case_rate", predictors = "case_rate", args_list = arx_class_args_list( - breaks = 0.1, + breaks = 0.25 / 7, # division by 7 gives weekly not daily ahead = .x ) )$predictions) %>% list_rbind() @@ -154,7 +154,7 @@ ggplot(multi_log_res, aes(target_date, geo_value, fill = .pred_class)) + scale_fill_brewer(palette = "Set1") ``` -While there is a bit of variability near to the beginning, we can clearly see that there are upswings for all states starting from the beginning of January 2022, which we can recall was when there was a massive spike in cases for many states. So our results seem to align well with what actually happened at the beginning of January 2022. +While there is a bit of variability near to the end, we can clearly see that there are upswings for all states starting from the beginning of January 2022, which we can recall was when there was a massive spike in cases for many states. So our results seem to align well with what actually happened at the beginning of January 2022. ## A brief reflection From 29205c3d391d388b69930bb8d0b77bddfd62d92c Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 14:41:51 -0700 Subject: [PATCH 31/39] description --- DESCRIPTION | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index e573bd902..d21127830 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: epipredict Title: Basic epidemiology forecasting methods -Version: 0.0.13 +Version: 0.0.12 Authors@R: c( person("Daniel", "McDonald", , "daniel@stat.ubc.ca", role = c("aut", "cre")), person("Ryan", "Tibshirani", , "ryantibs@cmu.edu", role = "aut"), @@ -23,7 +23,7 @@ URL: https://github.com/cmu-delphi/epipredict/, https://cmu-delphi.github.io/epipredict BugReports: https://github.com/cmu-delphi/epipredict/issues/ Depends: - epiprocess (>= 0.7.5), + epiprocess (>= 0.6.0), parsnip (>= 1.0.0), R (>= 3.5.0) Imports: @@ -32,11 +32,9 @@ Imports: distributional, dplyr, generics, - ggplot2, glue, hardhat (>= 1.3.0), magrittr, - purrr, quantreg, recipes (>= 1.0.4), rlang, @@ -53,6 +51,7 @@ Suggests: data.table, epidatr (>= 1.0.0), fs, + ggplot2, knitr, lubridate, poissonreg, From dd357e7ec81fb029f9db38aba074969e5f7ccdf0 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 14:42:23 -0700 Subject: [PATCH 32/39] Add purrr to suggests --- DESCRIPTION | 1 + 1 file changed, 1 insertion(+) diff --git a/DESCRIPTION b/DESCRIPTION index d21127830..b7795cdfc 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -55,6 +55,7 @@ Suggests: knitr, lubridate, poissonreg, + purrr, ranger, RcppRoll, rmarkdown, From 3845afabfab40f0ec19d9de62368f5d2db01c13e Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 14:44:07 -0700 Subject: [PATCH 33/39] Style file --- vignettes/arx-classifier.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/arx-classifier.Rmd b/vignettes/arx-classifier.Rmd index 1cbf316d7..79db765b8 100644 --- a/vignettes/arx-classifier.Rmd +++ b/vignettes/arx-classifier.Rmd @@ -141,7 +141,7 @@ multi_log_res <- map(1:40, ~ arx_classifier( breaks = 0.25 / 7, # division by 7 gives weekly not daily ahead = .x ) -)$predictions) %>% list_rbind() +)$predictions) %>% list_rbind() ``` We can plot a the heatmap of the results over the aheads to see if there's anything novel or interesting to take away: From ec24435f4b8c9baa2546ff403d0737b4b43bf26f Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 14:54:48 -0700 Subject: [PATCH 34/39] Moved arx-classifier to articles --- man/arx_class_epi_workflow.Rd | 7 +++++-- man/arx_classifier.Rd | 7 +++++-- man/arx_fcast_epi_workflow.Rd | 9 +++++--- man/arx_forecaster.Rd | 9 +++++--- man/autoplot-epipred.Rd | 23 +++++++-------------- vignettes/{ => articles}/arx-classifier.Rmd | 0 6 files changed, 29 insertions(+), 26 deletions(-) rename vignettes/{ => articles}/arx-classifier.Rmd (100%) diff --git a/man/arx_class_epi_workflow.Rd b/man/arx_class_epi_workflow.Rd index e55e14160..bb5ac54ae 100644 --- a/man/arx_class_epi_workflow.Rd +++ b/man/arx_class_epi_workflow.Rd @@ -22,8 +22,11 @@ internally based on the \code{breaks} argument to \code{\link[=arx_class_args_li If discrete classes are already in the \code{epi_df}, it is recommended to code up a classifier from scratch using \code{\link[=epi_recipe]{epi_recipe()}}.} -\item{predictors}{A character vector giving column(s) of predictor -variables.} +\item{predictors}{A character vector giving column(s) of predictor variables. +This defaults to the \code{outcome}. However, if manually specified, only those variables +specifically mentioned will be used. (The \code{outcome} will not be added.) +By default, equals the outcome. If manually specified, does not add the +outcome variable, so make sure to specify it.} \item{trainer}{A \code{{parsnip}} model describing the type of estimation. For now, we enforce \code{mode = "classification"}. Typical values are diff --git a/man/arx_classifier.Rd b/man/arx_classifier.Rd index de487ec51..350352ae9 100644 --- a/man/arx_classifier.Rd +++ b/man/arx_classifier.Rd @@ -22,8 +22,11 @@ internally based on the \code{breaks} argument to \code{\link[=arx_class_args_li If discrete classes are already in the \code{epi_df}, it is recommended to code up a classifier from scratch using \code{\link[=epi_recipe]{epi_recipe()}}.} -\item{predictors}{A character vector giving column(s) of predictor -variables.} +\item{predictors}{A character vector giving column(s) of predictor variables. +This defaults to the \code{outcome}. However, if manually specified, only those variables +specifically mentioned will be used. (The \code{outcome} will not be added.) +By default, equals the outcome. If manually specified, does not add the +outcome variable, so make sure to specify it.} \item{trainer}{A \code{{parsnip}} model describing the type of estimation. For now, we enforce \code{mode = "classification"}. Typical values are diff --git a/man/arx_fcast_epi_workflow.Rd b/man/arx_fcast_epi_workflow.Rd index 8c76bcdd7..5db59497b 100644 --- a/man/arx_fcast_epi_workflow.Rd +++ b/man/arx_fcast_epi_workflow.Rd @@ -7,7 +7,7 @@ arx_fcast_epi_workflow( epi_data, outcome, - predictors, + predictors = outcome, trainer = NULL, args_list = arx_args_list() ) @@ -18,8 +18,11 @@ arx_fcast_epi_workflow( \item{outcome}{A character (scalar) specifying the outcome (in the \code{epi_df}).} -\item{predictors}{A character vector giving column(s) of predictor -variables.} +\item{predictors}{A character vector giving column(s) of predictor variables. +This defaults to the \code{outcome}. However, if manually specified, only those variables +specifically mentioned will be used. (The \code{outcome} will not be added.) +By default, equals the outcome. If manually specified, does not add the +outcome variable, so make sure to specify it.} \item{trainer}{A \code{{parsnip}} model describing the type of estimation. For now, we enforce \code{mode = "regression"}. May be \code{NULL} (the default).} diff --git a/man/arx_forecaster.Rd b/man/arx_forecaster.Rd index 7a042c65c..af05c0682 100644 --- a/man/arx_forecaster.Rd +++ b/man/arx_forecaster.Rd @@ -7,7 +7,7 @@ arx_forecaster( epi_data, outcome, - predictors, + predictors = outcome, trainer = parsnip::linear_reg(), args_list = arx_args_list() ) @@ -18,8 +18,11 @@ arx_forecaster( \item{outcome}{A character (scalar) specifying the outcome (in the \code{epi_df}).} -\item{predictors}{A character vector giving column(s) of predictor -variables.} +\item{predictors}{A character vector giving column(s) of predictor variables. +This defaults to the \code{outcome}. However, if manually specified, only those variables +specifically mentioned will be used. (The \code{outcome} will not be added.) +By default, equals the outcome. If manually specified, does not add the +outcome variable, so make sure to specify it.} \item{trainer}{A \code{{parsnip}} model describing the type of estimation. For now, we enforce \code{mode = "regression"}.} diff --git a/man/autoplot-epipred.Rd b/man/autoplot-epipred.Rd index 80a552604..c8d00e147 100644 --- a/man/autoplot-epipred.Rd +++ b/man/autoplot-epipred.Rd @@ -39,20 +39,11 @@ More than 3 levels begins to be difficult to see.} \item{...}{Ignored} -\item{.color_by}{Which variables should determine the color(s) used to plot -lines. Options include: -\itemize{ -\item \code{all_keys} - the default uses the interaction of any key variables -including the \code{geo_value} -\item \code{geo_value} - \code{geo_value} only -\item \code{other_keys} - any available keys that are not \code{geo_value} -\item \code{.response} - the numeric variables (same as the y-axis) -\item \code{all} - uses the interaction of all keys and numeric variables -\item \code{none} - no coloring aesthetic is applied -}} - -\item{.facet_by}{Similar to \code{.color_by} except that the default is to display -each numeric variable on a separate facet} +\item{.color_by}{A character string indicating how to color the data. See +\code{epiprocess::autoplot.epi_df()} for more details.} + +\item{.facet_by}{A character string indicating how to facet the data. See +\code{epiprocess::autoplot.epi_df()} for more details.} \item{.base_color}{If available, prediction bands will be shown with this color.} @@ -60,8 +51,8 @@ color.} \item{.point_pred_color}{If available, point forecasts will be shown with this color.} -\item{.max_facets}{Cut down of the number of facets displayed. Especially -useful for testing when there are many \code{geo_value}'s or keys.} +\item{.max_facets}{The maximum number of facets to show. If the number of +facets is greater than this value, only the top facets will be shown.} } \description{ For a fit workflow, the training data will be displayed, the response by diff --git a/vignettes/arx-classifier.Rmd b/vignettes/articles/arx-classifier.Rmd similarity index 100% rename from vignettes/arx-classifier.Rmd rename to vignettes/articles/arx-classifier.Rmd From b1adbd51f9e0b19b210a3d97552240eb3cbc488b Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 15:04:22 -0700 Subject: [PATCH 35/39] Please work --- DESCRIPTION | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/DESCRIPTION b/DESCRIPTION index b7795cdfc..fbde4f285 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -32,6 +32,7 @@ Imports: distributional, dplyr, generics, + ggplot2, glue, hardhat (>= 1.3.0), magrittr, @@ -51,7 +52,6 @@ Suggests: data.table, epidatr (>= 1.0.0), fs, - ggplot2, knitr, lubridate, poissonreg, From cdd5a91c1a71e3931e8a5d5736e1759483b9b25f Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 15:17:37 -0700 Subject: [PATCH 36/39] test --- DESCRIPTION | 4 ++-- man/arx_fcast_epi_workflow.Rd | 2 +- man/autoplot-epipred.Rd | 10 ++-------- vignettes/{articles => }/arx-classifier.Rmd | 0 4 files changed, 5 insertions(+), 11 deletions(-) rename vignettes/{articles => }/arx-classifier.Rmd (100%) diff --git a/DESCRIPTION b/DESCRIPTION index c9c1ee1a7..b7795cdfc 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: epipredict Title: Basic epidemiology forecasting methods -Version: 0.0.13 +Version: 0.0.12 Authors@R: c( person("Daniel", "McDonald", , "daniel@stat.ubc.ca", role = c("aut", "cre")), person("Ryan", "Tibshirani", , "ryantibs@cmu.edu", role = "aut"), @@ -32,7 +32,6 @@ Imports: distributional, dplyr, generics, - ggplot2, glue, hardhat (>= 1.3.0), magrittr, @@ -52,6 +51,7 @@ Suggests: data.table, epidatr (>= 1.0.0), fs, + ggplot2, knitr, lubridate, poissonreg, diff --git a/man/arx_fcast_epi_workflow.Rd b/man/arx_fcast_epi_workflow.Rd index 40b4e843b..4ed279351 100644 --- a/man/arx_fcast_epi_workflow.Rd +++ b/man/arx_fcast_epi_workflow.Rd @@ -8,7 +8,7 @@ arx_fcast_epi_workflow( epi_data, outcome, predictors = outcome, - trainer = NULL, + trainer = parsnip::linear_reg(), args_list = arx_args_list() ) } diff --git a/man/autoplot-epipred.Rd b/man/autoplot-epipred.Rd index cd56f3c05..3362b9124 100644 --- a/man/autoplot-epipred.Rd +++ b/man/autoplot-epipred.Rd @@ -39,20 +39,14 @@ More than 3 levels begins to be difficult to see.} \item{...}{Ignored} -\item{.color_by}{A character string indicating how to color the data. See -\code{epiprocess::autoplot.epi_df()} for more details.} - -\item{.facet_by}{A character string indicating how to facet the data. See -\code{epiprocess::autoplot.epi_df()} for more details.} +\item{.facet_by}{Similar to \code{.color_by} except that the default is to +display the response.} \item{.base_color}{If available, prediction bands will be shown with this color.} \item{.point_pred_color}{If available, point forecasts will be shown with this color.} - -\item{.max_facets}{The maximum number of facets to show. If the number of -facets is greater than this value, only the top facets will be shown.} } \description{ For a fit workflow, the training data will be displayed, the response by diff --git a/vignettes/articles/arx-classifier.Rmd b/vignettes/arx-classifier.Rmd similarity index 100% rename from vignettes/articles/arx-classifier.Rmd rename to vignettes/arx-classifier.Rmd From 3c6a44a25a5cae66f708243f2e8766a33ce9ede3 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 15:19:06 -0700 Subject: [PATCH 37/39] make sure version is updated --- DESCRIPTION | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/DESCRIPTION b/DESCRIPTION index b7795cdfc..0de58f6ae 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: epipredict Title: Basic epidemiology forecasting methods -Version: 0.0.12 +Version: 0.0.13 Authors@R: c( person("Daniel", "McDonald", , "daniel@stat.ubc.ca", role = c("aut", "cre")), person("Ryan", "Tibshirani", , "ryantibs@cmu.edu", role = "aut"), From fe1db833e9a493180a00eac9701e587bd603ba31 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 15:51:52 -0700 Subject: [PATCH 38/39] Use latest version of description --- DESCRIPTION | 4 ++-- man/autoplot-epipred.Rd | 15 +++++++++++++++ 2 files changed, 17 insertions(+), 2 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index 0de58f6ae..0552d700d 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -23,7 +23,7 @@ URL: https://github.com/cmu-delphi/epipredict/, https://cmu-delphi.github.io/epipredict BugReports: https://github.com/cmu-delphi/epipredict/issues/ Depends: - epiprocess (>= 0.6.0), + epiprocess (>= 0.7.5), parsnip (>= 1.0.0), R (>= 3.5.0) Imports: @@ -32,6 +32,7 @@ Imports: distributional, dplyr, generics, + ggplot2, glue, hardhat (>= 1.3.0), magrittr, @@ -51,7 +52,6 @@ Suggests: data.table, epidatr (>= 1.0.0), fs, - ggplot2, knitr, lubridate, poissonreg, diff --git a/man/autoplot-epipred.Rd b/man/autoplot-epipred.Rd index 3362b9124..18000cd82 100644 --- a/man/autoplot-epipred.Rd +++ b/man/autoplot-epipred.Rd @@ -39,6 +39,18 @@ More than 3 levels begins to be difficult to see.} \item{...}{Ignored} +\item{.color_by}{Which variables should determine the color(s) used to plot +lines. Options include: +\itemize{ +\item \code{all_keys} - the default uses the interaction of any key variables +including the \code{geo_value} +\item \code{geo_value} - \code{geo_value} only +\item \code{other_keys} - any available keys that are not \code{geo_value} +\item \code{.response} - the numeric variables (same as the y-axis) +\item \code{all} - uses the interaction of all keys and numeric variables +\item \code{none} - no coloring aesthetic is applied +}} + \item{.facet_by}{Similar to \code{.color_by} except that the default is to display the response.} @@ -47,6 +59,9 @@ color.} \item{.point_pred_color}{If available, point forecasts will be shown with this color.} + +\item{.max_facets}{Cut down of the number of facets displayed. Especially +useful for testing when there are many \code{geo_value}'s or keys.} } \description{ For a fit workflow, the training data will be displayed, the response by From 2b25c350943f018b3255bd5727acd5aa8efa5a26 Mon Sep 17 00:00:00 2001 From: rachlobay <42976509+rachlobay@users.noreply.github.com> Date: Fri, 26 Apr 2024 19:25:15 -0700 Subject: [PATCH 39/39] Move update vignette. May be parsnip warning --- vignettes/{ => articles}/update.Rmd | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename vignettes/{ => articles}/update.Rmd (100%) diff --git a/vignettes/update.Rmd b/vignettes/articles/update.Rmd similarity index 100% rename from vignettes/update.Rmd rename to vignettes/articles/update.Rmd