Skip to content

Commit 6f1891d

Browse files
committed
feat: add vignette snapshots
1 parent e74b7e7 commit 6f1891d

13 files changed

+4996
-50
lines changed

tests/testthat/_snaps/vignette-snapshot/advanced.html

+824
Large diffs are not rendered by default.

tests/testthat/_snaps/vignette-snapshot/advanced.new.html

+825
Large diffs are not rendered by default.

tests/testthat/_snaps/vignette-snapshot/aggregation.html

+613
Large diffs are not rendered by default.

tests/testthat/_snaps/vignette-snapshot/archive.html

+699
Large diffs are not rendered by default.

tests/testthat/_snaps/vignette-snapshot/archive.new.html

+696
Large diffs are not rendered by default.

tests/testthat/_snaps/vignette-snapshot/epiprocess.html

+672
Large diffs are not rendered by default.

tests/testthat/_snaps/vignette-snapshot/slide.html

+614
Large diffs are not rendered by default.
+18
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Vignettes that use epi_archives or epi_dfs.
2+
vignettes <- paste0(here::here("vignettes/"), c(
3+
"advanced.Rmd",
4+
"aggregation.Rmd",
5+
"archive.Rmd",
6+
"epiprocess.Rmd",
7+
"slide.Rmd"
8+
))
9+
for (input_file in vignettes) {
10+
test_that(paste0("snapshot vignette ", basename(input_file)), {
11+
# skip("Skipping snapshot tests by default, as they are slow.")
12+
output_file <- sub("\\.Rmd$", ".html", input_file)
13+
withr::with_file(output_file, {
14+
devtools::build_rmd(input_file)
15+
expect_snapshot_file(output_file)
16+
})
17+
})
18+
}

vignettes/advanced.Rmd

+7-11
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,13 @@ vignette: >
99

1010
In this vignette, we discuss how to use the sliding functionality in the
1111
`epiprocess` package with less common grouping schemes or with computations that
12-
have advanced output structures.
13-
The output of a slide computation should either be an atomic value/vector, or a
14-
data frame. This data frame can have multiple columns, multiple rows, or both.
12+
have advanced output structures. The output of a slide computation should either
13+
be an atomic value/vector, or a data frame. This data frame can have multiple
14+
columns, multiple rows, or both.
1515

1616
During basic usage (e.g., when all optional arguments are set to their defaults):
1717

1818
* `epi_slide(edf, <computation>, .....)`:
19-
2019
* keeps **all** columns of `edf`, adds computed column(s)
2120
* outputs **one row per row in `edf`** (recycling outputs from
2221
computations appropriately if there are multiple time series bundled
@@ -26,9 +25,7 @@ During basic usage (e.g., when all optional arguments are set to their defaults)
2625
`dplyr::arrange(time_value, .by_group = TRUE)`**
2726
* outputs an **`epi_df`** if the required columns are present, otherwise a
2827
tibble
29-
3028
* `epix_slide(ea, <computation>, .....)`:
31-
3229
* keeps **grouping and `time_value`** columns of `ea`, adds computed
3330
column(s)
3431
* outputs **any number of rows** (computations are allowed to output any
@@ -40,6 +37,7 @@ During basic usage (e.g., when all optional arguments are set to their defaults)
4037
* outputs a **tibble**
4138

4239
These differences in basic behavior make some common slide operations require less boilerplate:
40+
4341
* predictors and targets calculated with `epi_slide` are automatically lined up
4442
with each other and with the signals from which they were calculated; and
4543
* computations for an `epix_slide` can output data frames with any number of
@@ -84,13 +82,14 @@ simple synthetic example.
8482
```{r message = FALSE}
8583
library(epiprocess)
8684
library(dplyr)
85+
set.seed(123)
8786
8887
edf <- tibble(
8988
geo_value = rep(c("ca", "fl", "pa"), each = 3),
9089
time_value = rep(seq(as.Date("2020-06-01"), as.Date("2020-06-03"), by = "day"), length.out = length(geo_value)),
9190
x = seq_along(geo_value) + 0.01 * rnorm(length(geo_value)),
9291
) %>%
93-
as_epi_df()
92+
as_epi_df(as_of = as.Date("2024-03-20"))
9493
9594
# 2-day trailing average, per geo value
9695
edf %>%
@@ -338,7 +337,7 @@ library(data.table)
338337
library(ggplot2)
339338
theme_set(theme_bw())
340339
341-
x <- archive_cases_dv_subset_2$DT %>%
340+
x <- archive_cases_dv_subset$DT %>%
342341
filter(geo_value %in% c("ca", "fl")) %>%
343342
as_epi_archive(compactify = FALSE)
344343
```
@@ -525,10 +524,7 @@ separate ARX model on each state. As in the archive vignette, we can see a
525524
difference between version-aware (right column) and -unaware (left column)
526525
forecasting, as well.
527526

528-
529527
## Attribution
530528
The `case_rate_7d_av` data used in this document is a modified part of the [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19) as [republished in the COVIDcast Epidata API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html). This data set is licensed under the terms of the [Creative Commons Attribution 4.0 International license](https://creativecommons.org/licenses/by/4.0/) by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. Copyright Johns Hopkins University 2020.
531529

532530
The `percent_cli` data is a modified part of the [COVIDcast Epidata API Doctor Visits data](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/doctor-visits.html). This dataset is licensed under the terms of the [Creative Commons Attribution 4.0 International license](https://creativecommons.org/licenses/by/4.0/). Copyright Delphi Research Group at Carnegie Mellon University 2020.
533-
534-

vignettes/aggregation.Rmd

+3-3
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ x <- pub_covidcast(
3434
) %>%
3535
select(geo_value, time_value, cases = value) %>%
3636
full_join(y, by = "geo_value") %>%
37-
as_epi_df()
37+
as_epi_df(as_of = as.Date("2024-03-20"))
3838
```
3939

4040
The data contains 16,212 rows and 5 columns.
@@ -192,7 +192,7 @@ running `epi_slide()` on the zero-filled data brings these trailing averages
192192

193193
```{r}
194194
xt %>%
195-
as_epi_df() %>%
195+
as_epi_df(as_of = as.Date("2024-03-20")) %>%
196196
group_by(geo_value) %>%
197197
epi_slide(cases_7dav = mean(cases), before = 6) %>%
198198
ungroup() %>%
@@ -203,7 +203,7 @@ xt %>%
203203
print(n = 7)
204204
205205
xt_filled %>%
206-
as_epi_df() %>%
206+
as_epi_df(as_of = as.Date("2024-03-20")) %>%
207207
group_by(geo_value) %>%
208208
epi_slide(cases_7dav = mean(cases), before = 6) %>%
209209
ungroup() %>%

vignettes/archive.Rmd

+19-28
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,16 @@ vignette: >
77
%\VignetteEncoding{UTF-8}
88
---
99

10-
In addition to the `epi_df` data structure, which we have been working with all
11-
along in these vignettes, the `epiprocess` package has a companion structure
12-
called `epi_archive`. In comparison to an `epi_df` object, which can be seen as
13-
storing a single snapshot of a data set with the most up-to-date signal values
14-
as of some given time, an `epi_archive` object stores the full version history
15-
of a data set. Many signals of interest for epidemiological tracking are subject
16-
to revision (some more than others), and paying attention to data revisions can
17-
be important for all sorts of downstream data analysis and modeling tasks.
18-
19-
This vignette walks through working with `epi_archive` objects and demonstrates
10+
In addition to the `epi_df` data structure, the `epiprocess` package has a
11+
companion structure called `epi_archive`. In comparison to an `epi_df` object,
12+
which can be seen as storing a single snapshot of a data set with the most
13+
up-to-date signal values as of some given time, an `epi_archive` object stores
14+
the full version history of a data set. Many signals of interest for
15+
epidemiological tracking are subject to revision (some more than others) and
16+
paying attention to data revisions can be important for all sorts of downstream
17+
data analysis and modeling tasks.
18+
19+
This vignette walks through working with `epi_archive()` objects and demonstrates
2020
some of their key functionality. We'll work with a signal on the percentage of
2121
doctor's visits with CLI (COVID-like illness) computed from medical insurance
2222
claims, available through the [COVIDcast
@@ -55,9 +55,8 @@ library(ggplot2)
5555

5656
## Getting data into `epi_archive` format
5757

58-
An <code><a href="../reference/epi_archive.html">epi_archive</a></code> object
59-
can be constructed from a data frame, data table, or tibble, provided that it
60-
has (at least) the following columns:
58+
An `epi_archive()` object can be constructed from a data frame, data table, or
59+
tibble, provided that it has (at least) the following columns:
6160

6261
* `geo_value`: the geographic value associated with each row of measurements.
6362
* `time_value`: the time value associated with each row of measurements.
@@ -71,7 +70,7 @@ As we can see from the above, the data frame returned by
7170
format, with `issue` playing the role of `version`. We can now use
7271
`as_epi_archive()` to bring it into `epi_archive` format. For removal of
7372
redundant version updates in `as_epi_archive` using compactify, please refer to
74-
the compactify vignette.
73+
the [compactify vignette](articles/compactify.html).
7574

7675
```{r, eval=FALSE}
7776
x <- dv %>%
@@ -91,10 +90,10 @@ class(x)
9190
print(x)
9291
```
9392

94-
An `epi_archive` is special kind of class called an R6 class. Its primary field
95-
is a data table `DT`, which is of class `data.table` (from the `data.table`
96-
package), and has columns `geo_value`, `time_value`, `version`, as well as any
97-
number of additional columns.
93+
An `epi_archive` is consists of a primary field `DT`, which is a data table
94+
(from the `data.table` package) that has the columns `geo_value`, `time_value`,
95+
`version` (and possibly additional ones), and other metadata fields, such as
96+
`geo_type` and `time_type`.
9897

9998
```{r}
10099
class(x$DT)
@@ -112,9 +111,7 @@ key(x$DT)
112111
```
113112

114113
In general, the last version of each observation is carried forward (LOCF) to
115-
fill in data between recorded versions. **A word of caution:** R6 objects,
116-
unlike most other objects in R, have reference semantics. An important
117-
consequence of this is that objects are not copied when modified.
114+
fill in data between recorded versions.
118115

119116
```{r}
120117
original_value <- x$DT$percent_cli[1]
@@ -144,10 +141,7 @@ call (as it did in the case above).
144141

145142
A key method of an `epi_archive` class is `as_of()`, which generates a snapshot
146143
of the archive in `epi_df` format. This represents the most up-to-date values of
147-
the signal variables as of a given version. This can be accessed via `x$as_of()`
148-
for an `epi_archive` object `x`, but the package also provides a simple wrapper
149-
function `epix_as_of()` since this is likely a more familiar interface for users
150-
not familiar with R6 (or object-oriented programming).
144+
the signal variables as of a given version.
151145

152146
```{r}
153147
x_snapshot <- epix_as_of(x, max_version = as.Date("2021-06-01"))
@@ -215,9 +209,6 @@ they overestimate it (both states towards the beginning of 2021), though not
215209
quite as dramatically. Modeling the revision process, which is often called
216210
*backfill modeling*, is an important statistical problem in it of itself.
217211

218-
<!-- todo: refer to some project/code? perhaps we should think about writing a
219-
function in `epiprocess` or even a separate package? -->
220-
221212
## Merging `epi_archive` objects
222213

223214
Now we demonstrate how to merge two `epi_archive` objects together, e.g., so

vignettes/compactify.Rmd

+1
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,7 @@ slide_median <- function(my_ea) {
106106
107107
speeds <- rbind(speeds, speed_test(slide_median, "slide_median"))
108108
```
109+
109110
Here is a detailed performance comparison:
110111

111112
```{r}

vignettes/epiprocess.Rmd

+5-8
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,7 @@ and `time_value` columns, respectively, but inferring the `as_of` field is not
125125
as easy. See the documentation for `as_epi_df()` more details.
126126

127127
```{r}
128-
x <- as_epi_df(cases) %>%
128+
x <- as_epi_df(cases, as_of = as.Date("2024-03-20")) %>%
129129
select(geo_value, time_value, total_cases = value)
130130
131131
attributes(x)$metadata
@@ -169,7 +169,7 @@ data.frame(
169169
# misnamed
170170
reported_date = rep(seq(as.Date("2020-06-01"), as.Date("2020-06-03"), by = "day"), length.out = length(geo_value)),
171171
value = seq_along(geo_value) + 0.01 * withr::with_rng_version("3.0.0", withr::with_seed(42, length(geo_value)))
172-
) %>% as_epi_df()
172+
) %>% as_epi_df(as_of = as.Date("2024-03-20"))
173173
```
174174

175175
The columns can be renamed to match `epi_df` format. In the example below, notice there is also an additional key `pol`.
@@ -220,7 +220,7 @@ ex3 <- ex3 %>%
220220
state = rep(tolower("MA"), 6),
221221
pol = rep(c("blue", "swing", "swing"), each = 2)
222222
) %>%
223-
as_epi_df(additional_metadata = list(other_keys = c("state", "pol")))
223+
as_epi_df(additional_metadata = list(other_keys = c("state", "pol")), as_of = as.Date("2024-03-20"))
224224
225225
attr(ex3, "metadata")
226226
```
@@ -256,7 +256,7 @@ cases in Canada in 2003, from the
256256
x <- outbreaks::sars_canada_2003 %>%
257257
mutate(geo_value = "ca") %>%
258258
select(geo_value, time_value = date, starts_with("cases")) %>%
259-
as_epi_df(geo_type = "nation")
259+
as_epi_df(geo_type = "nation", as_of = as.Date("2024-03-20"))
260260
261261
head(x)
262262
@@ -303,7 +303,7 @@ x <- outbreaks::ebola_sierraleone_2014 %>%
303303
filter(cases == 1) %>%
304304
group_by(geo_value, time_value) %>%
305305
summarise(cases = sum(cases)) %>%
306-
as_epi_df(geo_type = "province")
306+
as_epi_df(geo_type = "province", as_of = as.Date("2024-03-20"))
307307
308308
ggplot(x, aes(x = time_value, y = cases)) +
309309
geom_col(aes(fill = geo_value), show.legend = FALSE) +
@@ -312,11 +312,8 @@ ggplot(x, aes(x = time_value, y = cases)) +
312312
labs(x = "Date", y = "Confirmed cases of Ebola in Sierra Leone")
313313
```
314314

315-
316-
317315
## Attribution
318316
This document contains a dataset that is a modified part of the [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19) as [republished in the COVIDcast Epidata API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html). This data set is licensed under the terms of the [Creative Commons Attribution 4.0 International license](https://creativecommons.org/licenses/by/4.0/) by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. Copyright Johns Hopkins University 2020.
319317

320318
[From the COVIDcast Epidata API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html):
321319
These signals are taken directly from the JHU CSSE [COVID-19 GitHub repository](https://github.com/CSSEGISandData/COVID-19) without changes.
322-

0 commit comments

Comments
 (0)