Skip to content

Commit 53d8201

Browse files
committed
add content to versioned data vignette
1 parent b6b0a6f commit 53d8201

File tree

1 file changed

+134
-0
lines changed

1 file changed

+134
-0
lines changed

vignettes/versioned-data.Rmd

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,137 @@ options(tibble.print_min = 4L, tibble.print_max = 4L, max.print = 4L)
1313
library(epidatr)
1414
library(dplyr)
1515
```
16+
17+
18+
The Epidata API records not just each signal's estimate for a given location
19+
on a given day, but also *when* that estimate was made, and all updates to that
20+
estimate.
21+
22+
For example, let's look at the [doctor visits
23+
signal](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/doctor-visits.html)
24+
from the [`covidcast` endpoint](https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html),
25+
which estimates the percentage of outpatient doctor visits that are
26+
COVID-related. Consider a result row with `time_value` 2020-05-01 for
27+
`geo_values = "pa"`. This is an estimate for Pennsylvania on
28+
May 1, 2020. That estimate was *issued* on May 5, 2020, the delay being due to
29+
the aggregation of data by our source and the time taken by the Epidata API to
30+
ingest the data provided. Later, the estimate for May 1st could be updated,
31+
perhaps because additional visit data from May 1st arrived at our source and was
32+
reported to us. This constitutes a new *issue* of the data.
33+
34+
35+
### Data known "as of" a specific date
36+
37+
By default, endpoint functions fetch the most recent issue available. This
38+
is the best option for users who simply want to graph the latest data or
39+
construct dashboards. But if we are interested in knowing *when* data was
40+
reported, we can request specific data versions using the `as_of`, `issues`, or
41+
`lag` arguments.
42+
43+
_Note_ that these are mutually exclusive; only one can be specified
44+
at a time. Also, not all endpoints support all three parameters, so please
45+
check the documentation for that specific endpoint.
46+
47+
First, we can request the data that was available *as of* a specific date, using
48+
the `as_of` argument:
49+
50+
51+
```{r}
52+
epidata <- pub_covidcast(
53+
source = "doctor-visits",
54+
signals = "smoothed_adj_cli",
55+
time_type = "day",
56+
time_values = epirange("2020-05-01", "2020-05-01"),
57+
geo_type = "state",
58+
geo_values = "pa",
59+
as_of = "2020-05-07"
60+
)
61+
knitr::kable(epidata)
62+
```
63+
64+
This shows that an estimate of about 2.3% was issued on May 7. If we don't
65+
specify `as_of`, we get the most recent estimate available:
66+
67+
68+
```{r}
69+
epidata <- pub_covidcast(
70+
source = "doctor-visits",
71+
signals = "smoothed_adj_cli",
72+
time_type = "day",
73+
time_values = epirange("2020-05-01", "2020-05-01"),
74+
geo_type = "state",
75+
geo_values = "pa"
76+
)
77+
knitr::kable(epidata)
78+
```
79+
80+
Note the substantial change in the estimate, from less than 3% to almost 6%,
81+
reflecting new data that became available after May 7 about visits *occurring on*
82+
May 1. This illustrates the importance of issue date tracking, particularly
83+
for forecasting tasks. To backtest a forecasting model on past data, it is
84+
important to use the data that would have been available *at the time* the model
85+
was or would have been fit, not data that arrived much later.
86+
87+
88+
### Multiple issues of observations
89+
90+
By using the `issues` argument, we can request all issues in a certain time
91+
period:
92+
93+
```{r}
94+
epidata <- pub_covidcast(
95+
source = "doctor-visits",
96+
signals = "smoothed_adj_cli",
97+
time_type = "day",
98+
time_values = epirange("2020-05-01", "2020-05-01"),
99+
geo_type = "state",
100+
geo_values = "pa",
101+
issues = epirange("2020-05-01", "2020-05-15")
102+
)
103+
knitr::kable(epidata)
104+
```
105+
106+
This estimate was clearly updated many times as new data for May 1st arrived.
107+
108+
Note that these results include only data issued or updated between
109+
(inclusive) 2020-05-01 and 2020-05-15. If a value was first reported on
110+
2020-04-15, and never updated, a query for issues between 2020-05-01 and
111+
2020-05-15 will not include that value among its results.
112+
113+
114+
### Observations issued with a specific lag
115+
116+
Finally, we can use the `lag` argument to request only data reported with a
117+
certain lag. For example, requesting a lag of 7 days fetches only data issued
118+
exactly 7 days after the corresponding `time_value`:
119+
120+
```{r}
121+
epidata <- pub_covidcast(
122+
source = "doctor-visits",
123+
signals = "smoothed_adj_cli",
124+
time_type = "day",
125+
time_values = epirange("2020-05-01", "2020-05-07"),
126+
geo_type = "state",
127+
geo_values = "pa",
128+
lag = 7
129+
)
130+
knitr::kable(epidata)
131+
```
132+
133+
Note that though this query requested all values between 2020-05-01 and
134+
2020-05-07, May 3rd and May 4th were *not* included in the results set. This is
135+
because the query will only include a result for May 3rd if a value were issued
136+
on May 10th (a 7-day lag), but in fact the value was not updated on that day:
137+
138+
```{r}
139+
epidata <- pub_covidcast(
140+
source = "doctor-visits",
141+
signals = "smoothed_adj_cli",
142+
time_type = "day",
143+
time_values = epirange("2020-05-03", "2020-05-03"),
144+
geo_type = "state",
145+
geo_values = "pa",
146+
issues = epirange("2020-05-09", "2020-05-15")
147+
)
148+
knitr::kable(epidata)
149+
```

0 commit comments

Comments
 (0)