@@ -13,3 +13,137 @@ options(tibble.print_min = 4L, tibble.print_max = 4L, max.print = 4L)
13
13
library(epidatr)
14
14
library(dplyr)
15
15
```
16
+
17
+
18
+ The Epidata API records not just each signal's estimate for a given location
19
+ on a given day, but also * when* that estimate was made, and all updates to that
20
+ estimate.
21
+
22
+ For example, let's look at the [ doctor visits
23
+ signal] ( https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/doctor-visits.html )
24
+ from the [ ` covidcast ` endpoint] ( https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html ) ,
25
+ which estimates the percentage of outpatient doctor visits that are
26
+ COVID-related. Consider a result row with ` time_value ` 2020-05-01 for
27
+ ` geo_values = "pa" ` . This is an estimate for Pennsylvania on
28
+ May 1, 2020. That estimate was * issued* on May 5, 2020, the delay being due to
29
+ the aggregation of data by our source and the time taken by the Epidata API to
30
+ ingest the data provided. Later, the estimate for May 1st could be updated,
31
+ perhaps because additional visit data from May 1st arrived at our source and was
32
+ reported to us. This constitutes a new * issue* of the data.
33
+
34
+
35
+ ### Data known "as of" a specific date
36
+
37
+ By default, endpoint functions fetch the most recent issue available. This
38
+ is the best option for users who simply want to graph the latest data or
39
+ construct dashboards. But if we are interested in knowing * when* data was
40
+ reported, we can request specific data versions using the ` as_of ` , ` issues ` , or
41
+ ` lag ` arguments.
42
+
43
+ _ Note_ that these are mutually exclusive; only one can be specified
44
+ at a time. Also, not all endpoints support all three parameters, so please
45
+ check the documentation for that specific endpoint.
46
+
47
+ First, we can request the data that was available * as of* a specific date, using
48
+ the ` as_of ` argument:
49
+
50
+
51
+ ``` {r}
52
+ epidata <- pub_covidcast(
53
+ source = "doctor-visits",
54
+ signals = "smoothed_adj_cli",
55
+ time_type = "day",
56
+ time_values = epirange("2020-05-01", "2020-05-01"),
57
+ geo_type = "state",
58
+ geo_values = "pa",
59
+ as_of = "2020-05-07"
60
+ )
61
+ knitr::kable(epidata)
62
+ ```
63
+
64
+ This shows that an estimate of about 2.3% was issued on May 7. If we don't
65
+ specify ` as_of ` , we get the most recent estimate available:
66
+
67
+
68
+ ``` {r}
69
+ epidata <- pub_covidcast(
70
+ source = "doctor-visits",
71
+ signals = "smoothed_adj_cli",
72
+ time_type = "day",
73
+ time_values = epirange("2020-05-01", "2020-05-01"),
74
+ geo_type = "state",
75
+ geo_values = "pa"
76
+ )
77
+ knitr::kable(epidata)
78
+ ```
79
+
80
+ Note the substantial change in the estimate, from less than 3% to almost 6%,
81
+ reflecting new data that became available after May 7 about visits * occurring on*
82
+ May 1. This illustrates the importance of issue date tracking, particularly
83
+ for forecasting tasks. To backtest a forecasting model on past data, it is
84
+ important to use the data that would have been available * at the time* the model
85
+ was or would have been fit, not data that arrived much later.
86
+
87
+
88
+ ### Multiple issues of observations
89
+
90
+ By using the ` issues ` argument, we can request all issues in a certain time
91
+ period:
92
+
93
+ ``` {r}
94
+ epidata <- pub_covidcast(
95
+ source = "doctor-visits",
96
+ signals = "smoothed_adj_cli",
97
+ time_type = "day",
98
+ time_values = epirange("2020-05-01", "2020-05-01"),
99
+ geo_type = "state",
100
+ geo_values = "pa",
101
+ issues = epirange("2020-05-01", "2020-05-15")
102
+ )
103
+ knitr::kable(epidata)
104
+ ```
105
+
106
+ This estimate was clearly updated many times as new data for May 1st arrived.
107
+
108
+ Note that these results include only data issued or updated between
109
+ (inclusive) 2020-05-01 and 2020-05-15. If a value was first reported on
110
+ 2020-04-15, and never updated, a query for issues between 2020-05-01 and
111
+ 2020-05-15 will not include that value among its results.
112
+
113
+
114
+ ### Observations issued with a specific lag
115
+
116
+ Finally, we can use the ` lag ` argument to request only data reported with a
117
+ certain lag. For example, requesting a lag of 7 days fetches only data issued
118
+ exactly 7 days after the corresponding ` time_value ` :
119
+
120
+ ``` {r}
121
+ epidata <- pub_covidcast(
122
+ source = "doctor-visits",
123
+ signals = "smoothed_adj_cli",
124
+ time_type = "day",
125
+ time_values = epirange("2020-05-01", "2020-05-07"),
126
+ geo_type = "state",
127
+ geo_values = "pa",
128
+ lag = 7
129
+ )
130
+ knitr::kable(epidata)
131
+ ```
132
+
133
+ Note that though this query requested all values between 2020-05-01 and
134
+ 2020-05-07, May 3rd and May 4th were * not* included in the results set. This is
135
+ because the query will only include a result for May 3rd if a value were issued
136
+ on May 10th (a 7-day lag), but in fact the value was not updated on that day:
137
+
138
+ ``` {r}
139
+ epidata <- pub_covidcast(
140
+ source = "doctor-visits",
141
+ signals = "smoothed_adj_cli",
142
+ time_type = "day",
143
+ time_values = epirange("2020-05-03", "2020-05-03"),
144
+ geo_type = "state",
145
+ geo_values = "pa",
146
+ issues = epirange("2020-05-09", "2020-05-15")
147
+ )
148
+ knitr::kable(epidata)
149
+ ```
0 commit comments