Skip to content

Commit 66079cd

Browse files
authored
Merge pull request cmu-delphi#140 from cmu-delphi/technical-descriptions
Add technical descriptions to API documentation
2 parents be2991f + 9be2bbf commit 66079cd

File tree

8 files changed

+615
-65
lines changed

8 files changed

+615
-65
lines changed

docs/_config.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,12 @@ plugins:
3030

3131
remote_theme: pmarsceill/just-the-docs
3232

33-
# Just the Docs config
33+
## Just the Docs config
34+
# The theme compresses HTML (remove newlines, whitespace) by default, but this
35+
# breaks KaTeX, since kramdown inserts % in some places in the output.
36+
compress_html:
37+
ignore:
38+
envs: "all"
3439

3540
aux_links:
3641
"CMU Delphi Research Group":

docs/_includes/head_custom.html

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.css" integrity="sha384-zB1R0rpPzHqg7Kpt0Aljp8JPLqbXI3bhnPWROx27a9N0Ll6ZP/+DiW/UqRcLbRjq" crossorigin="anonymous">
2+
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.js" integrity="sha384-y23I5Q6l+B6vatafAwxRu/0oK/79VlbSz7Q9aiSZUvyWYIYsd+qj+o24G5ZU2zJz" crossorigin="anonymous"></script>
3+
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/contrib/mathtex-script-type.min.js" integrity="sha384-LJ2FmexL77rmGm6SIpxq7y+XA6bkLzGZEgCywzKOZG/ws4va9fUVu2neMjvc3zdv" crossorigin="anonymous"></script>

docs/api/covidcast-signals/_source-template.md

Lines changed: 35 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ grand_parent: COVIDcast API
55
---
66

77
# SOURCE NAME
8+
{: .no_toc}
89

910
* **Source name:** `SOURCE-API-NAME`
1011
* **First issued:** DATE RELEASED TO API
@@ -18,6 +19,40 @@ A brief description of what this source measures.
1819
| --- | --- |
1920
| `signal_name` | Brief description of the signal, including the units it is measured in and any smoothing that is applied |
2021

22+
## Table of contents
23+
{: .no_toc .text-delta}
24+
25+
1. TOC
26+
{:toc}
27+
28+
## Estimation
29+
30+
Describe how any relevant quantities are estimated---both statistically and in
31+
terms of the underlying features or inputs. (For example, if a signal is based
32+
on hospitalizations, what specific types of hospitalization are counted?)
33+
34+
If you need mathematics, we use KaTeX; you can see its supported LaTeX
35+
[here](https://katex.org/docs/supported.html). Inline math is done with *double*
36+
dollar signs, as in $$x = y/z$$, and display math by placing them with
37+
surrounding blank lines, as in
38+
39+
$$
40+
\frac{-b \pm \sqrt{b^2 - 4ac}}{2a}.
41+
$$
42+
43+
Note that the blank lines are essential.
44+
45+
### Standard Error
46+
47+
If this signal is a random variable, e.g. if it is a survey or based on
48+
proportion estimates, describe here how its standard error is reported and the
49+
nature of the random variation.
50+
51+
### Smoothing
52+
53+
If the smoothing is unusual or involves extra steps beyond a simple moving
54+
average, describe it here.
55+
2156
## Limitations
2257

2358
Any limitations in the interpretation of this signal, such as limits in its
@@ -32,12 +67,6 @@ If this signal is reported with a consistent lag, describe it here.
3267
If this signal is regularly backfilled, describe the reason and nature of the
3368
backfill here.
3469

35-
## Standard Error
36-
37-
If this signal is a random variable, e.g. if it is a survey or based on
38-
proportion estimates, describe here how its standard error is reported and the
39-
nature of the random variation.
40-
4170
## Source
4271

4372
If the signal has specific licensing or sourcing that should be acknowledged,

docs/api/covidcast-signals/doctor-visits.md

Lines changed: 137 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -5,34 +5,34 @@ grand_parent: COVIDcast API
55
---
66

77
# Doctor Visits
8+
{: .no_toc}
89

910
* **Source name:** `doctor-visits`
1011
* **Number of data revisions since 19 May 2020:** 0
1112
* **Date of last change:** Never
1213
* **Available for:** county, hrr, msa, state (see [geography coding docs](../covidcast_geography.md))
1314

1415
This data source is based on information about outpatient visits, provided to us
15-
by healthcare partners. Using this outpatient data, we estimate the percentage
16-
of COVID-related doctor's visits in a given location, on a given day.
16+
by a national health system. Using this outpatient data, we estimate the
17+
percentage of COVID-related doctor's visits in a given location, on a given day.
1718

1819
| Signal | Description |
1920
| --- | --- |
20-
| `smoothed_cli` | Estimated percentage of outpatient doctor visits primarily about COVID-related symptoms, based on data from healthcare partners, smoothed in time using a Gaussian linear smoother |
21-
| `smoothed_adj_cli` | Same, but with systematic day-of-week effects removed (so that every day "looks like" a Monday)|
21+
| `smoothed_cli` | Estimated percentage of outpatient doctor visits primarily about COVID-related symptoms, based on data from a national health system, smoothed in time using a Gaussian linear smoother |
22+
| `smoothed_adj_cli` | Same, but with systematic day-of-week effects removed; see [details below](#day-of-week-adjustment) |
2223

23-
Day-of-week effects are removed by fitting a model to all data in the United
24-
States; the model includes a fixed effect for each day of the week, except
25-
Monday. Once these effects are estimated, they are subtracted from each
26-
geographic area's time series. This removes day-to-day variation that arises
27-
solely from clinic schedules, work schedules, and other variation in doctor's
28-
visits that arise solely because of the day of week.
24+
## Table of contents
25+
{: .no_toc .text-delta}
26+
27+
1. TOC
28+
{:toc}
2929

3030
## Lag and Backfill
3131

32-
Note that because doctor's visits may be reported to our healthcare partners
33-
several days after they occur, these signals are typically available with
34-
several days of lag. This means that estimates for a specific day are only
35-
available several days later.
32+
Note that because doctor's visits may be reported to the health system several
33+
days after they occur, these signals are typically available with several days
34+
of lag. This means that estimates for a specific day are only available several
35+
days later.
3636

3737
The amount of lag in reporting can vary, and not all visits are reported with
3838
the same lag. After we first report estimates for a specific date, further data
@@ -43,8 +43,126 @@ June 16th.
4343

4444
## Limitations
4545

46-
This data source is based on outpatient visit data provided to us by healthcare
47-
partners. Our partners can report on a portion of the United States healthcare
48-
market, but not all of it, and so this source only represents those visits known
49-
to our partners. Their coverage and market share may vary across the United
50-
States.
46+
This data source is based on outpatient visit data provided to us by a national
47+
health system. The system can report on a portion of United States outpatient
48+
doctor's visits, but not all of them, and so this source only represents those
49+
visits known to them. Their coverage may vary across the United States.
50+
51+
Standard errors are not available for this data source.
52+
53+
## Qualifying Conditions
54+
55+
We receive data on the following five categories of counts:
56+
57+
- Denominator: Daily count of all unique outpatient visits.
58+
- COVID-like: Daily count of all unique outpatient visits with primary ICD-10 code
59+
of any of: {U071, U072, B9729, J1281, Z03818, B342, J1289}.
60+
- Flu-like: Daily count of all unique outpatient visits with primary ICD-10 code
61+
of any of: {J22, B349}. The occurrence of these codes in an area is
62+
correlated with that area's historical influenza activity, but are
63+
diagnostic codes not specific to influenza and can appear in COVID-19 cases.
64+
- Mixed: Daily count of all unique outpatient visits with primary ICD-10 code of
65+
any of: {Z20828, J129}. The occurance of these codes in an area is
66+
correlated to a blend of that area's COVID-19 confirmed case counts and
67+
influenza behavior, and are not diagnostic codes specific to either disease.
68+
- Flu: Daily count of all unique outpatient visits with primary ICD-10 code of
69+
any of: {J09\*, J10\*, J11\*}. The asterisk `*` indicates inclusion of all
70+
subcodes. This set of codes are assigned to influenza viruses.
71+
72+
If a patient has multiple visits on the same date (and hence multiple primary
73+
ICD-10 codes), then we will only count one of and in descending order: *Flu*,
74+
*COVID-like*, *Flu-like*, *Mixed*. This ordering tries to account for the most
75+
definitive confirmation, e.g. the codes assigned to *Flu* are only used for
76+
confirmed influenza cases, which are unrelated to the COVID-19 coronavirus.
77+
78+
## Estimation
79+
80+
### COVID-Like Illness
81+
82+
For a fixed location $$i$$ and time $$t$$, let $$Y_{it}^{\text{Covid-like}}$$,
83+
$$Y_{it}^{\text{Flu-like}}$$, $$Y_{it}^{\text{Mixed}}$$, $$Y_{it}^{\text{Flu}}$$
84+
denote the correspondingly named ICD-filtered counts and let $$N_{it}$$ be the
85+
total count of visits (the *Denominator*). Our estimate of the CLI percentage is
86+
given by
87+
88+
$$
89+
\hat p_{it} = 100 \cdot \frac{Y_{it}^{\text{Covid-like}} +
90+
\left((Y_{it}^{\text{Flu-like}} + Y_{it}^{\text{Mixed}}) -
91+
Y_{it}^{\text{Flu}}\right)}{N_{it}}
92+
$$
93+
94+
The estimated standard error is:
95+
96+
$$
97+
\widehat{\text{se}}(\hat{p}_{it}) = \sqrt{\frac{\hat{p}_{it}(1-\hat{p}_{it})}{N_{it}}}.
98+
$$
99+
100+
Note the quantity above is not going to be correct for multiple reasons: smoothing/day of
101+
week adjustments/etc.
102+
103+
### Day-of-Week Adjustment
104+
105+
The fraction of visits due to CLI is dependent on the day of the week. On
106+
weekends, doctors see a higher percentage of acute conditions, so the percentage
107+
of CLI is higher. Each day of the week has a different behavior, and if we do
108+
not adjust for this effect, we will not be able to meaningfully compare the
109+
doctor visits signal across different days of the week. We use a Poisson
110+
regression model to produce a signal adjusted for this effect.
111+
112+
We assume that this weekday effect is multiplicative. For example, if the
113+
underlying rate of CLI on each Monday was the same as the previous Sunday, then
114+
the ratio between the doctor visit signals on Sunday and Monday would be a
115+
constant. Formally, we assume that
116+
117+
$$
118+
\log \mu_t = \alpha_{wd(t)} + \phi_t
119+
$$
120+
121+
where $$\mu_t$$ is the expected doctor visits percentage of CLI at time $$t$$,
122+
$$\alpha_{wd(t)}$$ is the weekday correction for the weekday of day $$t$$, and
123+
$$\phi_t$$ is the corrected doctor visits percentage of CLI at time $$t$$.
124+
125+
For simplicity, we fit assume that the weekday parameters do not change over
126+
time or location. To fit the $$\alpha$$ parameters, we minimize the following
127+
convex objective function:
128+
129+
$$
130+
f(\alpha, \phi | \mu) = -\log \ell (\alpha,\phi|\mu) + \lambda ||\Delta^3 \phi||_1
131+
$$
132+
133+
where $$\ell$$ is the Poisson likelihood and $$\Delta^3 \phi$$ is the third
134+
differences of $$\phi$$. For identifiability, we constrain the sum of $$\alpha$$
135+
to be zero by setting Sunday's fixed effect to be the negative sum of the other
136+
weekdays. The penalty term encourages the $$\phi$$ curve to be smooth and
137+
produces meaningful $$\alpha$$ values.
138+
139+
Once we have estimated values for $$\alpha$$ for each type of count $$k$$ in
140+
{*COVID-like*, *Flu-like*, *Mixed*, *Flu*}, we obtain the adjusted count
141+
142+
$$\dot{Y}_{it}^k = Y_{it}^k / \alpha_{wd(t)}.$$
143+
144+
We then use these adjusted counts to estimate the CLI percentage as described
145+
above.
146+
147+
### Backfill
148+
149+
To help with the reporting delay, we perform the following simple "backfill"
150+
correction on each location. At each time $$t$$, we consider the total visit
151+
count. If the value is less than a minimum sample threshold, we go back to the
152+
previous time $$t-1$$, and add this visit count to the previous total, again
153+
checking to see if the threshold has been met. If not, we continue to move
154+
backwards in time until we meet the threshold, and take the estimate at time
155+
$$t$$ to be the average over the smallest window that meets the threshold. We
156+
enforce a hard stop to consider only the past 7 days, if we have not yet met the
157+
threshold during that time bin, no estimate will be produced. If, for instance,
158+
at time $$t$$, the minimum sample threshold is already met, then the estimate
159+
only contains data from time $$t$$. This is a dynamic-length moving average,
160+
working backwards through time. The threshold is set at 500 observations.
161+
162+
### Smoothing
163+
164+
To help with variability, we also employ a local linear regression filter with a
165+
Gaussian kernel. The bandwidth is fixed to approximately cover a rolling 7 day
166+
window, with the highest weight placed on the right edge of the window (the most
167+
recent timepoint). Given this smoothing step, the standard error estimate shown
168+
above is not exactly correct, as the calculation is done post-smoothing.

0 commit comments

Comments
 (0)