krivard
diff --git a/‎docs/_config.yml
Lines changed: 6 additions & 1 deletion b/‎docs/_config.yml
Lines changed: 6 additions & 1 deletion
diff --git a/‎docs/_includes/head_custom.html
Lines changed: 3 additions & 0 deletions b/‎docs/_includes/head_custom.html
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/api/covidcast-signals/_source-template.md
Lines changed: 35 additions & 6 deletions b/‎docs/api/covidcast-signals/_source-template.md
Lines changed: 35 additions & 6 deletions
diff --git a/‎docs/api/covidcast-signals/doctor-visits.md
Lines changed: 137 additions & 19 deletions b/‎docs/api/covidcast-signals/doctor-visits.md
Lines changed: 137 additions & 19 deletions
@@ -30,7 +30,12 @@ plugins:
 
 remote_theme: pmarsceill/just-the-docs
 
-# Just the Docs config
+## Just the Docs config
+# The theme compresses HTML (remove newlines, whitespace) by default, but this
+# breaks KaTeX, since kramdown inserts % in some places in the output.
+compress_html:
+  ignore:
+    envs: "all"
 
 aux_links:
   "CMU Delphi Research Group":
 
@@ -0,0 +1,3 @@
+<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.css" integrity="sha384-zB1R0rpPzHqg7Kpt0Aljp8JPLqbXI3bhnPWROx27a9N0Ll6ZP/+DiW/UqRcLbRjq" crossorigin="anonymous">
+<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.js" integrity="sha384-y23I5Q6l+B6vatafAwxRu/0oK/79VlbSz7Q9aiSZUvyWYIYsd+qj+o24G5ZU2zJz" crossorigin="anonymous"></script>
+<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/contrib/mathtex-script-type.min.js" integrity="sha384-LJ2FmexL77rmGm6SIpxq7y+XA6bkLzGZEgCywzKOZG/ws4va9fUVu2neMjvc3zdv" crossorigin="anonymous"></script>
@@ -5,6 +5,7 @@ grand_parent: COVIDcast API
 ---
 
 # SOURCE NAME
+{: .no_toc}
 
 * **Source name:** `SOURCE-API-NAME`
 * **First issued:** DATE RELEASED TO API
@@ -18,6 +19,40 @@ A brief description of what this source measures.
 | --- | --- |
 | `signal_name` | Brief description of the signal, including the units it is measured in and any smoothing that is applied |
 
+## Table of contents
+{: .no_toc .text-delta}
+
+1. TOC
+{:toc}
+
+## Estimation
+
+Describe how any relevant quantities are estimated---both statistically and in
+terms of the underlying features or inputs. (For example, if a signal is based
+on hospitalizations, what specific types of hospitalization are counted?)
+
+If you need mathematics, we use KaTeX; you can see its supported LaTeX
+[here](https://katex.org/docs/supported.html). Inline math is done with *double*
+dollar signs, as in $$x = y/z$$, and display math by placing them with
+surrounding blank lines, as in
+
+$$
+\frac{-b \pm \sqrt{b^2 - 4ac}}{2a}.
+$$
+
+Note that the blank lines are essential.
+
+### Standard Error
+
+If this signal is a random variable, e.g. if it is a survey or based on
+proportion estimates, describe here how its standard error is reported and the
+nature of the random variation.
+
+### Smoothing
+
+If the smoothing is unusual or involves extra steps beyond a simple moving
+average, describe it here.
+
 ## Limitations
 
 Any limitations in the interpretation of this signal, such as limits in its
@@ -32,12 +67,6 @@ If this signal is reported with a consistent lag, describe it here.
 If this signal is regularly backfilled, describe the reason and nature of the
 backfill here.
 
-## Standard Error
-
-If this signal is a random variable, e.g. if it is a survey or based on
-proportion estimates, describe here how its standard error is reported and the
-nature of the random variation.
-
 ## Source
 
 If the signal has specific licensing or sourcing that should be acknowledged,
 
@@ -5,34 +5,34 @@ grand_parent: COVIDcast API
 ---
 
 # Doctor Visits
+{: .no_toc}
 
 * **Source name:** `doctor-visits`
 * **Number of data revisions since 19 May 2020:** 0
 * **Date of last change:** Never
 * **Available for:** county, hrr, msa, state (see [geography coding docs](../covidcast_geography.md))
 
 This data source is based on information about outpatient visits, provided to us
-by healthcare partners. Using this outpatient data, we estimate the percentage
-of COVID-related doctor's visits in a given location, on a given day.
+by a national health system. Using this outpatient data, we estimate the
+percentage of COVID-related doctor's visits in a given location, on a given day.
 
 | Signal | Description |
 | --- | --- |
-| `smoothed_cli` | Estimated percentage of outpatient doctor visits primarily about COVID-related symptoms, based on data from healthcare partners, smoothed in time using a Gaussian linear smoother |
-| `smoothed_adj_cli` | Same, but with systematic day-of-week effects removed (so that every day "looks like" a Monday)|
+| `smoothed_cli` | Estimated percentage of outpatient doctor visits primarily about COVID-related symptoms, based on data from a national health system, smoothed in time using a Gaussian linear smoother |
+| `smoothed_adj_cli` | Same, but with systematic day-of-week effects removed; see [details below](#day-of-week-adjustment) |
 
-Day-of-week effects are removed by fitting a model to all data in the United
-States; the model includes a fixed effect for each day of the week, except
-Monday. Once these effects are estimated, they are subtracted from each
-geographic area's time series. This removes day-to-day variation that arises
-solely from clinic schedules, work schedules, and other variation in doctor's
-visits that arise solely because of the day of week.
+## Table of contents
+{: .no_toc .text-delta}
+
+1. TOC
+{:toc}
 
 ## Lag and Backfill
 
-Note that because doctor's visits may be reported to our healthcare partners
-several days after they occur, these signals are typically available with
-several days of lag. This means that estimates for a specific day are only
-available several days later.
+Note that because doctor's visits may be reported to the health system several
+days after they occur, these signals are typically available with several days
+of lag. This means that estimates for a specific day are only available several
+days later.
 
 The amount of lag in reporting can vary, and not all visits are reported with
 the same lag. After we first report estimates for a specific date, further data
@@ -43,8 +43,126 @@ June 16th.
 
 ## Limitations
 
-This data source is based on outpatient visit data provided to us by healthcare
-partners. Our partners can report on a portion of the United States healthcare
-market, but not all of it, and so this source only represents those visits known
-to our partners. Their coverage and market share may vary across the United
-States.
+This data source is based on outpatient visit data provided to us by a national
+health system. The system can report on a portion of United States outpatient
+doctor's visits, but not all of them, and so this source only represents those
+visits known to them. Their coverage may vary across the United States.
+
+Standard errors are not available for this data source.
+
+## Qualifying Conditions
+
+We receive data on the following five categories of counts:
+
+- Denominator: Daily count of all unique outpatient visits.
+- COVID-like: Daily count of all unique outpatient visits with primary ICD-10 code
+	of any of: {U071, U072, B9729, J1281, Z03818, B342, J1289}.
+- Flu-like: Daily count of all unique outpatient visits with primary ICD-10 code
+	of any of: {J22, B349}. The occurrence of these codes in an area is
+	correlated with that area's historical influenza activity, but are
+	diagnostic codes not specific to influenza and can appear in COVID-19 cases.
+- Mixed: Daily count of all unique outpatient visits with primary ICD-10 code of
+	any of: {Z20828, J129}. The occurance of these codes in an area is
+	correlated to a blend of that area's COVID-19 confirmed case counts and
+	influenza behavior, and are not diagnostic codes specific to either disease.
+- Flu: Daily count of all unique outpatient visits with primary ICD-10 code of
+	any of: {J09\*, J10\*, J11\*}. The asterisk `*` indicates inclusion of all
+	subcodes. This set of codes are assigned to influenza viruses.
+
+If a patient has multiple visits on the same date (and hence multiple primary
+ICD-10 codes), then we will only count one of and in descending order: *Flu*,
+*COVID-like*, *Flu-like*, *Mixed*. This ordering tries to account for the most
+definitive confirmation, e.g. the codes assigned to *Flu* are only used for
+confirmed influenza cases, which are unrelated to the COVID-19 coronavirus.
+
+## Estimation
+
+### COVID-Like Illness
+
+For a fixed location $$i$$ and time $$t$$, let $$Y_{it}^{\text{Covid-like}}$$,
+$$Y_{it}^{\text{Flu-like}}$$, $$Y_{it}^{\text{Mixed}}$$, $$Y_{it}^{\text{Flu}}$$
+denote the correspondingly named ICD-filtered counts and let $$N_{it}$$ be the
+total count of visits (the *Denominator*). Our estimate of the CLI percentage is
+given by
+
+$$
+\hat p_{it} = 100 \cdot  \frac{Y_{it}^{\text{Covid-like}} +
+	\left((Y_{it}^{\text{Flu-like}} + Y_{it}^{\text{Mixed}}) -
+	Y_{it}^{\text{Flu}}\right)}{N_{it}}
+$$
+
+The estimated standard error is:
+
+$$
+\widehat{\text{se}}(\hat{p}_{it}) =  \sqrt{\frac{\hat{p}_{it}(1-\hat{p}_{it})}{N_{it}}}.
+$$
+
+Note the quantity above is not going to be correct for multiple reasons: smoothing/day of
+week adjustments/etc.
+
+### Day-of-Week Adjustment
+
+The fraction of visits due to CLI is dependent on the day of the week. On
+weekends, doctors see a higher percentage of acute conditions, so the percentage
+of CLI is higher. Each day of the week has a different behavior, and if we do
+not adjust for this effect, we will not be able to meaningfully compare the
+doctor visits signal across different days of the week. We use a Poisson
+regression model to produce a signal adjusted for this effect.
+
+We assume that this weekday effect is multiplicative. For example, if the
+underlying rate of CLI on each Monday was the same as the previous Sunday, then
+the ratio between the doctor visit signals on Sunday and Monday would be a
+constant. Formally, we assume that
+
+$$
+\log \mu_t = \alpha_{wd(t)} + \phi_t
+$$
+
+where $$\mu_t$$ is the expected doctor visits percentage of CLI at time $$t$$,
+$$\alpha_{wd(t)}$$ is the weekday correction for the weekday of day $$t$$, and
+$$\phi_t$$ is the corrected doctor visits percentage of CLI at time $$t$$.
+
+For simplicity, we fit assume that the weekday parameters do not change over
+time or location. To fit the $$\alpha$$ parameters, we minimize the following
+convex objective function:
+
+$$
+f(\alpha, \phi | \mu) = -\log \ell (\alpha,\phi|\mu) + \lambda ||\Delta^3 \phi||_1
+$$
+
+where $$\ell$$ is the Poisson likelihood and $$\Delta^3 \phi$$ is the third
+differences of $$\phi$$. For identifiability, we constrain the sum of $$\alpha$$
+to be zero by setting Sunday's fixed effect to be the negative sum of the other
+weekdays. The penalty term encourages the $$\phi$$ curve to be smooth and
+produces meaningful $$\alpha$$ values.
+
+Once we have estimated values for $$\alpha$$ for each type of count $$k$$ in
+{*COVID-like*, *Flu-like*, *Mixed*, *Flu*}, we obtain the adjusted count
+
+$$\dot{Y}_{it}^k = Y_{it}^k / \alpha_{wd(t)}.$$
+
+We then use these adjusted counts to estimate the CLI percentage as described
+above.
+
+### Backfill
+
+To help with the reporting delay, we perform the following simple "backfill"
+correction on each location. At each time $$t$$, we consider the total visit
+count. If the value is less than a minimum sample threshold, we go back to the
+previous time $$t-1$$, and add this visit count to the previous total, again
+checking to see if the threshold has been met. If not, we continue to move
+backwards in time until we meet the threshold, and take the estimate at time
+$$t$$ to be the average over the smallest window that meets the threshold. We
+enforce a hard stop to consider only the past 7 days, if we have not yet met the
+threshold during that time bin, no estimate will be produced. If, for instance,
+at time $$t$$, the minimum sample threshold is already met, then the estimate
+only contains data from time $$t$$. This is a dynamic-length moving average,
+working backwards through time. The threshold is set at 500 observations.
+
+### Smoothing
+
+To help with variability, we also employ a local linear regression filter with a
+Gaussian kernel. The bandwidth is fixed to approximately cover a rolling 7 day
+window, with the highest weight placed on the right edge of the window (the most
+recent timepoint). Given this smoothing step, the standard error estimate shown
+above is not exactly correct, as the calculation is done post-smoothing.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.css" integrity="sha384-zB1R0rpPzHqg7Kpt0Aljp8JPLqbXI3bhnPWROx27a9N0Ll6ZP/+DiW/UqRcLbRjq" crossorigin="anonymous">`
	`2`	`+<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.js" integrity="sha384-y23I5Q6l+B6vatafAwxRu/0oK/79VlbSz7Q9aiSZUvyWYIYsd+qj+o24G5ZU2zJz" crossorigin="anonymous"></script>`
	`3`	`+<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/contrib/mathtex-script-type.min.js" integrity="sha384-LJ2FmexL77rmGm6SIpxq7y+XA6bkLzGZEgCywzKOZG/ws4va9fUVu2neMjvc3zdv" crossorigin="anonymous"></script>`