cmu-delphi
diff --git a/‎docs/api/README.md
+1-1 b/‎docs/api/README.md
+1-1
diff --git a/‎docs/api/covidcast-signals/doctor-visits.md
+21-16 b/‎docs/api/covidcast-signals/doctor-visits.md
+21-16
diff --git a/‎docs/api/covidcast-signals/fb-survey.md
+10-2 b/‎docs/api/covidcast-signals/fb-survey.md
+10-2
diff --git a/‎docs/api/covidcast-signals/ght.md
+14-8 b/‎docs/api/covidcast-signals/ght.md
+14-8
diff --git a/‎docs/api/covidcast-signals/google-survey.md
+233-1 b/‎docs/api/covidcast-signals/google-survey.md
+233-1
@@ -1,6 +1,6 @@
 ---
 title: Delphi Epidata API
-nav_order: 2
+nav_order: 3
 has_children: true
 ---
 
 
@@ -13,12 +13,12 @@ grand_parent: COVIDcast API
 * **Available for:** county, hrr, msa, state (see [geography coding docs](../covidcast_geography.md))
 
 This data source is based on information about outpatient visits, provided to us
-by a national health system. Using this outpatient data, we estimate the
+by health system partners. Using this outpatient data, we estimate the
 percentage of COVID-related doctor's visits in a given location, on a given day.
 
 | Signal | Description |
 | --- | --- |
-| `smoothed_cli` | Estimated percentage of outpatient doctor visits primarily about COVID-related symptoms, based on data from a national health system, smoothed in time using a Gaussian linear smoother |
+| `smoothed_cli` | Estimated percentage of outpatient doctor visits primarily about COVID-related symptoms, based on data from health system partners, smoothed in time using a Gaussian linear smoother |
 | `smoothed_adj_cli` | Same, but with systematic day-of-week effects removed; see [details below](#day-of-week-adjustment) |
 
 ## Table of contents
@@ -29,10 +29,10 @@ percentage of COVID-related doctor's visits in a given location, on a given day.
 
 ## Lag and Backfill
 
-Note that because doctor's visits may be reported to the health system several
-days after they occur, these signals are typically available with several days
-of lag. This means that estimates for a specific day are only available several
-days later.
+Note that because doctor's visits may be reported to the health system partners
+several days after they occur, these signals are typically available with
+several days of lag. This means that estimates for a specific day are only
+available several days later.
 
 The amount of lag in reporting can vary, and not all visits are reported with
 the same lag. After we first report estimates for a specific date, further data
@@ -43,10 +43,11 @@ June 16th.
 
 ## Limitations
 
-This data source is based on outpatient visit data provided to us by a national
-health system. The system can report on a portion of United States outpatient
-doctor's visits, but not all of them, and so this source only represents those
-visits known to them. Their coverage may vary across the United States.
+This data source is based on outpatient visit data provided to us by health
+system partners. The partners can report on a portion of United States
+outpatient doctor's visits, but not all of them, and so this source only
+represents those visits known to them. Their coverage may vary across the United
+States.
 
 Standard errors are not available for this data source.
 
@@ -115,16 +116,20 @@ the ratio between the doctor visit signals on Sunday and Monday would be a
 constant. Formally, we assume that
 
 $$
-\log \mu_t = \alpha_{wd(t)} + \phi_t
+\begin{aligned}
+\mathbb{E}[Y_{it}] &= \mu_t\\
+\log \mu_t &= \alpha_{\text{wd}(t)} + \phi_t,
+\end{aligned}
 $$
 
-where $$\mu_t$$ is the expected doctor visits percentage of CLI at time $$t$$,
-$$\alpha_{wd(t)}$$ is the weekday correction for the weekday of day $$t$$, and
+where $$Y_{it}$$ is the observed doctor visits percentage of CLI at time $$t$$,
+$$\text{wd}(t) \in \{0, \dots, 6\}$$ is the day-of-week of time $$t$$,
+$$\alpha_{\text{wd}(t)}$$ is the corresponding weekday correction, and
 $$\phi_t$$ is the corrected doctor visits percentage of CLI at time $$t$$.
 
-For simplicity, we fit assume that the weekday parameters do not change over
-time or location. To fit the $$\alpha$$ parameters, we minimize the following
-convex objective function:
+For simplicity, we assume that the weekday parameters do not change over time or
+location. To fit the $$\alpha$$ parameters, we minimize the following convex
+objective function:
 
 $$
 f(\alpha, \phi | \mu) = -\log \ell (\alpha,\phi|\mu) + \lambda ||\Delta^3 \phi||_1
 
@@ -27,8 +27,8 @@ day.
 
 | Signal | Description |
 | --- | --- |
-| `raw_cli` | Estimated percentage of people with COVID-like illness based on the [criteria below](#defining-household-ili-and-cli), with no smoothing or survey weighting |
-| `raw_ili` | Estimated percentage of people with influenza-like illness based on the [criteria below](#defining-household-ili-and-cli), with no smoothing or survey weighting |
+| `raw_cli` | Estimated percentage of people with COVID-like illness based on the [criteria below](#ili-and-cli-indicators), with no smoothing or survey weighting |
+| `raw_ili` | Estimated percentage of people with influenza-like illness based on the [criteria below](#ili-and-cli-indicators), with no smoothing or survey weighting |
 | `raw_wcli` | Estimated percentage of people with COVID-like illness; adjusted using survey weights [as described below](#survey-weighting) |
 | `raw_wili` | Estimated percentage of people with influenza-like illness; adjusted using survey weights [as described below](#survey-weighting) |
 | `raw_hh_cmnty_cli` | Estimated percentage of people reporting illness in their local community, as [described below](#estimating-community-cli), including their household, with no smoothing or survey weighting |
@@ -92,6 +92,14 @@ COVID-like illness or CLI is not a standard indicator. Through our discussions
 with the CDC, we chose to define it as: fever along with cough or shortness of
 breath or difficulty breathing.
 
+Symptoms alone are not sufficient to diagnose influenza or coronavirus
+infections, and so these ILI and CLI indicators are *not* expected to be
+unbiased estimates of the true rate of influenza or coronavirus infections.
+These symptoms can be caused by many other conditions, and many true infections
+can be asymptomatic. Instead, we expect these indicators to be useful for
+comparison across the United States and across time, to determine where symptoms
+appear to be increasing.
+
 ### Defining Household ILI and CLI
 
 For a single survey, we are interested in the quantities:
 
@@ -32,14 +32,20 @@ numbers of COVID-related searches.
 ## Estimation
 
 We query the Google Health Trends API for overall searcher interest in a set of
-COVID-19 related terms which encompass the following topics: coronavirus
-symptoms; coronavirus help; coronavirus test-seeking; anosmia (lack of smell or
-taste). The API provides data at the Nielsen Designated Marketing Area (DMA)
-level and at the State level. This information reported by the API is unitless
-and pre-normalized for population size; i.e., the time series obtained for New
-York and Wyoming states are directly comparable. The public has access to a
-limited view of such information through [Google
-Trends](https://trends.google.com).
+COVID-19 related terms about anosmia (lack of smell or taste), which emerged as
+a symptom of the coronavirus. The specific terms are:
+
+* "why cant i smell or taste"
+* "loss of smell"
+* "loss of taste"
+* Anosmia generally, by querying for topics linked by Google to the anosmia item
+  in the Freebase knowledge graph (ID `/m/0m7pl`)
+
+The API provides data at the Nielsen Designated Marketing Area (DMA) level and
+at the State level. This information reported by the API is unitless and
+pre-normalized for population size; i.e., the time series obtained for New York
+and Wyoming states are directly comparable. The public has access to a limited
+view of such information through [Google Trends](https://trends.google.com).
 
 DMA-level data are aggregated to the MSA and HRR level through
 population-weighted averaging.
 
@@ -5,6 +5,7 @@ grand_parent: COVIDcast API
 ---
 
 # Google Symptom Surveys
+{: .no_toc}
 
 * **Source name:** `google-survey`
 * **Number of data revisions since 19 May 2020:** 0
@@ -35,4 +36,235 @@ specific geographical areas as needed to support forecasting efforts.
 | Signal | Description |
 | --- | --- |
 | `raw_cli` | Estimated percentage of people who know someone in their community with COVID-like illness |
-| `smoothed_cli` | Estimated percentage of people who know someone in their community with COVID-like illness, smoothed in time |
+| `smoothed_cli` | Estimated percentage of people who know someone in their community with COVID-like illness, smoothed in time [as described below](#smoothing) |
+
+## Table of contents
+{: .no_toc .text-delta}
+
+1. TOC
+{:toc}
+
+## Estimation
+
+Let $$Y$$ be the number of people who know someone in their community with
+COVID-like illness or CLI, over a given time period and in a given location, and
+let $$N$$ be the number of people in this location who do *not* know someone in
+their community with CLI. We are interested in the proportion
+
+$$
+p = \frac{Y}{Y+N}.
+$$
+
+Since the Google Surveys system provides estimated counties for each respondent,
+we are able to report $$p$$ for counties, MSAs, HRRs, and states. Our current
+rule-of-thumb is to discard any estimate (whether at a county, MSA, HRR, or
+state level) that is composed of fewer than 100 survey responses.
+
+At the county level, MSA, and HRR levels, our estimation procedure is fairly
+simple, and is outlined below. Estimation for mega-counties and states is more
+complex, and deferred to the next subsection.
+
+### County Level
+
+Recall that we run surveys separately (in a stratified manner) in each county.
+In a given county, if $$Y$$ denotes the number of respondents who know someone
+in their community with CLI, $$N$$ denotes the total number who do not, and $$n
+= Y + N$$ the number of "yes" and "no" responses combined, then to estimate
+$$p$$ in the county, we simply use:
+
+$$
+\hat{p} = \frac{Y}{n}.
+$$
+
+Its estimated standard error is:
+
+$$
+\widehat{\text{se}}(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}},
+$$
+
+which is the plug-in estimate of the standard error of the estimator, treating
+$$n$$ as fixed.
+
+### MSA and HRR Levels
+
+Suppose a given MSA or HRR contains $$m$$ counties. In county $$i$$, let $$Y_i$$ denote the
+number of "yes" responses, $$N_i$$ denote the number of "no" responses, and
+$$n_i = Y_i + N_i$$ the total number of yes and no responses. Let $$\hat p_i =
+Y_i/n_i$$ be the estimate for each county. Also, let $$k_i$$ denote the
+population of county $$i$$, and let $$k = \sum_{i=1}^m k_i$$ denote the total
+population of all surveyed counties within the MSA or HRR.
+
+Our estimator is then
+
+$$
+\hat{p} = \sum_{i=1}^m \frac{k_i}{k} \hat{p}_i,
+$$
+
+the standard stratified sampling estimate of $$p$$. Its estimated standard error
+is
+
+$$
+\hat{p} = \sqrt{\sum_{i=1}^m \Big(\frac{k_i}{k}\Big)^2
+  \frac{\hat{p}_i(1-\hat{p}_i)}{n_i}},
+$$
+
+again the plug-in estimate of standard error of the estimator.
+
+### State and Mega-County Estimates
+
+State estimates are somewhat complicated by the multi-resolution nature of
+sampling within a state: recall that we run surveys directly in each state, but
+also in directly in all of its counties with more than 100,000 population. In
+order to combine state and county level surveys into a state-level community
+%CLI estimate, we use a Bayesian approach.
+
+For *every* county $$i$$ in the state, irrespective of whether the county was
+surveyed, let $$(Y_{c,i},N_{c,i})$$ represent the number of observed yes and no
+responses, and define $$n_{c,i} = Y_{c,i} + N_{c,i}$$. Let $$m_{c,i}$$ be the county
+population, and let $$p_{c,i}^*$$ represent the true fraction of individuals in
+county $$i$$ who would have responded yes (assuming all individuals would have
+responded yes or no and ignoring that "unsure" is a valid option). Note that for a
+county not surveyed, we have $$Y_{c,i} = N_{c,i} = n_{c,i} = 0$$.
+
+For the state survey, let $$(Y_s,N_s)$$ be the number of observed yes and no
+responses, and define $$n_{s} = Y_{s} + N_{s}$$. Let $$m_{s}$$ be the state
+population, and let
+
+$$
+p_{s}^* = \sum_{i} m_{c,i} p_{c,i}^*/m_s
+$$
+
+represent the true fraction of individuals in the state who would have responded
+yes (assuming all individuals would have responded yes or no and ignoring that
+unsure is a valid option).
+
+Suppose that we assume the county probabilities $$p_{c,i}^*$$ are drawn
+independently from a common $$\operatorname{Beta}(a,b)$$ prior.
+
+Maximum a posteriori (MAP) estimates $$\hat{p}_{c,i}$$ of $$p_{c,i}^*$$
+can be obtained for all counties $$i$$ by maximizing
+
+$$
+\begin{aligned}
+    &Y_s \log(p_s) + N_s \log(1-p_s) +
+    \sum_{i} \tilde{Y}_{c,i} \log (p_{c,i}) + \tilde{N}_{c,i} \log(1 - p_{c,i}) \\
+    &=
+    Y_s \log\left(\sum_{i} m_{c,i} p_{c,i}/m_s\right) + N_s \log\left(1-\sum_{i}
+    m_{c,i} p_{c,i}/m_s\right) \\
+    &+
+    \sum_{i} \tilde{Y}_{c,i} \log (p_{c,i}) + \tilde{N}_{c,i} \log(1 - p_{c,i})
+\end{aligned}
+$$
+
+over $$p_{c,i}$$ subject to
+
+$$
+\begin{aligned}
+    0 &\leq p_{c,i} & \forall i &: \tilde{Y}_{c,i} = 0 \text{ and} \\
+    p_{c,i} &\leq 1 & \forall i &: \tilde{N}_{c,i} = 0,
+\end{aligned}
+$$
+
+where
+
+$$
+(\tilde{Y}_{c,i},\tilde{N}_{c,i},\tilde{n}_{c,i})=(Y_{c,i}+a-1, N_{c,i}+b-1, \tilde{Y}_{c,i}+\tilde{N}_{c,i})
+$$
+
+are pseudo-counts induced by the prior.
+Then the MAP estimate for the state probability is given by
+
+$$
+\hat{p}_s = \sum_{i} m_{c,i} \hat{p}_{c,i}/m_s.
+$$
+
+For the megacounty, we can lump all unsurveyed counties together into a single
+"other" county with associated population $$m_o = \sum_{\text{unsurveyed } i}
+m_{c,i}$$ and estimated proportion given by
+
+$$
+\hat{p}_o = \sum_{\text{unsurveyed } i} m_{c,i} \hat{p}_{c,i}/m_o.
+$$
+
+Notably, the maximization problem is concave and coincides with maximum
+likelihood estimation when $$a = b = 1$$.
+
+#### Empirical Bayes and Prior Choice
+
+Selecting $$a, b > 1$$ ensures that all pseudo-counts are non-zero and prevents
+degenerate estimates of the form $$p_{c,i} \in \{0,1\}$$ by shrinking each
+county estimate, even the unsurveyed ones, toward some relevant prior value.
+
+We currently set the prior hyperparameters so that the prior mode
+$$\frac{a-1}{(a-1)+(b-1)}$$ matches the pooled mean of surveyed county
+proportions and each county receives $$\tilde{n}$$ additional pseudocounts from
+the prior:
+
+$$
+\begin{aligned}
+a &= 1 + \tilde{n}\hat{\mu}\\
+b &= 1 + \tilde{n}(1-\hat{\mu}), \text{ for}\\
+\hat{\mu} &= \frac{\sum_{\text{surveyed } i} Y_{c,i}}{\sum_{\text{surveyed } i} n_{c,i}}.
+\end{aligned}
+$$
+
+The number of pseudocounts $$\tilde n$$ is currently set to 5, although it may
+be possible to choose a value that varies to minimize mean squared error.
+
+#### Modification for when State Survey is Missing
+
+When state survey results are missing due to problems in the sampling process,
+the MAP estimate of the megacounties can be obtained by directly taking the
+prior mode:
+
+$$
+\hat p_o = \frac{a-1}{(b-1)+(a-1)} = \hat \mu = \sum_{\text{surveyed } i}
+Y_{c,i} / \sum_{\text{surveyed } i} n_{c,i}
+$$
+
+and the state MAP estimate is the weighted average of the individual
+county-level estimates, reproduced here:
+
+$$
+\hat{p}_s = \frac{m_o \hat p_o + \sum_{\text{surveyed } i} m_{c,i} \hat
+p_{c,i}}{m_s} = \frac{\sum_{\text{surveyed } i} m_{c,i} \hat p_{c,i}}{m_s-m_o}.
+$$
+
+Since this estimator is clearly biased, the variance is not representative of
+the amount of uncertainty in the estimate. Our alternative to reporting variance
+is to report the MSE of the MAP estimate:
+
+$$
+\text{MSE}(\hat p_s) =
+\left(\sum_{\text{surveyed } i} \frac{\hat{p}_{c,i}
+\left(1-\hat{p}_{c,i}\right)}{n_i}\left(\frac{m_i}{m}\right)^2\right) +
+\left(\sum_{\text{unsurveyed } i} \frac{m_i}{m} \cdot (\hat{p}_{c,i} - p_{c,i})
+\right)^2,
+$$
+
+using the pseudocount $$n_i = Y_{c,i} + N_{c,i} + \tilde n$$. Writing the latter
+bias term using the megacounty, an upper bound for this term is (using $$m =
+\sum_i m_i$$):
+
+$$
+\left(\frac{m_o}{m}\right)^2(\hat{p}_o - p_o)^2 \le \left(\frac{m_o}{m}\right)^2
+\max\left((1-\hat{p}_o)^2, \hat{p}_o^2\right)
+$$
+
+The MSE assumes that the the survey county data is random and that the prior
+parameters are fixed and not random, so the unsurveyed counties only contribute
+bias while the surveyed counties are unbiased for their respective county
+probabilities and contribute variance.
+
+## Smoothing
+
+Additionally, as with the Facebook surveys, we consider estimates formed by
+pooling data over time.  That is, daily, for each location, we first pool all
+data available in that location over the last 5 days, and compute the estimates
+given above using all five days of data.
+
+In contrast to the Facebook surveys, this pooling does not significantly change
+the availability of estimates, because of our stratified sampling procedure
+(essentially always) delivers sufficient data at the county level---at least 100
+survey responses---to warrant their own estimates. However, the pooling
+procedure still does help by serving as a smoother.