Skip to content

GHT has misleading declining trend --- in areas with unexpectedly low volume? #67

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
brookslogan opened this issue Jun 5, 2020 · 10 comments
Assignees
Labels
data quality Missing data, weird data, broken data Engineering Used to filter issues when synching with Asana modeling Must coordinate with Modeling team
Milestone

Comments

@brookslogan
Copy link

In MSAs like:

  • Beaumont-Port Arthur, TX
  • College Station-Bryan, TX
  • Sioux City, IA-NE-SD
  • Cheyenne, WY

It appears that the GHT signal is encountering single queries or single days with queries above the reporting threshold on scattered days, but then the smoothing/averaging over the following days make it look like a declining trend over the next 7 days. Especially with the default of showing the last two weeks only in the graph, this often appears like a (misleading) rapidly decreasing trend, or strange spikes upward.

(I guess this goes back to real 0s vs. under threshold; not sure what the status on handling these is.) It seems we need to adjust the smoothing to either: skip over missed reporting or perhaps both missed reporting and real 0s --- e.g., if there are only 3 nonmissing days, average over only those 3 with a 3 in the denominator rather than 7 --- or report NAs for the smoothed estimates instead.

@brookslogan
Copy link
Author

The second potential issue here is why College Station isn't meeting the reporting threshold most of the time; I would think it's pretty populous but I am not certain.

@krivard
Copy link
Contributor

krivard commented Jun 5, 2020

Yes; this is related to #36 -- current smoothing is designed to handle regions with occasional missingness; it does poorly with areas that are typically missing with only occasional data.

@capnrefsmmat
Copy link
Contributor

College Station is populous during the school year; otherwise it's a pretty small college town.

@brookslogan
Copy link
Author

brookslogan commented Jun 6, 2020

Since we are doing "filtering" --- best estimate for time t using data up to time t (or t + lag) --- I don't think we can avoid spikes upward without sacrificing prediction accuracy or availability. For viz purposes I would think a spike upward, plateau, then spike downward might mislead viewers less (maybe viz team's user studies will give actual data on this), but this might also be at the cost of prediction accuracy. Guess there is not a clear course of action here that meets all purposes simultaneously. Maybe expanding the default viz time scale to include last four weeks could help. But the issue might still remain if a signal is missing for the last 3 weeks in a row.

@huisaddison
Copy link
Contributor

I agree with @brookslogan 's last comment. (I will refer to "filtering := left smoothing" and "smoothing := symmetric smoothing" to avoid overloading the term "smoothing".)

Basically, I devised the smoothing method for GHT, which was required to be a left smoother, to be "as smooth as possible without sacrificing the ability to tell that there was a jump today". This leads to a jump when a point mass appears followed by a taper to zero. I thought that this would be preferable to turning a point mass, say on Monday, to a symmetric mass centered on Wednesday. A left smoother is not capable of turning it into a point mass centered on Monday.

My personal view is that the appropriate thing to do for the map is, to perform left-smoothing for the present and symmetric-smoothing for the past (which is the same as recomputing a smoother over all available data, every day). Then maintain well-documented, separate sources of data that are only left-smoothed for end-users who are using it to construct their own models. (More generally, this fits into the discussion of backfill, etc.).

@krivard krivard added the Triage Nominate for inclusion in the next release label Jun 11, 2020
@krivard krivard added this to the Ongoing milestone Jun 12, 2020
@krivard krivard removed the Triage Nominate for inclusion in the next release label Jun 12, 2020
@brookslogan
Copy link
Author

I agree with @huisaddison that symmetric smoothers are more natural to plot. But this may be blocked by the addition of the issue and lag columns; I don't know how far progress is on adding these.

Looking again at the map, though, I am not sure if I have correctly described the nature of the GHT patterns we are seeing. It looks like there can be spikes that are not left-smoothed. (But maybe this is from GHT returning different results for the same day?)

@krivard
Copy link
Contributor

krivard commented Jul 6, 2020

Next:

  • Put a prototype centered smoother signal in as a wip signal

@jingjtang
Copy link
Contributor

The new signal is added here. The declining trend would be changed to a "spike upward, plateau, then spike downward" pattern.

@krivard krivard added the modeling Must coordinate with Modeling team label Jul 8, 2020
@dshemetov
Copy link
Contributor

Hi everyone! I've been working on a smoothing utility refactor along with implementing a new smoother for the past couple weeks (see #176). I am still catching up to speed on what the challenge points are exactly, but I have applied some new methods to tackling the spiking behavior in this notebook. I would love to get your feedback!

The first two sections ("GHT MSA" and "Imputing") contain some plots directly relevant to this discussion. The rest of the notebook contains applications of a variety of other methods on other datasets.

@krivard
Copy link
Contributor

krivard commented Aug 12, 2020

Just to clarify -- doctor-visits and hospital-admissions are smoothed by us, not by the provider, but they're in the same boat with fb-survey in that the DUA prohibits access to the source data outside CMU Delphi.

You can however get a conservative approximation of what smoothing the fb-survey signal would do if you grab the raw signal variants that come out of the API. Once Alex and I finish resolving discrepancies between the old and new fb-survey codebases, the survey signals should be less sensitive to this problem. This is because new codebase is able to compute the smoothed signals before applying the minimum sample size thresholds, so the underlying signal is less choppy. The raw signals in the API have had the minimum sample size thresholds applied, so that's a worst-case scenario of choppiness.

@nmdefries nmdefries added the data quality Missing data, weird data, broken data label Nov 10, 2020
@SumitDELPHI SumitDELPHI added the Engineering Used to filter issues when synching with Asana label Dec 6, 2020
@krivard krivard closed this as completed Aug 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data quality Missing data, weird data, broken data Engineering Used to filter issues when synching with Asana modeling Must coordinate with Modeling team
Projects
None yet
Development

No branches or pull requests

8 participants