Skip to content

Commit 7270e9d

Browse files
committed
update plans
1 parent f09e3f5 commit 7270e9d

File tree

1 file changed

+9
-3
lines changed

1 file changed

+9
-3
lines changed

validator/PLANS.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
* Recognized file name format
77
* Recognized geographical type (county, state, etc)
88
* Recognized geo id format (e.g. state is two lowercase letters)
9-
* Geo id has been seen before, in historical data
9+
* Geo id has been seen before in historical data
1010
* Missing geo type + signal + date combos based on the geo type + signal combos Covidcast metadata says should be available
1111
* Missing ‘val’ values
1212
* Negative ‘val’ values
@@ -22,6 +22,7 @@
2222
* Most recent date seen in source data is not older than most recent date seen in reference data
2323
* Similar number of obs per day as recent API data (static threshold)
2424
* Similar average value as API data (static threshold)
25+
* Outliers in cases and deaths signals using [this method](https://github.com/cmu-delphi/covidcast-forecast/tree/dev/corrections/data_corrections)
2526
* Source data for specified date range is empty
2627
* API data for specified date range is empty
2728

@@ -44,22 +45,26 @@
4445

4546
### Larger issues
4647

48+
* Set up validator to use Sir-complains-a-lot alerting functionality on a signal-by-signal basis (should send alert output as a slack message and "@" a set person), as a stop-gap before the logging server is ready
49+
* This is [how Sir-CAL works](https://github.com/benjaminysmith/covidcast-indicators/blob/main/sir_complainsalot/delphi_sir_complainsalot/run.py)
50+
* [Example output](https://delphi-org.slack.com/archives/C01E81A3YKF/p1605793508000100)
4751
* Improve errors and error report
4852
* Check if [errors raised from validating all signals](https://docs.google.com/spreadsheets/d/1_aRBDrNeaI-3ZwuvkRNSZuZ2wfHJk6Bxj35Ol_XZ9yQ/edit#gid=1226266834) are correct, not false positives, not overly verbose or repetitive
4953
* Easier suppression of many errors at once
5054
* Maybe store errors as dict of dicts. Keys could be check strings (e.g. "check_bad_se"), then next layer geo type, etc
5155
* Nicer formatting for error “report”.
56+
* Potentially set `__print__()` method in ValidationError class
5257
* E.g. if a single type of error is raised for many different datasets, summarize all error messages into a single message? But it still has to be clear how to suppress each individually
5358
* Check for erratic data sources that wrongly report all zeroes
5459
* E.g. the error with the Wisconsin data for the 10/26 forecasts
5560
* Wary of a purely static check for this
5661
* Are there any geo regions where this might cause false positives? E.g. small counties or MSAs, certain signals (deaths, since it's << cases)
5762
* This test is partially captured by checking avgs in source vs reference data, unless erroneous zeroes continue for more than a week
58-
* Also partially captured by outlier checking. If zeroes aren't outliers, then it's hard to say that they're erroneous at all.
63+
* Also partially captured by outlier checking, depending on `size_cut` setting. If zeroes aren't outliers, then it's hard to say that they're erroneous at all.
5964
* Use known erroneous/anomalous days of source data to tune static thresholds and test behavior
6065
* If can't get data from API, do we want to use substitute data for the comparative checks instead?
61-
* E.g. most recent successful API pull -- might end up being a couple weeks older
6266
* Currently, any API fetch problems just doesn't do comparative checks at all.
67+
* E.g. most recent successful API pull -- might end up being a couple weeks older
6368
* Improve performance and reduce runtime (no particular goal, just avoid being painfully slow!)
6469
* Profiling (iterate)
6570
* Check if saving intermediate files will improve efficiency (currently a bottleneck at "individual file checks" section. Parallelize?)
@@ -80,3 +85,4 @@
8085
* Raise errors when one p-value (per geo region, e.g.) is significant OR when a bunch of p-values for that same type of test (different geo regions, e.g.) are "close" to significant
8186
* Correct p-values for multiple testing
8287
* Bonferroni would be easy but is sensitive to choice of "family" of tests; Benjamimi-Hochberg is a bit more involved but is less sensitive to choice of "family"; [comparison of the two](https://delphi-org.slack.com/archives/D01A9KNTPKL/p1603294915000500)
88+
* Use prophet package? Would require 2-3 months of API data.

0 commit comments

Comments
 (0)