Skip to content

Commit 373af74

Browse files
committed
update plans. Create starter-issue section
1 parent e29e8a7 commit 373af74

File tree

1 file changed

+28
-22
lines changed

1 file changed

+28
-22
lines changed

validator/PLANS.md

Lines changed: 28 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -34,17 +34,38 @@
3434
* User can manually disable specific checks for specific datasets using a field in the params.json file
3535
* User can enable test mode (checks only a small number of data files) using a field in the params.json file
3636

37-
## Checks + features wishlist, and problems to think about:
37+
## Checks + features wishlist, and problems to think about
38+
39+
### Starter/small issues
3840

39-
* Improve performance and reduce runtime (what's the target time? Just want to not be painfully slow...)
40-
* Profiling (iterate)
41-
* Check if saving intermediate files will improve efficiency (currently a bottleneck at "individual file checks" section. Parallelize?)
42-
* Make `all_frames` MultiIndex-ed by geo type and signal name? Make a dict of data indexed by geo type and signal name? May improve performance or may just make access more readable.
4341
* Check for duplicate rows
44-
* Check explicitly for large spikes (avg_val check can detect jumps in average value)
4542
* Backfill problems, especially with JHU and USA Facts, where a change to old data results in a datapoint that doesn’t agree with surrounding data ([JHU examples](https://delphi-org.slack.com/archives/CF9G83ZJ9/p1600729151013900)) or is very different from the value it replaced. If date is already in the API, have any values changed significantly within the "backfill" window (use span_length setting). See [this](https://github.com/cmu-delphi/covidcast-indicators/pull/155#discussion_r504195207) for context.
4643
* Run check_missing_date_files (or similar) on every geo type-signal type separately in comparative checks loop.
47-
* Different test thresholds for different files? Currently some control based on smoothed vs raw signals
44+
45+
### Larger issues
46+
47+
* Check if [errors raised from validating all signals](https://docs.google.com/spreadsheets/d/1_aRBDrNeaI-3ZwuvkRNSZuZ2wfHJk6Bxj35Ol_XZ9yQ/edit#gid=1226266834) are correct, not false positives, not overly verbose or repetitive
48+
* Check for erratic data sources that wrongly report all zeroes
49+
* E.g. the error with the Wisconsin data for the 10/26 forecasts
50+
* Wary of a purely static check for this
51+
* Are there any geo regions where this might cause false positives? E.g. small counties or MSAs, certain signals (deaths, since it's << cases)
52+
* This test is partially captured by checking avgs in source vs reference data, unless erroneous zeroes continue for more than a week
53+
* Also partially captured by outlier checking. If zeroes aren't outliers, then it's hard to say that they're erroneous at all.
54+
* Easier suppression of many errors at once
55+
* Nicer formatting for error “report”.
56+
* E.g. if a single type of error is raised for many different datasets, summarize all error messages into a single message? But it still has to be clear how to suppress each individually
57+
* Use known erroneous/anomalous days of source data to tune static thresholds and test behavior
58+
* If can't get data from API, do we want to use substitute data for the comparative checks instead?
59+
* E.g. most recent successful API pull -- might end up being a couple weeks older
60+
* Currently, any API fetch problems just doesn't do comparative checks at all.
61+
* Improve performance and reduce runtime (no particular goal, just avoid being painfully slow!)
62+
* Profiling (iterate)
63+
* Check if saving intermediate files will improve efficiency (currently a bottleneck at "individual file checks" section. Parallelize?)
64+
* Make `all_frames` MultiIndex-ed by geo type and signal name? Make a dict of data indexed by geo type and signal name? May improve performance or may just make access more readable.
65+
* Ensure validator runs on signals that require AWS credentials (iterate)
66+
67+
### Longer-term issues
68+
4869
* Data correctness and consistency over longer time periods (weeks to months). Compare data against long-ago (3 months?) API data for changes in trends.
4970
* Long-term trends and correlations between time series. Currently, checks only look at a data window of a few days
5071
* Any relevant anomaly detection packages already exist?
@@ -57,18 +78,3 @@
5778
* Raise errors when one p-value (per geo region, e.g.) is significant OR when a bunch of p-values for that same type of test (different geo regions, e.g.) are "close" to significant
5879
* Correct p-values for multiple testing
5980
* Bonferroni would be easy but is sensitive to choice of "family" of tests; Benjamimi-Hochberg is a bit more involved but is less sensitive to choice of "family"; [comparison of the two](https://delphi-org.slack.com/archives/D01A9KNTPKL/p1603294915000500)
60-
* Nicer formatting for error “report”.
61-
* E.g. if a single type of error is raised for many different datasets, summarize all error messages into a single message? But it still has to be clear how to suppress each individually
62-
* Easier suppression of many errors at once
63-
* Use known erroneous/anomalous days of source data to tune static thresholds and test behavior
64-
* Ensure validator runs on signals that require AWS credentials (iterate)
65-
* Check if [errors raised from validating all signals](https://docs.google.com/spreadsheets/d/1_aRBDrNeaI-3ZwuvkRNSZuZ2wfHJk6Bxj35Ol_XZ9yQ/edit#gid=1226266834) are correct, not false positives, not overly verbose or repetitive
66-
* If can't get data from API, do we want to use substitute data for the comparative checks instead?
67-
* E.g. most recent successful API pull -- might end up being a couple weeks older
68-
* Currently, any API fetch problems just doesn't do comparative checks at all.
69-
* Check for erratic data sources that wrongly report all zeroes
70-
* E.g. the error with the Wisconsin data for the 10/26 forecasts
71-
* Wary of a purely static check for this
72-
* Are there any geo regions where this might cause false positives? E.g. small counties or MSAs, certain signals (deaths, since it's << cases)
73-
* This test is partially captured by checking avgs in source vs reference data, unless erroneous zeroes continue for more than a week
74-
* Also partially captured by outlier checking. If zeroes aren't outliers, then it's hard to say that they're erroneous at all.

0 commit comments

Comments
 (0)