update plans. Create starter-issue section

nmdefries · nmdefries · commit 373af740c3fe · 2020-11-04T17:05:59.000-05:00
diff --git a/validator/PLANS.md b/validator/PLANS.md
@@ -34,17 +34,38 @@
 * User can manually disable specific checks for specific datasets using a field in the params.json file
 * User can enable test mode (checks only a small number of data files) using a field in the params.json file
 
-## Checks + features wishlist, and problems to think about:
+## Checks + features wishlist, and problems to think about
+
+### Starter/small issues
 
-* Improve performance and reduce runtime (what's the target time? Just want to not be painfully slow...)
-  * Profiling (iterate)
-  * Check if saving intermediate files will improve efficiency (currently a bottleneck at "individual file checks" section. Parallelize?)
-  * Make `all_frames` MultiIndex-ed by geo type and signal name? Make a dict of data indexed by geo type and signal name? May improve performance or may just make access more readable.
 * Check for duplicate rows
-* Check explicitly for large spikes (avg_val check can detect jumps in average value)
 * Backfill problems, especially with JHU and USA Facts, where a change to old data results in a datapoint that doesn’t agree with surrounding data ([JHU examples](https://delphi-org.slack.com/archives/CF9G83ZJ9/p1600729151013900)) or is very different from the value it replaced. If date is already in the API, have any values changed significantly within the "backfill" window (use span_length setting). See [this](https://github.com/cmu-delphi/covidcast-indicators/pull/155#discussion_r504195207) for context.
 * Run check_missing_date_files (or similar) on every geo type-signal type separately in comparative checks loop.
-* Different test thresholds for different files? Currently some control based on smoothed vs raw signals
+
+### Larger issues
+
+* Check if [errors raised from validating all signals](https://docs.google.com/spreadsheets/d/1_aRBDrNeaI-3ZwuvkRNSZuZ2wfHJk6Bxj35Ol_XZ9yQ/edit#gid=1226266834) are correct, not false positives, not overly verbose or repetitive
+* Check for erratic data sources that wrongly report all zeroes
+  * E.g. the error with the Wisconsin data for the 10/26 forecasts
+  * Wary of a purely static check for this
+  * Are there any geo regions where this might cause false positives? E.g. small counties or MSAs, certain signals (deaths, since it's << cases)
+  * This test is partially captured by checking avgs in source vs reference data, unless erroneous zeroes continue for more than a week
+  * Also partially captured by outlier checking. If zeroes aren't outliers, then it's hard to say that they're erroneous at all.
+* Easier suppression of many errors at once
+* Nicer formatting for error “report”.
+  * E.g. if a single type of error is raised for many different datasets, summarize all error messages into a single message? But it still has to be clear how to suppress each individually
+* Use known erroneous/anomalous days of source data to tune static thresholds and test behavior
+* If can't get data from API, do we want to use substitute data for the comparative checks instead?
+  * E.g. most recent successful API pull -- might end up being a couple weeks older
+  * Currently, any API fetch problems just doesn't do comparative checks at all.
+* Improve performance and reduce runtime (no particular goal, just avoid being painfully slow!)
+  * Profiling (iterate)
+  * Check if saving intermediate files will improve efficiency (currently a bottleneck at "individual file checks" section. Parallelize?)
+  * Make `all_frames` MultiIndex-ed by geo type and signal name? Make a dict of data indexed by geo type and signal name? May improve performance or may just make access more readable.
+* Ensure validator runs on signals that require AWS credentials (iterate)
+
+### Longer-term issues
+
 * Data correctness and consistency over longer time periods (weeks to months). Compare data against long-ago (3 months?) API data for changes in trends.
   * Long-term trends and correlations between time series. Currently, checks only look at a data window of a few days
   * Any relevant anomaly detection packages already exist?
@@ -57,18 +78,3 @@
   * Raise errors when one p-value (per geo region, e.g.) is significant OR when a bunch of p-values for that same type of test (different geo regions, e.g.) are "close" to significant
   * Correct p-values for multiple testing
   * Bonferroni would be easy but is sensitive to choice of "family" of tests; Benjamimi-Hochberg is a bit more involved but is less sensitive to choice of "family"; [comparison of the two](https://delphi-org.slack.com/archives/D01A9KNTPKL/p1603294915000500)
-* Nicer formatting for error “report”.
-  * E.g. if a single type of error is raised for many different datasets, summarize all error messages into a single message? But it still has to be clear how to suppress each individually
-* Easier suppression of many errors at once
-* Use known erroneous/anomalous days of source data to tune static thresholds and test behavior
-* Ensure validator runs on signals that require AWS credentials (iterate)
-* Check if [errors raised from validating all signals](https://docs.google.com/spreadsheets/d/1_aRBDrNeaI-3ZwuvkRNSZuZ2wfHJk6Bxj35Ol_XZ9yQ/edit#gid=1226266834) are correct, not false positives, not overly verbose or repetitive
-* If can't get data from API, do we want to use substitute data for the comparative checks instead?
-  * E.g. most recent successful API pull -- might end up being a couple weeks older
-  * Currently, any API fetch problems just doesn't do comparative checks at all.
-* Check for erratic data sources that wrongly report all zeroes
-  * E.g. the error with the Wisconsin data for the 10/26 forecasts
-  * Wary of a purely static check for this
-  * Are there any geo regions where this might cause false positives? E.g. small counties or MSAs, certain signals (deaths, since it's << cases)
-  * This test is partially captured by checking avgs in source vs reference data, unless erroneous zeroes continue for more than a week
-  * Also partially captured by outlier checking. If zeroes aren't outliers, then it's hard to say that they're erroneous at all.