Skip to content

Commit 4bbe787

Browse files
committed
update plans
1 parent 9ac11d8 commit 4bbe787

File tree

1 file changed

+17
-9
lines changed

1 file changed

+17
-9
lines changed

validator/PLANS.md

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
* Recognized file name format
77
* Recognized geographical type (county, state, etc)
88
* Recognized geo id format (e.g. state is two lowercase letters)
9+
* Geo id has been seen before, in historical data
910
* Missing geo type + signal + date combos based on the geo type + signal combos Covidcast metadata says should be available
1011
* Missing ‘val’ values
1112
* Negative ‘val’ values
@@ -21,6 +22,8 @@
2122
* Most recent date seen in source data is not older than most recent date seen in reference data
2223
* Similar number of obs per day as recent API data (static threshold)
2324
* Similar average value as API data (static threshold)
25+
* Source data for specified date range is empty
26+
* API data for specified date range is empty
2427

2528

2629
## Current features
@@ -33,15 +36,14 @@
3336

3437
## Checks + features wishlist, and problems to think about:
3538

36-
* Improve performance and reduce runtime (what's the goal?)
39+
* Improve performance and reduce runtime (what's the target time?)
3740
* Profiling (iterate)
38-
* Check if saving intermediate files will improve efficiency (currently a bottleneck at "individual file checks" section)
39-
* Make `all_frames` MultiIndex-ed by geo type and signal name? Make a dict of data indexed by geo type and signal name? May improve performance.
40-
* Which, if any, *specific* geo_ids are missing (get unique geo ids from historical data or delphi_utils)
41+
* Check if saving intermediate files will improve efficiency (currently a bottleneck at "individual file checks" section. Parallelize?)
42+
* Make `all_frames` MultiIndex-ed by geo type and signal name? Make a dict of data indexed by geo type and signal name? May improve performance or may just make access more readable.
4143
* Check for duplicate rows
4244
* Check explicitly for large spikes (avg_val check can detect jumps in average value)
4345
* Backfill problems, especially with JHU and USA Facts, where a change to old data results in a datapoint that doesn’t agree with surrounding data ([JHU examples](https://delphi-org.slack.com/archives/CF9G83ZJ9/p1600729151013900)) or is very different from the value it replaced. If date is already in the API, have any values changed significantly within the "backfill" window (use span_length setting). See [this](https://github.com/cmu-delphi/covidcast-indicators/pull/155#discussion_r504195207) for context.
44-
* Run check_missing_dates on every geo type-signal type separately. Probably move check to geo_sig loop.
46+
* Run check_missing_date_files (or similar) on every geo type-signal type separately in comparative checks loop.
4547
* Different test thresholds for different files? Currently some control based on smoothed vs raw signals
4648
* Data correctness and consistency over longer time periods (weeks to months). Compare data against long-ago (3 months?) API data for changes in trends.
4749
* Long-term trends and correlations between time series. Currently, checks only look at a data window of a few days
@@ -56,11 +58,17 @@
5658
* Correct p-values for multiple testing
5759
* Bonferroni would be easy but is sensitive to choice of "family" of tests; Benjamimi-Hochberg is a bit more involved but is less sensitive to choice of "family"; [comparison of the two](https://delphi-org.slack.com/archives/D01A9KNTPKL/p1603294915000500)
5860
* Nicer formatting for error “report”.
59-
* E.g. if a single type of error is raised for many different datasets, summarize all error messages into a single message? But it still has to be clear how to suppress each
61+
* E.g. if a single type of error is raised for many different datasets, summarize all error messages into a single message? But it still has to be clear how to suppress each individually
6062
* Easier suppression of many errors at once
61-
* Ensure validator runs on signals that require AWS credentials (iterate)
6263
* Use known erroneous/anomalous days of source data to tune static thresholds and test behavior
64+
* Ensure validator runs on signals that require AWS credentials (iterate)
6365
* Check if [errors raised from validating all signals](https://docs.google.com/spreadsheets/d/1_aRBDrNeaI-3ZwuvkRNSZuZ2wfHJk6Bxj35Ol_XZ9yQ/edit#gid=1226266834) are correct, not false positives, not overly verbose or repetitive
64-
* If can't get data from API, do we want to use substitute data for the comparative checks instead? E.g. most recent successful API pull -- might end up being a couple weeks older
66+
* If can't get data from API, do we want to use substitute data for the comparative checks instead?
67+
* E.g. most recent successful API pull -- might end up being a couple weeks older
6568
* Currently, any API fetch problems just doesn't do comparative checks at all.
66-
* Potentially implement a check for erratic data sources that wrongly report all 0's (like the error with the Wisconsin data for the 10/26 forecasts)
69+
* Check for erratic data sources that wrongly report all zeroes
70+
* E.g. the error with the Wisconsin data for the 10/26 forecasts
71+
* Wary of a purely static check for this
72+
* Are there any geo regions where this might cause false positives? E.g. small counties or MSAs, certain signals (deaths, since it's << cases)
73+
* This test is partially captured by checking avgs in source vs reference data, unless erroneous zeroes continue for more than a week
74+
* Also partially captured by outlier checking. If zeroes aren't outliers, then it's hard to say that they're erroneous at all.

0 commit comments

Comments
 (0)