|
6 | 6 | * Recognized file name format
|
7 | 7 | * Recognized geographical type (county, state, etc)
|
8 | 8 | * Recognized geo id format (e.g. state is two lowercase letters)
|
| 9 | +* Geo id has been seen before, in historical data |
9 | 10 | * Missing geo type + signal + date combos based on the geo type + signal combos Covidcast metadata says should be available
|
10 | 11 | * Missing ‘val’ values
|
11 | 12 | * Negative ‘val’ values
|
|
21 | 22 | * Most recent date seen in source data is not older than most recent date seen in reference data
|
22 | 23 | * Similar number of obs per day as recent API data (static threshold)
|
23 | 24 | * Similar average value as API data (static threshold)
|
| 25 | +* Source data for specified date range is empty |
| 26 | +* API data for specified date range is empty |
24 | 27 |
|
25 | 28 |
|
26 | 29 | ## Current features
|
|
33 | 36 |
|
34 | 37 | ## Checks + features wishlist, and problems to think about:
|
35 | 38 |
|
36 |
| -* Improve performance and reduce runtime (what's the goal?) |
| 39 | +* Improve performance and reduce runtime (what's the target time?) |
37 | 40 | * Profiling (iterate)
|
38 |
| - * Check if saving intermediate files will improve efficiency (currently a bottleneck at "individual file checks" section) |
39 |
| - * Make `all_frames` MultiIndex-ed by geo type and signal name? Make a dict of data indexed by geo type and signal name? May improve performance. |
40 |
| -* Which, if any, *specific* geo_ids are missing (get unique geo ids from historical data or delphi_utils) |
| 41 | + * Check if saving intermediate files will improve efficiency (currently a bottleneck at "individual file checks" section. Parallelize?) |
| 42 | + * Make `all_frames` MultiIndex-ed by geo type and signal name? Make a dict of data indexed by geo type and signal name? May improve performance or may just make access more readable. |
41 | 43 | * Check for duplicate rows
|
42 | 44 | * Check explicitly for large spikes (avg_val check can detect jumps in average value)
|
43 | 45 | * Backfill problems, especially with JHU and USA Facts, where a change to old data results in a datapoint that doesn’t agree with surrounding data ([JHU examples](https://delphi-org.slack.com/archives/CF9G83ZJ9/p1600729151013900)) or is very different from the value it replaced. If date is already in the API, have any values changed significantly within the "backfill" window (use span_length setting). See [this](https://github.com/cmu-delphi/covidcast-indicators/pull/155#discussion_r504195207) for context.
|
44 |
| -* Run check_missing_dates on every geo type-signal type separately. Probably move check to geo_sig loop. |
| 46 | +* Run check_missing_date_files (or similar) on every geo type-signal type separately in comparative checks loop. |
45 | 47 | * Different test thresholds for different files? Currently some control based on smoothed vs raw signals
|
46 | 48 | * Data correctness and consistency over longer time periods (weeks to months). Compare data against long-ago (3 months?) API data for changes in trends.
|
47 | 49 | * Long-term trends and correlations between time series. Currently, checks only look at a data window of a few days
|
|
56 | 58 | * Correct p-values for multiple testing
|
57 | 59 | * Bonferroni would be easy but is sensitive to choice of "family" of tests; Benjamimi-Hochberg is a bit more involved but is less sensitive to choice of "family"; [comparison of the two](https://delphi-org.slack.com/archives/D01A9KNTPKL/p1603294915000500)
|
58 | 60 | * Nicer formatting for error “report”.
|
59 |
| - * E.g. if a single type of error is raised for many different datasets, summarize all error messages into a single message? But it still has to be clear how to suppress each |
| 61 | + * E.g. if a single type of error is raised for many different datasets, summarize all error messages into a single message? But it still has to be clear how to suppress each individually |
60 | 62 | * Easier suppression of many errors at once
|
61 |
| -* Ensure validator runs on signals that require AWS credentials (iterate) |
62 | 63 | * Use known erroneous/anomalous days of source data to tune static thresholds and test behavior
|
| 64 | +* Ensure validator runs on signals that require AWS credentials (iterate) |
63 | 65 | * Check if [errors raised from validating all signals](https://docs.google.com/spreadsheets/d/1_aRBDrNeaI-3ZwuvkRNSZuZ2wfHJk6Bxj35Ol_XZ9yQ/edit#gid=1226266834) are correct, not false positives, not overly verbose or repetitive
|
64 |
| -* If can't get data from API, do we want to use substitute data for the comparative checks instead? E.g. most recent successful API pull -- might end up being a couple weeks older |
| 66 | +* If can't get data from API, do we want to use substitute data for the comparative checks instead? |
| 67 | + * E.g. most recent successful API pull -- might end up being a couple weeks older |
65 | 68 | * Currently, any API fetch problems just doesn't do comparative checks at all.
|
66 |
| -* Potentially implement a check for erratic data sources that wrongly report all 0's (like the error with the Wisconsin data for the 10/26 forecasts) |
| 69 | +* Check for erratic data sources that wrongly report all zeroes |
| 70 | + * E.g. the error with the Wisconsin data for the 10/26 forecasts |
| 71 | + * Wary of a purely static check for this |
| 72 | + * Are there any geo regions where this might cause false positives? E.g. small counties or MSAs, certain signals (deaths, since it's << cases) |
| 73 | + * This test is partially captured by checking avgs in source vs reference data, unless erroneous zeroes continue for more than a week |
| 74 | + * Also partially captured by outlier checking. If zeroes aren't outliers, then it's hard to say that they're erroneous at all. |
0 commit comments