|
| 1 | +# Validator checks and features |
| 2 | + |
| 3 | +## Current checks for indicator source data |
| 4 | + |
| 5 | +* Missing dates within the selected range |
| 6 | +* Recognized file name format |
| 7 | +* Recognized geographical type (county, state, etc) |
| 8 | +* Recognized geo id format (e.g. state is two lowercase letters) |
| 9 | +* Specific geo id has been seen before, in historical data |
| 10 | +* Missing geo type + signal + date combos based on the geo type + signal combos Covidcast metadata says should be available |
| 11 | +* Missing ‘val’ values |
| 12 | +* Negative ‘val’ values |
| 13 | +* Out-of-range ‘val’ values (>0 for all signals, <=100 for percents, <=100 000 for proportions) |
| 14 | +* Missing ‘se’ values |
| 15 | +* Appropriate ‘se’ values, within a calculated reasonable range |
| 16 | +* Stderr != 0 |
| 17 | +* If signal and stderr both = 0 (seen in Quidel data due to lack of Jeffreys correction, [issue 255](https://github.com/cmu-delphi/covidcast-indicators/issues/255#issuecomment-692196541)) |
| 18 | +* Missing ‘sample_size’ values |
| 19 | +* Appropriate ‘sample_size’ values, ≥ 100 (default) or user-defined threshold |
| 20 | +* Most recent date seen in source data is recent enough, < 1 day ago (default) or user-defined on a per-signal basis |
| 21 | +* Most recent date seen in source data is not in the future |
| 22 | +* Most recent date seen in source data is not older than most recent date seen in reference data |
| 23 | +* Similar number of obs per day as recent API data (static threshold) |
| 24 | +* Similar average value as API data (static threshold) |
| 25 | +* Source data for specified date range is empty |
| 26 | +* API data for specified date range is empty |
| 27 | + |
| 28 | + |
| 29 | +## Current features |
| 30 | + |
| 31 | +* Errors and warnings are summarized in class attribute and printed on exit |
| 32 | +* If any non-suppressed errors are raised, the validation process exits with non-zero status |
| 33 | +* Various check settings are controllable via indicator-specific params.json files |
| 34 | +* User can manually disable specific checks for specific datasets using a field in the params.json file |
| 35 | +* User can enable test mode (checks only a small number of data files) using a field in the params.json file |
| 36 | + |
| 37 | +## Checks + features wishlist, and problems to think about |
| 38 | + |
| 39 | +### Starter/small issues |
| 40 | + |
| 41 | +* Check for duplicate rows |
| 42 | +* Backfill problems, especially with JHU and USA Facts, where a change to old data results in a datapoint that doesn’t agree with surrounding data ([JHU examples](https://delphi-org.slack.com/archives/CF9G83ZJ9/p1600729151013900)) or is very different from the value it replaced. If date is already in the API, have any values changed significantly within the "backfill" window (use span_length setting). See [this](https://github.com/cmu-delphi/covidcast-indicators/pull/155#discussion_r504195207) for context. |
| 43 | +* Run check_missing_date_files (or similar) on every geo type-signal type separately in comparative checks loop. |
| 44 | + |
| 45 | +### Larger issues |
| 46 | + |
| 47 | +* Expand framework to support nchs_mortality, which is provided on a weekly basis and has some differences from the daily data. E.g. filenames use a different format ("weekly_YYYYWW_geotype_signalname.csv") |
| 48 | +* Make backtesting framework so new checks can be run individually on historical indicator data to tune false positives, output verbosity, understand frequency of error raising, etc. Should pull data from API the first time and save locally in `cache` dir. |
| 49 | +* Add DETAILS.md doc with detailed descriptions of what each check does and how. Will be especially important for statistical/anomaly detection checks. |
| 50 | +* Improve errors and error report |
| 51 | + * Check if [errors raised from validating all signals](https://docs.google.com/spreadsheets/d/1_aRBDrNeaI-3ZwuvkRNSZuZ2wfHJk6Bxj35Ol_XZ9yQ/edit#gid=1226266834) are correct, not false positives, not overly verbose or repetitive |
| 52 | + * Easier suppression of many errors at once |
| 53 | + * Maybe store errors as dict of dicts. Keys could be check strings (e.g. "check_bad_se"), then next layer geo type, etc |
| 54 | + * Nicer formatting for error “report”. |
| 55 | + * E.g. if a single type of error is raised for many different datasets, summarize all error messages into a single message? But it still has to be clear how to suppress each individually |
| 56 | +* Check for erratic data sources that wrongly report all zeroes |
| 57 | + * E.g. the error with the Wisconsin data for the 10/26 forecasts |
| 58 | + * Wary of a purely static check for this |
| 59 | + * Are there any geo regions where this might cause false positives? E.g. small counties or MSAs, certain signals (deaths, since it's << cases) |
| 60 | + * This test is partially captured by checking avgs in source vs reference data, unless erroneous zeroes continue for more than a week |
| 61 | + * Also partially captured by outlier checking. If zeroes aren't outliers, then it's hard to say that they're erroneous at all. |
| 62 | +* Outlier detection (in progress) |
| 63 | + * Current approach is tuned to daily cases and daily deaths; use just on those signals? |
| 64 | + * prophet (package) detection is flexible, but needs 2-3 months historical data to fit on. May make sense to use if other statistical checks also need that much data. |
| 65 | +* Use known erroneous/anomalous days of source data to tune static thresholds and test behavior |
| 66 | +* If can't get data from API, do we want to use substitute data for the comparative checks instead? |
| 67 | + * E.g. most recent successful API pull -- might end up being a couple weeks older |
| 68 | + * Currently, any API fetch problems just doesn't do comparative checks at all. |
| 69 | +* Improve performance and reduce runtime (no particular goal, just avoid being painfully slow!) |
| 70 | + * Profiling (iterate) |
| 71 | + * Save intermediate files? |
| 72 | + * Currently a bottleneck at "individual file checks" section. Parallelize? |
| 73 | + * Make `all_frames` MultiIndex-ed by geo type and signal name? Make a dict of data indexed by geo type and signal name? May improve performance or may just make access more readable. |
| 74 | +* Ensure validator runs on signals that require AWS credentials (iterate) |
| 75 | + |
| 76 | +### Longer-term issues |
| 77 | + |
| 78 | +* Data correctness and consistency over longer time periods (weeks to months). Compare data against long-ago (3 months?) API data for changes in trends. |
| 79 | + * Long-term trends and correlations between time series. Currently, checks only look at a data window of a few days |
| 80 | + * Any relevant anomaly detection packages already exist? |
| 81 | + * What sorts of hypothesis tests to use? See [time series trend analysis](https://www.genasis.cz/time-series/index.php?pg=home--trend-analysis). |
| 82 | + * See data-quality GitHub issues, Ryan’s [correlation notebook](https://github.com/cmu-delphi/covidcast/tree/main/R-notebooks), and Dmitry's [indicator validation notebook](https://github.com/cmu-delphi/covidcast-indicators/blob/deploy-jhu/testing_utils/indicator_validation.template.ipynb) for ideas |
| 83 | + * E.g. Doctor visits decreasing correlation with cases |
| 84 | + * E.g. WY/RI missing or very low compared to historical |
| 85 | +* Use hypothesis testing p-values to decide when to raise error or not, instead of static thresholds. Many low but non-significant p-values will also raise error. See [here](https://delphi-org.slack.com/archives/CV1SYBC90/p1601307675021000?thread_ts=1600277030.103500&cid=CV1SYBC90) and [here](https://delphi-org.slack.com/archives/CV1SYBC90/p1600978037007500?thread_ts=1600277030.103500&cid=CV1SYBC90) for more background. |
| 86 | + * Order raised exceptions by p-value |
| 87 | + * Raise errors when one p-value (per geo region, e.g.) is significant OR when a bunch of p-values for that same type of test (different geo regions, e.g.) are "close" to significant |
| 88 | + * Correct p-values for multiple testing |
| 89 | + * Bonferroni would be easy but is sensitive to choice of "family" of tests; Benjamimi-Hochberg is a bit more involved but is less sensitive to choice of "family"; [comparison of the two](https://delphi-org.slack.com/archives/D01A9KNTPKL/p1603294915000500) |
0 commit comments