-
Notifications
You must be signed in to change notification settings - Fork 16
FB-survey validation with a generic design to include other pipelines #155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
088f8aa
284ac9d
ebd93a3
6babdba
0b28b63
40a2c4d
77be3f0
e398c88
449c09f
0161c47
a307086
a7d7a58
45fbb0e
14fb6ce
2da9138
1d378c2
c5edfd1
9d18363
45986ad
b8eae79
c8a5b7a
7b23269
6cae2b4
d27b104
66091d1
9c54f43
59a2313
5572ef8
8eb1f6a
7719f03
147401a
8157a6b
00a249a
d9cfa30
e491f7c
ae16037
2be6a78
0ff4fd7
ccf8015
0b9d39e
51302e3
e820f97
7b0785f
089ce9c
b9ce615
138cc5c
4e57493
9d2d043
7f6a919
a7e1d11
2f1f743
cb374ca
a2a00f1
23d39bc
517b6d9
c5cce71
37d93af
32fb967
2834df7
ddb13c3
72e4b50
9950104
ca0697f
5942cbb
0f6c443
9d77795
79e80ba
bfb3408
d2162c6
93ff163
9a0e8dc
040a98e
ab3586e
9c2f17d
e6a2242
c5a99cc
919cedf
175f229
e8c7606
6357e66
12e1318
2478681
caac16e
4a7951e
05c6b07
42d6c6b
37ad56b
8027a19
0f0938b
15463c3
1b5deac
00014e6
1d0a7a2
41cb38f
070d7c2
c93d0b3
1822e05
12405a3
b4ca162
4410b99
74a80ec
2da1616
3ae75a1
4af0b1d
4bf5b3f
02b73b6
f926079
c09c5a0
014335f
cfbda43
183b130
9d4e008
b2cdc95
7d47001
3d27e29
0fc164f
03df082
cd367fb
40d7f63
cee585b
b3a2b9c
1bfd508
b0756af
02d0e24
396b76d
dd4c76d
fd99f22
be920b1
213fa60
6622bee
5d8f220
f2562e6
ac28c33
0ea0a79
69881a4
9ac11d8
4bbe787
e29e8a7
06ba36c
373af74
61b0b99
82feea8
8984aac
538d98e
45046cb
ad32bd3
42ad65e
728a060
fb15373
22e88e2
eb737f4
b700138
af05d99
e2c86a3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
# Validator checks and features | ||
|
||
## Current checks for indicator source data | ||
|
||
* Missing dates within the selected range | ||
* Recognized file name format | ||
* Recognized geographical type (county, state, etc) | ||
* Recognized geo id format (e.g. state is two lowercase letters) | ||
* Specific geo id has been seen before, in historical data | ||
* Missing geo type + signal + date combos based on the geo type + signal combos Covidcast metadata says should be available | ||
* Missing ‘val’ values | ||
* Negative ‘val’ values | ||
* Out-of-range ‘val’ values (>0 for all signals, <=100 for percents, <=100 000 for proportions) | ||
* Missing ‘se’ values | ||
* Appropriate ‘se’ values, within a calculated reasonable range | ||
* Stderr != 0 | ||
* If signal and stderr both = 0 (seen in Quidel data due to lack of Jeffreys correction, [issue 255](https://github.com/cmu-delphi/covidcast-indicators/issues/255#issuecomment-692196541)) | ||
* Missing ‘sample_size’ values | ||
* Appropriate ‘sample_size’ values, ≥ 100 (default) or user-defined threshold | ||
* Most recent date seen in source data is recent enough, < 1 day ago (default) or user-defined on a per-signal basis | ||
* Most recent date seen in source data is not in the future | ||
* Most recent date seen in source data is not older than most recent date seen in reference data | ||
* Similar number of obs per day as recent API data (static threshold) | ||
* Similar average value as API data (static threshold) | ||
* Source data for specified date range is empty | ||
* API data for specified date range is empty | ||
|
||
|
||
## Current features | ||
|
||
* Errors and warnings are summarized in class attribute and printed on exit | ||
* If any non-suppressed errors are raised, the validation process exits with non-zero status | ||
* Various check settings are controllable via indicator-specific params.json files | ||
* User can manually disable specific checks for specific datasets using a field in the params.json file | ||
* User can enable test mode (checks only a small number of data files) using a field in the params.json file | ||
|
||
## Checks + features wishlist, and problems to think about | ||
|
||
### Starter/small issues | ||
|
||
* Check for duplicate rows | ||
* Backfill problems, especially with JHU and USA Facts, where a change to old data results in a datapoint that doesn’t agree with surrounding data ([JHU examples](https://delphi-org.slack.com/archives/CF9G83ZJ9/p1600729151013900)) or is very different from the value it replaced. If date is already in the API, have any values changed significantly within the "backfill" window (use span_length setting). See [this](https://github.com/cmu-delphi/covidcast-indicators/pull/155#discussion_r504195207) for context. | ||
* Run check_missing_date_files (or similar) on every geo type-signal type separately in comparative checks loop. | ||
|
||
### Larger issues | ||
|
||
* Expand framework to support nchs_mortality, which is provided on a weekly basis and has some differences from the daily data. E.g. filenames use a different format ("weekly_YYYYWW_geotype_signalname.csv") | ||
* Make backtesting framework so new checks can be run individually on historical indicator data to tune false positives, output verbosity, understand frequency of error raising, etc. Should pull data from API the first time and save locally in `cache` dir. | ||
* Add DETAILS.md doc with detailed descriptions of what each check does and how. Will be especially important for statistical/anomaly detection checks. | ||
* Improve errors and error report | ||
* Check if [errors raised from validating all signals](https://docs.google.com/spreadsheets/d/1_aRBDrNeaI-3ZwuvkRNSZuZ2wfHJk6Bxj35Ol_XZ9yQ/edit#gid=1226266834) are correct, not false positives, not overly verbose or repetitive | ||
* Easier suppression of many errors at once | ||
* Maybe store errors as dict of dicts. Keys could be check strings (e.g. "check_bad_se"), then next layer geo type, etc | ||
* Nicer formatting for error “report”. | ||
* E.g. if a single type of error is raised for many different datasets, summarize all error messages into a single message? But it still has to be clear how to suppress each individually | ||
* Check for erratic data sources that wrongly report all zeroes | ||
* E.g. the error with the Wisconsin data for the 10/26 forecasts | ||
* Wary of a purely static check for this | ||
* Are there any geo regions where this might cause false positives? E.g. small counties or MSAs, certain signals (deaths, since it's << cases) | ||
* This test is partially captured by checking avgs in source vs reference data, unless erroneous zeroes continue for more than a week | ||
* Also partially captured by outlier checking. If zeroes aren't outliers, then it's hard to say that they're erroneous at all. | ||
* Outlier detection (in progress) | ||
* Current approach is tuned to daily cases and daily deaths; use just on those signals? | ||
* prophet (package) detection is flexible, but needs 2-3 months historical data to fit on. May make sense to use if other statistical checks also need that much data. | ||
* Use known erroneous/anomalous days of source data to tune static thresholds and test behavior | ||
* If can't get data from API, do we want to use substitute data for the comparative checks instead? | ||
* E.g. most recent successful API pull -- might end up being a couple weeks older | ||
* Currently, any API fetch problems just doesn't do comparative checks at all. | ||
* Improve performance and reduce runtime (no particular goal, just avoid being painfully slow!) | ||
* Profiling (iterate) | ||
* Save intermediate files? | ||
* Currently a bottleneck at "individual file checks" section. Parallelize? | ||
* Make `all_frames` MultiIndex-ed by geo type and signal name? Make a dict of data indexed by geo type and signal name? May improve performance or may just make access more readable. | ||
* Ensure validator runs on signals that require AWS credentials (iterate) | ||
|
||
### Longer-term issues | ||
|
||
* Data correctness and consistency over longer time periods (weeks to months). Compare data against long-ago (3 months?) API data for changes in trends. | ||
* Long-term trends and correlations between time series. Currently, checks only look at a data window of a few days | ||
* Any relevant anomaly detection packages already exist? | ||
* What sorts of hypothesis tests to use? See [time series trend analysis](https://www.genasis.cz/time-series/index.php?pg=home--trend-analysis). | ||
* See data-quality GitHub issues, Ryan’s [correlation notebook](https://github.com/cmu-delphi/covidcast/tree/main/R-notebooks), and Dmitry's [indicator validation notebook](https://github.com/cmu-delphi/covidcast-indicators/blob/deploy-jhu/testing_utils/indicator_validation.template.ipynb) for ideas | ||
* E.g. Doctor visits decreasing correlation with cases | ||
* E.g. WY/RI missing or very low compared to historical | ||
* Use hypothesis testing p-values to decide when to raise error or not, instead of static thresholds. Many low but non-significant p-values will also raise error. See [here](https://delphi-org.slack.com/archives/CV1SYBC90/p1601307675021000?thread_ts=1600277030.103500&cid=CV1SYBC90) and [here](https://delphi-org.slack.com/archives/CV1SYBC90/p1600978037007500?thread_ts=1600277030.103500&cid=CV1SYBC90) for more background. | ||
* Order raised exceptions by p-value | ||
* Raise errors when one p-value (per geo region, e.g.) is significant OR when a bunch of p-values for that same type of test (different geo regions, e.g.) are "close" to significant | ||
* Correct p-values for multiple testing | ||
* Bonferroni would be easy but is sensitive to choice of "family" of tests; Benjamimi-Hochberg is a bit more involved but is less sensitive to choice of "family"; [comparison of the two](https://delphi-org.slack.com/archives/D01A9KNTPKL/p1603294915000500) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
# Validator | ||
|
||
The validator performs two main tasks: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for writing such through documenation. How are you planning on integrating this with the rest of the system -- will this be added as an extra automation step? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. The intended pipeline is:
|
||
1) Sanity checks on daily data generated from the pipeline of a specific data | ||
source. | ||
2) Comparative analysis with recent data from the API | ||
to detect any anomalies, such as spikes or significant value differences | ||
|
||
The validator validates new source data in CSV format against data pulled from the [COVIDcast API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html). | ||
|
||
|
||
## Running the Validator | ||
|
||
The validator is run by executing the Python module contained in this | ||
directory from the main directory of the indicator of interest. | ||
|
||
The safest way to do this is to create a virtual environment, | ||
install the common DELPHI tools, install the indicator module and its | ||
dependencies, and then install the validator module and its | ||
dependencies to the virtual environment. | ||
|
||
To do this, navigate to the main directory of the indicator of interest and run the following code: | ||
|
||
``` | ||
python -m venv env | ||
source env/bin/activate | ||
pip install ../_delphi_utils_python/. | ||
pip install . | ||
pip install ../validator | ||
``` | ||
|
||
To execute the module and validate source data (by default, in `receiving`), run the indicator to generate data files, then run | ||
the validator, as follows: | ||
|
||
``` | ||
env/bin/python -m delphi_INDICATORNAME | ||
env/bin/python -m delphi_validator | ||
``` | ||
|
||
Once you are finished with the code, you can deactivate the virtual environment | ||
and (optionally) remove the environment itself. | ||
|
||
``` | ||
deactivate | ||
rm -r env | ||
``` | ||
|
||
### Customization | ||
|
||
All of the user-changable parameters are stored in the `validation` field of the indicator's `params.json` file. If `params.json` does not already include a `validation` field, please copy that provided in this module's `params.json.template`. | ||
|
||
Please update the follow settings: | ||
|
||
* `data_source`: should match the [formatting](https://cmu-delphi.github.io/delphi-epidata/api/covidcast_signals.html) as used in COVIDcast API calls | ||
* `end_date`: specifies the last date to be checked; if set to "latest", `end_date` will always be the current date | ||
* `span_length`: specifies the number of days before the `end_date` to check. `span_length` should be long enough to contain all recent source data that is still in the process of being updated (i.e. in the backfill period), for example, if the data source of interest has a 2-week lag before all reports are in for a given date, `scan_length` should be 14 days | ||
* `smoothed_signals`: list of the names of the signals that are smoothed (e.g. 7-day average) | ||
* `expected_lag`: dictionary of signal name-int pairs specifying the number of days of expected lag (time between event occurrence and when data about that event was published) for that signal | ||
* `test_mode`: boolean; `true` checks only a small number of data files | ||
* `suppressed_errors`: list of lists uniquely specifying errors that have been manually verified as false positives or acceptable deviations from expected | ||
|
||
All other fields contain working defaults, to be modified as needed. | ||
|
||
## Testing the code | ||
|
||
To test the code, please create a new virtual environment in the main module directory using the following procedure, similar to above: | ||
|
||
``` | ||
python -m venv env | ||
source env/bin/activate | ||
pip install ../_delphi_utils_python/. | ||
pip install . | ||
``` | ||
|
||
To do a static test of the code style, it is recommended to run **pylint** on | ||
the module. To do this, run the following from the main module directory: | ||
|
||
``` | ||
env/bin/pylint delphi_validator | ||
``` | ||
|
||
The most aggressive checks are turned off; only relatively important issues | ||
should be raised and they should be manually checked (or better, fixed). | ||
|
||
Unit tests are also included in the module. To execute these, run the following command from this directory: | ||
|
||
``` | ||
(cd tests && ../env/bin/pytest --cov=delphi_validator --cov-report=term-missing) | ||
``` | ||
|
||
The output will show the number of unit tests that passed and failed, along with the percentage of code covered by the tests. None of the tests should fail and the code lines that are not covered by unit tests should be small and should not include critical sub-routines. | ||
|
||
|
||
## Code tour | ||
|
||
* run.py: sends params.json fields to and runs the validation process | ||
* datafetcher.py: methods for loading source and API data | ||
* validate.py: methods for validating data. Includes the individual check methods and supporting functions. | ||
* errors.py: custom errors | ||
|
||
|
||
## Adding checks | ||
|
||
To add a new validation check, define the check as a `Validator` class method in `validate.py`. Each check should append a descriptive error message to the `raised` attribute if triggered. All checks should allow the user to override exception raising for a specific file using the `suppressed_errors` setting in `params.json`. | ||
|
||
This features requires that the `check_data_id` defined for an error uniquely identifies that combination of check and test data. This usually takes the form of a tuple of strings with the check method and test identifier, and test data filename or date, geo type, and signal name. | ||
|
||
Add the newly defined check to the `validate()` method to be executed. It should go in one of three sections: | ||
|
||
* data sanity checks where a data file is compared against static format settings, | ||
* data trend and value checks where a set of source data (can be one or several days) is compared against recent API data, from the previous few days, | ||
* data trend and value checks where a set of source data is compared against long term API data, from the last few months |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
## Code Review (Python) | ||
|
||
A code review of this module should include a careful look at the code and the | ||
output. To assist in the process, but certainly not in replace of it, please | ||
check the following items. | ||
|
||
**Documentation** | ||
|
||
- [ ] the README.md file template is filled out and currently accurate; it is | ||
possible to load and test the code using only the instructions given | ||
- [ ] minimal docstrings (one line describing what the function does) are | ||
included for all functions; full docstrings describing the inputs and expected | ||
outputs should be given for non-trivial functions | ||
|
||
**Structure** | ||
|
||
- [ ] code should use 4 spaces for indentation; other style decisions are | ||
flexible, but be consistent within a module | ||
- [ ] any required metadata files are checked into the repository and placed | ||
within the directory `static` | ||
- [ ] any intermediate files that are created and stored by the module should | ||
be placed in the directory `cache` | ||
- [ ] all options and API keys are passed through the file `params.json` | ||
- [ ] template parameter file (`params.json.template`) is checked into the | ||
code; no personal (i.e., usernames) or private (i.e., API keys) information is | ||
included in this template file | ||
|
||
**Testing** | ||
|
||
- [ ] module can be installed in a new virtual environment | ||
- [ ] pylint with the default `.pylint` settings run over the module produces | ||
minimal warnings; warnings that do exist have been confirmed as false positives | ||
- [ ] reasonably high level of unit test coverage covering all of the main logic | ||
of the code (e.g., missing coverage for raised errors that do not currently seem | ||
possible to reach are okay; missing coverage for options that will be needed are | ||
not) | ||
- [ ] all unit tests run without errors |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# -*- coding: utf-8 -*- | ||
"""Module to validate indicator source data before uploading to the public COVIDcast API. | ||
|
||
This file defines the functions that are made public by the module. As the | ||
module is intended to be executed though the main method, these are primarily | ||
for testing. | ||
""" | ||
|
||
from __future__ import absolute_import | ||
|
||
from . import run | ||
|
||
__version__ = "0.1.0" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# -*- coding: utf-8 -*- | ||
"""Call the function run_module when executed. | ||
|
||
This file indicates that running the module (`python -m delphi_validator`) will | ||
call the function `run_module` found within the run.py file. | ||
""" | ||
|
||
from .run import run_module | ||
|
||
run_module() |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
# -*- coding: utf-8 -*- | ||
""" | ||
Functions to get CSV filenames and data. | ||
""" | ||
|
||
import re | ||
from os import listdir | ||
from os.path import isfile, join | ||
from itertools import product | ||
import pandas as pd | ||
import numpy as np | ||
|
||
import covidcast | ||
from .errors import APIDataFetchError | ||
|
||
filename_regex = re.compile( | ||
r'^(?P<date>\d{8})_(?P<geo_type>\w+?)_(?P<signal>\w+)\.csv$') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not for this change necessarily -- but is this definition shared across all of our code? Would it make sense to extract this to a library somewhere instead? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or maybe the validator should accept the regex as a parameter? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shared with epidata; there will be a lot of things duplicated between this package and the validation done by the epidata acquisition code. We should pull it out into a library as soon as we have a private package repo up and running. |
||
|
||
|
||
def read_filenames(path): | ||
""" | ||
Return a list of tuples of every filename and regex match to the CSV filename | ||
format in the specified directory. | ||
|
||
Arguments: | ||
- path: path to the directory containing CSV data files. | ||
|
||
Returns: | ||
- list of tuples | ||
""" | ||
daily_filenames = [(f, filename_regex.match(f)) | ||
for f in listdir(path) if isfile(join(path, f))] | ||
return daily_filenames | ||
|
||
|
||
def load_csv(path): | ||
""" | ||
Load CSV with specified column types. | ||
""" | ||
return pd.read_csv( | ||
path, | ||
dtype={ | ||
'geo_id': str, | ||
'val': float, | ||
'se': float, | ||
'sample_size': float, | ||
}) | ||
|
||
|
||
def get_geo_signal_combos(data_source): | ||
""" | ||
Get list of geo type-signal type combinations that we expect to see, based on | ||
combinations reported available by COVIDcast metadata. | ||
""" | ||
meta = covidcast.metadata() | ||
source_meta = meta[meta['data_source'] == data_source] | ||
unique_signals = source_meta['signal'].unique().tolist() | ||
unique_geotypes = source_meta['geo_type'].unique().tolist() | ||
|
||
geo_signal_combos = list(product(unique_geotypes, unique_signals)) | ||
print("Number of expected geo region-signal combinations:", | ||
len(geo_signal_combos)) | ||
|
||
return geo_signal_combos | ||
|
||
|
||
def fetch_api_reference(data_source, start_date, end_date, geo_type, signal_type): | ||
""" | ||
Get and process API data for use as a reference. Formatting is changed | ||
to match that of source data CSVs. | ||
""" | ||
api_df = covidcast.signal( | ||
data_source, signal_type, start_date, end_date, geo_type) | ||
|
||
if not isinstance(api_df, pd.DataFrame): | ||
custom_msg = "Error fetching data from " + str(start_date) + \ | ||
" to " + str(end_date) + \ | ||
"for data source: " + data_source + \ | ||
", signal type: " + signal_type + \ | ||
", geo type: " + geo_type | ||
|
||
raise APIDataFetchError(custom_msg) | ||
|
||
column_names = ["geo_id", "val", | ||
"se", "sample_size", "time_value"] | ||
|
||
# Replace None with NA to make numerical manipulation easier. | ||
# Rename and reorder columns to match those in df_to_test. | ||
api_df = api_df.replace( | ||
to_replace=[None], value=np.nan | ||
).rename( | ||
columns={'geo_value': "geo_id", 'stderr': 'se', 'value': 'val'} | ||
).drop( | ||
['direction', 'issue', 'lag'], axis=1 | ||
).reindex(columns=column_names) | ||
|
||
return api_df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious -- have you tried running this on existing imports already?
Specifically I'm wondering if you have a sense of how often this generates false positives (if any) so that the aggressiveness of the validations can be tuned.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I have run this on the USAFacts indicators.
The old Facebook-specific validator had false positives from the check comparing average values of source data vs of the API. I increased the static threshold used in this check to require a larger difference in averages to trigger the error, and haven't yet had problems with it being too sensitive.
I do often see the "val column can't have any cell smaller than 0" error. Although we expect to see negative values occasionally, used to adjusted for previous overreporting, it turns out they are fairly common. These aren't false positives per se, but perhaps given the frequency of the error, we would only want to report it if there were a very large negative value or many small negative values.
The majority of the current checks look at formatting or very reasonable rule-based cutoffs (percents can't be >100, e.g.) so there shouldn't be a lot of tuning needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to put @chinandrew on surveying the other signals.