Skip to content

FB-survey validation with a generic design to include other pipelines #155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 154 commits into from
Nov 19, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
154 commits
Select commit Hold shift + click to select a range
088f8aa
Initial commit for the validation
amartyabasu Jul 16, 2020
284ac9d
Correcting regular expression for filenames
amartyabasu Jul 16, 2020
ebd93a3
Breaking down nested dataframes for complete date range for more stre…
amartyabasu Jul 16, 2020
6babdba
Minor regex change for filename capture.
amartyabasu Jul 20, 2020
0b28b63
Adding custom exception files for readability
amartyabasu Jul 20, 2020
40a2c4d
Minor changes
amartyabasu Jul 20, 2020
77be3f0
Geotype validation logic correction
amartyabasu Jul 23, 2020
e398c88
Modifying val check
amartyabasu Jul 23, 2020
449c09f
Handling incorrect floating point se check
amartyabasu Jul 24, 2020
0161c47
Adding numpy dependency
amartyabasu Jul 24, 2020
a307086
Removing numpy ceiling func
amartyabasu Jul 24, 2020
a7d7a58
sample size filtering
amartyabasu Jul 24, 2020
45fbb0e
Reading signals and geo_types from metadata()
amartyabasu Jul 30, 2020
14fb6ce
Rapid rise check
amartyabasu Jul 30, 2020
2da9138
Minor change
amartyabasu Jul 30, 2020
1d378c2
Package compliance
amartyabasu Jul 30, 2020
c5edfd1
Average diff check correction
amartyabasu Aug 4, 2020
9d18363
Adding smoothing thresholds
amartyabasu Aug 4, 2020
45986ad
Edits for avg diff check
amartyabasu Aug 4, 2020
b8eae79
Readme edits
amartyabasu Aug 4, 2020
c8a5b7a
Tweaks to fetch metadata based on old/new pipeline.
amartyabasu Aug 7, 2020
7b23269
Removing prototype code.
amartyabasu Aug 7, 2020
6cae2b4
Fixing module build errors
amartyabasu Aug 10, 2020
d27b104
module name change in readme.
amartyabasu Aug 10, 2020
66091d1
Clean up
amartyabasu Aug 10, 2020
9c54f43
[wip] validation package\n*replace print/exit with raised exceptions
krivard Sep 4, 2020
59a2313
fixup! [wip] validation package\n*replace print/exit with raised exce…
krivard Sep 4, 2020
5572ef8
Remove old-codebase method; add pandas loader
krivard Sep 8, 2020
8eb1f6a
[wip] Added linting template info to README. Replaced instances of ef…
nmdefries Sep 9, 2020
7719f03
[wip] Defined function to get geo_sig_combo outside context of get_fi…
nmdefries Sep 10, 2020
147401a
[wip] Fetch data_source from params.json. Added as argument to valida…
nmdefries Sep 10, 2020
8157a6b
[wip] get_geo_sig_cmbo() now encapsulates covidcast metadata call
nmdefries Sep 10, 2020
00a249a
Set alternative to smooth_option for non-FB data
nmdefries Sep 14, 2020
d9cfa30
[wip] Runs but some check definitions are wrong
nmdefries Sep 14, 2020
e491f7c
Added docstring and comments to validate_daily(). Checked correctness…
nmdefries Sep 15, 2020
ae16037
Initial docstring in validate(). validate_daily() now called in valid…
nmdefries Sep 15, 2020
2be6a78
Debugged validate_daily(). Implemented and debugged check_min_allowed…
nmdefries Sep 16, 2020
0ff4fd7
Changed geo_id regex to use lowercase letters for state abbreviation
nmdefries Sep 17, 2020
ccf8015
Added check in check_bad_val for proprtion maximum allowed value. Add…
nmdefries Sep 17, 2020
0b9d39e
Changed check_bad_se to allow for NA standard errors. Added docstrings
nmdefries Sep 17, 2020
51302e3
Added general params argument to validate() to encapsulate miscellane…
nmdefries Sep 22, 2020
e820f97
Modified check_bad_sample_size() to allow user-defined minimum sample…
nmdefries Sep 22, 2020
7b0785f
Removed duplicate code in check_missing_dates(). Improved print state…
nmdefries Sep 22, 2020
089ce9c
[wip] Adding in 'semirecent' checks according to R reference code
nmdefries Sep 22, 2020
b9ce615
[wip] Generalize check_rapid_change()
nmdefries Sep 22, 2020
138cc5c
[wip] Replace Nones in API datapull with NAs to make numerical manipu…
nmdefries Sep 22, 2020
4e57493
[wip] Fixed calculation of mean_stddiff and mean_stdabsdiff in check_…
nmdefries Sep 23, 2020
9d2d043
Fixed check_avg_val_diffs(). Now runs like reference code. Added comm…
nmdefries Sep 23, 2020
7f6a919
Validator now uses a Validator class to allow validation to complete …
nmdefries Sep 24, 2020
a7e1d11
Print error messages one per line
nmdefries Sep 25, 2020
2f1f743
[wip] Added some docstrings
nmdefries Sep 28, 2020
cb374ca
[wip] Added docstrings to validate.py functions
nmdefries Sep 29, 2020
a2a00f1
[wip] Added docstrings and comments to datafetcher.py
nmdefries Sep 29, 2020
23d39bc
pylint recommended improvements
nmdefries Sep 29, 2020
517b6d9
Formatting improvements
nmdefries Sep 29, 2020
c5cce71
Updated README and template params
nmdefries Sep 29, 2020
37d93af
Updated README with new check info
nmdefries Sep 30, 2020
32fb967
Added unittesting info in README and directory
nmdefries Sep 30, 2020
2834df7
validate() comparison checks now match reference code
nmdefries Oct 1, 2020
ddb13c3
Removed unused datafetch functions. Added more identifying info to er…
nmdefries Oct 2, 2020
72e4b50
add current and planned features doc
nmdefries Oct 2, 2020
9950104
Added uniquely identifying check ids in standard format. Plan to use …
nmdefries Oct 3, 2020
ca0697f
Added API fetch and formatting function. Fixed few small bugs
nmdefries Oct 5, 2020
5942cbb
Moved params definitions to Validator init as attributes
nmdefries Oct 5, 2020
0f6c443
Added manual error suppression feature. Error messaging reports numbe…
nmdefries Oct 5, 2020
9d77795
Propogate attribute usage throughout class. Update documentation
nmdefries Oct 5, 2020
79e80ba
Linting improvements
nmdefries Oct 5, 2020
bfb3408
README, PLANS, tests updates. Moved ValidationError to errors.py
nmdefries Oct 8, 2020
d2162c6
unit tests
nmdefries Oct 9, 2020
93ff163
[wip] more tests. regex for catching wrong geo_ids is not sensitive e…
nmdefries Oct 9, 2020
9a0e8dc
sample size and se tests
nmdefries Oct 9, 2020
040a98e
rapid change and avg value comparison tests
nmdefries Oct 9, 2020
ab3586e
Fixed geo_id regex problems
nmdefries Oct 9, 2020
9c2f17d
Finished tests. Linting updates. Moved TODOs to plans.md document
nmdefries Oct 9, 2020
e6a2242
Updated params template
nmdefries Oct 9, 2020
c5a99cc
Flag kroc counter as temporary. Should be removed when production-ready
nmdefries Oct 10, 2020
919cedf
Added params.json flag to toggle test mode on and off. Test mode limi…
nmdefries Oct 10, 2020
175f229
plans.md update
nmdefries Oct 10, 2020
e8c7606
params.json update
nmdefries Oct 10, 2020
6357e66
Remove fb-survey "community" sieve.
nmdefries Oct 12, 2020
12e1318
Update regex format for national geo regions
nmdefries Oct 12, 2020
2478681
Added span_length setting and "latest" value for end_date to allow ea…
nmdefries Oct 13, 2020
caac16e
Update __main__.py
nmdefries Oct 13, 2020
4a7951e
Update __init__.py
nmdefries Oct 13, 2020
05c6b07
update params template
nmdefries Oct 14, 2020
42d6c6b
Added user settings to specify names of all smoothed signals for sele…
nmdefries Oct 14, 2020
37ad56b
Update plans doc. Add info about user settings to readme
nmdefries Oct 14, 2020
8027a19
updated tests
nmdefries Oct 15, 2020
0f0938b
plan updates
nmdefries Oct 15, 2020
15463c3
Lengthen abbreviated obj names. Print total number of tests run
nmdefries Oct 20, 2020
1b5deac
update plans
nmdefries Oct 20, 2020
00014e6
fix bug causing key error when geo-sig combo file is not found
nmdefries Oct 20, 2020
1d0a7a2
update data fetch error message
nmdefries Oct 20, 2020
41cb38f
update plans
nmdefries Oct 21, 2020
070d7c2
update plans
nmdefries Oct 21, 2020
c93d0b3
update plans
nmdefries Oct 21, 2020
1822e05
update plans
nmdefries Oct 21, 2020
12405a3
Update PLANS.md
Ananya-Joshi Oct 21, 2020
b4ca162
apply fix for check_avg_val index bug
nmdefries Oct 21, 2020
4410b99
update fix
nmdefries Oct 21, 2020
74a80ec
fix bug in check_avg_val
nmdefries Oct 21, 2020
2da1616
update state and national regex to allow capital or lowercase abbrevi…
nmdefries Oct 22, 2020
3ae75a1
Update PLANS.md
Ananya-Joshi Oct 22, 2020
4af0b1d
add raised_warnings (non-upload blocking) attribute and print output.…
nmdefries Oct 22, 2020
4bf5b3f
update plans
nmdefries Oct 22, 2020
02b73b6
Update PLANS.md
Ananya-Joshi Oct 22, 2020
f926079
check_avg_val fix
nmdefries Oct 22, 2020
c09c5a0
formatting
nmdefries Oct 22, 2020
014335f
update plans
nmdefries Oct 23, 2020
cfbda43
update plans
nmdefries Oct 23, 2020
183b130
add dma to geo type options
nmdefries Oct 23, 2020
9d4e008
allow error messages to include variable values
nmdefries Oct 23, 2020
b2cdc95
update plans
nmdefries Oct 23, 2020
7d47001
update plans
nmdefries Oct 23, 2020
3d27e29
update plans
nmdefries Oct 23, 2020
0fc164f
catch APIDataFetchError and add to list of errors, aborting the compa…
nmdefries Oct 23, 2020
03df082
update plans
nmdefries Oct 23, 2020
cd367fb
update plans. Import new error type
nmdefries Oct 26, 2020
40d7f63
Update PLANS.md
Ananya-Joshi Oct 26, 2020
cee585b
add try block toprint useful info if ZeroDivsionError is raised. chan…
nmdefries Oct 26, 2020
b3a2b9c
remove pdb import
nmdefries Oct 26, 2020
1bfd508
update plans
nmdefries Oct 27, 2020
b0756af
update plans
nmdefries Oct 27, 2020
02d0e24
update plans
nmdefries Oct 28, 2020
396b76d
fix missing leading zeroes in numeric geo_id if previously saved as f…
nmdefries Oct 30, 2020
dd4c76d
update plans. Fix zfill KeyError
nmdefries Oct 30, 2020
fd99f22
combine API calls. update tets and plans
nmdefries Oct 30, 2020
be920b1
thread API calls. update plans
nmdefries Nov 2, 2020
213fa60
use semaphore to improve API threading. Fix error in reference data s…
nmdefries Nov 2, 2020
6622bee
replace read_geo_signal_combo_files() by querying all-frames. Remove …
nmdefries Nov 2, 2020
5d8f220
First draft of geo_id validator
JedGrabman Nov 2, 2020
f2562e6
Fix typo (geo_value -> geo_id)
JedGrabman Nov 2, 2020
ac28c33
Minor fixes
JedGrabman Nov 3, 2020
0ea0a79
Allow setting static directory in template
JedGrabman Nov 3, 2020
69881a4
rename directory (dev -> scripts)
JedGrabman Nov 3, 2020
9ac11d8
update unittests with function name changes
nmdefries Nov 3, 2020
4bbe787
update plans
nmdefries Nov 4, 2020
e29e8a7
update plans
nmdefries Nov 4, 2020
06ba36c
Pass through filename for error message
JedGrabman Nov 4, 2020
373af74
update plans. Create starter-issue section
nmdefries Nov 4, 2020
61b0b99
update plans
nmdefries Nov 4, 2020
82feea8
update plans
nmdefries Nov 5, 2020
8984aac
update plans
nmdefries Nov 5, 2020
538d98e
Add file for national values
JedGrabman Nov 5, 2020
45046cb
Allow uppercase version of geo_id, but with warning
JedGrabman Nov 5, 2020
ad32bd3
Add tests for geo_id value checks
JedGrabman Nov 5, 2020
42ad65e
Add small error-message changes to tests
nmdefries Nov 5, 2020
728a060
Merge pull request #426 from JedGrabman/jedgrabman/geo_id_validation
nmdefries Nov 5, 2020
fb15373
fix tests
nmdefries Nov 6, 2020
22e88e2
Make geo_id checks robust
JedGrabman Nov 9, 2020
eb737f4
Merge pull request #470 from JedGrabman/jedgrabman/geo_id_validation
nmdefries Nov 9, 2020
b700138
update plans with feedback from modeling meeting
nmdefries Nov 9, 2020
af05d99
Merge branch 'main' into fb-package-validation
nmdefries Nov 18, 2020
e2c86a3
update plans, small clean-up items
nmdefries Nov 18, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions validator/PLANS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Validator checks and features

## Current checks for indicator source data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious -- have you tried running this on existing imports already?

Specifically I'm wondering if you have a sense of how often this generates false positives (if any) so that the aggressiveness of the validations can be tuned.

Copy link
Contributor

@nmdefries nmdefries Oct 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I have run this on the USAFacts indicators.

The old Facebook-specific validator had false positives from the check comparing average values of source data vs of the API. I increased the static threshold used in this check to require a larger difference in averages to trigger the error, and haven't yet had problems with it being too sensitive.

I do often see the "val column can't have any cell smaller than 0" error. Although we expect to see negative values occasionally, used to adjusted for previous overreporting, it turns out they are fairly common. These aren't false positives per se, but perhaps given the frequency of the error, we would only want to report it if there were a very large negative value or many small negative values.

The majority of the current checks look at formatting or very reasonable rule-based cutoffs (percents can't be >100, e.g.) so there shouldn't be a lot of tuning needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to put @chinandrew on surveying the other signals.


* Missing dates within the selected range
* Recognized file name format
* Recognized geographical type (county, state, etc)
* Recognized geo id format (e.g. state is two lowercase letters)
* Specific geo id has been seen before, in historical data
* Missing geo type + signal + date combos based on the geo type + signal combos Covidcast metadata says should be available
* Missing ‘val’ values
* Negative ‘val’ values
* Out-of-range ‘val’ values (>0 for all signals, <=100 for percents, <=100 000 for proportions)
* Missing ‘se’ values
* Appropriate ‘se’ values, within a calculated reasonable range
* Stderr != 0
* If signal and stderr both = 0 (seen in Quidel data due to lack of Jeffreys correction, [issue 255](https://github.com/cmu-delphi/covidcast-indicators/issues/255#issuecomment-692196541))
* Missing ‘sample_size’ values
* Appropriate ‘sample_size’ values, ≥ 100 (default) or user-defined threshold
* Most recent date seen in source data is recent enough, < 1 day ago (default) or user-defined on a per-signal basis
* Most recent date seen in source data is not in the future
* Most recent date seen in source data is not older than most recent date seen in reference data
* Similar number of obs per day as recent API data (static threshold)
* Similar average value as API data (static threshold)
* Source data for specified date range is empty
* API data for specified date range is empty


## Current features

* Errors and warnings are summarized in class attribute and printed on exit
* If any non-suppressed errors are raised, the validation process exits with non-zero status
* Various check settings are controllable via indicator-specific params.json files
* User can manually disable specific checks for specific datasets using a field in the params.json file
* User can enable test mode (checks only a small number of data files) using a field in the params.json file

## Checks + features wishlist, and problems to think about

### Starter/small issues

* Check for duplicate rows
* Backfill problems, especially with JHU and USA Facts, where a change to old data results in a datapoint that doesn’t agree with surrounding data ([JHU examples](https://delphi-org.slack.com/archives/CF9G83ZJ9/p1600729151013900)) or is very different from the value it replaced. If date is already in the API, have any values changed significantly within the "backfill" window (use span_length setting). See [this](https://github.com/cmu-delphi/covidcast-indicators/pull/155#discussion_r504195207) for context.
* Run check_missing_date_files (or similar) on every geo type-signal type separately in comparative checks loop.

### Larger issues

* Expand framework to support nchs_mortality, which is provided on a weekly basis and has some differences from the daily data. E.g. filenames use a different format ("weekly_YYYYWW_geotype_signalname.csv")
* Make backtesting framework so new checks can be run individually on historical indicator data to tune false positives, output verbosity, understand frequency of error raising, etc. Should pull data from API the first time and save locally in `cache` dir.
* Add DETAILS.md doc with detailed descriptions of what each check does and how. Will be especially important for statistical/anomaly detection checks.
* Improve errors and error report
* Check if [errors raised from validating all signals](https://docs.google.com/spreadsheets/d/1_aRBDrNeaI-3ZwuvkRNSZuZ2wfHJk6Bxj35Ol_XZ9yQ/edit#gid=1226266834) are correct, not false positives, not overly verbose or repetitive
* Easier suppression of many errors at once
* Maybe store errors as dict of dicts. Keys could be check strings (e.g. "check_bad_se"), then next layer geo type, etc
* Nicer formatting for error “report”.
* E.g. if a single type of error is raised for many different datasets, summarize all error messages into a single message? But it still has to be clear how to suppress each individually
* Check for erratic data sources that wrongly report all zeroes
* E.g. the error with the Wisconsin data for the 10/26 forecasts
* Wary of a purely static check for this
* Are there any geo regions where this might cause false positives? E.g. small counties or MSAs, certain signals (deaths, since it's << cases)
* This test is partially captured by checking avgs in source vs reference data, unless erroneous zeroes continue for more than a week
* Also partially captured by outlier checking. If zeroes aren't outliers, then it's hard to say that they're erroneous at all.
* Outlier detection (in progress)
* Current approach is tuned to daily cases and daily deaths; use just on those signals?
* prophet (package) detection is flexible, but needs 2-3 months historical data to fit on. May make sense to use if other statistical checks also need that much data.
* Use known erroneous/anomalous days of source data to tune static thresholds and test behavior
* If can't get data from API, do we want to use substitute data for the comparative checks instead?
* E.g. most recent successful API pull -- might end up being a couple weeks older
* Currently, any API fetch problems just doesn't do comparative checks at all.
* Improve performance and reduce runtime (no particular goal, just avoid being painfully slow!)
* Profiling (iterate)
* Save intermediate files?
* Currently a bottleneck at "individual file checks" section. Parallelize?
* Make `all_frames` MultiIndex-ed by geo type and signal name? Make a dict of data indexed by geo type and signal name? May improve performance or may just make access more readable.
* Ensure validator runs on signals that require AWS credentials (iterate)

### Longer-term issues

* Data correctness and consistency over longer time periods (weeks to months). Compare data against long-ago (3 months?) API data for changes in trends.
* Long-term trends and correlations between time series. Currently, checks only look at a data window of a few days
* Any relevant anomaly detection packages already exist?
* What sorts of hypothesis tests to use? See [time series trend analysis](https://www.genasis.cz/time-series/index.php?pg=home--trend-analysis).
* See data-quality GitHub issues, Ryan’s [correlation notebook](https://github.com/cmu-delphi/covidcast/tree/main/R-notebooks), and Dmitry's [indicator validation notebook](https://github.com/cmu-delphi/covidcast-indicators/blob/deploy-jhu/testing_utils/indicator_validation.template.ipynb) for ideas
* E.g. Doctor visits decreasing correlation with cases
* E.g. WY/RI missing or very low compared to historical
* Use hypothesis testing p-values to decide when to raise error or not, instead of static thresholds. Many low but non-significant p-values will also raise error. See [here](https://delphi-org.slack.com/archives/CV1SYBC90/p1601307675021000?thread_ts=1600277030.103500&cid=CV1SYBC90) and [here](https://delphi-org.slack.com/archives/CV1SYBC90/p1600978037007500?thread_ts=1600277030.103500&cid=CV1SYBC90) for more background.
* Order raised exceptions by p-value
* Raise errors when one p-value (per geo region, e.g.) is significant OR when a bunch of p-values for that same type of test (different geo regions, e.g.) are "close" to significant
* Correct p-values for multiple testing
* Bonferroni would be easy but is sensitive to choice of "family" of tests; Benjamimi-Hochberg is a bit more involved but is less sensitive to choice of "family"; [comparison of the two](https://delphi-org.slack.com/archives/D01A9KNTPKL/p1603294915000500)
112 changes: 112 additions & 0 deletions validator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Validator

The validator performs two main tasks:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing such through documenation. How are you planning on integrating this with the rest of the system -- will this be added as an extra automation step?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The intended pipeline is:

  1. Run indicator
  2. Run validator <- this package
  3. Run archiver <- completed but not yet integrated

1) Sanity checks on daily data generated from the pipeline of a specific data
source.
2) Comparative analysis with recent data from the API
to detect any anomalies, such as spikes or significant value differences

The validator validates new source data in CSV format against data pulled from the [COVIDcast API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html).


## Running the Validator

The validator is run by executing the Python module contained in this
directory from the main directory of the indicator of interest.

The safest way to do this is to create a virtual environment,
install the common DELPHI tools, install the indicator module and its
dependencies, and then install the validator module and its
dependencies to the virtual environment.

To do this, navigate to the main directory of the indicator of interest and run the following code:

```
python -m venv env
source env/bin/activate
pip install ../_delphi_utils_python/.
pip install .
pip install ../validator
```

To execute the module and validate source data (by default, in `receiving`), run the indicator to generate data files, then run
the validator, as follows:

```
env/bin/python -m delphi_INDICATORNAME
env/bin/python -m delphi_validator
```

Once you are finished with the code, you can deactivate the virtual environment
and (optionally) remove the environment itself.

```
deactivate
rm -r env
```

### Customization

All of the user-changable parameters are stored in the `validation` field of the indicator's `params.json` file. If `params.json` does not already include a `validation` field, please copy that provided in this module's `params.json.template`.

Please update the follow settings:

* `data_source`: should match the [formatting](https://cmu-delphi.github.io/delphi-epidata/api/covidcast_signals.html) as used in COVIDcast API calls
* `end_date`: specifies the last date to be checked; if set to "latest", `end_date` will always be the current date
* `span_length`: specifies the number of days before the `end_date` to check. `span_length` should be long enough to contain all recent source data that is still in the process of being updated (i.e. in the backfill period), for example, if the data source of interest has a 2-week lag before all reports are in for a given date, `scan_length` should be 14 days
* `smoothed_signals`: list of the names of the signals that are smoothed (e.g. 7-day average)
* `expected_lag`: dictionary of signal name-int pairs specifying the number of days of expected lag (time between event occurrence and when data about that event was published) for that signal
* `test_mode`: boolean; `true` checks only a small number of data files
* `suppressed_errors`: list of lists uniquely specifying errors that have been manually verified as false positives or acceptable deviations from expected

All other fields contain working defaults, to be modified as needed.

## Testing the code

To test the code, please create a new virtual environment in the main module directory using the following procedure, similar to above:

```
python -m venv env
source env/bin/activate
pip install ../_delphi_utils_python/.
pip install .
```

To do a static test of the code style, it is recommended to run **pylint** on
the module. To do this, run the following from the main module directory:

```
env/bin/pylint delphi_validator
```

The most aggressive checks are turned off; only relatively important issues
should be raised and they should be manually checked (or better, fixed).

Unit tests are also included in the module. To execute these, run the following command from this directory:

```
(cd tests && ../env/bin/pytest --cov=delphi_validator --cov-report=term-missing)
```

The output will show the number of unit tests that passed and failed, along with the percentage of code covered by the tests. None of the tests should fail and the code lines that are not covered by unit tests should be small and should not include critical sub-routines.


## Code tour

* run.py: sends params.json fields to and runs the validation process
* datafetcher.py: methods for loading source and API data
* validate.py: methods for validating data. Includes the individual check methods and supporting functions.
* errors.py: custom errors


## Adding checks

To add a new validation check, define the check as a `Validator` class method in `validate.py`. Each check should append a descriptive error message to the `raised` attribute if triggered. All checks should allow the user to override exception raising for a specific file using the `suppressed_errors` setting in `params.json`.

This features requires that the `check_data_id` defined for an error uniquely identifies that combination of check and test data. This usually takes the form of a tuple of strings with the check method and test identifier, and test data filename or date, geo type, and signal name.

Add the newly defined check to the `validate()` method to be executed. It should go in one of three sections:

* data sanity checks where a data file is compared against static format settings,
* data trend and value checks where a set of source data (can be one or several days) is compared against recent API data, from the previous few days,
* data trend and value checks where a set of source data is compared against long term API data, from the last few months
37 changes: 37 additions & 0 deletions validator/REVIEW.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
## Code Review (Python)

A code review of this module should include a careful look at the code and the
output. To assist in the process, but certainly not in replace of it, please
check the following items.

**Documentation**

- [ ] the README.md file template is filled out and currently accurate; it is
possible to load and test the code using only the instructions given
- [ ] minimal docstrings (one line describing what the function does) are
included for all functions; full docstrings describing the inputs and expected
outputs should be given for non-trivial functions

**Structure**

- [ ] code should use 4 spaces for indentation; other style decisions are
flexible, but be consistent within a module
- [ ] any required metadata files are checked into the repository and placed
within the directory `static`
- [ ] any intermediate files that are created and stored by the module should
be placed in the directory `cache`
- [ ] all options and API keys are passed through the file `params.json`
- [ ] template parameter file (`params.json.template`) is checked into the
code; no personal (i.e., usernames) or private (i.e., API keys) information is
included in this template file

**Testing**

- [ ] module can be installed in a new virtual environment
- [ ] pylint with the default `.pylint` settings run over the module produces
minimal warnings; warnings that do exist have been confirmed as false positives
- [ ] reasonably high level of unit test coverage covering all of the main logic
of the code (e.g., missing coverage for raised errors that do not currently seem
possible to reach are okay; missing coverage for options that will be needed are
not)
- [ ] all unit tests run without errors
13 changes: 13 additions & 0 deletions validator/delphi_validator/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# -*- coding: utf-8 -*-
"""Module to validate indicator source data before uploading to the public COVIDcast API.

This file defines the functions that are made public by the module. As the
module is intended to be executed though the main method, these are primarily
for testing.
"""

from __future__ import absolute_import

from . import run

__version__ = "0.1.0"
10 changes: 10 additions & 0 deletions validator/delphi_validator/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# -*- coding: utf-8 -*-
"""Call the function run_module when executed.

This file indicates that running the module (`python -m delphi_validator`) will
call the function `run_module` found within the run.py file.
"""

from .run import run_module

run_module()
97 changes: 97 additions & 0 deletions validator/delphi_validator/datafetcher.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# -*- coding: utf-8 -*-
"""
Functions to get CSV filenames and data.
"""

import re
from os import listdir
from os.path import isfile, join
from itertools import product
import pandas as pd
import numpy as np

import covidcast
from .errors import APIDataFetchError

filename_regex = re.compile(
r'^(?P<date>\d{8})_(?P<geo_type>\w+?)_(?P<signal>\w+)\.csv$')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this change necessarily -- but is this definition shared across all of our code? Would it make sense to extract this to a library somewhere instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe the validator should accept the regex as a parameter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared with epidata; there will be a lot of things duplicated between this package and the validation done by the epidata acquisition code. We should pull it out into a library as soon as we have a private package repo up and running.



def read_filenames(path):
"""
Return a list of tuples of every filename and regex match to the CSV filename
format in the specified directory.

Arguments:
- path: path to the directory containing CSV data files.

Returns:
- list of tuples
"""
daily_filenames = [(f, filename_regex.match(f))
for f in listdir(path) if isfile(join(path, f))]
return daily_filenames


def load_csv(path):
"""
Load CSV with specified column types.
"""
return pd.read_csv(
path,
dtype={
'geo_id': str,
'val': float,
'se': float,
'sample_size': float,
})


def get_geo_signal_combos(data_source):
"""
Get list of geo type-signal type combinations that we expect to see, based on
combinations reported available by COVIDcast metadata.
"""
meta = covidcast.metadata()
source_meta = meta[meta['data_source'] == data_source]
unique_signals = source_meta['signal'].unique().tolist()
unique_geotypes = source_meta['geo_type'].unique().tolist()

geo_signal_combos = list(product(unique_geotypes, unique_signals))
print("Number of expected geo region-signal combinations:",
len(geo_signal_combos))

return geo_signal_combos


def fetch_api_reference(data_source, start_date, end_date, geo_type, signal_type):
"""
Get and process API data for use as a reference. Formatting is changed
to match that of source data CSVs.
"""
api_df = covidcast.signal(
data_source, signal_type, start_date, end_date, geo_type)

if not isinstance(api_df, pd.DataFrame):
custom_msg = "Error fetching data from " + str(start_date) + \
" to " + str(end_date) + \
"for data source: " + data_source + \
", signal type: " + signal_type + \
", geo type: " + geo_type

raise APIDataFetchError(custom_msg)

column_names = ["geo_id", "val",
"se", "sample_size", "time_value"]

# Replace None with NA to make numerical manipulation easier.
# Rename and reorder columns to match those in df_to_test.
api_df = api_df.replace(
to_replace=[None], value=np.nan
).rename(
columns={'geo_value': "geo_id", 'stderr': 'se', 'value': 'val'}
).drop(
['direction', 'issue', 'lag'], axis=1
).reindex(columns=column_names)

return api_df
Loading