Skip to content

Port the Facebook validation pipeline to be generic and automatable #59

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 4 tasks
capnrefsmmat opened this issue Jun 1, 2020 · 6 comments · Fixed by #155
Closed
2 of 4 tasks

Port the Facebook validation pipeline to be generic and automatable #59

capnrefsmmat opened this issue Jun 1, 2020 · 6 comments · Fixed by #155
Assignees
Labels
CTIS Improvements and reporting for CTIS

Comments

@capnrefsmmat
Copy link
Contributor

capnrefsmmat commented Jun 1, 2020

Currently, covid-19/facebook/prepare-extracts/covidalert-io-funs.R contains a validation pipeline for Facebook. As I understand it, it does the following checks when we prepare a new day of data to upload.

  1. Ensure that the old data in the API matches the newly generated old data; that is, if it's currently June 1, make sure we didn't unexpectedly change data for May 25th.
  2. Do sanity checks on the new data, such as the geography types being reasonable, the geo_ids having the right format, the values and SEs being in the correct range, sample sizes are present, dates aren't missing, etc.
  3. Verify the number of geographical regions reporting hasn't suddenly changed.
  4. Verify the average variable values haven't suddenly changed.

Many of these checks can be made generic to multiple data sources and applied to our new pipeline. This would require

  • adapt the script to work on Taylor's directory structure where data files are placed
  • provide for configuration files that specify the checks that apply to each data source
  • adapt the script to report all errors, rather than dying on the first one
  • make it easy to automatically run for each all data source as a part of its automation job
@capnrefsmmat capnrefsmmat self-assigned this Jun 1, 2020
@capnrefsmmat
Copy link
Contributor Author

I've begun a validator branch that will automate validating new data for one data source.

Only a very minimal skeleton is present; I have not ported the validation code, nor have I even run it.

My plan is to have a script that takes two arguments: the name of the data source (e.g. "fb-survey") and the date we want to validate. The params.json file will contain a set of parameters for each data source, such as the minimum and maximum allowable values, and these will be used to check the latest data in receiving for that source, as well as to verify that the data matches the API and the other checks we want to do.

The script will log any errors in a concise output format and then exit with nonzero status if there were any failures, to make it easy to plug into an automation pipeline.

@capnrefsmmat capnrefsmmat added the CTIS Improvements and reporting for CTIS label Jul 5, 2020
@krivard
Copy link
Contributor

krivard commented Jul 7, 2020

Complications with architecture: validation needs to run before the CSV files are written to have access to the day and other metadata.

Complications with checking against the real data.

@krivard
Copy link
Contributor

krivard commented Jul 21, 2020

Candidate is up in #155

@krivard
Copy link
Contributor

krivard commented Sep 9, 2020

Whack-a-mole to get the thing to run, using usafacts output as a test case.

effective_sample_size is not a thing anymore; either use sample_size or just drop the constraint.

@krivard
Copy link
Contributor

krivard commented Sep 16, 2020

Runs! Now checking each criteria against the reference codebase, and building unit tests.

Developing a better procedure for reporting data quality issues: label "data quality"; issue template PR incoming.

@krivard
Copy link
Contributor

krivard commented Oct 7, 2020

Finished handling for known anomalies. Cleaning up remaining TODOs, expect to flag for review end of week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CTIS Improvements and reporting for CTIS
Projects
None yet
4 participants