Port the Facebook validation pipeline to be generic and automatable #59

capnrefsmmat · 2020-06-01T14:49:43Z

Currently, covid-19/facebook/prepare-extracts/covidalert-io-funs.R contains a validation pipeline for Facebook. As I understand it, it does the following checks when we prepare a new day of data to upload.

Ensure that the old data in the API matches the newly generated old data; that is, if it's currently June 1, make sure we didn't unexpectedly change data for May 25th.
Do sanity checks on the new data, such as the geography types being reasonable, the geo_ids having the right format, the values and SEs being in the correct range, sample sizes are present, dates aren't missing, etc.
Verify the number of geographical regions reporting hasn't suddenly changed.
Verify the average variable values haven't suddenly changed.

Many of these checks can be made generic to multiple data sources and applied to our new pipeline. This would require

adapt the script to work on Taylor's directory structure where data files are placed
provide for configuration files that specify the checks that apply to each data source
adapt the script to report all errors, rather than dying on the first one
make it easy to automatically run for each ~~all~~ data source as a part of its automation job

The text was updated successfully, but these errors were encountered:

capnrefsmmat · 2020-06-11T21:14:21Z

I've begun a validator branch that will automate validating new data for one data source.

Only a very minimal skeleton is present; I have not ported the validation code, nor have I even run it.

My plan is to have a script that takes two arguments: the name of the data source (e.g. "fb-survey") and the date we want to validate. The params.json file will contain a set of parameters for each data source, such as the minimum and maximum allowable values, and these will be used to check the latest data in receiving for that source, as well as to verify that the data matches the API and the other checks we want to do.

The script will log any errors in a concise output format and then exit with nonzero status if there were any failures, to make it easy to plug into an automation pipeline.

krivard · 2020-07-07T18:41:22Z

Complications with architecture: validation needs to run before the CSV files are written to have access to the day and other metadata.

Complications with checking against the real data.

krivard · 2020-07-21T18:23:14Z

Candidate is up in #155

krivard · 2020-09-09T19:26:03Z

Whack-a-mole to get the thing to run, using usafacts output as a test case.

effective_sample_size is not a thing anymore; either use sample_size or just drop the constraint.

krivard · 2020-09-16T19:47:30Z

Runs! Now checking each criteria against the reference codebase, and building unit tests.

Developing a better procedure for reporting data quality issues: label "data quality"; issue template PR incoming.

krivard · 2020-10-07T19:30:20Z

Finished handling for known anomalies. Cleaning up remaining TODOs, expect to flag for review end of week.

capnrefsmmat self-assigned this Jun 1, 2020

capnrefsmmat assigned amartyabasu and unassigned capnrefsmmat Jun 26, 2020

capnrefsmmat added the CTIS Improvements and reporting for CTIS label Jul 5, 2020

krivard mentioned this issue Jul 16, 2020

Decide if we should switch to the weighted Facebook signals on the public maps #139

Closed

krivard mentioned this issue Aug 10, 2020

[fb-survey] synthetic dataset and unit tests #20

Closed

3 tasks

krivard assigned nmdefries and unassigned amartyabasu Sep 8, 2020

krivard mentioned this issue Sep 17, 2020

Adapt archive utility to run as a separate step in the pipeline #280

Closed

krivard mentioned this issue Oct 13, 2020

FB-survey validation with a generic design to include other pipelines #155

Merged

krivard closed this as completed in #155 Nov 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Port the Facebook validation pipeline to be generic and automatable #59

Port the Facebook validation pipeline to be generic and automatable #59

capnrefsmmat commented Jun 1, 2020 •

edited by krivard

Loading

capnrefsmmat commented Jun 11, 2020

Uh oh!

krivard commented Jul 7, 2020

Uh oh!

krivard commented Jul 21, 2020

Uh oh!

krivard commented Sep 9, 2020

Uh oh!

krivard commented Sep 16, 2020

Uh oh!

krivard commented Oct 7, 2020

Uh oh!

Port the Facebook validation pipeline to be generic and automatable #59

Port the Facebook validation pipeline to be generic and automatable #59

Comments

capnrefsmmat commented Jun 1, 2020 • edited by krivard Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

capnrefsmmat commented Jun 11, 2020

Uh oh!

krivard commented Jul 7, 2020

Uh oh!

krivard commented Jul 21, 2020

Uh oh!

krivard commented Sep 9, 2020

Uh oh!

krivard commented Sep 16, 2020

Uh oh!

krivard commented Oct 7, 2020

Uh oh!

capnrefsmmat commented Jun 1, 2020 •

edited by krivard

Loading