Skip to content

Update Quidel Covid pipeline #1452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jingjtang opened this issue Jan 7, 2022 · 8 comments · Fixed by #1467
Closed

Update Quidel Covid pipeline #1452

jingjtang opened this issue Jan 7, 2022 · 8 comments · Fixed by #1467
Assignees
Labels
data quality Missing data, weird data, broken data

Comments

@jingjtang
Copy link
Contributor

We should have several updates to Quidel Covid Test data at the current stage of COVID.

  • Considering adding age group specific signals for Quidel Covid
  • Add megaounty to Quidel Covid
  • Add another minimum report threshold for Quidel Covid
  • All the three items above can get historical reports especially the third one, we now have deletion support in our system
  • Roni (maybe and I) will meet with Quidel next week? to see whether we can get variant information from Quidel (Like Flu)
@jingjtang jingjtang added the data quality Missing data, weird data, broken data label Jan 7, 2022
@krivard
Copy link
Contributor

krivard commented Jan 10, 2022

Ryan, Slack, 2022-01-10:

Metrics:

  • for each availability threshold: 50, 40, 30, 20. (one table per threshold)
  • for each age group, and also for all-ages signal
  • % of locations available: average over time, for each county/state levels
  • correlations: timewise, average over locations, for each county/state levels

@krivard
Copy link
Contributor

krivard commented Jan 13, 2022

Conclusions:

  • stick with the current 50 availability threshold for 1) consistency with what we have published 2) statistical consideration 3) not big difference if lower the threshold
  • Use the age group breakdown: 0-4, 5-17, 18-49, 50-64, 65+

@krivard
Copy link
Contributor

krivard commented Feb 10, 2022

Remaining work on this:

  • Generate CSV listing all (signal, time_value, geo_type, geo_value, issue) to be removed from the database after this change. Total number of deletions needed: 23,180,625
  • Develop method for deleting large numbers of covidcast rows as specified by a csv file
  • Test on staging with 100, 1k, 10k lines to estimate timings
  • Decide whether to process deletions in batches or schedule downtime
  • Complete deletions in prod

@krivard krivard reopened this Feb 10, 2022
@krivard
Copy link
Contributor

krivard commented Mar 17, 2022

Deletions are complete, but checks are showing suspicious counts.

Here's what was done:

  1. Grab issue,source,time_value,geo_type,signal,geo_value,value,stderr,sample_size for quidel covid_ag_smoothed_pct_positive with issue<20220208 from MariaDB and store in a CSV file. edit: actually included issues up to 20220210
  2. Convert CSV file to batch issue format as "old data". edit: this version includes issues up to 20220208
  3. Reset the Quidel input cache. For each issue in the old data, in order chronologically, run the v0.3.2 (omicron) quidel indicator into a new export directory for that issue. Run the filesystem archiver, using the equivalent issue of the old data as the archivediffer cache. Now the contents of the export directory contain deletion annotations for any regions that were in the old data but are no longer generated by the new pipeline.
  4. Grab only the export files for covid_ag_smoothed_pct_positive which have deletion annotations.
  5. Convert the export files to deletion CSV format (geo_id,value,stderr,sample_size,issue,time_value,geo_type,signal,source)
  6. Split into 1.5M batches
  7. Delete 1 batch at a time using the [covidcast] Cycle Quidel deletions step. Wait for replication to recover between each batch. Spot-check a line in each batch before and after to make sure it exists before deleting and no longer exists after deleting.

The problem:

  • The "old data" CSV had 37,719,716 lines. edit: 36,453,057 lines for issue<20220208
  • The deletions CSV had 14,859,138 lines. edit: 14,817,614 lines for issue<20220208
  • The DB is currently showing 21,616,771 rows for quidel covid_ag_smoothed_pct_positive with issue<20220208
  • We'd expect the DB to show 22,860,086 rows instead. edit: 21,635,443 rows for issue<20220208 edit edit: 21,616,771 rows for issue<20220208 after re-running the filter
  • We're short 1,243,315 rows, or just over 5% edit: 18,672 rows, or 0.086% edit edit: We're perfect!

Current plan:

  • generate a "expected remaining" file containing old data minus deletions
  • sample 200 1000 rows from expected remaining
  • see if these rows can be found in the DB. expect 5% <0.1% missing. which ones?
  • possibly check exhaustively 😥

edit 20220317: Sample of 1000 rows found none missing. Trying a larger sample and preparing to check exhaustively.

edit 20220328: Exhaustive check complete. No missing rows identified. All rows from the 21,635,443 were matched to an equivalent row from the 21,616,771 based on the following query:

SELECT geo_value,issue,time_value,geo_type,`signal`,source,value,missing_value 
FROM covidcast 
WHERE source="{}" and `signal`="{}" and time_type="day" 
AND time_value={} and geo_type="{}" and geo_value="{}" and issue={};

The 21,635,443 were passed through sort -u and remained 21,635,443. All lines of the 21,635,443 are for covid_ag_smoothed_pct_positive,quidel. 18,672 lines of the 21,635,443 show as being for issues >=20220208 🤦‍♀️

Mystery solved, we're good to release.

@jingjtang
Copy link
Contributor Author

jingjtang commented Mar 17, 2022

@krivard what does this mean? Why do we have issues< 20200208?

  • The DB is currently showing 21,616,771 rows for quidel covid_ag_smoothed_pct_positive with issue<20200208

No worries, I think it is a typo.

So the deletion work actually deleted 1,243,315 rows more than expected. I read the current process, it seems that the deletion work is not done by deleting the rows in the deletion CSV that I provided but just using that for checking, is that correct?

@jingjtang
Copy link
Contributor Author

jingjtang commented Mar 17, 2022

If the answer to the previous question is yes. It could because that there is some gaps in the deletion file and our previous released data. I remembered that we have several times of update in this pipeline and that could generate different outputs.
e.g., I remembered that we generated the output from the very first date (2020-05-26) several times, and that is not considered in my deletion file.
So, it's possible that the number of rows in my deletion file is smaller than what we actually deleted.

@krivard
Copy link
Contributor

krivard commented Mar 17, 2022

apologies -- I could not get the file you generated to match against the database, but never followed up.

(also a lot of this is just notekeeping for me as I refine my analysis; I will continue to edit as I learn more)

@krivard
Copy link
Contributor

krivard commented Mar 28, 2022

Details above -- deletions confirmed correct.

@krivard krivard closed this as completed Mar 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data quality Missing data, weird data, broken data
Projects
None yet
2 participants