Update Quidel Covid pipeline #1452

jingjtang · 2022-01-07T20:25:22Z

We should have several updates to Quidel Covid Test data at the current stage of COVID.

Considering adding age group specific signals for Quidel Covid
Add megaounty to Quidel Covid
Add another minimum report threshold for Quidel Covid
All the three items above can get historical reports especially the third one, we now have deletion support in our system
Roni (maybe and I) will meet with Quidel next week? to see whether we can get variant information from Quidel (Like Flu)

krivard · 2022-01-10T17:35:37Z

Ryan, Slack, 2022-01-10:

Metrics:

for each availability threshold: 50, 40, 30, 20. (one table per threshold)

for each age group, and also for all-ages signal

% of locations available: average over time, for each county/state levels

correlations: timewise, average over locations, for each county/state levels

krivard · 2022-01-13T21:39:28Z

Live doc
PDF version: Quidel_AgeGroup_Signals.pdf, generated 2022-01-13 16:40

Conclusions:

stick with the current 50 availability threshold for 1) consistency with what we have published 2) statistical consideration 3) not big difference if lower the threshold

Use the age group breakdown: 0-4, 5-17, 18-49, 50-64, 65+

krivard · 2022-02-10T16:05:01Z

Remaining work on this:

Generate CSV listing all (signal, time_value, geo_type, geo_value, issue) to be removed from the database after this change. Total number of deletions needed: 23,180,625
Develop method for deleting large numbers of covidcast rows as specified by a csv file
Test on staging with 100, 1k, 10k lines to estimate timings
Decide whether to process deletions in batches or schedule downtime
Complete deletions in prod

krivard · 2022-03-17T17:49:21Z

Deletions are complete, but checks are showing suspicious counts.

Here's what was done:

Grab issue,source,time_value,geo_type,signal,geo_value,value,stderr,sample_size for quidel covid_ag_smoothed_pct_positive ~~with issue<20220208~~ from MariaDB and store in a CSV file. edit: actually included issues up to 20220210
Convert CSV file to batch issue format as "old data". edit: this version includes issues up to 20220208
Reset the Quidel input cache. For each issue in the old data, in order chronologically, run the v0.3.2 (omicron) quidel indicator into a new export directory for that issue. Run the filesystem archiver, using the equivalent issue of the old data as the archivediffer cache. Now the contents of the export directory contain deletion annotations for any regions that were in the old data but are no longer generated by the new pipeline.
Grab only the export files for covid_ag_smoothed_pct_positive which have deletion annotations.
Convert the export files to deletion CSV format (geo_id,value,stderr,sample_size,issue,time_value,geo_type,signal,source)
Split into 1.5M batches
Delete 1 batch at a time using the [covidcast] Cycle Quidel deletions step. Wait for replication to recover between each batch. Spot-check a line in each batch before and after to make sure it exists before deleting and no longer exists after deleting.

The problem:

The "old data" CSV had 37,719,716 lines. edit: 36,453,057 lines for issue<20220208
The deletions CSV had 14,859,138 lines. edit: 14,817,614 lines for issue<20220208
The DB is currently showing 21,616,771 rows for quidel covid_ag_smoothed_pct_positive with issue<20220208
We'd expect the DB to show ~~22,860,086 rows~~ instead. ~~edit: 21,635,443 rows for issue<20220208~~ edit edit: 21,616,771 rows for issue<20220208 after re-running the filter
We're short 1,243,315 rows, or just over 5% edit: 18,672 rows, or 0.086% edit edit: We're perfect!

Current plan:

generate a "expected remaining" file containing old data minus deletions
sample ~~200~~ 1000 rows from expected remaining
see if these rows can be found in the DB. expect 5% <0.1% missing. which ones?
possibly check exhaustively 😥

edit 20220317: Sample of 1000 rows found none missing. Trying a larger sample and preparing to check exhaustively.

edit 20220328: Exhaustive check complete. No missing rows identified. All rows from the 21,635,443 were matched to an equivalent row from the 21,616,771 based on the following query:

SELECT geo_value,issue,time_value,geo_type,`signal`,source,value,missing_value 
FROM covidcast 
WHERE source="{}" and `signal`="{}" and time_type="day" 
AND time_value={} and geo_type="{}" and geo_value="{}" and issue={};

The 21,635,443 were passed through sort -u and remained 21,635,443. All lines of the 21,635,443 are for covid_ag_smoothed_pct_positive,quidel. 18,672 lines of the 21,635,443 show as being for issues >=20220208 🤦‍♀️

Mystery solved, we're good to release.

jingjtang · 2022-03-17T17:53:21Z

@krivard what does this mean? Why do we have issues< 20200208?

The DB is currently showing 21,616,771 rows for quidel covid_ag_smoothed_pct_positive with issue<20200208

No worries, I think it is a typo.

So the deletion work actually deleted 1,243,315 rows more than expected. I read the current process, it seems that the deletion work is not done by deleting the rows in the deletion CSV that I provided but just using that for checking, is that correct?

jingjtang · 2022-03-17T18:13:36Z

If the answer to the previous question is yes. It could because that there is some gaps in the deletion file and our previous released data. I remembered that we have several times of update in this pipeline and that could generate different outputs.
e.g., I remembered that we generated the output from the very first date (2020-05-26) several times, and that is not considered in my deletion file.
So, it's possible that the number of rows in my deletion file is smaller than what we actually deleted.

krivard · 2022-03-17T18:44:41Z

apologies -- I could not get the file you generated to match against the database, but never followed up.

(also a lot of this is just notekeeping for me as I refine my analysis; I will continue to edit as I learn more)

krivard · 2022-03-28T17:51:30Z

Details above -- deletions confirmed correct.

jingjtang added the data quality Missing data, weird data, broken data label Jan 7, 2022

jingjtang assigned krivard Jan 7, 2022

jingjtang mentioned this issue Jan 14, 2022

Update quidel covidtest (Add Age Groups Signals, Add rest-of-state reports) #1467

Merged

krivard closed this as completed in #1467 Feb 8, 2022

krivard mentioned this issue Feb 9, 2022

Delete rows specified by CSV cmu-delphi/delphi-epidata#840

Merged

4 tasks

krivard reopened this Feb 10, 2022

krivard closed this as completed Mar 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update Quidel Covid pipeline #1452

Update Quidel Covid pipeline #1452

jingjtang commented Jan 7, 2022

krivard commented Jan 10, 2022

Uh oh!

krivard commented Jan 13, 2022

Uh oh!

krivard commented Feb 10, 2022 •

edited

Loading

Uh oh!

krivard commented Mar 17, 2022 •

edited

Loading

Uh oh!

jingjtang commented Mar 17, 2022 •

edited

Loading

Uh oh!

jingjtang commented Mar 17, 2022 •

edited

Loading

Uh oh!

krivard commented Mar 17, 2022

Uh oh!

krivard commented Mar 28, 2022

Uh oh!

Update Quidel Covid pipeline #1452

Update Quidel Covid pipeline #1452

Comments

jingjtang commented Jan 7, 2022

krivard commented Jan 10, 2022

Uh oh!

krivard commented Jan 13, 2022

Uh oh!

krivard commented Feb 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krivard commented Mar 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jingjtang commented Mar 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jingjtang commented Mar 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krivard commented Mar 17, 2022

Uh oh!

krivard commented Mar 28, 2022

Uh oh!

krivard commented Feb 10, 2022 •

edited

Loading

krivard commented Mar 17, 2022 •

edited

Loading

jingjtang commented Mar 17, 2022 •

edited

Loading

jingjtang commented Mar 17, 2022 •

edited

Loading