Resolve performance issues related to data versioning (quick way) #158

krivard · 2020-07-22T17:45:55Z

We did #24, but now directions and meta cache updates take forever and run out of memory. Let's fix that.

krivard · 2020-07-22T18:10:12Z

full-text index for wip_ detection is probably a dead end but the upgrade to 10.5 was a good idea
direction update crashed more verbosely; probably running out of memory. Doubled it to 40GB and added tuning. Takes at least 5 hours to crash, builds a big temp table.

Probably a more substantial engineering effort to really address this.

add an is_wip column?
add an is_most_recent_issue column?
adding indices?

K will follow up with George, Wael, Brian for a more detailed engineering convo about this

AlSaeed · 2020-07-22T18:48:48Z

A possible way to go forward for the direction update is to compute the update for partitions of the data by the time-series key (source, signal, time_type, geo_type, geo_value), updating the direction one partition at a time. An added advantage for this is that we can then schedule the direction update for partitions at different times of the day.

krivard · 2020-07-23T14:46:18Z

Results from meeting:

@AlSaeed to modify direction computation to handle only one partition per ingestion cycle, as described above. @korlaxxalrok to provide data distribution information to help determine an effective partitioning function. We expect this to reduce both memory and running time, at the expense of timeliness.

@melange396 to modify covidcast table definition as specified below, then modify (1) CSV import to fill the is_wip column appropriately, and (2) meta cache update to handle one (source, signal) pair at a time. We expect (1) to reduce running time, and (2) to reduce memory requirements at the expense of some additional running time overhead.

Table mods:

Rename timestamp1 to value_timestamp
Rename timestamp2 to direction_timestamp
Add flag column is_wip

This is a short-term fix to get us back on our feet so we can take our time and get the next one right: the long-term plan specified in #161.

krivard · 2020-07-23T18:37:12Z

Also add int columns:

missing_value
missing_std
missing_sample_size

melange396 · 2020-07-23T21:54:44Z

whats the use case for those deleted/missing variables? will that come from a csv filename or a row value? or does it get applied in post-processing like the direction updater?

krivard · 2020-07-23T22:14:12Z

There are several circumstances that generate a missing value, and I'm still working on cataloguing them all (see also cmu-delphi/delphi-epidata#147 )

Some of them can arrive in the csv files during normal ingestion:

doctor-visits and hospital-admissions currently submit NA for stderr and sample_size, but this should be replaced with a missingness code for "restricted access"
currently any region not encountered or not meeting minimum sample size requirements is just left out of the csv file, but this should be replaced with a missingness code for "insufficient data"
ght and the cases/deaths indicators currently submit NA for stderr and sample_size, but this should be replaced with a missingness code for "not applicable" (there are no samples)

At least one of them cannot:

We know when generally to expect values to become available in the API, but sometimes there are pipeline failures. It would be nice to return a missingness code that tells the user "your query was properly formed, and normally we'd have that data for you by now, but it hasn't been dropped off yet and we're not sure why"

At some point we plan to have a unified API interface that invisibly handles both public + private queries; once that happens, I can also imagine returning "restricted access" if you ask for e.g. ZIP5-level data but don't provide an authentication key.

krivard · 2020-07-24T18:08:20Z

Partitioned directions are ready in cmu-delphi/delphi-epidata#161 for testing on staging

Meta cache update is nearly ready but failing some mystery unrelated tests; W to take a look.

krivard · 2020-07-28T18:05:07Z

Partitions are taking ~35 minutes apiece; test run without specifying the partition also took 35 minutes so not sure how to run all partitions.

AlSaeed · 2020-07-29T16:10:45Z

Partition 0 is failing in both cases, as one time series with signal 'wip_confirmed_7day_avg_cumulativ' doesn't have stdev value, which means it doesn't have any data before Constants.SIGNAL_STDEV_MAX_DAY. The fix will depend on the intended behavior:

Would assigning a stdev of 0 for every timeseries with no data before Constants.SIGNAL_STDEV_MAX_DAY be acceptable fix? Or is failure the correct behavior in this case?
In case processing one partition fails, should processing the rest of partitions continue? I assume the entire process should fail, as otherwise we will have to check the log each time we run direction updater to see if some partitions failed.

krivard · 2020-07-29T18:15:54Z

Loads of options here, none of them particularly good.

Use a constant (any constant, not just 0) as a default reference stdev: Not great, since different signals are on wildly different scales. 0 is particularly problematic since it would mean everything gets classified as either increasing or decreasing, with no "steady".
Use the first month of data to determine the reference stdev for each signal. This is not a scalable calculation over the vast quantities of data we're faced with, so it will be prohibitively slow. For performance we want to use the same date bracket for everything.
Set Constants.SIGNAL_STDEV_MAX_DAY to be -30 days from whatever day the direction updater is being run, so it will keep up with current events as they continue to shift. This is future proof, but it will mean that directions may mysteriously change over time. For example, maybe three months ago the reference stdev was low, and a particular span of 7 days was marked as "increasing"; today, with an additional 3 months of data being used to compute the reference stdev, that same span of 7 days might be marked as "steady". Keep in mind nothing currently consumes the direction signal other than one particular mode in the COVIDcast map, and that mode doesn't permit comparing what the direction was when you looked at it a month ago with what it looks like now... but if someone is confused by it anyway, is that something we are willing to accept?

Quick fix:

Every time series with no data before Constants.SIGNAL_STDEV_MAX_DAY gets NA as their direction. This means that new indicators (like Quidel COVID tests) that don't have data before 4/22 will never receive a direction -- so we should consider this fix a temporary one.

Long-term:

Assign this as a data analysis project

krivard · 2020-07-29T18:20:02Z

Meta cache revision is available in cmu-delphi/delphi-epidata#167, but includes the column renames so we'll need to get the disk space upped before we can apply it.

krivard · 2020-07-30T18:17:09Z

Working on adding a column to track which issue is the most recent.

Provisioning storage for further testing, including the column renames.

krivard · 2020-07-31T18:09:34Z

DB schema: signal data type fix is done (varchar32->64); column renames in progress.

Once that's done, we'll update the direction code to use the new names, then merge the meta cache update changes and test them.

krivard · 2020-08-03T18:10:45Z

Setup branches are ready to merge;

Direction update remains to fix all the tests; conflicts expected because we're doing two column adds from two different branches.

krivard · 2020-08-06T18:15:53Z

All staging PRs have been merged; next is to deploy to staging and test.

krivard · 2020-08-07T18:06:14Z

is_most_recent_issue update ran overnight

Remaining tests:

csv upload
direction update
meta cache update

korlaxxalrok · 2020-08-10T17:44:05Z

The current lay of the land:

csv upload: Testing today, will run on a large batch of files to make sure there are no major issues.
direction update: Ran a test over the weekend on all partitions. It took ~970 minutes (16 hours). No memory errors, but this is still a long run time. I wonder if we can run this against multiple partitions at once. Seems to be about ~40 minutes/partition.
meta cache update: This seems good! Completed in ~98 minutes, which seems to be a big improvement over the current run time of ~420 minutes.

krivard · 2020-08-10T18:12:19Z

Assuming no large issues with csv uploading, this will work as the stopgap fix until we do the major table reorganization.

To deploy to main:

Upgrade MariaDB on prod
Table updates (couple of columns)
is_current_issue took 12 hours -- iterate on this in staging to figure out if this will affect API performance
Deploy code changes
Update Automation jobs

AlSaeed · 2020-08-11T16:28:08Z

The fill_is_latest_issue script is only meant to be run once to initialize the is_latest_issue column. The integrity of the column is then maintained by the updated insert_or_update_batch query.

For the performance of direction update, I suggest adding the following key, and benchmarking at least one partition:
ALTER TABLE `covidcast` ADD KEY `by_is_latest_issue` (`source`, `signal`, `time_type`, `geo_type`, `geo_value`, `time_value`, `is_latest_issue`)

korlaxxalrok · 2020-08-12T20:55:48Z

Adding more detail to what a cut over might looks like:

Upgrade MariaDB - We may want to do this step separately from the others. Will give us time to fix anything that may come up. So maybe we do the database upgrade on one day, then the other stuff on the next.
- Stop Automation
- Take physical backup with xtrabackup/mariabackup
- Follow steps here: https://mariadb.com/kb/en/upgrading-from-mariadb-104-to-mariadb-105/ - I've run through this on the staging server and it seemed to be pretty straight forward.
Table changes
- ALTER ONLINE TABLE `covidcast` MODIFY `signal` varchar(64) NOT NULL;
- ALTER ONLINE TABLE `covidcast` RENAME COLUMN `value_updated_timestamp` TO `timestamp1`;
- ALTER ONLINE TABLE `covidcast` RENAME COLUMN `direction_updated_timestamp` TO `timestamp2`;
- ALTER ONLINE TABLE `covidcast` ADD COLUMN `is_latest_issue` BINARY(1) NOT NULL DEFAULT (0);
Merge all relevant changes to main and deploy
Run the fill is_latest_issue operation [ python3 -m delphi.epidata.acquisition.covidcast.fill_is_latest_issue ]
- This will take some time
- It will not block API requests
- I think meta data cache updates will still work, so we could run that temporarily out of the Automation loop, though if we aren't also updating the DB with anything it might not be necessary.
- It will block writes to epidata.covidcast while it is running, so we need to take that into consideration and probably run this over night while Automation is paused.

Once all the changes are made we should be able to start Automation.

Follow on tasks would be to:

Deploy the direction update code that handles running against one partition-per-cycle (or if we are not making the changes there then maybe we look at wrapping the current updater in a shell script that could handle figuring out which partition to run against?).

Once we know how long the is_latest_issue fill will take then we can extrapolate for production. Probably we have nearly 2x the rows there.

korlaxxalrok · 2020-08-13T13:17:30Z

Hmm... I ran the is_latest_issue fill again on staging and it took ~237 minutes. Assuming it did the same amount of work as the first time it ran we might expect it to take ~400 minutes on prod (336m rows in staging vs. 574m rows in production).

krivard · 2020-08-13T18:07:32Z

Still need to check:

replication

2-day migration event sounds wise.

krivard · 2020-08-14T18:06:34Z

Evidence suggests we'll have to upgrade replicas as well, ideally the same night we upgrade prod. Will do it with a script so a buddy should not be necessary, but will keep us posted.

Estimated schedule:

Upgrades Wednesday evening
Deployment Thursday /evening

krivard · 2020-08-20T18:04:15Z

Database upgrades successful, modulo surprise reactivation of the direction job. Switched to a noop temporarily.
Remaining transformations in progress.

krivard · 2020-08-21T18:10:45Z

Replicas choked on the varchar enlengthening; need to repair and then clone
Batching those changes with the ones needed for scaling up for the FB data challenge

krivard · 2020-08-24T18:16:45Z

Adding more storage to prod tonight before we do the table renames, code push, and issue fill (probably tomorrow).

korlaxxalrok · 2020-08-25T15:53:49Z

The storage has been added, so after Automation has a chance to right itself we can look at the final steps for this.

korlaxxalrok · 2020-08-25T19:57:39Z

@krivard I have the tables renamed/created for this task. Next up is:

Merge the PR
Deploy
Run the is_latest_issue fill operation

The PR has reviewers, but I am not sure who might be available this week.

korlaxxalrok · 2020-08-25T20:00:10Z

@krivard Ah! I see, didn't notice those were just suggestions for reviewers. So maybe this is ready to go. I'm not going to just merge it, but when you have a chance to double check it in some capacity we can merge it.

krivard · 2020-08-25T20:01:47Z

Is that the branch currently on staging, and is staging working as it's meant to?

korlaxxalrok · 2020-08-25T20:21:12Z

Yeah, the branch has been deployed there and should have been what we tested with (87b1984).

Looking back, we manually tested CSV upload, direction update, and the cache update.

I never updated re CSV upload in this issue thread, but in recall I ran it on a huge collection of files and it succeeded.
Direction update took forever (as expected).
We saw an improvement in the cache update. Our mileage may vary in the production env due to more than the 2x (and growing) data we have there.

I think we've done what we can for testing. Our staging env is not a super robust approximation of production, but the code ran there. We don't run an identical Automation flow so outliers could exist.

I think we can merge, deploy, and move forward (carefully), but I can definitely run another batch of tests if we think that is valuable.

krivard · 2020-08-25T20:51:22Z

Okay, I've double-checked that the API on staging returns data, and the meta queries work as well.

I can merge now, or wait for the morning if we're worried it will fail overnight.

korlaxxalrok · 2020-08-25T21:16:51Z

I think we can go ahead and merge. I'll watch to see if we have any major issues.

krivard mentioned this issue Jul 22, 2020

Store a forecast view of covidcast epidata signals #24

Closed

6 tasks

krivard changed the title ~~Resolve performance issues related to data versioning~~ Resolve performance issues related to data versioning (quick way) Jul 23, 2020

krivard pinned this issue Jul 29, 2020

amartyabasu unpinned this issue Aug 4, 2020

krivard mentioned this issue Aug 10, 2020

[covidcast] Add support for querying multiple signals #116

Closed

krivard mentioned this issue Aug 12, 2020

Apply "quick fix" data ingestion performance improvements cmu-delphi/delphi-epidata#187

Merged

krivard mentioned this issue Aug 14, 2020

Consider adding vintage NCHS mortality data from the CDC #119

Closed

krivard mentioned this issue Aug 18, 2020

Revise definition of "historical variability" in direction/trend calculations #224

Closed

krivard mentioned this issue Aug 23, 2020

Allow encoding of non-numbers (e.g. missingness, deletions, ranges) cmu-delphi/delphi-epidata#195

Closed

krivard closed this as completed in cmu-delphi/delphi-epidata#187 Aug 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve performance issues related to data versioning (quick way) #158

Resolve performance issues related to data versioning (quick way) #158

krivard commented Jul 22, 2020

krivard commented Jul 22, 2020

AlSaeed commented Jul 22, 2020

krivard commented Jul 23, 2020

krivard commented Jul 23, 2020

melange396 commented Jul 23, 2020

krivard commented Jul 23, 2020 •

edited

Loading

krivard commented Jul 24, 2020

krivard commented Jul 28, 2020

AlSaeed commented Jul 29, 2020 •

edited

Loading

krivard commented Jul 29, 2020 •

edited

Loading

krivard commented Jul 29, 2020

krivard commented Jul 30, 2020

krivard commented Jul 31, 2020

krivard commented Aug 3, 2020

krivard commented Aug 6, 2020

krivard commented Aug 7, 2020 •

edited by korlaxxalrok

Loading

korlaxxalrok commented Aug 10, 2020

krivard commented Aug 10, 2020

AlSaeed commented Aug 11, 2020

korlaxxalrok commented Aug 12, 2020

korlaxxalrok commented Aug 13, 2020

krivard commented Aug 13, 2020

krivard commented Aug 14, 2020 •

edited

Loading

krivard commented Aug 20, 2020

krivard commented Aug 21, 2020

krivard commented Aug 24, 2020

korlaxxalrok commented Aug 25, 2020 •

edited

Loading

korlaxxalrok commented Aug 25, 2020

korlaxxalrok commented Aug 25, 2020

krivard commented Aug 25, 2020

korlaxxalrok commented Aug 25, 2020

krivard commented Aug 25, 2020

korlaxxalrok commented Aug 25, 2020

Resolve performance issues related to data versioning (quick way) #158

Resolve performance issues related to data versioning (quick way) #158

Comments

krivard commented Jul 22, 2020

krivard commented Jul 22, 2020

AlSaeed commented Jul 22, 2020

krivard commented Jul 23, 2020

krivard commented Jul 23, 2020

melange396 commented Jul 23, 2020

krivard commented Jul 23, 2020 • edited Loading

krivard commented Jul 24, 2020

krivard commented Jul 28, 2020

AlSaeed commented Jul 29, 2020 • edited Loading

krivard commented Jul 29, 2020 • edited Loading

krivard commented Jul 29, 2020

krivard commented Jul 30, 2020

krivard commented Jul 31, 2020

krivard commented Aug 3, 2020

krivard commented Aug 6, 2020

krivard commented Aug 7, 2020 • edited by korlaxxalrok Loading

korlaxxalrok commented Aug 10, 2020

krivard commented Aug 10, 2020

AlSaeed commented Aug 11, 2020

korlaxxalrok commented Aug 12, 2020

korlaxxalrok commented Aug 13, 2020

krivard commented Aug 13, 2020

krivard commented Aug 14, 2020 • edited Loading

krivard commented Aug 20, 2020

krivard commented Aug 21, 2020

krivard commented Aug 24, 2020

korlaxxalrok commented Aug 25, 2020 • edited Loading

korlaxxalrok commented Aug 25, 2020

korlaxxalrok commented Aug 25, 2020

krivard commented Aug 25, 2020

korlaxxalrok commented Aug 25, 2020

krivard commented Aug 25, 2020

korlaxxalrok commented Aug 25, 2020

krivard commented Jul 23, 2020 •

edited

Loading

AlSaeed commented Jul 29, 2020 •

edited

Loading

krivard commented Jul 29, 2020 •

edited

Loading

krivard commented Aug 7, 2020 •

edited by korlaxxalrok

Loading

krivard commented Aug 14, 2020 •

edited

Loading

korlaxxalrok commented Aug 25, 2020 •

edited

Loading