[JIT]: just-in-time series computations #646

dshemetov · 2021-07-15T23:09:42Z

PRD for this work

Built on #1045.

Prerequisites:

Unless it is a documentation hotfix it should be merged against the dev branch
Branch is up-to-date with the branch to be merged with, i.e. dev
Build is successful
Code is cleaned up and formatted

Summary

Adds support for JIT streaming 7-day averaging and differencing calculations to the API server.

TODO:

sgratzl

what is your plan for the other covidcast endpoints? like /meta /trend /backfill, ....

integrations/server/test_covidcast_endpoints.py

requirements.txt

src/server/endpoints/covidcast.py

src/server/endpoints/covidcast_utils/model.py

tests/server/test_params.py

src/server/endpoints/covidcast_utils/smooth_diff.py

dshemetov · 2021-07-27T18:14:07Z

@sgratzl I haven't thought it too much, but I wanted to redirect a call to /trend to / and then perform the trend calculations on the output there. Not sure if that's possible with Flask.

sgratzl · 2021-07-28T15:25:42Z

another question regarding smoothing: when I request a smoothed signal starting at 2021-01-10 how will the first entry looks like? which unsmoothed values are used for computing the first smoothed value?

sgratzl · 2021-07-28T15:27:26Z

@sgratzl I haven't thought it too much, but I wanted to redirect a call to /trend to / and then perform the trend calculations on the output there. Not sure if that's possible with Flask.

that won't really work since some of the endpoints use specialized SQL queries and don't operate on the normal data returned by the regular API endpoint, e.g. the backfill one as an extreme case.

dshemetov · 2021-07-28T18:18:28Z

another question regarding smoothing: when I request a smoothed signal starting at 2021-01-10 how will the first entry looks like? which unsmoothed values are used for computing the first smoothed value?

I wrote a couple options for this. By default, smoothing will use a dynamic window, e.g. 2021-01-10 will use just the same value from that day, 2021-01-11 will average that day's value and the previous, and so on up until the window length. If pad_fill_value is given (floats and nans allowed), then (window_length - 1) many pad_fill_values will be appended prior to 2021-01-10 and used for averaging. If a smoother kernel is provided (i.e. a list of floats), then padding has similar logic, but weighted averaging is used (this is assuming we can find a way to send a list of floats in an HTTP query string.)

In both diffing and smoothing, I am currently setting stderr and sample_size to None. Obtaining those from a group of values seems difficult.

sgratzl · 2021-07-28T19:33:19Z

In both diffing and smoothing, I am currently setting stderr and sample_size to None. Obtaining those from a group of values seems difficult.

but we have the previous 7 days in the database. The values shouldn't change for 2021-01-10 regardless whether I request the timeframes 2021-01-03--2021-01-12, 2021-01-07--2021-01-12, or 2021-01-10--2021-01-12. as far as I understand they would in the current implementation, wouldn't they?

dshemetov · 2021-07-28T19:33:58Z

@sgratzl I haven't thought it too much, but I wanted to redirect a call to /trend to / and then perform the trend calculations on the output there. Not sure if that's possible with Flask.

that won't really work since some of the endpoints use specialized SQL queries and don't operate on the normal data returned by the regular API endpoint, e.g. the backfill one as an extreme case.

Hmm, I will need to learn more about the other endpoints and get back with some ideas.

dshemetov · 2021-07-28T19:46:09Z

but we have the previous 7 days in the database. The values shouldn't change for 2021-01-10 regardless whether I request the timeframes 2021-01-03--2021-01-12, 2021-01-07--2021-01-12, or 2021-01-10--2021-01-12. as far as I understand they would in the current implementation, wouldn't they?

You're right, they would change. I didn't want to complicate the smoothing logic on this first pass implementation. And I think it can get quite complicated. If we want the smoothed values to not change, then we could modify the time range of the request to make the calculation.

but we have the previous 7 days in the database

One complication is that this won't always be true: signals could have outages or could be requesting data on the "left" boundary. So suppose that we always request the db for extra days, given a smoothing window. The decision to pad would then need to be made when you first see the row value with the earliest timestamp - if the smoothing window is not covered by the data, then padding is necessary. I'll write a mock of logic like this.

dshemetov · 2021-07-28T22:09:08Z

@sgratzl here are my notes for the other endpoints, pls let me know if I misread the code.

/trend and /trendseries can be either rerouted to / or have the same _get_transformed_row_generator code as / inserted before trend calculations
/correlation can do the same
/csv I don't understand this endpoint; what is the goal here? why is the param parsing different? I assume it's doing something more than just returning a CSV version of the / endpoint response?
/backfill it appears that this endpoint returns all the issues in a given specified range; I'm not sure that we can support this case for all time-series computations, more on that below*.
/meta this requires reporting max/min statistics on all the signals; if we shift over to dropping smoothed/incidence signal from the table, this would require performing a full time-series calculation for each derived signal, which would be really expensive; one option would be to perform this computation once daily and cache it, otherwise we just might not support it
/coverage this requires reporting geo-coverage statistics on all the signals; my thinking is similar to the /meta endpoint

*Say we have a query that returns multiple rows with the same (geo, source, signal, time_value) (e.g. an issue range). It is not clear to me how to order this in a way that would allow for sensible computations that rely on time-ordering. For example, suppose we have a query that returns the rows below.

id	geo	source	signal	value	time_value	issue
1	01234	"src"	"sig"	5	20210701	20210702
2	01234	"src"	"sig"	7	20210702	20210702
3	01234	"src"	"sig"	6	20210701	20210703

I don't know how to handle smoothing or diffing in this case, because the value "previous to" 20210702 is multiply defined. To be fair, one option is to report every possible combination (i.e. average rows 2 and 1 together and 2 and 3 together, using the max issue of the involved rows). However, this will not be doable in a streaming fashion due to the ordering of the "issue" column. These considerations lead me to conclude that derived signals should not be supported for requests with the issue parameter (note that as_of can still be supported).

krivard · 2021-07-29T17:39:54Z

reporting max/min statistics on all the signals; if we shift over to dropping smoothed/incidence signal from the table, this would require performing a full time-series calculation for each derived signal, which would be really expensive; one option would be to perform this computation once daily and cache it, otherwise we just might not support it

two alternatives:

compute variant stats at acquisition time, since we're reading it into a pandas data frame anyhow
revert variant stats to base stats & include a warning message that this was done

krivard · 2021-07-29T17:45:25Z

These considerations lead me to conclude that derived signals should not be supported for requests with the issue parameter

I we desperately wanted single issue, we could do the following:

Run a database query as-of the requested issue
Run a second database query as-of the previous issue
Generate a derived stream for each
Only return rows where the derived data for as-of the requested issue is different from the derived data for as-of the previous issue

it's utter 🗑️ and I have no idea how you'd stream it, but not impossible. definitely wait and see if we have a compelling use case request first.

krivard · 2021-07-29T17:46:59Z

& ftr I'm okay with making the backfill interface unavailable for derived signals

dshemetov · 2021-07-29T18:28:10Z

reporting max/min statistics on all the signals; if we shift over to dropping smoothed/incidence signal from the table, this would require performing a full time-series calculation for each derived signal, which would be really expensive; one option would be to perform this computation once daily and cache it, otherwise we just might not support it

two alternatives:

compute variant stats at acquisition time, since we're reading it into a pandas data frame anyhow

revert variant stats to base stats & include a warning message that this was done

It occurs to me that these statistics may vary depending on the parameters given to the computation - i.e. max for smoothing with which kernel? with which window length? max for diffing with what boundary conditions? with what nan filling choice? I suppose we could provide the statistics only for the default parameter settings.

dshemetov · 2021-07-29T19:00:14Z

Thinking more about correlations computations, those would be hard to do in a streaming fashion. But it looks like our current implementation of correlations just loads the query into a dataframe anyway.

krivard · 2021-07-29T21:04:43Z

is there already an upper limit on the number of points for correlations? if not we should add one

dshemetov · 2021-07-29T21:25:02Z

@krivard I don't see a limit. Btw, pd.read_sql_query seems very relevant. @sgratzl do you know how intelligent this function is about doing lazy evaluations? Depending on if it can handle streaming or not, it could obsolete some of my custom-brew code. Which would be simultaneously a bummer and a relief?

It seems like you can give it a chunksize, so it will only grab a limited number of rows at a time. Not sure how that would interact with requesting an e.g. pd.DataFrame.windowed call on the dataframe. For example, if windowed requires a window intersecting two or more chunks, are they all loaded and discarded as they're needed? Seems like that would defeat the purpose of a chunksize. Besides, multiple-signal derivations on the same base-signal would still require a buffer to do replay... hm.

krivard · 2021-09-03T13:45:04Z

@dshemetov would you switch this to draft until it's ready to review?

dshemetov · 2021-09-03T21:29:01Z

@krivard so trying to get this into an MVP state I need to handle a few more endpoints. My thinking is: we can handle trend, trendseries, and possibly csv with redirects to /. For /meta and /coverage, the simplest thing to do now would be to not report the smoothed / diffed signals. We can add periodic caching of the results or another solution at a later time. What do you think?

krivard · 2021-09-07T21:00:29Z

the simplest thing to do now would be to not report the smoothed / diffed signals

I'm okay with calling /meta and /coverage out of scope for this PR.

That will, however, affect rollout. Before we delete the derivable cases/deaths signals from the database, we need to check how www-covidcast determines valid dates for cases/deaths. If www-covidcast expects meta information for the derived signals, then we'll need to put in a shim first depending on what exactly the constraints are;

If www-covidcast just needs min and max time_value for all variants, then we can amend metadata at query time just by copying the entry from the base signal.
If it needs legit mean/min/max/stdev value, that gets trickier, but in that case we probably just do a post-pass during the meta update to fire up an API client and request the derived time series.

Regardless, out of scope for this PR, and possibly for the MVP in general

* add smooth_diff * add model updates * add /trend endpoint * add /trendseries endpoint * add /csv endpoint * params with utility functions * update date utility functions

[JIT Variant]: Reduce iterator stack on base signals

dshemetov marked this pull request as draft July 15, 2021 23:10

dshemetov requested review from krivard and sgratzl July 26, 2021 21:55

dshemetov marked this pull request as ready for review July 26, 2021 21:55

dshemetov requested a review from brookslogan July 27, 2021 16:08

sgratzl reviewed Jul 27, 2021

View reviewed changes

dshemetov mentioned this pull request Sep 3, 2021

Limit rows returned for correlations #704

Merged

4 tasks

dshemetov marked this pull request as draft September 3, 2021 18:54

dshemetov marked this pull request as ready for review September 11, 2021 01:20

dshemetov changed the title ~~Feature: server side just-in-time time series computations~~ Feature: server side time series computations Sep 17, 2021

dshemetov force-pushed the jit_computations branch 2 times, most recently from e679e50 to 517fead Compare September 28, 2021 11:19

dshemetov added 15 commits December 8, 2022 12:06

Server: update test_db to use CovidcastRow

f1b361f

Server: update test_delete_batch to use CovidcastRow

01a2011

Server: update test_delphi_epidata to use CovidcastRow

0ec5b28

Server: update test_covidcast_endpoints to use CovidcastRow

c49a64a

Server: update test_covidcast to use CovidcastRow

48825fc

Server: update test_utils to use CovidcastRow

f6d7ae6

Server: update TimePair to auto-sort tuples

eac044f

Server: minor model.py data_source_by_id name update

5b998d5

Server: update csv issue none handling

dc26e55

Server: add type hints to _query

f8de0ba

Acquisition: update test_csv_uploading to remove Pandas warning

f264325

Server: add PANDAS_DTYPES to model.py

287af33

Docker: add more_itertools==8.4.0 to Python and API images

0962cd3

Acquisition: update database.py to use CovidcastRow

9fc5535

Docker: bump API and Python pandas to 1.5.1

0fa53d7

dshemetov force-pushed the ds/bump-pandas branch from aff6036 to 0fa53d7 Compare December 8, 2022 20:21

dshemetov and others added 2 commits December 8, 2022 14:11

JIT: major feature commit

a1efe48

* add smooth_diff * add model updates * add /trend endpoint * add /trendseries endpoint * add /csv endpoint * params with utility functions * update date utility functions

CI: Build a container image from this branch

2391d10

dshemetov force-pushed the jit_computations branch from 8f2bdaf to 2391d10 Compare December 8, 2022 22:11

dshemetov added 2 commits December 8, 2022 14:29

JIT: Small name improvement in model.py

7b0addb

JIT: reduce the iterator stack on non-derived signals

e2a2585

dshemetov marked this pull request as draft December 8, 2022 22:48

dshemetov changed the title ~~Feature: just-in-time (JIT) series computations~~ [JIT]: just-in-time series computations Dec 8, 2022

dshemetov mentioned this pull request Dec 8, 2022

[JIT]: remove stream_results from db options for speed tests #1050

Draft

4 tasks

dshemetov added 3 commits December 9, 2022 14:09

JIT: rewrite a few iterators and reduce the iterator stack

4c023f3

JIT: iterator optimizations

2e09038

Merge pull request #1049 from cmu-delphi/ds/jit-faster-base-signals

f6dfed9

[JIT Variant]: Reduce iterator stack on base signals

dshemetov mentioned this pull request Feb 23, 2023

Linters #1084

Closed

brookslogan mentioned this pull request Mar 2, 2023

Remove issue queries for 7dav data cmu-delphi/epiprocess#230

Closed

dshemetov force-pushed the ds/bump-pandas branch from 0fa53d7 to 2a9e770 Compare March 4, 2023 01:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JIT]: just-in-time series computations #646

[JIT]: just-in-time series computations #646

dshemetov commented Jul 15, 2021 •

edited

Loading

sgratzl left a comment

dshemetov commented Jul 27, 2021 •

edited

Loading

sgratzl commented Jul 28, 2021

sgratzl commented Jul 28, 2021

dshemetov commented Jul 28, 2021 •

edited

Loading

sgratzl commented Jul 28, 2021

dshemetov commented Jul 28, 2021

dshemetov commented Jul 28, 2021 •

edited

Loading

dshemetov commented Jul 28, 2021 •

edited

Loading

krivard commented Jul 29, 2021

krivard commented Jul 29, 2021

krivard commented Jul 29, 2021

dshemetov commented Jul 29, 2021 •

edited

Loading

dshemetov commented Jul 29, 2021

krivard commented Jul 29, 2021

dshemetov commented Jul 29, 2021 •

edited

Loading

krivard commented Sep 3, 2021

dshemetov commented Sep 3, 2021

krivard commented Sep 7, 2021

[JIT]: just-in-time series computations #646

Are you sure you want to change the base?

[JIT]: just-in-time series computations #646

Conversation

dshemetov commented Jul 15, 2021 • edited Loading

Summary

sgratzl left a comment

Choose a reason for hiding this comment

dshemetov commented Jul 27, 2021 • edited Loading

sgratzl commented Jul 28, 2021

sgratzl commented Jul 28, 2021

dshemetov commented Jul 28, 2021 • edited Loading

sgratzl commented Jul 28, 2021

dshemetov commented Jul 28, 2021

dshemetov commented Jul 28, 2021 • edited Loading

dshemetov commented Jul 28, 2021 • edited Loading

krivard commented Jul 29, 2021

krivard commented Jul 29, 2021

krivard commented Jul 29, 2021

dshemetov commented Jul 29, 2021 • edited Loading

dshemetov commented Jul 29, 2021

krivard commented Jul 29, 2021

dshemetov commented Jul 29, 2021 • edited Loading

krivard commented Sep 3, 2021

dshemetov commented Sep 3, 2021

krivard commented Sep 7, 2021

dshemetov commented Jul 15, 2021 •

edited

Loading

dshemetov commented Jul 27, 2021 •

edited

Loading

dshemetov commented Jul 28, 2021 •

edited

Loading

dshemetov commented Jul 28, 2021 •

edited

Loading

dshemetov commented Jul 28, 2021 •

edited

Loading

dshemetov commented Jul 29, 2021 •

edited

Loading

dshemetov commented Jul 29, 2021 •

edited

Loading