Skip to content

[JIT]: just-in-time series computations #646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 24 commits into
base: ds/bump-pandas
Choose a base branch
from

Conversation

dshemetov
Copy link
Contributor

@dshemetov dshemetov commented Jul 15, 2021

PRD for this work

Built on #1045.

Prerequisites:

  • Unless it is a documentation hotfix it should be merged against the dev branch
  • Branch is up-to-date with the branch to be merged with, i.e. dev
  • Build is successful
  • Code is cleaned up and formatted

Summary

Adds support for JIT streaming 7-day averaging and differencing calculations to the API server.

TODO:

  • rebase to dev after Epidata v4.1 #954
  • figure out how to use the name aliasing feature and handle "source:*" requests
  • handle multiple signal requests
  • reimplement derivable signal logic: add a "derivable" column to the meta-table, use signal base for the "derivable" signal, and use "is_smoothed", "is_cumulative" to determine which transformations to apply
  • change the base signals in the db / meta table from incidence to cumulative
  • write integration tests
    • add an integration test with multiple geos
    • add integration test with geo:* option
  • add old API compatibility mode
  • update buffer to be cleaner and more memory efficient
  • add smoother params handling more features in a future update
  • add time-padding fetch logic
  • add time-gap handling (likely nan filling at first)
  • handle "issue", "as_of", and "lag" params - "issue" and "lag" are not supported, "as_of" is supported naturally
  • handle other endpoints
    • /trend
    • /trendseries
    • /correlation (unavailable bc streaming is defeated by full-length time-series computations)
    • /csv
    • /backfill (multiple problems)
    • /meta work in [JIT] meta caching system #995
    • /coverage (out of scope)
    • /anomalies (only does a sheet lookup)

@dshemetov dshemetov marked this pull request as draft July 15, 2021 23:10
@dshemetov dshemetov requested review from krivard and sgratzl July 26, 2021 21:55
@dshemetov dshemetov marked this pull request as ready for review July 26, 2021 21:55
@dshemetov dshemetov requested a review from brookslogan July 27, 2021 16:08
Copy link
Member

@sgratzl sgratzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is your plan for the other covidcast endpoints? like /meta /trend /backfill, ....

@dshemetov
Copy link
Contributor Author

dshemetov commented Jul 27, 2021

@sgratzl I haven't thought it too much, but I wanted to redirect a call to /trend to / and then perform the trend calculations on the output there. Not sure if that's possible with Flask.

@sgratzl
Copy link
Member

sgratzl commented Jul 28, 2021

another question regarding smoothing: when I request a smoothed signal starting at 2021-01-10 how will the first entry looks like? which unsmoothed values are used for computing the first smoothed value?

@sgratzl
Copy link
Member

sgratzl commented Jul 28, 2021

@sgratzl I haven't thought it too much, but I wanted to redirect a call to /trend to / and then perform the trend calculations on the output there. Not sure if that's possible with Flask.

that won't really work since some of the endpoints use specialized SQL queries and don't operate on the normal data returned by the regular API endpoint, e.g. the backfill one as an extreme case.

@dshemetov
Copy link
Contributor Author

dshemetov commented Jul 28, 2021

another question regarding smoothing: when I request a smoothed signal starting at 2021-01-10 how will the first entry looks like? which unsmoothed values are used for computing the first smoothed value?

I wrote a couple options for this. By default, smoothing will use a dynamic window, e.g. 2021-01-10 will use just the same value from that day, 2021-01-11 will average that day's value and the previous, and so on up until the window length. If pad_fill_value is given (floats and nans allowed), then (window_length - 1) many pad_fill_values will be appended prior to 2021-01-10 and used for averaging. If a smoother kernel is provided (i.e. a list of floats), then padding has similar logic, but weighted averaging is used (this is assuming we can find a way to send a list of floats in an HTTP query string.)

In both diffing and smoothing, I am currently setting stderr and sample_size to None. Obtaining those from a group of values seems difficult.

@sgratzl
Copy link
Member

sgratzl commented Jul 28, 2021

In both diffing and smoothing, I am currently setting stderr and sample_size to None. Obtaining those from a group of values seems difficult.

but we have the previous 7 days in the database. The values shouldn't change for 2021-01-10 regardless whether I request the timeframes 2021-01-03--2021-01-12, 2021-01-07--2021-01-12, or 2021-01-10--2021-01-12. as far as I understand they would in the current implementation, wouldn't they?

@dshemetov
Copy link
Contributor Author

@sgratzl I haven't thought it too much, but I wanted to redirect a call to /trend to / and then perform the trend calculations on the output there. Not sure if that's possible with Flask.

that won't really work since some of the endpoints use specialized SQL queries and don't operate on the normal data returned by the regular API endpoint, e.g. the backfill one as an extreme case.

Hmm, I will need to learn more about the other endpoints and get back with some ideas.

@dshemetov
Copy link
Contributor Author

dshemetov commented Jul 28, 2021

but we have the previous 7 days in the database. The values shouldn't change for 2021-01-10 regardless whether I request the timeframes 2021-01-03--2021-01-12, 2021-01-07--2021-01-12, or 2021-01-10--2021-01-12. as far as I understand they would in the current implementation, wouldn't they?

You're right, they would change. I didn't want to complicate the smoothing logic on this first pass implementation. And I think it can get quite complicated. If we want the smoothed values to not change, then we could modify the time range of the request to make the calculation.

but we have the previous 7 days in the database

One complication is that this won't always be true: signals could have outages or could be requesting data on the "left" boundary. So suppose that we always request the db for extra days, given a smoothing window. The decision to pad would then need to be made when you first see the row value with the earliest timestamp - if the smoothing window is not covered by the data, then padding is necessary. I'll write a mock of logic like this.

@dshemetov
Copy link
Contributor Author

dshemetov commented Jul 28, 2021

@sgratzl here are my notes for the other endpoints, pls let me know if I misread the code.

  • /trend and /trendseries can be either rerouted to / or have the same _get_transformed_row_generator code as / inserted before trend calculations
  • /correlation can do the same
  • /csv I don't understand this endpoint; what is the goal here? why is the param parsing different? I assume it's doing something more than just returning a CSV version of the / endpoint response?
  • /backfill it appears that this endpoint returns all the issues in a given specified range; I'm not sure that we can support this case for all time-series computations, more on that below*.
  • /meta this requires reporting max/min statistics on all the signals; if we shift over to dropping smoothed/incidence signal from the table, this would require performing a full time-series calculation for each derived signal, which would be really expensive; one option would be to perform this computation once daily and cache it, otherwise we just might not support it
  • /coverage this requires reporting geo-coverage statistics on all the signals; my thinking is similar to the /meta endpoint

*Say we have a query that returns multiple rows with the same (geo, source, signal, time_value) (e.g. an issue range). It is not clear to me how to order this in a way that would allow for sensible computations that rely on time-ordering. For example, suppose we have a query that returns the rows below.

id geo source signal value time_value issue
1 01234 "src" "sig" 5 20210701 20210702
2 01234 "src" "sig" 7 20210702 20210702
3 01234 "src" "sig" 6 20210701 20210703

I don't know how to handle smoothing or diffing in this case, because the value "previous to" 20210702 is multiply defined. To be fair, one option is to report every possible combination (i.e. average rows 2 and 1 together and 2 and 3 together, using the max issue of the involved rows). However, this will not be doable in a streaming fashion due to the ordering of the "issue" column. These considerations lead me to conclude that derived signals should not be supported for requests with the issue parameter (note that as_of can still be supported).

@krivard
Copy link
Contributor

krivard commented Jul 29, 2021

reporting max/min statistics on all the signals; if we shift over to dropping smoothed/incidence signal from the table, this would require performing a full time-series calculation for each derived signal, which would be really expensive; one option would be to perform this computation once daily and cache it, otherwise we just might not support it

two alternatives:

@krivard
Copy link
Contributor

krivard commented Jul 29, 2021

These considerations lead me to conclude that derived signals should not be supported for requests with the issue parameter

I we desperately wanted single issue, we could do the following:

  • Run a database query as-of the requested issue
  • Run a second database query as-of the previous issue
  • Generate a derived stream for each
  • Only return rows where the derived data for as-of the requested issue is different from the derived data for as-of the previous issue

it's utter 🗑️ and I have no idea how you'd stream it, but not impossible. definitely wait and see if we have a compelling use case request first.

@krivard
Copy link
Contributor

krivard commented Jul 29, 2021

& ftr I'm okay with making the backfill interface unavailable for derived signals

@dshemetov
Copy link
Contributor Author

dshemetov commented Jul 29, 2021

reporting max/min statistics on all the signals; if we shift over to dropping smoothed/incidence signal from the table, this would require performing a full time-series calculation for each derived signal, which would be really expensive; one option would be to perform this computation once daily and cache it, otherwise we just might not support it

two alternatives:

It occurs to me that these statistics may vary depending on the parameters given to the computation - i.e. max for smoothing with which kernel? with which window length? max for diffing with what boundary conditions? with what nan filling choice? I suppose we could provide the statistics only for the default parameter settings.

@dshemetov
Copy link
Contributor Author

Thinking more about correlations computations, those would be hard to do in a streaming fashion. But it looks like our current implementation of correlations just loads the query into a dataframe anyway.

@krivard
Copy link
Contributor

krivard commented Jul 29, 2021

is there already an upper limit on the number of points for correlations? if not we should add one

@dshemetov
Copy link
Contributor Author

dshemetov commented Jul 29, 2021

@krivard I don't see a limit. Btw, pd.read_sql_query seems very relevant. @sgratzl do you know how intelligent this function is about doing lazy evaluations? Depending on if it can handle streaming or not, it could obsolete some of my custom-brew code. Which would be simultaneously a bummer and a relief?

It seems like you can give it a chunksize, so it will only grab a limited number of rows at a time. Not sure how that would interact with requesting an e.g. pd.DataFrame.windowed call on the dataframe. For example, if windowed requires a window intersecting two or more chunks, are they all loaded and discarded as they're needed? Seems like that would defeat the purpose of a chunksize. Besides, multiple-signal derivations on the same base-signal would still require a buffer to do replay... hm.

@krivard
Copy link
Contributor

krivard commented Sep 3, 2021

@dshemetov would you switch this to draft until it's ready to review?

@dshemetov dshemetov marked this pull request as draft September 3, 2021 18:54
@dshemetov
Copy link
Contributor Author

@krivard so trying to get this into an MVP state I need to handle a few more endpoints. My thinking is: we can handle trend, trendseries, and possibly csv with redirects to /. For /meta and /coverage, the simplest thing to do now would be to not report the smoothed / diffed signals. We can add periodic caching of the results or another solution at a later time. What do you think?

@krivard
Copy link
Contributor

krivard commented Sep 7, 2021

the simplest thing to do now would be to not report the smoothed / diffed signals

I'm okay with calling /meta and /coverage out of scope for this PR.

That will, however, affect rollout. Before we delete the derivable cases/deaths signals from the database, we need to check how www-covidcast determines valid dates for cases/deaths. If www-covidcast expects meta information for the derived signals, then we'll need to put in a shim first depending on what exactly the constraints are;

  • If www-covidcast just needs min and max time_value for all variants, then we can amend metadata at query time just by copying the entry from the base signal.
  • If it needs legit mean/min/max/stdev value, that gets trickier, but in that case we probably just do a post-pass during the meta update to fire up an API client and request the derived time series.

Regardless, out of scope for this PR, and possibly for the MVP in general

@dshemetov dshemetov marked this pull request as ready for review September 11, 2021 01:20
@dshemetov dshemetov changed the title Feature: server side just-in-time time series computations Feature: server side time series computations Sep 17, 2021
@dshemetov dshemetov force-pushed the jit_computations branch 2 times, most recently from e679e50 to 517fead Compare September 28, 2021 11:19
dshemetov and others added 2 commits December 8, 2022 14:11
* add smooth_diff
* add model updates
* add /trend endpoint
* add /trendseries endpoint
* add /csv endpoint
* params with utility functions
* update date utility functions
@dshemetov dshemetov marked this pull request as draft December 8, 2022 22:48
@dshemetov dshemetov changed the title Feature: just-in-time (JIT) series computations [JIT]: just-in-time series computations Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compute incidence at query time Compute 7-day averages at query time in covidcast
5 participants