-
Notifications
You must be signed in to change notification settings - Fork 68
[JIT]: just-in-time series computations #646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: ds/bump-pandas
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is your plan for the other covidcast endpoints? like /meta /trend /backfill, ....
@sgratzl I haven't thought it too much, but I wanted to redirect a call to /trend to / and then perform the trend calculations on the output there. Not sure if that's possible with Flask. |
another question regarding smoothing: when I request a smoothed signal starting at 2021-01-10 how will the first entry looks like? which unsmoothed values are used for computing the first smoothed value? |
that won't really work since some of the endpoints use specialized SQL queries and don't operate on the normal data returned by the regular API endpoint, e.g. the backfill one as an extreme case. |
I wrote a couple options for this. By default, smoothing will use a dynamic window, e.g. 2021-01-10 will use just the same value from that day, 2021-01-11 will average that day's value and the previous, and so on up until the window length. If pad_fill_value is given (floats and nans allowed), then (window_length - 1) many pad_fill_values will be appended prior to 2021-01-10 and used for averaging. If a smoother kernel is provided (i.e. a list of floats), then padding has similar logic, but weighted averaging is used (this is assuming we can find a way to send a list of floats in an HTTP query string.) In both diffing and smoothing, I am currently setting stderr and sample_size to None. Obtaining those from a group of values seems difficult. |
but we have the previous 7 days in the database. The values shouldn't change for 2021-01-10 regardless whether I request the timeframes 2021-01-03--2021-01-12, 2021-01-07--2021-01-12, or 2021-01-10--2021-01-12. as far as I understand they would in the current implementation, wouldn't they? |
Hmm, I will need to learn more about the other endpoints and get back with some ideas. |
You're right, they would change. I didn't want to complicate the smoothing logic on this first pass implementation. And I think it can get quite complicated. If we want the smoothed values to not change, then we could modify the time range of the request to make the calculation.
One complication is that this won't always be true: signals could have outages or could be requesting data on the "left" boundary. So suppose that we always request the db for extra days, given a smoothing window. The decision to pad would then need to be made when you first see the row value with the earliest timestamp - if the smoothing window is not covered by the data, then padding is necessary. I'll write a mock of logic like this. |
@sgratzl here are my notes for the other endpoints, pls let me know if I misread the code.
*Say we have a query that returns multiple rows with the same (geo, source, signal, time_value) (e.g. an issue range). It is not clear to me how to order this in a way that would allow for sensible computations that rely on time-ordering. For example, suppose we have a query that returns the rows below.
I don't know how to handle smoothing or diffing in this case, because the value "previous to" 20210702 is multiply defined. To be fair, one option is to report every possible combination (i.e. average rows 2 and 1 together and 2 and 3 together, using the max issue of the involved rows). However, this will not be doable in a streaming fashion due to the ordering of the "issue" column. These considerations lead me to conclude that derived signals should not be supported for requests with the issue parameter (note that as_of can still be supported). |
two alternatives:
|
I we desperately wanted single issue, we could do the following:
it's utter 🗑️ and I have no idea how you'd stream it, but not impossible. definitely wait and see if we have a compelling use case request first. |
& ftr I'm okay with making the backfill interface unavailable for derived signals |
It occurs to me that these statistics may vary depending on the parameters given to the computation - i.e. max for smoothing with which kernel? with which window length? max for diffing with what boundary conditions? with what nan filling choice? I suppose we could provide the statistics only for the default parameter settings. |
Thinking more about correlations computations, those would be hard to do in a streaming fashion. But it looks like our current implementation of correlations just loads the query into a dataframe anyway. |
is there already an upper limit on the number of points for correlations? if not we should add one |
@krivard I don't see a limit. Btw, It seems like you can give it a chunksize, so it will only grab a limited number of rows at a time. Not sure how that would interact with requesting an e.g. |
@dshemetov would you switch this to draft until it's ready to review? |
@krivard so trying to get this into an MVP state I need to handle a few more endpoints. My thinking is: we can handle trend, trendseries, and possibly csv with redirects to |
I'm okay with calling That will, however, affect rollout. Before we delete the derivable cases/deaths signals from the database, we need to check how www-covidcast determines valid dates for cases/deaths. If www-covidcast expects meta information for the derived signals, then we'll need to put in a shim first depending on what exactly the constraints are;
Regardless, out of scope for this PR, and possibly for the MVP in general |
e679e50
to
517fead
Compare
aff6036
to
0fa53d7
Compare
* add smooth_diff * add model updates * add /trend endpoint * add /trendseries endpoint * add /csv endpoint * params with utility functions * update date utility functions
8f2bdaf
to
2391d10
Compare
[JIT Variant]: Reduce iterator stack on base signals
0fa53d7
to
2a9e770
Compare
PRD for this work
Built on #1045.
Prerequisites:
dev
branchdev
Summary
Adds support for JIT streaming 7-day averaging and differencing calculations to the API server.
TODO:
dev
after Epidata v4.1 #954add smoother params handlingmore features in a future update/correlation(unavailable bc streaming is defeated by full-length time-series computations)/backfill(multiple problems)/coverage(out of scope)/anomalies(only does a sheet lookup)