-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
kdb-like window join for pandas #13959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
so I think some spelling is in order here, to make these easier to grok (should apply these generally to merging operations), maybe something like this:
as
so |
👍 on Jeff's spelling. The original function seems to be doing a lot, better to split it up into components that are understandable. The |
cc @wesm |
A quick example, as much for my benefit:
Suppose I want to compute this:
The result would be
I'm going to take a crack at this now. I won't have Jeff's notation to start since I think that's a bigger conversation. But I do want to get this functionality as soon as possible. |
Having a composable API for joins in general would be nice. How do you imagine the asof version would look? As an aside, the syntax for describing aggregates leaves a lot to be desired (this is not specific to this proposal). I spent a bunch of time thinking about how to do this (you can see what I came up with in some of the examples in http://docs.ibis-project.org/generated-notebooks/2.html#aggregating-joined-table-with-metrics-involving-more-than-one-base-reference). We may be able to study some other DSL (e.g. Blaze) to think about deferred join syntax. Adding any kind of expression DSL becomes a deep rabbit hole (e.g. deterministically resolving output types). Maybe something that should wait for pandas 2.0 |
Hi, what's the status on this guy? |
@zak-b2c2 pandas is a community led project and so contributions move it forwards you are welcome to submit a pull request |
I would like a time-based aggregating function that also acts as a join:
http://code.kx.com/wiki/Reference/wj
In pandas, this might look something like:
This would compute the total volume from df2 whose timestamps are within (-100, 100) ms of df1’s timestamps and whose tickers match.
I imagine I could specify different columns:
By the way, this is a more general form of the
.rolling()
functions @jreback wrote recently since I can just use one DataFrame as both parameters:My firm has custom business calendars for things like market hours as well as exchange holidays. The kdb version takes arrays of timestamps directly for the begin and end, which handles the general-purpose case of custom business calendars. So I imagine could get the five-day average of volume with:
I can get started on this right away if my proposal makes sense.
The text was updated successfully, but these errors were encountered: