Skip to content

kdb-like window join for pandas #13959

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
chrisaycock opened this issue Aug 10, 2016 · 7 comments
Open

kdb-like window join for pandas #13959

chrisaycock opened this issue Aug 10, 2016 · 7 comments
Labels
Datetime Datetime data dtype Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode Window rolling, ewma, expanding

Comments

@chrisaycock
Copy link
Contributor

I would like a time-based aggregating function that also acts as a join:

http://code.kx.com/wiki/Reference/wj

In pandas, this might look something like:

ms = pd.Timedelta(1, 'ms')
merge_window(df1, df2, on='time', by='ticker', time_range=(-100*ms, 100*ms), how=(np.sum, 'volume'))

This would compute the total volume from df2 whose timestamps are within (-100, 100) ms of df1’s timestamps and whose tickers match.

I imagine I could specify different columns:

merge_window(df1, df2, left_on='trade_time', right_on='exchange_time', time_range=(0*ms, 500*ms),
             how={'total_volume':(np.sum, 'volume'), 'mean_volume':(np.mean, 'volume')})

By the way, this is a more general form of the .rolling() functions @jreback wrote recently since I can just use one DataFrame as both parameters:

merge_window(df, df, on='time', time_range=(-5*ms, 0*ms), how=[(np.median, 'price'), (np.sum, 'volume')])

My firm has custom business calendars for things like market hours as well as exchange holidays. The kdb version takes arrays of timestamps directly for the begin and end, which handles the general-purpose case of custom business calendars. So I imagine could get the five-day average of volume with:

# internal functions
cal = get_our_internal_business_calendar()
begin = adjust_our_time(df.time, -5, cal)
end = adjust_our_time(df.time, 0, cal)

# use values directly
merge_window(df1, df2, on='time', begin_times=begin, end_times=end, how=(np.mean, 'volume'))

I can get started on this right away if my proposal makes sense.

@jreback jreback added Datetime Datetime data dtype Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design Difficulty Advanced labels Aug 10, 2016
@jreback jreback added this to the Next Major Release milestone Aug 10, 2016
@jreback
Copy link
Contributor

jreback commented Aug 10, 2016

so I think some spelling is in order here, to make these easier to grok (should apply these generally to merging operations), maybe something like this:

merge_window(df1, df2, left_on='trade_time', right_on='exchange_time', time_range=(0*ms, 500*ms), how={'total_volume':(np.sum, 'volume'), 'mean_volume':(np.mean, 'volume')})

as

(df1.merging(df2, left_on='trade_time', right_on='exchange_time')
    .window(time=(0*ms, 500*ms))
    .agg({'total_volume':(np.sum, 'volume'), 
          'mean_volume':(np.mean, 'volume')})
)

so .merging is a like a group/rolling/resample operator, the next .window() is the type of merge (e.g. support .asof(), .ordered())

@TomAugspurger
Copy link
Contributor

👍 on Jeff's spelling. The original function seems to be doing a lot, better to split it up into components that are understandable.

The .window step could be generalized somewhat. e.g. https://github.com/dgrtwo/fuzzyjoin has some examples on numeric and word distance.

@chrisaycock
Copy link
Contributor Author

cc @wesm

@chrisaycock
Copy link
Contributor Author

A quick example, as much for my benefit:

import string

df1 = pd.DataFrame({'date': pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s'),
                    'name': list(string.ascii_uppercase[:20])})

df2 = pd.DataFrame({'date': pd.date_range(start='2016-01-01 09:29:50', periods=15, freq='2s'),
                    'value': np.arange(15, dtype=np.float64)})

Suppose I want to compute this:

s = pd.Timedelta(1, 's')
merge_window(df1, df2, on='date', time_range=(-6*s, 2*s), how=(np.sum, 'value'))

The result would be

df1['value'] = [df2[(df2.date >= start) & (df2.date <= stop)]['value'].sum() for start,stop in zip(df1.date + -6*s, df1.date + 2*s)]

I'm going to take a crack at this now. I won't have Jeff's notation to start since I think that's a bigger conversation. But I do want to get this functionality as soon as possible.

@wesm
Copy link
Member

wesm commented Aug 25, 2016

Having a composable API for joins in general would be nice. How do you imagine the asof version would look?

As an aside, the syntax for describing aggregates leaves a lot to be desired (this is not specific to this proposal). I spent a bunch of time thinking about how to do this (you can see what I came up with in some of the examples in http://docs.ibis-project.org/generated-notebooks/2.html#aggregating-joined-table-with-metrics-involving-more-than-one-base-reference). We may be able to study some other DSL (e.g. Blaze) to think about deferred join syntax. Adding any kind of expression DSL becomes a deep rabbit hole (e.g. deterministically resolving output types). Maybe something that should wait for pandas 2.0

@zak-b2c2
Copy link

zak-b2c2 commented Feb 3, 2021

Hi, what's the status on this guy?

@jreback
Copy link
Contributor

jreback commented Feb 3, 2021

@zak-b2c2 pandas is a community led project and so contributions move it forwards

you are welcome to submit a pull request

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests

7 participants