kdb-like window join for pandas #13959

chrisaycock · 2016-08-10T20:54:04Z

I would like a time-based aggregating function that also acts as a join:

http://code.kx.com/wiki/Reference/wj

In pandas, this might look something like:

ms = pd.Timedelta(1, 'ms')
merge_window(df1, df2, on='time', by='ticker', time_range=(-100*ms, 100*ms), how=(np.sum, 'volume'))

This would compute the total volume from df2 whose timestamps are within (-100, 100) ms of df1’s timestamps and whose tickers match.

I imagine I could specify different columns:

merge_window(df1, df2, left_on='trade_time', right_on='exchange_time', time_range=(0*ms, 500*ms),
             how={'total_volume':(np.sum, 'volume'), 'mean_volume':(np.mean, 'volume')})

By the way, this is a more general form of the .rolling() functions @jreback wrote recently since I can just use one DataFrame as both parameters:

merge_window(df, df, on='time', time_range=(-5*ms, 0*ms), how=[(np.median, 'price'), (np.sum, 'volume')])

My firm has custom business calendars for things like market hours as well as exchange holidays. The kdb version takes arrays of timestamps directly for the begin and end, which handles the general-purpose case of custom business calendars. So I imagine could get the five-day average of volume with:

# internal functions
cal = get_our_internal_business_calendar()
begin = adjust_our_time(df.time, -5, cal)
end = adjust_our_time(df.time, 0, cal)

# use values directly
merge_window(df1, df2, on='time', begin_times=begin, end_times=end, how=(np.mean, 'volume'))

I can get started on this right away if my proposal makes sense.

The text was updated successfully, but these errors were encountered:

jreback · 2016-08-10T22:29:28Z

so I think some spelling is in order here, to make these easier to grok (should apply these generally to merging operations), maybe something like this:

merge_window(df1, df2, left_on='trade_time', right_on='exchange_time', time_range=(0*ms, 500*ms), how={'total_volume':(np.sum, 'volume'), 'mean_volume':(np.mean, 'volume')})

as

(df1.merging(df2, left_on='trade_time', right_on='exchange_time')
    .window(time=(0*ms, 500*ms))
    .agg({'total_volume':(np.sum, 'volume'), 
          'mean_volume':(np.mean, 'volume')})
)

so .merging is a like a group/rolling/resample operator, the next .window() is the type of merge (e.g. support .asof(), .ordered())

TomAugspurger · 2016-08-11T03:22:51Z

👍 on Jeff's spelling. The original function seems to be doing a lot, better to split it up into components that are understandable.

The .window step could be generalized somewhat. e.g. https://github.com/dgrtwo/fuzzyjoin has some examples on numeric and word distance.

chrisaycock · 2016-08-11T04:14:03Z

cc @wesm

chrisaycock · 2016-08-17T20:46:10Z

A quick example, as much for my benefit:

import string

df1 = pd.DataFrame({'date': pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s'),
                    'name': list(string.ascii_uppercase[:20])})

df2 = pd.DataFrame({'date': pd.date_range(start='2016-01-01 09:29:50', periods=15, freq='2s'),
                    'value': np.arange(15, dtype=np.float64)})

Suppose I want to compute this:

s = pd.Timedelta(1, 's')
merge_window(df1, df2, on='date', time_range=(-6*s, 2*s), how=(np.sum, 'value'))

The result would be

df1['value'] = [df2[(df2.date >= start) & (df2.date <= stop)]['value'].sum() for start,stop in zip(df1.date + -6*s, df1.date + 2*s)]

I'm going to take a crack at this now. I won't have Jeff's notation to start since I think that's a bigger conversation. But I do want to get this functionality as soon as possible.

wesm · 2016-08-25T05:39:11Z

Having a composable API for joins in general would be nice. How do you imagine the asof version would look?

As an aside, the syntax for describing aggregates leaves a lot to be desired (this is not specific to this proposal). I spent a bunch of time thinking about how to do this (you can see what I came up with in some of the examples in http://docs.ibis-project.org/generated-notebooks/2.html#aggregating-joined-table-with-metrics-involving-more-than-one-base-reference). We may be able to study some other DSL (e.g. Blaze) to think about deferred join syntax. Adding any kind of expression DSL becomes a deep rabbit hole (e.g. deterministically resolving output types). Maybe something that should wait for pandas 2.0

zak-b2c2 · 2021-02-03T10:24:39Z

Hi, what's the status on this guy?

jreback · 2021-02-03T12:07:48Z

@zak-b2c2 pandas is a community led project and so contributions move it forwards

you are welcome to submit a pull request

jreback added Datetime Datetime data dtype Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design Difficulty Advanced labels Aug 10, 2016

jreback added this to the Next Major Release milestone Aug 10, 2016

chrisaycock mentioned this issue Sep 26, 2016

Unified merge API wesm/pandas2#31

Open

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke added the Enhancement label Mar 31, 2020

jreback added the Window rolling, ewma, expanding label Nov 25, 2020

mroeschke removed the API Design label May 1, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kdb-like window join for pandas #13959

kdb-like window join for pandas #13959

chrisaycock commented Aug 10, 2016

jreback commented Aug 10, 2016

TomAugspurger commented Aug 11, 2016

chrisaycock commented Aug 11, 2016

chrisaycock commented Aug 17, 2016

wesm commented Aug 25, 2016 •

edited

Loading

zak-b2c2 commented Feb 3, 2021

jreback commented Feb 3, 2021

kdb-like window join for pandas #13959

kdb-like window join for pandas #13959

Comments

chrisaycock commented Aug 10, 2016

jreback commented Aug 10, 2016

TomAugspurger commented Aug 11, 2016

chrisaycock commented Aug 11, 2016

chrisaycock commented Aug 17, 2016

wesm commented Aug 25, 2016 • edited Loading

zak-b2c2 commented Feb 3, 2021

jreback commented Feb 3, 2021

wesm commented Aug 25, 2016 •

edited

Loading