transform function is very slow in 0.14 - similar to issue #2121 #7383

aullrich2013 · 2014-06-06T21:50:02Z

I have a dataframe where I'm computing some summary items for groups and it's convenient to reconsolidate merge those back into the parent dataframe. Transform seems the most intuitive way to do this but I'm seeing much worse performance than doing a manual aggregation and merge.

Starting from issue #2121 marked as closed:

import numpy as np
import pandas as pd
from pandas import Index, MultiIndex, DataFrame

n_dates = 20000
n_securities = 50
n_columns = 2
share_na = 0.1

dates = pd.date_range('1997-12-31', periods=n_dates, freq='B')
dates = Index(map(lambda x: x.year * 10000 + x.month * 100 + x.day, dates))

secid_min = int('10000000', 16)
secid_max = int('F0000000', 16)
step = (secid_max - secid_min) // (n_securities - 1)
security_ids = map(lambda x: hex(x)[2:10].upper(), range(secid_min, secid_max + 1, step))

data_index = MultiIndex(levels=[dates.values],
    labels=[[i for i in xrange(n_dates) for _ in xrange(n_securities)]],
    names=['date'])
n_data = len(data_index)

columns = Index(['factor{}'.format(i) for i in xrange(1, n_columns + 1)])

data = DataFrame(np.random.randn(n_data, n_columns), index=data_index, columns=columns)

grouped = data.groupby(level='date')

which gives us a (1000000, 2) dataframe of floats

%timeit grouped.transform(max)

1 loops, best of 3: 8.16 s per loop

my alternative is to write something like this:

def mergeTransform(grouped, originalDF):
    item = grouped.max()
    colnames = item.columns
    item = originalDF.merge(item, how='outer', left_index=True, right_index=True).ix[:,-item.shape[1]:]
    item.columns = colnames

which seems to run much faster despite all the indexing going on:

%%timeit 
mergeTransform(grouped, data)

10 loops, best of 3: 81.1 ms per loop

The text was updated successfully, but these errors were encountered:

jreback · 2014-06-06T22:08:04Z

see #6496 as well (it's possible the merge is faster in some cases)
depends on the group density - eg how many groups as a function of. the total size

would welcome a pull request for these solutions - it's not hard

the existing transform will be faster of density is very high (eg lots of groups)

so may make sense to separate these 2 cases - but may not matter

jreback · 2014-06-14T15:49:25Z

#6496 is somewhat faster in this case (this only affects Series.transform really)

So this would work on core/groupby.py/NDFrame.transform

want to do a PR to see if you can do this type of merging soln in general?

e.g. put it in place and see if you can get the tests to pass?

we already have this benchmark in vbench, so you can see

jreback · 2014-06-14T20:44:22Z

give a try with #7463

jreback added Groupby labels Jun 14, 2014

jreback added this to the 0.15.0 milestone Jun 14, 2014

jreback mentioned this issue Jun 14, 2014

PERF: performance gains in DataFrame groupby.transform for ufuncs (GH7383) #7463

Merged

jreback modified the milestones: 0.14.1, 0.15.0 Jun 14, 2014

jreback closed this as completed in #7463 Jun 16, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transform function is very slow in 0.14 - similar to issue #2121 #7383

transform function is very slow in 0.14 - similar to issue #2121 #7383

aullrich2013 commented Jun 6, 2014

jreback commented Jun 6, 2014

jreback commented Jun 14, 2014

jreback commented Jun 14, 2014

transform function is very slow in 0.14 - similar to issue #2121 #7383

transform function is very slow in 0.14 - similar to issue #2121 #7383

Comments

aullrich2013 commented Jun 6, 2014

jreback commented Jun 6, 2014

jreback commented Jun 14, 2014

jreback commented Jun 14, 2014