Skip to content

transform function is very slow in 0.14 - similar to issue #2121 #7383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aullrich2013 opened this issue Jun 6, 2014 · 3 comments · Fixed by #7463
Closed

transform function is very slow in 0.14 - similar to issue #2121 #7383

aullrich2013 opened this issue Jun 6, 2014 · 3 comments · Fixed by #7463
Labels
Groupby Performance Memory or execution speed performance
Milestone

Comments

@aullrich2013
Copy link

I have a dataframe where I'm computing some summary items for groups and it's convenient to reconsolidate merge those back into the parent dataframe. Transform seems the most intuitive way to do this but I'm seeing much worse performance than doing a manual aggregation and merge.

Starting from issue #2121 marked as closed:

import numpy as np
import pandas as pd
from pandas import Index, MultiIndex, DataFrame

n_dates = 20000
n_securities = 50
n_columns = 2
share_na = 0.1

dates = pd.date_range('1997-12-31', periods=n_dates, freq='B')
dates = Index(map(lambda x: x.year * 10000 + x.month * 100 + x.day, dates))

secid_min = int('10000000', 16)
secid_max = int('F0000000', 16)
step = (secid_max - secid_min) // (n_securities - 1)
security_ids = map(lambda x: hex(x)[2:10].upper(), range(secid_min, secid_max + 1, step))

data_index = MultiIndex(levels=[dates.values],
    labels=[[i for i in xrange(n_dates) for _ in xrange(n_securities)]],
    names=['date'])
n_data = len(data_index)

columns = Index(['factor{}'.format(i) for i in xrange(1, n_columns + 1)])

data = DataFrame(np.random.randn(n_data, n_columns), index=data_index, columns=columns)

grouped = data.groupby(level='date')

which gives us a (1000000, 2) dataframe of floats

%timeit grouped.transform(max)

1 loops, best of 3: 8.16 s per loop

my alternative is to write something like this:

def mergeTransform(grouped, originalDF):
    item = grouped.max()
    colnames = item.columns
    item = originalDF.merge(item, how='outer', left_index=True, right_index=True).ix[:,-item.shape[1]:]
    item.columns = colnames

which seems to run much faster despite all the indexing going on:

%%timeit 
mergeTransform(grouped, data)

10 loops, best of 3: 81.1 ms per loop

@jreback
Copy link
Contributor

jreback commented Jun 6, 2014

see #6496 as well (it's possible the merge is faster in some cases)
depends on the group density - eg how many groups as a function of. the total size

would welcome a pull request for these solutions - it's not hard

the existing transform will be faster of density is very high (eg lots of groups)

so may make sense to separate these 2 cases - but may not matter

@jreback jreback added this to the 0.15.0 milestone Jun 14, 2014
@jreback
Copy link
Contributor

jreback commented Jun 14, 2014

#6496 is somewhat faster in this case (this only affects Series.transform really)

So this would work on core/groupby.py/NDFrame.transform

want to do a PR to see if you can do this type of merging soln in general?

e.g. put it in place and see if you can get the tests to pass?

we already have this benchmark in vbench, so you can see

@jreback
Copy link
Contributor

jreback commented Jun 14, 2014

give a try with #7463

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants