Skip to content

PERF: Cythonize groupby transforms #4095 #10901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

chris-b1
Copy link
Contributor

closes - #4095 - adding Cythonized versions of shift, cumprod and cumsum, along with a helper GroupBy._cython_transform that other cythonized transform-ops could be added to.

I'm using fused types here instead of adding to generated.pyx - as far as I can tell that is working well, looking back through the old issues (#255) it looks like that wasn't done originally because of older versions of Cython?

0.16.2

In [17]: df = pd.DataFrame({'i':range(1000) * 1000, 
                            'f':np.linspace(0, 2000, 1000000),    
                            'f2':np.linspace(1, 2, 1000000)})

In [18]: %timeit df.groupby('i').cumsum()
1 loops, best of 3: 614 ms per loop

In [19]: %timeit df.groupby('i').shift()
1 loops, best of 3: 499 ms per loop

PR

In [2]: %timeit df.groupby('i').cumsum()
10 loops, best of 3: 58.6 ms per loop

In [3]: %timeit df.groupby('i').shift()
10 loops, best of 3: 73.4 ms per loop

@chris-b1
Copy link
Contributor Author

Travis build is failing with the message below - which seems like it might be an out of memory error?

 gcc: internal compiler error: Killed (program cc1)
 Please submit a full bug report, with preprocessed source if appropriate.

@chris-b1 chris-b1 force-pushed the cython-grouby-transform branch from 9875210 to 539d069 Compare August 29, 2015 22:25
@chris-b1
Copy link
Contributor Author

chris-b1 commented Sep 4, 2015

@jreback - any guesses on why the Travis build is failing? It builds fine locally on windows, also worked in a linux vm.

@jreback
Copy link
Contributor

jreback commented Sep 4, 2015

I think this doesn't build under cython 0.19.1 (our min version).

@jreback jreback added Groupby Performance Memory or execution speed performance labels Sep 4, 2015
@chris-b1
Copy link
Contributor Author

chris-b1 commented Sep 4, 2015

@jreback I'm not sure that's the problem, I just downgraded and it built locally.

@jreback
Copy link
Contributor

jreback commented Sep 4, 2015

looks pretty odd: https://travis-ci.org/pydata/pandas/jobs/77861783, gcc kill, no idea :<

@kawochen
Copy link
Contributor

kawochen commented Sep 7, 2015

This could indicate a gcc memory problem. Refactoring into smaller files should make it easier to compile.

@chris-b1 chris-b1 force-pushed the cython-grouby-transform branch from 539d069 to c671cab Compare September 19, 2015 03:50
@chris-b1 chris-b1 force-pushed the cython-grouby-transform branch from c671cab to de23ddb Compare October 4, 2015 17:31
@chris-b1
Copy link
Contributor Author

chris-b1 commented Oct 4, 2015

@jreback - this is actually building ok now; I didn't do anything but rebase.

@jreback
Copy link
Contributor

jreback commented Oct 4, 2015

why don't we put this in generate_code.py so all the groupby code is together (I know this doesn't actually need generation as it uses the fused types)

as an aside I suspect we can use fused types with quite a bit of the existing groupby code

@jreback jreback added this to the 0.17.1 milestone Oct 11, 2015
@@ -1139,6 +1184,49 @@ def _try_cast(self, result, obj):

return result

def _cython_transform(self, how, numeric_only=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this function here? it duplicates what .aggregate does

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does a couple different things (looking up the cython func, creating accum which is an array where running-totals are stored) but a lot is duplicated. I'll try to factor out the common stuff.

@jreback
Copy link
Contributor

jreback commented Oct 11, 2015

need some asv benches for this

@chris-b1 chris-b1 force-pushed the cython-grouby-transform branch from de23ddb to d94947e Compare October 11, 2015 20:07
@chris-b1
Copy link
Contributor Author

@jreback - updated for your comments, moved the cython into generated.pyx, added some benches, and refactored BaseGrouper to put the cythonized transform methods alongside aggregation.

  [b2a547e8] [d94947e]
-    10.37s     8.38ms      0.00  groupby.groupby_transform_cythonized.time_groupby_transform_shift
-    13.68s     6.76ms      0.00  groupby.groupby_transform_cythonized.time_groupby_transform_cumsum
-    13.59s     6.70ms      0.00  groupby.groupby_transform_cythonized.time_groupby_transform_cumprod

for dt in [dtype_str, 'object']:
f = getattr(_algos, "%s_%s" % (fname, dtype_str), None)
if f is not None:
return f

if dtype_str == 'float64':
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I can tell this was only here to handle group_median - which didn't have a dtype name in the function, but is float-only. I just renamed the function to group_median_float64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@chris-b1
Copy link
Contributor Author

@jreback, is there a standardized way to deal with integer overflows? running into that on cumprod - in 0.17.0 the overflowed values were just passed back through.

In [26]: pd.__version__
Out[26]: u'0.17.0'

In [27]: df = pd.DataFrame({'key': ['b'] * 100, 'value': 2})

In [32]: df.groupby('key').cumprod().dtypes
Out[32]: 
value    int64
dtype: object

In [28]: df.groupby('key').cumprod()
Out[28]: 
         value
0            2
1            4
2            8
3           16
4           32
5           64
6          128
7          256
8          512
9         1024
10        2048
11        4096
12        8192
13       16384
14       32768
15       65536
16      131072
17      262144
18      524288
19     1048576
20     2097152
21     4194304
22     8388608
23    16777216
24    33554432
25    67108864
26   134217728
27   268435456
28   536870912
29  1073741824
..         ...
70           0
71           0
72           0

In this case one of the overflowed values happens to be iNaT so the existing cython logic wants to promote this float (but still the wrong answer because it happens after the calc).

In [24]: pd.__version__
Out[24]: '0.17.0+21.g37ea209.dirty'

In [25]: df.groupby('key').cumprod().tail()
Out[25]: 
    value
95      0
96      0
97      0
98      0
99      0

In [26]: df.groupby('key').cumprod().dtypes
Out[26]: 
value    float64
dtype: object

Is the right thing just to define the calculation in terms of floats, and then try to convert back if we can? Or try to match the 0.17.0 behavior?

@jreback
Copy link
Contributor

jreback commented Oct 14, 2015

this should always be a float calc

@chris-b1
Copy link
Contributor Author

@jreback, if you want to take another look, I think I've got your comments addressed.

The reference tests (e.g. gb.cumsum() == gb.apply(lambda x:x.cumsum()) have gotten pretty complex - I probably need to add some simpler ones with hard-coded output too.

try:
result, names = self.grouper.transform(obj.values, how)
except AssertionError as e:
raise GroupByError(str(e))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what hits this?

@jreback
Copy link
Contributor

jreback commented Nov 13, 2015

@chris-b1 ok, looks reasonable. pls rebase / squash. ping when green.

@jreback
Copy link
Contributor

jreback commented Nov 15, 2015

@chris-b1 if you could update would be gr8. trying to get this in for 0.17.1

@chris-b1 chris-b1 force-pushed the cython-grouby-transform branch from 0c48739 to 9d11734 Compare November 15, 2015 18:45
@chris-b1
Copy link
Contributor Author

@jreback - this is updated (also added whatsnew) - let me know if it needs anything else.

@jreback
Copy link
Contributor

jreback commented Nov 16, 2015

merged via b07dd0c

thanks!

@jreback jreback closed this Nov 16, 2015
@chris-b1 chris-b1 deleted the cython-grouby-transform branch December 4, 2015 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants