-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Cythonize groupby transforms #4095 #10901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Travis build is failing with the message below - which seems like it might be an out of memory error?
|
9875210
to
539d069
Compare
@jreback - any guesses on why the Travis build is failing? It builds fine locally on windows, also worked in a linux vm. |
I think this doesn't build under cython 0.19.1 (our min version). |
@jreback I'm not sure that's the problem, I just downgraded and it built locally. |
looks pretty odd: https://travis-ci.org/pydata/pandas/jobs/77861783, |
This could indicate a gcc memory problem. Refactoring into smaller files should make it easier to compile. |
539d069
to
c671cab
Compare
c671cab
to
de23ddb
Compare
@jreback - this is actually building ok now; I didn't do anything but rebase. |
why don't we put this in generate_code.py so all the groupby code is together (I know this doesn't actually need generation as it uses the fused types) as an aside I suspect we can use fused types with quite a bit of the existing groupby code |
@@ -1139,6 +1184,49 @@ def _try_cast(self, result, obj): | |||
|
|||
return result | |||
|
|||
def _cython_transform(self, how, numeric_only=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this function here? it duplicates what .aggregate
does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does a couple different things (looking up the cython func, creating accum
which is an array where running-totals are stored) but a lot is duplicated. I'll try to factor out the common stuff.
need some asv benches for this |
de23ddb
to
d94947e
Compare
@jreback - updated for your comments, moved the cython into generated.pyx, added some benches, and refactored
|
for dt in [dtype_str, 'object']: | ||
f = getattr(_algos, "%s_%s" % (fname, dtype_str), None) | ||
if f is not None: | ||
return f | ||
|
||
if dtype_str == 'float64': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as far as I can tell this was only here to handle group_median
- which didn't have a dtype name in the function, but is float-only. I just renamed the function to group_median_float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@jreback, is there a standardized way to deal with integer overflows? running into that on In [26]: pd.__version__
Out[26]: u'0.17.0'
In [27]: df = pd.DataFrame({'key': ['b'] * 100, 'value': 2})
In [32]: df.groupby('key').cumprod().dtypes
Out[32]:
value int64
dtype: object
In [28]: df.groupby('key').cumprod()
Out[28]:
value
0 2
1 4
2 8
3 16
4 32
5 64
6 128
7 256
8 512
9 1024
10 2048
11 4096
12 8192
13 16384
14 32768
15 65536
16 131072
17 262144
18 524288
19 1048576
20 2097152
21 4194304
22 8388608
23 16777216
24 33554432
25 67108864
26 134217728
27 268435456
28 536870912
29 1073741824
.. ...
70 0
71 0
72 0 In this case one of the overflowed values happens to be In [24]: pd.__version__
Out[24]: '0.17.0+21.g37ea209.dirty'
In [25]: df.groupby('key').cumprod().tail()
Out[25]:
value
95 0
96 0
97 0
98 0
99 0
In [26]: df.groupby('key').cumprod().dtypes
Out[26]:
value float64
dtype: object Is the right thing just to define the calculation in terms of floats, and then try to convert back if we can? Or try to match the 0.17.0 behavior? |
this should always be a float calc |
@jreback, if you want to take another look, I think I've got your comments addressed. The reference tests (e.g. |
try: | ||
result, names = self.grouper.transform(obj.values, how) | ||
except AssertionError as e: | ||
raise GroupByError(str(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what hits this?
@chris-b1 ok, looks reasonable. pls rebase / squash. ping when green. |
@chris-b1 if you could update would be gr8. trying to get this in for 0.17.1 |
0c48739
to
9d11734
Compare
@jreback - this is updated (also added whatsnew) - let me know if it needs anything else. |
merged via b07dd0c thanks! |
closes - #4095 - adding Cythonized versions of
shift
,cumprod
andcumsum
, along with a helperGroupBy._cython_transform
that other cythonized transform-ops could be added to.I'm using fused types here instead of adding to
generated.pyx
- as far as I can tell that is working well, looking back through the old issues (#255) it looks like that wasn't done originally because of older versions of Cython?0.16.2
PR