-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Cythonize Groupby.cummin/cummax (#15048) #15053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Random thought, but any presumption on performance vs bottleneck? http://berkeleyanalytics.com/bottleneck/reference.html#bottleneck.move_min (with window = len) |
bottleneck will be way slower as it will be called in a python loop. The cython grouperby routines have the indexers already and thus everything is a single cython loop. |
Ah, you can't put the bn call in a |
@MaximilianR no, nothing to do with that. The point of the groupby_* routines is that they take and indexer (already been factorized), and make the results per-group. bottleneck is for a full 1-d array. Here we have many variable sized (the groups) 1-d arrays. |
lgtm. pls add a whatsnew note & look at the build log, linting issue with a cython file |
self.df.groupby('A').cummin() | ||
|
||
def time_cummax(self): | ||
self.df.groupby('A').cummax() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are already time_cummin
and time_cummax
functions below. Do these test something else specifically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1417,6 +1418,26 @@ def cumsum(self, axis=0, *args, **kwargs): | |||
|
|||
@Substitution(name='groupby') | |||
@Appender(_doc_template) | |||
def cummin(self, axis=0, *args, **kwargs): | |||
"""Cumulative min for each group""" | |||
nv.validate_groupby_func('cummin', args, kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cummin and cummax are not numpy functions I think? So in that case this line (and the args, kwargs) are not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, therefore probably don't need to pass it through nv.validate_groupby_func
@@ -75,7 +75,7 @@ | |||
'last', 'first', | |||
'head', 'tail', 'median', | |||
'mean', 'sum', 'min', 'max', | |||
'cumsum', 'cumprod', 'cummin', 'cummax', 'cumcount', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If cummin and cummax can be removed, should cumsum and cumprod also not be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean "also be removed" instead of "also not be removed"? If so, you're probably correct, I'll try remove them to see if anything breaks.
if val == val: | ||
if j == 0: | ||
min_val = val | ||
elif val < min_val: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic doesn't seem right - I think you need to be comparing to a group specific min, something like this?
if val < accum[lab, j]:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks! Was doing some local tests and noticed the results didn't exactly match the alternative groupby.apply(np.minimum.accumulate)
. Will check this out tonight.
e811514
to
6a54faf
Compare
Current coverage is 84.75% (diff: 53.84%)@@ master #15053 diff @@
==========================================
Files 145 145
Lines 51220 51232 +12
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43415 43421 +6
- Misses 7805 7811 +6
Partials 0 0
|
Added a whatsnew and a test. I don't think my implementation is entirely correct yet so I am expecting the test I added to fail. Is there a conventional way to debug cython code? Ideally, I'd like to drop a |
copy this to python, debug and assert correctness, then cythonize it. note you will have to hack the calling function (to make sure it is imported / called correctly while you do this). there are ways to debug cython code, just google. @chrisaycock may be able to elaborate a bit. |
The only way I've done debugging with Cython has been to add |
Thanks for the tips @jreback and @chrisaycock! |
277182c
to
f0d1b24
Compare
Made the changes related to your comments @jorisvandenbossche, and added a whatsnew & passing tests @jreback. All green. |
|
||
N, K = (<object> values).shape | ||
accum = np.empty_like(accum) | ||
accum[:] = _int64_max |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not right it needs to be nan if val != val
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since val is always set before the comparison, shouldn't val == val always equal True?
val = values[i, j]
if val == val:
....
I notice this pattern in group_cumsum
and group_cumprod
as well. Should those have nan if val != val
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nan != nan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, thanks for the insight. I'll be sure to add that condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value (_int64_max
) needs to be type specific - e.g., here's a failing test case
In [7]: df = pd.DataFrame({'key':[1], 'value': 1e99})
In [8]: df.groupby('key').cummin()
Out[8]:
value
0 1.104865e-311
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the head-up @chris-b1! Would that mean creating 3 different group_cummin
/group_cummax
functions to accommodate int32
, float32
and float64
? (Like the other group_* functions in this file).
It doesn't seem possible to get the dtype
directly from the values array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably the easiest way would be to add to the templated code - look at group_min
/max
- it is already dumping the max/min value into the template.
The other option (just fyi) - numeric
here is a cython fused type - you can check and specialize on it like this.
if numeric is int64_t then:
...
elif numeric is float32_t then:
...
tm.assert_frame_equal(result, expected) | ||
|
||
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2], | ||
'B': range(8, 2, -1)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs tests with nans and different dtypes
needs test against doing .apply(lambda x: x.cummin()) for validation
cb171c5
to
0b0df6b
Compare
Added additional tests and addressed the the val != val condition @jreback . All green. |
8dc0aed
to
2fc025b
Compare
Thanks for the tips @chris-b1, I was able to refactor the algorithm to handle different max/min dtypes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple minor things, otherwise looks good to me
int64_t lab | ||
|
||
N, K = (<object> values).shape | ||
accum = np.empty_like(accum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call isn't necessary, accum
is already allocated to the right size/type
accum[lab, j] = min_val | ||
out[i, j] = accum[lab, j] | ||
# val = nan | ||
elif val != val: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this an else
int64_t lab | ||
|
||
N, K = (<object> values).shape | ||
accum = np.empty_like(accum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
accum[lab, j] = max_val | ||
out[i, j] = accum[lab, j] | ||
# val = nan | ||
elif val != val: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
def test_cummin_cummax(self): | ||
# GH 15048 | ||
# Test with different dtypes | ||
for dtype in [np.int32, np.float32, np.float64]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test with int64, datetime types should work as well (e.g. datetime64, timedelta64). (if not pls comment; there are some issues I am working on elsewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also test with object
dtype (which we don't expect to work, I think it should raise). but we can handle this in another issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct about using the object
dtype. It raises this error:
DataError: No numeric types to aggregate
Will test the other dtypes later tonight.
result = df.groupby('A').cummin() | ||
expected = pd.DataFrame({'B': [3, 3, 3, 6, 6, 6]}).astype(dtype) | ||
tm.assert_frame_equal(result, expected) | ||
expected = df.groupby('A').B.apply(lambda x: x.cummin()).to_frame() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put a blank line in between sections
expected = df.groupby('A').B.apply(lambda x: x.cummin()).to_frame() | ||
tm.assert_frame_equal(result, expected) | ||
# Test min value for dtype | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of this, pass these in with the dtype, IOW change your loop to be
for dtype, min_value in [(np.int32, np.iinfo(np.int32).min), (np.float32.....)]
tm.assert_frame_equal(result, expected) | ||
# Test max value for dtype | ||
try: | ||
max_val = np.iinfo(dtype).max |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
expected = df.groupby('A').B.apply(lambda x: x.cummax()).to_frame() | ||
tm.assert_frame_equal(result, expected) | ||
|
||
# Some entries in column are nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this needs to work for float types (32 and 64). as well as datetime64/timedelta64 types (which use NaT)
tm.assert_frame_equal(result, expected) | ||
expected = df.groupby('A').B.apply(lambda x: x.cummax()).to_frame() | ||
tm.assert_frame_equal(result, expected) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also test where first element is nan.
def test_cummin_cummax(self): | ||
# GH 15048 | ||
# Test with different dtypes | ||
for dtype in [np.int32, np.float32, np.float64]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also test with object
dtype (which we don't expect to work, I think it should raise). but we can handle this in another issue.
e0a9cc3
to
8ee56ee
Compare
@jreback Added the addtional nan test & restructured the tests to pass in min and max values. Additionally performing cummin/max on
|
pep8, removed args, kwargs add whatsnew + test + changed logic small error in algo Use a more obvious test Fixed algo & test passed Add dtypes test Add additional tests handle nan case Adapt max/min for different dtypes + tests remove uncessary comments Added test & adjust algorithm Fix linting errors
8ee56ee
to
5e8ba63
Compare
def group_cummin_{{name}}(ndarray[{{dest_type2}}, ndim=2] out, | ||
ndarray[{{dest_type2}}, ndim=2] values, | ||
ndarray[int64_t] labels, | ||
ndarray[{{dest_type2}}, ndim=2] accum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I am puzzled why we even pass in accum
here at all (rather than just allocate it here). It is only used internally. This would affect the groupby code and cumsum, cumprod as well as these new funcs.
@mroeschke if you want to do a followup to remove it after would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback - I had done this pattern in the cumsum
code. IIRC, it was because I was using fused types / memorviews, and I didn't see a way to do the allocation. It would be straightforward to avoid in the templated code.
By example
def func(numeric[:] arr, int a, int b):
# how would you allocate an (a x b) array of type `numeric?`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh I see. ok, in the templated code this should be straightforward then. after seeing how easy templated code is, I much prefer to fused types :>
thanks @mroeschke The other dtypes are fixed in #15054 (e.g. datetime/timedelta). If you'd do that followup to remove accum would be great! |
e.g. in #15054
|
the merged PR fails against master, not sure why the tests didn't catch this.
The culprit is this. This only should happen if its an actual integer type on values.dtype (and NOT datetime64/timedelta64/bool) |
Apologies for not flagging this last night. I made a naive assumption that travis would catch this and the build would fail:
Thanks for pointing out the root issue. I'll look into addressing it! |
closes pandas-dev#15048 Author: Matt Roeschke <[email protected]> Closes pandas-dev#15053 from mroeschke/improve_cummin_cummax and squashes the following commits: 5e8ba63 [Matt Roeschke] PERF: Cythonize Groupby.cummin/cummax (pandas-dev#15048)
git diff upstream/master | flake8 --diff
Still need to add some tests to see if this is equivalent to the old implimentation (numpy.minimum.accumulate?). Any early feedback appreciated.