Skip to content

PERF: Cythonize Groupby.cummin/cummax (#15048) #15053

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Jan 4, 2017


   before     after       ratio
  [6eb705f5] [e8115142]
-     1.08s     1.74ms      0.00  groupby.groupby_cummax_cummin.time_cummax
-     1.09s     1.73ms      0.00  groupby.groupby_cummax_cummin.time_cummin

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Still need to add some tests to see if this is equivalent to the old implimentation (numpy.minimum.accumulate?). Any early feedback appreciated.

@max-sixty
Copy link
Contributor

Random thought, but any presumption on performance vs bottleneck? http://berkeleyanalytics.com/bottleneck/reference.html#bottleneck.move_min (with window = len)

@jreback
Copy link
Contributor

jreback commented Jan 4, 2017

bottleneck will be way slower as it will be called in a python loop. The cython grouperby routines have the indexers already and thus everything is a single cython loop.

@max-sixty
Copy link
Contributor

bottleneck will be way slower as it will be called in a python loop

Ah, you can't put the bn call in a with nogil context manager?

@jreback
Copy link
Contributor

jreback commented Jan 4, 2017

@MaximilianR no, nothing to do with that.

The point of the groupby_* routines is that they take and indexer (already been factorized), and make the results per-group.

bottleneck is for a full 1-d array. Here we have many variable sized (the groups) 1-d arrays.

@jreback jreback added Performance Memory or execution speed performance Groupby labels Jan 4, 2017
@jreback
Copy link
Contributor

jreback commented Jan 4, 2017

lgtm. pls add a whatsnew note & look at the build log, linting issue with a cython file

self.df.groupby('A').cummin()

def time_cummax(self):
self.df.groupby('A').cummax()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are already time_cummin and time_cummax functions below. Do these test something else specifically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for pointing that out. These benches were transferred directly from #15048, but they are essentially redundant compared to the existing benches: df.groupby('key').['value'].cummax() (old benchmark) vs df.groupby('key').cummax() (#15048)

@@ -1417,6 +1418,26 @@ def cumsum(self, axis=0, *args, **kwargs):

@Substitution(name='groupby')
@Appender(_doc_template)
def cummin(self, axis=0, *args, **kwargs):
"""Cumulative min for each group"""
nv.validate_groupby_func('cummin', args, kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cummin and cummax are not numpy functions I think? So in that case this line (and the args, kwargs) are not needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, therefore probably don't need to pass it through nv.validate_groupby_func

@@ -75,7 +75,7 @@
'last', 'first',
'head', 'tail', 'median',
'mean', 'sum', 'min', 'max',
'cumsum', 'cumprod', 'cummin', 'cummax', 'cumcount',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cummin and cummax can be removed, should cumsum and cumprod also not be removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean "also be removed" instead of "also not be removed"? If so, you're probably correct, I'll try remove them to see if anything breaks.

if val == val:
if j == 0:
min_val = val
elif val < min_val:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic doesn't seem right - I think you need to be comparing to a group specific min, something like this?

if val < accum[lab, j]: 
   ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks! Was doing some local tests and noticed the results didn't exactly match the alternative groupby.apply(np.minimum.accumulate). Will check this out tonight.

@mroeschke mroeschke force-pushed the improve_cummin_cummax branch from e811514 to 6a54faf Compare January 5, 2017 08:17
@codecov-io
Copy link

codecov-io commented Jan 5, 2017

Current coverage is 84.75% (diff: 53.84%)

Merging #15053 into master will decrease coverage by <.01%

@@             master     #15053   diff @@
==========================================
  Files           145        145          
  Lines         51220      51232    +12   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43415      43421     +6   
- Misses         7805       7811     +6   
  Partials          0          0          

Powered by Codecov. Last update c71f214...5e8ba63

@mroeschke
Copy link
Member Author

mroeschke commented Jan 5, 2017

Added a whatsnew and a test. I don't think my implementation is entirely correct yet so I am expecting the test I added to fail.

Is there a conventional way to debug cython code? Ideally, I'd like to drop a pdb.set_trace()-like functionality to see what's going on.

@jreback
Copy link
Contributor

jreback commented Jan 5, 2017

copy this to python, debug and assert correctness, then cythonize it. note you will have to hack the calling function (to make sure it is imported / called correctly while you do this).

there are ways to debug cython code, just google. @chrisaycock may be able to elaborate a bit.

@chrisaycock
Copy link
Contributor

The only way I've done debugging with Cython has been to add print() statements and to run samples in the Python interpreter. So really ad hoc.

@mroeschke
Copy link
Member Author

Thanks for the tips @jreback and @chrisaycock!

@mroeschke mroeschke force-pushed the improve_cummin_cummax branch 2 times, most recently from 277182c to f0d1b24 Compare January 7, 2017 09:03
@mroeschke mroeschke changed the title [WIP] PERF: Cythonize Groupby.cummin/cummax (#15048) PERF: Cythonize Groupby.cummin/cummax (#15048) Jan 7, 2017
@mroeschke
Copy link
Member Author

Made the changes related to your comments @jorisvandenbossche, and added a whatsnew & passing tests @jreback. All green.


N, K = (<object> values).shape
accum = np.empty_like(accum)
accum[:] = _int64_max
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not right it needs to be nan if val != val

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since val is always set before the comparison, shouldn't val == val always equal True?

val = values[i, j]
if val == val:
    ....

I notice this pattern in group_cumsum and group_cumprod as well. Should those have nan if val != val as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nan != nan

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, thanks for the insight. I'll be sure to add that condition

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value (_int64_max) needs to be type specific - e.g., here's a failing test case

In [7]: df = pd.DataFrame({'key':[1], 'value': 1e99})

In [8]: df.groupby('key').cummin()
Out[8]: 
           value
0  1.104865e-311

Copy link
Member Author

@mroeschke mroeschke Jan 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the head-up @chris-b1! Would that mean creating 3 different group_cummin/group_cummax functions to accommodate int32, float32 and float64? (Like the other group_* functions in this file).

It doesn't seem possible to get the dtype directly from the values array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably the easiest way would be to add to the templated code - look at group_min/max - it is already dumping the max/min value into the template.

The other option (just fyi) - numeric here is a cython fused type - you can check and specialize on it like this.

if numeric is int64_t then:
    ...
elif numeric is float32_t then:
    ...

tm.assert_frame_equal(result, expected)

df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': range(8, 2, -1)})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs tests with nans and different dtypes

needs test against doing .apply(lambda x: x.cummin()) for validation

@mroeschke mroeschke force-pushed the improve_cummin_cummax branch 2 times, most recently from cb171c5 to 0b0df6b Compare January 8, 2017 20:30
@mroeschke
Copy link
Member Author

Added additional tests and addressed the the val != val condition @jreback . All green.

@mroeschke mroeschke force-pushed the improve_cummin_cummax branch 2 times, most recently from 8dc0aed to 2fc025b Compare January 9, 2017 06:51
@mroeschke
Copy link
Member Author

Thanks for the tips @chris-b1, I was able to refactor the algorithm to handle different max/min dtypes.

Copy link
Contributor

@chris-b1 chris-b1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple minor things, otherwise looks good to me

int64_t lab

N, K = (<object> values).shape
accum = np.empty_like(accum)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call isn't necessary, accum is already allocated to the right size/type

accum[lab, j] = min_val
out[i, j] = accum[lab, j]
# val = nan
elif val != val:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this an else

int64_t lab

N, K = (<object> values).shape
accum = np.empty_like(accum)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

accum[lab, j] = max_val
out[i, j] = accum[lab, j]
# val = nan
elif val != val:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

def test_cummin_cummax(self):
# GH 15048
# Test with different dtypes
for dtype in [np.int32, np.float32, np.float64]:
Copy link
Contributor

@jreback jreback Jan 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test with int64, datetime types should work as well (e.g. datetime64, timedelta64). (if not pls comment; there are some issues I am working on elsewhere).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also test with object dtype (which we don't expect to work, I think it should raise). but we can handle this in another issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct about using the object dtype. It raises this error:

DataError: No numeric types to aggregate

Will test the other dtypes later tonight.

result = df.groupby('A').cummin()
expected = pd.DataFrame({'B': [3, 3, 3, 6, 6, 6]}).astype(dtype)
tm.assert_frame_equal(result, expected)
expected = df.groupby('A').B.apply(lambda x: x.cummin()).to_frame()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put a blank line in between sections

expected = df.groupby('A').B.apply(lambda x: x.cummin()).to_frame()
tm.assert_frame_equal(result, expected)
# Test min value for dtype
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this, pass these in with the dtype, IOW change your loop to be

for dtype, min_value in [(np.int32, np.iinfo(np.int32).min), (np.float32.....)]

tm.assert_frame_equal(result, expected)
# Test max value for dtype
try:
max_val = np.iinfo(dtype).max
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

expected = df.groupby('A').B.apply(lambda x: x.cummax()).to_frame()
tm.assert_frame_equal(result, expected)

# Some entries in column are nan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this needs to work for float types (32 and 64). as well as datetime64/timedelta64 types (which use NaT)

tm.assert_frame_equal(result, expected)
expected = df.groupby('A').B.apply(lambda x: x.cummax()).to_frame()
tm.assert_frame_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also test where first element is nan.

def test_cummin_cummax(self):
# GH 15048
# Test with different dtypes
for dtype in [np.int32, np.float32, np.float64]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also test with object dtype (which we don't expect to work, I think it should raise). but we can handle this in another issue.

@mroeschke mroeschke force-pushed the improve_cummin_cummax branch from e0a9cc3 to 8ee56ee Compare January 11, 2017 06:20
@mroeschke
Copy link
Member Author

@jreback Added the addtional nan test & restructured the tests to pass in min and max values.

Additionally performing cummin/max on datetime, timedeta and object dtypes all raise the same error: DataError: No numeric types to aggregate

In [16]: df = pd.DataFrame({'A':[5]*5+[6]*5,'B':range(10)})

In [17]: df
Out[17]: 
   A  B
0  5  0
1  5  1
2  5  2
3  5  3
4  5  4
5  6  5
6  6  6
7  6  7
8  6  8
9  6  9

In [18]: df['B'] = df['B'].astype(object)

In [19]: df.groupby('A').cummin()

DataError: No numeric types to aggregate

In [20]: df['B'] = pd.date_range(2015, freq='10min', periods=10)

In [21]: df
Out[21]: 
   A                             B
0  5 1970-01-01 00:00:00.000002015
1  5 1970-01-01 00:10:00.000002015
2  5 1970-01-01 00:20:00.000002015
3  5 1970-01-01 00:30:00.000002015
4  5 1970-01-01 00:40:00.000002015
5  6 1970-01-01 00:50:00.000002015
6  6 1970-01-01 01:00:00.000002015
7  6 1970-01-01 01:10:00.000002015
8  6 1970-01-01 01:20:00.000002015
9  6 1970-01-01 01:30:00.000002015

In [22]: df.groupby('A').cummin()

DataError: No numeric types to aggregate

In [26]: df['B'] = [pd.Timedelta(n, unit='d') for n in range(10)]

In [27]: df
Out[27]: 
   A      B
0  5 0 days
1  5 1 days
2  5 2 days
3  5 3 days
4  5 4 days
5  6 5 days
6  6 6 days
7  6 7 days
8  6 8 days
9  6 9 days

In [28]: df.groupby('A').cummin()

DataError: No numeric types to aggregate

pep8, removed args, kwargs

add whatsnew + test + changed logic

small error in algo

Use a more obvious test

Fixed algo & test passed

Add dtypes test

Add additional tests

handle nan case

Adapt max/min for different dtypes + tests

remove uncessary comments

Added test & adjust algorithm

Fix linting errors
@mroeschke mroeschke force-pushed the improve_cummin_cummax branch from 8ee56ee to 5e8ba63 Compare January 11, 2017 06:55
def group_cummin_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
ndarray[{{dest_type2}}, ndim=2] values,
ndarray[int64_t] labels,
ndarray[{{dest_type2}}, ndim=2] accum):
Copy link
Contributor

@jreback jreback Jan 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I am puzzled why we even pass in accum here at all (rather than just allocate it here). It is only used internally. This would affect the groupby code and cumsum, cumprod as well as these new funcs.

@mroeschke if you want to do a followup to remove it after would be great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback - I had done this pattern in the cumsum code. IIRC, it was because I was using fused types / memorviews, and I didn't see a way to do the allocation. It would be straightforward to avoid in the templated code.

By example

def func(numeric[:] arr, int a, int b):
    # how would you allocate an (a x b) array of type `numeric?`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh I see. ok, in the templated code this should be straightforward then. after seeing how easy templated code is, I much prefer to fused types :>

@jreback jreback added this to the 0.20.0 milestone Jan 11, 2017
@jreback jreback closed this in 0fe491d Jan 11, 2017
@jreback
Copy link
Contributor

jreback commented Jan 11, 2017

thanks @mroeschke

The other dtypes are fixed in #15054 (e.g. datetime/timedelta).

If you'd do that followup to remove accum would be great!

@jreback
Copy link
Contributor

jreback commented Jan 11, 2017

e.g. in #15054

In [2]: df
Out[2]: 
   group  int  float string category_string  category_int   datetime                datetimetz  timedelta
0      1    1    4.0      a               a             7 2013-01-01 2013-01-01 00:00:00-05:00   00:00:01
1      1    2    5.0      b               b             8 2013-01-02 2013-01-02 00:00:00-05:00   00:00:02
2      2    3    6.0      c               c             9 2013-01-03 2013-01-03 00:00:00-05:00   00:00:03

In [3]: df.groupby('group').cummin()
Out[3]: 
   int  float  category_int
0    1    4.0             7
1    1    4.0             7
2    3    6.0             9

In [4]: df.groupby('group').cummin(numeric_only=False)
Out[4]: 
   int  float  category_int   datetime                datetimetz  timedelta
0    1    4.0             7 2013-01-01 2013-01-01 05:00:00-05:00   00:00:01
1    1    4.0             7 2013-01-01 2013-01-01 05:00:00-05:00   00:00:01
2    3    6.0             9 2013-01-03 2013-01-03 05:00:00-05:00   00:00:03

@jreback
Copy link
Contributor

jreback commented Jan 11, 2017

@mroeschke

the merged PR fails against master, not sure why the tests didn't catch this.

(pandas) bash-3.2$ nosetests pandas/tests/groupby/test_groupby.py  -m cummin_cummax -s 
F
======================================================================
FAIL: test_cummin_cummax (test_groupby.TestGroupBy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jreback/pandas/pandas/tests/groupby/test_groupby.py", line 5798, in test_cummin_cummax
    tm.assert_frame_equal(result, expected)
  File "/Users/jreback/pandas/pandas/util/testing.py", line 1313, in assert_frame_equal
    obj='DataFrame.iloc[:, {0}]'.format(i))
  File "/Users/jreback/pandas/pandas/util/testing.py", line 1154, in assert_series_equal
    assert_attr_equal('dtype', left, right)
  File "/Users/jreback/pandas/pandas/util/testing.py", line 878, in assert_attr_equal
    left_attr, right_attr)
  File "/Users/jreback/pandas/pandas/util/testing.py", line 1018, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: Attributes are different

Attribute "dtype" are different
[left]:  float64
[right]: int64

----------------------------------------------------------------------
Ran 1 test in 0.044s

FAILED (failures=1)

The culprit is this.
https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby.py#L1895

This only should happen if its an actual integer type on values.dtype (and NOT datetime64/timedelta64/bool)

@mroeschke
Copy link
Member Author

Apologies for not flagging this last night. I made a naive assumption that travis would catch this and the build would fail:

In [6]: df #np.int64 type
Out[6]: 
   A                    B
0  1                    3
1  1                    4
2  1 -9223372036854775808
3  1                    2
4  2                    2
5  2                    3
6  2 -9223372036854775808
7  2                    1

In [7]: df.groupby('A').cummin()
Out[7]: 
     B
0  3.0
1  3.0
2  NaN
3  NaN
4  2.0
5  2.0
6  NaN
7  NaN

Thanks for pointing out the root issue. I'll look into addressing it!

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
closes pandas-dev#15048

Author: Matt Roeschke <[email protected]>

Closes pandas-dev#15053 from mroeschke/improve_cummin_cummax and squashes the following commits:

5e8ba63 [Matt Roeschke] PERF: Cythonize Groupby.cummin/cummax (pandas-dev#15048)
@mroeschke mroeschke deleted the improve_cummin_cummax branch December 20, 2017 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: groupby-cummax,cummin
7 participants