PERF: Cythonize Groupby.cummin/cummax (#15048) #15053

mroeschke · 2017-01-04T08:25:44Z

closes PERF: groupby-cummax,cummin #15048
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry


   before     after       ratio
  [6eb705f5] [e8115142]
-     1.08s     1.74ms      0.00  groupby.groupby_cummax_cummin.time_cummax
-     1.09s     1.73ms      0.00  groupby.groupby_cummax_cummin.time_cummin

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Still need to add some tests to see if this is equivalent to the old implimentation (numpy.minimum.accumulate?). Any early feedback appreciated.

max-sixty · 2017-01-04T12:13:07Z

Random thought, but any presumption on performance vs bottleneck? http://berkeleyanalytics.com/bottleneck/reference.html#bottleneck.move_min (with window = len)

jreback · 2017-01-04T12:20:52Z

bottleneck will be way slower as it will be called in a python loop. The cython grouperby routines have the indexers already and thus everything is a single cython loop.

max-sixty · 2017-01-04T12:45:06Z

bottleneck will be way slower as it will be called in a python loop

Ah, you can't put the bn call in a with nogil context manager?

jreback · 2017-01-04T12:49:38Z

@MaximilianR no, nothing to do with that.

The point of the groupby_* routines is that they take and indexer (already been factorized), and make the results per-group.

bottleneck is for a full 1-d array. Here we have many variable sized (the groups) 1-d arrays.

jreback · 2017-01-04T12:50:27Z

lgtm. pls add a whatsnew note & look at the build log, linting issue with a cython file

jorisvandenbossche · 2017-01-04T14:16:12Z

asv_bench/benchmarks/groupby.py

+        self.df.groupby('A').cummin()
+
+    def time_cummax(self):
+        self.df.groupby('A').cummax()


There are already time_cummin and time_cummax functions below. Do these test something else specifically?

Ah thanks for pointing that out. These benches were transferred directly from #15048, but they are essentially redundant compared to the existing benches: df.groupby('key').['value'].cummax() (old benchmark) vs df.groupby('key').cummax() (#15048)

jorisvandenbossche · 2017-01-04T14:17:38Z

pandas/core/groupby.py

@@ -1417,6 +1418,26 @@ def cumsum(self, axis=0, *args, **kwargs):

    @Substitution(name='groupby')
    @Appender(_doc_template)
+    def cummin(self, axis=0, *args, **kwargs):
+        """Cumulative min for each group"""
+        nv.validate_groupby_func('cummin', args, kwargs)


cummin and cummax are not numpy functions I think? So in that case this line (and the args, kwargs) are not needed?

Correct, therefore probably don't need to pass it through nv.validate_groupby_func

jorisvandenbossche · 2017-01-04T14:19:51Z

pandas/core/groupby.py

@@ -75,7 +75,7 @@
    'last', 'first',
    'head', 'tail', 'median',
    'mean', 'sum', 'min', 'max',
-    'cumsum', 'cumprod', 'cummin', 'cummax', 'cumcount',


If cummin and cummax can be removed, should cumsum and cumprod also not be removed?

Do you mean "also be removed" instead of "also not be removed"? If so, you're probably correct, I'll try remove them to see if anything breaks.

chris-b1 · 2017-01-04T14:52:21Z

pandas/src/algos_groupby_helper.pxi.in

+                if val == val:
+                    if j == 0:
+                        min_val = val
+                    elif val < min_val:


This logic doesn't seem right - I think you need to be comparing to a group specific min, something like this?

if val < accum[lab, j]: ...

Ah thanks! Was doing some local tests and noticed the results didn't exactly match the alternative groupby.apply(np.minimum.accumulate). Will check this out tonight.

codecov-io · 2017-01-05T08:17:10Z

Current coverage is 84.75% (diff: 53.84%)

Merging #15053 into master will decrease coverage by <.01%

@@             master     #15053   diff @@
==========================================
  Files           145        145          
  Lines         51220      51232    +12   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43415      43421     +6   
- Misses         7805       7811     +6   
  Partials          0          0

Powered by Codecov. Last update c71f214...5e8ba63

mroeschke · 2017-01-05T08:22:30Z

Added a whatsnew and a test. I don't think my implementation is entirely correct yet so I am expecting the test I added to fail.

Is there a conventional way to debug cython code? Ideally, I'd like to drop a pdb.set_trace()-like functionality to see what's going on.

jreback · 2017-01-05T11:55:29Z

copy this to python, debug and assert correctness, then cythonize it. note you will have to hack the calling function (to make sure it is imported / called correctly while you do this).

there are ways to debug cython code, just google. @chrisaycock may be able to elaborate a bit.

chrisaycock · 2017-01-05T14:54:07Z

The only way I've done debugging with Cython has been to add print() statements and to run samples in the Python interpreter. So really ad hoc.

mroeschke · 2017-01-05T17:48:09Z

Thanks for the tips @jreback and @chrisaycock!

mroeschke · 2017-01-07T17:54:11Z

Made the changes related to your comments @jorisvandenbossche, and added a whatsnew & passing tests @jreback. All green.

jreback · 2017-01-07T18:42:46Z

pandas/src/algos_groupby_helper.pxi.in

+
+    N, K = (<object> values).shape
+    accum = np.empty_like(accum)
+    accum[:] = _int64_max


this is not right it needs to be nan if val != val

Since val is always set before the comparison, shouldn't val == val always equal True?

val = values[i, j] if val == val: ....

I notice this pattern in group_cumsum and group_cumprod as well. Should those have nan if val != val as well?

Ah I see, thanks for the insight. I'll be sure to add that condition

This value (_int64_max) needs to be type specific - e.g., here's a failing test case

In [7]: df = pd.DataFrame({'key':[1], 'value': 1e99}) In [8]: df.groupby('key').cummin() Out[8]: value 0 1.104865e-311

Thanks for the head-up @chris-b1! Would that mean creating 3 different group_cummin/group_cummax functions to accommodate int32, float32 and float64? (Like the other group_* functions in this file).

It doesn't seem possible to get the dtype directly from the values array.

Yeah, probably the easiest way would be to add to the templated code - look at group_min/max - it is already dumping the max/min value into the template.

The other option (just fyi) - numeric here is a cython fused type - you can check and specialize on it like this.

if numeric is int64_t then: ... elif numeric is float32_t then: ...

jreback · 2017-01-07T18:44:18Z

pandas/tests/groupby/test_groupby.py

+        tm.assert_frame_equal(result, expected)
+
+        df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
+                           'B': range(8, 2, -1)})


needs tests with nans and different dtypes

needs test against doing .apply(lambda x: x.cummin()) for validation

mroeschke · 2017-01-08T21:55:48Z

Added additional tests and addressed the the val != val condition @jreback . All green.

mroeschke · 2017-01-09T08:03:07Z

Thanks for the tips @chris-b1, I was able to refactor the algorithm to handle different max/min dtypes.

chris-b1

Couple minor things, otherwise looks good to me

chris-b1 · 2017-01-09T14:18:59Z

pandas/src/algos_groupby_helper.pxi.in

+        int64_t lab
+
+    N, K = (<object> values).shape
+    accum = np.empty_like(accum)


This call isn't necessary, accum is already allocated to the right size/type

chris-b1 · 2017-01-09T14:19:34Z

pandas/src/algos_groupby_helper.pxi.in

+                    accum[lab, j] = min_val
+                    out[i, j] = accum[lab, j]
+                # val = nan
+                elif val != val:


Make this an else

chris-b1 · 2017-01-09T14:19:56Z

pandas/src/algos_groupby_helper.pxi.in

+        int64_t lab
+
+    N, K = (<object> values).shape
+    accum = np.empty_like(accum)


same as above

chris-b1 · 2017-01-09T14:20:03Z

pandas/src/algos_groupby_helper.pxi.in

+                    accum[lab, j] = max_val
+                    out[i, j] = accum[lab, j]
+                # val = nan
+                elif val != val:


same as above

jreback · 2017-01-09T14:45:06Z

pandas/tests/groupby/test_groupby.py

+    def test_cummin_cummax(self):
+        # GH 15048
+        # Test with different dtypes
+        for dtype in [np.int32, np.float32, np.float64]:


test with int64, datetime types should work as well (e.g. datetime64, timedelta64). (if not pls comment; there are some issues I am working on elsewhere).

also test with object dtype (which we don't expect to work, I think it should raise). but we can handle this in another issue.

You're correct about using the object dtype. It raises this error:

DataError: No numeric types to aggregate

Will test the other dtypes later tonight.

jreback · 2017-01-09T14:45:52Z

pandas/tests/groupby/test_groupby.py

+            result = df.groupby('A').cummin()
+            expected = pd.DataFrame({'B': [3, 3, 3, 6, 6, 6]}).astype(dtype)
+            tm.assert_frame_equal(result, expected)
+            expected = df.groupby('A').B.apply(lambda x: x.cummin()).to_frame()


put a blank line in between sections

jreback · 2017-01-09T14:46:44Z

pandas/tests/groupby/test_groupby.py

+            expected = df.groupby('A').B.apply(lambda x: x.cummin()).to_frame()
+            tm.assert_frame_equal(result, expected)
+            # Test min value for dtype
+            try:


instead of this, pass these in with the dtype, IOW change your loop to be

for dtype, min_value in [(np.int32, np.iinfo(np.int32).min), (np.float32.....)]

jreback · 2017-01-09T14:46:59Z

pandas/tests/groupby/test_groupby.py

+            tm.assert_frame_equal(result, expected)
+            # Test max value for dtype
+            try:
+                max_val = np.iinfo(dtype).max


jreback · 2017-01-09T14:47:50Z

pandas/tests/groupby/test_groupby.py

+            expected = df.groupby('A').B.apply(lambda x: x.cummax()).to_frame()
+            tm.assert_frame_equal(result, expected)
+
+        # Some entries in column are nan


so this needs to work for float types (32 and 64). as well as datetime64/timedelta64 types (which use NaT)

jreback · 2017-01-09T14:48:46Z

pandas/tests/groupby/test_groupby.py

+        tm.assert_frame_equal(result, expected)
+        expected = df.groupby('A').B.apply(lambda x: x.cummax()).to_frame()
+        tm.assert_frame_equal(result, expected)
+


also test where first element is nan.

jreback · 2017-01-09T14:49:16Z

pandas/tests/groupby/test_groupby.py

+    def test_cummin_cummax(self):
+        # GH 15048
+        # Test with different dtypes
+        for dtype in [np.int32, np.float32, np.float64]:


also test with object dtype (which we don't expect to work, I think it should raise). but we can handle this in another issue.

mroeschke · 2017-01-11T06:25:36Z

@jreback Added the addtional nan test & restructured the tests to pass in min and max values.

Additionally performing cummin/max on datetime, timedeta and object dtypes all raise the same error: DataError: No numeric types to aggregate

In [16]: df = pd.DataFrame({'A':[5]*5+[6]*5,'B':range(10)})

In [17]: df
Out[17]: 
   A  B
0  5  0
1  5  1
2  5  2
3  5  3
4  5  4
5  6  5
6  6  6
7  6  7
8  6  8
9  6  9

In [18]: df['B'] = df['B'].astype(object)

In [19]: df.groupby('A').cummin()

DataError: No numeric types to aggregate

In [20]: df['B'] = pd.date_range(2015, freq='10min', periods=10)

In [21]: df
Out[21]: 
   A                             B
0  5 1970-01-01 00:00:00.000002015
1  5 1970-01-01 00:10:00.000002015
2  5 1970-01-01 00:20:00.000002015
3  5 1970-01-01 00:30:00.000002015
4  5 1970-01-01 00:40:00.000002015
5  6 1970-01-01 00:50:00.000002015
6  6 1970-01-01 01:00:00.000002015
7  6 1970-01-01 01:10:00.000002015
8  6 1970-01-01 01:20:00.000002015
9  6 1970-01-01 01:30:00.000002015

In [22]: df.groupby('A').cummin()

DataError: No numeric types to aggregate

In [26]: df['B'] = [pd.Timedelta(n, unit='d') for n in range(10)]

In [27]: df
Out[27]: 
   A      B
0  5 0 days
1  5 1 days
2  5 2 days
3  5 3 days
4  5 4 days
5  6 5 days
6  6 6 days
7  6 7 days
8  6 8 days
9  6 9 days

In [28]: df.groupby('A').cummin()

DataError: No numeric types to aggregate

pep8, removed args, kwargs add whatsnew + test + changed logic small error in algo Use a more obvious test Fixed algo & test passed Add dtypes test Add additional tests handle nan case Adapt max/min for different dtypes + tests remove uncessary comments Added test & adjust algorithm Fix linting errors

jreback · 2017-01-11T13:15:11Z

pandas/src/algos_groupby_helper.pxi.in

+def group_cummin_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
+                          ndarray[{{dest_type2}}, ndim=2] values,
+                          ndarray[int64_t] labels,
+                          ndarray[{{dest_type2}}, ndim=2] accum):


so I am puzzled why we even pass in accum here at all (rather than just allocate it here). It is only used internally. This would affect the groupby code and cumsum, cumprod as well as these new funcs.

@mroeschke if you want to do a followup to remove it after would be great.

@jreback - I had done this pattern in the cumsum code. IIRC, it was because I was using fused types / memorviews, and I didn't see a way to do the allocation. It would be straightforward to avoid in the templated code.

By example

def func(numeric[:] arr, int a, int b): # how would you allocate an (a x b) array of type `numeric?`

ahh I see. ok, in the templated code this should be straightforward then. after seeing how easy templated code is, I much prefer to fused types :>

jreback · 2017-01-11T13:23:07Z

thanks @mroeschke

The other dtypes are fixed in #15054 (e.g. datetime/timedelta).

If you'd do that followup to remove accum would be great!

jreback · 2017-01-11T13:30:31Z

e.g. in #15054

In [2]: df
Out[2]: 
   group  int  float string category_string  category_int   datetime                datetimetz  timedelta
0      1    1    4.0      a               a             7 2013-01-01 2013-01-01 00:00:00-05:00   00:00:01
1      1    2    5.0      b               b             8 2013-01-02 2013-01-02 00:00:00-05:00   00:00:02
2      2    3    6.0      c               c             9 2013-01-03 2013-01-03 00:00:00-05:00   00:00:03

In [3]: df.groupby('group').cummin()
Out[3]: 
   int  float  category_int
0    1    4.0             7
1    1    4.0             7
2    3    6.0             9

In [4]: df.groupby('group').cummin(numeric_only=False)
Out[4]: 
   int  float  category_int   datetime                datetimetz  timedelta
0    1    4.0             7 2013-01-01 2013-01-01 05:00:00-05:00   00:00:01
1    1    4.0             7 2013-01-01 2013-01-01 05:00:00-05:00   00:00:01
2    3    6.0             9 2013-01-03 2013-01-03 05:00:00-05:00   00:00:03

jreback · 2017-01-11T13:56:41Z

@mroeschke

the merged PR fails against master, not sure why the tests didn't catch this.

(pandas) bash-3.2$ nosetests pandas/tests/groupby/test_groupby.py  -m cummin_cummax -s 
F
======================================================================
FAIL: test_cummin_cummax (test_groupby.TestGroupBy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jreback/pandas/pandas/tests/groupby/test_groupby.py", line 5798, in test_cummin_cummax
    tm.assert_frame_equal(result, expected)
  File "/Users/jreback/pandas/pandas/util/testing.py", line 1313, in assert_frame_equal
    obj='DataFrame.iloc[:, {0}]'.format(i))
  File "/Users/jreback/pandas/pandas/util/testing.py", line 1154, in assert_series_equal
    assert_attr_equal('dtype', left, right)
  File "/Users/jreback/pandas/pandas/util/testing.py", line 878, in assert_attr_equal
    left_attr, right_attr)
  File "/Users/jreback/pandas/pandas/util/testing.py", line 1018, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: Attributes are different

Attribute "dtype" are different
[left]:  float64
[right]: int64

----------------------------------------------------------------------
Ran 1 test in 0.044s

FAILED (failures=1)

The culprit is this.
https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby.py#L1895

This only should happen if its an actual integer type on values.dtype (and NOT datetime64/timedelta64/bool)

…in groupby xref pandas-dev#15053

mroeschke · 2017-01-11T18:04:33Z

Apologies for not flagging this last night. I made a naive assumption that travis would catch this and the build would fail:

In [6]: df #np.int64 type
Out[6]: 
   A                    B
0  1                    3
1  1                    4
2  1 -9223372036854775808
3  1                    2
4  2                    2
5  2                    3
6  2 -9223372036854775808
7  2                    1

In [7]: df.groupby('A').cummin()
Out[7]: 
     B
0  3.0
1  3.0
2  NaN
3  NaN
4  2.0
5  2.0
6  NaN
7  NaN

Thanks for pointing out the root issue. I'll look into addressing it!

closes pandas-dev#15048 Author: Matt Roeschke <[email protected]> Closes pandas-dev#15053 from mroeschke/improve_cummin_cummax and squashes the following commits: 5e8ba63 [Matt Roeschke] PERF: Cythonize Groupby.cummin/cummax (pandas-dev#15048)

jreback added Performance Memory or execution speed performance Groupby labels Jan 4, 2017

jorisvandenbossche reviewed Jan 4, 2017

View reviewed changes

chris-b1 reviewed Jan 4, 2017

View reviewed changes

mroeschke force-pushed the improve_cummin_cummax branch from e811514 to 6a54faf Compare January 5, 2017 08:17

mroeschke force-pushed the improve_cummin_cummax branch 2 times, most recently from 277182c to f0d1b24 Compare January 7, 2017 09:03

mroeschke changed the title ~~[WIP] PERF: Cythonize Groupby.cummin/cummax (#15048)~~ PERF: Cythonize Groupby.cummin/cummax (#15048) Jan 7, 2017

jreback requested changes Jan 7, 2017

View reviewed changes

mroeschke force-pushed the improve_cummin_cummax branch 2 times, most recently from cb171c5 to 0b0df6b Compare January 8, 2017 20:30

mroeschke force-pushed the improve_cummin_cummax branch 2 times, most recently from 8dc0aed to 2fc025b Compare January 9, 2017 06:51

chris-b1 reviewed Jan 9, 2017

View reviewed changes

jreback reviewed Jan 9, 2017

View reviewed changes

jreback requested changes Jan 9, 2017

View reviewed changes

mroeschke force-pushed the improve_cummin_cummax branch from e0a9cc3 to 8ee56ee Compare January 11, 2017 06:20

mroeschke force-pushed the improve_cummin_cummax branch from 8ee56ee to 5e8ba63 Compare January 11, 2017 06:55

jreback reviewed Jan 11, 2017

View reviewed changes

jreback added this to the 0.20.0 milestone Jan 11, 2017

jreback closed this in 0fe491d Jan 11, 2017

jreback added a commit to jreback/pandas that referenced this pull request Jan 11, 2017

KLUDGE: check for iNaT in integer data prior to accumulate/transform …

2af5806

…in groupby xref pandas-dev#15053

jreback mentioned this pull request Jan 11, 2017

BUG: groupby.cummin/max changing passed values to nan uncesessarily #15109

Closed

mroeschke mentioned this pull request Jan 24, 2017

BUG: groupby.cummin/cummax nans inf values for int64 #15205

Closed

3 tasks

mroeschke deleted the improve_cummin_cummax branch December 20, 2017 02:04

PERF: Cythonize Groupby.cummin/cummax (#15048) #15053

PERF: Cythonize Groupby.cummin/cummax (#15048) #15053

Conversation

mroeschke commented Jan 4, 2017 • edited Loading

max-sixty commented Jan 4, 2017

jreback commented Jan 4, 2017

max-sixty commented Jan 4, 2017

jreback commented Jan 4, 2017

jreback commented Jan 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Jan 5, 2017 • edited Loading

Current coverage is 84.75% (diff: 53.84%)

mroeschke commented Jan 5, 2017 • edited Loading

jreback commented Jan 5, 2017

chrisaycock commented Jan 5, 2017

mroeschke commented Jan 5, 2017

mroeschke commented Jan 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke Jan 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Jan 8, 2017

mroeschke commented Jan 9, 2017

chris-b1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback Jan 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Jan 11, 2017

jreback Jan 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 11, 2017

jreback commented Jan 11, 2017

jreback commented Jan 11, 2017

mroeschke commented Jan 11, 2017

mroeschke commented Jan 4, 2017 •

edited

Loading

codecov-io commented Jan 5, 2017 •

edited

Loading

mroeschke commented Jan 5, 2017 •

edited

Loading

mroeschke Jan 9, 2017 •

edited

Loading

jreback Jan 9, 2017 •

edited

Loading

jreback Jan 11, 2017 •

edited

Loading