REGR: cumsum regression with groupby call to agg #31802

mattharrison · 2020-02-08T01:30:16Z

Code Sample, a copy-pastable example if possible

I want to define a custom function that I can pass to the agg method. It uses the cumsum method, which appears to be problematic recently.

# Your code here
import pandas as pd

def max_test(s):
    return s.cumsum().max()
    #return s.max()

dummy_data = pd.DataFrame(
    {'AIRLINE': {0: 'WN', 1: 'UA', 2: 'MQ', 3: 'AA', 4: 'WN'},
     'ORG_AIR': {0: 'LAX', 1: 'DEN', 2: 'DFW', 3: 'DFW', 4: 'LAX'},
     'DIST': {0: 590, 1: 1452, 2: 641, 3: 1192, 4: 1363}})

gb = dummy_data.groupby(['AIRLINE', 'ORG_AIR'])

result = gb.agg(
    #'max' 
    max_test
)

print(result)

Problem description

Prior to Pandas 1.0rc this worked. It now raises an exception:

$ python /tmp/regpandas.py
Traceback (most recent call last):
  File "/tmp/regpandas.py", line 16, in <module>
    max_test
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/groupby/generic.py", line 948, in aggregate
    return self._python_agg_general(func, *args, **kwargs)
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 936, in _python_agg_general
    result, counts = self.grouper.agg_series(obj, f)
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/groupby/ops.py", line 641, in agg_series
    return self._aggregate_series_fast(obj, func)
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/groupby/ops.py", line 666, in _aggregate_series_fast
    result, counts = grouper.get_result()
  File "pandas/_libs/reduction.pyx", line 376, in pandas._libs.reduction.SeriesGrouper.get_result
  File "pandas/_libs/reduction.pyx", line 193, in pandas._libs.reduction._BaseGrouper._apply_to_group
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 913, in <lambda>
    f = lambda x: func(x, *args, **kwargs)
  File "/tmp/regpandas.py", line 4, in max_test
    return s.cumsum().max()
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/generic.py", line 11331, in cum_func
    result = self._data.apply(na_accum_func)
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 440, in apply
    applied = b.apply(f, **kwargs)
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 403, in apply
    result = self.make_block(values=_block_shape(result, ndim=self.ndim))
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 273, in make_block
    return make_block(values, placement=placement, ndim=self.ndim)
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 3041, in make_block
    return klass(values, ndim=ndim, placement=placement)
  File "/Users/matt/.env/pandas1/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 125, in __init__
    f"Wrong number of items passed {len(self.values)}, "
ValueError: Wrong number of items passed 2, placement implies 1

Expected Output

$ python /tmp/regpandas.py
                 DIST
AIRLINE ORG_AIR
AA      DFW      1192
MQ      DFW       641
UA      DEN      1452
WN      LAX      1953

Output of `pd.show_versions()`

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.3.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 18.6.0
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.1
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 19.0.3
setuptools       : 40.8.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.10.3
IPython          : 7.11.1
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.1.2
numexpr          : 2.7.1
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : None
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : 1.3.13
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
numba            : None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2020-02-08T10:28:23Z

@mattharrison Thanks for the clear report! I can confirm this is a regression.

cc @jbrockmendel this is another example of something failing in libreduction with another error than the specific one that is catched:

pandas/pandas/core/groupby/ops.py

Lines 640 to 648 in ca84bd0

    
           try: 
        
               return self._aggregate_series_fast(obj, func) 
        
           except ValueError as err: 
        
               if "Function does not reduce" in str(err): 
        
                   # raised in libreduction 
        
                   pass 
        
               else: 
        
                   raise 
        
           return self._aggregate_series_pure_python(obj, func)

jorisvandenbossche · 2020-02-11T13:35:19Z

@jbrockmendel can you take a look?

jbrockmendel · 2020-02-11T16:44:00Z

We expect cumsum().max() to be a reduction, but when we go through libreduction, on the last group it doesn't reduce. This suggests that we could catch this within lib-reduction and re-raise as ValueError("Function does not reduce").

jorisvandenbossche · 2020-02-11T23:11:46Z

We expect cumsum().max() to be a reduction, but when we go through libreduction, on the last group it doesn't reduce.

Can you explain this a bit more? The function is doing a max(), so I would expect this to always be a reduction?

jbrockmendel · 2020-02-12T01:38:46Z

Can you explain this a bit more? The function is doing a max(), so I would expect this to always be a reduction?

I, too, expect it to be a reduction.

I added in Block.apply a print(result, getattr(result, "shape", None)) just before calling _split_op_result and the ran the example here:

# Your code here
import pandas as pd

def max_test(s):
    return s.cumsum().max()

df = pd.DataFrame(
    {'AIRLINE': {0: 'WN', 1: 'UA', 2: 'MQ', 3: 'AA', 4: 'WN'},
     'ORG_AIR': {0: 'LAX', 1: 'DEN', 2: 'DFW', 3: 'DFW', 4: 'LAX'},
     'DIST': {0: 590, 1: 1452, 2: 641, 3: 1192, 4: 1363}})

gb = df.groupby(['AIRLINE', 'ORG_AIR'])

>>> result = gb.agg(max_test)
[] (0,)
[1192] (1,)
[641] (1,)
[1452] (1,)
[ 590 1953] (2,)
Traceback (most recent call last):
[...]

so maybe it is reducing, just along the wrong axis?

TomAugspurger · 2020-03-06T20:35:29Z

A bit simpler reproducer:

In [43]: df = pd.DataFrame({"A": ['a', 'a', 'a'], "B": ['a', 'b', 'b'], "C": [1, 1, 1]})

In [44]: def f(x):
    ...:     assert len(x) == len(x._data.blocks[0].mgr_locs)
    ...:     return 0
    ...:

In [45]: df.groupby(["A", "B"]).agg(f)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-45-6faf28de6e4f> in <module>
----> 1 df.groupby(["A", "B"]).agg(f)

~/sandbox/pandas/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
    937             # grouper specific aggregations
    938             if self.grouper.nkeys > 1:
--> 939                 return self._python_agg_general(func, *args, **kwargs)
    940             elif args or kwargs:
    941                 result = self._aggregate_frame(func, *args, **kwargs)

~/sandbox/pandas/pandas/core/groupby/groupby.py in _python_agg_general(self, func, *args, **kwargs)
    924             try:
    925                 # if this function is invalid for this dtype, we will ignore it.
--> 926                 result, counts = self.grouper.agg_series(obj, f)
    927             except TypeError:
    928                 continue

~/sandbox/pandas/pandas/core/groupby/ops.py in agg_series(self, obj, func)
    638
    639         try:
--> 640             return self._aggregate_series_fast(obj, func)
    641         except ValueError as err:
    642             if "Function does not reduce" in str(err):

~/sandbox/pandas/pandas/core/groupby/ops.py in _aggregate_series_fast(self, obj, func)
    663         group_index = algorithms.take_nd(group_index, indexer, allow_fill=False)
    664         grouper = libreduction.SeriesGrouper(obj, func, group_index, ngroups, dummy)
--> 665         result, counts = grouper.get_result()
    666         return result, counts
    667

~/sandbox/pandas/pandas/_libs/reduction.pyx in pandas._libs.reduction.SeriesGrouper.get_result()
    373                         cached_typ, cached_ityp, islider, vslider)
    374
--> 375                     res, initialized = self._apply_to_group(cached_typ, cached_ityp,
    376                                                             islider, vslider,
    377                                                             group_size, initialized)

~/sandbox/pandas/pandas/_libs/reduction.pyx in pandas._libs.reduction._BaseGrouper._apply_to_group()
    191         """
    192         cached_ityp._engine.clear_mapping()
--> 193         res = self.f(cached_typ)
    194         res = _extract_result(res)
    195         if not initialized:

~/sandbox/pandas/pandas/core/groupby/groupby.py in <lambda>(x)
    911     def _python_agg_general(self, func, *args, **kwargs):
    912         func = self._is_builtin_func(func)
--> 913         f = lambda x: func(x, *args, **kwargs)
    914
    915         # iterate through "columns" ex exclusions to populate output dict

<ipython-input-44-5151b7117ae2> in f(x)
      1 def f(x):
----> 2     assert len(x) == len(x._data.blocks[0].mgr_locs)
      3     return 0
      4

AssertionError:

I think the issue is roughly somewhere around

pandas/pandas/_libs/reduction.pyx

Line 179 in 54c5e9e

object.__setattr__(cached_typ._data._block, 'values', vslider.buf)

. We update the Block._values, but don't adjust mgr_locs / placement. Anything that creates a new block from a function applied to those values will fail. In this case, we're making a Series on the result of the cumsum.

I'm not 100% sure what the solution is. It seems weird to mutate the mgr_locs in libreduction. Do you have any thoughts on that @jbrockmendel?

jbrockmendel · 2020-03-07T03:25:21Z

It seems weird to mutate the mgr_locs in libreduction

That can only end in tears. I'd be inclined to failover to the non-fastpath pretty aggressively.

TomAugspurger · 2020-03-09T15:57:55Z

Yeah, that's my preference as well. By "weird" I should have probably put "crazy" :)

I'll see if I can get a somewhat targeted check for when we need to bail out.

TomAugspurger · 2020-03-09T18:52:13Z

Right now I'm pessimistic about a "targeted check" being possible. It seems like libreduction.SeriesGrouper is fundamentally at odds with user-defined callables. We're creating invalid Series / DataFrame objects, so anything relying on the wrong invariant may hit arbitrary exceptions.

I think our only two choices are to

Change the except ValueError in agg_series back to except Exception
Just always call aggregate_series_pure_python for UDFs, skipping aggregate_series_fast / libreduction.

TomAugspurger · 2020-03-09T19:00:15Z

I suppose the 3rd choice is to continue to play whack-a-mole on exceptions / messages in

pandas/pandas/core/groupby/ops.py

Lines 639 to 647 in 787dc8a

    
           try: 
        
               return self._aggregate_series_fast(obj, func) 
        
           except ValueError as err: 
        
               if "Function does not reduce" in str(err): 
        
                   # raised in libreduction 
        
                   pass 
        
               else: 
        
                   raise 
        
           return self._aggregate_series_pure_python(obj, func)

. That's probably the most conservative change...

Closes pandas-dev#31802 This "fixes" pandas-dev#31802 by expanding the number of cases where we swallow an exception in libreduction. Currently, we're creating an invalid Series in SeriesBinGrouper where the `.mgr_locs` doesn't match the values. See pandas-dev#31802 (comment) for more. For now, we simply catch more cases that fall back to Python. I've gone with a minimal change which addresses only issues hitting this exact exception. We might want to go broader, but that's not clear.

jbrockmendel · 2020-03-09T19:54:40Z

Just always call aggregate_series_pure_python for UDFs, skipping aggregate_series_fast / libreduction.

Maintenance-wise, this is by far the most appealing option, xref #32086. As of that PR, the performance penalty (as measured by our most-affected asv) of disabling fast_apply was 48x. Some perf PRs since then have gotten that down to 7x. I'm pretty sure I can shave that down to 5x without too much trouble.

Leaving aside the issue of one asv not being a representative sample, the question is how big a perf penalty are we willing to accept to get rid of this hassle?

jbrockmendel · 2020-03-09T19:59:18Z

object.setattr(cached_typ._data._block, 'values', vslider.buf)

I'm a little confused here. How are we getting to a point where vslider.buf has the wrong shape?

jorisvandenbossche · 2020-03-09T20:00:24Z

What's the difference maintenance wise between try/except the fast version (with a generic except in case of UDFs), and otherwise fallback to pure python version versus an if/else to only do the pure python version for UDFs?

jbrockmendel · 2020-03-09T20:12:54Z

What's the difference maintenance wise between try/except the fast version [...] (with a generic except in case of UDFs)

Non-surfaced bugs
The cython code path is very difficult to grok, not newcomer-friendly. Among other things it messes with ndarray internals, which is kinda sketchy.
We need to worry about fast/slow versions having matching behavior, similar to the failure in POC/REF: remove SeriesBinGrouper, SeriesGrouper #32083. (i'm pretty sure the libreduction._extract_array behavior/usage is subtly different from what we have in the non-cython paths)

jbrockmendel · 2020-03-09T20:14:22Z

we could use something like the raw keyword to let users opt in to a fastpath to operate only on the underlying arrays (i think we do this for DataFrame.apply). If we wanted to get fancy about it we could look at annotations on a UDF and see if is fastpath-friendly

jorisvandenbossche · 2020-03-09T20:26:02Z

But we need the cython code anyway? Or would we be able to remove quite some of it if it didn't need to support UDFs?

jbrockmendel · 2020-03-09T20:40:08Z

But we need the cython code anyway? Or would we be able to remove quite some of it if it didn't need to support UDFs?

If we got rid of fast_apply, we could rip out 136 lines of cython and about 60 more of non-cython, see #32086.

Also it looks like fast/slow paths give different behavior for tshift in tests.groupby.test_whitelist

jbrockmendel · 2020-03-09T21:08:01Z

@TomAugspurger after having looked at it more closely, updating mgr_locs is viable. Adding cached_type._data._block.mgr_locs = slice(len(vslider.buf))) to _update_cached_objs in libreduction gives us back a valid Series

TomAugspurger · 2020-03-09T21:24:21Z

Hmmm OK. Let's dig this hole a little deeper then...

jbrockmendel · 2020-03-09T22:28:45Z

the question is how big a perf penalty are we willing to accept to get rid of this hassle?

Any thoughts here? I've got it down to 4.5x locally

TomAugspurger · 2020-03-10T13:07:34Z

4.5x seems a bit rough (is that across the board, or just specific cases?)

I think I'd prefer to see something like what you proposed earlier:

we could use something like the raw keyword to let users opt in to a fastpath to operate only on the underlying arrays

It seems safeish to be doing these hacks on plain ndarrays. I think most of the fragility comes from trying to operate on more complex objects like Series & DataFrame. If we could ensure that the callable is only operating on ndarrays (via an opt-in keyword like raw) then the perf / maintenance tradeoff seems more palitable.

jbrockmendel · 2020-03-10T14:52:31Z

If we could ensure that the callable is only operating on ndarrays

This seems like a good tradeoff.

4.5x seems a bit rough (is that across the board, or just specific cases?)

A few weeks ago I ran the asvs with fast_apply disabled and chose the one that was most affected. Have been using that for comparison in optimization PRs since then. Most of the slowdown comes from slicing because we create new Series+SingleBlockManager+Index+Block objects on each of ~2k groups

TomAugspurger · 2020-03-10T14:58:34Z

Do you have a sense (without looking) for whether we're doing "pandas things" (accessing the index, blocks, etc.) in those methods? I guess that determines whether the raw keyword is even an option.

jbrockmendel · 2020-03-10T15:02:25Z

whether we're doing "pandas things" (accessing the index, blocks, etc.) in those methods?

which methods are you referring to?

TomAugspurger · 2020-03-10T16:12:46Z

The callable being applied. If I'm thinking about this the right way, that callable would start receiving an ndarray rather than a Series.

jbrockmendel · 2020-03-10T16:21:24Z

Gotcha. Yah, passing an ndarray to the UDF would allow for a lot of simplication

jorisvandenbossche added this to the 1.0.2 milestone Feb 8, 2020

jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Feb 8, 2020

jorisvandenbossche changed the title ~~cumsum regression with groupby call to agg~~ REGR: cumsum regression with groupby call to agg Feb 8, 2020

dsaxton mentioned this issue Feb 18, 2020

ValueError when grouping with max/min as aggregation functions (pandas-1.0.1) #32077

Closed

TomAugspurger mentioned this issue Mar 9, 2020

Ensure valid Block mutation in SeriesBinGrouper. #32561

Merged

jreback added Groupby Numeric Operations Arithmetic, Comparison, and Logical operations Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 11, 2020

jreback closed this as completed in #32561 Mar 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: cumsum regression with groupby call to agg #31802

REGR: cumsum regression with groupby call to agg #31802

mattharrison commented Feb 8, 2020 •

edited by jbrockmendel

Loading

jorisvandenbossche commented Feb 8, 2020

jorisvandenbossche commented Feb 11, 2020

jbrockmendel commented Feb 11, 2020

jorisvandenbossche commented Feb 11, 2020

jbrockmendel commented Feb 12, 2020

TomAugspurger commented Mar 6, 2020

jbrockmendel commented Mar 7, 2020

TomAugspurger commented Mar 9, 2020

TomAugspurger commented Mar 9, 2020

TomAugspurger commented Mar 9, 2020

jbrockmendel commented Mar 9, 2020

jbrockmendel commented Mar 9, 2020

jorisvandenbossche commented Mar 9, 2020

jbrockmendel commented Mar 9, 2020

jbrockmendel commented Mar 9, 2020

jorisvandenbossche commented Mar 9, 2020

jbrockmendel commented Mar 9, 2020

jbrockmendel commented Mar 9, 2020

TomAugspurger commented Mar 9, 2020

jbrockmendel commented Mar 9, 2020

TomAugspurger commented Mar 10, 2020

jbrockmendel commented Mar 10, 2020

TomAugspurger commented Mar 10, 2020

jbrockmendel commented Mar 10, 2020

TomAugspurger commented Mar 10, 2020

jbrockmendel commented Mar 10, 2020

REGR: cumsum regression with groupby call to agg #31802

REGR: cumsum regression with groupby call to agg #31802

Comments

mattharrison commented Feb 8, 2020 • edited by jbrockmendel Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jorisvandenbossche commented Feb 8, 2020

jorisvandenbossche commented Feb 11, 2020

jbrockmendel commented Feb 11, 2020

jorisvandenbossche commented Feb 11, 2020

jbrockmendel commented Feb 12, 2020

TomAugspurger commented Mar 6, 2020

jbrockmendel commented Mar 7, 2020

TomAugspurger commented Mar 9, 2020

TomAugspurger commented Mar 9, 2020

TomAugspurger commented Mar 9, 2020

jbrockmendel commented Mar 9, 2020

jbrockmendel commented Mar 9, 2020

jorisvandenbossche commented Mar 9, 2020

jbrockmendel commented Mar 9, 2020

jbrockmendel commented Mar 9, 2020

jorisvandenbossche commented Mar 9, 2020

jbrockmendel commented Mar 9, 2020

jbrockmendel commented Mar 9, 2020

TomAugspurger commented Mar 9, 2020

jbrockmendel commented Mar 9, 2020

TomAugspurger commented Mar 10, 2020

jbrockmendel commented Mar 10, 2020

TomAugspurger commented Mar 10, 2020

jbrockmendel commented Mar 10, 2020

TomAugspurger commented Mar 10, 2020

jbrockmendel commented Mar 10, 2020

mattharrison commented Feb 8, 2020 •

edited by jbrockmendel

Loading

Output of `pd.show_versions()`