AssertionError when grouping with max/min as aggregation functions (pandas-1.0.0) #31522

marcevrard · 2020-01-31T22:44:03Z

Code Sample

import pandas as pd
import numpy as np

df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'key3' : ['three', 'three', 'three', 'six', 'six'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df.groupby('key1').min()

Problem description

Since pandas-1.0.0, an AssertionError is thrown when grouping a DataFrame by a key and using max/min as aggregation functions. It works fine if only 1 key (other than the grouping key) is of the type object in the DataFrame, but it doesn't when the number of keys of type object is bigger than 1 (as shown in the example). This configuration worked fine on previous versions of pandas (e.g., pandas-0.25.3).

Expected Output

key1	key2	key3	data1	data2
a	one	six	-0.67246	-1.6302
b	one	six	-1.72628	-0.907298

Output of `pd.show_versions()`

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.0.0
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0.post20200127
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : None
tabulate : 0.8.3
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.48.0

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-01-31T22:59:33Z

Thanks for the report. That assert is from #29035 (cc @jbrockmendel)

We do min on object dtype, which is NotImplemented in Cython, so fall back to the python agg. Then in

                    result = cast(DataFrame, result)
                    # unwrap DataFrame to get array
                    assert len(result._data.blocks) == 1
                    result = result._data.blocks[0].values
                    if isinstance(result, np.ndarray) and result.ndim == 1:
                        result = result.reshape(1, -1)

the ` assert len(result._data.blocks) == 1 fails

(Pdb) pp result
     key2 key3
key1
a     one  six
b     one  six

and we fall through to the finally with a DataFrame.

FYI @marcevrard , we publish release candidates and nightly builds, if you want to catch these before the release. You can select watch pandas' "Release only" if nightly builds aren't an option.

TomAugspurger · 2020-01-31T23:02:46Z

I guess the faulty assumption is that a groupby aggregation on an single Block won't split it into multiple blocks. This apparently isn't true for object blocks

(Pdb) obj._data.blocks
(ObjectBlock: slice(0, 2, 1), 2 x 5, dtype: object,)
(Pdb) result._data.blocks
(ObjectBlock: slice(0, 1, 1), 1 x 2, dtype: object, ObjectBlock: slice(1, 2, 1), 1 x 2, dtype: object)

jbrockmendel · 2020-02-01T03:08:23Z

I guess the faulty assumption is that a groupby aggregation on an single Block won't split it into multiple blocks.

I guess _split_and_operate would have to be called in there somehow. Easiest solution would be to raise TypeError, which should revert to the previous behavior. Longer-term we probably should be handling that case without raising.

TomAugspurger · 2020-02-03T13:40:15Z

Looking into this today.

Closes pandas-dev#31522

* REGR: Fixed AssertionError in groupby Closes #31522

marcevrard · 2020-02-08T23:01:57Z

Thank you for the quick fix, I confirm it indeed works again with the 1.0.1 version.

zking1219 · 2020-02-13T20:59:49Z

I'm still seeing this error in pandas-1.0.1:

    df_mh_spc6 = df_mh_spc5.groupby(['bldg_id'], as_index=False, sort=False).max()
env/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1378: in f
    return self._cython_agg_general(alias, alt=npfunc, **kwargs)
env/lib/python3.8/site-packages/pandas/core/groupby/generic.py:1003: in _cython_agg_general
    agg_blocks, agg_items = self._cython_agg_blocks(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pandas.core.groupby.generic.DataFrameGroupBy object at 0x11d88db20>, how = 'max'
alt = <function amax at 0x111901d30>, numeric_only = False, min_count = -1

    def _cython_agg_blocks(
        self, how: str, alt=None, numeric_only: bool = True, min_count: int = -1
    ) -> "Tuple[List[Block], Index]":
        # TODO: the actual managing of mgr_locs is a PITA
        # here, it should happen via BlockManager.combine
    
        data: BlockManager = self._get_data_to_aggregate()
    
        if numeric_only:
            data = data.get_numeric_data(copy=False)
    
        agg_blocks: List[Block] = []
        new_items: List[np.ndarray] = []
        deleted_items: List[np.ndarray] = []
        # Some object-dtype blocks might be split into List[Block[T], Block[U]]
        split_items: List[np.ndarray] = []
        split_frames: List[DataFrame] = []
    
        no_result = object()
        for block in data.blocks:
            # Avoid inheriting result from earlier in the loop
            result = no_result
            locs = block.mgr_locs.as_array
            try:
                result, _ = self.grouper.aggregate(
                    block.values, how, axis=1, min_count=min_count
                )
            except NotImplementedError:
                # generally if we have numeric_only=False
                # and non-applicable functions
                # try to python agg
    
                if alt is None:
                    # we cannot perform the operation
                    # in an alternate way, exclude the block
                    assert how == "ohlc"
                    deleted_items.append(locs)
                    continue
    
                # call our grouper again with only this block
                obj = self.obj[data.items[locs]]
                if obj.shape[1] == 1:
                    # Avoid call to self.values that can occur in DataFrame
                    #  reductions; see GH#28949
                    obj = obj.iloc[:, 0]
    
                s = get_groupby(obj, self.grouper)
                try:
                    result = s.aggregate(lambda x: alt(x, axis=self.axis))
                except TypeError:
                    # we may have an exception in trying to aggregate
                    # continue and exclude the block
                    deleted_items.append(locs)
                    continue
                else:
                    result = cast(DataFrame, result)
                    # unwrap DataFrame to get array
                    if len(result._data.blocks) != 1:
                        # We've split an object block! Everything we've assumed
                        # about a single block input returning a single block output
                        # is a lie. To keep the code-path for the typical non-split case
                        # clean, we choose to clean up this mess later on.
                        split_items.append(locs)
                        split_frames.append(result)
                        continue
    
                    assert len(result._data.blocks) == 1
                    result = result._data.blocks[0].values
                    if isinstance(result, np.ndarray) and result.ndim == 1:
                        result = result.reshape(1, -1)
    
            assert not isinstance(result, DataFrame)
    
            if result is not no_result:
                # see if we can cast the block back to the original dtype
                result = maybe_downcast_numeric(result, block.dtype)
    
                if block.is_extension and isinstance(result, np.ndarray):
                    # e.g. block.values was an IntegerArray
                    # (1, N) case can occur if block.values was Categorical
                    #  and result is ndarray[object]
                    assert result.ndim == 1 or result.shape[0] == 1
                    try:
                        # Cast back if feasible
                        result = type(block.values)._from_sequence(
                            result.ravel(), dtype=block.values.dtype
                        )
                    except ValueError:
                        # reshape to be valid for non-Extension Block
                        result = result.reshape(1, -1)
    
                agg_block: Block = block.make_block(result)
    
            new_items.append(locs)
            agg_blocks.append(agg_block)
    
        if not (agg_blocks or split_frames):
            raise DataError("No numeric types to aggregate")
    
        if split_items:
            # Clean up the mess left over from split blocks.
            for locs, result in zip(split_items, split_frames):
>               assert len(locs) == result.shape[1]
E               AssertionError

env/lib/python3.8/site-packages/pandas/core/groupby/generic.py:1110: AssertionError

Any idea why this might still be happening? @TomAugspurger
It worked fine until pandas-1.0.0.

Thanks!

TomAugspurger · 2020-02-13T21:24:02Z

@zking1219 if you have a minimal example I'd recommend opening a new issue.

zking1219 · 2020-02-13T21:27:09Z

I can see how that might help, I'll work on putting one together. Thanks!

willkochtitzky · 2020-07-24T19:45:20Z

I am still getting this error message running pandas 1.0.5. I switched back to 0.25.1 and it is working just fine. My dataset is a little complicated and I don't have time to put together a minimal example now, but thought you would want to know that this still seems to be a problem.

prakass1 · 2020-08-13T13:43:43Z

Seeing the same error in the pandas-1.1.0 version as well

SierprinskiFox · 2020-08-13T17:22:07Z

I got the same error message...
With pdb I have managed to find out that there is a problem in the 1450th row:D
This command does not give an AssertionError (in pdb):
datafile.head(1464).groupby("column_name").min()
But this one does:
datafile.head(1465).groupby("column_name").min()

The 1465th row has 43 columns instead of 42. But when I have deleted the 42nd column (Ie the 43rd) nothing got better. I still get the same error message.

prakass1 · 2020-08-13T18:02:47Z

the pandas-1.1.0 version as

I did not see the error again in 1.1.0. I followed the procedure to recompile all the packages and also I identified that it had also occurred since I had some NaN in my data which I have fixed. Before was not getting assert error even in the presence of NaN

TomAugspurger added Groupby Regression Functionality that used to work in a prior pandas version labels Jan 31, 2020

TomAugspurger added this to the 1.0.1 milestone Jan 31, 2020

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Feb 3, 2020

REGR: Fixed AssertionError in groupby

f868874

Closes pandas-dev#31522

TomAugspurger mentioned this issue Feb 3, 2020

REGR: Fixed AssertionError in groupby #31616

Merged

TomAugspurger added the Has PR label Feb 4, 2020

TomAugspurger closed this as completed in #31616 Feb 5, 2020

TomAugspurger added a commit that referenced this issue Feb 5, 2020

REGR: Fixed AssertionError in groupby (#31616)

2bf618f

* REGR: Fixed AssertionError in groupby Closes #31522

ghost9023 mentioned this issue Dec 10, 2020

BUG: .groupby().min() .max() .agg('min') .agg('max') ERROR #38401

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError when grouping with max/min as aggregation functions (pandas-1.0.0) #31522

AssertionError when grouping with max/min as aggregation functions (pandas-1.0.0) #31522

marcevrard commented Jan 31, 2020

TomAugspurger commented Jan 31, 2020

TomAugspurger commented Jan 31, 2020

jbrockmendel commented Feb 1, 2020

TomAugspurger commented Feb 3, 2020

marcevrard commented Feb 8, 2020

zking1219 commented Feb 13, 2020

TomAugspurger commented Feb 13, 2020

zking1219 commented Feb 13, 2020

willkochtitzky commented Jul 24, 2020

prakass1 commented Aug 13, 2020

SierprinskiFox commented Aug 13, 2020 •

edited

Loading

prakass1 commented Aug 13, 2020

AssertionError when grouping with max/min as aggregation functions (pandas-1.0.0) #31522

AssertionError when grouping with max/min as aggregation functions (pandas-1.0.0) #31522

Comments

marcevrard commented Jan 31, 2020

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented Jan 31, 2020

TomAugspurger commented Jan 31, 2020

jbrockmendel commented Feb 1, 2020

TomAugspurger commented Feb 3, 2020

marcevrard commented Feb 8, 2020

zking1219 commented Feb 13, 2020

TomAugspurger commented Feb 13, 2020

zking1219 commented Feb 13, 2020

willkochtitzky commented Jul 24, 2020

prakass1 commented Aug 13, 2020

SierprinskiFox commented Aug 13, 2020 • edited Loading

prakass1 commented Aug 13, 2020

Output of `pd.show_versions()`

SierprinskiFox commented Aug 13, 2020 •

edited

Loading