Skip to content

AssertionError when grouping with max/min as aggregation functions (pandas-1.0.0) #31522

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
marcevrard opened this issue Jan 31, 2020 · 12 comments · Fixed by #31616
Closed

AssertionError when grouping with max/min as aggregation functions (pandas-1.0.0) #31522

marcevrard opened this issue Jan 31, 2020 · 12 comments · Fixed by #31616
Labels
Groupby Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@marcevrard
Copy link

Code Sample

import pandas as pd
import numpy as np

df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'key3' : ['three', 'three', 'three', 'six', 'six'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df.groupby('key1').min()

Problem description

Since pandas-1.0.0, an AssertionError is thrown when grouping a DataFrame by a key and using max/min as aggregation functions. It works fine if only 1 key (other than the grouping key) is of the type object in the DataFrame, but it doesn't when the number of keys of type object is bigger than 1 (as shown in the example). This configuration worked fine on previous versions of pandas (e.g., pandas-0.25.3).

Expected Output

key1 key2 key3 data1 data2
a one six -0.67246 -1.6302
b one six -1.72628 -0.907298

Output of pd.show_versions()

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.0.0
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0.post20200127
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : None
tabulate : 0.8.3
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.48.0

@TomAugspurger
Copy link
Contributor

Thanks for the report. That assert is from #29035 (cc @jbrockmendel)

We do min on object dtype, which is NotImplemented in Cython, so fall back to the python agg. Then in

                    result = cast(DataFrame, result)
                    # unwrap DataFrame to get array
                    assert len(result._data.blocks) == 1
                    result = result._data.blocks[0].values
                    if isinstance(result, np.ndarray) and result.ndim == 1:
                        result = result.reshape(1, -1)

the ` assert len(result._data.blocks) == 1 fails

(Pdb) pp result
     key2 key3
key1
a     one  six
b     one  six

and we fall through to the finally with a DataFrame.

FYI @marcevrard , we publish release candidates and nightly builds, if you want to catch these before the release. You can select watch pandas' "Release only" if nightly builds aren't an option.

@TomAugspurger
Copy link
Contributor

I guess the faulty assumption is that a groupby aggregation on an single Block won't split it into multiple blocks. This apparently isn't true for object blocks

(Pdb) obj._data.blocks
(ObjectBlock: slice(0, 2, 1), 2 x 5, dtype: object,)
(Pdb) result._data.blocks
(ObjectBlock: slice(0, 1, 1), 1 x 2, dtype: object, ObjectBlock: slice(1, 2, 1), 1 x 2, dtype: object)

@TomAugspurger TomAugspurger added Groupby Regression Functionality that used to work in a prior pandas version labels Jan 31, 2020
@TomAugspurger TomAugspurger added this to the 1.0.1 milestone Jan 31, 2020
@jbrockmendel
Copy link
Member

I guess the faulty assumption is that a groupby aggregation on an single Block won't split it into multiple blocks.

I guess _split_and_operate would have to be called in there somehow. Easiest solution would be to raise TypeError, which should revert to the previous behavior. Longer-term we probably should be handling that case without raising.

@TomAugspurger
Copy link
Contributor

Looking into this today.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Feb 3, 2020
TomAugspurger added a commit that referenced this issue Feb 5, 2020
* REGR: Fixed AssertionError in groupby

Closes #31522
@marcevrard
Copy link
Author

Thank you for the quick fix, I confirm it indeed works again with the 1.0.1 version.

@zking1219
Copy link

I'm still seeing this error in pandas-1.0.1:

    df_mh_spc6 = df_mh_spc5.groupby(['bldg_id'], as_index=False, sort=False).max()
env/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1378: in f
    return self._cython_agg_general(alias, alt=npfunc, **kwargs)
env/lib/python3.8/site-packages/pandas/core/groupby/generic.py:1003: in _cython_agg_general
    agg_blocks, agg_items = self._cython_agg_blocks(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pandas.core.groupby.generic.DataFrameGroupBy object at 0x11d88db20>, how = 'max'
alt = <function amax at 0x111901d30>, numeric_only = False, min_count = -1

    def _cython_agg_blocks(
        self, how: str, alt=None, numeric_only: bool = True, min_count: int = -1
    ) -> "Tuple[List[Block], Index]":
        # TODO: the actual managing of mgr_locs is a PITA
        # here, it should happen via BlockManager.combine
    
        data: BlockManager = self._get_data_to_aggregate()
    
        if numeric_only:
            data = data.get_numeric_data(copy=False)
    
        agg_blocks: List[Block] = []
        new_items: List[np.ndarray] = []
        deleted_items: List[np.ndarray] = []
        # Some object-dtype blocks might be split into List[Block[T], Block[U]]
        split_items: List[np.ndarray] = []
        split_frames: List[DataFrame] = []
    
        no_result = object()
        for block in data.blocks:
            # Avoid inheriting result from earlier in the loop
            result = no_result
            locs = block.mgr_locs.as_array
            try:
                result, _ = self.grouper.aggregate(
                    block.values, how, axis=1, min_count=min_count
                )
            except NotImplementedError:
                # generally if we have numeric_only=False
                # and non-applicable functions
                # try to python agg
    
                if alt is None:
                    # we cannot perform the operation
                    # in an alternate way, exclude the block
                    assert how == "ohlc"
                    deleted_items.append(locs)
                    continue
    
                # call our grouper again with only this block
                obj = self.obj[data.items[locs]]
                if obj.shape[1] == 1:
                    # Avoid call to self.values that can occur in DataFrame
                    #  reductions; see GH#28949
                    obj = obj.iloc[:, 0]
    
                s = get_groupby(obj, self.grouper)
                try:
                    result = s.aggregate(lambda x: alt(x, axis=self.axis))
                except TypeError:
                    # we may have an exception in trying to aggregate
                    # continue and exclude the block
                    deleted_items.append(locs)
                    continue
                else:
                    result = cast(DataFrame, result)
                    # unwrap DataFrame to get array
                    if len(result._data.blocks) != 1:
                        # We've split an object block! Everything we've assumed
                        # about a single block input returning a single block output
                        # is a lie. To keep the code-path for the typical non-split case
                        # clean, we choose to clean up this mess later on.
                        split_items.append(locs)
                        split_frames.append(result)
                        continue
    
                    assert len(result._data.blocks) == 1
                    result = result._data.blocks[0].values
                    if isinstance(result, np.ndarray) and result.ndim == 1:
                        result = result.reshape(1, -1)
    
            assert not isinstance(result, DataFrame)
    
            if result is not no_result:
                # see if we can cast the block back to the original dtype
                result = maybe_downcast_numeric(result, block.dtype)
    
                if block.is_extension and isinstance(result, np.ndarray):
                    # e.g. block.values was an IntegerArray
                    # (1, N) case can occur if block.values was Categorical
                    #  and result is ndarray[object]
                    assert result.ndim == 1 or result.shape[0] == 1
                    try:
                        # Cast back if feasible
                        result = type(block.values)._from_sequence(
                            result.ravel(), dtype=block.values.dtype
                        )
                    except ValueError:
                        # reshape to be valid for non-Extension Block
                        result = result.reshape(1, -1)
    
                agg_block: Block = block.make_block(result)
    
            new_items.append(locs)
            agg_blocks.append(agg_block)
    
        if not (agg_blocks or split_frames):
            raise DataError("No numeric types to aggregate")
    
        if split_items:
            # Clean up the mess left over from split blocks.
            for locs, result in zip(split_items, split_frames):
>               assert len(locs) == result.shape[1]
E               AssertionError

env/lib/python3.8/site-packages/pandas/core/groupby/generic.py:1110: AssertionError

Any idea why this might still be happening? @TomAugspurger
It worked fine until pandas-1.0.0.

Thanks!

@TomAugspurger
Copy link
Contributor

@zking1219 if you have a minimal example I'd recommend opening a new issue.

@zking1219
Copy link

I can see how that might help, I'll work on putting one together. Thanks!

@willkochtitzky
Copy link

I am still getting this error message running pandas 1.0.5. I switched back to 0.25.1 and it is working just fine. My dataset is a little complicated and I don't have time to put together a minimal example now, but thought you would want to know that this still seems to be a problem.

@prakass1
Copy link

Seeing the same error in the pandas-1.1.0 version as well

@SierprinskiFox
Copy link

SierprinskiFox commented Aug 13, 2020

I got the same error message...
With pdb I have managed to find out that there is a problem in the 1450th row:D
This command does not give an AssertionError (in pdb):
datafile.head(1464).groupby("column_name").min()
But this one does:
datafile.head(1465).groupby("column_name").min()

The 1465th row has 43 columns instead of 42. But when I have deleted the 42nd column (Ie the 43rd) nothing got better. I still get the same error message.

@prakass1
Copy link

the pandas-1.1.0 version as

I did not see the error again in 1.1.0. I followed the procedure to recompile all the packages and also I identified that it had also occurred since I had some NaN in my data which I have fixed. Before was not getting assert error even in the presence of NaN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants