Skip to content

BUG: GroupBy aggregation of DataFrame with MultiIndex columns breaks with custom function #32040

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Mar 12, 2020

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Feb 16, 2020

@MarcoGorelli MarcoGorelli force-pushed the issue-31777 branch 2 times, most recently from c4ea34e to 4556c63 Compare February 16, 2020 11:33
@MarcoGorelli MarcoGorelli changed the title fix setting of index BUG: fix setting of index Feb 16, 2020
@MarcoGorelli MarcoGorelli changed the title BUG: fix setting of index BUG: GroupBy aggregation of DataFrame with MultiIndex columns breaks with custom function Feb 16, 2020
@@ -968,7 +968,8 @@ def aggregate(self, func=None, *args, **kwargs):
result = self._aggregate_frame(func)
else:
result.columns = Index(
result.columns.levels[0], name=self._selected_obj.columns.name
[i[:-1] if len(i) > 2 else i[0] for i in result.columns],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you actaully trying to do here?

Copy link
Member Author

@MarcoGorelli MarcoGorelli Feb 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Currently, when we get here, if we have

>>> result.columns
MultiIndex([('B', '<lambda>'),
            ('C', '<lambda>'),
            ('D', '<lambda>')],
           )

then the current line of code on master ignores the last level:

>>> Index(result.columns.levels[0], name=self._selected_obj.columns.name)
Index(['B', 'C', 'D'], dtype='object')

However, if (as in the new test I've added) we have

>>> result.columns
MultiIndex([(1, 3, '<lambda>'),
            (1, 4, '<lambda>'),
            (2, 3, '<lambda>'),
            (2, 4, '<lambda>')],
           )

then the current line of code on master will only take the first level:

>>> Index(result.columns.levels[0], name=self._selected_obj.columns.name)
Int64Index([1, 2], dtype='int64')

However, what we need is

MultiIndex([(1, 3),
            (1, 4),
            (2, 3),
            (2, 4)],
           )

So, that's what I'm trying to do here - if there's only two levels, return the first one as an Index (as is currently done on master), and if there's more levels, return all but the last one as a MultiIndex

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe misreading but this entire expression seems strange. Why would master arbitrarily select the first column index level from the caller?

I think there's a more general fix to be had here

Copy link
Member Author

@MarcoGorelli MarcoGorelli Feb 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would master arbitrarily select the first column index level from the caller?

I think the idea is to select everything except for the last level, which is the one containing the name of the function e.g. '<lambda>'

I think there's a more general fix to be had here

Seems plausible :)

Copy link
Member Author

@MarcoGorelli MarcoGorelli Feb 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd @jreback thanks for your reviews - have found a cleaner way to write this:

result.columns = result.columns.rename(
    [self._selected_obj.columns.name] * result.columns.nlevels
).droplevel(-1)

Before I update the tests according the tests, could you let me know if this seems like the correct patch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you not just copy the columns and drop level thereafter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd not sure I understand what you mean - as in,

result.columns.droplevel(-1).rename(self._selected_obj.columns.name)

?

The problem is that then, if we have

(Pdb) result.columns
MultiIndex([('B', '<lambda>'),
            ('C', '<lambda>'),
            ('D', '<lambda>')],
           )

then

(Pdb) result.columns.droplevel(-1)
Index(['B', 'C', 'D'], dtype='object')

but, if we have

(Pdb) result.columns
MultiIndex([(1, 3, '<lambda>'),
            (1, 4, '<lambda>'),
            (2, 3, '<lambda>'),
            (2, 4, '<lambda>')],
           )

then

(Pdb) result.columns.droplevel(-1)
MultiIndex([(1, 3),
            (1, 4),
            (2, 3),
            (2, 4)],
           )

In the first case, renaming can be done with e.g. a string:

result.columns.droplevel(-1).rename(self._selected_obj.columns.name)

but in the second case, it needs to be done with a list of strings:

result.columns.droplevel(-1).rename([self._selected_obj.columns.name]*2)

@@ -968,7 +968,8 @@ def aggregate(self, func=None, *args, **kwargs):
result = self._aggregate_frame(func)
else:
result.columns = Index(
result.columns.levels[0], name=self._selected_obj.columns.name
[i[:-1] if len(i) > 2 else i[0] for i in result.columns],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe misreading but this entire expression seems strange. Why would master arbitrarily select the first column index level from the caller?

I think there's a more general fix to be had here

@MarcoGorelli MarcoGorelli force-pushed the issue-31777 branch 2 times, most recently from afc94b7 to 093bcd5 Compare February 27, 2020 15:57
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd @jreback can you have another look at this? Seems like the revised version satisfies your concerns.

@TomAugspurger TomAugspurger added this to the 1.0.2 milestone Mar 10, 2020
grp = df.groupby(np.r_[np.zeros(2), np.ones(2)])
result = grp.agg(func)
expected_keys = [(1, 3), (1, 4), (2, 3), (2, 4)]
expected = pd.DataFrame(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than parametrize this test can you just construct the expectation from literal values? The parametrized values themselves don't differ enough to be useful but make it tougher to understand what is going on here

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd comment plus a questions, looks ok otherwise.

@jreback
Copy link
Contributor

jreback commented Mar 11, 2020

pls merge master as well.

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Mar 11, 2020

@WillAyd @jreback thanks for your reviews.

Rather than parametrize this test can you just construct the expectation from literal values?

Sure, done. Simplified test a bit, but it still reproduces the error:

$ git checkout upstream/master -- pandas/core
$ pytest pandas/tests/groupby/aggregate/test_aggregate.py::test_multiindex_custom_func
Full output
========================================================================================= test session starts =========================================================================================
platform linux -- Python 3.7.6, pytest-5.3.4, py-1.8.0, pluggy-0.13.0
rootdir: /home/SERILOCAL/m.gorelli/pandas, inifile: setup.cfg
plugins: forked-1.1.2, hypothesis-5.3.0, asyncio-0.10.0, cov-2.8.1, xdist-1.31.0
collected 3 items                                                                                                                                                                                     

pandas/tests/groupby/aggregate/test_aggregate.py FFF                                                                                                                                            [100%]

============================================================================================== FAILURES ===============================================================================================
_______________________________________________________________________________ test_multiindex_custom_func[<lambda>0] ________________________________________________________________________________

func = <function <lambda> at 0x7fb76ca12170>

    @pytest.mark.parametrize(
        "func", [lambda s: s.mean(), lambda s: np.mean(s), lambda s: np.nanmean(s)]
    )
    def test_multiindex_custom_func(func):
        # GH 31777
        data = [[1, 4, 2], [5, 7, 1]]
        df = pd.DataFrame(data, columns=pd.MultiIndex.from_arrays([[1, 1, 2], [3, 4, 3]]))
>       result = df.groupby(np.array([0, 1])).agg(func)

pandas/tests/groupby/aggregate/test_aggregate.py:701: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/groupby/generic.py:959: in aggregate
    result.columns.levels[0], name=self._selected_obj.columns.name
pandas/core/generic.py:5178: in __setattr__
    return object.__setattr__(self, name, value)
pandas/_libs/properties.pyx:67: in pandas._libs.properties.AxisProperty.__set__
    obj._set_axis(self.axis, value)
pandas/core/generic.py:565: in _set_axis
    self._data.set_axis(axis, labels)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = BlockManager
Items: MultiIndex([(1, 3, '<lambda>'),
            (1, 4, '<lambda>'),
            (2, 3, '<lambda>')],
 ... 1, 1), 1 x 2, dtype: int64
IntBlock: slice(1, 2, 1), 1 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64
axis = 0, new_labels = Int64Index([1, 2], dtype='int64')

    def set_axis(self, axis: int, new_labels: Index) -> None:
        # Caller is responsible for ensuring we have an Index object.
        old_len = len(self.axes[axis])
        new_len = len(new_labels)
    
        if new_len != old_len:
            raise ValueError(
>               f"Length mismatch: Expected axis has {old_len} elements, new "
                f"values have {new_len} elements"
            )
E           ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

pandas/core/internals/managers.py:223: ValueError
_______________________________________________________________________________ test_multiindex_custom_func[<lambda>1] ________________________________________________________________________________

func = <function <lambda> at 0x7fb76ca12200>

    @pytest.mark.parametrize(
        "func", [lambda s: s.mean(), lambda s: np.mean(s), lambda s: np.nanmean(s)]
    )
    def test_multiindex_custom_func(func):
        # GH 31777
        data = [[1, 4, 2], [5, 7, 1]]
        df = pd.DataFrame(data, columns=pd.MultiIndex.from_arrays([[1, 1, 2], [3, 4, 3]]))
>       result = df.groupby(np.array([0, 1])).agg(func)

pandas/tests/groupby/aggregate/test_aggregate.py:701: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/groupby/generic.py:959: in aggregate
    result.columns.levels[0], name=self._selected_obj.columns.name
pandas/core/generic.py:5178: in __setattr__
    return object.__setattr__(self, name, value)
pandas/_libs/properties.pyx:67: in pandas._libs.properties.AxisProperty.__set__
    obj._set_axis(self.axis, value)
pandas/core/generic.py:565: in _set_axis
    self._data.set_axis(axis, labels)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = BlockManager
Items: MultiIndex([(1, 3, '<lambda>'),
            (1, 4, '<lambda>'),
            (2, 3, '<lambda>')],
 ... 1, 1), 1 x 2, dtype: int64
IntBlock: slice(1, 2, 1), 1 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64
axis = 0, new_labels = Int64Index([1, 2], dtype='int64')

    def set_axis(self, axis: int, new_labels: Index) -> None:
        # Caller is responsible for ensuring we have an Index object.
        old_len = len(self.axes[axis])
        new_len = len(new_labels)
    
        if new_len != old_len:
            raise ValueError(
>               f"Length mismatch: Expected axis has {old_len} elements, new "
                f"values have {new_len} elements"
            )
E           ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

pandas/core/internals/managers.py:223: ValueError
_______________________________________________________________________________ test_multiindex_custom_func[<lambda>2] ________________________________________________________________________________

func = <function <lambda> at 0x7fb76ca12290>

    @pytest.mark.parametrize(
        "func", [lambda s: s.mean(), lambda s: np.mean(s), lambda s: np.nanmean(s)]
    )
    def test_multiindex_custom_func(func):
        # GH 31777
        data = [[1, 4, 2], [5, 7, 1]]
        df = pd.DataFrame(data, columns=pd.MultiIndex.from_arrays([[1, 1, 2], [3, 4, 3]]))
>       result = df.groupby(np.array([0, 1])).agg(func)

pandas/tests/groupby/aggregate/test_aggregate.py:701: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/groupby/generic.py:959: in aggregate
    result.columns.levels[0], name=self._selected_obj.columns.name
pandas/core/generic.py:5178: in __setattr__
    return object.__setattr__(self, name, value)
pandas/_libs/properties.pyx:67: in pandas._libs.properties.AxisProperty.__set__
    obj._set_axis(self.axis, value)
pandas/core/generic.py:565: in _set_axis
    self._data.set_axis(axis, labels)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = BlockManager
Items: MultiIndex([(1, 3, '<lambda>'),
            (1, 4, '<lambda>'),
            (2, 3, '<lambda>')],
 ... 1, 1), 1 x 2, dtype: int64
IntBlock: slice(1, 2, 1), 1 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64
axis = 0, new_labels = Int64Index([1, 2], dtype='int64')

    def set_axis(self, axis: int, new_labels: Index) -> None:
        # Caller is responsible for ensuring we have an Index object.
        old_len = len(self.axes[axis])
        new_len = len(new_labels)
    
        if new_len != old_len:
            raise ValueError(
>               f"Length mismatch: Expected axis has {old_len} elements, new "
                f"values have {new_len} elements"
            )
E           ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

pandas/core/internals/managers.py:223: ValueError

when is nlevels actually > 1 here?

Always.

E.g. if our columns are multiindices: if we run

In [1]: import pandas as pd                                                                                                                                                                            

In [2]:     data = [[1, 4, 2], [5, 7, 1]] 
   ...:     df = pd.DataFrame(data, columns=pd.MultiIndex.from_arrays([[1, 1, 2], [3, 4, 3]])) 
   ...:                                                                                                                                                                                                

In [3]: import numpy as np                                                                                                                                                                             

In [4]: df.groupby(np.array([0, 1])).agg(lambda x: x.mean()) 

then when we get here, result.columns is

MultiIndex([(1, 3, '<lambda>'),
            (1, 4, '<lambda>'),
            (2, 3, '<lambda>')],
           )

E.g. if our columns are not multiindices: if we run

In [5]: df = pd.DataFrame(data, columns=['col1', 'col2', 'col3']) 

In [6]: df.groupby(np.array([0, 1])).agg(lambda x: x.mean()) 

then when we get here, result.columns is

MultiIndex([('col1', '<lambda>'),
            ('col2', '<lambda>'),
            ('col3', '<lambda>')],
           )

Furthermore, I tried adding in

assert result.columns.nlevels > 1

and all the tests in pandas/tests/groupby still passed

@TomAugspurger
Copy link
Contributor

@MarcoGorelli there's a merge conflict in 1.0.2.rst if you could update.

@MarcoGorelli
Copy link
Member Author

@TomAugspurger sure, done (I presume the CI failure is unrelated?)

@jreback jreback merged commit ef7e720 into pandas-dev:master Mar 12, 2020
@jreback
Copy link
Contributor

jreback commented Mar 12, 2020

thanks @MarcoGorelli

@jreback
Copy link
Contributor

jreback commented Mar 12, 2020

@meeseeksdev backport 1.0.x

@lumberbot-app
Copy link

lumberbot-app bot commented Mar 12, 2020

Something went wrong ... Please have a look at my logs.

WillAyd pushed a commit that referenced this pull request Mar 12, 2020
…ndex columns breaks with custom function (#32648)

Co-authored-by: Marco Gorelli <[email protected]>
SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020
@MarcoGorelli MarcoGorelli deleted the issue-31777 branch July 12, 2020 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GroupBy aggregation of DataFrame with MultiIndex columns breaks with custom function
5 participants