BUG: GroupBy aggregation of DataFrame with MultiIndex columns breaks with custom function #32040

MarcoGorelli · 2020-02-16T11:32:31Z

closes GroupBy aggregation of DataFrame with MultiIndex columns breaks with custom function #31777
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jreback · 2020-02-16T16:52:58Z

pandas/core/groupby/generic.py

@@ -968,7 +968,8 @@ def aggregate(self, func=None, *args, **kwargs):
                    result = self._aggregate_frame(func)
                else:
                    result.columns = Index(
-                        result.columns.levels[0], name=self._selected_obj.columns.name
+                        [i[:-1] if len(i) > 2 else i[0] for i in result.columns],


what are you actaully trying to do here?

@jreback Currently, when we get here, if we have

>>> result.columns MultiIndex([('B', '<lambda>'), ('C', '<lambda>'), ('D', '<lambda>')], )

then the current line of code on master ignores the last level:

>>> Index(result.columns.levels[0], name=self._selected_obj.columns.name) Index(['B', 'C', 'D'], dtype='object')

However, if (as in the new test I've added) we have

>>> result.columns MultiIndex([(1, 3, '<lambda>'), (1, 4, '<lambda>'), (2, 3, '<lambda>'), (2, 4, '<lambda>')], )

then the current line of code on master will only take the first level:

>>> Index(result.columns.levels[0], name=self._selected_obj.columns.name) Int64Index([1, 2], dtype='int64')

However, what we need is

MultiIndex([(1, 3), (1, 4), (2, 3), (2, 4)], )

So, that's what I'm trying to do here - if there's only two levels, return the first one as an Index (as is currently done on master), and if there's more levels, return all but the last one as a MultiIndex

Maybe misreading but this entire expression seems strange. Why would master arbitrarily select the first column index level from the caller?

I think there's a more general fix to be had here

Why would master arbitrarily select the first column index level from the caller?

I think the idea is to select everything except for the last level, which is the one containing the name of the function e.g. '<lambda>'

I think there's a more general fix to be had here

Seems plausible :)

@WillAyd @jreback thanks for your reviews - have found a cleaner way to write this:

result.columns = result.columns.rename( [self._selected_obj.columns.name] * result.columns.nlevels ).droplevel(-1)

Before I update the tests according the tests, could you let me know if this seems like the correct patch?

Can you not just copy the columns and drop level thereafter?

@WillAyd not sure I understand what you mean - as in,

result.columns.droplevel(-1).rename(self._selected_obj.columns.name)

?

The problem is that then, if we have

(Pdb) result.columns MultiIndex([('B', '<lambda>'), ('C', '<lambda>'), ('D', '<lambda>')], )

then

(Pdb) result.columns.droplevel(-1) Index(['B', 'C', 'D'], dtype='object')

but, if we have

(Pdb) result.columns MultiIndex([(1, 3, '<lambda>'), (1, 4, '<lambda>'), (2, 3, '<lambda>'), (2, 4, '<lambda>')], )

then

(Pdb) result.columns.droplevel(-1) MultiIndex([(1, 3), (1, 4), (2, 3), (2, 4)], )

In the first case, renaming can be done with e.g. a string:

result.columns.droplevel(-1).rename(self._selected_obj.columns.name)

but in the second case, it needs to be done with a list of strings:

result.columns.droplevel(-1).rename([self._selected_obj.columns.name]*2)

pandas/tests/groupby/aggregate/test_aggregate.py

WillAyd · 2020-02-16T21:22:18Z

pandas/core/groupby/generic.py

@@ -968,7 +968,8 @@ def aggregate(self, func=None, *args, **kwargs):
                    result = self._aggregate_frame(func)
                else:
                    result.columns = Index(
-                        result.columns.levels[0], name=self._selected_obj.columns.name
+                        [i[:-1] if len(i) > 2 else i[0] for i in result.columns],


Maybe misreading but this entire expression seems strange. Why would master arbitrarily select the first column index level from the caller?

I think there's a more general fix to be had here

pandas/core/groupby/generic.py

TomAugspurger

@WillAyd @jreback can you have another look at this? Seems like the revised version satisfies your concerns.

WillAyd · 2020-03-11T00:07:40Z

pandas/tests/groupby/aggregate/test_aggregate.py

+    grp = df.groupby(np.r_[np.zeros(2), np.ones(2)])
+    result = grp.agg(func)
+    expected_keys = [(1, 3), (1, 4), (2, 3), (2, 4)]
+    expected = pd.DataFrame(


Rather than parametrize this test can you just construct the expectation from literal values? The parametrized values themselves don't differ enough to be useful but make it tougher to understand what is going on here

pandas/core/groupby/generic.py

jreback

@WillAyd comment plus a questions, looks ok otherwise.

jreback · 2020-03-11T01:58:43Z

pls merge master as well.

MarcoGorelli · 2020-03-11T09:59:07Z

@WillAyd @jreback thanks for your reviews.

Rather than parametrize this test can you just construct the expectation from literal values?

Sure, done. Simplified test a bit, but it still reproduces the error:

$ git checkout upstream/master -- pandas/core
$ pytest pandas/tests/groupby/aggregate/test_aggregate.py::test_multiindex_custom_func

Full output

========================================================================================= test session starts =========================================================================================
platform linux -- Python 3.7.6, pytest-5.3.4, py-1.8.0, pluggy-0.13.0
rootdir: /home/SERILOCAL/m.gorelli/pandas, inifile: setup.cfg
plugins: forked-1.1.2, hypothesis-5.3.0, asyncio-0.10.0, cov-2.8.1, xdist-1.31.0
collected 3 items                                                                                                                                                                                     

pandas/tests/groupby/aggregate/test_aggregate.py FFF                                                                                                                                            [100%]

============================================================================================== FAILURES ===============================================================================================
_______________________________________________________________________________ test_multiindex_custom_func[<lambda>0] ________________________________________________________________________________

func = <function <lambda> at 0x7fb76ca12170>

    @pytest.mark.parametrize(
        "func", [lambda s: s.mean(), lambda s: np.mean(s), lambda s: np.nanmean(s)]
    )
    def test_multiindex_custom_func(func):
        # GH 31777
        data = [[1, 4, 2], [5, 7, 1]]
        df = pd.DataFrame(data, columns=pd.MultiIndex.from_arrays([[1, 1, 2], [3, 4, 3]]))
>       result = df.groupby(np.array([0, 1])).agg(func)

pandas/tests/groupby/aggregate/test_aggregate.py:701: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/groupby/generic.py:959: in aggregate
    result.columns.levels[0], name=self._selected_obj.columns.name
pandas/core/generic.py:5178: in __setattr__
    return object.__setattr__(self, name, value)
pandas/_libs/properties.pyx:67: in pandas._libs.properties.AxisProperty.__set__
    obj._set_axis(self.axis, value)
pandas/core/generic.py:565: in _set_axis
    self._data.set_axis(axis, labels)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = BlockManager
Items: MultiIndex([(1, 3, '<lambda>'),
            (1, 4, '<lambda>'),
            (2, 3, '<lambda>')],
 ... 1, 1), 1 x 2, dtype: int64
IntBlock: slice(1, 2, 1), 1 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64
axis = 0, new_labels = Int64Index([1, 2], dtype='int64')

    def set_axis(self, axis: int, new_labels: Index) -> None:
        # Caller is responsible for ensuring we have an Index object.
        old_len = len(self.axes[axis])
        new_len = len(new_labels)
    
        if new_len != old_len:
            raise ValueError(
>               f"Length mismatch: Expected axis has {old_len} elements, new "
                f"values have {new_len} elements"
            )
E           ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

pandas/core/internals/managers.py:223: ValueError
_______________________________________________________________________________ test_multiindex_custom_func[<lambda>1] ________________________________________________________________________________

func = <function <lambda> at 0x7fb76ca12200>

    @pytest.mark.parametrize(
        "func", [lambda s: s.mean(), lambda s: np.mean(s), lambda s: np.nanmean(s)]
    )
    def test_multiindex_custom_func(func):
        # GH 31777
        data = [[1, 4, 2], [5, 7, 1]]
        df = pd.DataFrame(data, columns=pd.MultiIndex.from_arrays([[1, 1, 2], [3, 4, 3]]))
>       result = df.groupby(np.array([0, 1])).agg(func)

pandas/tests/groupby/aggregate/test_aggregate.py:701: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/groupby/generic.py:959: in aggregate
    result.columns.levels[0], name=self._selected_obj.columns.name
pandas/core/generic.py:5178: in __setattr__
    return object.__setattr__(self, name, value)
pandas/_libs/properties.pyx:67: in pandas._libs.properties.AxisProperty.__set__
    obj._set_axis(self.axis, value)
pandas/core/generic.py:565: in _set_axis
    self._data.set_axis(axis, labels)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = BlockManager
Items: MultiIndex([(1, 3, '<lambda>'),
            (1, 4, '<lambda>'),
            (2, 3, '<lambda>')],
 ... 1, 1), 1 x 2, dtype: int64
IntBlock: slice(1, 2, 1), 1 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64
axis = 0, new_labels = Int64Index([1, 2], dtype='int64')

    def set_axis(self, axis: int, new_labels: Index) -> None:
        # Caller is responsible for ensuring we have an Index object.
        old_len = len(self.axes[axis])
        new_len = len(new_labels)
    
        if new_len != old_len:
            raise ValueError(
>               f"Length mismatch: Expected axis has {old_len} elements, new "
                f"values have {new_len} elements"
            )
E           ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

pandas/core/internals/managers.py:223: ValueError
_______________________________________________________________________________ test_multiindex_custom_func[<lambda>2] ________________________________________________________________________________

func = <function <lambda> at 0x7fb76ca12290>

    @pytest.mark.parametrize(
        "func", [lambda s: s.mean(), lambda s: np.mean(s), lambda s: np.nanmean(s)]
    )
    def test_multiindex_custom_func(func):
        # GH 31777
        data = [[1, 4, 2], [5, 7, 1]]
        df = pd.DataFrame(data, columns=pd.MultiIndex.from_arrays([[1, 1, 2], [3, 4, 3]]))
>       result = df.groupby(np.array([0, 1])).agg(func)

pandas/tests/groupby/aggregate/test_aggregate.py:701: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/groupby/generic.py:959: in aggregate
    result.columns.levels[0], name=self._selected_obj.columns.name
pandas/core/generic.py:5178: in __setattr__
    return object.__setattr__(self, name, value)
pandas/_libs/properties.pyx:67: in pandas._libs.properties.AxisProperty.__set__
    obj._set_axis(self.axis, value)
pandas/core/generic.py:565: in _set_axis
    self._data.set_axis(axis, labels)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = BlockManager
Items: MultiIndex([(1, 3, '<lambda>'),
            (1, 4, '<lambda>'),
            (2, 3, '<lambda>')],
 ... 1, 1), 1 x 2, dtype: int64
IntBlock: slice(1, 2, 1), 1 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64
axis = 0, new_labels = Int64Index([1, 2], dtype='int64')

    def set_axis(self, axis: int, new_labels: Index) -> None:
        # Caller is responsible for ensuring we have an Index object.
        old_len = len(self.axes[axis])
        new_len = len(new_labels)
    
        if new_len != old_len:
            raise ValueError(
>               f"Length mismatch: Expected axis has {old_len} elements, new "
                f"values have {new_len} elements"
            )
E           ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

pandas/core/internals/managers.py:223: ValueError

when is nlevels actually > 1 here?

Always.

E.g. if our columns are multiindices: if we run

In [1]: import pandas as pd                                                                                                                                                                            

In [2]:     data = [[1, 4, 2], [5, 7, 1]] 
   ...:     df = pd.DataFrame(data, columns=pd.MultiIndex.from_arrays([[1, 1, 2], [3, 4, 3]])) 
   ...:                                                                                                                                                                                                

In [3]: import numpy as np                                                                                                                                                                             

In [4]: df.groupby(np.array([0, 1])).agg(lambda x: x.mean())

then when we get here, result.columns is

MultiIndex([(1, 3, '<lambda>'),
            (1, 4, '<lambda>'),
            (2, 3, '<lambda>')],
           )

E.g. if our columns are not multiindices: if we run

In [5]: df = pd.DataFrame(data, columns=['col1', 'col2', 'col3']) 

In [6]: df.groupby(np.array([0, 1])).agg(lambda x: x.mean())

then when we get here, result.columns is

MultiIndex([('col1', '<lambda>'),
            ('col2', '<lambda>'),
            ('col3', '<lambda>')],
           )

Furthermore, I tried adding in

assert result.columns.nlevels > 1

and all the tests in pandas/tests/groupby still passed

TomAugspurger · 2020-03-11T14:24:43Z

@MarcoGorelli there's a merge conflict in 1.0.2.rst if you could update.

MarcoGorelli · 2020-03-11T16:11:27Z

@TomAugspurger sure, done (I presume the CI failure is unrelated?)

jreback · 2020-03-12T02:33:00Z

thanks @MarcoGorelli

jreback · 2020-03-12T02:33:13Z

@meeseeksdev backport 1.0.x

…ith MultiIndex columns breaks with custom function

lumberbot-app · 2020-03-12T02:33:47Z

Something went wrong ... Please have a look at my logs.

…ndex columns breaks with custom function (#32648) Co-authored-by: Marco Gorelli <[email protected]>

…with custom function (pandas-dev#32040)

MarcoGorelli force-pushed the issue-31777 branch 2 times, most recently from c4ea34e to 4556c63 Compare February 16, 2020 11:33

MarcoGorelli changed the title ~~fix setting of index~~ BUG: fix setting of index Feb 16, 2020

MarcoGorelli changed the title ~~BUG: fix setting of index~~ BUG: GroupBy aggregation of DataFrame with MultiIndex columns breaks with custom function Feb 16, 2020

jreback added Groupby MultiIndex labels Feb 16, 2020

jreback requested changes Feb 16, 2020

View reviewed changes

WillAyd requested changes Feb 16, 2020

View reviewed changes

MarcoGorelli force-pushed the issue-31777 branch from bef96e4 to 52adefd Compare February 17, 2020 11:20

jbrockmendel reviewed Feb 20, 2020

View reviewed changes

pandas/core/groupby/generic.py Show resolved Hide resolved

MarcoGorelli force-pushed the issue-31777 branch 2 times, most recently from afc94b7 to 093bcd5 Compare February 27, 2020 15:57

MarcoGorelli and others added 9 commits February 27, 2020 16:00

fix setting of index

b06f08f

simplify

1d573b3

rename

d8d588d

fix renaming

fd3854a

note gh number, parametrize test

e709a06

whatsnew

093bcd5

clearer comment

88ce23b

clearer comment

4e4bae8

Merge remote-tracking branch 'upstream/master' into issue-31777

ab86946

TomAugspurger approved these changes Mar 10, 2020

View reviewed changes

TomAugspurger added this to the 1.0.2 milestone Mar 10, 2020

WillAyd requested changes Mar 11, 2020

View reviewed changes

jreback reviewed Mar 11, 2020

View reviewed changes

pandas/core/groupby/generic.py Show resolved Hide resolved

jreback reviewed Mar 11, 2020

View reviewed changes

Marco Gorelli added 2 commits March 11, 2020 09:28

Merge remote-tracking branch 'upstream/master' into issue-31777

7345139

construct expected from literal values for legibility

73caec2

Marco Gorelli added 2 commits March 11, 2020 13:03

Merge remote-tracking branch 'upstream/master' into issue-31777

b024616

Use custom functions throughout

bf048fa

MarcoGorelli and others added 5 commits March 11, 2020 14:37

Merge branch 'master' into issue-31777

19bd5da

Update v1.0.2.rst

4389f6a

make whatsnew entry consistent

5beabac

Merge remote-tracking branch 'upstream/master' into issue-31777

08e5d7b

reinstate change

a9fae8d

Merge remote-tracking branch 'upstream/master' into issue-31777

801df5c

jreback approved these changes Mar 12, 2020

View reviewed changes

jreback merged commit ef7e720 into pandas-dev:master Mar 12, 2020

jreback added the Still Needs Manual Backport label Mar 12, 2020

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Mar 12, 2020

Backport PR pandas-dev#32040: BUG: GroupBy aggregation of DataFrame w…

25cf8c7

…ith MultiIndex columns breaks with custom function

meeseeksmachine mentioned this pull request Mar 12, 2020

Backport PR #32040 on branch 1.0.x (BUG: GroupBy aggregation of DataFrame with MultiIndex columns breaks with custom function) #32648

Merged

jreback removed the Still Needs Manual Backport label Mar 12, 2020

WillAyd pushed a commit that referenced this pull request Mar 12, 2020

Backport PR #32040: BUG: GroupBy aggregation of DataFrame with MultiI…

5e52303

…ndex columns breaks with custom function (#32648) Co-authored-by: Marco Gorelli <[email protected]>

SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020

BUG: GroupBy aggregation of DataFrame with MultiIndex columns breaks …

aaabeb9

…with custom function (pandas-dev#32040)

simonjayhawkins mentioned this pull request Jul 12, 2020

BUG: AttributeError when doing groupby with as_index=False on Empty DataFrame #35246

Closed

3 tasks

MarcoGorelli deleted the issue-31777 branch July 12, 2020 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: GroupBy aggregation of DataFrame with MultiIndex columns breaks with custom function #32040

BUG: GroupBy aggregation of DataFrame with MultiIndex columns breaks with custom function #32040

MarcoGorelli commented Feb 16, 2020 •

edited

Loading

jreback Feb 16, 2020

MarcoGorelli Feb 16, 2020 •

edited

Loading

WillAyd Feb 16, 2020

MarcoGorelli Feb 16, 2020 •

edited

Loading

MarcoGorelli Feb 17, 2020 •

edited

Loading

WillAyd Feb 18, 2020

MarcoGorelli Feb 18, 2020

WillAyd Feb 16, 2020

TomAugspurger left a comment

WillAyd Mar 11, 2020

jreback left a comment

jreback commented Mar 11, 2020

MarcoGorelli commented Mar 11, 2020 •

edited

Loading

TomAugspurger commented Mar 11, 2020

MarcoGorelli commented Mar 11, 2020

jreback commented Mar 12, 2020

jreback commented Mar 12, 2020

lumberbot-app bot commented Mar 12, 2020

BUG: GroupBy aggregation of DataFrame with MultiIndex columns breaks with custom function #32040

BUG: GroupBy aggregation of DataFrame with MultiIndex columns breaks with custom function #32040

Conversation

MarcoGorelli commented Feb 16, 2020 • edited Loading

jreback Feb 16, 2020

Choose a reason for hiding this comment

MarcoGorelli Feb 16, 2020 • edited Loading

Choose a reason for hiding this comment

WillAyd Feb 16, 2020

Choose a reason for hiding this comment

MarcoGorelli Feb 16, 2020 • edited Loading

Choose a reason for hiding this comment

MarcoGorelli Feb 17, 2020 • edited Loading

Choose a reason for hiding this comment

WillAyd Feb 18, 2020

Choose a reason for hiding this comment

MarcoGorelli Feb 18, 2020

Choose a reason for hiding this comment

WillAyd Feb 16, 2020

Choose a reason for hiding this comment

TomAugspurger left a comment

Choose a reason for hiding this comment

WillAyd Mar 11, 2020

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback commented Mar 11, 2020

MarcoGorelli commented Mar 11, 2020 • edited Loading

TomAugspurger commented Mar 11, 2020

MarcoGorelli commented Mar 11, 2020

jreback commented Mar 12, 2020

jreback commented Mar 12, 2020

lumberbot-app bot commented Mar 12, 2020

MarcoGorelli commented Feb 16, 2020 •

edited

Loading

MarcoGorelli Feb 16, 2020 •

edited

Loading

MarcoGorelli Feb 16, 2020 •

edited

Loading

MarcoGorelli Feb 17, 2020 •

edited

Loading

MarcoGorelli commented Mar 11, 2020 •

edited

Loading