Skip to content

BUG: Fix groupby with MultiIndex Series corner case (#25704) #25743

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 30, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -371,6 +371,7 @@ Groupby/Resample/Rolling
- Bug in :meth:`pandas.core.groupby.DataFrameGroupBy.nunique` in which the names of column levels were lost (:issue:`23222`)
- Bug in :func:`pandas.core.groupby.GroupBy.agg` when applying a aggregation function to timezone aware data (:issue:`23683`)
- Bug in :func:`pandas.core.groupby.GroupBy.first` and :func:`pandas.core.groupby.GroupBy.last` where timezone information would be dropped (:issue:`21603`)
- Bug in :func:`Series.groupby` where using ``groupby`` with a :class:`MultiIndex` Series with a list of labels equal to the length of the series caused incorrect grouping (:issue:`25704`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a little hard to read but potential incorrect anyway. Are you sure this stipulation about the length of the supplied labels w.r.t to the length of the series itself is a requirement to reproduce the issue?

I did the following locally and still saw the problem:

In [24]:     index_array = [
    ...:         ['x', 'x'],
    ...:         ['a', 'b'],
    ...:         ['k', 'k'],
    ...:         ['j', 'j'],
    ...:     ]
    ...:     index_names = ['first', 'second', 'third', 'fourth']
    ...:     ri = pd.MultiIndex.from_arrays(index_array, names=index_names)
    ...:     s = pd.Series(data=[1, 2], index=ri)
    ...:     result = s.groupby(['first', 'third']).sum()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I obviously wasn't very clear, because your local example is the problem I was talking about. I may be confusing terminology somewhere.

Your Series has two data points, [1, 2].

The len of the list of labels in your groupby is also two, ['first', 'third'].

Those two being equal are the problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK now actually on a subsequent read I think this is right here

- Ensured that ordering of outputs in ``groupby`` aggregation functions is consistent across all versions of Python (:issue:`25692`)
- Bug in :func:`idxmax` and :func:`idxmin` on :meth:`DataFrame.groupby` with datetime column would return incorrect dtype (:issue:`25444`, :issue:`15306`)

Expand Down
2 changes: 2 additions & 0 deletions pandas/core/groupby/grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -524,6 +524,8 @@ def _get_grouper(obj, key=None, axis=0, level=None, sort=True,
if isinstance(obj, DataFrame):
all_in_columns_index = all(g in obj.columns or g in obj.index.names
for g in keys)
elif isinstance(obj, Series):
all_in_columns_index = all(g in obj.index.names for g in keys)
else:
all_in_columns_index = False
except Exception:
Expand Down
20 changes: 20 additions & 0 deletions pandas/tests/groupby/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1719,3 +1719,23 @@ def test_groupby_empty_list_raises():
msg = "Grouper and axis must be same length"
with pytest.raises(ValueError, match=msg):
df.groupby([[]])


def test_groupby_multiindex_series_keys_len_equal_group_axis():
# GH 25704
index_array = [
['x', 'x'],
['a', 'b'],
['k', 'k']
]
index_names = ['first', 'second', 'third']
ri = pd.MultiIndex.from_arrays(index_array, names=index_names)
s = pd.Series(data=[1, 2], index=ri)
result = s.groupby(['first', 'third']).sum()

index_array = [['x'], ['k']]
index_names = ['first', 'third']
ei = pd.MultiIndex.from_arrays(index_array, names=index_names)
expected = pd.Series([3], index=ei)

assert_series_equal(result, expected)