Skip to content

BUG: Fix groupby with MultiIndex Series corner case (#25704) #25743

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 30, 2019

Conversation

ArtificialQualia
Copy link
Contributor

There was an issue with groupby when you used groupby on a Series with a list of labels that was equal to the length of the entire Series.

core/groupby/grouper.py:_get_grouper was not properly handling the case where the list of labels provided were all in the Series index names. The DataFrame case was being handled, but Series was previously missed in this case.

@codecov
Copy link

codecov bot commented Mar 15, 2019

Codecov Report

Merging #25743 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25743      +/-   ##
==========================================
+ Coverage   91.25%   91.25%   +<.01%     
==========================================
  Files         172      172              
  Lines       52977    52979       +2     
==========================================
+ Hits        48343    48348       +5     
+ Misses       4634     4631       -3
Flag Coverage Δ
#multiple 89.83% <100%> (ø) ⬆️
#single 41.74% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/grouper.py 98.18% <100%> (-0.36%) ⬇️
pandas/core/indexes/base.py 96.78% <0%> (+0.21%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65c0441...6d0dfc1. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 15, 2019

Codecov Report

Merging #25743 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25743      +/-   ##
==========================================
+ Coverage   91.81%   91.81%   +<.01%     
==========================================
  Files         175      175              
  Lines       52578    52580       +2     
==========================================
+ Hits        48273    48275       +2     
  Misses       4305     4305
Flag Coverage Δ
#multiple 90.36% <100%> (ø) ⬆️
#single 41.89% <0%> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/grouper.py 98.16% <100%> (-0.36%) ⬇️
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.79% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.73% <0%> (+0.1%) ⬆️
pandas/core/indexes/base.py 96.79% <0%> (+0.21%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f0ba498...322b2ce. Read the comment docs.

@@ -237,6 +237,7 @@ Groupby/Resample/Rolling
- Bug in :meth:`pandas.core.groupby.DataFrameGroupBy.nunique` in which the names of column levels were lost (:issue:`23222`)
- Bug in :func:`pandas.core.groupby.GroupBy.agg` when applying a aggregation function to timezone aware data (:issue:`23683`)
- Bug in :func:`pandas.core.groupby.GroupBy.first` and :func:`pandas.core.groupby.GroupBy.last` where timezone information would be dropped (:issue:`21603`)
- Bug in :func:`Series.groupby` where using ``groupby`` with a :class:`MultiIndex` Series with a list of labels equal to the length of the series caused incorrect grouping (:issue:`25704`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a little hard to read but potential incorrect anyway. Are you sure this stipulation about the length of the supplied labels w.r.t to the length of the series itself is a requirement to reproduce the issue?

I did the following locally and still saw the problem:

In [24]:     index_array = [
    ...:         ['x', 'x'],
    ...:         ['a', 'b'],
    ...:         ['k', 'k'],
    ...:         ['j', 'j'],
    ...:     ]
    ...:     index_names = ['first', 'second', 'third', 'fourth']
    ...:     ri = pd.MultiIndex.from_arrays(index_array, names=index_names)
    ...:     s = pd.Series(data=[1, 2], index=ri)
    ...:     result = s.groupby(['first', 'third']).sum()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I obviously wasn't very clear, because your local example is the problem I was talking about. I may be confusing terminology somewhere.

Your Series has two data points, [1, 2].

The len of the list of labels in your groupby is also two, ['first', 'third'].

Those two being equal are the problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK now actually on a subsequent read I think this is right here

@jreback jreback added this to the 0.25.0 milestone Mar 19, 2019
@jreback
Copy link
Contributor

jreback commented Mar 19, 2019

lgtm. over to you @WillAyd

@jreback
Copy link
Contributor

jreback commented Mar 26, 2019

can update to @WillAyd comments

@jreback
Copy link
Contributor

jreback commented Mar 30, 2019

@ArtificialQualia can you merge master and update to comments

@ArtificialQualia
Copy link
Contributor Author

@WillAyd sorry, I had commented on your issue a while ago but I guess it wasn't showing up correctly since it was under my 'review'. I deleted the review and re-added my comment to you.

@WillAyd WillAyd merged commit 819fc15 into pandas-dev:master Mar 30, 2019
@WillAyd
Copy link
Member

WillAyd commented Mar 30, 2019

Nice work @ArtificialQualia !

@TomAugspurger
Copy link
Contributor

@WillAyd
Copy link
Member

WillAyd commented Apr 1, 2019

Thanks for the callout - @ArtificialQualia could you take a look?

@ArtificialQualia
Copy link
Contributor Author

@WillAyd it does appear this caused the regression due to those functions calling _get_grouper and then hitting all(g in obj.index.names for g in keys) when they didn't have to do that before.

However, with some refactoring of the if statements I can get this back down to the previous performance numbers (even slightly better in some cases). Do you want a separate issue/PR for that fix?

@WillAyd
Copy link
Member

WillAyd commented Apr 2, 2019

Yea sure separate PR makes sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GroupBy Not Throwing KeyError When Names Exist in MultiIndex
5 participants