BUG: Fix groupby with MultiIndex Series corner case (#25704) #25743

ArtificialQualia · 2019-03-15T22:14:51Z

closes GroupBy Not Throwing KeyError When Names Exist in MultiIndex #25704
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

There was an issue with groupby when you used groupby on a Series with a list of labels that was equal to the length of the entire Series.

core/groupby/grouper.py:_get_grouper was not properly handling the case where the list of labels provided were all in the Series index names. The DataFrame case was being handled, but Series was previously missed in this case.

codecov · 2019-03-15T22:51:07Z

Codecov Report

Merging #25743 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25743      +/-   ##
==========================================
+ Coverage   91.25%   91.25%   +<.01%     
==========================================
  Files         172      172              
  Lines       52977    52979       +2     
==========================================
+ Hits        48343    48348       +5     
+ Misses       4634     4631       -3

Flag	Coverage Δ
#multiple	`89.83% <100%> (ø)`	⬆️
#single	`41.74% <0%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/grouper.py	`98.18% <100%> (-0.36%)`	⬇️
pandas/core/indexes/base.py	`96.78% <0%> (+0.21%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65c0441...6d0dfc1. Read the comment docs.

codecov · 2019-03-15T22:51:27Z

Codecov Report

Merging #25743 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25743      +/-   ##
==========================================
+ Coverage   91.81%   91.81%   +<.01%     
==========================================
  Files         175      175              
  Lines       52578    52580       +2     
==========================================
+ Hits        48273    48275       +2     
  Misses       4305     4305

Flag	Coverage Δ
#multiple	`90.36% <100%> (ø)`	⬆️
#single	`41.89% <0%> (-0.08%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/grouper.py	`98.16% <100%> (-0.36%)`	⬇️
pandas/io/gbq.py	`75% <0%> (-12.5%)`	⬇️
pandas/core/frame.py	`96.79% <0%> (-0.12%)`	⬇️
pandas/util/testing.py	`90.73% <0%> (+0.1%)`	⬆️
pandas/core/indexes/base.py	`96.79% <0%> (+0.21%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f0ba498...322b2ce. Read the comment docs.

WillAyd · 2019-03-18T04:41:27Z

doc/source/whatsnew/v0.25.0.rst

@@ -237,6 +237,7 @@ Groupby/Resample/Rolling
 - Bug in :meth:`pandas.core.groupby.DataFrameGroupBy.nunique` in which the names of column levels were lost (:issue:`23222`)
 - Bug in :func:`pandas.core.groupby.GroupBy.agg` when applying a aggregation function to timezone aware data (:issue:`23683`)
 - Bug in :func:`pandas.core.groupby.GroupBy.first` and :func:`pandas.core.groupby.GroupBy.last` where timezone information would be dropped (:issue:`21603`)
+- Bug in :func:`Series.groupby` where using ``groupby`` with a :class:`MultiIndex` Series with a list of labels equal to the length of the series caused incorrect grouping (:issue:`25704`)


I think this is a little hard to read but potential incorrect anyway. Are you sure this stipulation about the length of the supplied labels w.r.t to the length of the series itself is a requirement to reproduce the issue?

I did the following locally and still saw the problem:

In [24]: index_array = [ ...: ['x', 'x'], ...: ['a', 'b'], ...: ['k', 'k'], ...: ['j', 'j'], ...: ] ...: index_names = ['first', 'second', 'third', 'fourth'] ...: ri = pd.MultiIndex.from_arrays(index_array, names=index_names) ...: s = pd.Series(data=[1, 2], index=ri) ...: result = s.groupby(['first', 'third']).sum()

Yeah, I obviously wasn't very clear, because your local example is the problem I was talking about. I may be confusing terminology somewhere.

Your Series has two data points, [1, 2].

The len of the list of labels in your groupby is also two, ['first', 'third'].

Those two being equal are the problem.

OK now actually on a subsequent read I think this is right here

jreback · 2019-03-19T21:57:07Z

lgtm. over to you @WillAyd

jreback · 2019-03-26T00:29:29Z

can update to @WillAyd comments

jreback · 2019-03-30T19:23:29Z

@ArtificialQualia can you merge master and update to comments

…x-groupby-multiindex

ArtificialQualia · 2019-03-30T20:45:26Z

@WillAyd sorry, I had commented on your issue a while ago but I guess it wasn't showing up correctly since it was under my 'review'. I deleted the review and re-added my comment to you.

WillAyd · 2019-03-30T23:45:38Z

Nice work @ArtificialQualia !

TomAugspurger · 2019-04-01T12:25:19Z

This may have caused a perf regression: http://pandas.pydata.org/speed/pandas/html/index.html#stat_ops.SeriesMultiIndexOps.time_op?p-op=%27sem%27&p-level=1&commits=819fc1525488b4da7f3d76051a85971a2e004fed

I haven’t looked closely.

WillAyd · 2019-04-01T16:12:36Z

Thanks for the callout - @ArtificialQualia could you take a look?

ArtificialQualia · 2019-04-01T21:00:45Z

@WillAyd it does appear this caused the regression due to those functions calling _get_grouper and then hitting all(g in obj.index.names for g in keys) when they didn't have to do that before.

However, with some refactoring of the if statements I can get this back down to the previous performance numbers (even slightly better in some cases). Do you want a separate issue/PR for that fix?

WillAyd · 2019-04-02T00:55:10Z

Yea sure separate PR makes sense

BUG: Fix groupby with MultiIndex Series corner case (pandas-dev#25704)

6d0dfc1

gfyoung added Bug Groupby MultiIndex labels Mar 16, 2019

WillAyd requested changes Mar 18, 2019

View reviewed changes

Merge branch 'master' into fix-groupby-multiindex

47c13c3

jreback added this to the 0.25.0 milestone Mar 19, 2019

Merge branch 'master' of https://github.com/pandas-dev/pandas into fi…

322b2ce

…x-groupby-multiindex

WillAyd approved these changes Mar 30, 2019

View reviewed changes

WillAyd merged commit 819fc15 into pandas-dev:master Mar 30, 2019

This was referenced Apr 2, 2019

Fix performance regression with Series operations #25952

Closed

PERF: Fix performance regression with Series statistical ops (#25952) #25953

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix groupby with MultiIndex Series corner case (#25704) #25743

BUG: Fix groupby with MultiIndex Series corner case (#25704) #25743

ArtificialQualia commented Mar 15, 2019

codecov bot commented Mar 15, 2019

codecov bot commented Mar 15, 2019 •

edited

Loading

WillAyd Mar 18, 2019

ArtificialQualia Mar 30, 2019

WillAyd Mar 30, 2019

jreback commented Mar 19, 2019

jreback commented Mar 26, 2019

jreback commented Mar 30, 2019

ArtificialQualia commented Mar 30, 2019

WillAyd commented Mar 30, 2019

TomAugspurger commented Apr 1, 2019

WillAyd commented Apr 1, 2019

ArtificialQualia commented Apr 1, 2019

WillAyd commented Apr 2, 2019

BUG: Fix groupby with MultiIndex Series corner case (#25704) #25743

BUG: Fix groupby with MultiIndex Series corner case (#25704) #25743

Conversation

ArtificialQualia commented Mar 15, 2019

codecov bot commented Mar 15, 2019

Codecov Report

codecov bot commented Mar 15, 2019 • edited Loading

Codecov Report

WillAyd Mar 18, 2019

Choose a reason for hiding this comment

ArtificialQualia Mar 30, 2019

Choose a reason for hiding this comment

WillAyd Mar 30, 2019

Choose a reason for hiding this comment

jreback commented Mar 19, 2019

jreback commented Mar 26, 2019

jreback commented Mar 30, 2019

ArtificialQualia commented Mar 30, 2019

WillAyd commented Mar 30, 2019

TomAugspurger commented Apr 1, 2019

WillAyd commented Apr 1, 2019

ArtificialQualia commented Apr 1, 2019

WillAyd commented Apr 2, 2019

codecov bot commented Mar 15, 2019 •

edited

Loading