-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix groupby with MultiIndex Series corner case (#25704) #25743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix groupby with MultiIndex Series corner case (#25704) #25743
Conversation
Codecov Report
@@ Coverage Diff @@
## master #25743 +/- ##
==========================================
+ Coverage 91.25% 91.25% +<.01%
==========================================
Files 172 172
Lines 52977 52979 +2
==========================================
+ Hits 48343 48348 +5
+ Misses 4634 4631 -3
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25743 +/- ##
==========================================
+ Coverage 91.81% 91.81% +<.01%
==========================================
Files 175 175
Lines 52578 52580 +2
==========================================
+ Hits 48273 48275 +2
Misses 4305 4305
Continue to review full report at Codecov.
|
@@ -237,6 +237,7 @@ Groupby/Resample/Rolling | |||
- Bug in :meth:`pandas.core.groupby.DataFrameGroupBy.nunique` in which the names of column levels were lost (:issue:`23222`) | |||
- Bug in :func:`pandas.core.groupby.GroupBy.agg` when applying a aggregation function to timezone aware data (:issue:`23683`) | |||
- Bug in :func:`pandas.core.groupby.GroupBy.first` and :func:`pandas.core.groupby.GroupBy.last` where timezone information would be dropped (:issue:`21603`) | |||
- Bug in :func:`Series.groupby` where using ``groupby`` with a :class:`MultiIndex` Series with a list of labels equal to the length of the series caused incorrect grouping (:issue:`25704`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a little hard to read but potential incorrect anyway. Are you sure this stipulation about the length of the supplied labels w.r.t to the length of the series itself is a requirement to reproduce the issue?
I did the following locally and still saw the problem:
In [24]: index_array = [
...: ['x', 'x'],
...: ['a', 'b'],
...: ['k', 'k'],
...: ['j', 'j'],
...: ]
...: index_names = ['first', 'second', 'third', 'fourth']
...: ri = pd.MultiIndex.from_arrays(index_array, names=index_names)
...: s = pd.Series(data=[1, 2], index=ri)
...: result = s.groupby(['first', 'third']).sum()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I obviously wasn't very clear, because your local example is the problem I was talking about. I may be confusing terminology somewhere.
Your Series has two data points, [1, 2]
.
The len
of the list of labels in your groupby is also two, ['first', 'third']
.
Those two being equal are the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK now actually on a subsequent read I think this is right here
lgtm. over to you @WillAyd |
can update to @WillAyd comments |
@ArtificialQualia can you merge master and update to comments |
…x-groupby-multiindex
@WillAyd sorry, I had commented on your issue a while ago but I guess it wasn't showing up correctly since it was under my 'review'. I deleted the review and re-added my comment to you. |
Nice work @ArtificialQualia ! |
This may have caused a perf regression: http://pandas.pydata.org/speed/pandas/html/index.html#stat_ops.SeriesMultiIndexOps.time_op?p-op=%27sem%27&p-level=1&commits=819fc1525488b4da7f3d76051a85971a2e004fed I haven’t looked closely. |
Thanks for the callout - @ArtificialQualia could you take a look? |
@WillAyd it does appear this caused the regression due to those functions calling However, with some refactoring of the |
Yea sure separate PR makes sense |
git diff upstream/master -u -- "*.py" | flake8 --diff
There was an issue with
groupby
when you usedgroupby
on aSeries
with a list of labels that was equal to the length of the entireSeries
.core/groupby/grouper.py:_get_grouper
was not properly handling the case where the list of labels provided were all in theSeries
index names. TheDataFrame
case was being handled, butSeries
was previously missed in this case.