-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fix GroupBy nth Handling with Observed=False #26419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/groupby/groupby.py
Outdated
result_index = self.grouper.result_index | ||
out.index = result_index[ids[mask]] | ||
|
||
if not self.observed and isinstance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping to get this into _wrap_aggregated_output
(which is what wraps most of the other methods) but that method doesn't appear to do any reindexing so would have to mess with that to make it work. Figured this as a one-off was easier though ideally would have everything consolidated...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you just need a small change I think in result_index
. This is very similar to what we do with all_grouper
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC I think result_index
is fine and handles properly; the problem here is specific to nth
where the mask on the preceding lines essentially ignores the work that result_index
does to properly handle NA categorical data, hence the reindex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced here; this looks like just a bandaid to me and there is an underlying issue that needs fixing. also what about dropna=True?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dropna='all'
seems to have a separate set of issues. Here's behavior on master:
>>> grp.nth(0, dropna='all')
cat
a 1.0
b 2.0
c 3.0
Name: ser, dtype: float64
Which is not correct since b
and c
are never associated with those values.
Also trying to mix that with observed doesn't work:
>>> grp = df.groupby('cat', observed=True)['ser']
>>> grp.nth(0, dropna='all')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/williamayd/clones/pandas/pandas/core/groupby/groupby.py", line 1773, in nth
result.index = self.grouper.result_index
File "/Users/williamayd/clones/pandas/pandas/core/generic.py", line 5144, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/_libs/properties.pyx", line 67, in pandas._libs.properties.AxisProperty.__set__
obj._set_axis(self.axis, value)
File "/Users/williamayd/clones/pandas/pandas/core/series.py", line 381, in _set_axis
self._data.set_axis(axis, labels)
File "/Users/williamayd/clones/pandas/pandas/core/internals/managers.py", line 155, in set_axis
'values have {new} elements'.format(old=old_len, new=new_len))
ValueError: Length mismatch: Expected axis has 3 elements, new values have 1 elements
So both weird though I'm not sure how we want to handle the combinations of observed and dropna; will open a separate issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #26454
Codecov Report
@@ Coverage Diff @@
## master #26419 +/- ##
==========================================
- Coverage 91.69% 91.68% -0.01%
==========================================
Files 174 174
Lines 50739 50742 +3
==========================================
- Hits 46523 46522 -1
- Misses 4216 4220 +4
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26419 +/- ##
==========================================
- Coverage 92.04% 91.72% -0.32%
==========================================
Files 180 174 -6
Lines 50714 50757 +43
==========================================
- Hits 46679 46559 -120
- Misses 4035 4198 +163
Continue to review full report at Codecov.
|
pandas/core/groupby/groupby.py
Outdated
result_index = self.grouper.result_index | ||
out.index = result_index[ids[mask]] | ||
|
||
if not self.observed and isinstance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you just need a small change I think in result_index
. This is very similar to what we do with all_grouper
.
pandas/core/groupby/groupby.py
Outdated
result_index = self.grouper.result_index | ||
out.index = result_index[ids[mask]] | ||
|
||
if not self.observed and isinstance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced here; this looks like just a bandaid to me and there is an underlying issue that needs fixing. also what about dropna=True?
I think this makes sense to wait until #26463 gets merged and use |
So doesn't look like merging in those changes helped; I think we have a few random reindexing ops scattered throughout the module still investigating better ways to share that functionality |
what's the status here? |
Haven't been able to address this comment yet https://github.com/pandas-dev/pandas/pull/26419/files#r285344403 so depending on how much of a blocker you think that is may or may not be ready for merge. I think as is gets us a lot closer to unifying nth / first / last so ultimately net benefit but open to debate |
@WillAyd what was that comment above? this PR looks good. |
Hmm link to comments in old commits might not work but if you look at the entire commit you'll see it as the first one there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. ping on green.
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -1141,6 +1141,7 @@ Groupby/resample/rolling | |||
- Improved :class:`pandas.core.window.Rolling`, :class:`pandas.core.window.Window` and :class:`pandas.core.window.EWM` functions to exclude nuisance columns from results instead of raising errors and raise a ``DataError`` only if all columns are nuisance (:issue:`12537`) | |||
- Bug in :meth:`pandas.core.window.Rolling.max` and :meth:`pandas.core.window.Rolling.min` where incorrect results are returned with an empty variable window (:issue:`26005`) | |||
- Raise a helpful exception when an unsupported weighted window function is used as an argument of :meth:`pandas.core.window.Window.aggregate` (:issue:`26597`) | |||
- Bug in :meth:`pandas.core.groupby.GroupBy.nth` where ``observed=False`` was being ignored for Categorical groupers (:issue:`26385`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to 0.25.1
Fixed the merge conflict. |
Thanks @WillAyd. I think this is an improvement on master. I believe there's an outstanding request for additional test coverage with |
Thanks @TomAugspurger for pushing this over the finish line |
* Added test coverage for observed=False with ops * Fixed issue with observed=False and nth
git diff upstream/master -u -- "*.py" | flake8 --diff