Fix GroupBy nth Handling with Observed=False #26419

WillAyd · 2019-05-16T00:51:52Z

closes GroupBy nth ignores observed keyword for Categorical #26385
tests added / passed
[] passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2019-05-16T00:51:56Z

Hello @WillAyd! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-08-19 17:22:58 UTC

WillAyd · 2019-05-16T00:53:07Z

pandas/core/groupby/groupby.py

+            result_index = self.grouper.result_index
+            out.index = result_index[ids[mask]]
+
+            if not self.observed and isinstance(


I was hoping to get this into _wrap_aggregated_output (which is what wraps most of the other methods) but that method doesn't appear to do any reindexing so would have to mess with that to make it work. Figured this as a one-off was easier though ideally would have everything consolidated...

you just need a small change I think in result_index. This is very similar to what we do with all_grouper.

IIUC I think result_index is fine and handles properly; the problem here is specific to nth where the mask on the preceding lines essentially ignores the work that result_index does to properly handle NA categorical data, hence the reindex

I am not convinced here; this looks like just a bandaid to me and there is an underlying issue that needs fixing. also what about dropna=True?

dropna='all' seems to have a separate set of issues. Here's behavior on master:

>>> grp.nth(0, dropna='all') cat a 1.0 b 2.0 c 3.0 Name: ser, dtype: float64

Which is not correct since b and c are never associated with those values.

Also trying to mix that with observed doesn't work:

>>> grp = df.groupby('cat', observed=True)['ser'] >>> grp.nth(0, dropna='all') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/williamayd/clones/pandas/pandas/core/groupby/groupby.py", line 1773, in nth result.index = self.grouper.result_index File "/Users/williamayd/clones/pandas/pandas/core/generic.py", line 5144, in __setattr__ return object.__setattr__(self, name, value) File "pandas/_libs/properties.pyx", line 67, in pandas._libs.properties.AxisProperty.__set__ obj._set_axis(self.axis, value) File "/Users/williamayd/clones/pandas/pandas/core/series.py", line 381, in _set_axis self._data.set_axis(axis, labels) File "/Users/williamayd/clones/pandas/pandas/core/internals/managers.py", line 155, in set_axis 'values have {new} elements'.format(old=old_len, new=new_len)) ValueError: Length mismatch: Expected axis has 3 elements, new values have 1 elements

So both weird though I'm not sure how we want to handle the combinations of observed and dropna; will open a separate issue

pandas/tests/groupby/test_categorical.py

pandas/tests/groupby/conftest.py

pandas/core/groupby/groupby.py

codecov · 2019-05-16T01:30:58Z

Codecov Report

Merging #26419 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26419      +/-   ##
==========================================
- Coverage   91.69%   91.68%   -0.01%     
==========================================
  Files         174      174              
  Lines       50739    50742       +3     
==========================================
- Hits        46523    46522       -1     
- Misses       4216     4220       +4

Flag	Coverage Δ
#multiple	`90.19% <100%> (ø)`	⬆️
#single	`41.16% <20%> (-0.16%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/groupby.py	`97.25% <100%> (+0.01%)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`97.02% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 80e2615...f671204. Read the comment docs.

codecov · 2019-05-16T01:30:59Z

Codecov Report

Merging #26419 into master will decrease coverage by 0.31%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26419      +/-   ##
==========================================
- Coverage   92.04%   91.72%   -0.32%     
==========================================
  Files         180      174       -6     
  Lines       50714    50757      +43     
==========================================
- Hits        46679    46559     -120     
- Misses       4035     4198     +163

Flag	Coverage Δ
#multiple	`90.23% <100%> (-0.45%)`	⬇️
#single	`41.69% <20%> (-0.27%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/groupby.py	`97.25% <100%> (+0.07%)`	⬆️
pandas/plotting/_misc.py	`38.23% <0%> (-26.63%)`	⬇️
pandas/io/gbq.py	`78.94% <0%> (-21.06%)`	⬇️
pandas/io/gcs.py	`80% <0%> (-20%)`	⬇️
pandas/io/s3.py	`89.47% <0%> (-10.53%)`	⬇️
pandas/core/groupby/base.py	`91.83% <0%> (-8.17%)`	⬇️
pandas/plotting/_core.py	`83.89% <0%> (-5.49%)`	⬇️
pandas/io/excel/_xlrd.py	`94.54% <0%> (-5.46%)`	⬇️
pandas/io/clipboard/clipboards.py	`31.64% <0%> (-3.14%)`	⬇️
pandas/core/internals/managers.py	`93.93% <0%> (-2.93%)`	⬇️
... and 102 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e955515...ad729c5. Read the comment docs.

jreback · 2019-05-16T01:37:37Z

pandas/core/groupby/groupby.py

+            result_index = self.grouper.result_index
+            out.index = result_index[ids[mask]]
+
+            if not self.observed and isinstance(


you just need a small change I think in result_index. This is very similar to what we do with all_grouper.

pandas/tests/groupby/conftest.py

pandas/tests/groupby/test_categorical.py

jreback · 2019-05-18T14:32:30Z

pandas/core/groupby/groupby.py

+            result_index = self.grouper.result_index
+            out.index = result_index[ids[mask]]
+
+            if not self.observed and isinstance(


I am not convinced here; this looks like just a bandaid to me and there is an underlying issue that needs fixing. also what about dropna=True?

WillAyd · 2019-05-19T22:48:16Z

I think this makes sense to wait until #26463 gets merged and use _reindex_like from that once it gets moved into the base GroupBy class

WillAyd · 2019-06-03T23:44:53Z

So doesn't look like merging in those changes helped; I think we have a few random reindexing ops scattered throughout the module still investigating better ways to share that functionality

jreback · 2019-06-27T03:41:21Z

what's the status here?

WillAyd · 2019-06-27T03:46:21Z

Haven't been able to address this comment yet https://github.com/pandas-dev/pandas/pull/26419/files#r285344403 so depending on how much of a blocker you think that is may or may not be ready for merge.

I think as is gets us a lot closer to unifying nth / first / last so ultimately net benefit but open to debate

TomAugspurger · 2019-07-15T13:29:28Z

@WillAyd linting issue: https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=14466&view=logs&jobId=435ca956-9126-505f-566f-fff31072e2ba&taskId=ffc59b44-7438-5626-aa7a-09fa4141874b&lineStart=43&lineEnd=44&colStart=1&colEnd=1

jreback · 2019-07-15T21:19:52Z

@WillAyd what was that comment above? this PR looks good.

WillAyd · 2019-07-15T21:21:55Z

ee549ed#r285344403

WillAyd · 2019-07-15T21:24:58Z

Hmm link to comments in old commits might not work but if you look at the entire commit you'll see it as the first one there

jreback

lgtm. ping on green.

jreback · 2019-07-25T22:38:24Z

doc/source/whatsnew/v0.25.0.rst

@@ -1141,6 +1141,7 @@ Groupby/resample/rolling
 - Improved :class:`pandas.core.window.Rolling`, :class:`pandas.core.window.Window` and :class:`pandas.core.window.EWM` functions to exclude nuisance columns from results instead of raising errors and raise a ``DataError`` only if all columns are nuisance (:issue:`12537`)
 - Bug in :meth:`pandas.core.window.Rolling.max` and :meth:`pandas.core.window.Rolling.min` where incorrect results are returned with an empty variable window (:issue:`26005`)
 - Raise a helpful exception when an unsupported weighted window function is used as an argument of :meth:`pandas.core.window.Window.aggregate` (:issue:`26597`)
+- Bug in :meth:`pandas.core.groupby.GroupBy.nth` where ``observed=False`` was being ignored for Categorical groupers (:issue:`26385`)


move to 0.25.1

TomAugspurger · 2019-08-19T17:22:55Z

Fixed the merge conflict.

…ndling

TomAugspurger · 2019-08-20T19:02:00Z

Thanks @WillAyd. I think this is an improvement on master.

I believe there's an outstanding request for additional test coverage with observed=True? Can you double check that when you have time, and perhaps open an issue for it?

…False

)

WillAyd · 2019-08-23T12:15:10Z

Thanks @TomAugspurger for pushing this over the finish line

* Added test coverage for observed=False with ops * Fixed issue with observed=False and nth

WillAyd added 4 commits May 15, 2019 16:04

Added test coverage for observed=False with ops

03ee26b

Fixed issue with observed=False and nth

ee549ed

Stubbed whatsnew note

f0a510d

Merge remote-tracking branch 'upstream/master' into nth-na-handling

f671204

WillAyd added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Categorical Categorical Data Type labels May 16, 2019

WillAyd commented May 16, 2019

View reviewed changes

jreback requested changes May 16, 2019

View reviewed changes

WillAyd added 2 commits May 17, 2019 08:01

Merge remote-tracking branch 'upstream/master' into nth-na-handling

94dda01

lint fixup

e59a991

jreback requested changes May 18, 2019

View reviewed changes

WillAyd added 3 commits May 18, 2019 19:19

Simplified test

3677471

Merge remote-tracking branch 'upstream/master' into nth-na-handling

34c2f06

whatsnew whitespace fix

2ca34e3

Merge remote-tracking branch 'upstream/master' into nth-na-handling

d3e5efa

Merge remote-tracking branch 'upstream/master' into nth-na-handling

f9758b8

WillAyd added 2 commits June 26, 2019 22:47

Merge remote-tracking branch 'upstream/master' into nth-na-handling

ad729c5

Merge remote-tracking branch 'upstream/master' into nth-na-handling

5b7b6bc

WillAyd added 3 commits July 15, 2019 09:04

Merge remote-tracking branch 'upstream/master' into nth-na-handling

aff7327

blackify

47201fb

Merge remote-tracking branch 'upstream/master' into nth-na-handling

56822cc

Removed doc whitespace

1804e27

jreback requested changes Jul 25, 2019

View reviewed changes

jreback added this to the 0.25.1 milestone Jul 25, 2019

WillAyd added 2 commits July 25, 2019 15:39

Merge remote-tracking branch 'upstream/master' into nth-na-handling

4c2e413

moved whatsnew to 0.25.1

a837564

Merge remote-tracking branch 'upstream/master' into WillAyd-nth-na-ha…

308e569

…ndling

TomAugspurger merged commit d2031d7 into pandas-dev:master Aug 20, 2019

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Aug 20, 2019

Backport PR pandas-dev#26419: Fix GroupBy nth Handling with Observed=…

8f530a1

…False

meeseeksmachine mentioned this pull request Aug 20, 2019

Backport PR #26419 on branch 0.25.x (Fix GroupBy nth Handling with Observed=False) #28042

Merged

gfyoung pushed a commit that referenced this pull request Aug 20, 2019

Backport PR #26419: Fix GroupBy nth Handling with Observed=False (#28042

3ac5bba

)

WillAyd deleted the nth-na-handling branch August 23, 2019 12:14

galuhsahid pushed a commit to galuhsahid/pandas that referenced this pull request Aug 25, 2019

Fix GroupBy nth Handling with Observed=False (pandas-dev#26419)

0c490e0

* Added test coverage for observed=False with ops * Fixed issue with observed=False and nth

Uh oh!

Fix GroupBy nth Handling with Observed=False #26419

Fix GroupBy nth Handling with Observed=False #26419

Uh oh!

Conversation

WillAyd commented May 16, 2019

Uh oh!

pep8speaks commented May 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2019-08-19 17:22:58 UTC

Uh oh!

WillAyd May 16, 2019

Choose a reason for hiding this comment

Uh oh!

jreback May 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd May 17, 2019

Choose a reason for hiding this comment

Uh oh!

jreback May 18, 2019

Choose a reason for hiding this comment

Uh oh!

WillAyd May 18, 2019

Choose a reason for hiding this comment

Uh oh!

WillAyd May 19, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented May 16, 2019

Codecov Report

Uh oh!

codecov bot commented May 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jreback May 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jreback May 18, 2019

Choose a reason for hiding this comment

Uh oh!

WillAyd commented May 19, 2019

Uh oh!

WillAyd commented Jun 3, 2019

Uh oh!

jreback commented Jun 27, 2019

Uh oh!

WillAyd commented Jun 27, 2019

Uh oh!

TomAugspurger commented Jul 15, 2019

Uh oh!

jreback commented Jul 15, 2019

Uh oh!

WillAyd commented Jul 15, 2019

Uh oh!

WillAyd commented Jul 15, 2019

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback Jul 25, 2019

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Aug 19, 2019

Uh oh!

TomAugspurger commented Aug 20, 2019

Uh oh!

WillAyd commented Aug 23, 2019

Uh oh!

Uh oh!

pep8speaks commented May 16, 2019 •

edited

Loading

jreback May 16, 2019 •

edited

Loading

codecov bot commented May 16, 2019 •

edited

Loading

jreback May 16, 2019 •

edited

Loading