Skip to content

Fix GroupBy nth Handling with Observed=False #26419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Aug 20, 2019

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented May 16, 2019

@WillAyd WillAyd added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Categorical Categorical Data Type labels May 16, 2019
@pep8speaks
Copy link

pep8speaks commented May 16, 2019

Hello @WillAyd! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-08-19 17:22:58 UTC

result_index = self.grouper.result_index
out.index = result_index[ids[mask]]

if not self.observed and isinstance(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to get this into _wrap_aggregated_output (which is what wraps most of the other methods) but that method doesn't appear to do any reindexing so would have to mess with that to make it work. Figured this as a one-off was easier though ideally would have everything consolidated...

Copy link
Contributor

@jreback jreback May 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you just need a small change I think in result_index. This is very similar to what we do with all_grouper.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC I think result_index is fine and handles properly; the problem here is specific to nth where the mask on the preceding lines essentially ignores the work that result_index does to properly handle NA categorical data, hence the reindex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced here; this looks like just a bandaid to me and there is an underlying issue that needs fixing. also what about dropna=True?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropna='all' seems to have a separate set of issues. Here's behavior on master:

>>> grp.nth(0, dropna='all')
cat
a    1.0
b    2.0
c    3.0
Name: ser, dtype: float64

Which is not correct since b and c are never associated with those values.

Also trying to mix that with observed doesn't work:

>>> grp = df.groupby('cat', observed=True)['ser']
>>> grp.nth(0, dropna='all')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/williamayd/clones/pandas/pandas/core/groupby/groupby.py", line 1773, in nth
    result.index = self.grouper.result_index
  File "/Users/williamayd/clones/pandas/pandas/core/generic.py", line 5144, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 67, in pandas._libs.properties.AxisProperty.__set__
    obj._set_axis(self.axis, value)
  File "/Users/williamayd/clones/pandas/pandas/core/series.py", line 381, in _set_axis
    self._data.set_axis(axis, labels)
  File "/Users/williamayd/clones/pandas/pandas/core/internals/managers.py", line 155, in set_axis
    'values have {new} elements'.format(old=old_len, new=new_len))
ValueError: Length mismatch: Expected axis has 3 elements, new values have 1 elements

So both weird though I'm not sure how we want to handle the combinations of observed and dropna; will open a separate issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #26454

@codecov
Copy link

codecov bot commented May 16, 2019

Codecov Report

Merging #26419 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26419      +/-   ##
==========================================
- Coverage   91.69%   91.68%   -0.01%     
==========================================
  Files         174      174              
  Lines       50739    50742       +3     
==========================================
- Hits        46523    46522       -1     
- Misses       4216     4220       +4
Flag Coverage Δ
#multiple 90.19% <100%> (ø) ⬆️
#single 41.16% <20%> (-0.16%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/groupby.py 97.25% <100%> (+0.01%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97.02% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 80e2615...f671204. Read the comment docs.

@codecov
Copy link

codecov bot commented May 16, 2019

Codecov Report

Merging #26419 into master will decrease coverage by 0.31%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26419      +/-   ##
==========================================
- Coverage   92.04%   91.72%   -0.32%     
==========================================
  Files         180      174       -6     
  Lines       50714    50757      +43     
==========================================
- Hits        46679    46559     -120     
- Misses       4035     4198     +163
Flag Coverage Δ
#multiple 90.23% <100%> (-0.45%) ⬇️
#single 41.69% <20%> (-0.27%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/groupby.py 97.25% <100%> (+0.07%) ⬆️
pandas/plotting/_misc.py 38.23% <0%> (-26.63%) ⬇️
pandas/io/gbq.py 78.94% <0%> (-21.06%) ⬇️
pandas/io/gcs.py 80% <0%> (-20%) ⬇️
pandas/io/s3.py 89.47% <0%> (-10.53%) ⬇️
pandas/core/groupby/base.py 91.83% <0%> (-8.17%) ⬇️
pandas/plotting/_core.py 83.89% <0%> (-5.49%) ⬇️
pandas/io/excel/_xlrd.py 94.54% <0%> (-5.46%) ⬇️
pandas/io/clipboard/clipboards.py 31.64% <0%> (-3.14%) ⬇️
pandas/core/internals/managers.py 93.93% <0%> (-2.93%) ⬇️
... and 102 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e955515...ad729c5. Read the comment docs.

result_index = self.grouper.result_index
out.index = result_index[ids[mask]]

if not self.observed and isinstance(
Copy link
Contributor

@jreback jreback May 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you just need a small change I think in result_index. This is very similar to what we do with all_grouper.

result_index = self.grouper.result_index
out.index = result_index[ids[mask]]

if not self.observed and isinstance(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced here; this looks like just a bandaid to me and there is an underlying issue that needs fixing. also what about dropna=True?

@WillAyd
Copy link
Member Author

WillAyd commented May 19, 2019

I think this makes sense to wait until #26463 gets merged and use _reindex_like from that once it gets moved into the base GroupBy class

@WillAyd
Copy link
Member Author

WillAyd commented Jun 3, 2019

So doesn't look like merging in those changes helped; I think we have a few random reindexing ops scattered throughout the module still investigating better ways to share that functionality

@jreback
Copy link
Contributor

jreback commented Jun 27, 2019

what's the status here?

@WillAyd
Copy link
Member Author

WillAyd commented Jun 27, 2019

Haven't been able to address this comment yet https://github.com/pandas-dev/pandas/pull/26419/files#r285344403 so depending on how much of a blocker you think that is may or may not be ready for merge.

I think as is gets us a lot closer to unifying nth / first / last so ultimately net benefit but open to debate

@jreback
Copy link
Contributor

jreback commented Jul 15, 2019

@WillAyd what was that comment above? this PR looks good.

@WillAyd
Copy link
Member Author

WillAyd commented Jul 15, 2019

ee549ed#r285344403

@WillAyd
Copy link
Member Author

WillAyd commented Jul 15, 2019

Hmm link to comments in old commits might not work but if you look at the entire commit you'll see it as the first one there

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. ping on green.

@@ -1141,6 +1141,7 @@ Groupby/resample/rolling
- Improved :class:`pandas.core.window.Rolling`, :class:`pandas.core.window.Window` and :class:`pandas.core.window.EWM` functions to exclude nuisance columns from results instead of raising errors and raise a ``DataError`` only if all columns are nuisance (:issue:`12537`)
- Bug in :meth:`pandas.core.window.Rolling.max` and :meth:`pandas.core.window.Rolling.min` where incorrect results are returned with an empty variable window (:issue:`26005`)
- Raise a helpful exception when an unsupported weighted window function is used as an argument of :meth:`pandas.core.window.Window.aggregate` (:issue:`26597`)
- Bug in :meth:`pandas.core.groupby.GroupBy.nth` where ``observed=False`` was being ignored for Categorical groupers (:issue:`26385`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.25.1

@jreback jreback added this to the 0.25.1 milestone Jul 25, 2019
@TomAugspurger
Copy link
Contributor

Fixed the merge conflict.

@TomAugspurger
Copy link
Contributor

Thanks @WillAyd. I think this is an improvement on master.

I believe there's an outstanding request for additional test coverage with observed=True? Can you double check that when you have time, and perhaps open an issue for it?

@TomAugspurger TomAugspurger merged commit d2031d7 into pandas-dev:master Aug 20, 2019
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Aug 20, 2019
@WillAyd WillAyd deleted the nth-na-handling branch August 23, 2019 12:14
@WillAyd
Copy link
Member Author

WillAyd commented Aug 23, 2019

Thanks @TomAugspurger for pushing this over the finish line

galuhsahid pushed a commit to galuhsahid/pandas that referenced this pull request Aug 25, 2019
* Added test coverage for observed=False with ops

* Fixed issue with observed=False and nth
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GroupBy nth ignores observed keyword for Categorical
5 participants