Skip to content

BUG: Remove null values before sorting during groupby nunique calculation #27951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 7, 2019

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Aug 16, 2019

@jschendel jschendel added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Datetime Datetime data dtype labels Aug 16, 2019
@jschendel jschendel added this to the 0.25.1 milestone Aug 16, 2019
@jschendel jschendel mentioned this pull request Aug 17, 2019
5 tasks
@TomAugspurger
Copy link
Contributor

Does this completely close #27904? Or should it remain open for the case of dropna=False?

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Aug 20, 2019

In the case of dropna=False, what should nunique return? As in, do the NaNs all count as distinct values? I wouldn't expect them to, although as np.nan == np.nan returns False, so I'm not sure

@TomAugspurger
Copy link
Contributor

Just one NA value I think.

@TomAugspurger
Copy link
Contributor

And to be clear, we can leave handling dropna=False to a followup PR. Just want to make sure that we have an open issue for it.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good for the dropna=True case. Will leave the issue open for dropna=False, which seems broken still.

@TomAugspurger
Copy link
Contributor

Actually... I'm going to push this off the 0.25.1 milestone (sorry @MarcoGorelli)

Can you check a few things:

  1. How's the performance of this? I assume np.lexsort has to do some kinds of NA masking and filtering internally. Are we much slower than that?
  2. It looks like there's a now unnecessary if dropna starting around line 1171 now. IIUC, val should now have all the na values removed, so masking again shouldn't be necessary.

@TomAugspurger TomAugspurger modified the milestones: 0.25.1, 1.0 Aug 22, 2019
@MarcoGorelli
Copy link
Member Author

Sorry for the delay - yes, I agree with your decision, Tom :) I'll pick this up again next week

@pep8speaks
Copy link

pep8speaks commented Aug 22, 2019

Hello @MarcoGorelli! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-09-06 15:50:47 UTC

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Aug 22, 2019

As noted as a comment in #27904, this works fine with numpy's own nat value...is it considered cheating just replace pandas.NaT with np.datetime64("NaT")? I tried that, and both the existing and the new tests (which include dropna=False) pass.

@WillAyd
Copy link
Member

WillAyd commented Aug 23, 2019

Haven't looked deeply yet but does this also impact Series and DataFrame methods? The location of the change would imply so hence my reason for asking

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment. Can you also add a whatsnew for v1.0.0

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Aug 23, 2019

Thanks for your review!

Sure, whatsnew entry added.

If I've understood your question, then the pandas.DataFrame.nunique and pandas.Series.nunique methods didn't have this issue to begin with, and are unaffected by this change.

@WillAyd
Copy link
Member

WillAyd commented Aug 23, 2019

Ah misread the file location but makes sense so thanks for confirming. Should be able to review more deeply over the next few days

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for delay in review. Looks pretty good some general comments

@@ -1143,6 +1143,9 @@ def nunique(self, dropna=True):

val = self.obj._internal_get_values()

# GH 27951
val[isna(val)] = np.datetime64("NaT")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the actual root of the issue is a bug in NumPy as described by @TomAugspurger where NaT values are not sorted as you'd expected

numpy/numpy#12629

So I think this works for now but maybe add a comment about NumPy bug 12629 for reference

@@ -1015,6 +1025,81 @@ def test_nunique_with_timegrouper():
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parametrization here is pretty repetitive, though I realize that you have three items at a time being sent through to keep the expectation different across each.

Is there a way to more succinctly parametrize though? It's rather difficult to read this and find what's expected

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - have modified it so the DataFrame is constructed within the test

@MarcoGorelli MarcoGorelli changed the title Remove null values before sorting during groupby nunique calculation BUG: Remove null values before sorting during groupby nunique calculation Sep 6, 2019
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - @TomAugspurger mind another quick look?

@TomAugspurger TomAugspurger merged commit 820072a into pandas-dev:master Sep 7, 2019
@TomAugspurger
Copy link
Contributor

Thanks @MarcoGorelli!

Can you confirm that this fixed all of #27904, or are there outstanding tasks?

@MarcoGorelli
Copy link
Member Author

That's right - thanks for having taken the time to review my work!

@MarcoGorelli MarcoGorelli deleted the fix-nunique-groupby branch September 7, 2019 12:17
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

groupby nunique() with dates vs datetimes in presence of NaTs
5 participants