BUG: Remove null values before sorting during groupby nunique calculation #27951

MarcoGorelli · 2019-08-16T16:52:28Z

Closes groupby nunique() with dates vs datetimes in presence of NaTs #27904
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

doc/source/whatsnew/v0.25.1.rst

pandas/core/groupby/generic.py

pandas/tests/groupby/test_function.py

TomAugspurger · 2019-08-19T21:44:29Z

Does this completely close #27904? Or should it remain open for the case of dropna=False?

MarcoGorelli · 2019-08-20T12:47:53Z

In the case of dropna=False, what should nunique return? As in, do the NaNs all count as distinct values? I wouldn't expect them to, although as np.nan == np.nan returns False, so I'm not sure

TomAugspurger · 2019-08-20T16:19:29Z

Just one NA value I think.

TomAugspurger · 2019-08-20T16:25:07Z

And to be clear, we can leave handling dropna=False to a followup PR. Just want to make sure that we have an open issue for it.

TomAugspurger

This looks good for the dropna=True case. Will leave the issue open for dropna=False, which seems broken still.

TomAugspurger · 2019-08-22T13:22:07Z

Actually... I'm going to push this off the 0.25.1 milestone (sorry @MarcoGorelli)

Can you check a few things:

How's the performance of this? I assume np.lexsort has to do some kinds of NA masking and filtering internally. Are we much slower than that?
It looks like there's a now unnecessary if dropna starting around line 1171 now. IIUC, val should now have all the na values removed, so masking again shouldn't be necessary.

MarcoGorelli · 2019-08-22T14:28:10Z

Sorry for the delay - yes, I agree with your decision, Tom :) I'll pick this up again next week

pep8speaks · 2019-08-22T21:22:49Z

Hello @MarcoGorelli! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-09-06 15:50:47 UTC

MarcoGorelli · 2019-08-22T21:30:18Z

As noted as a comment in #27904, this works fine with numpy's own nat value...is it considered cheating just replace pandas.NaT with np.datetime64("NaT")? I tried that, and both the existing and the new tests (which include dropna=False) pass.

WillAyd · 2019-08-23T13:59:25Z

Haven't looked deeply yet but does this also impact Series and DataFrame methods? The location of the change would imply so hence my reason for asking

WillAyd

See above comment. Can you also add a whatsnew for v1.0.0

MarcoGorelli · 2019-08-23T14:47:36Z

Thanks for your review!

Sure, whatsnew entry added.

If I've understood your question, then the pandas.DataFrame.nunique and pandas.Series.nunique methods didn't have this issue to begin with, and are unaffected by this change.

WillAyd · 2019-08-23T14:51:51Z

Ah misread the file location but makes sense so thanks for confirming. Should be able to review more deeply over the next few days

WillAyd

sorry for delay in review. Looks pretty good some general comments

WillAyd · 2019-09-05T18:16:21Z

pandas/core/groupby/generic.py

@@ -1143,6 +1143,9 @@ def nunique(self, dropna=True):

        val = self.obj._internal_get_values()

+        # GH 27951
+        val[isna(val)] = np.datetime64("NaT")


I think the actual root of the issue is a bug in NumPy as described by @TomAugspurger where NaT values are not sorted as you'd expected

numpy/numpy#12629

So I think this works for now but maybe add a comment about NumPy bug 12629 for reference

WillAyd · 2019-09-05T18:17:39Z

pandas/tests/groupby/test_function.py

@@ -1015,6 +1025,81 @@ def test_nunique_with_timegrouper():
    tm.assert_series_equal(result, expected)


+@pytest.mark.parametrize(


The parametrization here is pretty repetitive, though I realize that you have three items at a time being sent through to keep the expectation different across each.

Is there a way to more succinctly parametrize though? It's rather difficult to read this and find what's expected

Agreed - have modified it so the DataFrame is constructed within the test

WillAyd

lgtm - @TomAugspurger mind another quick look?

TomAugspurger · 2019-09-07T11:30:00Z

Thanks @MarcoGorelli!

Can you confirm that this fixed all of #27904, or are there outstanding tasks?

MarcoGorelli · 2019-09-07T12:17:34Z

That's right - thanks for having taken the time to review my work!

…tion (pandas-dev#27951) Closes pandas-dev#27904

MarcoGorelli mentioned this pull request Aug 16, 2019

groupby nunique() with dates vs datetimes in presence of NaTs #27904

Closed

jschendel reviewed Aug 16, 2019

View reviewed changes

doc/source/whatsnew/v0.25.1.rst Outdated Show resolved Hide resolved

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved

jschendel added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Datetime Datetime data dtype labels Aug 16, 2019

jschendel added this to the 0.25.1 milestone Aug 16, 2019

jschendel reviewed Aug 16, 2019

View reviewed changes

pandas/tests/groupby/test_function.py Outdated Show resolved Hide resolved

jschendel mentioned this pull request Aug 17, 2019

issue 27904 #27969

Closed

5 tasks

TomAugspurger approved these changes Aug 22, 2019

View reviewed changes

TomAugspurger modified the milestones: 0.25.1, 1.0 Aug 22, 2019

MarcoGorelli force-pushed the fix-nunique-groupby branch from 7018a8a to 4445c02 Compare August 22, 2019 21:22

WillAyd requested changes Aug 23, 2019

View reviewed changes

WillAyd requested changes Sep 5, 2019

View reviewed changes

Temporary fix

9b09add

MarcoGorelli force-pushed the fix-nunique-groupby branch from 6538f5e to 9b09add Compare September 5, 2019 21:15

MarcoGorelli and others added 2 commits September 5, 2019 22:51

Correct order of imports

ac9aa7e

Correct order to imports

20ec544

MarcoGorelli changed the title ~~Remove null values before sorting during groupby nunique calculation~~ BUG: Remove null values before sorting during groupby nunique calculation Sep 6, 2019

WillAyd approved these changes Sep 6, 2019

View reviewed changes

TomAugspurger merged commit 820072a into pandas-dev:master Sep 7, 2019

MarcoGorelli deleted the fix-nunique-groupby branch September 7, 2019 12:17

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

BUG: Remove null values before sorting during groupby nunique calcula…

192d681

…tion (pandas-dev#27951) Closes pandas-dev#27904

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

BUG: Remove null values before sorting during groupby nunique calcula…

fa8b6bf

…tion (pandas-dev#27951) Closes pandas-dev#27904

dsaxton mentioned this pull request Feb 13, 2020

BUG: groupby-nunique modifies null values #31950

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Remove null values before sorting during groupby nunique calculation #27951

BUG: Remove null values before sorting during groupby nunique calculation #27951

MarcoGorelli commented Aug 16, 2019 •

edited by TomAugspurger

Loading

TomAugspurger commented Aug 19, 2019

MarcoGorelli commented Aug 20, 2019 •

edited

Loading

TomAugspurger commented Aug 20, 2019

TomAugspurger commented Aug 20, 2019

TomAugspurger left a comment

TomAugspurger commented Aug 22, 2019

MarcoGorelli commented Aug 22, 2019

pep8speaks commented Aug 22, 2019 •

edited

Loading

MarcoGorelli commented Aug 22, 2019 •

edited

Loading

WillAyd commented Aug 23, 2019

WillAyd left a comment

MarcoGorelli commented Aug 23, 2019 •

edited

Loading

WillAyd commented Aug 23, 2019

WillAyd left a comment

WillAyd Sep 5, 2019

WillAyd Sep 5, 2019

MarcoGorelli Sep 5, 2019

WillAyd left a comment

TomAugspurger commented Sep 7, 2019

MarcoGorelli commented Sep 7, 2019

		@@ -1015,6 +1025,81 @@ def test_nunique_with_timegrouper():
		tm.assert_series_equal(result, expected)


		@pytest.mark.parametrize(

BUG: Remove null values before sorting during groupby nunique calculation #27951

BUG: Remove null values before sorting during groupby nunique calculation #27951

Conversation

MarcoGorelli commented Aug 16, 2019 • edited by TomAugspurger Loading

TomAugspurger commented Aug 19, 2019

MarcoGorelli commented Aug 20, 2019 • edited Loading

TomAugspurger commented Aug 20, 2019

TomAugspurger commented Aug 20, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

TomAugspurger commented Aug 22, 2019

MarcoGorelli commented Aug 22, 2019

pep8speaks commented Aug 22, 2019 • edited Loading

Comment last updated at 2019-09-06 15:50:47 UTC

MarcoGorelli commented Aug 22, 2019 • edited Loading

WillAyd commented Aug 23, 2019

WillAyd left a comment

Choose a reason for hiding this comment

MarcoGorelli commented Aug 23, 2019 • edited Loading

WillAyd commented Aug 23, 2019

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd Sep 5, 2019

Choose a reason for hiding this comment

WillAyd Sep 5, 2019

Choose a reason for hiding this comment

MarcoGorelli Sep 5, 2019

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

TomAugspurger commented Sep 7, 2019

MarcoGorelli commented Sep 7, 2019

MarcoGorelli commented Aug 16, 2019 •

edited by TomAugspurger

Loading

MarcoGorelli commented Aug 20, 2019 •

edited

Loading

pep8speaks commented Aug 22, 2019 •

edited

Loading

MarcoGorelli commented Aug 22, 2019 •

edited

Loading

MarcoGorelli commented Aug 23, 2019 •

edited

Loading