Skip to content

BUG: Remove null values before sorting during groupby nunique calculation #27951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 7, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ Datetimelike
- Bug in :meth:`Series.__setitem__` incorrectly casting ``np.timedelta64("NaT")`` to ``np.datetime64("NaT")`` when inserting into a :class:`Series` with datetime64 dtype (:issue:`27311`)
- Bug in :meth:`Series.dt` property lookups when the underlying data is read-only (:issue:`27529`)
- Bug in ``HDFStore.__getitem__`` incorrectly reading tz attribute created in Python 2 (:issue:`26443`)
-
- Bug in :meth:`pandas.core.groupby.SeriesGroupBy.nunique` where ``NaT`` values were interfering with the count of unique values (:issue:`27951`)


Timedelta
Expand Down
4 changes: 4 additions & 0 deletions pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1147,6 +1147,10 @@ def nunique(self, dropna=True):

val = self.obj._internal_get_values()

# GH 27951
# temporary fix while we wait for NumPy bug 12629 to be fixed
val[isna(val)] = np.datetime64("NaT")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the actual root of the issue is a bug in NumPy as described by @TomAugspurger where NaT values are not sorted as you'd expected

numpy/numpy#12629

So I think this works for now but maybe add a comment about NumPy bug 12629 for reference


try:
sorter = np.lexsort((val, ids))
except TypeError: # catches object dtypes
Expand Down
48 changes: 47 additions & 1 deletion pandas/tests/groupby/test_function.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import builtins
import datetime as dt
from io import StringIO
from itertools import product
from string import ascii_lowercase
Expand All @@ -9,7 +10,16 @@
from pandas.errors import UnsupportedFunctionCall

import pandas as pd
from pandas import DataFrame, Index, MultiIndex, Series, Timestamp, date_range, isna
from pandas import (
DataFrame,
Index,
MultiIndex,
NaT,
Series,
Timestamp,
date_range,
isna,
)
import pandas.core.nanops as nanops
from pandas.util import _test_decorators as td, testing as tm

Expand Down Expand Up @@ -1015,6 +1025,42 @@ def test_nunique_with_timegrouper():
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parametrization here is pretty repetitive, though I realize that you have three items at a time being sent through to keep the expectation different across each.

Is there a way to more succinctly parametrize though? It's rather difficult to read this and find what's expected

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - have modified it so the DataFrame is constructed within the test

"key, data, dropna, expected",
[
(
["x", "x", "x"],
[Timestamp("2019-01-01"), NaT, Timestamp("2019-01-01")],
True,
Series([1], index=pd.Index(["x"], name="key"), name="data"),
),
(
["x", "x", "x"],
[dt.date(2019, 1, 1), NaT, dt.date(2019, 1, 1)],
True,
Series([1], index=pd.Index(["x"], name="key"), name="data"),
),
(
["x", "x", "x", "y", "y"],
[dt.date(2019, 1, 1), NaT, dt.date(2019, 1, 1), NaT, dt.date(2019, 1, 1)],
False,
Series([2, 2], index=pd.Index(["x", "y"], name="key"), name="data"),
),
(
["x", "x", "x", "x", "y"],
[dt.date(2019, 1, 1), NaT, dt.date(2019, 1, 1), NaT, dt.date(2019, 1, 1)],
False,
Series([2, 1], index=pd.Index(["x", "y"], name="key"), name="data"),
),
],
)
def test_nunique_with_NaT(key, data, dropna, expected):
# GH 27951
df = pd.DataFrame({"key": key, "data": data})
result = df.groupby(["key"])["data"].nunique(dropna=dropna)
tm.assert_series_equal(result, expected)


def test_nunique_preserves_column_level_names():
# GH 23222
test = pd.DataFrame([1, 2, 2], columns=pd.Index(["A"], name="level_0"))
Expand Down