-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Remove null values before sorting during groupby nunique calculation #27951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
import builtins | ||
import datetime as dt | ||
from io import StringIO | ||
from itertools import product | ||
from string import ascii_lowercase | ||
|
@@ -9,7 +10,16 @@ | |
from pandas.errors import UnsupportedFunctionCall | ||
|
||
import pandas as pd | ||
from pandas import DataFrame, Index, MultiIndex, Series, Timestamp, date_range, isna | ||
from pandas import ( | ||
DataFrame, | ||
Index, | ||
MultiIndex, | ||
NaT, | ||
Series, | ||
Timestamp, | ||
date_range, | ||
isna, | ||
) | ||
import pandas.core.nanops as nanops | ||
from pandas.util import _test_decorators as td, testing as tm | ||
|
||
|
@@ -1015,6 +1025,42 @@ def test_nunique_with_timegrouper(): | |
tm.assert_series_equal(result, expected) | ||
|
||
|
||
@pytest.mark.parametrize( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The parametrization here is pretty repetitive, though I realize that you have three items at a time being sent through to keep the expectation different across each. Is there a way to more succinctly parametrize though? It's rather difficult to read this and find what's expected There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed - have modified it so the DataFrame is constructed within the test |
||
"key, data, dropna, expected", | ||
[ | ||
( | ||
["x", "x", "x"], | ||
[Timestamp("2019-01-01"), NaT, Timestamp("2019-01-01")], | ||
True, | ||
Series([1], index=pd.Index(["x"], name="key"), name="data"), | ||
), | ||
( | ||
["x", "x", "x"], | ||
[dt.date(2019, 1, 1), NaT, dt.date(2019, 1, 1)], | ||
True, | ||
Series([1], index=pd.Index(["x"], name="key"), name="data"), | ||
), | ||
( | ||
["x", "x", "x", "y", "y"], | ||
[dt.date(2019, 1, 1), NaT, dt.date(2019, 1, 1), NaT, dt.date(2019, 1, 1)], | ||
False, | ||
Series([2, 2], index=pd.Index(["x", "y"], name="key"), name="data"), | ||
), | ||
( | ||
["x", "x", "x", "x", "y"], | ||
[dt.date(2019, 1, 1), NaT, dt.date(2019, 1, 1), NaT, dt.date(2019, 1, 1)], | ||
False, | ||
Series([2, 1], index=pd.Index(["x", "y"], name="key"), name="data"), | ||
), | ||
], | ||
) | ||
def test_nunique_with_NaT(key, data, dropna, expected): | ||
# GH 27951 | ||
df = pd.DataFrame({"key": key, "data": data}) | ||
result = df.groupby(["key"])["data"].nunique(dropna=dropna) | ||
tm.assert_series_equal(result, expected) | ||
|
||
|
||
def test_nunique_preserves_column_level_names(): | ||
# GH 23222 | ||
test = pd.DataFrame([1, 2, 2], columns=pd.Index(["A"], name="level_0")) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the actual root of the issue is a bug in NumPy as described by @TomAugspurger where NaT values are not sorted as you'd expected
numpy/numpy#12629
So I think this works for now but maybe add a comment about NumPy bug 12629 for reference