-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Preserve None in Series unique #20893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #20893 +/- ##
=======================================
Coverage 91.81% 91.81%
=======================================
Files 153 153
Lines 49481 49481
=======================================
Hits 45430 45430
Misses 4051 4051
Continue to review full report at Codecov.
|
def test_unique_obj_na_preservation(self, nulls_fixture): | ||
# GH 20866 | ||
s = pd.Series(['foo', nulls_fixture]) | ||
assert s.iloc[1] is nulls_fixture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the .unique()
is missing from s
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's a good reason that it is passing :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update and see if you can fixup the regression here? |
Updated to match pre-existing behavior. FWIW I don't think this is ideal as it still mangles Ref: |
Thanks for looking into this! Can you do a quick performance check? (just a
BTW, I agree we should maybe reconsider how missing-like values are treated here, as it is indeed not fully consistent that NaN / NaT are not kept separate. |
@@ -870,7 +870,7 @@ cdef class PyObjectHashTable(HashTable): | |||
for i in range(n): | |||
val = values[i] | |||
hash(val) | |||
if not checknull(val): | |||
if not checknull(val) or val is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a comment here, checkull is specifically designed to catch ALL nulls here.
# GH 20866 | ||
s = pd.Series(['foo', None]) | ||
result = s.unique() | ||
assert result[1] is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
construct array and compare. this should be in test_algorithms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to use strict_nan=True
in that case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
Here are the perf comps # Master
In [5]: arr = np.array(['foo', 'bar', 'baz', None]* 10000)
In [6]: %timeit pd.unique(arr)
2.72 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# f106b581570c18fb7ec57b941e744daef50b4460
In [4]: arr = np.array(['foo', 'bar', 'baz', None]* 10000)
In [5]: %timeit pd.unique(arr)
2.78 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
thanks @WillAyd |
git diff upstream/master -u -- "*.py" | flake8 --diff