Preserve None in Series unique #20893

WillAyd · 2018-05-01T04:00:33Z

closes Change in behaviour of unique method regarding None values #20866
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

codecov · 2018-05-01T06:42:57Z

Codecov Report

Merging #20893 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #20893   +/-   ##
=======================================
  Coverage   91.81%   91.81%           
=======================================
  Files         153      153           
  Lines       49481    49481           
=======================================
  Hits        45430    45430           
  Misses       4051     4051

Flag	Coverage Δ
#multiple	`90.21% <ø> (ø)`	⬆️
#single	`41.85% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd4332f...f106b58. Read the comment docs.

jschendel · 2018-05-01T08:30:34Z

pandas/tests/series/test_analytics.py

+    def test_unique_obj_na_preservation(self, nulls_fixture):
+        # GH 20866
+        s = pd.Series(['foo', nulls_fixture])
+        assert s.iloc[1] is nulls_fixture


Looks like the .unique() is missing from s.

Ah, that's a good reason that it is passing :-)

Whoops...hmm well yes now I am getting nan just as you are so something strange is going on here. Going to bisect between d274d0b and b020891 as the behavior changed for me between those. Will post back results soon

jreback · 2018-05-04T10:36:19Z

can you update and see if you can fixup the regression here?

WillAyd · 2018-05-04T16:04:58Z

Updated to match pre-existing behavior. FWIW I don't think this is ideal as it still mangles pd.NaT and np.nan but I figured revert to what it was for 0.23 and I can open a separate issue to ensure that the NA values remain distinct in a later release

Ref:
#20866 (comment)

jorisvandenbossche · 2018-05-05T11:41:34Z

Thanks for looking into this!

Can you do a quick performance check? (just a %timeit before / after with a largish object series) Eg

In [81]: s = pd.Series(['a', 'b', 'c', None]*10000)

In [82]: %timeit s.unique()
3.18 ms ± 9.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

BTW, I agree we should maybe reconsider how missing-like values are treated here, as it is indeed not fully consistent that NaN / NaT are not kept separate.

jreback · 2018-05-05T12:46:51Z

pandas/_libs/hashtable_class_helper.pxi.in

@@ -870,7 +870,7 @@ cdef class PyObjectHashTable(HashTable):
        for i in range(n):
            val = values[i]
            hash(val)
-            if not checknull(val):
+            if not checknull(val) or val is None:


needs a comment here, checkull is specifically designed to catch ALL nulls here.

jreback · 2018-05-05T12:47:31Z

pandas/tests/series/test_analytics.py

+        # GH 20866
+        s = pd.Series(['foo', None])
+        result = s.unique()
+        assert result[1] is None


construct array and compare. this should be in test_algorithms

Make sure to use strict_nan=True in that case

WillAyd · 2018-05-05T18:11:35Z

Here are the perf comps

# Master
In [5]: arr = np.array(['foo', 'bar', 'baz', None]* 10000)
In [6]: %timeit pd.unique(arr)
2.72 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# f106b581570c18fb7ec57b941e744daef50b4460
In [4]: arr = np.array(['foo', 'bar', 'baz', None]* 10000)
In [5]: %timeit pd.unique(arr)
2.78 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

jreback · 2018-05-11T18:29:52Z

thanks @WillAyd

Added test for preservation of NA values in Series unique

f33a4c5

jorisvandenbossche mentioned this pull request May 1, 2018

Change in behaviour of unique method regarding None values #20866

Closed

jschendel reviewed May 1, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into uniq-tst

9736ecb

jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 4, 2018

WillAyd added 2 commits May 4, 2018 08:30

Merge remote-tracking branch 'upstream/master' into uniq-tst

f791e5e

Reverted none to nan conversion in unique

1e83e60

WillAyd changed the title ~~Added test for preservation of NA values in Series unique~~ Preserve None in Series unique May 4, 2018

jreback requested changes May 5, 2018

View reviewed changes

WillAyd added 2 commits May 5, 2018 11:08

Moved test; added comments

df28111

Merge remote-tracking branch 'upstream/master' into uniq-tst

f106b58

jorisvandenbossche added this to the 0.23.0 milestone May 11, 2018

jreback approved these changes May 11, 2018

View reviewed changes

jreback merged commit 3d03fdb into pandas-dev:master May 11, 2018

WillAyd deleted the uniq-tst branch May 11, 2018 18:30

topper-123 pushed a commit to topper-123/pandas that referenced this pull request May 13, 2018

Preserve None in Series unique (pandas-dev#20893)

6de48f6

topper-123 pushed a commit to topper-123/pandas that referenced this pull request May 13, 2018

Preserve None in Series unique (pandas-dev#20893)

5e362d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve None in Series unique #20893

Preserve None in Series unique #20893

WillAyd commented May 1, 2018

codecov bot commented May 1, 2018 •

edited

Loading

jschendel May 1, 2018

jorisvandenbossche May 1, 2018

WillAyd May 1, 2018 •

edited

Loading

jreback commented May 4, 2018

WillAyd commented May 4, 2018

jorisvandenbossche commented May 5, 2018

jreback May 5, 2018

jreback May 5, 2018

jorisvandenbossche May 5, 2018

jreback May 5, 2018

WillAyd commented May 5, 2018

jreback commented May 11, 2018

Preserve None in Series unique #20893

Preserve None in Series unique #20893

Conversation

WillAyd commented May 1, 2018

codecov bot commented May 1, 2018 • edited Loading

Codecov Report

jschendel May 1, 2018

Choose a reason for hiding this comment

jorisvandenbossche May 1, 2018

Choose a reason for hiding this comment

WillAyd May 1, 2018 • edited Loading

Choose a reason for hiding this comment

jreback commented May 4, 2018

WillAyd commented May 4, 2018

jorisvandenbossche commented May 5, 2018

jreback May 5, 2018

Choose a reason for hiding this comment

jreback May 5, 2018

Choose a reason for hiding this comment

jorisvandenbossche May 5, 2018

Choose a reason for hiding this comment

jreback May 5, 2018

Choose a reason for hiding this comment

WillAyd commented May 5, 2018

jreback commented May 11, 2018

codecov bot commented May 1, 2018 •

edited

Loading

WillAyd May 1, 2018 •

edited

Loading