Skip to content

Preserve None in Series unique #20893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 11, 2018
Merged

Preserve None in Series unique #20893

merged 6 commits into from
May 11, 2018

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented May 1, 2018

@codecov
Copy link

codecov bot commented May 1, 2018

Codecov Report

Merging #20893 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #20893   +/-   ##
=======================================
  Coverage   91.81%   91.81%           
=======================================
  Files         153      153           
  Lines       49481    49481           
=======================================
  Hits        45430    45430           
  Misses       4051     4051
Flag Coverage Δ
#multiple 90.21% <ø> (ø) ⬆️
#single 41.85% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd4332f...f106b58. Read the comment docs.

def test_unique_obj_na_preservation(self, nulls_fixture):
# GH 20866
s = pd.Series(['foo', nulls_fixture])
assert s.iloc[1] is nulls_fixture
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the .unique() is missing from s.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a good reason that it is passing :-)

Copy link
Member Author

@WillAyd WillAyd May 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops...hmm well yes now I am getting nan just as you are so something strange is going on here. Going to bisect between d274d0b and b020891 as the behavior changed for me between those. Will post back results soon

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 4, 2018
@jreback
Copy link
Contributor

jreback commented May 4, 2018

can you update and see if you can fixup the regression here?

@WillAyd WillAyd changed the title Added test for preservation of NA values in Series unique Preserve None in Series unique May 4, 2018
@WillAyd
Copy link
Member Author

WillAyd commented May 4, 2018

Updated to match pre-existing behavior. FWIW I don't think this is ideal as it still mangles pd.NaT and np.nan but I figured revert to what it was for 0.23 and I can open a separate issue to ensure that the NA values remain distinct in a later release

Ref:
#20866 (comment)

@jorisvandenbossche
Copy link
Member

Thanks for looking into this!

Can you do a quick performance check? (just a %timeit before / after with a largish object series) Eg

In [81]: s = pd.Series(['a', 'b', 'c', None]*10000)

In [82]: %timeit s.unique()
3.18 ms ± 9.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

BTW, I agree we should maybe reconsider how missing-like values are treated here, as it is indeed not fully consistent that NaN / NaT are not kept separate.

@@ -870,7 +870,7 @@ cdef class PyObjectHashTable(HashTable):
for i in range(n):
val = values[i]
hash(val)
if not checknull(val):
if not checknull(val) or val is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a comment here, checkull is specifically designed to catch ALL nulls here.

# GH 20866
s = pd.Series(['foo', None])
result = s.unique()
assert result[1] is None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

construct array and compare. this should be in test_algorithms

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to use strict_nan=True in that case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

@WillAyd
Copy link
Member Author

WillAyd commented May 5, 2018

Here are the perf comps

# Master
In [5]: arr = np.array(['foo', 'bar', 'baz', None]* 10000)
In [6]: %timeit pd.unique(arr)
2.72 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# f106b581570c18fb7ec57b941e744daef50b4460
In [4]: arr = np.array(['foo', 'bar', 'baz', None]* 10000)
In [5]: %timeit pd.unique(arr)
2.78 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@jorisvandenbossche jorisvandenbossche added this to the 0.23.0 milestone May 11, 2018
@jreback jreback merged commit 3d03fdb into pandas-dev:master May 11, 2018
@jreback
Copy link
Contributor

jreback commented May 11, 2018

thanks @WillAyd

@WillAyd WillAyd deleted the uniq-tst branch May 11, 2018 18:30
topper-123 pushed a commit to topper-123/pandas that referenced this pull request May 13, 2018
topper-123 pushed a commit to topper-123/pandas that referenced this pull request May 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change in behaviour of unique method regarding None values
4 participants