Skip to content

Change in behaviour of unique method regarding None values #20866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Apr 29, 2018 · 9 comments · Fixed by #20893
Closed

Change in behaviour of unique method regarding None values #20866

jorisvandenbossche opened this issue Apr 29, 2018 · 9 comments · Fixed by #20893
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Blocker Blocking issue or pull request for an upcoming release Dtype Conversions Unexpected or buggy dtype conversions Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@jorisvandenbossche
Copy link
Member

The behaviour of unique regarding None values recently changed on master (somewhere in the last month) :

In [1]: pd.__version__
Out[1]: '0.22.0'

In [2]: s = pd.Series(['test', None])

In [3]: s.unique()
Out[3]: array(['test', None], dtype=object)

vs

In [1]: pd.__version__
Out[1]: '0.23.0.dev0+807.g563a6ad'

In [2]: s = pd.Series(['test', None])

In [3]: s.unique()
Out[3]: array(['test', nan], dtype=object)

Not sure what the desired result is (we typically handle None like it is missing like NaN, but on the other hand, we also didn't coerce to it on construction, i.e. we do allow None values to be stored), but it is a breaking change (discovered it because geopandas tests started failing; geopandas/geopandas#711).

@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Apr 29, 2018
@TomAugspurger TomAugspurger added Blocker Blocking issue or pull request for an upcoming release Regression Functionality that used to work in a prior pandas version Dtype Conversions Unexpected or buggy dtype conversions Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Apr 29, 2018
@WillAyd
Copy link
Member

WillAyd commented Apr 30, 2018

This may have subsequently been "fixed" on master

>>> pd.__version__
'0.23.0.dev0+813.gd274d0b22'
>>> s = pd.Series(['test', None])
>>> s.unique()
array(['test', None], dtype=object)

@jreback
Copy link
Contributor

jreback commented May 1, 2018

@WillAyd can see if there is a test (and if not put one up)?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented May 1, 2018

Hmm, strange, I did a full clean rebuild of master, and this is still giving nan for me:

In [1]: s = pd.Series(['test', None])

In [2]: s.unique()
Out[2]: array(['test', nan], dtype=object)

In [3]: pd.__version__
Out[3]: '0.23.0.dev0+815.gf799916'

@jschendel
Copy link
Member

I'm also getting the same output as @jorisvandenbossche.

@jorisvandenbossche
Copy link
Member Author

It's very strange that the tests are passing though (in the PR that @WillAyd did: #20893)

I was testing on linux / python 3.5 / numpy 1.13.3 (not sure what else might influence this), the failing tests on geopandas were using the linux travis build with python 3.6 and pandas installed with pip from github master.

@jorisvandenbossche
Copy link
Member Author

I suppose this might be the cause (not of the fact we are seeing different behaviour though): 7f7f3d4

@jschendel
Copy link
Member

@jorisvandenbossche : see my comment on the PR.

@WillAyd
Copy link
Member

WillAyd commented May 1, 2018

@jorisvandenbossche after reviewing some more I am able to reproduce the error you provided after doing a full clean / rebuild. After looking at it, I believe you are correct that 7f7f3d4 is causing the regression, as a clean / rebuild on the commit prior still returned None but a full clean / rebuild on that commit does return NaN.

@WillAyd
Copy link
Member

WillAyd commented May 1, 2018

Looking at this further I noticed that the behavior is somewhat inconsistent with the commit prior to the regression. Note below that None is preserved but pd.NaT becomes np.nan:

>>> pd.__version__
'0.23.0.dev0+752.g0bd8a5a62'
>>> pd.Series(['foo', None]).unique()
array(['foo', None], dtype=object)
>>> pd.Series(['foo', pd.NaT]).unique()
array(['foo', nan], dtype=object)
>>> pd.Series(['foo', np.nan]).unique()
array(['foo', nan], dtype=object)

On the commit with the regression, everything is np.nan:

>>> pd.__version__
'0.23.0.dev0+753.g7f7f3d49b'
>>> pd.Series(['foo', None]).unique()
array(['foo', nan], dtype=object)
>>> pd.Series(['foo', pd.NaT]).unique()
array(['foo', nan], dtype=object)
>>> pd.Series(['foo', np.nan]).unique()
array(['foo', nan], dtype=object)

The difference in the treatment of None has to do with the replacement of _checknan with checknull in the referenced PR. Do we care to revert this? IMO this is a pretty dark corner and if unique wasn't preserving pd.NaT before I don't take it as a guarantee that it should have preserved None either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Blocker Blocking issue or pull request for an upcoming release Dtype Conversions Unexpected or buggy dtype conversions Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants