BUG: None is not equal to None #20442

mapio · 2018-03-21T20:16:41Z

If you have None in a series, it will not be considered equal to None (even in the same series).

A = pd.Series([None]*3)
A == A

gives

0    False
1    False
2    False
dtype: bool

but, of course

A.values == A.values

gives

array([ True,  True,  True])

I can't find any reference to this behaviour in the documentation and looks quite unnatural.

Output of `pd.show_versions()`

Here the output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.2
Cython: None
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.0
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jschendel · 2018-03-21T22:26:00Z

Also a bit strange that == is inconsistent with the eq method in this case:

In [2]: pd.__version__
Out[2]: '0.23.0.dev0+658.g17c1fad'

In [3]: s = pd.Series([None]*3)

In [4]: s
Out[4]:
0    None
1    None
2    None
dtype: object

In [5]: s == s
Out[5]:
0    False
1    False
2    False
dtype: bool

In [6]: s.eq(s)
Out[6]:
0    True
1    True
2    True
dtype: bool

mroeschke · 2018-03-21T23:07:14Z

I believe this occurs because this comparison hits this block:

pandas/pandas/_libs/ops.pyx

Lines 162 to 163 in 17c1fad

    
           if checknull(x) or checknull(y): 
        
               result[i] = False

And the checknull function resolves down this path which considers None as null:

pandas/pandas/_libs/src/util.pxd

Lines 156 to 158 in 17c1fad

    
           cdef inline bint _checknull(object val): 
        
               try: 
        
                   return val is None or (cpython.PyFloat_Check(val) and val != val)

jreback · 2018-03-22T10:20:19Z

see the warnings box: http://pandas.pydata.org/pandas-docs/stable/missing_data.html

This was done quite a while ago to make the behavior of nulls consistent, in that they don't compare equal. This puts None and np.nan on an equal (though not-consistent with python, BUT consistent with numpy) footing.

So this is not a bug, rather a consequence of stradling 2 conventions.

I suppose the documentation could be slightly enhanced.

jreback · 2018-03-22T10:22:17Z

however this should be consistent with this:In [7]: s = pd.Series([np.nan]*3)

In [8]: s == s
Out[8]: 
0    False
1    False
2    False
dtype: bool

In [9]: s.eq(s)
Out[9]: 
0    False
1    False
2    False
dtype: bool

so [6] form @jschendel should match [9] here. The incorrect case is actually .eq

jreback · 2018-03-22T10:23:24Z

here's the dichotomy in scalars

In [11]: None == None
Out[11]: True

In [12]: np.nan == np.nan
Out[12]: False

mapio · 2018-03-22T10:52:14Z

I do not agree with @jreback and the current status of equality for None in pandas.

I've seen the warning box and I completely agree that, according to IEEE floating-point standard, "a NaN is never equal to any other number (including another NaN)" (cited from What Every Computer Scientist Should Know About Floating-Point Arithmetic).

I've no problems, if a data type of a series is numeric, to accept that NaNs are not equal and even that None get converted to NaNs.

But it a series contains objects (let them be string, list, or whatever), the equality should be consistent with Python equality for such objects.

It's a basic application of the principle of least astonishment. I don't see any benefit in mangling None if the data type is non numeric.

jreback · 2018-03-22T11:19:32Z

@mapio you are missing the point. None is NOT the default missing indicator for anything. np.nan IS. None is accepted for compatibility.

mapio · 2018-03-22T13:31:52Z

Where in the documentation is stated that None is the "missing indicator"? And, if this is the case, I find incredibly confusing that

>>> pd.Series([1,None,2])
0    1.0
1    NaN
2    2.0
dtype: float64

but

>>> pd.Series(['a', None, 'b'])
0       a
1    None
2       b
dtype: object

If you pick na.nan as a missing data indicator, than you must convert every None in a series to it, not only in numeric series, but also in series of objects. To be clear, I expect the second example to look

>>> pd.Series(['a', None, 'b'])
0       a
1    NaN
2       b
dtype: object

Given such a convention I would have probably still asked why you decide to translate a None to a NaN, but I would have not been surprised at all to have a False comparing the second entry of the two series.

Probably I'm missing the point, but the current situation is, at best, inconsistent (and as such, is very confusing).

TomAugspurger · 2018-03-22T14:26:38Z

http://pandas-docs.github.io/pandas-docs-travis/missing_data.html#values-considered-missing

@mapio how familiar are you with NumPy / pandas type system? When you do Series([1, None, 2]), we infer it as floats. If you want python objects, specify that.

In [5]: s = pd.Series([1, None, 2], dtype=object)

In [6]: s
Out[6]:
0       1
1    None
2       2
dtype: object

But as you've seen, many places in pandas, especially in ops like ==, treat None as missing.

mapio · 2018-03-22T15:28:04Z

I'm quite familiar with NumPy. My point is the exact opposite of what you say.

Since when you build a series from object in Pandas it will correctly infer the object data type, None are left as such. But when you compare them (and only then), they are silently translated to NaNs and (due to IEEE way of handling equality among NaNs) they result to be different.

It makes complete sense in numeric series, where None are clearly converted to NaNs.

My point is:

it's ok to convert to NaNs if you want to have NaN the only way to represent missing values,
it's ok to compare NaNs as different (of course),
you must perform such conversion visibly every time you set a series element to None!

If you keep None as element of object series and convert them silently to NaNs before comparing them (with == or eq) you are surprising your users (and also your developers, given that this is reported as a bug with regard to eq).

No implicit conversion to NaN should happen. If you store objects, you should compare them as Python does.

jreback · 2018-03-23T10:36:50Z

this is a long ago argument. np.nan is the missing value indicator for ALL dtypes, including object. This is not going to change. Yes we allow None to be explicity passed for object dtypes and be preserved (but only in this case), as object can essentially hold anything.

So comparisons must follow the convention that np.nan behavior is what we do.'
leaving this issue open for the sole purpose of fixing [9] above.

mapio · 2018-03-23T10:48:22Z

If you choose to convert missing values to NaNs for ALL (allcaps your) dtypes, a decision I can understand and second), I still miss why this does not hold for object dtype!

Please, consider converting None to NaN also for object series, this will make missing values handling consistent and equality consistent. I can't see any drawback…

TomAugspurger · 2018-03-23T11:00:31Z

I think this will be hard to change as pandas is currently structured. We allow both None and NaN as the missing value indicator in object columns.

In [14]: a = pd.Series(['a', None, float('nan')])

In [15]: a
Out[15]:
0       a
1    None
2     NaN
dtype: object

In [16]: a == a
Out[16]:
0     True
1    False
2    False
dtype: bool

I can't see any drawback…

IIRC, people found it useful to not immediately convert None to NaN there. They wanted to keep their python scalars. I don't recall details though.

If we wanted to change this, which I'm not sure we do, I don't see a deprecation path to changing None to compare equal to itself, though perhaps you do.

How is this affecting you negatively?

mapio · 2018-03-23T11:23:39Z

I think that people finds useful to not immediately convert to NaN and keep their Python types because they want the objects they put in the series to behave like Python objects. This benefit is lost if you change the way equality works for some of the Python types (like None)!

Consider this piece of code:

In [1]: import pandas as pd
In [2]: A = pd.Series(['a', None, 2])
In [3]: B = pd.Series(['a', None, 2])

In [4]: all(A == B)
Out[4]: False
In [5]: all(lambda a, b: a == b for a, b in zip(A, B))
Out[5]: True

In [5]: all(A.values == B.values)
Out[5]: True
In [6]: all(lambda a, b: a == b for a, b in zip(A.values, B.values))
Out[6]: True

I expect that In [4] should tell me if the two series are equal member by member, vectorizing in some sense the equality. Currently they turn out not to be! But if I do test equality member by member as in In[5] instead I get (as expected) that the series are equal member by member .

Now, if I turn the series to numpy.arrays as in In[5] I got the expected result. Here NumPy is consistent with the vectorization in In[6].

As you can clearly see there is an inconsistency that is very hard to understand (you have to know, and there is no place in the docs where this is clearly stated) that:

pandas coverts all missing values except for None to NaN,
pandas (according to IEEE) treats NaN as different from themselves.

Having an inconsistent behaviour and leaving None being as such but comparing them as different violates the principle of least astonishment.

I spent quite some time (with two dataframes of tens of thousands of elements of different dtypes) to get why there was a difference when, looking at the corresponding rows, all the values where equal!

I don't think anyone can really think that an inconsistent behaviour is a good thing…

TomAugspurger · 2018-03-23T11:35:29Z

Do you see a way to deprecate this cleanly (along with the behavior of isna / notna, groupby, etc.)?

…

On Fri, Mar 23, 2018 at 6:23 AM, Massimo Santini ***@***.***> wrote: I think that people finds useful to not immediately convert to NaN and keep their Python types because they want the objects they put in the series to *behave like Python objects*. This benefit is lost if you change the way equality works for *some* of the Python types (like None)! Consider this piece of code: In [1]: import pandas as pd In [2]: A = pd.Series(['a', None, 2]) In [3]: B = pd.Series(['a', None, 2]) In [4]: all(A == B) Out[4]: False In [5]: all(lambda a, b: a == b for a, b in zip(A, B)) Out[5]: True In [5]: all(A.values == B.values) Out[5]: True In [6]: all(lambda a, b: a == b for a, b in zip(A.values, B.values)) Out[6]: True ´´´ I expect that `In [4]` should tell me if the two series are equal *member by member*, vectorizing in some sense the equality. Currently they turn out not to be! But if I do test equality member by member as in `In[5]` instead I get (as expected) that the series are equal *member by member* . Now, if I turn the series to `numpy.array`s as in `In[5]` I got the expected result. Here NumPy is consistent with the vectorization in `In[6]`. As you can clearly see there is an inconsistency that is very hard to understand (you have to know, and there is no place in the docs where this is clearly stated) that: * pandas coverts all missing values *except for* ´None´ to NaN,* pandas (according to IEEE) treats NaN as different from themselves. Having an inconsistent behaviour and leaving `None` being as such but comparing them as different violates the principle of least astonishment. I spent quite some time (with two dataframes of tens of thousands of elements of different dtypes) to get why there was a difference when, looking at the corresponding rows, all the values where equal! I don't think anyone can really think that an inconsistent behaviour is a good thing… — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#20442 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIi3CLCGdGyp5eaOvh7P8bjWEmLzEks5thNtEgaJpZM4S1qvp> .

mapio · 2018-03-23T17:42:53Z

I am just a Pandas user, not a developer. What I can suggest is to look for the code that converts None to NaN in numeric series constructors and copy it to object series constructors. And document it in the session of the documentation where the handling of missing values.

Observe that this is not likely to cause any particular harm. If one puts a None in a series it's not likely to invoke methods on it, so that changing it to np.nan will not make much of a difference. Especially since Series.isnull is (as stated in the documentation) an alias of Series.isna, so that code that was looking for None will still find it, whatever method it will call to test for it.

If some developer offers to fix the code, I volunteer to fix the docs 😄

walkerh · 2018-04-09T16:22:56Z

There is no way to address this without a major point release, since any change here will break backwards compatibility.

Nevertheless, this issue does imply the clear need for documentation improvement: Nones selectively behave as nans.

jreback added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Difficulty Advanced labels Mar 22, 2018

jreback added this to the Next Major Release milestone Mar 22, 2018

jreback changed the title ~~None is not equal to None~~ BUG: None is not equal to None Mar 22, 2018

asdf8601 mentioned this issue Jun 19, 2018

Series dtype changes when a new row is added #21501

Closed

WillAyd mentioned this issue Jul 3, 2018

BUG: df.duplicated treats None as np.nan in object columns #21720

Open

jbrockmendel mentioned this issue Aug 10, 2018

DataFrame vs Series vs Index arithmetic Roundup #18824

Closed

59 tasks

jschendel mentioned this issue Apr 11, 2019

Problem with DataFrame.replace using None #26050

Closed

jbrockmendel removed Effort Low labels Oct 21, 2019

Karl-Wiese mentioned this issue Nov 25, 2020

ENH: add a keyword to the eq method to consider NAs equal #38063

Open

simonjayhawkins mentioned this issue Jul 9, 2022

BUG: Series with same index values, but in different orders, cannot be compared, but can be added #47554

Open

3 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

This was referenced Apr 29, 2024

None != None (NA != NA) iris-garden/test-process#1629

Closed

None != None (NA != NA) iris-garden/test-process#2215

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: None is not equal to None #20442

BUG: None is not equal to None #20442

mapio commented Mar 21, 2018

jschendel commented Mar 21, 2018

mroeschke commented Mar 21, 2018 •

edited

Loading

jreback commented Mar 22, 2018

jreback commented Mar 22, 2018

jreback commented Mar 22, 2018

mapio commented Mar 22, 2018

jreback commented Mar 22, 2018

mapio commented Mar 22, 2018 •

edited

Loading

TomAugspurger commented Mar 22, 2018 •

edited

Loading

mapio commented Mar 22, 2018

jreback commented Mar 23, 2018

mapio commented Mar 23, 2018

TomAugspurger commented Mar 23, 2018 •

edited

Loading

mapio commented Mar 23, 2018 •

edited

Loading

TomAugspurger commented Mar 23, 2018 via email

mapio commented Mar 23, 2018 •

edited

Loading

walkerh commented Apr 9, 2018

BUG: None is not equal to None #20442

BUG: None is not equal to None #20442

Comments

mapio commented Mar 21, 2018

Output of pd.show_versions()

jschendel commented Mar 21, 2018

mroeschke commented Mar 21, 2018 • edited Loading

jreback commented Mar 22, 2018

jreback commented Mar 22, 2018

jreback commented Mar 22, 2018

mapio commented Mar 22, 2018

jreback commented Mar 22, 2018

mapio commented Mar 22, 2018 • edited Loading

TomAugspurger commented Mar 22, 2018 • edited Loading

mapio commented Mar 22, 2018

jreback commented Mar 23, 2018

mapio commented Mar 23, 2018

TomAugspurger commented Mar 23, 2018 • edited Loading

mapio commented Mar 23, 2018 • edited Loading

TomAugspurger commented Mar 23, 2018 via email

mapio commented Mar 23, 2018 • edited Loading

walkerh commented Apr 9, 2018

Output of `pd.show_versions()`

mroeschke commented Mar 21, 2018 •

edited

Loading

mapio commented Mar 22, 2018 •

edited

Loading

TomAugspurger commented Mar 22, 2018 •

edited

Loading

TomAugspurger commented Mar 23, 2018 •

edited

Loading

mapio commented Mar 23, 2018 •

edited

Loading

mapio commented Mar 23, 2018 •

edited

Loading