-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: None is not equal to None #20442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Also a bit strange that In [2]: pd.__version__
Out[2]: '0.23.0.dev0+658.g17c1fad'
In [3]: s = pd.Series([None]*3)
In [4]: s
Out[4]:
0 None
1 None
2 None
dtype: object
In [5]: s == s
Out[5]:
0 False
1 False
2 False
dtype: bool
In [6]: s.eq(s)
Out[6]:
0 True
1 True
2 True
dtype: bool |
I believe this occurs because this comparison hits this block: Lines 162 to 163 in 17c1fad
And the pandas/pandas/_libs/src/util.pxd Lines 156 to 158 in 17c1fad
|
see the warnings box: http://pandas.pydata.org/pandas-docs/stable/missing_data.html This was done quite a while ago to make the behavior of nulls consistent, in that they don't compare equal. This puts So this is not a bug, rather a consequence of stradling 2 conventions. I suppose the documentation could be slightly enhanced. |
however this should be consistent with this:In [7]: s = pd.Series([np.nan]*3)
so [6] form @jschendel should match [9] here. The incorrect case is actually |
here's the dichotomy in scalars
|
I do not agree with @jreback and the current status of equality for I've seen the warning box and I completely agree that, according to IEEE floating-point standard, "a NaN is never equal to any other number (including another NaN)" (cited from What Every Computer Scientist Should Know About Floating-Point Arithmetic). I've no problems, if a data type of a series is numeric, to accept that NaNs are not equal and even that But it a series contains objects (let them be string, list, or whatever), the equality should be consistent with Python equality for such objects. It's a basic application of the principle of least astonishment. I don't see any benefit in mangling |
@mapio you are missing the point. |
Where in the documentation is stated that >>> pd.Series([1,None,2])
0 1.0
1 NaN
2 2.0
dtype: float64 but >>> pd.Series(['a', None, 'b'])
0 a
1 None
2 b
dtype: object If you pick >>> pd.Series(['a', None, 'b'])
0 a
1 NaN
2 b
dtype: object Given such a convention I would have probably still asked why you decide to translate a Probably I'm missing the point, but the current situation is, at best, inconsistent (and as such, is very confusing). |
http://pandas-docs.github.io/pandas-docs-travis/missing_data.html#values-considered-missing @mapio how familiar are you with NumPy / pandas type system? When you do In [5]: s = pd.Series([1, None, 2], dtype=object)
In [6]: s
Out[6]:
0 1
1 None
2 2
dtype: object But as you've seen, many places in pandas, especially in ops like |
I'm quite familiar with NumPy. My point is the exact opposite of what you say. Since when you build a series from object in Pandas it will correctly infer the It makes complete sense in numeric series, where My point is:
If you keep No implicit conversion to NaN should happen. If you store objects, you should compare them as Python does. |
this is a long ago argument. So comparisons must follow the convention that |
If you choose to convert missing values to NaNs for ALL (allcaps your) dtypes, a decision I can understand and second), I still miss why this does not hold for object dtype! Please, consider converting |
I think this will be hard to change as pandas is currently structured. We allow both None and NaN as the missing value indicator in object columns. In [14]: a = pd.Series(['a', None, float('nan')])
In [15]: a
Out[15]:
0 a
1 None
2 NaN
dtype: object
In [16]: a == a
Out[16]:
0 True
1 False
2 False
dtype: bool
IIRC, people found it useful to not immediately convert If we wanted to change this, which I'm not sure we do, I don't see a deprecation path to changing How is this affecting you negatively? |
I think that people finds useful to not immediately convert to NaN and keep their Python types because they want the objects they put in the series to behave like Python objects. This benefit is lost if you change the way equality works for some of the Python types (like Consider this piece of code: In [1]: import pandas as pd
In [2]: A = pd.Series(['a', None, 2])
In [3]: B = pd.Series(['a', None, 2])
In [4]: all(A == B)
Out[4]: False
In [5]: all(lambda a, b: a == b for a, b in zip(A, B))
Out[5]: True
In [5]: all(A.values == B.values)
Out[5]: True
In [6]: all(lambda a, b: a == b for a, b in zip(A.values, B.values))
Out[6]: True I expect that Now, if I turn the series to As you can clearly see there is an inconsistency that is very hard to understand (you have to know, and there is no place in the docs where this is clearly stated) that:
Having an inconsistent behaviour and leaving I spent quite some time (with two dataframes of tens of thousands of elements of different dtypes) to get why there was a difference when, looking at the corresponding rows, all the values where equal! I don't think anyone can really think that an inconsistent behaviour is a good thing… |
Do you see a way to deprecate this cleanly (along with the behavior of isna
/ notna, groupby, etc.)?
…On Fri, Mar 23, 2018 at 6:23 AM, Massimo Santini ***@***.***> wrote:
I think that people finds useful to not immediately convert to NaN and
keep their Python types because they want the objects they put in the
series to *behave like Python objects*. This benefit is lost if you
change the way equality works for *some* of the Python types (like None)!
Consider this piece of code:
In [1]: import pandas as pd
In [2]: A = pd.Series(['a', None, 2])
In [3]: B = pd.Series(['a', None, 2])
In [4]: all(A == B)
Out[4]: False
In [5]: all(lambda a, b: a == b for a, b in zip(A, B))
Out[5]: True
In [5]: all(A.values == B.values)
Out[5]: True
In [6]: all(lambda a, b: a == b for a, b in zip(A.values, B.values))
Out[6]: True
´´´
I expect that `In [4]` should tell me if the two series are equal *member by member*, vectorizing in some sense the equality. Currently they turn out not to be! But if I do test equality member by member as in `In[5]` instead I get (as expected) that the series are equal *member by member* .
Now, if I turn the series to `numpy.array`s as in `In[5]` I got the expected result. Here NumPy is consistent with the vectorization in `In[6]`.
As you can clearly see there is an inconsistency that is very hard to understand (you have to know, and there is no place in the docs where this is clearly stated) that:
* pandas coverts all missing values *except for* ´None´ to NaN,* pandas (according to IEEE) treats NaN as different from themselves.
Having an inconsistent behaviour and leaving `None` being as such but comparing them as different violates the principle of least astonishment.
I spent quite some time (with two dataframes of tens of thousands of elements of different dtypes) to get why there was a difference when, looking at the corresponding rows, all the values where equal!
I don't think anyone can really think that an inconsistent behaviour is a good thing…
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#20442 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIi3CLCGdGyp5eaOvh7P8bjWEmLzEks5thNtEgaJpZM4S1qvp>
.
|
I am just a Pandas user, not a developer. What I can suggest is to look for the code that converts Observe that this is not likely to cause any particular harm. If one puts a If some developer offers to fix the code, I volunteer to fix the docs 😄 |
There is no way to address this without a major point release, since any change here will break backwards compatibility. Nevertheless, this issue does imply the clear need for documentation improvement: |
If you have
None
in a series, it will not be considered equal toNone
(even in the same series).gives
but, of course
gives
I can't find any reference to this behaviour in the documentation and looks quite unnatural.
Output of
pd.show_versions()
Here the output of
pd.show_versions()
pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.2
Cython: None
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.0
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: