-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
pd.core.algorithms.isin() doesn't handle nan correctly if it is a Python-object #22119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I didn't debug it yet, but the issue is probably similar to #21866 (with the difference, that there is no special handling in The right fix would be probably to fix the behavior of the hash-table and not to try to implement workarounds. |
Does this result in a higher-level bug for you? This is the same behavior as Python In [1]: float('nan') in {float('nan')}
Out[1]: False |
I have used However, to be precise, my example corresponds to:
Or using float('nan')
Probably, this is the case because Not very stable approach in anycase though... Also, I would prefer to be consistent with the sane pandas float64-behavior than with to some degree irratic python-set behavior. |
The fact that we're getting inconsistent results depending on the cc @jreback : do we have precedent for treating |
NaN is never equal to itself.
The fact that `dictionary[my_nan]` gives a different result to
`dictionary[float('nan')]` is because Python's hashtable first checks for
object identity. If this doesn't turn up anything, then it falls back to
equality. `np.nan` happens to be a singleton, but IIRC, this isn't part of
the documented API.
So I guess my recommendation would be to not try to look up NaNs :)
…On Mon, Jul 30, 2018 at 8:59 AM gfyoung ***@***.***> wrote:
The fact that we're getting inconsistent results depending on the np.nan
"wrapper" (ndarray vs list) looks weird in itself.
cc @jreback <https://github.com/jreback> : do we have precedent for
treating np.nan as equal to itself?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#22119 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIguBHXPNv7lVWgDoUL4U9BkNLNf2ks5uLxFTgaJpZM4VlkUO>
.
|
@gfyoung pd.unique() treats all nans as being the same:
|
In order to have a consistent behavior, a hash-map/hash-set requires (among other things) that the relation However, for floats with ieee-754-standard There are multiple ways to extend ieee-754-
Pandas opted for the second. Thus the behavior of
the
So having a different behavior for nans as Python-objects is at least surprising. |
There are actually two different (even if somewhat related) issues:
So I assume there is a bug somewhere, which results in different nan-objects arriving in the hash-table, see #22160.
The Python-way is to say, that nans are not important enough to have a special treatment for them, for which all other objects/classes would pay in performance. I'm not sure pandas can say nans are not important enough (it is probably the most frequently used value:)), so adding special handling for nans could be worth it, in order to be consistent with the behavior of
but this is obviously not my decision to make. |
Actually, #22148 is at its core the 2. part of this issue: Do we consider
ie. for different nan-objects to be True or False (right now it is False)? |
My proposal (i.e. starting point for a discussion) is in PR #22207 : If both objects are floats, then check whether they are both nans - the same way it is done for float64. The used hash-function has the necessary behavior, so there is no need to change something. It doesn't prevent the user to define custom classes with behaviors similar to nan and shoot themselves in the foot - it is probably too much ask for handling such cases as well. As to performance, I could not see any worsening (added some additional performance tests myself), the results are:
Btw, the )new) test above shows, how easy it is to trigger the wost case behavior of My take aways from it:
This would also fix #22148. |
Probably worth mentioning: However, IIRC the user could create another Enforcing singleton would be probably a cleaner solution than trying to fix this esoteric possibility in the hash-map. |
Example
Problem description
results in
Expected Output
However, I would expect the result to be
which is the case for example for
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-53-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.22.0
pytest: 3.2.1
pip: 10.0.1
setuptools: 36.5.0.post20170921
Cython: 0.28.3
numpy: 1.13.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 0.9.8
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: 0.1.3
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: