Skip to content

BUG: get_indexer_non_unique() does not always handle NaNs correctly #44465

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
johannes-mueller opened this issue Nov 15, 2021 · 2 comments
Closed
3 tasks done
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@johannes-mueller
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

idx_no_nan = pd.Index([1, 2], dtype=np.float64)
idx_nan = pd.Index([1, 2, np.nan])

print("dtype=np.float64")
target = pd.Index([1, np.nan], dtype=np.float64)
print(idx_no_nan.get_indexer(target))
print(idx_no_nan.get_indexer_non_unique(target))
print(idx_nan.get_indexer(target))
print(idx_nan.get_indexer_non_unique(target))

print("dtype='object'")
target = pd.Index([1, np.nan], dtype='object')
print(idx_no_nan.get_indexer(target))
print(idx_no_nan.get_indexer_non_unique(target))
print(idx_nan.get_indexer(target))
print(idx_nan.get_indexer_non_unique(target))

Issue Description

Annotated script output

dtype=np.float64
[ 0 -1]
(array([ 0, -1]), array([1]))  # ← expected
[0 2]
(array([ 0, -1]), array([1])) #  ← would expect (array([0, 2]), array([], dtype=int64)
dtype='object'
[ 0 -1]
(array([0, 0, 1]), array([], dtype=int64)) # ← would expect (array([0, -1]), array([1]))
[0 2]
(array([0, 2]), array([], dtype=int64) # ← expected

Expected Behavior

As the indexes and the targets are unique I would expect the output of get_indexer() and get_indexer_non_unique() to be equivalent.

dtype=np.float64
[ 0 -1]
(array([ 0, -1]), array([1]))
[0 2]
(array([0, 2]), array([], dtype=int64)
dtype='object'
[ 0 -1]
(array([0, -1]), array([1]))
[0 2]
(array([0, 2]), array([], dtype=int64)

Installed Versions

INSTALLED VERSIONS ------------------ commit : a07561e python : 3.8.12.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-90-lowlatency Version : #101-Ubuntu SMP PREEMPT Fri Oct 15 20:57:56 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : de_DE.UTF-8 LOCALE : de_DE.UTF-8

pandas : 1.4.0.dev0+1117.ga07561e5a3
numpy : 1.21.4
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.5.3
Cython : 0.29.24
pytest : 6.2.5
hypothesis : 6.24.2
sphinx : 4.2.0
blosc : None
feather : None
xlsxwriter : 3.0.2
lxml.etree : 4.6.4
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.23.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.2
fsspec : 2021.11.0
fastparquet : 0.7.1
gcsfs : 2021.11.0
matplotlib : 3.4.3
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 6.0.0
pyxlsb : None
s3fs : 2021.11.0
scipy : 1.7.2
sqlalchemy : 1.4.26
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1

@johannes-mueller johannes-mueller added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 15, 2021
@johannes-mueller
Copy link
Contributor Author

The part with the object dtyped target is most probably an effect of numpy/numpy#15499

@johannes-mueller
Copy link
Contributor Author

It turns out that there are separate root causes for the two unexpected outputs. Will split this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

1 participant