Skip to content

IntervalIndex get_indexer may return -1 for values that are in the index #16410

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nateyoder opened this issue May 21, 2017 · 10 comments · Fixed by #31171
Closed

IntervalIndex get_indexer may return -1 for values that are in the index #16410

nateyoder opened this issue May 21, 2017 · 10 comments · Fixed by #31171
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@nateyoder
Copy link
Contributor

Code Sample, a copy-pastable example if possible

idx = pd.IntervalIndex.from_breaks(pd.np.arange(10))
target = pd.Interval(0, 1, closed='right')
# Indexer works in this case if index was sorted
a = idx.get_indexer([target])
assert a[0] == 0

idx = pd.Index(pd.np.roll(idx.values, 1))
a = idx.get_indexer([target])
assert tmp == idx[1]
# Even though the second value in the index is equal to the target `get_indexer` returns -1
a[0]
>>> -1

Problem description

I expected get_indexer on IntervalIndex is expected to return the index of matching values. However it does not always seem to do so.

Expected Output

In the above example I expected an output of a = np.array([1])

Output of pd.show_versions()

INSTALLED VERSIONS

commit: 4111382
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0.dev+67.g4111382
pytest: 3.0.7
pip: 8.1.2
setuptools: 35.0.2
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.1.0
sphinx: 1.4.8
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.1.2
pymysql: 0.7.9.None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented May 22, 2017

this should raise. We cannot efficiently handle non-monotonic indexes, though I think we actually try to do this.

@shoyer @buyology

@jreback jreback added this to the Next Major Release milestone May 22, 2017
@jreback
Copy link
Contributor

jreback commented May 22, 2017

@nateyoder welcome to have you take a look.

@nateyoder
Copy link
Contributor Author

@jreback sounds like your desired behavior for now is raising? If that is the case I can try to take a look.

@shoyer
Copy link
Member

shoyer commented May 22, 2017

Agreed, get_indexer should always do an exact lookup here.

@jreback
Copy link
Contributor

jreback commented May 22, 2017

actually .get_indexer() doesn't raise, so this is correct (meaning it was not found).

so this is correct (though certainly may not be handled correctly elsewhere), where are you seeing this (aside from the top-example)?

In [1]: Index([1, 2, 3]).get_indexer([10])
Out[1]: array([-1])

@nateyoder
Copy link
Contributor Author

The issue with this get_indexer call for IntervalIndex is that it returns -1 even though the interval IS in the index (as shown with the second assert in my example). If non-monotonic IntervalIndex objects are not handled in the code it seems to me like creating one should raise?

@jreback
Copy link
Contributor

jreback commented May 22, 2017

@nateyoder sorry, you are right. this should return the right value, but get_indexer() never raises by definition. (the -1 indicate missing values)

@shoyer
Copy link
Member

shoyer commented May 22, 2017 via email

@jschendel jschendel mentioned this issue Nov 26, 2017
3 tasks
@makbigc
Copy link
Contributor

makbigc commented Apr 7, 2019

The bug is still here for non-monotonic index.

In [1]: import pandas as pd                                                     

In [2]: idx1 = pd.IntervalIndex.from_tuples([(2,3), (4,5), (0,1)])              

In [3]: idx2 = pd.IntervalIndex.from_tuples([(0,1), (2,3), (6,7), (8,9)])       

In [4]: idx1.get_indexer(idx2)                                                  
Out[4]: array([-1, -1, -1, -1])
# The expected output is array([2, 0, -1, -1])

In [5]: idx1[1:]                                                                
Out[5]: 
IntervalIndex([(4, 5], (0, 1]],
              closed='right',
              dtype='interval[int64]')

In [6]: idx1.get_indexer(idx1[1:])                                              
Out[6]: array([ 0,  1, -1])
# The expected output is array([1, 2])

I guess this bug is due to that IntervalTree.get_indexer only returns the correct values for its right.
In my first example,

lindexer = self._engine.get_indexer(left.values)
rindexer = self._engine.get_indexer(right.values)

rindexer gets array([2, 0, -1, -1]), but lindexer gets array([-1, -1, -1, -1]).

Digging deeper into the IntervalTree.

if self.left[i] {{cmp_left}} point {{cmp_right}} self.right[i]:

Becuase IntervalIndex is right closed by default, {{cmp_left}} is <. left.values, which is point in the above code, is just excluded by the < operator. While right.values is included by <= operator.

@mroeschke
Copy link
Member

The original issue isn't fully copy pasteable, but @makbigc's examples look to work on master. Could use a test.

In [30]: In [2]: idx1 = pd.IntervalIndex.from_tuples([(2,3), (4,5), (0,1)])
    ...:
    ...: In [3]: idx2 = pd.IntervalIndex.from_tuples([(0,1), (2,3), (6,7), (8,9)])
    ...:
    ...: In [4]: idx1.get_indexer(idx2)
Out[30]: array([ 2,  0, -1, -1])

In [31]: In [6]: idx1.get_indexer(idx1[1:])
Out[31]: array([1, 2])

In [32]: pd.__version__
Out[32]: '0.26.0.dev0+627.gef77b5700'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Interval Interval data type labels Oct 22, 2019
@jreback jreback modified the milestones: Contributions Welcome, 1.1 Jan 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants