Skip to content

ENH: add IntervalIndex.get_loc_exact to look for exact matches only #19353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Jan 22, 2018

This PR add a 'exact' option to the parameter method on IntervalIndex.get_loc for accepting exact matches only.

In addition to it's direct advantage, it will later make it possible to make a PR to allow IntervalIndex.get_indexer calls with the other index having non-Interval values, and hence allow mixing mix Interval with non-intervals in pd.Index(..., dtype=object) instances. This would fix the remaining issue with #19021.

@jschendel
Copy link
Member

jschendel commented Jan 23, 2018

There's a private method that looks like it's supposed to do this:

def _get_loc_only_exact_matches(self, key):
if isinstance(key, Interval):
if not self.is_unique:
raise ValueError("cannot index with a slice Interval"
" and a non-unique index")
# TODO: this expands to a tuple index, see if we can
# do better
return Index(self._multiindex.values).get_loc(key)
raise KeyError

It doesn't appear to be working at all though:

In [2]: ii = pd.interval_range(0, 4)

In [3]: ii[0]
Out[3]: Interval(0, 1, closed='right')

In [4]: ii.get_loc(ii[0])
Out[4]: 0

In [5]: ii._get_loc_only_exact_matches(ii[0])
---------------------------------------------------------------------------
KeyError: Interval(0, 1, closed='right')

The issue appears to be that Index(self._multiindex.values) creates an index of tuples, not intervals, so passing an Interval object will always raise a KeyError. Likewise, due to the first line, passing anything other than an Interval object will also raise a KeyError.

In [6]: pd.Index(ii._multiindex.values)
Out[6]: Index([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='object')

Might be best to implement this in _get_loc_only_exact_matches instead? It looks like IntervalIndex.get_slice_bound (inherited from Index) relies on _get_loc_only_exact_matches.

# GH19349
assert self.index.get_loc(Interval(0, 1), method='exact') == 0
with pytest.raises(KeyError):
self.index.get_loc(Interval(0, 0.5), method='exact')
Copy link
Member

@jschendel jschendel Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also test that an Interval that only differs by closed raises a KeyError as well, e.g. Interval(0, 1, closed='neither')?

@codecov
Copy link

codecov bot commented Jan 23, 2018

Codecov Report

Merging #19353 into master will decrease coverage by 0.02%.
The diff coverage is 83.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19353      +/-   ##
==========================================
- Coverage   91.61%   91.59%   -0.03%     
==========================================
  Files         150      150              
  Lines       48807    48817      +10     
==========================================
- Hits        44716    44712       -4     
- Misses       4091     4105      +14
Flag Coverage Δ
#multiple 89.96% <83.33%> (-0.03%) ⬇️
#single 41.72% <16.66%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/base.py 96.38% <50%> (-0.06%) ⬇️
pandas/core/indexes/interval.py 92.33% <90%> (-0.07%) ⬇️
pandas/plotting/_converter.py 65.22% <0%> (-1.74%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 324379c...9045cc9. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Jan 23, 2018

yes, rather than adding kwargs to .get_loc much better to fix the private method here.

@jreback jreback added the Interval Interval data type label Jan 23, 2018
@topper-123
Copy link
Contributor Author

I can look into the private method.

@jreback , do you mean you are negative wrt. method='exact' in the public interface? I think it will be useful to have in many cases, e.g.

>>> ii = pd.interva_range(0, 3)
>>> ii.get_loc(pd.Interval(1,3))
slice(1, 3, None)
>>> ii.get_loc(pd.Interval(1,3), method='exact')
KeyError: ...

Above, we've answered two different questions, both reasonable to ask, IMO.

@topper-123
Copy link
Contributor Author

topper-123 commented Jan 23, 2018

In _get_loc_only_exact_matches I've converted the line return Index(self._multiindex.values).get_loc(key) to return super(IntervalIndex, self)._engine.get_loc(key). Another option would be to do self.astype(object).get_loc(key), which is more readable, but slower, presumedly because of the need to cast.

The super method is faster, around 13 ms, while the astype method is around 75 ms, so I've chosen the super method.

Also, _get_loc_only_exact_matches only worked for unique indices, while get_loc should work with non-unique indices, so some changes to accomodate non-unique values.

@topper-123 topper-123 force-pushed the IntervalIndex.get_loc branch from bf1050d to a30e7c0 Compare January 23, 2018 16:40
@pep8speaks
Copy link

pep8speaks commented Jan 23, 2018

Hello @topper-123! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 11, 2018 at 16:12 Hours UTC

@topper-123 topper-123 force-pushed the IntervalIndex.get_loc branch from a30e7c0 to 3b9aae3 Compare January 23, 2018 16:41
@jreback
Copy link
Contributor

jreback commented Jan 24, 2018

@topper-123 we are not going to change the .get_loc api. it is very simple and needs to stay that way. As I said I am not above exposing the private function to make an exact match.

@topper-123
Copy link
Contributor Author

topper-123 commented Feb 2, 2018

@jreback , adding this keeps things simple IMO: If you look at the doc string for pd.Index`.get_loc, the method part is:

method : {None, 'pad'/'ffill', 'backfill'/'bfill', 'nearest'}, optional
    * default: exact matches only.
    * ...

It's logical that IntervalIndex.get_loc should be able to revert back to exact matches only, which is the default for all other index types.

@topper-123 topper-123 force-pushed the IntervalIndex.get_loc branch from 3b9aae3 to aaa9a7a Compare February 8, 2018 19:23
@topper-123
Copy link
Contributor Author

I've added a get_loc_exact method instead of using method='exact' in get_loc.

@jschendel
Copy link
Member

I think there are some caching issues when trying to use super(IntervalIndex, self)._engine. Since both IntervalIndex._engine and Index._engine are decorated with @cache_readonly, it looks like you can't actually create both, which can lead to some weird errors depending on which _engine gets created first.

As an example, when an IntervalIndex is not unique, get_loc uses self._engine. In such a scenario, you can't successfully use both get_loc and get_loc_exact, as whichever method you call second will fail.

Using get_loc first causes get_loc_exact to fail:

In [1]: import pandas as pd

In [2]: ii = pd.IntervalIndex.from_tuples([(0, 1), (1, 2), (0, 1)])

In [3]: ii.get_loc(pd.Interval(0, 1))
Out[3]: array([0, 2], dtype=int64)

In [4]: ii.get_loc_exact(pd.Interval(0, 1))
---------------------------------------------------------------------------
TypeError: No matching signature found

Using get_loc_exact first causes get_loc to fail:

In [1]: import pandas as pd

In [2]: ii = pd.IntervalIndex.from_tuples([(0, 1), (1, 2), (0, 1)])

In [3]: ii.get_loc_exact(pd.Interval(0, 1))
Out[3]: array([ True, False,  True], dtype=bool)

In [4]: ii.get_loc(pd.Interval(0, 1))
---------------------------------------------------------------------------
AttributeError: 'pandas._libs.index.ObjectEngine' object has no attribute 'get_loc_interval'

Not sure if there's a simple workaround here or not.

@topper-123
Copy link
Contributor Author

Nice catch.

I've changed the call to return self.astype(object).get_loc(key) and added a test for the condition you pointed out.

A shame, as the super method would have been faster if your issue would not have surfaced.

@topper-123 topper-123 force-pushed the IntervalIndex.get_loc branch from aaa9a7a to 7590b87 Compare February 9, 2018 21:09
Returns
-------
loc : int if unique index, slice if monotonic index, else mask

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add Raises section

@@ -1003,6 +1004,47 @@ def get_loc(self, key, method=None):
else:
return self._engine.get_loc(key)

def get_loc_exact(self, key, method=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the same method in Index, dispatch to .get_loc

--------
get_loc
"""
return self.astype(object).get_loc(key, method=method)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

self.index.get_loc_exact(Interval(2, 3))
with pytest.raises(KeyError):
self.index.get_loc_exact(Interval(-1, 0, 'left'))
# See #19353#issuecomment-364295029
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this last test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this tests if the caching issue showed by @jschendel exists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok then write that as a comment. an issue reference is good, but a comment + reference even better. add blank lines in between sections of tests.

@@ -497,6 +497,23 @@ def test_get_loc_interval(self):
pytest.raises(KeyError, self.index.get_loc,
Interval(-1, 0, 'left'))

# To be removed, replaced by test_interval_new.py (see #16316, #16386)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you adding this comment?

@@ -507,6 +507,7 @@ Other API Changes
- Set operations (union, difference...) on :class:`IntervalIndex` with incompatible index types will now raise a ``TypeError`` rather than a ``ValueError`` (:issue:`19329`)
- :class:`DateOffset` objects render more simply, e.g. "<DateOffset: days=1>" instead of "<DateOffset: kwds={'days': 1}>" (:issue:`19403`)
- :func:`pandas.merge` provides a more informative error message when trying to merge on timezone-aware and timezone-naive columns (:issue:`15800`)
- New :meth:`IntervalIndex.get_loc_exact` has been added to find exact Interval matches only (:issue:`19349`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to Other Enhancements

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add to the api.rst

--------
get_loc
"""
return self.astype(object).get_loc(key, method=method)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems quite inefficient. a) can you add an asv, b) I would do this in the cython code itself that implements get_loc (this can be a further refinement / PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The super method was faster, around 13 ms, while this is around 75 ms, so it's slower.

I don't have experience with cython, so can't take on those changes in get_loc, sorry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that's fine an asv for now is prob best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked into asv and the speed issue.

>>> idx = IntervalIndex.from_breaks(np.arange(1000001))
>>> i = idx[80000]
>>> %t idx.get_loc(i)
25.2 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %t idx.get_loc_exact(i)
1.33 s ± 46.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I doubt this acceptable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, so this is much slower?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think this speed issue is beyound me, so I'm stuck. Maybe it's something wrong with Interval type conversion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so why don't you just do .get_loc and then check if its exact after?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that works very nicely: 41.5 µs for get_loc_exact vs 33.1 µs for get_loc.

so there's no speed issue now at all.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves API Design labels Feb 10, 2018
@topper-123 topper-123 force-pushed the IntervalIndex.get_loc branch 3 times, most recently from fc09b08 to eff6ec6 Compare February 10, 2018 22:09
raise KeyError(key)
else:
exact_matches = self[all_matches] == key
if (exact_matches).all():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can move this up to avoid duplication

exact_matches = self[all_matches] == key
if (exact_matches).all():
return all_matches
elif (exact_matches).any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can also be simplified i think

Copy link
Contributor Author

@topper-123 topper-123 Feb 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so: in the exact_matches.all() case, all_matches might be a slice object, so all_matches[exact_matches] would then fail.


# The below tests if get_loc_exact interferes with caching
# used for index.get_loc. See #19353#issuecomment-364295029
self.index.get_loc(Interval(0, 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert that this is something

@@ -136,7 +136,7 @@ def test_with_slices(self):
s[Interval(3, 6):]

expected = s.iloc[3:5]
result = s[[Interval(3, 6)]]
result = s[Interval(3, 6)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unneeded, I've changed it back.

@topper-123 topper-123 force-pushed the IntervalIndex.get_loc branch 3 times, most recently from dd3611a to ce138b6 Compare February 10, 2018 22:30
@topper-123 topper-123 changed the title ENH: let IntervalIndex.get_loc look for exact matches only ENH: add IntervalIndex.get_loc_exact to look for exact matches only Feb 10, 2018
@topper-123
Copy link
Contributor Author

All concerns have been addessed, AFAICS.

if is_scalar(all_matches):
if exact_matches:
return all_matches
raise KeyError(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be simplified

@topper-123 topper-123 force-pushed the IntervalIndex.get_loc branch from ce138b6 to 44f3471 Compare February 10, 2018 23:38
@topper-123
Copy link
Contributor Author

green.

@jorisvandenbossche
Copy link
Member

Is it really needed to add a method to all Index objects only for use in IntervalIndex ?
Our public API is already huge, not sure this is needed to add even more. We have other examples where Index subclasses have specific type-related methods that are not available for other Index classes.

@@ -323,6 +323,7 @@ Other Enhancements

- ``IntervalIndex.astype`` now supports conversions between subtypes when passed an ``IntervalDtype`` (:issue:`19197`)
- :class:`IntervalIndex` and its associated constructor methods (``from_arrays``, ``from_breaks``, ``from_tuples``) have gained a ``dtype`` parameter (:issue:`19262`)
- New :meth:`IntervalIndex.get_loc_exact` has been added to find exact Interval matches only (:issue:`19349`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob needs to be added to api.rst

def get_loc_exact(self, key, method=None):
"""Get integer location, slice or boolean mask for exact
matches only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah could remove this

@@ -229,6 +229,10 @@ def time_getitem_list(self, monotonic):
def time_loc_list(self, monotonic):
monotonic.loc[80000:]

def time_get_loc_exact(self, monotonic):
interval = monotonic.index[80000]
monotonic.index.get_loc_exact(interval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should only run on II

@jschendel
Copy link
Member

I was just going over some of the tests in test_interval_new.py, and noticed the following comment in test_get_loc_interval, which indicates that the desired future behavior for get_loc is to only return exact matches:

def test_get_loc_interval(self, idx_side, side):
idx = IntervalIndex.from_tuples([(0, 1), (2, 3)], closed=idx_side)
for bound in [[0, 1], [1, 2], [2, 3], [3, 4],
[0, 2], [2.5, 3], [-1, 4]]:
# if get_loc is supplied an interval, it should only search
# for exact matches, not overlaps or covers, else KeyError.

Is this really the desired future behavior? If so, should we try putting the get_loc_exact behavior in this PR directly into the Interval cases for get_loc? Couldn't find any comments specifically about this after doing a quick search through the issue/pr that added the new test.

@jorisvandenbossche
Copy link
Member

@jschendel Yes, that was also my first thought when I saw this PR, and wanted to say: let's change get_loc to be exact as we discussed that we want it this way.
However, the comment you cite above is when it is passed an interval, then it is required to be exact (currently you can also pass a 'sub-interval'). But for non-interval scalars it still works as well, so eg df.loc[1] will still work to select all rows with intervalindex where the interval contains 1. So for scalars it does not need to be exact (at least, that is what we discussed then).

@jorisvandenbossche
Copy link
Member

So it is the test below the one you quoted (test_get_loc_scalar):

# if get_loc is supplied a scalar, it should return the index of
# the interval which contains the scalar, or KeyError.

@jschendel
Copy link
Member

@jorisvandenbossche : Yes, maybe it wasn't clear from my comment, but I was suggesting just modifying the Interval specific behavior of get_loc to force exactness. Since we're already writing the logic to do this, I thought it might not be that much additional work to just incorporate it in get_loc as well, which could lead to some simplifications for get_loc_exact. After looking over the code a bit more though, it might not be as straightforward as I initially thought.

@jorisvandenbossche
Copy link
Member

modifying the Interval specific behavior of get_loc to force exactness.

Ah, yes, I missed that. But that's a good idea

@topper-123 topper-123 closed this May 18, 2018
@topper-123 topper-123 deleted the IntervalIndex.get_loc branch May 18, 2018 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG/inconsistency: IntervalIndex.get_loc gives a location for non-exact inputs
5 participants