ENH: add IntervalIndex.get_loc_exact to look for exact matches only #19353

topper-123 · 2018-01-22T21:18:53Z

closes BUG/inconsistency: IntervalIndex.get_loc gives a location for non-exact inputs #19349
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This PR add a 'exact' option to the parameter method on IntervalIndex.get_loc for accepting exact matches only.

In addition to it's direct advantage, it will later make it possible to make a PR to allow IntervalIndex.get_indexer calls with the other index having non-Interval values, and hence allow mixing mix Interval with non-intervals in pd.Index(..., dtype=object) instances. This would fix the remaining issue with #19021.

jschendel · 2018-01-23T02:06:59Z

There's a private method that looks like it's supposed to do this:

pandas/pandas/core/indexes/interval.py

Lines 870 to 880 in 9872d67

    
           def _get_loc_only_exact_matches(self, key): 
        
               if isinstance(key, Interval): 
        
                   if not self.is_unique: 
        
                       raise ValueError("cannot index with a slice Interval" 
        
                                        " and a non-unique index") 
        
                   # TODO: this expands to a tuple index, see if we can 
        
                   # do better 
        
                   return Index(self._multiindex.values).get_loc(key) 
        
               raise KeyError

It doesn't appear to be working at all though:

In [2]: ii = pd.interval_range(0, 4)

In [3]: ii[0]
Out[3]: Interval(0, 1, closed='right')

In [4]: ii.get_loc(ii[0])
Out[4]: 0

In [5]: ii._get_loc_only_exact_matches(ii[0])
---------------------------------------------------------------------------
KeyError: Interval(0, 1, closed='right')

The issue appears to be that Index(self._multiindex.values) creates an index of tuples, not intervals, so passing an Interval object will always raise a KeyError. Likewise, due to the first line, passing anything other than an Interval object will also raise a KeyError.

In [6]: pd.Index(ii._multiindex.values)
Out[6]: Index([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='object')

Might be best to implement this in _get_loc_only_exact_matches instead? It looks like IntervalIndex.get_slice_bound (inherited from Index) relies on _get_loc_only_exact_matches.

jschendel · 2018-01-23T02:17:29Z

pandas/tests/indexes/interval/test_interval.py

+        # GH19349
+        assert self.index.get_loc(Interval(0, 1), method='exact') == 0
+        with pytest.raises(KeyError):
+            self.index.get_loc(Interval(0, 0.5), method='exact')


Can you also test that an Interval that only differs by closed raises a KeyError as well, e.g. Interval(0, 1, closed='neither')?

codecov · 2018-01-23T08:28:27Z

Codecov Report

Merging #19353 into master will decrease coverage by 0.02%.
The diff coverage is 83.33%.

@@            Coverage Diff             @@
##           master   #19353      +/-   ##
==========================================
- Coverage   91.61%   91.59%   -0.03%     
==========================================
  Files         150      150              
  Lines       48807    48817      +10     
==========================================
- Hits        44716    44712       -4     
- Misses       4091     4105      +14

Flag	Coverage Δ
#multiple	`89.96% <83.33%> (-0.03%)`	⬇️
#single	`41.72% <16.66%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/base.py	`96.38% <50%> (-0.06%)`	⬇️
pandas/core/indexes/interval.py	`92.33% <90%> (-0.07%)`	⬇️
pandas/plotting/_converter.py	`65.22% <0%> (-1.74%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 324379c...9045cc9. Read the comment docs.

jreback · 2018-01-23T11:14:56Z

yes, rather than adding kwargs to .get_loc much better to fix the private method here.

topper-123 · 2018-01-23T12:54:06Z

I can look into the private method.

@jreback , do you mean you are negative wrt. method='exact' in the public interface? I think it will be useful to have in many cases, e.g.

>>> ii = pd.interva_range(0, 3)
>>> ii.get_loc(pd.Interval(1,3))
slice(1, 3, None)
>>> ii.get_loc(pd.Interval(1,3), method='exact')
KeyError: ...

Above, we've answered two different questions, both reasonable to ask, IMO.

topper-123 · 2018-01-23T16:40:11Z

In _get_loc_only_exact_matches I've converted the line return Index(self._multiindex.values).get_loc(key) to return super(IntervalIndex, self)._engine.get_loc(key). Another option would be to do self.astype(object).get_loc(key), which is more readable, but slower, presumedly because of the need to cast.

The super method is faster, around 13 ms, while the astype method is around 75 ms, so I've chosen the super method.

Also, _get_loc_only_exact_matches only worked for unique indices, while get_loc should work with non-unique indices, so some changes to accomodate non-unique values.

pep8speaks · 2018-01-23T16:40:38Z

Hello @topper-123! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 11, 2018 at 16:12 Hours UTC

jreback · 2018-01-24T01:06:02Z

@topper-123 we are not going to change the .get_loc api. it is very simple and needs to stay that way. As I said I am not above exposing the private function to make an exact match.

topper-123 · 2018-02-02T22:04:27Z

@jreback , adding this keeps things simple IMO: If you look at the doc string for pd.Index`.get_loc, the method part is:

method : {None, 'pad'/'ffill', 'backfill'/'bfill', 'nearest'}, optional
    * default: exact matches only.
    * ...

It's logical that IntervalIndex.get_loc should be able to revert back to exact matches only, which is the default for all other index types.

topper-123 · 2018-02-08T19:25:11Z

I've added a get_loc_exact method instead of using method='exact' in get_loc.

jschendel · 2018-02-09T00:26:11Z

I think there are some caching issues when trying to use super(IntervalIndex, self)._engine. Since both IntervalIndex._engine and Index._engine are decorated with @cache_readonly, it looks like you can't actually create both, which can lead to some weird errors depending on which _engine gets created first.

As an example, when an IntervalIndex is not unique, get_loc uses self._engine. In such a scenario, you can't successfully use both get_loc and get_loc_exact, as whichever method you call second will fail.

Using get_loc first causes get_loc_exact to fail:

In [1]: import pandas as pd

In [2]: ii = pd.IntervalIndex.from_tuples([(0, 1), (1, 2), (0, 1)])

In [3]: ii.get_loc(pd.Interval(0, 1))
Out[3]: array([0, 2], dtype=int64)

In [4]: ii.get_loc_exact(pd.Interval(0, 1))
---------------------------------------------------------------------------
TypeError: No matching signature found

Using get_loc_exact first causes get_loc to fail:

In [1]: import pandas as pd

In [2]: ii = pd.IntervalIndex.from_tuples([(0, 1), (1, 2), (0, 1)])

In [3]: ii.get_loc_exact(pd.Interval(0, 1))
Out[3]: array([ True, False,  True], dtype=bool)

In [4]: ii.get_loc(pd.Interval(0, 1))
---------------------------------------------------------------------------
AttributeError: 'pandas._libs.index.ObjectEngine' object has no attribute 'get_loc_interval'

Not sure if there's a simple workaround here or not.

topper-123 · 2018-02-09T21:08:58Z

Nice catch.

I've changed the call to return self.astype(object).get_loc(key) and added a test for the condition you pointed out.

A shame, as the super method would have been faster if your issue would not have surfaced.

jreback · 2018-02-10T16:45:04Z

pandas/core/indexes/interval.py

+        Returns
+        -------
+        loc : int if unique index, slice if monotonic index, else mask
+


add Raises section

jreback · 2018-02-10T16:45:52Z

pandas/core/indexes/interval.py

@@ -1003,6 +1004,47 @@ def get_loc(self, key, method=None):
            else:
                return self._engine.get_loc(key)

+    def get_loc_exact(self, key, method=None):


add the same method in Index, dispatch to .get_loc

jreback · 2018-02-10T16:46:05Z

pandas/core/indexes/interval.py

+        --------
+        get_loc
+        """
+        return self.astype(object).get_loc(key, method=method)


jreback · 2018-02-10T16:46:27Z

pandas/tests/indexes/interval/test_interval.py

+            self.index.get_loc_exact(Interval(2, 3))
+        with pytest.raises(KeyError):
+            self.index.get_loc_exact(Interval(-1, 0, 'left'))
+        # See #19353#issuecomment-364295029


what is this last test?

this tests if the caching issue showed by @jschendel exists.

ok then write that as a comment. an issue reference is good, but a comment + reference even better. add blank lines in between sections of tests.

jreback · 2018-02-10T16:46:53Z

pandas/tests/indexes/interval/test_interval.py

@@ -497,6 +497,23 @@ def test_get_loc_interval(self):
        pytest.raises(KeyError, self.index.get_loc,
                      Interval(-1, 0, 'left'))

+    # To be removed, replaced by test_interval_new.py (see #16316, #16386)


why are you adding this comment?

jreback · 2018-02-10T16:47:40Z

doc/source/whatsnew/v0.23.0.txt

@@ -507,6 +507,7 @@ Other API Changes
 - Set operations (union, difference...) on :class:`IntervalIndex` with incompatible index types will now raise a ``TypeError`` rather than a ``ValueError`` (:issue:`19329`)
 - :class:`DateOffset` objects render more simply, e.g. "<DateOffset: days=1>" instead of "<DateOffset: kwds={'days': 1}>" (:issue:`19403`)
 - :func:`pandas.merge` provides a more informative error message when trying to merge on timezone-aware and timezone-naive columns (:issue:`15800`)
+- New :meth:`IntervalIndex.get_loc_exact` has been added to find exact Interval matches only (:issue:`19349`)


move to Other Enhancements

add to the api.rst

jreback · 2018-02-10T16:49:26Z

pandas/core/indexes/interval.py

+        --------
+        get_loc
+        """
+        return self.astype(object).get_loc(key, method=method)


this seems quite inefficient. a) can you add an asv, b) I would do this in the cython code itself that implements get_loc (this can be a further refinement / PR).

The super method was faster, around 13 ms, while this is around 75 ms, so it's slower.

I don't have experience with cython, so can't take on those changes in get_loc, sorry.

ok that's fine an asv for now is prob best.

I've looked into asv and the speed issue.

>>> idx = IntervalIndex.from_breaks(np.arange(1000001)) >>> i = idx[80000] >>> %t idx.get_loc(i) 25.2 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) >>> %t idx.get_loc_exact(i) 1.33 s ± 46.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I doubt this acceptable?

wait, so this is much slower?

Yes. I think this speed issue is beyound me, so I'm stuck. Maybe it's something wrong with Interval type conversion?

so why don't you just do .get_loc and then check if its exact after?

Ok, that works very nicely: 41.5 µs for get_loc_exact vs 33.1 µs for get_loc.

so there's no speed issue now at all.

jreback · 2018-02-10T22:11:05Z

pandas/core/indexes/interval.py

+            raise KeyError(key)
+        else:
+            exact_matches = self[all_matches] == key
+            if (exact_matches).all():


you can move this up to avoid duplication

jreback · 2018-02-10T22:11:22Z

pandas/core/indexes/interval.py

+            exact_matches = self[all_matches] == key
+            if (exact_matches).all():
+                return all_matches
+            elif (exact_matches).any():


this can also be simplified i think

I don't think so: in the exact_matches.all() case, all_matches might be a slice object, so all_matches[exact_matches] would then fail.

jreback · 2018-02-10T22:11:47Z

pandas/tests/indexes/interval/test_interval.py

+
+        # The below tests if get_loc_exact interferes with caching
+        # used for index.get_loc. See #19353#issuecomment-364295029
+        self.index.get_loc(Interval(0, 1))


assert that this is something

jreback · 2018-02-10T22:12:02Z

pandas/tests/indexing/interval/test_interval.py

@@ -136,7 +136,7 @@ def test_with_slices(self):
            s[Interval(3, 6):]

        expected = s.iloc[3:5]
-        result = s[[Interval(3, 6)]]
+        result = s[Interval(3, 6)]


why is this changed?

This is unneeded, I've changed it back.

topper-123 · 2018-02-10T23:08:14Z

All concerns have been addessed, AFAICS.

jreback · 2018-02-10T23:12:43Z

pandas/core/indexes/interval.py

+        if is_scalar(all_matches):
+            if exact_matches:
+                return all_matches
+            raise KeyError(key)


this can be simplified

topper-123 · 2018-02-11T08:52:48Z

green.

jorisvandenbossche · 2018-02-11T18:10:18Z

Is it really needed to add a method to all Index objects only for use in IntervalIndex ?
Our public API is already huge, not sure this is needed to add even more. We have other examples where Index subclasses have specific type-related methods that are not available for other Index classes.

jreback · 2018-02-11T18:45:20Z

doc/source/whatsnew/v0.23.0.txt

@@ -323,6 +323,7 @@ Other Enhancements

 - ``IntervalIndex.astype`` now supports conversions between subtypes when passed an ``IntervalDtype`` (:issue:`19197`)
 - :class:`IntervalIndex` and its associated constructor methods (``from_arrays``, ``from_breaks``, ``from_tuples``) have gained a ``dtype`` parameter (:issue:`19262`)
+- New :meth:`IntervalIndex.get_loc_exact` has been added to find exact Interval matches only (:issue:`19349`)


prob needs to be added to api.rst

jreback · 2018-02-11T18:45:33Z

pandas/core/indexes/base.py

+    def get_loc_exact(self, key, method=None):
+        """Get integer location, slice or boolean mask for exact
+        matches only.
+


yeah could remove this

jreback · 2018-02-11T18:45:58Z

asv_bench/benchmarks/indexing.py

@@ -229,6 +229,10 @@ def time_getitem_list(self, monotonic):
    def time_loc_list(self, monotonic):
        monotonic.loc[80000:]

+    def time_get_loc_exact(self, monotonic):
+        interval = monotonic.index[80000]
+        monotonic.index.get_loc_exact(interval)


should only run on II

jschendel · 2018-02-11T21:01:44Z

I was just going over some of the tests in test_interval_new.py, and noticed the following comment in test_get_loc_interval, which indicates that the desired future behavior for get_loc is to only return exact matches:

pandas/pandas/tests/indexes/interval/test_interval_new.py

Lines 24 to 31 in 324379c

    
           def test_get_loc_interval(self, idx_side, side): 
        
               idx = IntervalIndex.from_tuples([(0, 1), (2, 3)], closed=idx_side) 
        
               for bound in [[0, 1], [1, 2], [2, 3], [3, 4], 
        
                             [0, 2], [2.5, 3], [-1, 4]]: 
        
                   # if get_loc is supplied an interval, it should only search 
        
                   # for exact matches, not overlaps or covers, else KeyError.

Is this really the desired future behavior? If so, should we try putting the get_loc_exact behavior in this PR directly into the Interval cases for get_loc? Couldn't find any comments specifically about this after doing a quick search through the issue/pr that added the new test.

jorisvandenbossche · 2018-02-11T21:12:00Z

@jschendel Yes, that was also my first thought when I saw this PR, and wanted to say: let's change get_loc to be exact as we discussed that we want it this way.
However, the comment you cite above is when it is passed an interval, then it is required to be exact (currently you can also pass a 'sub-interval'). But for non-interval scalars it still works as well, so eg df.loc[1] will still work to select all rows with intervalindex where the interval contains 1. So for scalars it does not need to be exact (at least, that is what we discussed then).

jorisvandenbossche · 2018-02-11T21:13:12Z

So it is the test below the one you quoted (test_get_loc_scalar):

pandas/pandas/tests/indexes/interval/test_interval_new.py

Lines 57 to 58 in 324379c

    
           # if get_loc is supplied a scalar, it should return the index of 
        
           # the interval which contains the scalar, or KeyError.

jschendel · 2018-02-11T21:49:07Z

@jorisvandenbossche : Yes, maybe it wasn't clear from my comment, but I was suggesting just modifying the Interval specific behavior of get_loc to force exactness. Since we're already writing the logic to do this, I thought it might not be that much additional work to just incorporate it in get_loc as well, which could lead to some simplifications for get_loc_exact. After looking over the code a bit more though, it might not be as straightforward as I initially thought.

jorisvandenbossche · 2018-02-11T21:50:36Z

modifying the Interval specific behavior of get_loc to force exactness.

Ah, yes, I missed that. But that's a good idea

topper-123 mentioned this pull request Jan 22, 2018

BUG/inconsistency: IntervalIndex.get_loc gives a location for non-exact inputs #19349

Open

topper-123 force-pushed the IntervalIndex.get_loc branch 2 times, most recently from 4904ac0 to bf1050d Compare January 22, 2018 22:42

jschendel reviewed Jan 23, 2018

View reviewed changes

jreback added the Interval Interval data type label Jan 23, 2018

topper-123 force-pushed the IntervalIndex.get_loc branch from bf1050d to a30e7c0 Compare January 23, 2018 16:40

topper-123 force-pushed the IntervalIndex.get_loc branch from a30e7c0 to 3b9aae3 Compare January 23, 2018 16:41

topper-123 force-pushed the IntervalIndex.get_loc branch from 3b9aae3 to aaa9a7a Compare February 8, 2018 19:23

topper-123 force-pushed the IntervalIndex.get_loc branch from aaa9a7a to 7590b87 Compare February 9, 2018 21:09

jreback requested changes Feb 10, 2018

View reviewed changes

jreback added Indexing Related to indexing on series/frames, not to indexes themselves API Design labels Feb 10, 2018

topper-123 force-pushed the IntervalIndex.get_loc branch 3 times, most recently from fc09b08 to eff6ec6 Compare February 10, 2018 22:09

jreback requested changes Feb 10, 2018

View reviewed changes

topper-123 force-pushed the IntervalIndex.get_loc branch 3 times, most recently from dd3611a to ce138b6 Compare February 10, 2018 22:30

topper-123 changed the title ~~ENH: let IntervalIndex.get_loc look for exact matches only~~ ENH: add IntervalIndex.get_loc_exact to look for exact matches only Feb 10, 2018

jreback requested changes Feb 10, 2018

View reviewed changes

add IntervalIndex.get_loc_exact

44f3471

topper-123 force-pushed the IntervalIndex.get_loc branch from ce138b6 to 44f3471 Compare February 10, 2018 23:38

jreback added 3 commits February 11, 2018 10:06

Merge branch 'master' into PR_TOOL_MERGE_PR_19353

780b41b

clean

613d6f5

typo

9045cc9

jreback requested changes Feb 11, 2018

View reviewed changes

topper-123 mentioned this pull request Feb 18, 2018

ENH: DataFrame.append preserves columns dtype if possible #19021

Merged

4 tasks

topper-123 closed this May 18, 2018

topper-123 deleted the IntervalIndex.get_loc branch May 18, 2018 18:09

ENH: add IntervalIndex.get_loc_exact to look for exact matches only #19353

ENH: add IntervalIndex.get_loc_exact to look for exact matches only #19353

Conversation

topper-123 commented Jan 22, 2018 • edited Loading

jschendel commented Jan 23, 2018 • edited Loading

jschendel Jan 23, 2018 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Jan 23, 2018 • edited Loading

Codecov Report

jreback commented Jan 23, 2018

topper-123 commented Jan 23, 2018

topper-123 commented Jan 23, 2018 • edited Loading

pep8speaks commented Jan 23, 2018 • edited Loading

Comment last updated on February 11, 2018 at 16:12 Hours UTC

jreback commented Jan 24, 2018

topper-123 commented Feb 2, 2018 • edited Loading

topper-123 commented Feb 8, 2018

jschendel commented Feb 9, 2018

topper-123 commented Feb 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Feb 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Feb 10, 2018

Choose a reason for hiding this comment

topper-123 commented Feb 11, 2018

jorisvandenbossche commented Feb 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jschendel commented Feb 11, 2018

jorisvandenbossche commented Feb 11, 2018

jorisvandenbossche commented Feb 11, 2018

jschendel commented Feb 11, 2018

jorisvandenbossche commented Feb 11, 2018

topper-123 commented Jan 22, 2018 •

edited

Loading

jschendel commented Jan 23, 2018 •

edited

Loading

jschendel Jan 23, 2018 •

edited

Loading

codecov bot commented Jan 23, 2018 •

edited

Loading

topper-123 commented Jan 23, 2018 •

edited

Loading

pep8speaks commented Jan 23, 2018 •

edited

Loading

topper-123 commented Feb 2, 2018 •

edited

Loading

topper-123 Feb 10, 2018 •

edited

Loading