-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: add IntervalIndex.get_loc_exact to look for exact matches only #19353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4904ac0
to
bf1050d
Compare
There's a private method that looks like it's supposed to do this: pandas/pandas/core/indexes/interval.py Lines 870 to 880 in 9872d67
It doesn't appear to be working at all though: In [2]: ii = pd.interval_range(0, 4)
In [3]: ii[0]
Out[3]: Interval(0, 1, closed='right')
In [4]: ii.get_loc(ii[0])
Out[4]: 0
In [5]: ii._get_loc_only_exact_matches(ii[0])
---------------------------------------------------------------------------
KeyError: Interval(0, 1, closed='right') The issue appears to be that In [6]: pd.Index(ii._multiindex.values)
Out[6]: Index([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='object') Might be best to implement this in |
# GH19349 | ||
assert self.index.get_loc(Interval(0, 1), method='exact') == 0 | ||
with pytest.raises(KeyError): | ||
self.index.get_loc(Interval(0, 0.5), method='exact') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also test that an Interval
that only differs by closed
raises a KeyError
as well, e.g. Interval(0, 1, closed='neither')
?
Codecov Report
@@ Coverage Diff @@
## master #19353 +/- ##
==========================================
- Coverage 91.61% 91.59% -0.03%
==========================================
Files 150 150
Lines 48807 48817 +10
==========================================
- Hits 44716 44712 -4
- Misses 4091 4105 +14
Continue to review full report at Codecov.
|
yes, rather than adding kwargs to |
I can look into the private method. @jreback , do you mean you are negative wrt. >>> ii = pd.interva_range(0, 3)
>>> ii.get_loc(pd.Interval(1,3))
slice(1, 3, None)
>>> ii.get_loc(pd.Interval(1,3), method='exact')
KeyError: ... Above, we've answered two different questions, both reasonable to ask, IMO. |
In The super method is faster, around 13 ms, while the astype method is around 75 ms, so I've chosen the super method. Also, |
bf1050d
to
a30e7c0
Compare
Hello @topper-123! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on February 11, 2018 at 16:12 Hours UTC |
a30e7c0
to
3b9aae3
Compare
@topper-123 we are not going to change the |
@jreback , adding this keeps things simple IMO: If you look at the doc string for
It's logical that |
3b9aae3
to
aaa9a7a
Compare
I've added a |
I think there are some caching issues when trying to use As an example, when an Using In [1]: import pandas as pd
In [2]: ii = pd.IntervalIndex.from_tuples([(0, 1), (1, 2), (0, 1)])
In [3]: ii.get_loc(pd.Interval(0, 1))
Out[3]: array([0, 2], dtype=int64)
In [4]: ii.get_loc_exact(pd.Interval(0, 1))
---------------------------------------------------------------------------
TypeError: No matching signature found Using In [1]: import pandas as pd
In [2]: ii = pd.IntervalIndex.from_tuples([(0, 1), (1, 2), (0, 1)])
In [3]: ii.get_loc_exact(pd.Interval(0, 1))
Out[3]: array([ True, False, True], dtype=bool)
In [4]: ii.get_loc(pd.Interval(0, 1))
---------------------------------------------------------------------------
AttributeError: 'pandas._libs.index.ObjectEngine' object has no attribute 'get_loc_interval' Not sure if there's a simple workaround here or not. |
Nice catch. I've changed the call to A shame, as the super method would have been faster if your issue would not have surfaced. |
aaa9a7a
to
7590b87
Compare
Returns | ||
------- | ||
loc : int if unique index, slice if monotonic index, else mask | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add Raises section
@@ -1003,6 +1004,47 @@ def get_loc(self, key, method=None): | |||
else: | |||
return self._engine.get_loc(key) | |||
|
|||
def get_loc_exact(self, key, method=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the same method in Index, dispatch to .get_loc
pandas/core/indexes/interval.py
Outdated
-------- | ||
get_loc | ||
""" | ||
return self.astype(object).get_loc(key, method=method) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
self.index.get_loc_exact(Interval(2, 3)) | ||
with pytest.raises(KeyError): | ||
self.index.get_loc_exact(Interval(-1, 0, 'left')) | ||
# See #19353#issuecomment-364295029 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this last test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this tests if the caching issue showed by @jschendel exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok then write that as a comment. an issue reference is good, but a comment + reference even better. add blank lines in between sections of tests.
@@ -497,6 +497,23 @@ def test_get_loc_interval(self): | |||
pytest.raises(KeyError, self.index.get_loc, | |||
Interval(-1, 0, 'left')) | |||
|
|||
# To be removed, replaced by test_interval_new.py (see #16316, #16386) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you adding this comment?
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -507,6 +507,7 @@ Other API Changes | |||
- Set operations (union, difference...) on :class:`IntervalIndex` with incompatible index types will now raise a ``TypeError`` rather than a ``ValueError`` (:issue:`19329`) | |||
- :class:`DateOffset` objects render more simply, e.g. "<DateOffset: days=1>" instead of "<DateOffset: kwds={'days': 1}>" (:issue:`19403`) | |||
- :func:`pandas.merge` provides a more informative error message when trying to merge on timezone-aware and timezone-naive columns (:issue:`15800`) | |||
- New :meth:`IntervalIndex.get_loc_exact` has been added to find exact Interval matches only (:issue:`19349`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to Other Enhancements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add to the api.rst
pandas/core/indexes/interval.py
Outdated
-------- | ||
get_loc | ||
""" | ||
return self.astype(object).get_loc(key, method=method) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems quite inefficient. a) can you add an asv, b) I would do this in the cython code itself that implements get_loc (this can be a further refinement / PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The super method was faster, around 13 ms, while this is around 75 ms, so it's slower.
I don't have experience with cython, so can't take on those changes in get_loc, sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that's fine an asv for now is prob best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked into asv and the speed issue.
>>> idx = IntervalIndex.from_breaks(np.arange(1000001))
>>> i = idx[80000]
>>> %t idx.get_loc(i)
25.2 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %t idx.get_loc_exact(i)
1.33 s ± 46.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I doubt this acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, so this is much slower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think this speed issue is beyound me, so I'm stuck. Maybe it's something wrong with Interval type conversion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so why don't you just do .get_loc and then check if its exact after?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that works very nicely: 41.5 µs for get_loc_exact
vs 33.1 µs for get_loc
.
so there's no speed issue now at all.
fc09b08
to
eff6ec6
Compare
pandas/core/indexes/interval.py
Outdated
raise KeyError(key) | ||
else: | ||
exact_matches = self[all_matches] == key | ||
if (exact_matches).all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can move this up to avoid duplication
pandas/core/indexes/interval.py
Outdated
exact_matches = self[all_matches] == key | ||
if (exact_matches).all(): | ||
return all_matches | ||
elif (exact_matches).any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can also be simplified i think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so: in the exact_matches.all()
case, all_matches
might be a slice object, so all_matches[exact_matches]
would then fail.
|
||
# The below tests if get_loc_exact interferes with caching | ||
# used for index.get_loc. See #19353#issuecomment-364295029 | ||
self.index.get_loc(Interval(0, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert that this is something
@@ -136,7 +136,7 @@ def test_with_slices(self): | |||
s[Interval(3, 6):] | |||
|
|||
expected = s.iloc[3:5] | |||
result = s[[Interval(3, 6)]] | |||
result = s[Interval(3, 6)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unneeded, I've changed it back.
dd3611a
to
ce138b6
Compare
All concerns have been addessed, AFAICS. |
pandas/core/indexes/interval.py
Outdated
if is_scalar(all_matches): | ||
if exact_matches: | ||
return all_matches | ||
raise KeyError(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be simplified
ce138b6
to
44f3471
Compare
green. |
Is it really needed to add a method to all Index objects only for use in |
@@ -323,6 +323,7 @@ Other Enhancements | |||
|
|||
- ``IntervalIndex.astype`` now supports conversions between subtypes when passed an ``IntervalDtype`` (:issue:`19197`) | |||
- :class:`IntervalIndex` and its associated constructor methods (``from_arrays``, ``from_breaks``, ``from_tuples``) have gained a ``dtype`` parameter (:issue:`19262`) | |||
- New :meth:`IntervalIndex.get_loc_exact` has been added to find exact Interval matches only (:issue:`19349`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prob needs to be added to api.rst
def get_loc_exact(self, key, method=None): | ||
"""Get integer location, slice or boolean mask for exact | ||
matches only. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah could remove this
@@ -229,6 +229,10 @@ def time_getitem_list(self, monotonic): | |||
def time_loc_list(self, monotonic): | |||
monotonic.loc[80000:] | |||
|
|||
def time_get_loc_exact(self, monotonic): | |||
interval = monotonic.index[80000] | |||
monotonic.index.get_loc_exact(interval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should only run on II
I was just going over some of the tests in pandas/pandas/tests/indexes/interval/test_interval_new.py Lines 24 to 31 in 324379c
Is this really the desired future behavior? If so, should we try putting the |
@jschendel Yes, that was also my first thought when I saw this PR, and wanted to say: let's change |
So it is the test below the one you quoted ( pandas/pandas/tests/indexes/interval/test_interval_new.py Lines 57 to 58 in 324379c
|
@jorisvandenbossche : Yes, maybe it wasn't clear from my comment, but I was suggesting just modifying the |
Ah, yes, I missed that. But that's a good idea |
git diff upstream/master -u -- "*.py" | flake8 --diff
This PR add a 'exact' option to the parameter
method
onIntervalIndex.get_loc
for accepting exact matches only.In addition to it's direct advantage, it will later make it possible to make a PR to allow
IntervalIndex.get_indexer
calls with the other index having non-Interval values, and hence allow mixing mix Interval with non-intervals inpd.Index(..., dtype=object)
instances. This would fix the remaining issue with #19021.