-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Implement IntervalIndex.is_overlapping #23327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @jschendel! Thanks for submitting the PR.
|
@@ -583,6 +639,9 @@ def _maybe_convert_i8(self, key): | |||
else: | |||
# DatetimeIndex/TimedeltaIndex | |||
key_dtype, key_i8 = key.dtype, Index(key.asi8) | |||
if key.hasnans: | |||
# NaT's i8 value may be viewed as not NA (e.g. is_overlapping) | |||
key_i8 = key_i8.where(~key._isnan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, without this change a datetime-like IntervalIndex
with closed='both'
containing two or more instances of NaT
would be marked as overlapping due to the NaT
's.
Since NaT
is converted to -9223372036854775808 during i8 conversion, the IntervalTree
previously interpreted this as an Interval
of length zero (same start/end), which would include the point in the closed='both'
case. So, if two of these occurred they would be interpreted as overlapping at a point.
I've added a relevant _maybe_convert_i8
test for this behavior.
f73f4c1
to
310f114
Compare
310f114
to
4662229
Compare
Codecov Report
@@ Coverage Diff @@
## master #23327 +/- ##
==========================================
+ Coverage 92.31% 92.31% +<.01%
==========================================
Files 161 161
Lines 51483 51487 +4
==========================================
+ Hits 47525 47529 +4
Misses 3958 3958
Continue to review full report at Codecov.
|
4662229
to
e0c8d2e
Compare
pandas/_libs/intervaltree.pxi.in
Outdated
|
||
self._is_overlapping = False | ||
for previous, current in zip(self.left_sorter, self.left_sorter[1:]): | ||
# overlap if start of current interval < end of previous interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't this just be an array comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but wouldn't that generally be a bit slower? For the array comparison it'd be something like:
(self.left[self.left_sorter[1:]] < self.right[self.left_sorter[:-1]]).any()
Wouldn't the interior comparison of self.left[...] < self.right[...]
be computed in it's entirety before the any
is evaluated? Whereas the elementwise way it's currently written would terminate after the first successful <
evaluation? Or is there something I'm missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but it’s worth testing
as it’s a bit more idiomatic to use array slicing here and it might be faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you're right. I did some timings with some non-cached versions of the loop and array methods (still cached the sorters since both methods use them in the same way), aptly named is_overlapping_loop
and is_overlapping_array
. Opted to switch to the array version since it's speed is consistent, and faster in a wider variety of scenarios.
When the overlap occurs at the beginning, the loop version is the fastest (pretty obvious):
In [2]: ii = pd.IntervalIndex.from_arrays(
...: np.arange(0, 2*10**5), np.arange(2, 2*10**5 + 2))
In [3]: ii._engine.left_sorter; ii._engine.right_sorter; pass # cache sorters
In [4]: %timeit ii._engine.is_overlapping_loop
980 ns ± 19.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit ii._engine.is_overlapping_array
2.06 ms ± 66.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If no overlaps are present, and both methods need to do full passes, the array method is faster:
In [6]: ii = pd.interval_range(0, 2*10**5)
In [7]: ii._engine.left_sorter; ii._engine.right_sorter; pass # cache sorters
In [8]: %timeit ii._engine.is_overlapping_loop
80.8 ms ± 6.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: %timeit ii._engine.is_overlapping_array
2.19 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If an overlap occurs midway through the array method is still faster, with the loop method taking about half the time as the non-overlapping version (makes sense):
In [10]: left = list(range(10**5)) + [10**5 - 2] + list(range(10**5, 2*10**5))
...: right = list(range(1, 10**5 + 1)) + [10**5 + 2] + list(range(10**5 + 1, 2*10**5 + 1))
...: ii = pd.IntervalIndex.from_arrays(left, right)
In [11]: ii._engine.left_sorter; ii._engine.right_sorter; pass # cache sorters
In [12]: %timeit ii._engine.is_overlapping_loop
39.6 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [13]: %timeit ii._engine.is_overlapping_array
2.23 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
can you merge master |
@jschendel can you rebase |
e0c8d2e
to
dd63492
Compare
@jreback : rebased and switched to an array based implementation (see #23327 (comment)) |
ping @jreback |
thanks @jschendel nice! |
Thanks for the quick response to the ping @jreback! Sorry for the delay with this PR; got a bit sidetracked with the 32bit issues. |
git diff upstream/master -u -- "*.py" | flake8 --diff
This is needed in the
get_indexer
implementation for the newIntervalIndex
behavior, as an overlappingIntervalIndex
may return non-unique indices for a given query; seems cleaner to implement separately. Also makes sense as a general attribute for anIntervalIndex
to have.