Skip to content

ENH: Implement IntervalIndex.is_overlapping #23327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 29, 2018

Conversation

jschendel
Copy link
Member

This is needed in the get_indexer implementation for the new IntervalIndex behavior, as an overlapping IntervalIndex may return non-unique indices for a given query; seems cleaner to implement separately. Also makes sense as a general attribute for an IntervalIndex to have.

@jschendel jschendel added Enhancement Interval Interval data type labels Oct 25, 2018
@jschendel jschendel added this to the 0.24.0 milestone Oct 25, 2018
@pep8speaks
Copy link

Hello @jschendel! Thanks for submitting the PR.

@@ -583,6 +639,9 @@ def _maybe_convert_i8(self, key):
else:
# DatetimeIndex/TimedeltaIndex
key_dtype, key_i8 = key.dtype, Index(key.asi8)
if key.hasnans:
# NaT's i8 value may be viewed as not NA (e.g. is_overlapping)
key_i8 = key_i8.where(~key._isnan)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, without this change a datetime-like IntervalIndex with closed='both' containing two or more instances of NaT would be marked as overlapping due to the NaT's.

Since NaT is converted to -9223372036854775808 during i8 conversion, the IntervalTree previously interpreted this as an Interval of length zero (same start/end), which would include the point in the closed='both' case. So, if two of these occurred they would be interpreted as overlapping at a point.

I've added a relevant _maybe_convert_i8 test for this behavior.

@codecov
Copy link

codecov bot commented Oct 26, 2018

Codecov Report

Merging #23327 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23327      +/-   ##
==========================================
+ Coverage   92.31%   92.31%   +<.01%     
==========================================
  Files         161      161              
  Lines       51483    51487       +4     
==========================================
+ Hits        47525    47529       +4     
  Misses       3958     3958
Flag Coverage Δ
#multiple 90.7% <100%> (ø) ⬆️
#single 42.43% <25%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/interval.py 93.02% <ø> (ø) ⬆️
pandas/core/indexes/interval.py 94.73% <100%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 90961f2...dd63492. Read the comment docs.

@jschendel
Copy link
Member Author

ping @jreback : this is passing now that #23353 has been merged


self._is_overlapping = False
for previous, current in zip(self.left_sorter, self.left_sorter[1:]):
# overlap if start of current interval < end of previous interval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't this just be an array comparison?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but wouldn't that generally be a bit slower? For the array comparison it'd be something like:

(self.left[self.left_sorter[1:]] < self.right[self.left_sorter[:-1]]).any()

Wouldn't the interior comparison of self.left[...] < self.right[...] be computed in it's entirety before the any is evaluated? Whereas the elementwise way it's currently written would terminate after the first successful < evaluation? Or is there something I'm missing here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but it’s worth testing
as it’s a bit more idiomatic to use array slicing here and it might be faster

Copy link
Member Author

@jschendel jschendel Nov 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you're right. I did some timings with some non-cached versions of the loop and array methods (still cached the sorters since both methods use them in the same way), aptly named is_overlapping_loop and is_overlapping_array. Opted to switch to the array version since it's speed is consistent, and faster in a wider variety of scenarios.

When the overlap occurs at the beginning, the loop version is the fastest (pretty obvious):

In [2]: ii = pd.IntervalIndex.from_arrays(
   ...:     np.arange(0, 2*10**5), np.arange(2, 2*10**5 + 2))

In [3]: ii._engine.left_sorter; ii._engine.right_sorter; pass  # cache sorters

In [4]: %timeit ii._engine.is_overlapping_loop
980 ns ± 19.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit ii._engine.is_overlapping_array
2.06 ms ± 66.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

If no overlaps are present, and both methods need to do full passes, the array method is faster:

In [6]: ii = pd.interval_range(0, 2*10**5)

In [7]: ii._engine.left_sorter; ii._engine.right_sorter; pass  # cache sorters

In [8]: %timeit ii._engine.is_overlapping_loop
80.8 ms ± 6.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: %timeit ii._engine.is_overlapping_array
2.19 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

If an overlap occurs midway through the array method is still faster, with the loop method taking about half the time as the non-overlapping version (makes sense):

In [10]: left = list(range(10**5)) + [10**5 - 2] + list(range(10**5, 2*10**5))
    ...: right = list(range(1, 10**5 + 1)) + [10**5 + 2] + list(range(10**5 + 1, 2*10**5 + 1))
    ...: ii = pd.IntervalIndex.from_arrays(left, right)

In [11]: ii._engine.left_sorter; ii._engine.right_sorter; pass  # cache sorters

In [12]: %timeit ii._engine.is_overlapping_loop
39.6 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [13]: %timeit ii._engine.is_overlapping_array
2.23 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@jreback
Copy link
Contributor

jreback commented Nov 18, 2018

can you merge master

@jreback
Copy link
Contributor

jreback commented Nov 23, 2018

@jschendel can you rebase

@jreback jreback removed this from the 0.24.0 milestone Nov 25, 2018
@jschendel
Copy link
Member Author

@jreback : rebased and switched to an array based implementation (see #23327 (comment))

@jschendel
Copy link
Member Author

ping @jreback

@jreback jreback added this to the 0.24.0 milestone Nov 29, 2018
@jreback jreback merged commit 7653a6b into pandas-dev:master Nov 29, 2018
@jreback
Copy link
Contributor

jreback commented Nov 29, 2018

thanks @jschendel nice!

@jschendel
Copy link
Member Author

Thanks for the quick response to the ping @jreback! Sorry for the delay with this PR; got a bit sidetracked with the 32bit issues.

@jschendel jschendel deleted the ii-is-overlapping branch November 29, 2018 17:26
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Interval Interval data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add IntervalIndex.is_overlapping
3 participants