ENH: Implement IntervalIndex.is_overlapping #23327

jschendel · 2018-10-25T07:18:13Z

closes ENH: Add IntervalIndex.is_overlapping #23309
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This is needed in the get_indexer implementation for the new IntervalIndex behavior, as an overlapping IntervalIndex may return non-unique indices for a given query; seems cleaner to implement separately. Also makes sense as a general attribute for an IntervalIndex to have.

pep8speaks · 2018-10-25T07:18:18Z

Hello @jschendel! Thanks for submitting the PR.

There are no PEP8 issues in the file pandas/core/arrays/interval.py !
There are no PEP8 issues in the file pandas/core/indexes/interval.py !
There are no PEP8 issues in the file pandas/tests/indexes/interval/test_interval.py !
There are no PEP8 issues in the file pandas/tests/indexes/interval/test_interval_tree.py !

jschendel · 2018-10-25T07:26:58Z

pandas/core/indexes/interval.py

@@ -583,6 +639,9 @@ def _maybe_convert_i8(self, key):
        else:
            # DatetimeIndex/TimedeltaIndex
            key_dtype, key_i8 = key.dtype, Index(key.asi8)
+            if key.hasnans:
+                # NaT's i8 value may be viewed as not NA (e.g. is_overlapping)
+                key_i8 = key_i8.where(~key._isnan)


Specifically, without this change a datetime-like IntervalIndex with closed='both' containing two or more instances of NaT would be marked as overlapping due to the NaT's.

Since NaT is converted to -9223372036854775808 during i8 conversion, the IntervalTree previously interpreted this as an Interval of length zero (same start/end), which would include the point in the closed='both' case. So, if two of these occurred they would be interpreted as overlapping at a point.

I've added a relevant _maybe_convert_i8 test for this behavior.

codecov · 2018-10-26T04:40:36Z

Codecov Report

Merging #23327 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #23327      +/-   ##
==========================================
+ Coverage   92.31%   92.31%   +<.01%     
==========================================
  Files         161      161              
  Lines       51483    51487       +4     
==========================================
+ Hits        47525    47529       +4     
  Misses       3958     3958

Flag	Coverage Δ
#multiple	`90.7% <100%> (ø)`	⬆️
#single	`42.43% <25%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/arrays/interval.py	`93.02% <ø> (ø)`	⬆️
pandas/core/indexes/interval.py	`94.73% <100%> (+0.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 90961f2...dd63492. Read the comment docs.

jschendel · 2018-10-30T16:54:49Z

ping @jreback : this is passing now that #23353 has been merged

pandas/_libs/intervaltree.pxi.in

jreback · 2018-10-31T12:57:06Z

pandas/_libs/intervaltree.pxi.in

+
+        self._is_overlapping = False
+        for previous, current in zip(self.left_sorter, self.left_sorter[1:]):
+            # overlap if start of current interval < end of previous interval


can't this just be an array comparison?

Yes, but wouldn't that generally be a bit slower? For the array comparison it'd be something like:

(self.left[self.left_sorter[1:]] < self.right[self.left_sorter[:-1]]).any()

Wouldn't the interior comparison of self.left[...] < self.right[...] be computed in it's entirety before the any is evaluated? Whereas the elementwise way it's currently written would terminate after the first successful < evaluation? Or is there something I'm missing here?

yes but it’s worth testing
as it’s a bit more idiomatic to use array slicing here and it might be faster

Looks like you're right. I did some timings with some non-cached versions of the loop and array methods (still cached the sorters since both methods use them in the same way), aptly named is_overlapping_loop and is_overlapping_array. Opted to switch to the array version since it's speed is consistent, and faster in a wider variety of scenarios.

When the overlap occurs at the beginning, the loop version is the fastest (pretty obvious):

In [2]: ii = pd.IntervalIndex.from_arrays( ...: np.arange(0, 2*10**5), np.arange(2, 2*10**5 + 2)) In [3]: ii._engine.left_sorter; ii._engine.right_sorter; pass # cache sorters In [4]: %timeit ii._engine.is_overlapping_loop 980 ns ± 19.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: %timeit ii._engine.is_overlapping_array 2.06 ms ± 66.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

If no overlaps are present, and both methods need to do full passes, the array method is faster:

In [6]: ii = pd.interval_range(0, 2*10**5) In [7]: ii._engine.left_sorter; ii._engine.right_sorter; pass # cache sorters In [8]: %timeit ii._engine.is_overlapping_loop 80.8 ms ± 6.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [9]: %timeit ii._engine.is_overlapping_array 2.19 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

If an overlap occurs midway through the array method is still faster, with the loop method taking about half the time as the non-overlapping version (makes sense):

In [10]: left = list(range(10**5)) + [10**5 - 2] + list(range(10**5, 2*10**5)) ...: right = list(range(1, 10**5 + 1)) + [10**5 + 2] + list(range(10**5 + 1, 2*10**5 + 1)) ...: ii = pd.IntervalIndex.from_arrays(left, right) In [11]: ii._engine.left_sorter; ii._engine.right_sorter; pass # cache sorters In [12]: %timeit ii._engine.is_overlapping_loop 39.6 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [13]: %timeit ii._engine.is_overlapping_array 2.23 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

jreback · 2018-11-18T18:43:16Z

can you merge master

jreback · 2018-11-23T03:19:51Z

@jschendel can you rebase

jschendel · 2018-11-27T03:15:02Z

@jreback : rebased and switched to an array based implementation (see #23327 (comment))

jschendel · 2018-11-29T17:16:57Z

ping @jreback

jreback · 2018-11-29T17:22:49Z

thanks @jschendel nice!

jschendel · 2018-11-29T17:26:21Z

Thanks for the quick response to the ping @jreback! Sorry for the delay with this PR; got a bit sidetracked with the 32bit issues.

jschendel added Enhancement Interval Interval data type labels Oct 25, 2018

jschendel added this to the 0.24.0 milestone Oct 25, 2018

jschendel commented Oct 25, 2018

View reviewed changes

jschendel force-pushed the ii-is-overlapping branch from f73f4c1 to 310f114 Compare October 25, 2018 07:38

jschendel mentioned this pull request Oct 26, 2018

BUG: Fix IntervalTree handling of NaN #23353

Merged

4 tasks

jschendel force-pushed the ii-is-overlapping branch from 310f114 to 4662229 Compare October 26, 2018 04:40

jschendel force-pushed the ii-is-overlapping branch from 4662229 to e0c8d2e Compare October 30, 2018 15:01

jreback requested changes Oct 31, 2018

View reviewed changes

jreback removed this from the 0.24.0 milestone Nov 25, 2018

jschendel added 2 commits November 26, 2018 17:15

ENH: Implement IntervalIndex.is_overlapping

16464be

switch to array impl

dd63492

jschendel force-pushed the ii-is-overlapping branch from e0c8d2e to dd63492 Compare November 27, 2018 03:13

jschendel mentioned this pull request Nov 28, 2018

BUG: pandas.cut should disallow overlapping IntervalIndex bins #23980

Closed

jreback added this to the 0.24.0 milestone Nov 29, 2018

jreback approved these changes Nov 29, 2018

View reviewed changes

jreback merged commit 7653a6b into pandas-dev:master Nov 29, 2018

jschendel deleted the ii-is-overlapping branch November 29, 2018 17:26

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

ENH: Implement IntervalIndex.is_overlapping (pandas-dev#23327)

d769750

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

ENH: Implement IntervalIndex.is_overlapping (pandas-dev#23327)

0541576

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Implement IntervalIndex.is_overlapping #23327

ENH: Implement IntervalIndex.is_overlapping #23327

Uh oh!

jschendel commented Oct 25, 2018

Uh oh!

pep8speaks commented Oct 25, 2018

Uh oh!

jschendel Oct 25, 2018

Uh oh!

codecov bot commented Oct 26, 2018 •

edited

Loading

Uh oh!

jschendel commented Oct 30, 2018

Uh oh!

Uh oh!

jreback Oct 31, 2018

Uh oh!

jschendel Oct 31, 2018

Uh oh!

jreback Oct 31, 2018

Uh oh!

jschendel Nov 27, 2018 •

edited

Loading

Uh oh!

jreback commented Nov 18, 2018

Uh oh!

jreback commented Nov 23, 2018

Uh oh!

jschendel commented Nov 27, 2018

Uh oh!

jschendel commented Nov 29, 2018

Uh oh!

jreback commented Nov 29, 2018

Uh oh!

jschendel commented Nov 29, 2018

Uh oh!

Uh oh!

Uh oh!

ENH: Implement IntervalIndex.is_overlapping #23327

ENH: Implement IntervalIndex.is_overlapping #23327

Uh oh!

Conversation

jschendel commented Oct 25, 2018

Uh oh!

pep8speaks commented Oct 25, 2018

Uh oh!

jschendel Oct 25, 2018

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jschendel commented Oct 30, 2018

Uh oh!

Uh oh!

jreback Oct 31, 2018

Choose a reason for hiding this comment

Uh oh!

jschendel Oct 31, 2018

Choose a reason for hiding this comment

Uh oh!

jreback Oct 31, 2018

Choose a reason for hiding this comment

Uh oh!

jschendel Nov 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Nov 18, 2018

Uh oh!

jreback commented Nov 23, 2018

Uh oh!

jschendel commented Nov 27, 2018

Uh oh!

jschendel commented Nov 29, 2018

Uh oh!

jreback commented Nov 29, 2018

Uh oh!

jschendel commented Nov 29, 2018

Uh oh!

Uh oh!

codecov bot commented Oct 26, 2018 •

edited

Loading

jschendel Nov 27, 2018 •

edited

Loading