Add spec for new Interval / IntervalIndex methods: .overlaps(), .covers() #18975

alexlenail · 2017-12-28T19:18:22Z

part 2/2 for Confusing (possibly buggy) IntervalIndex behavior #16316
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff

Working on parameterizing the tests. Will ping when ready for review.

pep8speaks · 2017-12-28T19:18:24Z

Hello @zfrenchee! Thanks for updating the PR.

In the file pandas/tests/indexes/interval/test_interval_overlaps_covers.py, following are the PEP8 issues :

Line 32:41: E128 continuation line under-indented for visual indent
Line 53:41: E128 continuation line under-indented for visual indent

Comment last updated on April 02, 2018 at 20:45 Hours UTC

jschendel · 2017-12-28T20:01:34Z

pandas/tests/indexes/interval/test_interval_new.py

+        # class Interval:
+        #     def covers(self, other: Interval) -> bool
+
+        assert Interval(1, 3).covers(Interval(1.5, 2.5))


I think Interval level tests would be more appropriate in pandas/tests/scalar/test_interval.py, which is where the other Interval specific tests are located.

jreback · 2017-12-28T20:03:21Z

pandas/tests/indexes/interval/test_interval_new.py

+        assert Interval(1, 3, closed='both').overlaps(Interval(1, 3, closed='both'))
+
+        assert Interval(1, 3, closed='both').overlaps(Interval(-1, 1, closed='right'))
+        assert not Interval(1, 3, closed='both').overlaps(Interval(-1, 1, closed='left'))


btw I would be very happy to have these in a different file, e.g.

pandas/tests/indexes/interval/test_overlap_covers.py or whatever (excluding @jschendel commet that the Interval tests need to go in the scalar tests); we can make a sub-dir there as well I think.

alexlenail · 2017-12-30T23:32:16Z

Ready for a look-over on structure @jreback @jschendel. Are the tests in the right place / well structured? Are there missing tests / cases?

jreback · 2017-12-30T23:54:38Z

pandas/tests/scalar/test_interval_new.py

+            }
+        }
+
+        assert ivl.overlaps(other) == should_overlap[other_side][ivl_side][ivl_range] == other.overlaps(ivl)


all of the tests from here on down should go in: pandas/tests/indexes/interval/test_overlaps.py (new file)

@jreback do you still want to see this?

jreback · 2017-12-30T23:55:44Z

what is the reason for both covers and overlaps?

alexlenail · 2017-12-31T00:01:27Z

@jreback (from @shoyer):

An interval covers another interval if all points in the second interval are found in the first interval.

An interval overlaps another interval if there exist any points found in both intervals.

-- (#16316 (comment))

I personally am interested in overlaps, but I can imagine people might want to ask what intervals are covered by their IntervalIndex or vice-versa. Does that answer the question?

jschendel · 2018-01-01T01:49:50Z

what is the reason for both covers and overlaps?

FWIW other implementations of Interval have both methods, e.g. the intervaltree package on pypi, and range types in postgres, albeit with a slightly different naming convention for covers. The intervaltree package uses a contains_interval method for covers. In postgres it's just an operator, @>, which is described in the docs as "contains range". I suppose we could maybe rename covers to be more in line with these, but I don't really have a strong opinion either way.

Is there any utility in adding a strict kwarg to covers? By that I basically mean strict=False is existing behavior (default), strict=True is existing behavior and no shared endpoints:

In [1]: iv = Interval(0, 1)

In [2]: iv.covers(iv, strict=False)
Out[2]: True

In [3]: iv.covers(iv, strict=True)
Out[3]: False

Could probably wait to see if there's demand for strict prior to implementing, unless there's a reason anyone wants it immediately implemented.

jreback · 2018-01-01T14:19:55Z

rebase and push again, fixed some hanging by Travis CI

codecov · 2018-01-01T17:00:31Z

Codecov Report

Merging #18975 into master will decrease coverage by 0.31%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #18975      +/-   ##
==========================================
- Coverage   91.82%    91.5%   -0.32%     
==========================================
  Files         152      150       -2     
  Lines       49248    48739     -509     
==========================================
- Hits        45222    44600     -622     
- Misses       4026     4139     +113

Flag	Coverage Δ
#multiple	`89.87% <ø> (-0.34%)`	⬇️
#single	`41.73% <ø> (-0.17%)`	⬇️

Impacted Files	Coverage Δ
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/plotting/_compat.py	`62% <0%> (-28.91%)`	⬇️
pandas/io/parquet.py	`52.29% <0%> (-19.51%)`	⬇️
pandas/core/missing.py	`84.3% <0%> (-7.35%)`	⬇️
pandas/plotting/_timeseries.py	`60.82% <0%> (-4.49%)`	⬇️
pandas/core/ops.py	`93.11% <0%> (-3.25%)`	⬇️
pandas/core/reshape/tile.py	`90.25% <0%> (-3.12%)`	⬇️
pandas/io/html.py	`85.98% <0%> (-2.81%)`	⬇️
pandas/io/formats/format.py	`96.24% <0%> (-2.01%)`	⬇️
pandas/core/internals.py	`94.53% <0%> (-1.01%)`	⬇️
... and 74 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cdfce2b...d28f932. Read the comment docs.

gfyoung · 2018-01-02T09:18:00Z

@zfrenchee : Looks like you have some lint errors. Otherwise, Travis looks happy to me.

alexlenail · 2018-01-02T15:12:43Z

I'd like to revisit the interface we agreed upon (initially from @shoyer) which this PR introduces tests for. The interface was

class Interval:
    def covers(self, other: Interval) -> bool
    def covers(self, other: IntervalIndex) -> IntegerArray1D
    def overlaps(self, other: Interval) -> bool
    def overlaps(self, other: IntervalIndex) -> IntegerArray1D

class IntervalIndex:
    def covers(self, other: Interval) -> IntegerArray1D
    def covers(self, other: IntervalIndex) -> Tuple[IntegerArray1D, IntegerArray1D]
    def overlaps(self, other: Interval) -> IntegerArray1D
    def overlaps(self, other: IntervalIndex) -> Tuple[IntegerArray1D, IntegerArray1D]

This is the last call for modifications to that interface in this PR. @jreback @jorisvandenbossche @gfyoung . What do you think?

shoyer · 2018-01-02T17:18:00Z

Thinking about this again: there's some ambiguity about whether coverage for IntervalIndex.covers(IntervalIndex) is:

elementwise: does each interval in the first index cover each corresponding interval?
all: does each interval in the first index cover all of the second intervals?
any: does each interval in the first index cover any of the second intervals?

Most pandas operations work elementwise, so arguably that's clearest. That would suggest we should only return booleans/boolean-arrays, i.e.,

class Interval:
    def covers(self, other: Interval) -> bool
    def covers(self, other: IntervalIndex) -> BoolArray1D
    def overlaps(self, other: Interval) -> bool
    def overlaps(self, other: IntervalIndex) -> BoolArray1D

class IntervalIndex:
    def covers(self, other: Interval) -> BoolArray1D
    def covers(self, other: IntervalIndex) -> BoolArray1D
    def overlaps(self, other: Interval) -> BoolArray1D
    def overlaps(self, other: IntervalIndex) -> BoolArray1D

If we adopted this choice, we would save "covers all", "covers any", "overlaps all" and "overlaps any" for another method, which might be implemented in the future (if necessary).

alexlenail · 2018-01-03T14:15:34Z

Thanks for those comments @shoyer.

I'm personally rather attached to your original suggestion, because I think all these other methods can be expressed with short recipes derived from the methods in their current form in the spec. I'm not sure anyone will really be asking the question "which intervals in IntervalIndex A cover every interval in IntervalIndex B?" and even if they did want to ask that, using the current spec they could get indices back and then check if that index set is complete (2 lines instead of 1). The reverse doesn't work -- If we provide these methods, we can't ask the more basic question "which intervals from A cover which intervals from B?"

Are you thinking of a particular other method in the current pandas API to base your suggestion from? If so we have to weigh consistency as a factor. Let me know what you think.

…- same for .covers

alexlenail · 2018-01-11T03:54:51Z

@pep8speaks

alexlenail · 2018-01-11T04:00:15Z

@jreback @gfyoung I believe this is ready for merge.

jreback

you need to have Interval tests in pandas/tests/scalar/, but IntervalIndex tests in pandas/tests/indexes/interval

jreback · 2018-01-11T11:55:15Z

pandas/tests/scalar/test_interval_new.py

+            }
+        }
+
+        assert ivl.overlaps(other) == \


these are very hard to read and debug much better to do

result = ... expected = ... assert result == expected

jreback · 2018-01-13T20:40:39Z

@zfrenchee your upstream may be called something else, use git remote -v to see (our docs say to call it this, which is why we put this here)

alexlenail · 2018-01-13T21:46:14Z

@jreback how do I configure autopep so it conforms to pandas style? When I invoke it now, it makes >79 char lines.

jreback · 2018-01-13T21:56:41Z

I think you have to tell autopep to fix the lines, its not automatic:

autopep8 --max-line-length

note that we generally don't like to use autopep for line breaking. we do this manually (because autopep8 produces weird output).

Its not that much code you can easily fix it manually.

alexlenail · 2018-01-20T13:09:49Z

@jreback green.

shoyer · 2018-01-20T18:26:03Z

Before we merge this, can someone else weigh in on the API options here? See #18975 (comment) and the comment below

jorisvandenbossche · 2018-01-20T21:14:16Z

Most pandas operations work elementwise, so arguably that's clearest. That would suggest we should only return booleans/boolean-arrays, i.e.,

I agree with this. For me it looks very strange that a Index operation returns a tuple.

But I find it very hard to think about this without code. @zfrenchee could you write up a few hypothetical example of how you would use it? To have a bit more the idea how it would look like.

As I asked for in the other PR as well, I think it would also be helpful to clear out the design to actually write out the reference docs (the docstrings of those methods). We can eg add them now as a multi-line comment at the top of the test files. BTW, I think it would also be rather easy to actually implement the methods (a basic, possibly non-performant implementation). That might make it easier to experiment with.

jschendel · 2018-01-20T21:38:01Z

can someone else weigh in on the API options here

I agree with @shoyer that returning booleans/boolean-arrays makes more sense for overlaps and covers. I plan to implement an interval accessor (#16401) in the near future now that the new accessor interface PR has been merged. I haven't looked too closely at the implementation, but I think returning booleans would enable indexing syntax along the lines of df[df.interval_col.iv.overlaps(other)]. At the very least it seems like it would make this behavior consistent with the accessor.

That being said, I'm not opposed to adding the functionality suggested by @zfrenchee, but it might be better served with a different name. Perhaps with a prefix along the lines of get_*, e.g. IntervalIndex.get_overlaps(other), or maybe even a more generic method with a how parameter, e.g. IntervalIndex.something_generic(other, how='overlaps|covers|...').

Unrelated, but one thing that stands out to me is Interval.overlaps(IntervalIndex) (likewise covers). Looks like this is identical to IntervalIndex.overlaps(Interval), so having both seems a little redundant. Syntactically Interval.overlaps(IntervalIndex) feels a little strange to me, since you have a method on a scalar taking a non-scalar argument. I'd be fine removing Interval.overlaps(IntervalIndex) and just implementing IntervalIndex.overlaps(Interval) (likewise covers), but I don't have particularly strong feelings either way.

alexlenail · 2018-01-22T15:24:49Z

But I find it very hard to think about this without code. @zfrenchee could you write up a few hypothetical example of how you would use it? To have a bit more the idea how it would look like.

I'm hoping to do genome arithmetic, like BEDtools. Check out the BEDtools tools, e.g. intersect, merge, and closest. With the current implementation of overlaps (in order to do something like bedtools intersect) I would do

one, two = idx1.overlaps(idx2)
return pd.concat((idx1[one], idx2[two]), axis=1)

I think that the issue with the boolean overlaps function is that it's hard to trace what overlaps what. E.g. consider:

idx1 = IntervalIndex.from_tuples([(1, 5), (2, 6)])
idx2 = IntervalIndex.from_tuples([(0, 4), (3, 7)])

then if idx1.overlaps(idx2) returns a boolean mask on idx1....

pd.concat((idx1[idx1.overlaps(idx2)], idx2[idx2.overlaps(idx2)]))

misses half of the overlaps. Do you see what I mean? Let me know if that needs to be clearer. Basically, if it were booleans, you wouldn't necessarily get all pairs of overlaps out from the overlaps function. I think .overlaps() should be able to tell you what all the overlaps are.

I agree with this. For me it looks very strange that a Index operation returns a tuple.

I can't deny it looks off. However, for this index, for this operation, it might be the only thing that makes sense? Very open to alternatives that can find all the overlaps...

Unrelated, but one thing that stands out to me is Interval.overlaps(IntervalIndex) (likewise covers). Looks like this is identical to IntervalIndex.overlaps(Interval), so having both seems a little redundant. Syntactically Interval.overlaps(IntervalIndex) feels a little strange to me, since you have a method on a scalar taking a non-scalar argument.

This is a good point. We discussed this a little. I also don't have strong feelings, and agree that if we were to remove one of them, Interval.overlaps(IntervalIndex) would be the one. I'd love to find some other useful functionality for that function signature, but I wasn't able to come up with one. I feel very weakly that if the choices are keep it or remove it, I would keep it for now.

@shoyer @jreback @jschendel @jorisvandenbossche

alexlenail · 2018-01-27T20:22:51Z

bump @jreback @shoyer @jschendel @jorisvandenbossche

jorisvandenbossche · 2018-01-29T22:08:04Z

@zfrenchee I am still trying to understand the logic behind one, two = idx1.overlaps(idx2) (I mean what it does exactly). Can you show what the output would be for the example idx1 and idx2 ?

jorisvandenbossche · 2018-01-29T22:09:30Z

(sidenote: if we decide something like this it too complicated or inconsistent for as a method, we should make sure it is not too hard to write a function that does that)

alexlenail · 2018-01-29T22:15:47Z

@jorisvandenbossche: @shoyer 's original API plan, which I add tests for in this PR:

>>> idx1 = IntervalIndex.from_tuples([(1, 5), (2, 6)])
>>> idx2 = IntervalIndex.from_tuples([(0, 4), (3, 7)])
>>> idx1.overlaps(idx2)
([0, 0, 1, 1], [0, 1, 0, 1])

as opposed to

>>> idx1 = IntervalIndex.from_tuples([(1, 5), (2, 6)])
>>> idx2 = IntervalIndex.from_tuples([(0, 4), (3, 7)])
>>> idx1.overlaps(idx2)
[true, true]

alexlenail · 2018-01-29T22:17:48Z

I'm alright renaming the method .find_overlaps if that reduces the confusion for not returning boolean masks, or even (maybe) setting this as a method on dataframes with IntervalIndex indexes, rather than on the indexes themselves... I just really want this functionality!

jschendel · 2018-01-29T23:00:32Z

I am still trying to understand the logic behind one, two = idx1.overlaps(idx2) (I mean what it does exactly)

@jorisvandenbossche : IIUC the "plain english" interpretation is, looking at the two lists pairwise, specifically which interval in idx1 overlaps which interval in idx2. For example, in the output that @zfrenchee just posted:

>>> idx1 = IntervalIndex.from_tuples([(1, 5), (2, 6)])
>>> idx2 = IntervalIndex.from_tuples([(0, 4), (3, 7)])
>>> idx1.overlaps(idx2)
([0, 0, 1, 1], [0, 1, 0, 1])

The first pairwise element, (0, 0), says that idx1[0] overlaps idx2[0]. And likewise for the other pairs, (0, 1), (1, 0), and (1, 1).

alexlenail · 2018-01-29T23:01:52Z

Thanks @jschendel. Sorry if I wasn't clear, @jorisvandenbossche.

jorisvandenbossche · 2018-01-29T23:13:23Z

OK, so you basically do a cross product combination, and then filter for the overlapping ones? So the length of the output varies depending on the data.
Personally I think we should not call that overlaps, but another method like you propose might be a good idea.

(this discussion is another reason I think it would be useful to actually write good docstrings for the functions your propose or even proof of concept implementation, that will makes things a lot easier to discuss)

jorisvandenbossche · 2018-01-29T23:17:06Z

BTW, you could see this as a 'join' based on the overlap. And you could then also join based on other ops like within or contains (like with spatial joins).

Would it actually be more useful to directly join two frames based on this? Instead of providing the indices with which you can then do the concat manually? (or are there case you rather want the indices and not directly the merged frames?)

alexlenail · 2018-01-30T15:03:16Z

@jorisvandenbossche

Would it actually be more useful to directly join two frames based on this? Instead of providing the indices with which you can then do the concat manually? (or are there case you rather want the indices and not directly the merged frames?)

I think there could be use cases for getting back the indices without doing the join, so I would vote to keep this as a two-step process. You might want to iterate through the zip of the result, and ask questions about each overlaps. You might want to just look at one side or the other. I personally will mostly likely just be using these functions solely for join operations via concat, but I think it's quite feasible others would have different uses for this.

alexlenail · 2018-01-30T15:07:14Z

Personally I think we should not call that overlaps, but another method like you propose might be a good idea.

I'm happy to name it anything, but I can't think of a better name: The problem this method solves is finding the overlaps between two lists of intervals, i.e. two IntervalIndexes. What else would you call it?

alexlenail · 2018-02-05T20:09:36Z

@jschendel @jorisvandenbossche @jreback

jorisvandenbossche · 2018-02-09T14:48:22Z

I'm happy to name it anything, but I can't think of a better name: The problem this method solves is finding the overlaps between two lists of intervals, i.e. two IntervalIndexes. What else would you call it?

It indeed does an overlap operation, but there are also other things for which this name would be the "obvious" name. To start with simply pd.IntervalIndex(..).overlaps(single_interval), which would logically return a single boolean array.
And given consistency in our API, it is IMO more logical to call that .overlaps.

Your proposal checks the overlap for all combinations, so we might use that notion in the name. Or something with 'join' as I mentioned above.

Another option would aso be to not have this a a method, but as a function. I would personally like that more (also given the return result), the only problem is then where to put it in the namespace.

jorisvandenbossche · 2018-02-09T14:49:10Z

See also my comment above #18975 (comment)

jreback · 2018-02-10T16:27:55Z

pandas/tests/indexes/interval/test_interval_overlaps_covers.py

+        tm.assert_numpy_array_equal(lidx, lidx_expected)
+        tm.assert_numpy_array_equal(ridx, ridx_expected)
+
+    def test_intervalIndex_covers_intervalIndex(self):


can you parameterize

jreback · 2018-02-10T16:28:02Z

pandas/tests/indexes/interval/test_interval_overlaps_covers.py

+        expected = (np.array([0, 1, 2, 2]), np.array([0, 1, 1, 2]))
+        self._compare_tuple_of_numpy_array(result, expected)
+
+    def test_intervalIndex_overlaps_intervalIndex(self):


jreback · 2018-02-10T16:33:38Z

pandas/tests/scalar/test_interval_new.py

+                                             ivl_range):
+
+        # class Interval:
+        #     def overlaps(self, other: IntervalIndex) -> IntegerArray1D


the II tests need to go in pandas/tests/indexes/interval/test_interval_new.py

jreback · 2018-02-10T16:34:12Z

pandas/tests/scalar/test_interval_new.py

+        ivl = Interval(1, 3, closed=ivl_side)
+        other = Interval(1, 3, closed=oth_side)
+
+        should_cover = {


would like to ideally enocde these results IN the parameterization if possible.

jreback · 2018-09-25T15:40:25Z

we had a lot of good discussion about this, but is pretty stale now.

add back old overlaps / covers tests

fdd289d

jschendel reviewed Dec 28, 2017

View reviewed changes

jreback reviewed Dec 28, 2017

View reviewed changes

jreback added Interval Interval data type Testing pandas testing functions or related to the test suite labels Dec 28, 2017

update

3ea2191

alexlenail mentioned this pull request Dec 30, 2017

Confusing (possibly buggy) IntervalIndex behavior #16316

Closed

Tiny restructuring

d781820

jreback reviewed Dec 30, 2017

View reviewed changes

jreback closed this Dec 30, 2017

jreback reopened this Dec 30, 2017

Alexander Lenail added 2 commits January 1, 2018 11:59

tiny change

d62ac67

Merge remote-tracking branch 'pandas-dev/master'

a422725

Alexander Lenail added 3 commits January 3, 2018 09:30

collapse ivl.overlaps(idx) and idx.overlaps(ivl) into a single test -…

9f9ce2f

…- same for .covers

few random changes

b17f260

try to be compliant with pep

9750335

jreback requested changes Jan 11, 2018

View reviewed changes

Alexander Lenail added 2 commits January 19, 2018 18:07

free of lint errors?

fda5d5c

now looks totally free

5686b1d

jreback requested changes Feb 10, 2018

View reviewed changes

more parameterization

d28f932

jreback closed this Sep 25, 2018

jschendel mentioned this pull request Oct 2, 2018

ENH: Implement overlaps method for Interval-like #22939

Merged

3 tasks

sterlinm mentioned this pull request Aug 14, 2021

ENH: IntervalArray/IntervalIndex Range Joins #43031

Open

Uh oh!

Add spec for new Interval / IntervalIndex methods: .overlaps(), .covers() #18975

Add spec for new Interval / IntervalIndex methods: .overlaps(), .covers() #18975

Uh oh!

Conversation

alexlenail commented Dec 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pep8speaks commented Dec 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated on April 02, 2018 at 20:45 Hours UTC

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexlenail commented Dec 30, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Dec 30, 2017

Uh oh!

alexlenail commented Dec 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jschendel commented Jan 1, 2018

Uh oh!

jreback commented Jan 1, 2018

Uh oh!

codecov bot commented Jan 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gfyoung commented Jan 2, 2018

Uh oh!

alexlenail commented Jan 2, 2018

Uh oh!

shoyer commented Jan 2, 2018

Uh oh!

alexlenail commented Jan 3, 2018

Uh oh!

alexlenail commented Jan 11, 2018

Uh oh!

alexlenail commented Jan 11, 2018

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Jan 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexlenail commented Jan 13, 2018

Uh oh!

jreback commented Jan 13, 2018

Uh oh!

alexlenail commented Jan 20, 2018

Uh oh!

shoyer commented Jan 20, 2018

Uh oh!

jorisvandenbossche commented Jan 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jschendel commented Jan 20, 2018

Uh oh!

alexlenail commented Jan 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexlenail commented Jan 27, 2018

Uh oh!

jorisvandenbossche commented Jan 29, 2018

Uh oh!

jorisvandenbossche commented Jan 29, 2018

Uh oh!

alexlenail commented Jan 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexlenail commented Dec 28, 2017 •

edited

Loading

pep8speaks commented Dec 28, 2017 •

edited

Loading

alexlenail commented Dec 31, 2017 •

edited

Loading

codecov bot commented Jan 1, 2018 •

edited

Loading

jreback commented Jan 13, 2018 •

edited

Loading

jorisvandenbossche commented Jan 20, 2018 •

edited

Loading

alexlenail commented Jan 22, 2018 •

edited

Loading

alexlenail commented Jan 29, 2018 •

edited

Loading