Skip to content

Add spec for new Interval / IntervalIndex methods: .overlaps(), .covers() #18975

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

alexlenail
Copy link
Contributor

@alexlenail alexlenail commented Dec 28, 2017

Working on parameterizing the tests. Will ping when ready for review.

@pep8speaks
Copy link

pep8speaks commented Dec 28, 2017

Hello @zfrenchee! Thanks for updating the PR.

Line 32:41: E128 continuation line under-indented for visual indent
Line 53:41: E128 continuation line under-indented for visual indent

Comment last updated on April 02, 2018 at 20:45 Hours UTC

# class Interval:
# def covers(self, other: Interval) -> bool

assert Interval(1, 3).covers(Interval(1.5, 2.5))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Interval level tests would be more appropriate in pandas/tests/scalar/test_interval.py, which is where the other Interval specific tests are located.

assert Interval(1, 3, closed='both').overlaps(Interval(1, 3, closed='both'))

assert Interval(1, 3, closed='both').overlaps(Interval(-1, 1, closed='right'))
assert not Interval(1, 3, closed='both').overlaps(Interval(-1, 1, closed='left'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I would be very happy to have these in a different file, e.g.

pandas/tests/indexes/interval/test_overlap_covers.py or whatever (excluding @jschendel commet that the Interval tests need to go in the scalar tests); we can make a sub-dir there as well I think.

@jreback jreback added Interval Interval data type Testing pandas testing functions or related to the test suite labels Dec 28, 2017
@alexlenail
Copy link
Contributor Author

Ready for a look-over on structure @jreback @jschendel. Are the tests in the right place / well structured? Are there missing tests / cases?

}
}

assert ivl.overlaps(other) == should_overlap[other_side][ivl_side][ivl_range] == other.overlaps(ivl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of the tests from here on down should go in: pandas/tests/indexes/interval/test_overlaps.py (new file)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback do you still want to see this?

@jreback
Copy link
Contributor

jreback commented Dec 30, 2017

what is the reason for both covers and overlaps?

@jreback jreback closed this Dec 30, 2017
@jreback jreback reopened this Dec 30, 2017
@alexlenail
Copy link
Contributor Author

alexlenail commented Dec 31, 2017

@jreback (from @shoyer):

An interval covers another interval if all points in the second interval are found in the first interval.

An interval overlaps another interval if there exist any points found in both intervals.

-- (#16316 (comment))

I personally am interested in overlaps, but I can imagine people might want to ask what intervals are covered by their IntervalIndex or vice-versa. Does that answer the question?

@jschendel
Copy link
Member

what is the reason for both covers and overlaps?

FWIW other implementations of Interval have both methods, e.g. the intervaltree package on pypi, and range types in postgres, albeit with a slightly different naming convention for covers. The intervaltree package uses a contains_interval method for covers. In postgres it's just an operator, @>, which is described in the docs as "contains range". I suppose we could maybe rename covers to be more in line with these, but I don't really have a strong opinion either way.

Is there any utility in adding a strict kwarg to covers? By that I basically mean strict=False is existing behavior (default), strict=True is existing behavior and no shared endpoints:

In [1]: iv = Interval(0, 1)

In [2]: iv.covers(iv, strict=False)
Out[2]: True

In [3]: iv.covers(iv, strict=True)
Out[3]: False

Could probably wait to see if there's demand for strict prior to implementing, unless there's a reason anyone wants it immediately implemented.

@jreback
Copy link
Contributor

jreback commented Jan 1, 2018

rebase and push again, fixed some hanging by Travis CI

@codecov
Copy link

codecov bot commented Jan 1, 2018

Codecov Report

Merging #18975 into master will decrease coverage by 0.31%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18975      +/-   ##
==========================================
- Coverage   91.82%    91.5%   -0.32%     
==========================================
  Files         152      150       -2     
  Lines       49248    48739     -509     
==========================================
- Hits        45222    44600     -622     
- Misses       4026     4139     +113
Flag Coverage Δ
#multiple 89.87% <ø> (-0.34%) ⬇️
#single 41.73% <ø> (-0.17%) ⬇️
Impacted Files Coverage Δ
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/plotting/_compat.py 62% <0%> (-28.91%) ⬇️
pandas/io/parquet.py 52.29% <0%> (-19.51%) ⬇️
pandas/core/missing.py 84.3% <0%> (-7.35%) ⬇️
pandas/plotting/_timeseries.py 60.82% <0%> (-4.49%) ⬇️
pandas/core/ops.py 93.11% <0%> (-3.25%) ⬇️
pandas/core/reshape/tile.py 90.25% <0%> (-3.12%) ⬇️
pandas/io/html.py 85.98% <0%> (-2.81%) ⬇️
pandas/io/formats/format.py 96.24% <0%> (-2.01%) ⬇️
pandas/core/internals.py 94.53% <0%> (-1.01%) ⬇️
... and 74 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cdfce2b...d28f932. Read the comment docs.

@gfyoung
Copy link
Member

gfyoung commented Jan 2, 2018

@zfrenchee : Looks like you have some lint errors. Otherwise, Travis looks happy to me.

@alexlenail
Copy link
Contributor Author

I'd like to revisit the interface we agreed upon (initially from @shoyer) which this PR introduces tests for. The interface was

class Interval:
    def covers(self, other: Interval) -> bool
    def covers(self, other: IntervalIndex) -> IntegerArray1D
    def overlaps(self, other: Interval) -> bool
    def overlaps(self, other: IntervalIndex) -> IntegerArray1D

class IntervalIndex:
    def covers(self, other: Interval) -> IntegerArray1D
    def covers(self, other: IntervalIndex) -> Tuple[IntegerArray1D, IntegerArray1D]
    def overlaps(self, other: Interval) -> IntegerArray1D
    def overlaps(self, other: IntervalIndex) -> Tuple[IntegerArray1D, IntegerArray1D]

This is the last call for modifications to that interface in this PR. @jreback @jorisvandenbossche @gfyoung . What do you think?

@shoyer
Copy link
Member

shoyer commented Jan 2, 2018

Thinking about this again: there's some ambiguity about whether coverage for IntervalIndex.covers(IntervalIndex) is:

  • elementwise: does each interval in the first index cover each corresponding interval?
  • all: does each interval in the first index cover all of the second intervals?
  • any: does each interval in the first index cover any of the second intervals?

Most pandas operations work elementwise, so arguably that's clearest. That would suggest we should only return booleans/boolean-arrays, i.e.,

class Interval:
    def covers(self, other: Interval) -> bool
    def covers(self, other: IntervalIndex) -> BoolArray1D
    def overlaps(self, other: Interval) -> bool
    def overlaps(self, other: IntervalIndex) -> BoolArray1D

class IntervalIndex:
    def covers(self, other: Interval) -> BoolArray1D
    def covers(self, other: IntervalIndex) -> BoolArray1D
    def overlaps(self, other: Interval) -> BoolArray1D
    def overlaps(self, other: IntervalIndex) -> BoolArray1D

If we adopted this choice, we would save "covers all", "covers any", "overlaps all" and "overlaps any" for another method, which might be implemented in the future (if necessary).

@alexlenail
Copy link
Contributor Author

Thanks for those comments @shoyer.

I'm personally rather attached to your original suggestion, because I think all these other methods can be expressed with short recipes derived from the methods in their current form in the spec. I'm not sure anyone will really be asking the question "which intervals in IntervalIndex A cover every interval in IntervalIndex B?" and even if they did want to ask that, using the current spec they could get indices back and then check if that index set is complete (2 lines instead of 1). The reverse doesn't work -- If we provide these methods, we can't ask the more basic question "which intervals from A cover which intervals from B?"

Are you thinking of a particular other method in the current pandas API to base your suggestion from? If so we have to weigh consistency as a factor. Let me know what you think.

@alexlenail
Copy link
Contributor Author

@pep8speaks

@alexlenail
Copy link
Contributor Author

@jreback @gfyoung I believe this is ready for merge.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to have Interval tests in pandas/tests/scalar/, but IntervalIndex tests in pandas/tests/indexes/interval

}
}

assert ivl.overlaps(other) == \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are very hard to read and debug much better to do

result = ...
expected = ...
assert result == expected

@jreback
Copy link
Contributor

jreback commented Jan 13, 2018

@zfrenchee your upstream may be called something else, use git remote -v to see (our docs say to call it this, which is why we put this here)

@alexlenail
Copy link
Contributor Author

@jreback how do I configure autopep so it conforms to pandas style? When I invoke it now, it makes >79 char lines.

@jreback
Copy link
Contributor

jreback commented Jan 13, 2018

I think you have to tell autopep to fix the lines, its not automatic:

autopep8 --max-line-length

note that we generally don't like to use autopep for line breaking. we do this manually (because autopep8 produces weird output).

Its not that much code you can easily fix it manually.

@alexlenail
Copy link
Contributor Author

@jreback green.

@shoyer
Copy link
Member

shoyer commented Jan 20, 2018

Before we merge this, can someone else weigh in on the API options here? See #18975 (comment) and the comment below

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 20, 2018

Most pandas operations work elementwise, so arguably that's clearest. That would suggest we should only return booleans/boolean-arrays, i.e.,

I agree with this. For me it looks very strange that a Index operation returns a tuple.

But I find it very hard to think about this without code. @zfrenchee could you write up a few hypothetical example of how you would use it? To have a bit more the idea how it would look like.

As I asked for in the other PR as well, I think it would also be helpful to clear out the design to actually write out the reference docs (the docstrings of those methods). We can eg add them now as a multi-line comment at the top of the test files. BTW, I think it would also be rather easy to actually implement the methods (a basic, possibly non-performant implementation). That might make it easier to experiment with.

@jschendel
Copy link
Member

can someone else weigh in on the API options here

I agree with @shoyer that returning booleans/boolean-arrays makes more sense for overlaps and covers. I plan to implement an interval accessor (#16401) in the near future now that the new accessor interface PR has been merged. I haven't looked too closely at the implementation, but I think returning booleans would enable indexing syntax along the lines of df[df.interval_col.iv.overlaps(other)]. At the very least it seems like it would make this behavior consistent with the accessor.

That being said, I'm not opposed to adding the functionality suggested by @zfrenchee, but it might be better served with a different name. Perhaps with a prefix along the lines of get_*, e.g. IntervalIndex.get_overlaps(other), or maybe even a more generic method with a how parameter, e.g. IntervalIndex.something_generic(other, how='overlaps|covers|...').

Unrelated, but one thing that stands out to me is Interval.overlaps(IntervalIndex) (likewise covers). Looks like this is identical to IntervalIndex.overlaps(Interval), so having both seems a little redundant. Syntactically Interval.overlaps(IntervalIndex) feels a little strange to me, since you have a method on a scalar taking a non-scalar argument. I'd be fine removing Interval.overlaps(IntervalIndex) and just implementing IntervalIndex.overlaps(Interval) (likewise covers), but I don't have particularly strong feelings either way.

@alexlenail
Copy link
Contributor Author

alexlenail commented Jan 22, 2018

But I find it very hard to think about this without code. @zfrenchee could you write up a few hypothetical example of how you would use it? To have a bit more the idea how it would look like.

I'm hoping to do genome arithmetic, like BEDtools. Check out the BEDtools tools, e.g. intersect, merge, and closest. With the current implementation of overlaps (in order to do something like bedtools intersect) I would do

one, two = idx1.overlaps(idx2)
return pd.concat((idx1[one], idx2[two]), axis=1)

I think that the issue with the boolean overlaps function is that it's hard to trace what overlaps what. E.g. consider:

idx1 = IntervalIndex.from_tuples([(1, 5), (2, 6)])
idx2 = IntervalIndex.from_tuples([(0, 4), (3, 7)])

then if idx1.overlaps(idx2) returns a boolean mask on idx1....

pd.concat((idx1[idx1.overlaps(idx2)], idx2[idx2.overlaps(idx2)]))

misses half of the overlaps. Do you see what I mean? Let me know if that needs to be clearer. Basically, if it were booleans, you wouldn't necessarily get all pairs of overlaps out from the overlaps function. I think .overlaps() should be able to tell you what all the overlaps are.


I agree with this. For me it looks very strange that a Index operation returns a tuple.

I can't deny it looks off. However, for this index, for this operation, it might be the only thing that makes sense? Very open to alternatives that can find all the overlaps...


Unrelated, but one thing that stands out to me is Interval.overlaps(IntervalIndex) (likewise covers). Looks like this is identical to IntervalIndex.overlaps(Interval), so having both seems a little redundant. Syntactically Interval.overlaps(IntervalIndex) feels a little strange to me, since you have a method on a scalar taking a non-scalar argument.

This is a good point. We discussed this a little. I also don't have strong feelings, and agree that if we were to remove one of them, Interval.overlaps(IntervalIndex) would be the one. I'd love to find some other useful functionality for that function signature, but I wasn't able to come up with one. I feel very weakly that if the choices are keep it or remove it, I would keep it for now.

@shoyer @jreback @jschendel @jorisvandenbossche

@alexlenail
Copy link
Contributor Author

@jorisvandenbossche
Copy link
Member

@zfrenchee I am still trying to understand the logic behind one, two = idx1.overlaps(idx2) (I mean what it does exactly). Can you show what the output would be for the example idx1 and idx2 ?

@jorisvandenbossche
Copy link
Member

(sidenote: if we decide something like this it too complicated or inconsistent for as a method, we should make sure it is not too hard to write a function that does that)

@alexlenail
Copy link
Contributor Author

alexlenail commented Jan 29, 2018

@jorisvandenbossche: @shoyer 's original API plan, which I add tests for in this PR:

>>> idx1 = IntervalIndex.from_tuples([(1, 5), (2, 6)])
>>> idx2 = IntervalIndex.from_tuples([(0, 4), (3, 7)])
>>> idx1.overlaps(idx2)
([0, 0, 1, 1], [0, 1, 0, 1])

as opposed to

>>> idx1 = IntervalIndex.from_tuples([(1, 5), (2, 6)])
>>> idx2 = IntervalIndex.from_tuples([(0, 4), (3, 7)])
>>> idx1.overlaps(idx2)
[true, true]

@alexlenail
Copy link
Contributor Author

I'm alright renaming the method .find_overlaps if that reduces the confusion for not returning boolean masks, or even (maybe) setting this as a method on dataframes with IntervalIndex indexes, rather than on the indexes themselves... I just really want this functionality!

@jschendel
Copy link
Member

I am still trying to understand the logic behind one, two = idx1.overlaps(idx2) (I mean what it does exactly)

@jorisvandenbossche : IIUC the "plain english" interpretation is, looking at the two lists pairwise, specifically which interval in idx1 overlaps which interval in idx2. For example, in the output that @zfrenchee just posted:

>>> idx1 = IntervalIndex.from_tuples([(1, 5), (2, 6)])
>>> idx2 = IntervalIndex.from_tuples([(0, 4), (3, 7)])
>>> idx1.overlaps(idx2)
([0, 0, 1, 1], [0, 1, 0, 1])

The first pairwise element, (0, 0), says that idx1[0] overlaps idx2[0]. And likewise for the other pairs, (0, 1), (1, 0), and (1, 1).

@alexlenail
Copy link
Contributor Author

Thanks @jschendel. Sorry if I wasn't clear, @jorisvandenbossche.

@jorisvandenbossche
Copy link
Member

OK, so you basically do a cross product combination, and then filter for the overlapping ones? So the length of the output varies depending on the data.
Personally I think we should not call that overlaps, but another method like you propose might be a good idea.

(this discussion is another reason I think it would be useful to actually write good docstrings for the functions your propose or even proof of concept implementation, that will makes things a lot easier to discuss)

@jorisvandenbossche
Copy link
Member

BTW, you could see this as a 'join' based on the overlap. And you could then also join based on other ops like within or contains (like with spatial joins).

Would it actually be more useful to directly join two frames based on this? Instead of providing the indices with which you can then do the concat manually? (or are there case you rather want the indices and not directly the merged frames?)

@alexlenail
Copy link
Contributor Author

@jorisvandenbossche

Would it actually be more useful to directly join two frames based on this? Instead of providing the indices with which you can then do the concat manually? (or are there case you rather want the indices and not directly the merged frames?)

I think there could be use cases for getting back the indices without doing the join, so I would vote to keep this as a two-step process. You might want to iterate through the zip of the result, and ask questions about each overlaps. You might want to just look at one side or the other. I personally will mostly likely just be using these functions solely for join operations via concat, but I think it's quite feasible others would have different uses for this.

@alexlenail
Copy link
Contributor Author

Personally I think we should not call that overlaps, but another method like you propose might be a good idea.

I'm happy to name it anything, but I can't think of a better name: The problem this method solves is finding the overlaps between two lists of intervals, i.e. two IntervalIndexes. What else would you call it?

@alexlenail
Copy link
Contributor Author

@jorisvandenbossche
Copy link
Member

I'm happy to name it anything, but I can't think of a better name: The problem this method solves is finding the overlaps between two lists of intervals, i.e. two IntervalIndexes. What else would you call it?

It indeed does an overlap operation, but there are also other things for which this name would be the "obvious" name. To start with simply pd.IntervalIndex(..).overlaps(single_interval), which would logically return a single boolean array.
And given consistency in our API, it is IMO more logical to call that .overlaps.

Your proposal checks the overlap for all combinations, so we might use that notion in the name. Or something with 'join' as I mentioned above.

Another option would aso be to not have this a a method, but as a function. I would personally like that more (also given the return result), the only problem is then where to put it in the namespace.

@jorisvandenbossche
Copy link
Member

See also my comment above #18975 (comment)

tm.assert_numpy_array_equal(lidx, lidx_expected)
tm.assert_numpy_array_equal(ridx, ridx_expected)

def test_intervalIndex_covers_intervalIndex(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize

expected = (np.array([0, 1, 2, 2]), np.array([0, 1, 1, 2]))
self._compare_tuple_of_numpy_array(result, expected)

def test_intervalIndex_overlaps_intervalIndex(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

ivl_range):

# class Interval:
# def overlaps(self, other: IntervalIndex) -> IntegerArray1D
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the II tests need to go in pandas/tests/indexes/interval/test_interval_new.py

ivl = Interval(1, 3, closed=ivl_side)
other = Interval(1, 3, closed=oth_side)

should_cover = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would like to ideally enocde these results IN the parameterization if possible.

@jreback
Copy link
Contributor

jreback commented Sep 25, 2018

we had a lot of good discussion about this, but is pretty stale now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interval Interval data type Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants