-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Add spec for new Interval / IntervalIndex methods: .overlaps(), .covers() #18975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @zfrenchee! Thanks for updating the PR.
Comment last updated on April 02, 2018 at 20:45 Hours UTC |
# class Interval: | ||
# def covers(self, other: Interval) -> bool | ||
|
||
assert Interval(1, 3).covers(Interval(1.5, 2.5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Interval
level tests would be more appropriate in pandas/tests/scalar/test_interval.py
, which is where the other Interval
specific tests are located.
assert Interval(1, 3, closed='both').overlaps(Interval(1, 3, closed='both')) | ||
|
||
assert Interval(1, 3, closed='both').overlaps(Interval(-1, 1, closed='right')) | ||
assert not Interval(1, 3, closed='both').overlaps(Interval(-1, 1, closed='left')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw I would be very happy to have these in a different file, e.g.
pandas/tests/indexes/interval/test_overlap_covers.py
or whatever (excluding @jschendel commet that the Interval tests need to go in the scalar tests); we can make a sub-dir there as well I think.
Ready for a look-over on structure @jreback @jschendel. Are the tests in the right place / well structured? Are there missing tests / cases? |
} | ||
} | ||
|
||
assert ivl.overlaps(other) == should_overlap[other_side][ivl_side][ivl_range] == other.overlaps(ivl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of the tests from here on down should go in: pandas/tests/indexes/interval/test_overlaps.py
(new file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback do you still want to see this?
what is the reason for both covers and overlaps? |
-- (#16316 (comment)) I personally am interested in overlaps, but I can imagine people might want to ask what intervals are covered by their IntervalIndex or vice-versa. Does that answer the question? |
FWIW other implementations of Is there any utility in adding a In [1]: iv = Interval(0, 1)
In [2]: iv.covers(iv, strict=False)
Out[2]: True
In [3]: iv.covers(iv, strict=True)
Out[3]: False Could probably wait to see if there's demand for |
rebase and push again, fixed some hanging by Travis CI |
Codecov Report
@@ Coverage Diff @@
## master #18975 +/- ##
==========================================
- Coverage 91.82% 91.5% -0.32%
==========================================
Files 152 150 -2
Lines 49248 48739 -509
==========================================
- Hits 45222 44600 -622
- Misses 4026 4139 +113
Continue to review full report at Codecov.
|
@zfrenchee : Looks like you have some lint errors. Otherwise, Travis looks happy to me. |
I'd like to revisit the interface we agreed upon (initially from @shoyer) which this PR introduces tests for. The interface was
This is the last call for modifications to that interface in this PR. @jreback @jorisvandenbossche @gfyoung . What do you think? |
Thinking about this again: there's some ambiguity about whether coverage for
Most pandas operations work elementwise, so arguably that's clearest. That would suggest we should only return booleans/boolean-arrays, i.e.,
If we adopted this choice, we would save "covers all", "covers any", "overlaps all" and "overlaps any" for another method, which might be implemented in the future (if necessary). |
Thanks for those comments @shoyer. I'm personally rather attached to your original suggestion, because I think all these other methods can be expressed with short recipes derived from the methods in their current form in the spec. I'm not sure anyone will really be asking the question "which intervals in IntervalIndex A cover every interval in IntervalIndex B?" and even if they did want to ask that, using the current spec they could get indices back and then check if that index set is complete (2 lines instead of 1). The reverse doesn't work -- If we provide these methods, we can't ask the more basic question "which intervals from A cover which intervals from B?" Are you thinking of a particular other method in the current pandas API to base your suggestion from? If so we have to weigh consistency as a factor. Let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to have Interval tests in pandas/tests/scalar/, but IntervalIndex tests in pandas/tests/indexes/interval
} | ||
} | ||
|
||
assert ivl.overlaps(other) == \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are very hard to read and debug much better to do
result = ...
expected = ...
assert result == expected
@zfrenchee your upstream may be called something else, use |
@jreback how do I configure autopep so it conforms to pandas style? When I invoke it now, it makes >79 char lines. |
I think you have to tell autopep to fix the lines, its not automatic:
note that we generally don't like to use autopep for line breaking. we do this manually (because autopep8 produces weird output). Its not that much code you can easily fix it manually. |
@jreback green. |
Before we merge this, can someone else weigh in on the API options here? See #18975 (comment) and the comment below |
I agree with this. For me it looks very strange that a Index operation returns a tuple. But I find it very hard to think about this without code. @zfrenchee could you write up a few hypothetical example of how you would use it? To have a bit more the idea how it would look like. As I asked for in the other PR as well, I think it would also be helpful to clear out the design to actually write out the reference docs (the docstrings of those methods). We can eg add them now as a multi-line comment at the top of the test files. BTW, I think it would also be rather easy to actually implement the methods (a basic, possibly non-performant implementation). That might make it easier to experiment with. |
I agree with @shoyer that returning booleans/boolean-arrays makes more sense for That being said, I'm not opposed to adding the functionality suggested by @zfrenchee, but it might be better served with a different name. Perhaps with a prefix along the lines of Unrelated, but one thing that stands out to me is |
I'm hoping to do genome arithmetic, like BEDtools. Check out the BEDtools tools, e.g. intersect, merge, and closest. With the current implementation of overlaps (in order to do something like bedtools intersect) I would do
I think that the issue with the boolean overlaps function is that it's hard to trace what overlaps what. E.g. consider:
then if
misses half of the overlaps. Do you see what I mean? Let me know if that needs to be clearer. Basically, if it were booleans, you wouldn't necessarily get all pairs of overlaps out from the overlaps function. I think
I can't deny it looks off. However, for this index, for this operation, it might be the only thing that makes sense? Very open to alternatives that can find all the overlaps...
This is a good point. We discussed this a little. I also don't have strong feelings, and agree that if we were to remove one of them, |
@zfrenchee I am still trying to understand the logic behind |
(sidenote: if we decide something like this it too complicated or inconsistent for as a method, we should make sure it is not too hard to write a function that does that) |
@jorisvandenbossche: @shoyer 's original API plan, which I add tests for in this PR:
as opposed to
|
I'm alright renaming the method |
@jorisvandenbossche : IIUC the "plain english" interpretation is, looking at the two lists pairwise, specifically which interval in >>> idx1 = IntervalIndex.from_tuples([(1, 5), (2, 6)])
>>> idx2 = IntervalIndex.from_tuples([(0, 4), (3, 7)])
>>> idx1.overlaps(idx2)
([0, 0, 1, 1], [0, 1, 0, 1]) The first pairwise element, |
Thanks @jschendel. Sorry if I wasn't clear, @jorisvandenbossche. |
OK, so you basically do a cross product combination, and then filter for the overlapping ones? So the length of the output varies depending on the data. (this discussion is another reason I think it would be useful to actually write good docstrings for the functions your propose or even proof of concept implementation, that will makes things a lot easier to discuss) |
BTW, you could see this as a 'join' based on the overlap. And you could then also join based on other ops like within or contains (like with spatial joins). Would it actually be more useful to directly join two frames based on this? Instead of providing the indices with which you can then do the concat manually? (or are there case you rather want the indices and not directly the merged frames?) |
I think there could be use cases for getting back the indices without doing the join, so I would vote to keep this as a two-step process. You might want to iterate through the zip of the result, and ask questions about each overlaps. You might want to just look at one side or the other. I personally will mostly likely just be using these functions solely for join operations via |
I'm happy to name it anything, but I can't think of a better name: The problem this method solves is finding the overlaps between two lists of intervals, i.e. two IntervalIndexes. What else would you call it? |
It indeed does an overlap operation, but there are also other things for which this name would be the "obvious" name. To start with simply Your proposal checks the overlap for all combinations, so we might use that notion in the name. Or something with 'join' as I mentioned above. Another option would aso be to not have this a a method, but as a function. I would personally like that more (also given the return result), the only problem is then where to put it in the namespace. |
See also my comment above #18975 (comment) |
tm.assert_numpy_array_equal(lidx, lidx_expected) | ||
tm.assert_numpy_array_equal(ridx, ridx_expected) | ||
|
||
def test_intervalIndex_covers_intervalIndex(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parameterize
expected = (np.array([0, 1, 2, 2]), np.array([0, 1, 1, 2])) | ||
self._compare_tuple_of_numpy_array(result, expected) | ||
|
||
def test_intervalIndex_overlaps_intervalIndex(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
ivl_range): | ||
|
||
# class Interval: | ||
# def overlaps(self, other: IntervalIndex) -> IntegerArray1D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the II tests need to go in pandas/tests/indexes/interval/test_interval_new.py
ivl = Interval(1, 3, closed=ivl_side) | ||
other = Interval(1, 3, closed=oth_side) | ||
|
||
should_cover = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would like to ideally enocde these results IN the parameterization if possible.
we had a lot of good discussion about this, but is pretty stale now. |
git diff upstream/master -u -- "*.py" | flake8 --diff
Working on parameterizing the tests. Will ping when ready for review.