-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Confusing (possibly buggy) IntervalIndex behavior #16316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
can't u change the post to show code that constructs a minimal example |
Make your dataframe
Test all these:
I think that documentation as to what the behavior should be might be helpful. I can't make heads nor tails of what P.S. Still curious as to how you would address my meta-question about |
well docs are here: http://pandas.pydata.org/pandas-docs/stable/advanced.html#intervalindex Well we treat indexing with an what would be helpful is using this example:
is to enumerate what those cases about should do semantically. Further having selections like would be really helpful. |
further I still don't understand what you need from |
Sorry, let me include the output from those
Intuitively, I want to say that 'a' is from 1 to 5, closed interval. All good here.
"Do I have any intervals from 1 to 5?" Yes. No surprise
Do I have any intervals from -10 to 10? Yes. Oh, okay, so it's not exact matches. Okay.
Do I have any intervals from 3 to 5? Yes. Okay, that makes sense, it's a partial overlap with [1,5].
Do I have any intervals from 3 to 4?
I think this qualifies as "Confusing (possibly buggy) IntervalIndex behavior" but it might not -- I might just be thinking about this incorrectly. I'm happy to supplement the docs to clarify for people like me if that's the case. Thanks! |
was looking for more discussion of whether an Interval needs to be an exact match, or matches if other intervals are fully contained what happens if it partially overlaps? |
@jreback I think I might see the footprint of the bug. New dataframe: (same as before, just from 10 to 15)
|
So:
|
@zfrenchee I have enough examples, what I want to know is why you think this should work at all.
IOW, take cases and comment on if they should work or raise (KeyError or other)
|
I think there are two behaviors for So reasonable behavior 1 is:
As a non-pandas expert, I believe this is most in keeping with what loc currently does on other index types, but I think this also obliterates the reason for having intervals, since you've essentially reduced the interval down to a token which you're selecting for. The other possible behavior is that loc returns all overlaps:
I think what makes most sense is to use loc for the first of these, and define a new special
or more likely:
This would return something sensible, like the indices from Maybe it would be easier to define overlaps like so:
In that case it's a little harder to decide what to return, since you would want it to return the same thing as:
What do you think? @jreback |
The closest analogy here is probably partial string indexing into Datetimes. We accept In [83]: df = pd.DataFrame(['a', 'b'], columns=['col'], index=pd.IntervalIndex.from_tuples([(1, 5), (7, 8)]))
...: df
...:
Out[83]:
col
(1, 5] a
(7, 8] b
In [61]: df.loc[pd.Interval(1, 5)]
Out[61]:
col a
Name: (1, 5], dtype: object
In [62]: df.loc[3]
Out[62]:
col a
Name: (1, 5], dtype: object``` but, then In [50]: df.loc[pd.Interval(1, 4)] would raise a KeyError, since it doesn't match exactly. Users wishing to do indexing by passing an For more flexible indexing with iterables of def IntervalIndex.covers(self, other: Iterable) -> Array[bool]:
"""Boolean mask for whether items of `other` overlap with anything `self`.
Output is the same same shape as `other`"""
# maybe enhance `.contains` to do this?
def IntervalIndex.covered_by(self, other: Iterable) -> Array[Bool]:
"""Boolean mask for whether items in self overlap with anything in `other`.
Output is the same shape as `self`"""
# maybe modify `.isin` to do this? So to summarize:
|
This sounds good to me! I think the original intent was to match fully contained intervals only, but clearly that logic is not working right. In any case it's certainly better (simpler / more explicit) to switch to requiring specific methods for this functionality. It seems like we need may need at least three methods for handling |
I think your ideas represent a good compromise. A couple concerns though:
|
I could have been wrong about the current behavior. I didn't test that much.
For the second point, we might want both methods as a convenience for if you have a list of Intervals, rather than an IntervalIndex. I think your right that they are perfect opposites (and we could implement it that way)
… On May 13, 2017, at 9:40 AM, Alexander Lenail ***@***.***> wrote:
@TomAugspurger
I think your ideas represent a good compromise. A couple concerns though:
(which is what happens currently)
The current behavior is definitely wrong with respect to what you suggest it should be, so this isn't quite right. Do you agree?
I'm not sure whether we need both covers and covered_by given that they seem to be perfect opposites? a.covers(b) == b.covered_by(a) right? In which case you can just flip the expression and don't need both. Let me know if I'm misreading this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
This depends on how we represent the results of
In this model, I think the full "all matches" methods could be useful, but for large lists of intervals these will get very long, so there is also something to be said for "any match" methods. These would return at most one result for each element in the first IntervalIndex, i.e.,
Or possibly just 1D integer arrays of indices or a boolean array (giving matches in the first argument), but either way So I guess this leaves us with up to 5 potential interval indexing methods, which we might call:
If we do indeed add all of the second type of methods, we might do |
@shoyer I have a hard time seeing why you would design the I'm also not sure who would need "any match" who wouldn't want "all match" but I'm more open to this idea. I can't think of utility to it, but it seems very plausible there's some use case I'm not thinking of. |
This needs to be an |
I'm also not sure if this would actually be useful to anyone -- I was merely restating @TomAugspurger's proposal. If we can't think of a real use-case, then we certainly shouldn't bother. It sounds like we want two methods These would have type signatures (slightly abusing standard typing notation):
|
IIRC you suggested this actually. And the are not duplicates of each other.
|
@shoyer sorry I might have lost track -- could you re-disambiguate what |
An interval covers another interval if all points in the second interval are found in the first interval. An interval overlaps another interval if there exist any points found in both intervals. |
@zfrenchee the way to make this happen MUCH faster would be to write tests, that xfail. meaning that we except them to work when the feature is fixed. Many many tests would help here, covering the bases, and edge cases. This would cover the user facing results, e.g. what you expect for a certain operation. Much of the raw input is in this issue. This would also help refine the exact API.
|
@jreback Sounds good, I'm happy to write the tests. One question though: We've discussed both enhancements on this issue, as well as ways the current implementation is "broken". I'll add xfail tests for the enhancements, but should I alter current tests to bring them into line with what we've discussed here. If so, should I also mark them as xfail? |
yes |
@jreback would you mind chatting about this over email? I have a couple more questions which don't necessarily make sense to put on this issue. If that's alright, shoot me an email at [email protected]. =) |
If you have developer workflow questions, GitHub is still a good place for
them -- we can answer questions or point you to existing primers. It's okay
if issues go slightly off topic.
…On Thu, May 18, 2017 at 1:22 PM Alexander Lenail ***@***.***> wrote:
@jreback <https://github.com/jreback> would you mind chatting about this
over email? I have a couple more questions which don't necessarily make
sense to put on this issue. If that's alright, shoot me an email at
***@***.*** =)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16316 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1gT81WLgNG5yDTiFGdnqwmNUn_1Zks5r7H5VgaJpZM4NWGbI>
.
|
Alright.
|
I don't think so, you can certain put what you think if its different
no, things stay in the docs until things are actually removed for Panel, though would prob take a PR to remove them for 0.21.0 |
chiming in on this, as we are heavy users of as already been mentioned, the key verbs are examples from the postgres docs:
now that we have |
Thanks @buyology, the Postgres references are very helpful. Honestly, we probably should have consulted them in the original design process. Looking at this list, it looks like we do indeed need the contains and overlaps functionality (though I'm not sure we can call our methods contains, given that does significantly from the current method). The rest is provided by current functionality, once you pull out lower and upper bounds. |
@shoyer returning to
What were you thinking w.r.t. the relationship between |
It's been a while since I thought about it, but my initial thought would be that |
@shoyer okay, that's what I did, currently in #18975. Take a look if you get a chance =) Note @jreback that means the functions
isn't any different from the function
and
is the same as
in that the |
came across this library: https://github.com/AlexandreDecan/python-intervals looks to have some interesting interval semantics cc @jschendel |
@jschendel this might be a possible topic to work on during the sprint (if you are interested of course)? It would be good to have this indexing behaviour clean-up in 0.25 before 1.0 (as they will break some behaviour I think, not sure we can do with deprecations). |
Yeah, seems like a good topic to work on during the sprint. I've done a bit of work on this already but have been a bit lazy on finishing it up. |
In the above, I have a region that I'm querying for with a partially overlapping interval. The query succeeds when the interval is partially overlapping until it doesn't, throwing the key error:
I think this is particularly confusing because there doesn't seem to be any prominent difference between the
loc
s that succeed and theloc
that fails as far as I can tell. I know we had discussedloc
's behavior in this context but I'm not sure we came to a conclusion.By the way, my larger question is about how to find intersections between two
IntervalIndex
. It seems like thefind_intersections
function didn't make it into this release @jreback ? Let me know! =]The text was updated successfully, but these errors were encountered: