Skip to content

interval.overlaps() mishandles empty intervals #26893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ghost opened this issue Jun 16, 2019 · 8 comments
Open

interval.overlaps() mishandles empty intervals #26893

ghost opened this issue Jun 16, 2019 · 8 comments
Labels
Bug Interval Interval data type

Comments

@ghost
Copy link

ghost commented Jun 16, 2019

The special case where left=right is not checked properly by the constructor, only closed='both' (a point) or closed='neither' (an empty interval) makes sense but '[0,0)' or (0,0] are nonsensical. The overlaps method returns inconsistent results when encountering such Intervals.

import pandas as pd
pd.Interval(0,0,'left')

Update:
Examples (from below):

In [4]:  
   ...: import pandas as pd
   ...: from pandas import Interval
   ...: 
   ...: a=Interval(0,1,'both')
   ...: b=Interval(0,0,'right')
   ...: print(a.overlaps(b))
   ...: b=Interval(0,0,'left')
   ...: print(a.overlaps(b))
True
False
@simonjayhawkins simonjayhawkins added the Interval Interval data type label Jun 17, 2019
@simonjayhawkins
Copy link
Member

@pilkibun Thanks for the report

The overlaps method returns inconsistent results when encountering such Intervals.

can you give examples?

@simonjayhawkins simonjayhawkins added the Needs Info Clarification about behavior needed to assess issue label Jun 17, 2019
@TomAugspurger
Copy link
Contributor

cc @jschendel if you have thoughts on the expected behavior here. Should we disallow things like Interval(0, 0, closed='left'), or should we update the logic downstream (like .overlaps) to consider cases like these?

@ghost
Copy link
Author

ghost commented Jun 17, 2019

can you give examples?

In [4]:  
   ...: import pandas as pd
   ...: from pandas import Interval
   ...: 
   ...: a=Interval(0,1,'both')
   ...: b=Interval(0,0,'right')
   ...: print(a.overlaps(b))
   ...: b=Interval(0,0,'left')
   ...: print(a.overlaps(b))
True
False

Since the left side and right side of a point interval coincide, it cannot have one side open and side closed at the same time. overlaps returns a different answer based on which "side" of a point we look it, which strikes me as wrong.

This is a pathological corner case but programmatically you may end up creating such intervals,
in which case allowing this ambiguity could introduce subtle bugs.

With the PR change pandas will complain loudly to warn the user, and force them to handle the case explicitly as either an empty interval (x,x), or a single point [x,x].

@jschendel
Copy link
Member

jschendel commented Jun 17, 2019

Should we disallow things like Interval(0, 0, closed='left')?

I'm -1 on this, as it would be inconsistent with other interval implementations I've seen, which allow empty intervals. From what I can tell, other interval implementations tend to fall into two categories:

  1. left == right is allowed for any closed with support for empty intervals.
  2. closed='both' is only supported, so left == right isn't really an issue, but empty intervals are still supported.

Disallowing left == right for closed='left'/'right' would seem to diverge from other implementations. It seems like this would also lead to some internal inconsistencies: since IntervalIndex and IntervalArray require having the same closed, this would make empty intervals not supported in closed='left'/'right' arrays/indexes, but supported in the closed='neither' case with no obvious way to work around (i.e. in the scalar interval case you can just tell people to use closed='neither' but this is not the case for indexes/arrays).

Some examples of other interval implementations:

  • Using postgres range types:
    • SELECT int8range(0, 0, '(]') returns empty
    • SELECT int8range(0, -1, '(]') raises: Error: range lower bound must be less than or equal to range upper bound.
  • Using octave (open source matlab):
    • infsupdec () returns [Empty]_trv (I don't think you can specify closed != 'both'?)
    • infsupdec (0, -1) returns [NaI], along with some warnings (NaI = Not an Interval)

The above does raise a question as to whether or not we want to specifically add a custom EmptyInterval object similar to the other implementations above. Not sure if the additional complexity needed to support this offsets any decrease in ops complexity that having such an object would enable. I think the only real benefit would be a nicer repr? Adding an is_empty property is another alternative that might be a bit simpler.

Adding NaI seems like it'd be way too large of a burden to maintain, and I think the core team would have me killed for suggesting we add yet another NA representation, so that's likely off the table.

Or should we update the logic downstream (like .overlaps) to consider cases like these?

Yes, IMO the downstream logic should handle empty intervals.

[0,0) or (0,0] are nonsensical

Not really, from a mathematical perspective these definitions are valid and equivalent to (0, 0):

  • image

@TomAugspurger
Copy link
Contributor

Thanks for the writeup @jschendel. FWIW (not much), my intutiton matches yours here.

@simonjayhawkins simonjayhawkins removed the Needs Info Clarification about behavior needed to assess issue label Jun 17, 2019
@ghost
Copy link
Author

ghost commented Jun 18, 2019

Also noting #26894 (review) here, which belongs as part of the discussion here.

@ghost ghost changed the title Interval can create ill-defined point intervals Issues with Interval handling of left=right intervals Jun 18, 2019
@ghost
Copy link
Author

ghost commented Jun 18, 2019

I also just noticed that for postgres (0,0) == (999,999) == empty, i.e. empty intervals have no reference point on the line, they are just singletons:

postgres=# SELECT int8range(0, 0, '[)') = int8range(42, 42, '()');
 ?column? 
----------
 t
(1 row)

That makes me even more skeptical about the logic of having empty intervals included in an index,
for when you chop up an interval into bins for example. For one thing, you could end up with duplicates in your index, which is bad news. for another, having a singleton botches ordering, which means you can't have an ordered index (or you'd have to decide on an arbitary rule) for searchsorted to work.

@jschendel
Copy link
Member

But in that case, you must agree that nothing should ever overlap with either '(x,x]' or '[x,x)'. And I've already given an example where this is violated.

Yes, of course, but this is a bug in the overlaps method. Disallowing (x, x] and [x, x) won't fix the issue, as it's still present with (x, x):

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+750.gbaa77c33f'

In [2]: a = pd.Interval(-1, 1, closed='both')

In [3]: b = pd.Interval(0, 0, closed='neither')

In [4]: a.overlaps(b)
Out[4]: True

So the current implementation clearly does not conform to what you claim as a reference/correct/consistent implementation.

I never said that our implementation was consistent with postgres, or any other implementation. Ideally we'd conform to the common conventions, but intervals are still relatively new in pandas - there are still many features that need to be added, and corner cases that need to be addressed (as you've discovered here).

My point was disallowing (x, x] and [x, x) by making them raise takes us further away from common conventions associated with intervals. It'd be akin int8range(0, 0, '[)') and int8range(0, 0, '(]') erroring in postgres with int8range(0, 0, '()') being allowed.

To be clear, I'm NOT against changing the behavior of how pandas handles (x, x] and [x, x), e.g. perhaps creating an EmptyInterval object so that Interval(0, 0, closed='left') == Interval(0, 0, closed='right') == Interval(0, 0, closed='neither') == EmptyInterval(), provided it's done in a maintainable way. Having the constructor raise is what I'm against.

I don't think silently converting these to closed='neither' would be a clean solution, as my impression is that this would require a bit of special casing to make this fully compatible with a left/right closed IntervalArray or IntervalIndex. I haven't gone through the code to attempt doing this though, so perhaps it's not actually too bad.

For one thing, you could end up with duplicates in your index, which is bad news

Indexes in pandas generically support dupes, so this is something that will be accounted for regardless.

having a singleton botches ordering, which means you can't have an ordered index (or you'd have to decide on an arbitary rule)

Likewise indexes in pandas generically support missing values, which similarly botch ordering, and will be accounted for regardless. I'm guessing that empty intervals can handled in a similar manner.

I suppose this could be an argument for maintaining the existing behavior, as the current implementation does not botch ordering.

Can you can suggest offhand a real-world example where such an interval would be created and it isn't a bug?

This can come about via the contraction of intervals, i.e. restricting intervals so that they lie within a given observation interval, and then using the contracted intervals to scale things where you'd want empty intervals to count as zero. Admittedly, the machinery to do this idiomatically does not exist in pandas yet, but the methods to do so are things that have been discussed.

This doesn't necessitate using an IntervalIndex, and could be done with an IntervalArray instead, but a large portion of code is shared between the two, and pandas generally doesn't enforce restrictions on what an index can contain vs. an equivalent array.

@ghost ghost changed the title Issues with Interval handling of left=right intervals interval.overlaps() mishandles empty intervals Jul 8, 2019
@mroeschke mroeschke added the Bug label Apr 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Interval Interval data type
Projects
None yet
4 participants