Skip to content

ENH: Arithmetic operations on intervals #43629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Hoeze opened this issue Sep 18, 2021 · 6 comments
Open

ENH: Arithmetic operations on intervals #43629

Hoeze opened this issue Sep 18, 2021 · 6 comments
Labels
Enhancement Interval Interval data type Needs Discussion Requires discussion from core team before further action Numeric Operations Arithmetic, Comparison, and Logical operations

Comments

@Hoeze
Copy link

Hoeze commented Sep 18, 2021

Is your feature request related to a problem?

I would like to be able to do arithmetic operations on intervals:

  • a + b:
    pd.Interval(3, 6) + pd.Interval(5, 7) == pd.Interval(3, 7)
  • a - b:
    pd.Interval(3, 6) - pd.Interval(5, 7) == pd.Interval(3, 5)
  • [a, b].union() / np.sum([a, b]):
    intervals = pd.arrays.IntervalArray.from_tuples([(0, 1), (1, 3), (2, 4), (5, 7)])
    intervals.sum() == pd.arrays.IntervalArray.from_tuples([(0, 4), (5, 7)])

API breaking implications

None AFAIK, since those methods are not implemented yet

@Hoeze Hoeze added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 18, 2021
@venaturum
Copy link
Contributor

@Hoeze I developed a package called staircase which allows a lot of manipulations with intervals and is designed to be closely aligned with pandas. You may find it handy? www.staircase.dev

@mroeschke
Copy link
Member

Is there a standard or convention for arithmetic with interval types? For example, do Postgres range types operate similarly?

@mroeschke mroeschke added Interval Interval data type Needs Discussion Requires discussion from core team before further action Numeric Operations Arithmetic, Comparison, and Logical operations and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 19, 2021
@Hoeze
Copy link
Author

Hoeze commented Sep 19, 2021

@venaturum Thanks for your suggestion.
Would you mind posting an example on how to use your library for this use case?

@mroeschke The details about Postgresql's range operators can be found here:
https://www.postgresql.org/docs/current/functions-range.html
PostgreSQL supports (amongst many others) the following operators:

  • +: range union
  • *: range intersection
  • -: range difference

Further operators are any kinds of overlap or contain operators.

I would not know whether PostgreSQL supports range aggregations, but I think that range union and range intersection across multiple intervals would be really helpful.

@venaturum
Copy link
Contributor

venaturum commented Sep 20, 2021

@Hoeze no problem. In the code snippets below I am assuming a notebook format where the last variable is printed to console. A jupyter notebook with these code examples is downloadable here: IntervalsWithStaircase.zip

The solution involves thinking about the problem in terms of step functions. We'll use a one-to-one mapping between sets of disjoint intervals and step functions, where for a set of intervals, the corresponding step function is f(x) = 1 if the point x belongs to one of the intervals and 0 otherwise.

The package represents step functions via a class called Stairs. This class has almost every binary operator you can think of defined, all results of which are Stairs objects themselves.

import pandas as pd
import staircase as sc

a = sc.Stairs(start=3, end=6)
b = sc.Stairs(start=5, end=7)

There are more parameters to the Stairs constructor which dictate "step-size" (default 1) and baseline-value (default 0) but start and end are all that is needed here.

You can use Stairs.plot, or Stairs.to_frame to understand the step function values

b.to_frame()
  start  end  value
0  -inf    5      0
1     5    7      1
2     7  inf      0

The union of intervals (a+b in your example) can be done by using a "logical or" operation:

result = a | b  # result will be a Stairs instance
result.to_frame()
  start  end  value
0  -inf    3    0.0
1     3    7    1.0
2     7  inf    0.0

For set difference (a-b in your example) the phrase "in a, and not in b" can be literally translated with the step functions:

result = a & ~b
result.to_frame()
  start  end  value
0  -inf    3    0.0
1     3    5    1.0
2     5  inf    0.0

The same result can be achieved by using b as a mask (which creates undefined values where b in non-zero) then filling the NAs with 0. It can also be achieved by multiplying instead of using the & operator, i.e.

result = a.mask(b).fillna(0)

or

result = a * ~b

For the union of many intervals, there are two approaches. The first is to create a step function for every interval and repeatedly apply the logical_or function.

from functools import reduce

intervals = pd.arrays.IntervalArray.from_tuples([(0, 1), (1, 3), (2, 4), (5, 7)])
result = reduce(sc.Stairs.logical_or, [sc.Stairs(start=i.left, end=i.right) for i in intervals])
result.to_frame()
  start  end  value
0  -inf    0    0.0
1     0    4    1.0
2     4    5    0.0
3     5    7    1.0
4     7  inf    0.0

The second approach is to add the step functions together (and values where the intervals overlap will have values of 2 or more), and then set non-zero values of the step function to 1. This approach may be favourable as it uses the ability to pass in vectors of start and end values for the intervals into the constructor of a single Stairs object.

result1 = sc.Stairs(start=intervals.left, end=intervals.right)
result1.to_frame()
  start  end  value
0  -inf    0      0
1     0    2      1
2     2    3      2
3     3    4      1
4     4    5      0
5     5    7      1
6     7  inf      0

This is where result1.plot() can come in handy to visualise the step function (requires matplotlib).

From here we can just use a relational operator to get a boolean valued step function:

result2 = result1 > 0
result2.to_frame()
  start  end  value
0  -inf    0      0
1     0    4      1
2     4    5      0
3     5    7      1
4     7  inf      0

We could have also used result1 != 0 or result1.make_boolean() to convert to the binary valued step function.

If you want to convert this step function back to an interval array then it can be done so like this:

intervals2 = (
    result2.to_frame()
    .query('value == 1')
    .pipe(lambda df: pd.arrays.IntervalArray.from_arrays(df.start, df.end))
)
print(intervals2)
<IntervalArray>
[(0, 4], (5, 7]]
Length: 2, dtype: interval[int64, right]

Is interval manipulation the primary use case for staircase? Probably not. Given vectors of "start" and "stop" times it makes quick work of things like queues, and asset utilisations. But maybe it's useful here depending on the context of your problem.

@venaturum
Copy link
Contributor

@Hoeze I've taken the above concepts with staircase and created a new package "piso" for set operations over pandas.Interval, pandas.arrays.IntervalArray and pandas.IntervalIndex.

https://piso.readthedocs.io

It's currently available through pypi. Will continue to expand the functionality over the coming weeks.

@t0uch9r455
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Interval Interval data type Needs Discussion Requires discussion from core team before further action Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants