Skip to content

vectorized operations with pd.Series of pd.Interval data #25177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
smsaladi opened this issue Feb 6, 2019 · 3 comments
Closed

vectorized operations with pd.Series of pd.Interval data #25177

smsaladi opened this issue Feb 6, 2019 · 3 comments
Labels
Duplicate Report Duplicate issue or pull request Interval Interval data type

Comments

@smsaladi
Copy link
Contributor

smsaladi commented Feb 6, 2019

Is there a plan to allow vectorized operations on pd.Series of pd.Interval data in the future -- perhaps just for the syntactic sugar?

  • constructing the pd.Series of interval values
df['range'] = pd.Interval(df['start'], df['end'])
# instead of 
df['range'] = df.apply(lambda r: pd.Interval(r['start'], r['end']), axis=1)
  • working with pd.Series of interval values
10 in df['range'] 
# instead of
df['range'].apply(lambda x: 10 in x)
df['range'].length
# instead of
df['range'].apply(lambda x: x.length)
df['range'] + 10
# instead of
df['range'].apply(lambda x: x + 10)

I could imagine an interface similar to the pd.Series.str. For example, operations might look like:

df['range'].intv.contains(10)
df['range'].intv.length
df['range'].intv + 10
@mroeschke
Copy link
Member

Yes; already noted in #16401. No explicit timeline on this feature but PRs always welcome!

@mroeschke mroeschke added Duplicate Report Duplicate issue or pull request Interval Interval data type labels Feb 6, 2019
@smsaladi
Copy link
Contributor Author

smsaladi commented Feb 6, 2019

Didn't mean to open a duplicate issue...thanks!

@jschendel
Copy link
Member

To expand on @mroeschke' comment a bit: A lot of this functionality was just added in 0.24.0 but is not documented well beyond the API reference, which should certainly be improved (xref #16400).

constructing the pd.Series of interval values

This can be done via the IntervalArray (or IntervalIndex) constructor:

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.24.0'

In [2]: df = pd.DataFrame({'start': [0, 1, 4], 'end': [2, 3, 8]}); df
Out[2]:
   start  end
0      0    2
1      1    3
2      4    8

In [3]: df['range'] = pd.arrays.IntervalArray.from_arrays(df['start'], df['end'])

In [4]: df
Out[4]:
   start  end   range
0      0    2  (0, 2]
1      1    3  (1, 3]
2      4    8  (4, 8]

I could imagine an interface similar to the pd.Series.str

There is an open issue to add this (xref #16401). This can be more or less done currently via the array property:

In [5]: df['range'].array.length
Out[5]: Int64Index([2, 2, 4], dtype='int64', name='end')

In [6]: df['range'].array.mid
Out[6]: Float64Index([1.0, 2.0, 6.0], dtype='float64', name='start')

In [7]: df['range'].array.overlaps(pd.Interval(2.5, 5))
Out[7]: array([False,  True,  True])

The main difference between a full interval accessor and the array property is the accessor would work with categoricals as well, and have a more consistent return type. I think the accessor would also be limited to elementwise operations, but not sure.

There are still quite a few things that need work, and many open issue. For example, arithmetic operations don't work for IntervalArray, but they do for IntervalIndex:

In [8]: df['range'] + 1
---------------------------------------------------------------------------
TypeError: unsupported operand type(s) for +: 'IntervalArray' and 'int'

In [9]: pd.Index(df['range']) + 1
Out[9]:
IntervalIndex([(1, 3], (2, 4], (5, 9]],
              closed='right',
              dtype='interval[int64]')

And also some suggested features (xref #19480, #21998) along with new specs for indexing behavior with intervals (xref #16316).

PRs are welcome to address any of the shortcomings mentioned here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Interval Interval data type
Projects
None yet
Development

No branches or pull requests

3 participants