Skip to content

ENH: Create Better IntervalDtype using PyArrow structs. #53033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
randolf-scholz opened this issue May 2, 2023 · 6 comments
Open
3 tasks done

ENH: Create Better IntervalDtype using PyArrow structs. #53033

randolf-scholz opened this issue May 2, 2023 · 6 comments
Labels
Arrow pyarrow functionality Enhancement Interval Interval data type

Comments

@randolf-scholz
Copy link
Contributor

randolf-scholz commented May 2, 2023

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Currently, pandas.IntervalArray suffer from 3 major limitations:

  1. They are limited to data with the same closedness on both sides. no longer the case apparently
  2. All datapoints are limited to the same closedness in the array. (i.e. the same array can only store closed intervals or only open intervals).
  3. Intervals do not allow missing values
    • In particular one cannot represent unbounded intervals for data types that lack an actual infinity value like int32.
  4. Some dtypes are not allowed like string

As a practical application for (1) that I am very interested in is storing information about the range of valid values for the columns of another DataFrame.

Feature Description

Given the better integration with pyarrow since 2.0, we can recreate IntervalDtype using pyarrow.struct:

import pyarrow as pa

def arrow_interval_dtype(subtype):
    fields = [
        ("lower_bound", subtype),
        ("upper_bound", subtype),
        ("lower_inclusive", pa.bool_()),
        ("upper_inclusive", pa.bool_()),
    ]
    return pa.struct(fields)

Contrary to the current IntervalDtype, this would solve all 3 major problems at once:

  1. Each element of the resulting StructArray can have separate closedness
  2. Pyarrow datatypes all support missing values
  3. We can in principle use any ordered data type for the subtype.

Alternative Solutions

None.

Additional Context

Additionally, common request is adding extra operations for interval dtypes:

Additionally, one could imagine having a IntervalUnion type, that can represent finite unions of intervals, combining the interval type discussed here with pyarrow list-type. This type would naturally arise when performing unions of intervals, such as [0, 2]∪[3, 5]. The nice thing here is that the resulting space is mathematically closed under the standard set operations (union, intersection, complements, difference)

@randolf-scholz randolf-scholz added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 2, 2023
@jbrockmendel
Copy link
Member

@jorisvandenbossche could the existing ArrowIntervalDtype be the solution here?

@lithomas1 lithomas1 added Interval Interval data type Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 2, 2023
@Julian048
Copy link
Contributor

@jbrockmendel is this something pandas wants to pursue?

@randolf-scholz
Copy link
Contributor Author

@jbrockmendel This issue is the only match when googling "ArrowIntervalDtype".

@Julian048
Copy link
Contributor

Julian048 commented Aug 9, 2023

@randolf-scholz I was having trouble finding that aswell

arrow does have an interval type but only for time related which is not what this issue is on

@jbrockmendel
Copy link
Member

i was referring to pandas.core.arrays.arrow.ArrowIntervalType

@randolf-scholz
Copy link
Contributor Author

randolf-scholz commented Aug 9, 2023

@jbrockmendel I took a look at the code, and one immediate limitation seems to be that again this restricts all intervals in the array to be of equal closedness, while the proposal here would allow storing intervals of different closedness in the same array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Enhancement Interval Interval data type
Projects
None yet
Development

No branches or pull requests

4 participants