Skip to content

REF: implement ArrowExtensionArray base class #46102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 26, 2022

Conversation

jbrockmendel
Copy link
Member

xref #46008

@jbrockmendel
Copy link
Member Author

Thoughts on where to locate things to avoid ImportErrors when pyarrow is not present? In principle could go in _mixins

@mroeschke
Copy link
Member

Thoughts on where to locate things to avoid ImportErrors when pyarrow is not present? In principle could go in _mixins

Does it make sense to just have this array in it's own file and do the same

if not pa_version_under1p01:
    import pyarrow as pa
    import pyarrow.compute as 

trick ArrowStringArray does?

@mroeschke mroeschke added this to the 1.5 milestone Feb 26, 2022
@mroeschke mroeschke added Arrow pyarrow functionality Refactor Internal refactoring of code labels Feb 26, 2022
@jreback jreback merged commit 7dea5ae into pandas-dev:main Feb 26, 2022
@jbrockmendel jbrockmendel deleted the ref-arrow-array branch February 26, 2022 22:20
@sterlinm
Copy link

Hi! Once this is released would you recommend third party developers of extension arrays use this class or is it intended to only be used privately? Thanks!

@mroeschke
Copy link
Member

@sterlinm the intention I think is for this to be private such that users can have the typical numeric, string, etc. dtype columns backed by pyarrow instead of numpy (e.g. dtype="int64[pyarrow]") , but if you or others have a compelling use case for this to be non-private definitely open to suggestions!

@sterlinm
Copy link

sterlinm commented Apr 1, 2022

@mroeschke Thanks!

It looks to me like a really useful framework for anybody who wants to build an extension array that is backed by an Arrow array. I think the use case is pretty much the same as the general use case for Extension Arrays in the first place.

The example I was thinking of was that if you wanted to have a datetime array that supported microsecond precision. I know this will probably be in Pandas itself eventually, but basically any time I want to build an extension array I'm going to want to have it be backed by an Arrow array. It looks like you've worked out the quirks of mapping key lookups onto pa.ChunkedArray, and I think it would be nice to be able to build off of that rather than re-implementing myself :)

It would serve a similar purpose to the Fletcher library, I think. https://fletcher.readthedocs.io/en/latest/

@sterlinm
Copy link

sterlinm commented Jul 8, 2022

It looks like in the next release of pyarrow there's going to be support for customizing ExtensionScalar's to control what's returned by as_py(). It would be great if that played nicely with the ArrowExtensionArray and ArrowDtype classes so it could be used non-privately. I'd be happy to look into helping out with that once the pyarrow release is done.

apache/arrow#13454

@mroeschke
Copy link
Member

ArrowExtensionArray.__getitem__ dispatches to as_py so a custom scalar should be returned there.

I haven't tested an ExtensionScalar/Type while developing the arrow stuff yet, but ideally this should work when calling pd.array([custom_scalars...], dtype=pd.ArrowDtype(pyarrow_dtype=pa.<extension_type>))

If you want to test drive the existing ArrowExtensionArray and ArrowDtype from the main branch, happy to have feedback!

https://github.com/pandas-dev/pandas/blob/main/pandas/core/arrays/arrow/array.py
https://github.com/pandas-dev/pandas/blob/main/pandas/core/arrays/arrow/dtype.py

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
@sterlinm
Copy link

Thanks! I'm going to experiment with using these to implement EA's for my own classes.

If I run into issues or have suggestions for making that easier, what's the best place for that discussion? Continue commenting here or open a new issue? Or none of the above? 🙂

I'd be happy to attempt to help contribute to address any issues I might run into.

@mroeschke
Copy link
Member

If I run into issues or have suggestions for making that easier, what's the best place for that discussion? Continue commenting here or open a new issue? Or none of the above?

It would be easier to open up separate Github issues for any feedback. Ideally one issue per topic. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants