-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Should IntegerArray provide data / mask through an API? #34873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yeah, but we have two optimizations planned that makes this a bit tricky:
Just an FYI, Actually... can you use a combination of these two? data = array.to_numpy(na_value=0)
mask = array.isna() I think that's 0-copy access to both components... Edit: no it's definitely not zero-copy, since we insert |
Thank you for the detailed answer! I think |
One more question @xhochy: do you want zero-copy access to the NumPy arrays, or would a pyarrow Array suffice? IntegerArray & BooleanArray do implement cc @jorisvandenbossche if you have thoughts. |
Note that But, long term it would indeed be good to have some more official way to access this. |
Actually, we don't: pandas/pandas/core/arrays/masked.py Lines 225 to 226 in 8ba9c62
but that seems a bug to me. As a user afterwards mutating that BooleanArray should never update the original array .. |
As a bit context here: Contrary to most things I post here on the issue tracker, Arrow isn't involved in this, only I would like to avoid copies if possible, thus I would adapt these computations to just the mask as it is implemented by Taking a step back, an alternative for this use case would also be to provide an interface in pandas |
We've just upgraded cuDF to Pandas 1.0+ and we'd really love an API to get the buffers underneath IntegerArray / BooleanArray / etc. classes zero copy regardless of whether the mask is a bitmask or a bytemask. For cuDF, if we're given a bytemask, we'd want to condense it down to a bitmask, but we'd want to do that on the GPU as opposed to the CPU. |
I think the best we can do for now is say "use
And we'll agree among friends that pandas won't change the names? We could offer a public |
Thanks @TomAugspurger! We'll use FWIW on the cuDF side what we have similar needs for this (to hand off to things like numba kernels / cupy functions / custom user shenanigans), and in using bitmasks it becomes tricky to try to efficiently handle typical zero copy operations like slicing. What we've done is have |
I'm currently implementing some algorithms on top of the
IntegerArray
class withnumba
. Therefore I would need to pass in the two separate backing NumPy arrays instead of thepandas
class. Usingseries.to_numpy()
isn't helpful in this case as this returns aobject
-typed array. For now, I'll keep usingseries.values._data
andseries.values._mask
but I'm aware that using private variables is not a long-term solution.Given that
series.values._data
may be undefined where the mask isTrue
, I know that returning the rawdata
may be a bit controversial but still need. Thus naming the accessor for it should be done carefully.The text was updated successfully, but these errors were encountered: