Skip to content

Should IntegerArray provide data / mask through an API? #34873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
xhochy opened this issue Jun 19, 2020 · 9 comments
Open

Should IntegerArray provide data / mask through an API? #34873

xhochy opened this issue Jun 19, 2020 · 9 comments
Labels
Enhancement NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@xhochy
Copy link
Contributor

xhochy commented Jun 19, 2020

I'm currently implementing some algorithms on top of the IntegerArray class with numba. Therefore I would need to pass in the two separate backing NumPy arrays instead of the pandas class. Using series.to_numpy() isn't helpful in this case as this returns a object-typed array. For now, I'll keep using series.values._data and series.values._mask but I'm aware that using private variables is not a long-term solution.

Given that series.values._data may be undefined where the mask is True, I know that returning the raw data may be a bit controversial but still need. Thus naming the accessor for it should be done carefully.

@xhochy xhochy added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Jun 19, 2020
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 19, 2020

Yeah, but we have two optimizations planned that makes this a bit tricky:

  1. Make ._mask optional. If there are no missing values then we can avoid allocating the array to store it.
  2. Make ._mask a bitmask rather than a bytemask.

series.values._mask

Just an FYI, series.array.isna() will give you _mask through a public API. Doesn't help you with ._data though :)

Actually... can you use a combination of these two?

data = array.to_numpy(na_value=0)
mask = array.isna()

I think that's 0-copy access to both components...

Edit: no it's definitely not zero-copy, since we insert na_value for the NAs.

@TomAugspurger TomAugspurger removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jun 19, 2020
@xhochy
Copy link
Contributor Author

xhochy commented Jun 22, 2020

Thank you for the detailed answer!

I think array.isna() is a pretty decent interface for me as it will be stable regarding the future performance optimization 2.. When 2. is implemented, having a public interface to retrieve the bitmask would also be nice but at least this way, my code won't break on implementation changes (it will just get a bit slower).

@TomAugspurger TomAugspurger added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Jun 22, 2020
@TomAugspurger
Copy link
Contributor

One more question @xhochy: do you want zero-copy access to the NumPy arrays, or would a pyarrow Array suffice? IntegerArray & BooleanArray do implement __arrow_array__, so you can get zero-copy access to it in a way that won't change if / when we update our implementation.

cc @jorisvandenbossche if you have thoughts.

@jorisvandenbossche
Copy link
Member

Note that array.isna() also gives a copy of the mask, so not necessarily optimal.
The __arrow_array__ is indeed zero-copy for the data, but not for the mask since that gets converted into a bitmask. If you need it as a bitmask, that might be fine, but if you actually need a numpy boolean mask for numba, that would give a double conversion bytemask-bitmask->bytemask.

But, long term it would indeed be good to have some more official way to access this.
We could add a return_mask=True/False option to to_numpy()?

@jorisvandenbossche
Copy link
Member

Note that array.isna() also gives a copy of the mask,

Actually, we don't:

def isna(self) -> np.ndarray:
return self._mask

but that seems a bug to me. As a user afterwards mutating that BooleanArray should never update the original array ..

@xhochy
Copy link
Contributor Author

xhochy commented Jun 22, 2020

As a bit context here: Contrary to most things I post here on the issue tracker, Arrow isn't involved in this, only pandas + numba. Using the __arrow_array__ interface for this would be for me on the same level as just accessing the private members of the array. I'm interesting in converting some existing computations that were yet implemented with numba on numpy float+int arrays to also support the new integer extension type.

I would like to avoid copies if possible, thus I would adapt these computations to just the mask as it is implemented by pandas. As long as it is a bytemask, take that and when the internal implementation switches, take the bitmask.

Taking a step back, an alternative for this use case would also be to provide an interface in pandas Series.apply_nonnull(numba_func_that_works_on_scalars) like it's done in Rolling.apply. This would solve my problem of having access to the raw arrays while removing the need for the user to check for nulls on byte- or bitmask. Especially as in the case of bitmask, I have not yet found a simple and performant numba access pattern.

@kkraus14
Copy link
Contributor

kkraus14 commented Jul 1, 2020

We've just upgraded cuDF to Pandas 1.0+ and we'd really love an API to get the buffers underneath IntegerArray / BooleanArray / etc. classes zero copy regardless of whether the mask is a bitmask or a bytemask.

For cuDF, if we're given a bytemask, we'd want to condense it down to a bitmask, but we'd want to do that on the GPU as opposed to the CPU.

cc @brandon-b-miller

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 1, 2020

I think the best we can do for now is say "use ._mask and ._data with the caveats that"

  1. You should always check that ._mask is not None (to protect against it being optional in the future)
  2. You should always check that _mask.size / itemsize is what you want (to protect against it being a bitmask in the future)

And we'll agree among friends that pandas won't change the names?

We could offer a public .mask and .data but I'm not siure what value those would have over just using the private versions, since we know that the implementation is going to change anyway.

@kkraus14
Copy link
Contributor

kkraus14 commented Jul 2, 2020

Thanks @TomAugspurger! We'll use ._data and ._mask for now.

FWIW on the cuDF side what we have similar needs for this (to hand off to things like numba kernels / cupy functions / custom user shenanigans), and in using bitmasks it becomes tricky to try to efficiently handle typical zero copy operations like slicing. What we've done is have .data / .mask nullify the offset by doing the pointer arithmetic / copy construction (if needed) respectively, but also provide a .base_data and .base_mask for when the underlying allocation is needed to be accessed zero copy without handling the offset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

No branches or pull requests

5 participants