Should IntegerArray provide data / mask through an API? #34873

xhochy · 2020-06-19T18:08:51Z

I'm currently implementing some algorithms on top of the IntegerArray class with numba. Therefore I would need to pass in the two separate backing NumPy arrays instead of the pandas class. Using series.to_numpy() isn't helpful in this case as this returns a object-typed array. For now, I'll keep using series.values._data and series.values._mask but I'm aware that using private variables is not a long-term solution.

Given that series.values._data may be undefined where the mask is True, I know that returning the raw data may be a bit controversial but still need. Thus naming the accessor for it should be done carefully.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-06-19T19:17:42Z

Yeah, but we have two optimizations planned that makes this a bit tricky:

Make ._mask optional. If there are no missing values then we can avoid allocating the array to store it.
Make ._mask a bitmask rather than a bytemask.

series.values._mask

Just an FYI, series.array.isna() will give you _mask through a public API. Doesn't help you with ._data though :)

Actually... can you use a combination of these two?

data = array.to_numpy(na_value=0)
mask = array.isna()

I think that's 0-copy access to both components...

Edit: no it's definitely not zero-copy, since we insert na_value for the NAs.

xhochy · 2020-06-22T09:15:15Z

Thank you for the detailed answer!

I think array.isna() is a pretty decent interface for me as it will be stable regarding the future performance optimization 2.. When 2. is implemented, having a public interface to retrieve the bitmask would also be nice but at least this way, my code won't break on implementation changes (it will just get a bit slower).

TomAugspurger · 2020-06-22T13:14:21Z

One more question @xhochy: do you want zero-copy access to the NumPy arrays, or would a pyarrow Array suffice? IntegerArray & BooleanArray do implement __arrow_array__, so you can get zero-copy access to it in a way that won't change if / when we update our implementation.

cc @jorisvandenbossche if you have thoughts.

jorisvandenbossche · 2020-06-22T13:23:39Z

Note that array.isna() also gives a copy of the mask, so not necessarily optimal.
The __arrow_array__ is indeed zero-copy for the data, but not for the mask since that gets converted into a bitmask. If you need it as a bitmask, that might be fine, but if you actually need a numpy boolean mask for numba, that would give a double conversion bytemask-bitmask->bytemask.

But, long term it would indeed be good to have some more official way to access this.
We could add a return_mask=True/False option to to_numpy()?

jorisvandenbossche · 2020-06-22T13:26:12Z

Note that array.isna() also gives a copy of the mask,

Actually, we don't:

pandas/pandas/core/arrays/masked.py

Lines 225 to 226 in 8ba9c62

    
           def isna(self) -> np.ndarray: 
        
               return self._mask

but that seems a bug to me. As a user afterwards mutating that BooleanArray should never update the original array ..

xhochy · 2020-06-22T14:02:31Z

As a bit context here: Contrary to most things I post here on the issue tracker, Arrow isn't involved in this, only pandas + numba. Using the __arrow_array__ interface for this would be for me on the same level as just accessing the private members of the array. I'm interesting in converting some existing computations that were yet implemented with numba on numpy float+int arrays to also support the new integer extension type.

I would like to avoid copies if possible, thus I would adapt these computations to just the mask as it is implemented by pandas. As long as it is a bytemask, take that and when the internal implementation switches, take the bitmask.

Taking a step back, an alternative for this use case would also be to provide an interface in pandas Series.apply_nonnull(numba_func_that_works_on_scalars) like it's done in Rolling.apply. This would solve my problem of having access to the raw arrays while removing the need for the user to check for nulls on byte- or bitmask. Especially as in the case of bitmask, I have not yet found a simple and performant numba access pattern.

kkraus14 · 2020-07-01T20:22:51Z

We've just upgraded cuDF to Pandas 1.0+ and we'd really love an API to get the buffers underneath IntegerArray / BooleanArray / etc. classes zero copy regardless of whether the mask is a bitmask or a bytemask.

For cuDF, if we're given a bytemask, we'd want to condense it down to a bitmask, but we'd want to do that on the GPU as opposed to the CPU.

cc @brandon-b-miller

TomAugspurger · 2020-07-01T20:36:39Z

I think the best we can do for now is say "use ._mask and ._data with the caveats that"

You should always check that ._mask is not None (to protect against it being optional in the future)
You should always check that _mask.size / itemsize is what you want (to protect against it being a bitmask in the future)

And we'll agree among friends that pandas won't change the names?

We could offer a public .mask and .data but I'm not siure what value those would have over just using the private versions, since we know that the implementation is going to change anyway.

kkraus14 · 2020-07-02T00:02:34Z

Thanks @TomAugspurger! We'll use ._data and ._mask for now.

FWIW on the cuDF side what we have similar needs for this (to hand off to things like numba kernels / cupy functions / custom user shenanigans), and in using bitmasks it becomes tricky to try to efficiently handle typical zero copy operations like slicing. What we've done is have .data / .mask nullify the offset by doing the pointer arithmetic / copy construction (if needed) respectively, but also provide a .base_data and .base_mask for when the underlying allocation is needed to be accessed zero copy without handling the offset.

xhochy added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Jun 19, 2020

TomAugspurger removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jun 19, 2020

TomAugspurger added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Jun 22, 2020

TomAugspurger mentioned this issue Jul 1, 2020

QST: Zero copy access to the mask of a nullable array #35089

Closed

2 tasks

jorisvandenbossche mentioned this issue Nov 16, 2020

ENH: Basis for a StringDtype using Arrow #35259

Merged

5 tasks

jorisvandenbossche mentioned this issue Mar 31, 2021

PERF/BUG: use masked algo in groupby cummin and cummax #40651

Merged

This was referenced Apr 13, 2021

API: Should isna return a copy for nullable types? #40935

Closed

BUG: isna not returning copy for MaskedArray #41020

Merged

mroeschke added Enhancement and removed Usage Question labels Aug 8, 2021

jorisvandenbossche mentioned this issue Apr 2, 2023

ENH: support mask for missing values when writing data / support writing pandas nullable dtypes geopandas/pyogrio#232

Merged

jorisvandenbossche mentioned this issue Aug 21, 2023

Bitmask Backed MaskedArray #54506

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should IntegerArray provide data / mask through an API? #34873

Should IntegerArray provide data / mask through an API? #34873

xhochy commented Jun 19, 2020

TomAugspurger commented Jun 19, 2020 •

edited

Loading

xhochy commented Jun 22, 2020

TomAugspurger commented Jun 22, 2020

jorisvandenbossche commented Jun 22, 2020

jorisvandenbossche commented Jun 22, 2020

xhochy commented Jun 22, 2020

kkraus14 commented Jul 1, 2020

TomAugspurger commented Jul 1, 2020 •

edited

Loading

kkraus14 commented Jul 2, 2020

Should IntegerArray provide data / mask through an API? #34873

Should IntegerArray provide data / mask through an API? #34873

Comments

xhochy commented Jun 19, 2020

TomAugspurger commented Jun 19, 2020 • edited Loading

xhochy commented Jun 22, 2020

TomAugspurger commented Jun 22, 2020

jorisvandenbossche commented Jun 22, 2020

jorisvandenbossche commented Jun 22, 2020

xhochy commented Jun 22, 2020

kkraus14 commented Jul 1, 2020

TomAugspurger commented Jul 1, 2020 • edited Loading

kkraus14 commented Jul 2, 2020

TomAugspurger commented Jun 19, 2020 •

edited

Loading

TomAugspurger commented Jul 1, 2020 •

edited

Loading