Skip to content

PERF: optimize conversion from boolean Arrow array to masked BooleanArray #41051

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jorisvandenbossche
Copy link
Member

For arrays without missing values it doesn't change, but gives a decent speed-up when having missing values:

In [1]: arr1 = pa.array([True, False, True, True, False]*100_000)

In [2]: arr2 = pa.array([True, False, None, True, False]*100_000)

In [3]: %timeit pd.BooleanDtype().__from_arrow__(arr1)
9.34 ms ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   <-- master
9.31 ms ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   <-- PR

In [4]: %timeit pd.BooleanDtype().__from_arrow__(arr2)
60.6 ms ± 678 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)   <-- master
18.8 ms ± 99.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   <-- PR

cc @simonjayhawkins

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance IO Parquet parquet, feather NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Apr 20, 2021
@jorisvandenbossche jorisvandenbossche added this to the 1.3 milestone Apr 20, 2021
@simonjayhawkins
Copy link
Member

@jorisvandenbossche i'm getting significant different timings and showing a much bigger improvement (30x with missing values) and same with * 1_000_000. can you double check you timings.

%timeit pd.BooleanDtype().__from_arrow__(arr1)
# 572 µs ± 4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  <-- master
# 553 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)   <-- PR

%timeit pd.BooleanDtype().__from_arrow__(arr2)
# 34.4 ms ± 373 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)   <-- master
# 1.06 ms ± 2.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)   <-- PR

buflist = arr.buffers()
data = pyarrow.BooleanArray.from_buffers(
arr.type, len(arr), [None, buflist[1]], offset=arr.offset
).to_numpy(zero_copy_only=False)
Copy link
Member

@simonjayhawkins simonjayhawkins Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to #41018 (comment)

if we do pass an incompatible type we now get a less helpful message

>>> import pandas as pd
>>> from pandas.core.arrays.string_arrow import ArrowStringDtype
>>> 
>>> s2 = pd.Series(["a", None, "1"], dtype="arrow_string")
>>> 
>>> arr = s2.values._data
>>> 
>>> arr
<pyarrow.lib.ChunkedArray object at 0x7f50a262ba90>
[
  [
    "a",
    null,
    "1"
  ]
]
>>> 
>>> pd.BooleanDtype().__from_arrow__(arr)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/simon/pandas/pandas/core/arrays/boolean.py", line 126, in __from_arrow__
    data = pyarrow.BooleanArray.from_buffers(
  File "pyarrow/array.pxi", line 947, in pyarrow.lib.Array.from_buffers
ValueError: Type's expected number of buffers (3) did not match the passed number (2).

on master

TypeError: Need to pass bool-like values

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we do pass an incompatible type we now get a less helpful message

Generally we should never do that ourselves, though.
It's possible to get that in a conversion of arrow -> pandas if you have outdated metadata:

>>> df = pd.DataFrame({'a': pd.array([True, False])})
>>> table = pa.table(df)
>>> new_table = table.cast(pa.schema([('a', pa.int8())], metadata=table.schema.metadata))
>>> new_table.to_pandas()
...
~/scipy/repos/pandas-build-arrow/pandas/core/arrays/boolean.py in __from_arrow__(self, array)
    135                 mask = np.zeros(len(arr), dtype=bool)
    136 
--> 137             bool_arr = BooleanArray(data, mask)
    138             results.append(bool_arr)
    139 

~/scipy/repos/pandas-build-arrow/pandas/core/arrays/boolean.py in __init__(self, values, mask, copy)
    289     def __init__(self, values: np.ndarray, mask: np.ndarray, copy: bool = False):
    290         if not (isinstance(values, np.ndarray) and values.dtype == np.bool_):
--> 291             raise TypeError(
    292                 "values should be boolean numpy array. Use "
    293                 "the 'pd.array' function instead"

TypeError: values should be boolean numpy array. Use the 'pd.array' function instead

>>> new_table = table.cast(pa.schema([('a', pa.string())], metadata=table.schema.metadata))
>>> new_table.to_pandas()
...
~/scipy/repos/pandas-build-arrow/pandas/core/arrays/boolean.py in __from_arrow__(self, array)
    124         for arr in chunks:
    125             buflist = arr.buffers()
--> 126             data = pyarrow.BooleanArray.from_buffers(
    127                 arr.type, len(arr), [None, buflist[1]], offset=arr.offset
    128             ).to_numpy(zero_copy_only=False)

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.Array.from_buffers()

ValueError: Type's expected number of buffers (3) did not match the passed number (2).

And I think this is maybe something to fix on the pyarrow side (it should ignore the metadata on error).

Now, it's also easy to add a arr.type check to ensure it is boolean arrow type, so will do that.

@jorisvandenbossche
Copy link
Member Author

i'm getting significant different timings and showing a much bigger improvement (30x with missing values)

Whoops, I was using my debug build of arrow, which makes things generally much slower, that should explain it.
But so all the better that the speed-up is larger! ;)

@jorisvandenbossche
Copy link
Member Author

@simonjayhawkins any further comments?

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorisvandenbossche lgtm.

does this require a release note?

@jorisvandenbossche
Copy link
Member Author

Will add a small note

@jorisvandenbossche jorisvandenbossche merged commit 83e4332 into pandas-dev:master Apr 21, 2021
@jorisvandenbossche jorisvandenbossche deleted the perf-boolean-from-arrow branch April 21, 2021 16:09
yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants