PERF: optimize conversion from boolean Arrow array to masked BooleanArray #41051

jorisvandenbossche · 2021-04-20T07:52:10Z

For arrays without missing values it doesn't change, but gives a decent speed-up when having missing values:

In [1]: arr1 = pa.array([True, False, True, True, False]*100_000)

In [2]: arr2 = pa.array([True, False, None, True, False]*100_000)

In [3]: %timeit pd.BooleanDtype().__from_arrow__(arr1)
9.34 ms ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   <-- master
9.31 ms ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   <-- PR

In [4]: %timeit pd.BooleanDtype().__from_arrow__(arr2)
60.6 ms ± 678 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)   <-- master
18.8 ms ± 99.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   <-- PR

cc @simonjayhawkins

…rray

simonjayhawkins · 2021-04-20T10:50:42Z

@jorisvandenbossche i'm getting significant different timings and showing a much bigger improvement (30x with missing values) and same with * 1_000_000. can you double check you timings.

%timeit pd.BooleanDtype().__from_arrow__(arr1)
# 572 µs ± 4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  <-- master
# 553 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)   <-- PR

%timeit pd.BooleanDtype().__from_arrow__(arr2)
# 34.4 ms ± 373 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)   <-- master
# 1.06 ms ± 2.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)   <-- PR

simonjayhawkins · 2021-04-20T11:39:19Z

pandas/core/arrays/boolean.py

+            buflist = arr.buffers()
+            data = pyarrow.BooleanArray.from_buffers(
+                arr.type, len(arr), [None, buflist[1]], offset=arr.offset
+            ).to_numpy(zero_copy_only=False)


related to #41018 (comment)

if we do pass an incompatible type we now get a less helpful message

>>> import pandas as pd >>> from pandas.core.arrays.string_arrow import ArrowStringDtype >>> >>> s2 = pd.Series(["a", None, "1"], dtype="arrow_string") >>> >>> arr = s2.values._data >>> >>> arr <pyarrow.lib.ChunkedArray object at 0x7f50a262ba90> [ [ "a", null, "1" ] ] >>> >>> pd.BooleanDtype().__from_arrow__(arr) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/simon/pandas/pandas/core/arrays/boolean.py", line 126, in __from_arrow__ data = pyarrow.BooleanArray.from_buffers( File "pyarrow/array.pxi", line 947, in pyarrow.lib.Array.from_buffers ValueError: Type's expected number of buffers (3) did not match the passed number (2).

on master

TypeError: Need to pass bool-like values

if we do pass an incompatible type we now get a less helpful message

Generally we should never do that ourselves, though.
It's possible to get that in a conversion of arrow -> pandas if you have outdated metadata:

>>> df = pd.DataFrame({'a': pd.array([True, False])}) >>> table = pa.table(df) >>> new_table = table.cast(pa.schema([('a', pa.int8())], metadata=table.schema.metadata)) >>> new_table.to_pandas() ... ~/scipy/repos/pandas-build-arrow/pandas/core/arrays/boolean.py in __from_arrow__(self, array) 135 mask = np.zeros(len(arr), dtype=bool) 136 --> 137 bool_arr = BooleanArray(data, mask) 138 results.append(bool_arr) 139 ~/scipy/repos/pandas-build-arrow/pandas/core/arrays/boolean.py in __init__(self, values, mask, copy) 289 def __init__(self, values: np.ndarray, mask: np.ndarray, copy: bool = False): 290 if not (isinstance(values, np.ndarray) and values.dtype == np.bool_): --> 291 raise TypeError( 292 "values should be boolean numpy array. Use " 293 "the 'pd.array' function instead" TypeError: values should be boolean numpy array. Use the 'pd.array' function instead >>> new_table = table.cast(pa.schema([('a', pa.string())], metadata=table.schema.metadata)) >>> new_table.to_pandas() ... ~/scipy/repos/pandas-build-arrow/pandas/core/arrays/boolean.py in __from_arrow__(self, array) 124 for arr in chunks: 125 buflist = arr.buffers() --> 126 data = pyarrow.BooleanArray.from_buffers( 127 arr.type, len(arr), [None, buflist[1]], offset=arr.offset 128 ).to_numpy(zero_copy_only=False) ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.Array.from_buffers() ValueError: Type's expected number of buffers (3) did not match the passed number (2).

And I think this is maybe something to fix on the pyarrow side (it should ignore the metadata on error).

Now, it's also easy to add a arr.type check to ensure it is boolean arrow type, so will do that.

jorisvandenbossche · 2021-04-20T11:47:13Z

i'm getting significant different timings and showing a much bigger improvement (30x with missing values)

Whoops, I was using my debug build of arrow, which makes things generally much slower, that should explain it.
But so all the better that the speed-up is larger! ;)

jorisvandenbossche · 2021-04-21T13:26:11Z

@simonjayhawkins any further comments?

simonjayhawkins

Thanks @jorisvandenbossche lgtm.

does this require a release note?

jorisvandenbossche · 2021-04-21T13:30:41Z

Will add a small note

…rray (pandas-dev#41051)

PERF: optimize conversion from boolean Arrow array to masked BooleanA…

7e3ac0d

…rray

jorisvandenbossche added Performance Memory or execution speed performance IO Parquet parquet, feather NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Apr 20, 2021

jorisvandenbossche added this to the 1.3 milestone Apr 20, 2021

jorisvandenbossche mentioned this pull request Apr 20, 2021

[ArrowStringArray] Use utf8_is_* functions from Apache Arrow if available #41041

Merged

simonjayhawkins reviewed Apr 20, 2021

View reviewed changes

check type and error + test

4b8bfd3

simonjayhawkins approved these changes Apr 21, 2021

View reviewed changes

add whatsnew note

6c17271

simonjayhawkins approved these changes Apr 21, 2021

View reviewed changes

simonjayhawkins mentioned this pull request Apr 21, 2021

[ArrowStringArray] implement ArrowStringArray._str_contains #41025

Merged

jorisvandenbossche merged commit 83e4332 into pandas-dev:master Apr 21, 2021

jorisvandenbossche deleted the perf-boolean-from-arrow branch April 21, 2021 16:09

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021

PERF: optimize conversion from boolean Arrow array to masked BooleanA…

ac80e23

…rray (pandas-dev#41051)

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

PERF: optimize conversion from boolean Arrow array to masked BooleanA…

d0b6587

…rray (pandas-dev#41051)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: optimize conversion from boolean Arrow array to masked BooleanArray #41051

PERF: optimize conversion from boolean Arrow array to masked BooleanArray #41051

jorisvandenbossche commented Apr 20, 2021

simonjayhawkins commented Apr 20, 2021

simonjayhawkins Apr 20, 2021 •

edited

Loading

jorisvandenbossche Apr 20, 2021

jorisvandenbossche commented Apr 20, 2021

jorisvandenbossche commented Apr 21, 2021

simonjayhawkins left a comment

jorisvandenbossche commented Apr 21, 2021

PERF: optimize conversion from boolean Arrow array to masked BooleanArray #41051

PERF: optimize conversion from boolean Arrow array to masked BooleanArray #41051

Conversation

jorisvandenbossche commented Apr 20, 2021

simonjayhawkins commented Apr 20, 2021

simonjayhawkins Apr 20, 2021 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche Apr 20, 2021

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 20, 2021

jorisvandenbossche commented Apr 21, 2021

simonjayhawkins left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 21, 2021

simonjayhawkins Apr 20, 2021 •

edited

Loading