-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: optimize conversion from boolean Arrow array to masked BooleanArray #41051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: optimize conversion from boolean Arrow array to masked BooleanArray #41051
Conversation
@jorisvandenbossche i'm getting significant different timings and showing a much bigger improvement (30x with missing values) and same with
|
buflist = arr.buffers() | ||
data = pyarrow.BooleanArray.from_buffers( | ||
arr.type, len(arr), [None, buflist[1]], offset=arr.offset | ||
).to_numpy(zero_copy_only=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to #41018 (comment)
if we do pass an incompatible type we now get a less helpful message
>>> import pandas as pd
>>> from pandas.core.arrays.string_arrow import ArrowStringDtype
>>>
>>> s2 = pd.Series(["a", None, "1"], dtype="arrow_string")
>>>
>>> arr = s2.values._data
>>>
>>> arr
<pyarrow.lib.ChunkedArray object at 0x7f50a262ba90>
[
[
"a",
null,
"1"
]
]
>>>
>>> pd.BooleanDtype().__from_arrow__(arr)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/simon/pandas/pandas/core/arrays/boolean.py", line 126, in __from_arrow__
data = pyarrow.BooleanArray.from_buffers(
File "pyarrow/array.pxi", line 947, in pyarrow.lib.Array.from_buffers
ValueError: Type's expected number of buffers (3) did not match the passed number (2).
on master
TypeError: Need to pass bool-like values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we do pass an incompatible type we now get a less helpful message
Generally we should never do that ourselves, though.
It's possible to get that in a conversion of arrow -> pandas if you have outdated metadata:
>>> df = pd.DataFrame({'a': pd.array([True, False])})
>>> table = pa.table(df)
>>> new_table = table.cast(pa.schema([('a', pa.int8())], metadata=table.schema.metadata))
>>> new_table.to_pandas()
...
~/scipy/repos/pandas-build-arrow/pandas/core/arrays/boolean.py in __from_arrow__(self, array)
135 mask = np.zeros(len(arr), dtype=bool)
136
--> 137 bool_arr = BooleanArray(data, mask)
138 results.append(bool_arr)
139
~/scipy/repos/pandas-build-arrow/pandas/core/arrays/boolean.py in __init__(self, values, mask, copy)
289 def __init__(self, values: np.ndarray, mask: np.ndarray, copy: bool = False):
290 if not (isinstance(values, np.ndarray) and values.dtype == np.bool_):
--> 291 raise TypeError(
292 "values should be boolean numpy array. Use "
293 "the 'pd.array' function instead"
TypeError: values should be boolean numpy array. Use the 'pd.array' function instead
>>> new_table = table.cast(pa.schema([('a', pa.string())], metadata=table.schema.metadata))
>>> new_table.to_pandas()
...
~/scipy/repos/pandas-build-arrow/pandas/core/arrays/boolean.py in __from_arrow__(self, array)
124 for arr in chunks:
125 buflist = arr.buffers()
--> 126 data = pyarrow.BooleanArray.from_buffers(
127 arr.type, len(arr), [None, buflist[1]], offset=arr.offset
128 ).to_numpy(zero_copy_only=False)
~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.Array.from_buffers()
ValueError: Type's expected number of buffers (3) did not match the passed number (2).
And I think this is maybe something to fix on the pyarrow side (it should ignore the metadata on error).
Now, it's also easy to add a arr.type
check to ensure it is boolean arrow type, so will do that.
Whoops, I was using my debug build of arrow, which makes things generally much slower, that should explain it. |
@simonjayhawkins any further comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorisvandenbossche lgtm.
does this require a release note?
Will add a small note |
For arrays without missing values it doesn't change, but gives a decent speed-up when having missing values:
cc @simonjayhawkins