-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Arrow setitem segfaults when len > 145 000 #52075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/tests/extension/test_arrow.py
Outdated
def test_setitem_boolean_replace_with_mask_segfault(): | ||
# GH#52059 | ||
N = 145_000 | ||
arr = ArrowExtensionArray(pa.chunked_array([np.array([True] * N)])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick np.ones(N, dtype=bool)
is about 1200x faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
pandas/core/arrays/arrow/array.py
Outdated
@@ -1634,6 +1634,9 @@ def _replace_with_mask( | |||
indices = pa.array(indices, type=pa.int64()) | |||
replacements = replacements.take(indices) | |||
return cls._if_else(mask, replacements, values) | |||
if isinstance(values, pa.ChunkedArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add the pyarrow issue link here too?
(mentioned in #52059 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
pandas/core/arrays/arrow/array.py
Outdated
@@ -1634,6 +1634,10 @@ def _replace_with_mask( | |||
indices = pa.array(indices, type=pa.int64()) | |||
replacements = replacements.take(indices) | |||
return cls._if_else(mask, replacements, values) | |||
if isinstance(values, pa.ChunkedArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could limit it to combine the chunks only when the values have boolean type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
N = 145_000 | ||
arr = ArrowExtensionArray(pa.chunked_array([np.ones((N,), dtype=np.bool_)])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this > 145_000. Based on @lukemanley's report, this just happens for any (also tiny) chunked array.
(I suppose the larger number came from reading where depending on the size it was creating a chunked array or not?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test did not fail with a smaller size, this was confusing to me as well but could this only get to fail with arrow functionality only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that might be a different issue. But so this test is currently not strictly testing the replace_with_mask issue giving invalid return arrays, because it seems the ==
is not actually doing its work ... (another bug!)
So also with a N of 2, I see the invalid output, the ==
just passes:
In [14]: N = 2
In [15]: arr = pd.arrays.ArrowExtensionArray(pa.chunked_array([np.ones((N,), dtype=np.bool_)]))
In [16]: expected = arr.copy()
In [17]: arr[np.zeros((N,), dtype=np.bool_)] = False
In [18]: expected._data
Out[18]:
<pyarrow.lib.ChunkedArray object at 0x7fa0f51e2de0>
[
[
true,
true
]
]
In [19]: arr._data
Out[19]:
<pyarrow.lib.ChunkedArray object at 0x7fa0f52e6390>
[
<Invalid array: Buffer #1 too small in array of type bool and length 2: expected at least 1 byte(s), got 0
/home/joris/scipy/repos/arrow/cpp/src/arrow/array/validate.cc:116 ValidateLayout(*data.type)>
]
In [20]: expected._data == arr._data
Out[20]: True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using 145_000 causes a segfault on my machine without the change though, so this should be good enough as a test for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (I think the larger test size is okay, but would be nice to shrink in the future)
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
Thanks @phofl |
* BUG: Arrow setitem segfaults when len > 145 000 * Add gh ref * Address review * Restrict to bool type (cherry picked from commit 10000db)
… len > 145 000) (#52259) * BUG: Arrow setitem segfaults when len > 145 000 (#52075) * BUG: Arrow setitem segfaults when len > 145 000 * Add gh ref * Address review * Restrict to bool type (cherry picked from commit 10000db) * _data --------- Co-authored-by: Patrick Hoefler <[email protected]>
thx for doing the backport |
.loc
on boolean Series withdtype_backend=pyarrow
on some dataframes. #52059 (Replace xxxx with the GitHub issue number)doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.