BUG: Arrow setitem segfaults when len > 145 000 #52075

phofl · 2023-03-19T18:00:34Z

closes BUG: python crashes on filtering with .loc on boolean Series with dtype_backend=pyarrow on some dataframes. #52059 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jbrockmendel · 2023-03-20T16:16:06Z

pandas/tests/extension/test_arrow.py

+def test_setitem_boolean_replace_with_mask_segfault():
+    # GH#52059
+    N = 145_000
+    arr = ArrowExtensionArray(pa.chunked_array([np.array([True] * N)]))


nitpick np.ones(N, dtype=bool) is about 1200x faster

mroeschke · 2023-03-20T16:45:11Z

pandas/core/arrays/arrow/array.py

@@ -1634,6 +1634,9 @@ def _replace_with_mask(
                indices = pa.array(indices, type=pa.int64())
                replacements = replacements.take(indices)
            return cls._if_else(mask, replacements, values)
+        if isinstance(values, pa.ChunkedArray):


Could you add the pyarrow issue link here too?

(mentioned in #52059 (comment))

jorisvandenbossche · 2023-03-21T10:12:24Z

pandas/core/arrays/arrow/array.py

@@ -1634,6 +1634,10 @@ def _replace_with_mask(
                indices = pa.array(indices, type=pa.int64())
                replacements = replacements.take(indices)
            return cls._if_else(mask, replacements, values)
+        if isinstance(values, pa.ChunkedArray):


You could limit it to combine the chunks only when the values have boolean type

jorisvandenbossche · 2023-03-21T10:15:44Z

pandas/tests/extension/test_arrow.py

+    N = 145_000
+    arr = ArrowExtensionArray(pa.chunked_array([np.ones((N,), dtype=np.bool_)]))


I don't think you need this > 145_000. Based on @lukemanley's report, this just happens for any (also tiny) chunked array.
(I suppose the larger number came from reading where depending on the size it was creating a chunked array or not?)

The test did not fail with a smaller size, this was confusing to me as well but could this only get to fail with arrow functionality only

OK, that might be a different issue. But so this test is currently not strictly testing the replace_with_mask issue giving invalid return arrays, because it seems the == is not actually doing its work ... (another bug!)

So also with a N of 2, I see the invalid output, the == just passes:

In [14]: N = 2 In [15]: arr = pd.arrays.ArrowExtensionArray(pa.chunked_array([np.ones((N,), dtype=np.bool_)])) In [16]: expected = arr.copy() In [17]: arr[np.zeros((N,), dtype=np.bool_)] = False In [18]: expected._data Out[18]: <pyarrow.lib.ChunkedArray object at 0x7fa0f51e2de0> [ [ true, true ] ] In [19]: arr._data Out[19]: <pyarrow.lib.ChunkedArray object at 0x7fa0f52e6390> [ <Invalid array: Buffer #1 too small in array of type bool and length 2: expected at least 1 byte(s), got 0 /home/joris/scipy/repos/arrow/cpp/src/arrow/array/validate.cc:116 ValidateLayout(*data.type)> ] In [20]: expected._data == arr._data Out[20]: True

Using 145_000 causes a segfault on my machine without the change though, so this should be good enough as a test for now?

mroeschke

LGTM (I think the larger test size is okay, but would be nice to shrink in the future)

lumberbot-app · 2023-03-27T21:48:38Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.0.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 10000db023208c1db0bba6a7d819bfe87dc49908

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #52075: BUG: Arrow setitem segfaults when len > 145 000'

Push to a named branch:

git push YOURFORK 2.0.x:auto-backport-of-pr-52075-on-2.0.x

Create a PR against branch 2.0.x, I would have named this PR:

"Backport PR #52075 on branch 2.0.x (BUG: Arrow setitem segfaults when len > 145 000)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

mroeschke · 2023-03-27T21:48:41Z

Thanks @phofl

* BUG: Arrow setitem segfaults when len > 145 000 * Add gh ref * Address review * Restrict to bool type (cherry picked from commit 10000db)

… len > 145 000) (#52259) * BUG: Arrow setitem segfaults when len > 145 000 (#52075) * BUG: Arrow setitem segfaults when len > 145 000 * Add gh ref * Address review * Restrict to bool type (cherry picked from commit 10000db) * _data --------- Co-authored-by: Patrick Hoefler <[email protected]>

phofl · 2023-03-29T14:02:32Z

thx for doing the backport

phofl added 2 commits March 19, 2023 18:58

BUG: Arrow setitem segfaults when len > 145 000

8c78892

Add gh ref

f5c7754

phofl added the Arrow pyarrow functionality label Mar 19, 2023

phofl added this to the 2.0 milestone Mar 19, 2023

lukemanley approved these changes Mar 19, 2023

View reviewed changes

jbrockmendel reviewed Mar 20, 2023

View reviewed changes

mroeschke reviewed Mar 20, 2023

View reviewed changes

phofl added 2 commits March 21, 2023 09:39

Merge remote-tracking branch 'upstream/main' into 52059

5c73f41

Address review

e72007d

jorisvandenbossche reviewed Mar 21, 2023

View reviewed changes

Restrict to bool type

7535639

mroeschke approved these changes Mar 23, 2023

View reviewed changes

mroeschke merged commit 10000db into pandas-dev:main Mar 27, 2023

lumberbot-app bot added the Still Needs Manual Backport label Mar 27, 2023

mroeschke removed the Still Needs Manual Backport label Mar 28, 2023

phofl deleted the 52059 branch March 29, 2023 14:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Arrow setitem segfaults when len > 145 000 #52075

BUG: Arrow setitem segfaults when len > 145 000 #52075

phofl commented Mar 19, 2023

jbrockmendel Mar 20, 2023

phofl Mar 21, 2023

mroeschke Mar 20, 2023

phofl Mar 21, 2023

jorisvandenbossche Mar 21, 2023

phofl Mar 21, 2023

jorisvandenbossche Mar 21, 2023

phofl Mar 21, 2023

jorisvandenbossche Mar 21, 2023 •

edited

Loading

phofl Mar 21, 2023

mroeschke left a comment

lumberbot-app bot commented Mar 27, 2023

mroeschke commented Mar 27, 2023

phofl commented Mar 29, 2023

		N = 145_000
		arr = ArrowExtensionArray(pa.chunked_array([np.ones((N,), dtype=np.bool_)]))

BUG: Arrow setitem segfaults when len > 145 000 #52075

BUG: Arrow setitem segfaults when len > 145 000 #52075

Conversation

phofl commented Mar 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Mar 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

lumberbot-app bot commented Mar 27, 2023

mroeschke commented Mar 27, 2023

phofl commented Mar 29, 2023

jorisvandenbossche Mar 21, 2023 •

edited

Loading