BUG: indexing empty pyarrow backed object returning corrupt object #51741

phofl · 2023-03-02T10:18:55Z

closes BUG: pd.concat fails with GroupBy.head() and pd.StringDtype["pyarrow"] #51734 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

looks like an empty chunked array creates problems later on.

jorisvandenbossche · 2023-03-03T14:02:34Z

pandas/core/arrays/arrow/array.py

@@ -349,7 +349,7 @@ def __getitem__(self, item: PositionalIndexer):
                    pa_dtype = pa.string()
                else:
                    pa_dtype = self._dtype.pyarrow_dtype
-                return type(self)(pa.chunked_array([], type=pa_dtype))
+                return type(self)(pa.array([], type=pa_dtype))


A chunked array without chunks should in theory also work, so this might point to something else that is buggy?

Looking at the error in #51734, it might be that the type needs to be specified in the chunked_array() call in _concat_same_type

This could also solve this, but imo we should rather avoid returning something here that creates these problems. When iterating over a chunked array without chunks you’ll get an empty list, which makes determining the dtype tricky, because we would have to implement upcasting logic when getting more than one object

edit: forget what I’ve said about upcasting…

but imo we should rather avoid returning something here that creates these problems

Yes, but so my point is that we should maybe rather ensure that this does not create these problems, because there can be other ways that such a chunked array gets created (eg coming from pyarrow).
A chunkedarray itself also has a type object, so you don't need to get one chunk to get the type.

Yep I missed the concat_same_type, makes sense when we only have one type

j-bennet · 2023-03-03T20:39:08Z

I can confirm that it solves the original issue.

mroeschke · 2023-03-06T09:53:47Z

pandas/core/arrays/arrow/array.py

@@ -1012,7 +1012,11 @@ def _concat_same_type(
        ArrowExtensionArray
        """
        chunks = [array for ea in to_concat for array in ea._data.iterchunks()]
-        arr = pa.chunked_array(chunks)
+        if to_concat[0].dtype == "string":


Is needed specifically for StringDtype("pyarrow") and not ArrowDtype(pa.string())?

If so, could you add a comment to that effect?

Yes since the StringDtype does not have a pyarrow_dtype

# Conflicts: # pandas/tests/extension/test_arrow.py

lumberbot-app · 2023-03-08T09:13:37Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.0.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 a07cb65459f24446e82354854ff6658f29414c0a

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #51741: BUG: indexing empty pyarrow backed object returning corrupt object'

Push to a named branch:

git push YOURFORK 2.0.x:auto-backport-of-pr-51741-on-2.0.x

Create a PR against branch 2.0.x, I would have named this PR:

"Backport PR #51741 on branch 2.0.x (BUG: indexing empty pyarrow backed object returning corrupt object)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

…andas-dev#51741)

…d object returning corrupt object) (#51841)

phofl added 2 commits March 2, 2023 11:16

BUG: indexing empty pyarrow backed object returning corrupt object

8381b24

remove from pandas

8d0abf8

phofl added this to the 2.0 milestone Mar 2, 2023

phofl added Indexing Related to indexing on series/frames, not to indexes themselves Arrow pyarrow functionality labels Mar 2, 2023

phofl requested a review from mroeschke March 2, 2023 10:19

phofl and others added 4 commits March 2, 2023 11:20

Merge branch 'main' into 51734

9c545e9

Fix no pyarrow installed error

198c175

Fix

7b6dbca

Move test

c4d580f

jorisvandenbossche reviewed Mar 3, 2023

View reviewed changes

Refactor

59d39d5

Merge branch 'main' into 51734

6d68db2

jorisvandenbossche approved these changes Mar 6, 2023

View reviewed changes

mroeschke reviewed Mar 6, 2023

View reviewed changes

phofl added 2 commits March 7, 2023 22:28

Add comment

e70427c

Merge remote-tracking branch 'upstream/main' into 51734

2d8ec9c

# Conflicts: # pandas/tests/extension/test_arrow.py

mroeschke approved these changes Mar 7, 2023

View reviewed changes

phofl merged commit a07cb65 into pandas-dev:main Mar 8, 2023

lumberbot-app bot added the Still Needs Manual Backport label Mar 8, 2023

phofl added a commit to phofl/pandas that referenced this pull request Mar 8, 2023

BUG: indexing empty pyarrow backed object returning corrupt object (p…

9e30b01

…andas-dev#51741)

phofl mentioned this pull request Mar 8, 2023

Backport PR #51741 on branch 2.0.x (BUG: indexing empty pyarrow backed object returning corrupt object) #51841

Merged

phofl removed the Still Needs Manual Backport label Mar 8, 2023

phofl added a commit that referenced this pull request Mar 8, 2023

Backport PR #51741 on branch 2.0.x (BUG: indexing empty pyarrow backe…

13cd542

…d object returning corrupt object) (#51841)

phofl deleted the 51734 branch August 28, 2023 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: indexing empty pyarrow backed object returning corrupt object #51741

BUG: indexing empty pyarrow backed object returning corrupt object #51741

phofl commented Mar 2, 2023 •

edited

Loading

jorisvandenbossche Mar 3, 2023

jorisvandenbossche Mar 3, 2023

phofl Mar 3, 2023 •

edited

Loading

jorisvandenbossche Mar 3, 2023

phofl Mar 3, 2023

phofl Mar 3, 2023

j-bennet commented Mar 3, 2023

mroeschke Mar 6, 2023

phofl Mar 7, 2023

phofl Mar 7, 2023

lumberbot-app bot commented Mar 8, 2023

BUG: indexing empty pyarrow backed object returning corrupt object #51741

BUG: indexing empty pyarrow backed object returning corrupt object #51741

Conversation

phofl commented Mar 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl Mar 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j-bennet commented Mar 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lumberbot-app bot commented Mar 8, 2023

phofl commented Mar 2, 2023 •

edited

Loading

phofl Mar 3, 2023 •

edited

Loading