BUG: Groupby not keeping string dtype for empty objects #55619

phofl · 2023-10-21T18:08:25Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

pandas/core/arrays/base.py

rhshadrach · 2023-10-22T13:20:00Z

pandas/core/groupby/ops.py

-        npvalues = lib.maybe_convert_objects(result, try_float=False)
-        if preserve_dtype:
-            out = maybe_cast_pointwise_result(npvalues, obj.dtype, numeric_only=True)
+        if len(obj) == 0 and len(result) == 0 and isinstance(obj.dtype, ExtensionDtype):


What happens on an empty list of categoricals with observed=False? I think this is the only case where len(obj) == 0 but len(result) > 0.

I didn't track his down specifically, but the test that was mentioned in the comment is not passing by here anymore

Running on this PR as-is:

func = "min" dtype = "string[pyarrow_numpy]" df = DataFrame({"a": ["a"], "b": "a", "c": "a"}, dtype=dtype) df["a"] = pd.Categorical([1], categories=[1, 2, 3]) df = df.iloc[:0] result = getattr(df.groupby("a", observed=False), func)() print(result.dtypes) # b float64 # c float64 # dtype: object

If you remove the condition len(result) == 0 in the line highlighted above, you get string[pyarrow_numpy] for the dtypes instead. However, the values in the DataFrame are still the float NaN. I didn't realize this was possible?

The new string dtype uses nan as missing value representation to keep numpy semantics, so it will work if you feed it an array with all NaN

Ah - thanks

I think this still needs fixed (but okay not for 2.1.2)

lithomas1 · 2023-10-26T13:10:56Z

thanks @phofl.

lumberbot-app · 2023-10-26T13:11:12Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.1.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 8afd868106dec889df89e3abed36af09bb7ddf8c

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #55619: BUG: Groupby not keeping string dtype for empty objects'

Push to a named branch:

git push YOURFORK 2.1.x:auto-backport-of-pr-55619-on-2.1.x

Create a PR against branch 2.1.x, I would have named this PR:

"Backport PR #55619 on branch 2.1.x (BUG: Groupby not keeping string dtype for empty objects)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

…5619) * BUG: Groupby not keeping string dtype for empty objects * Fix --------- Co-authored-by: Thomas Li <[email protected]> (cherry picked from commit 8afd868)

…type for empty objects) (#55705) BUG: Groupby not keeping string dtype for empty objects (#55619) * BUG: Groupby not keeping string dtype for empty objects * Fix --------- Co-authored-by: Thomas Li <[email protected]> (cherry picked from commit 8afd868) Co-authored-by: Patrick Hoefler <[email protected]>

BUG: Groupby not keeping string dtype for empty objects

0cb459c

phofl added Groupby Strings String extension data type and string data labels Oct 21, 2023

phofl added this to the 2.1.2 milestone Oct 21, 2023

phofl requested a review from rhshadrach as a code owner October 21, 2023 18:08

Fix

506c2b2

rhshadrach reviewed Oct 22, 2023

View reviewed changes

pandas/core/arrays/base.py Show resolved Hide resolved

rhshadrach reviewed Oct 22, 2023

View reviewed changes

lithomas1 requested a review from rhshadrach October 24, 2023 15:58

phofl and others added 2 commits October 24, 2023 21:51

Merge branch 'main' into string_dtype_groupby_len_zero

bd1c07d

Merge branch 'main' into string_dtype_groupby_len_zero

334dab9

lithomas1 approved these changes Oct 26, 2023

View reviewed changes

lithomas1 merged commit 8afd868 into pandas-dev:main Oct 26, 2023

lumberbot-app bot added the Still Needs Manual Backport label Oct 26, 2023

lithomas1 removed the Still Needs Manual Backport label Oct 26, 2023

lithomas1 mentioned this pull request Oct 26, 2023

Backport PR #55619 on branch 2.1.x (BUG: Groupby not keeping string dtype for empty objects) #55705

Merged

lithomas1 mentioned this pull request Dec 4, 2023

DOC: Move whatsnew #56320

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Groupby not keeping string dtype for empty objects #55619

BUG: Groupby not keeping string dtype for empty objects #55619

phofl commented Oct 21, 2023

rhshadrach Oct 22, 2023

phofl Oct 22, 2023

rhshadrach Oct 24, 2023

phofl Oct 24, 2023

rhshadrach Oct 24, 2023

rhshadrach Oct 26, 2023

lithomas1 commented Oct 26, 2023

lumberbot-app bot commented Oct 26, 2023

BUG: Groupby not keeping string dtype for empty objects #55619

BUG: Groupby not keeping string dtype for empty objects #55619

Conversation

phofl commented Oct 21, 2023

rhshadrach Oct 22, 2023

Choose a reason for hiding this comment

phofl Oct 22, 2023

Choose a reason for hiding this comment

rhshadrach Oct 24, 2023

Choose a reason for hiding this comment

phofl Oct 24, 2023

Choose a reason for hiding this comment

rhshadrach Oct 24, 2023

Choose a reason for hiding this comment

rhshadrach Oct 26, 2023

Choose a reason for hiding this comment

lithomas1 commented Oct 26, 2023

lumberbot-app bot commented Oct 26, 2023