TST(string dtype): Resolve xfail when grouping by nan column #60712

rhshadrach · 2025-01-13T02:26:48Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

At first glance the behavior perhaps seems odd, but I think have to go with it while we infer dtype based on the non-NA values and accept None as an NA value for the inferred dtype. Namely, in

 df = DataFrame({None: [1, 1, 2, 2], "b": [1, 1, 2, 3], "c": [4, 5, 6, 7]})

the columns are [None, "b", "c"] and so we infer this as strings with None being the corresponding NA-value, in this case np.nan. Thus we need to groupby np.nan.

WillAyd · 2025-01-13T16:29:24Z

At first glance the behavior perhaps seems odd, but I think have to go with it while we infer dtype based on the non-NA values and accept None as an NA value for the inferred dtype.

I think what is being tested here took advantage of the old implementation, giving None a special case as an object sentinel. In the log run, I don't see us supporting that natively for strings, so the coercion to a missing value indicator makes sense (and is consistent with constructors)

mroeschke · 2025-01-13T17:52:19Z

Thanks @rhshadrach

lumberbot-app · 2025-01-13T17:52:37Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.3.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 55a6d0a613897040fec1ae11adc15f5f04728032

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #60712: TST(string dtype): Resolve xfail when grouping by nan column'

Push to a named branch:

git push YOURFORK 2.3.x:auto-backport-of-pr-60712-on-2.3.x

Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #60712 on branch 2.3.x (TST(string dtype): Resolve xfail when grouping by nan column)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

…dev#60712) (cherry picked from commit 55a6d0a)

rhshadrach · 2025-01-14T02:10:29Z

Backport PR: #60719

jorisvandenbossche · 2025-01-14T08:11:54Z

I do think that on the short term, we should maybe keep supporting df.groupby(by=[None]).sum() ? (and potentially deprecate it later?)

There are similar tests with pivot or stack/unstack where we also refer to a column/index label with None. For indexing itself, we kept label-lookup working (df[None] gives the column with label NaN).

WillAyd · 2025-01-14T14:41:40Z

One of the problems with doing that I think would be breaking usability between the str and string types; the latter has never supported using None as a special case

…n column (#60712) (#60719) TST(string dtype): Resolve xfail when grouping by nan column (#60712) (cherry picked from commit 55a6d0a)

rhshadrach · 2025-01-18T20:00:33Z

I do think that on the short term, we should maybe keep supporting df.groupby(by=[None]).sum() ?

For all dtypes? Or just str? What happens with object dtype - is None also treated as an NA value?

I'm a bit resistant to changing groupby behavior here. This seems to me to be quite an edge case, and is long standing behavior for other dtypes. E.g.

df = pd.DataFrame([[1, 1, 2], [3, 4, 5]], columns=pd.Index([1, pd.NA, 2], dtype="Int64"))
gb = df.groupby([None])

raises.

TST(string dtype): Resolve xfail when grouping by nan column

0d6ab2b

rhshadrach added Testing pandas testing functions or related to the test suite Groupby Strings String extension data type and string data labels Jan 13, 2025

WillAyd approved these changes Jan 13, 2025

View reviewed changes

WillAyd added this to the 2.3 milestone Jan 13, 2025

mroeschke approved these changes Jan 13, 2025

View reviewed changes

mroeschke merged commit 55a6d0a into pandas-dev:main Jan 13, 2025
57 of 59 checks passed

lumberbot-app bot added the Still Needs Manual Backport label Jan 13, 2025

rhshadrach added a commit to rhshadrach/pandas that referenced this pull request Jan 14, 2025

TST(string dtype): Resolve xfail when grouping by nan column (pandas-…

83ea015

…dev#60712) (cherry picked from commit 55a6d0a)

jorisvandenbossche removed the Still Needs Manual Backport label Jan 14, 2025

mroeschke pushed a commit that referenced this pull request Jan 14, 2025

[backport 2.3.x] TST(string dtype): Resolve xfail when grouping by na…

7374d09

…n column (#60712) (#60719) TST(string dtype): Resolve xfail when grouping by nan column (#60712) (cherry picked from commit 55a6d0a)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST(string dtype): Resolve xfail when grouping by nan column #60712

TST(string dtype): Resolve xfail when grouping by nan column #60712

rhshadrach commented Jan 13, 2025 •

edited

Loading

WillAyd commented Jan 13, 2025

mroeschke commented Jan 13, 2025

lumberbot-app bot commented Jan 13, 2025

rhshadrach commented Jan 14, 2025

jorisvandenbossche commented Jan 14, 2025

WillAyd commented Jan 14, 2025

rhshadrach commented Jan 18, 2025

TST(string dtype): Resolve xfail when grouping by nan column #60712

TST(string dtype): Resolve xfail when grouping by nan column #60712

Conversation

rhshadrach commented Jan 13, 2025 • edited Loading

WillAyd commented Jan 13, 2025

mroeschke commented Jan 13, 2025

lumberbot-app bot commented Jan 13, 2025

rhshadrach commented Jan 14, 2025

jorisvandenbossche commented Jan 14, 2025

WillAyd commented Jan 14, 2025

rhshadrach commented Jan 18, 2025

rhshadrach commented Jan 13, 2025 •

edited

Loading