Skip to content

TST(string dtype): Resolve xfail when grouping by nan column #60712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 13, 2025

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Jan 13, 2025

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

At first glance the behavior perhaps seems odd, but I think have to go with it while we infer dtype based on the non-NA values and accept None as an NA value for the inferred dtype. Namely, in

 df = DataFrame({None: [1, 1, 2, 2], "b": [1, 1, 2, 3], "c": [4, 5, 6, 7]})

the columns are [None, "b", "c"] and so we infer this as strings with None being the corresponding NA-value, in this case np.nan. Thus we need to groupby np.nan.

@rhshadrach rhshadrach added Testing pandas testing functions or related to the test suite Groupby Strings String extension data type and string data labels Jan 13, 2025
@WillAyd
Copy link
Member

WillAyd commented Jan 13, 2025

At first glance the behavior perhaps seems odd, but I think have to go with it while we infer dtype based on the non-NA values and accept None as an NA value for the inferred dtype.

I think what is being tested here took advantage of the old implementation, giving None a special case as an object sentinel. In the log run, I don't see us supporting that natively for strings, so the coercion to a missing value indicator makes sense (and is consistent with constructors)

@WillAyd WillAyd added this to the 2.3 milestone Jan 13, 2025
@mroeschke mroeschke merged commit 55a6d0a into pandas-dev:main Jan 13, 2025
57 of 59 checks passed
@mroeschke
Copy link
Member

Thanks @rhshadrach

Copy link

lumberbot-app bot commented Jan 13, 2025

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.3.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 55a6d0a613897040fec1ae11adc15f5f04728032
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #60712: TST(string dtype): Resolve xfail when grouping by nan column'
  1. Push to a named branch:
git push YOURFORK 2.3.x:auto-backport-of-pr-60712-on-2.3.x
  1. Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #60712 on branch 2.3.x (TST(string dtype): Resolve xfail when grouping by nan column)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

rhshadrach added a commit to rhshadrach/pandas that referenced this pull request Jan 14, 2025
@rhshadrach
Copy link
Member Author

Backport PR: #60719

@jorisvandenbossche
Copy link
Member

I do think that on the short term, we should maybe keep supporting df.groupby(by=[None]).sum() ? (and potentially deprecate it later?)

There are similar tests with pivot or stack/unstack where we also refer to a column/index label with None. For indexing itself, we kept label-lookup working (df[None] gives the column with label NaN).

@WillAyd
Copy link
Member

WillAyd commented Jan 14, 2025

One of the problems with doing that I think would be breaking usability between the str and string types; the latter has never supported using None as a special case

mroeschke pushed a commit that referenced this pull request Jan 14, 2025
…n column (#60712) (#60719)

TST(string dtype): Resolve xfail when grouping by nan column (#60712)

(cherry picked from commit 55a6d0a)
@rhshadrach
Copy link
Member Author

I do think that on the short term, we should maybe keep supporting df.groupby(by=[None]).sum() ?

For all dtypes? Or just str? What happens with object dtype - is None also treated as an NA value?

I'm a bit resistant to changing groupby behavior here. This seems to me to be quite an edge case, and is long standing behavior for other dtypes. E.g.

df = pd.DataFrame([[1, 1, 2], [3, 4, 5]], columns=pd.Index([1, pd.NA, 2], dtype="Int64"))
gb = df.groupby([None])

raises.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Strings String extension data type and string data Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants