Skip to content

ENH: Series.str.get_dummies() raise on string type (follow up to PR #59577) #59786

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 47 commits into from
Jan 22, 2025

Conversation

aaronchucarroll
Copy link
Contributor

@aaronchucarroll aaronchucarroll commented Sep 12, 2024

Follow-up on #59577 (comment)

aaronchucarroll and others added 30 commits August 21, 2024 14:26
@rhshadrach
Copy link
Member

@aaronchucarroll - I believe there is an issue related to this, is that right?

@aaronchucarroll
Copy link
Contributor Author

aaronchucarroll commented Sep 26, 2024

I don't think there exists an issue for this. These are the requested follow-up changes to my last PR (#59577)

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the followup!

@mroeschke - I've resolved your comments from #59577 where they have been taken care of. There are a few comments that could use your review.

@aaronchucarroll
Copy link
Contributor Author

@rhshadrach Please see new changes

@@ -2482,12 +2482,16 @@ def get_dummies(
1 False False False
2 True False True
"""
from pandas.core.dtypes.common import is_string_dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas.core.dtypes.common is already being imported from at the top of this file. Can you move this import up there.

@mroeschke mroeschke added the Strings String extension data type and string data label Oct 31, 2024
@jorisvandenbossche jorisvandenbossche added this to the 3.0 milestone Nov 18, 2024
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aaronchucarroll. Would you have time to update based on the comments?

Comment on lines 2489 to 2490
if is_string_dtype(dtype):
raise ValueError("string dtype not supported, please use a numeric dtype")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if instead of explicitly disallowing specifically string dtype, should we allow any numeric dtype? (so raise an error if not is_numeric_dtype(dtype))
Because right now this would also allow the user to specify they want datetime dtype, for example, which also does not make much sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach thought here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that we can restrict to just numeric here.

Comment on lines 106 to 109
with pytest.raises(
ValueError, match="string dtype not supported, please use a numeric dtype"
):
s.str.get_dummies("|", dtype="str[pyarrow]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test above is sufficient (so keep test_get_dummies_with_str_dtype but can remove test_get_dummies_with_pa_str_dtype

@jorisvandenbossche jorisvandenbossche changed the title ENH: Follow up to PR #59577, Series.str.get_dummies() raise on str type ENH: Series.str.get_dummies() raise on string type (follow up to PR #59577) Nov 18, 2024
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@@ -434,7 +434,7 @@ def _str_get_dummies(self, sep: str = "|", dtype: NpDtype | None = None):
dummies_dtype = _dtype
else:
dummies_dtype = np.bool_
dummies = np.empty((len(arr), len(tags2)), dtype=dummies_dtype)
dummies = np.empty((len(arr), len(tags2)), dtype=dummies_dtype, order="F")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronchucarroll @rhshadrach do you remember why this change was included?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see that is based on a comment from @mroeschke at #59577 (comment)

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach merged commit 4c3b968 into pandas-dev:main Jan 22, 2025
54 checks passed
@rhshadrach
Copy link
Member

Thanks @aaronchucarroll and @jorisvandenbossche

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants