Skip to content

ENH: Consistent API between pd.get_dummies() and Series.str.get_dummies() #59235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 3 tasks
wany-oh opened this issue Jul 12, 2024 · 6 comments
Open
1 of 3 tasks
Assignees
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data

Comments

@wany-oh
Copy link
Contributor

wany-oh commented Jul 12, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Compared to pd.get_dummies(), Series.str.get_dummies() behaves so differently and has much more limited functionality. Such differences would not be user-friendly.

Feature Description

  1. The dtype of the return DataFrame of Series.str.get_dummies() should be bool, not int64.

    s = pd.Series(list('abca'))
    s.str.get_dummies()

    before:

       a  b  c
    0  1  0  0
    1  0  1  0
    2  0  0  1
    3  1  0  0
    

    after (same as pd.get_dummies(s)):

           a      b      c
    0   True  False  False
    1  False   True  False
    2  False  False   True
    3   True  False  False
    
  2. prefix=, prefix_sep=, dummy_na=, sparse=, and dtype= arguments should be added to Series.str.get_dummies().

    s = pd.Series(['a', 'b', np.nan])
    s.str.get_dummies(prefix="dummy", prefix_sep="=", dummy_na=True, dtype=float)

    after (same as pd.get_dummies(s, prefix="dummy", prefix_sep="=", dummy_na=True, dtype=float)):

       dummy=a  dummy=b  dummy=nan
    0      1.0      0.0        0.0
    1      0.0      1.0        0.0
    2      0.0      0.0        1.0
    

    Note: Among the arguments of pd.get_dummies(), the columns= argument is obviously not needed for Series.str.get_dummies(). Whether Series.str.get_dummies() needs a drop_first= argument is debatable since Series.str.get_dummies() can yield True in multiple columns unlike pd.get_dummies().

Alternative Solutions

While there are countless alternatives to obtaining DataFrames that yield the same result, there is no alternative that would bring consistency to the two methods. The only alternative might be to simply deprecate Series.str.get_dummies().

Additional Context

No response

@wany-oh wany-oh added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 12, 2024
@Aloqeely
Copy link
Member

Thanks for the suggestion! There is an issue opened to make Series.str.get_dummies defer to pd.get_dummies (#19618) which will then allow the usage of all these args but has been quite recently.
I'm ok with doing that, but is there a reason to not simply deprecate Series.str.get_dummies?

@Aloqeely Aloqeely added Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 14, 2024
@asishm
Copy link
Contributor

asishm commented Jul 14, 2024

is there a reason to not simply deprecate Series.str.get_dummies?

Series.str.get_dummies has an extra sep argument

@aaronchucarroll
Copy link
Contributor

take

@aaronchucarroll
Copy link
Contributor

Do we want to change the return df to use booleans rather than 1s and 0s? This makes more sense for Series.str.get_dummies to be consistent with pd.get_dummies. But it presents backward compatibility issues.

@Aloqeely
Copy link
Member

I don't think so. As you said that might break user code and so it would require a deprecation.

This issue might need more discussion from pandas devs. Any thoughts @mroeschke?

@mroeschke
Copy link
Member

Correct, changing the default data type would require a deprecation cycle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants