Skip to content

ENH: Rename get_dummies to more inclusive language #48250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks
davidcavazos opened this issue Aug 25, 2022 · 12 comments
Closed
3 tasks

ENH: Rename get_dummies to more inclusive language #48250

davidcavazos opened this issue Aug 25, 2022 · 12 comments
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@davidcavazos
Copy link

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

The word "dummy" from the pd.get_dummies function can be offensive to some people and should be renamed.

It's marked as a word that should not be used by Google's inclusive language word list.

Feature Description

A good alternative name could be renaming it to pd.get_indicator_variables, which would also be more explicit on what it does.

Alternative Solutions

Alternatively, pd.get_one_hot or pd.get_one_hot_encoded could also be an option familiar to Machine Learning practitioners.

Google trends show "indicator variable" and "one-hot encoding" to be similarly popular, with "indicator variable" being slightly more popular.

Additional Context

No response

@davidcavazos davidcavazos added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 25, 2022
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Aug 25, 2022

Hey @davidcavazos

I'd also be favourable to renaming it, but this was discussed in #35724 and at the time rejected

I think there isn't a settled alternative to dummy, and until there is no change should be made. One the world converges into something that mostly stops using dummy variable, then that should be adopted. More or less how the master->main change worked in pandas.

@MarcoGorelli
Copy link
Member

Closing for now then, if the world does converge on an alternative which would allow for #35724 to be reconsidered then we can go with that

@davidcavazos
Copy link
Author

The world won't converge on an alternative if we don't start doing the change. GitHub's master -> main only happened because they took the initiative.

@TheNeuralBit
Copy link
Contributor

TheNeuralBit commented Aug 25, 2022

My takeaway from the previous discussion is that adding a separate get_indicators (or some other agreed upon alternative) would be amenable. From there we could either:

  • Deprecate and ultimately remove get_dummies, or
  • Prefer get_indicators in documentation to nudge users there

It seems the former was rejected, but the latter could be acceptable. Could we pursue that approach?

(Also, would you prefer for us to continue discussion in #35274?)

@TheNeuralBit
Copy link
Contributor

A non-Google reference for "dummy" being non-inclusive: https://itconnect.uw.edu/guides-by-topic/identity-diversity-inclusion//inclusive-language-guide/

The origin of the word, “dummy,” is a person who cannot speak. Because the use of this word is often negatively associated with a disability, implying a person is worthless, ineffective or incapable, an alternative word should be used.

@MarcoGorelli
Copy link
Member

A non-Google reference for "dummy" being non-inclusive: https://itconnect.uw.edu/guides-by-topic/identity-diversity-inclusion//inclusive-language-guide/

The origin of the word, “dummy,” is a person who cannot speak. Because the use of this word is often negatively associated with a disability, implying a person is worthless, ineffective or incapable, an alternative word should be used.

That's a nice reference, thanks!

(Also, would you prefer for us to continue discussion in #35274?)

Yes please, let's keep the discussion in one place - perhaps post the reference you linked above there?

@davidcavazos
Copy link
Author

Thanks @TheNeuralBit. I think that either get_indicators (as per the previous discussion), or get_indicator_variables (longer, but more explicit), would work. For consistency with other pandas functions with usually shorter names, I think get_indicators might be more suitable.

I would personally create the new name and mark get_dummies as deprecated (but still usable), and finally remove it completely in a future release.

@davidcavazos
Copy link
Author

The University of Delaware also has a similar (although shorter) list including to remove "dummy value":

http://www1.udel.edu/itwebdev/help/dei.html

@davidcavazos
Copy link
Author

@davidcavazos
Copy link
Author

This document also mentions how it causes harm:

“Dummy” and similar terms stigmatize mental disabilities. The alternatives are clearer.

@MarcoGorelli
Copy link
Member

Thanks @davidcavazos , appreciate the references - could you post them in #35724 please so we keep the discussion in one place?

@davidcavazos
Copy link
Author

Sure, I just posted a summary of the key points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

3 participants