-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: pd.get_dummies should not default to dtype np.uint8 #45848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Please provide the output of pd.show_versions and provide something repriducible. we have a bug template for bug reports or an enhancement template if you think that this is more of an enhancement. |
To be clear, this is definitely an enhancement and not a bug since this is the documented behavior of The issue is that defaulting to I've included the requested information as well as an example below. If you need more info I'm happy to use a template, just provide a pointer where I could find one. Here's the output of pd.show_versions:
Here's an example of the issue:
returns
The intended behavior can be achieved by specifying any signed dtype as you can see here:
returns the generally expected result:
My (and many other people's) issue with this is that the default should not lead to unexpected results. It doesn't need to be a |
Thx, we would probably need a deprecation cycle if we would change that |
Makes sense to me. I appreciate you looking into this, thanks! |
From git log it looks like it's been like this for years, but I can't tell why uint8 was chosen over int8. I'd be in favour of deprecating the current default in favour of int8 |
I support this change. uint8 can lead to hard to track errors |
@pandas-dev/pandas-core anyone got any objections to changing the default type? Any reason to not just make the default type General suggestion for others who would like this changed: please use reactions to express support, no need to add comments just indicating that you'd also like to see this |
bool is a problem as doesn't play nice with missing values could certain return Boolean this would be a breaking change and so needs to wait for 2.0 (i think their is a tracking issue) |
Sure, but why would |
IMO you either go with |
This isn't quite right. Any int dtype can wrap under the right conditions. It wouldn't happen subtracting 2 dummies, but you cannot know that there isn't some edge case out there. I agree that |
Another funny behavior of
The suggestion to use |
Thanks @bashtage
Regarding handling |
@willkurt do you want to open a PR for this? This'd involve:
Not strictly necessary, but I think Sounds like there's agreement on changing the default type away from If anyone wants to work on this, here's the contributing guide, and feel free to ask for help if anything's unclear |
Probably in the minority but I think uint8 is a natural return type. bool would also be ok Int64 and float are pretty heavy handed - I think memory usage is really important here. |
Hi everyone, I am starting in Open Source and willing to contribute. Can I work on this issue and can anyone help me get started? |
Hey - thanks, but there's already a PR open |
OK, can you suggest me any beginner issue to work on. |
* ENH: Warn when dtype is not passed to get_dummies * Edit get_dummies' dtype warning * Add whatsnew entry for issue #45848 * Fix dtype warning test * Suppress warnings in docs * Edit whatsnew entry Co-authored-by: Marco Edward Gorelli <[email protected]> * Fix find_stack_level in get_dummies dtype warning * Change the default dtype of get_dummies to bool * Revert dtype(bool) change * Move the changelog entry to v1.6.0.rst * Move whatsnew entry to 'Other API changes' Co-authored-by: Marco Edward Gorelli <[email protected]> Co-authored-by: Marco Edward Gorelli <[email protected]>
Would this change also apply to |
that's a good point, |
* ENH: Warn when dtype is not passed to get_dummies * Edit get_dummies' dtype warning * Add whatsnew entry for issue pandas-dev#45848 * Fix dtype warning test * Suppress warnings in docs * Edit whatsnew entry Co-authored-by: Marco Edward Gorelli <[email protected]> * Fix find_stack_level in get_dummies dtype warning * Change the default dtype of get_dummies to bool * Revert dtype(bool) change * Move the changelog entry to v1.6.0.rst * Move whatsnew entry to 'Other API changes' Co-authored-by: Marco Edward Gorelli <[email protected]> Co-authored-by: Marco Edward Gorelli <[email protected]>
I was caught by surprise the other day when doing some vector subtraction when using pd.get_dummies. The issue is that the default dtype is np.uint8 which means that cases where a 1 is subtracted from a 0 will result in 255.
I tweeted about this surprise (with an example of this issue) and the overwhelming response was that this felt like a pretty big surprise and, in most cases, undesired default behavior. Bill Tubbs then made a mention of this in another issue where it was recommended that a new issue be created for this.
My guess is that the defaulting to np.uint8 is to reduce memory demands on what are potentially large, very sparse matrices. While I'm sure there are use cases that demand this, it seems like the risk of defaulting to np.uint8 outweigh the benefits of just choosing an signed representation.
The text was updated successfully, but these errors were encountered: