Skip to content

ENH: pd.get_dummies should not default to dtype np.uint8 #45848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
willkurt opened this issue Feb 6, 2022 · 20 comments · Fixed by #48022
Closed

ENH: pd.get_dummies should not default to dtype np.uint8 #45848

willkurt opened this issue Feb 6, 2022 · 20 comments · Fixed by #48022
Labels
Enhancement good first issue Needs Discussion Requires discussion from core team before further action

Comments

@willkurt
Copy link

willkurt commented Feb 6, 2022

I was caught by surprise the other day when doing some vector subtraction when using pd.get_dummies. The issue is that the default dtype is np.uint8 which means that cases where a 1 is subtracted from a 0 will result in 255.

I tweeted about this surprise (with an example of this issue) and the overwhelming response was that this felt like a pretty big surprise and, in most cases, undesired default behavior. Bill Tubbs then made a mention of this in another issue where it was recommended that a new issue be created for this.

My guess is that the defaulting to np.uint8 is to reduce memory demands on what are potentially large, very sparse matrices. While I'm sure there are use cases that demand this, it seems like the risk of defaulting to np.uint8 outweigh the benefits of just choosing an signed representation.

@phofl
Copy link
Member

phofl commented Feb 6, 2022

Please provide the output of pd.show_versions and provide something repriducible.

we have a bug template for bug reports or an enhancement template if you think that this is more of an enhancement.

@phofl phofl added the Needs Info Clarification about behavior needed to assess issue label Feb 6, 2022
@willkurt
Copy link
Author

willkurt commented Feb 6, 2022

To be clear, this is definitely an enhancement and not a bug since this is the documented behavior of get_dummies.

The issue is that defaulting to np.uint8 is not what most people expect to be the default behavior, and leads to unexpected result in pretty common use cases (subtracting vectors), and is considered to be a pretty severe 'gotcha'.

I've included the requested information as well as an example below. If you need more info I'm happy to use a template, just provide a pointer where I could find one.

Here's the output of pd.show_versions:

INSTALLED VERSIONS
------------------
commit           : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5
python           : 3.10.0.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 19.6.0
Version          : Darwin Kernel Version 19.6.0: Thu Sep 16 20:58:47 PDT 2021; root:xnu-6153.141.40.1~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.3.4
numpy            : 1.21.4
pytz             : 2021.3
dateutil         : 2.8.2
pip              : 21.3.1
setuptools       : 57.0.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.0.3
IPython          : 7.29.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.5.1
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : 1.7.2
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

Here's an example of the issue:

vec1 = pd.get_dummies(["okay", "gotcha", "okay"])
vec2 = pd.get_dummies(["gotcha", "okay", "okay"])
diff_default = vec1 - vec2
diff_default

returns

   gotcha  okay
0     255     1
1       1   255
2       0     0

The intended behavior can be achieved by specifying any signed dtype as you can see here:

vec1 = pd.get_dummies(["okay", "gotcha", "okay"],
                     dtype=np.float32)
vec2 = pd.get_dummies(["gotcha", "okay", "okay"],
                     dtype=np.float32)
diff_correct = vec1 - vec2
diff_correct

returns the generally expected result:

   gotcha  okay
0    -1.0   1.0
1     1.0  -1.0
2     0.0   0.0

My (and many other people's) issue with this is that the default should not lead to unexpected results. It doesn't need to be a np.float32, but it should be a signed dtype.

@phofl
Copy link
Member

phofl commented Feb 6, 2022

Thx, we would probably need a deprecation cycle if we would change that

@phofl phofl added Enhancement Needs Discussion Requires discussion from core team before further action and removed Needs Info Clarification about behavior needed to assess issue labels Feb 6, 2022
@willkurt
Copy link
Author

willkurt commented Feb 6, 2022

Makes sense to me. I appreciate you looking into this, thanks!

@MarcoGorelli
Copy link
Member

From git log it looks like it's been like this for years, but I can't tell why uint8 was chosen over int8.

I'd be in favour of deprecating the current default in favour of int8

@dhvalden
Copy link

dhvalden commented Feb 6, 2022

I support this change. uint8 can lead to hard to track errors

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Aug 4, 2022

@pandas-dev/pandas-core anyone got any objections to changing the default type? Any reason to not just make the default type bool? Then, the return type would be clear, and if people need to do arithmetic operations on the dummy values, they can do their own dtype conversion. But at least they wouldn't run into unexpected behaviour like this

General suggestion for others who would like this changed: please use reactions to express support, no need to add comments just indicating that you'd also like to see this

@jreback
Copy link
Contributor

jreback commented Aug 4, 2022

bool is a problem as doesn't play nice with missing values

could certain return Boolean

this would be a breaking change and so needs to wait for 2.0 (i think their is a tracking issue)

@MarcoGorelli
Copy link
Member

bool is a problem as doesn't play nice with missing values

Sure, but why would get_dummies return a missing value anyway? Unless I'm missing something, the return values would always be 0 or 1

@bashtage
Copy link
Contributor

bashtage commented Aug 4, 2022

IMO you either go with int64 or just make them float, if you want to move away from the idea of using the smallest int dtype that can represent the encoded categorical variable. float is probably the most sensible (between int64 and float) since it has the same storage requirements and handles nan fine.

@bashtage
Copy link
Contributor

bashtage commented Aug 4, 2022

The intended behavior can be achieved by specifying any signed dtype as you can see here:

This isn't quite right. Any int dtype can wrap under the right conditions. It wouldn't happen subtracting 2 dummies, but you cannot know that there isn't some edge case out there.

I agree that uint is particularly problematic here since np.iinfo(dt).max is always 1 to the left of 0.

@bashtage
Copy link
Contributor

bashtage commented Aug 4, 2022

Another funny behavior of get_dummies


In [26]: pd.get_dummies(c,dummy_na=True)
Out[26]:
   a  b  NaN
0  1  0    0
1  0  1    0
2  1  0    0
3  0  0    1


In [25]: ~pd.get_dummies(c,dummy_na=True)
Out[25]:
     a    b  NaN
0  254  255  255
1  255  254  255
2  254  255  255
3  255  255  254

The suggestion to use bool would avoid this issue.

@MarcoGorelli
Copy link
Member

Thanks @bashtage

float is probably the most sensible (between int64 and float) since it has the same storage requirements and handles nan fine.

Regarding handling nan - is there an example of a case when get_dummies returns nan? If not, then bool should be fine, right?

@MarcoGorelli
Copy link
Member

@willkurt do you want to open a PR for this? This'd involve:

  • in pandas/tests/reshape/test_get_dummies.py, for tests which don't already specify a dtype, pass np.dtype(np.uint8)
  • add a test which doesn't specify a dtype, and assert that a FutureWarning with a message like "the default dtype will change from 'uint8' to 'bool', please specify a dtype to silence this warning is raised
  • in pandas/core/reshape/encoding.py, in _get_dummies_1d, add a FutureWarning with a message like the above if dtype wasn't passed by the user

Not strictly necessary, but I think dtype=None could also be changed to dtype=lib.no_default

Sounds like there's agreement on changing the default type away from uint8, we can always revisit the message about what it'll be changed to in the PR review

If anyone wants to work on this, here's the contributing guide, and feel free to ask for help if anything's unclear

@WillAyd
Copy link
Member

WillAyd commented Aug 4, 2022

Probably in the minority but I think uint8 is a natural return type. bool would also be ok

Int64 and float are pretty heavy handed - I think memory usage is really important here.

@Dev-Khant
Copy link

Hi everyone, I am starting in Open Source and willing to contribute. Can I work on this issue and can anyone help me get started?

@MarcoGorelli
Copy link
Member

Hey - thanks, but there's already a PR open

@Dev-Khant
Copy link

OK, can you suggest me any beginner issue to work on.

mroeschke pushed a commit that referenced this issue Oct 11, 2022
* ENH: Warn when dtype is not passed to get_dummies

* Edit get_dummies' dtype warning

* Add whatsnew entry for issue #45848

* Fix dtype warning test

* Suppress warnings in docs

* Edit whatsnew entry

Co-authored-by: Marco Edward Gorelli <[email protected]>

* Fix find_stack_level in get_dummies dtype warning

* Change the default dtype of get_dummies to bool

* Revert dtype(bool) change

* Move the changelog entry to v1.6.0.rst

* Move whatsnew entry to 'Other API changes'

Co-authored-by: Marco Edward Gorelli <[email protected]>
Co-authored-by: Marco Edward Gorelli <[email protected]>
@wany-oh
Copy link
Contributor

wany-oh commented Oct 21, 2022

Would this change also apply to Series.str.get_dummies()?

@MarcoGorelli
Copy link
Member

that's a good point, .str.get_dummies still defaults to int64 - want to open a separate issue about changing that to bool too?

noatamir pushed a commit to noatamir/pandas that referenced this issue Nov 9, 2022
* ENH: Warn when dtype is not passed to get_dummies

* Edit get_dummies' dtype warning

* Add whatsnew entry for issue pandas-dev#45848

* Fix dtype warning test

* Suppress warnings in docs

* Edit whatsnew entry

Co-authored-by: Marco Edward Gorelli <[email protected]>

* Fix find_stack_level in get_dummies dtype warning

* Change the default dtype of get_dummies to bool

* Revert dtype(bool) change

* Move the changelog entry to v1.6.0.rst

* Move whatsnew entry to 'Other API changes'

Co-authored-by: Marco Edward Gorelli <[email protected]>
Co-authored-by: Marco Edward Gorelli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement good first issue Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants