Skip to content

ENH: Allow different dtype in pandas.Series.str.get_dummies #47872

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
JeffersonQin opened this issue Jul 27, 2022 · 1 comment · Fixed by #59577
Closed
1 of 3 tasks

ENH: Allow different dtype in pandas.Series.str.get_dummies #47872

JeffersonQin opened this issue Jul 27, 2022 · 1 comment · Fixed by #59577
Labels
Enhancement Performance Memory or execution speed performance Strings String extension data type and string data
Milestone

Comments

@JeffersonQin
Copy link

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

For pandas.Series.str.get_dummies now it will only return data type of numpy.int64. It would be nice if other data types can be specified.

Feature Description

Add a new parameter to str.get_dummies

Alternative Solutions

N/A

Additional Context

As pandas.Series.str.get_dummies is the easiest method in pandas to implement multi-encoding, it would be great if more data types are supported. The int64 used now can easily cause OOM problem in many cases. Indeed, it is this problem I came across that encouraged me to request this feature here.

Traceback (most recent call last):
  File "D:\CodeSpace\comp9727-assn2\preprocessing.py", line 13, in <module>
  File "C:\Users\JeffersonQin\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\strings\accessor.py", line 101, in wrapper
    return func(self, *args, **kwargs)
  File "C:\Users\JeffersonQin\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\strings\accessor.py", line 1919, in get_dummies
    result, name = self._data.array._str_get_dummies(sep)
  File "C:\Users\JeffersonQin\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\strings\object_array.py", line 369, in _str_get_dummies
    dummies = np.empty((len(arr), len(tags2)), dtype=np.int64)
numpy.core._exceptions.MemoryError: Unable to allocate 25.8 GiB for an array with shape (231637, 14942) and data type int64
@JeffersonQin JeffersonQin added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 27, 2022
@mzeitlin11
Copy link
Member

Thanks for the request @JeffersonQin! Sounds very reasonable to me - especially since pd.get_dummies already accepts a dtype argument (and defaults to uint8, so defaulting to int64 in str.get_dummies is potentially unexpected behavior).

@mzeitlin11 mzeitlin11 added Performance Memory or execution speed performance Strings String extension data type and string data and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 4, 2022
@mzeitlin11 mzeitlin11 added this to the Contributions Welcome milestone Aug 4, 2022
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@rhshadrach rhshadrach added this to the 3.0 milestone Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
4 participants