Skip to content

ENH: Series.str.get_dummies should defer to pd.get_dummies and pass thru args #19618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
randomgambit opened this issue Feb 9, 2018 · 9 comments
Labels
Categorical Categorical Data Type Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode Strings String extension data type and string data

Comments

@randomgambit
Copy link

Hello the Pandas team and thanks for making this package greater day after day.

I was using the str.get_dummies method on a dataframe and I realized that by default the dummies are coded as int64.

This looks to me very inefficient because I ran into a memory error when trying to get dummies for a dataframe with several millions of rows (and about 5k dummies). I had to create the dummies by chunk, and use to_numeric() to coerce to int8.

Would it be possible to natively have the dummies in int8 format so that they take very little space? In that case NaN would be coerced to 0 but that should be fine.

What do you think?
Thanks!

@jreback
Copy link
Contributor

jreback commented Feb 9, 2018

what actually should happen is that .str.get_dummies should just call and defer to pd.get_dummies

which already does all of this:

Signature: pd.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
Docstring:
Convert categorical variable into dummy/indicator variables

Parameters
----------
data : array-like, Series, or DataFrame
prefix : string, list of strings, or dict of strings, default None
    String to append DataFrame column names
    Pass a list with length equal to the number of columns
    when calling get_dummies on a DataFrame. Alternatively, `prefix`
    can be a dictionary mapping column names to prefixes.
prefix_sep : string, default '_'
    If appending prefix, separator/delimiter to use. Or pass a
    list or dictionary as with `prefix.`
dummy_na : bool, default False
    Add a column to indicate NaNs, if False NaNs are ignored.
columns : list-like, default None
    Column names in the DataFrame to be encoded.
    If `columns` is None then all the columns with
    `object` or `category` dtype will be converted.
sparse : bool, default False
    Whether the dummy columns should be sparse or not.  Returns
    SparseDataFrame if `data` is a Series or if all columns are included.
    Otherwise returns a DataFrame with some SparseBlocks.
drop_first : bool, default False
    Whether to get k-1 dummies out of k categorical levels by removing the
    first level.

    .. versionadded:: 0.18.0

dtype : dtype, default np.uint8
    Data type for new columns. Only a single dtype is allowed.

    .. versionadded:: 0.23.0

Returns
-------
dummies : DataFrame or SparseDataFrame

@jreback jreback changed the title Series.str.get_dummies() use smallest int type? ENH: Series.str.get_dummies should defer to pd.get_dummies and pass thru args Feb 9, 2018
@jreback jreback added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode Strings String extension data type and string data Categorical Categorical Data Type Difficulty Intermediate labels Feb 9, 2018
@jreback jreback added this to the Next Major Release milestone Feb 9, 2018
@jreback
Copy link
Contributor

jreback commented Feb 9, 2018

want to do a PR, this is pretty straightforward.

@jreback
Copy link
Contributor

jreback commented Feb 9, 2018

pd.get_dummies already does this by default

In [1]: s = Series(list('aabbcdefga'))

In [2]: s.str.get_dummies()
Out[2]: 
   a  b  c  d  e  f  g
0  1  0  0  0  0  0  0
1  1  0  0  0  0  0  0
2  0  1  0  0  0  0  0
3  0  1  0  0  0  0  0
4  0  0  1  0  0  0  0
5  0  0  0  1  0  0  0
6  0  0  0  0  1  0  0
7  0  0  0  0  0  1  0
8  0  0  0  0  0  0  1
9  1  0  0  0  0  0  0

In [3]: pd.get_dummies(s)
Out[3]: 
   a  b  c  d  e  f  g
0  1  0  0  0  0  0  0
1  1  0  0  0  0  0  0
2  0  1  0  0  0  0  0
3  0  1  0  0  0  0  0
4  0  0  1  0  0  0  0
5  0  0  0  1  0  0  0
6  0  0  0  0  1  0  0
7  0  0  0  0  0  1  0
8  0  0  0  0  0  0  1
9  1  0  0  0  0  0  0

In [4]: pd.get_dummies(s).dtypes
Out[4]: 
a    uint8
b    uint8
c    uint8
d    uint8
e    uint8
f    uint8
g    uint8
dtype: object

In [5]: s.str.get_dummies().dtypes
Out[5]: 
a    int64
b    int64
c    int64
d    int64
e    int64
f    int64
g    int64
dtype: object

@randomgambit
Copy link
Author

thanks @jreback I think the issue is that the str method allows me to get dummies in the very common situation where the dummies are separated by some separator.

Example

df = pd.DataFrame({'mystring' : ['JEFF;REBACK;PANDAS',
                                 'JEFFERSON;REBACKSON;PANDAS']}) 

df
Out[17]: 
                     mystring
0          JEFF;REBACK;PANDAS
1  JEFFERSON;REBACKSON;PANDAS

Now,

df.mystring.str.get_dummies(sep = ';')
Out[18]: 
   JEFF  JEFFERSON  PANDAS  REBACK  REBACKSON
0     1          0       1       1          0
1     0          1       1       0          1

while pd.get_dummies cant do that.

pd.get_dummies(df,prefix_sep = ';')
Out[19]: 
   mystring;JEFF;REBACK;PANDAS  mystring;JEFFERSON;REBACKSON;PANDAS
0                            1                                    0
1                            0                                    1

Thanks

@jreback
Copy link
Contributor

jreback commented Feb 10, 2018

@randomgambit my point is that this can simply dispatch to the impl of get_dummies.

@johnmalaty
Copy link

@randomgambit did you find a solution to that issue as I have the same with my dataset?

@joshlk
Copy link

joshlk commented Jun 6, 2019

Hi, @jreback I'm working on a fix which defers to pd.get_dummies in #26686

@billtubbs
Copy link

billtubbs commented Feb 6, 2022

FYI. Some people on Twitter are not happy that pd.get_dummies returns uint8 rather than int8. Would it be better for both functions/methods to return int8? (They say uint is dangerous because people might try to do some algebra with the uint8s instead of converting to floats first).

@MarcoGorelli
Copy link
Member

@billtubbs that feels like a separate issue, could you open a new one specifically about the get_dummies return type please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants