-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: allow get_dummies to accept dtype argument #18330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
a333fa9
f84f83e
2737069
c412dae
b869afe
7038b31
c412be0
769b3b6
20556f2
b3ec885
9e5d0bb
b8ab365
9db17f2
ef7a473
67d346d
367e753
4e47860
bf8327c
f3abd2b
649d303
a7a60b7
bc192fd
d19d81f
158a317
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -240,7 +240,7 @@ values will be set to ``NaN``. | |
df3 | ||
df3.unstack() | ||
.. versionadded: 0.18.0 | ||
.. versionadded:: 0.18.0 | ||
|
||
Alternatively, unstack takes an optional ``fill_value`` argument, for specifying | ||
the value of missing data. | ||
|
@@ -634,6 +634,17 @@ When a column contains only one level, it will be omitted in the result. | |
pd.get_dummies(df, drop_first=True) | ||
By default new columns will have ``np.uint8`` dtype. To choose another dtype use ``dtype`` argument: | ||
|
||
.. ipython:: python | ||
df = pd.DataFrame({'A': list('abc'), 'B': [1.1, 2.2, 3.3]}) | ||
pd.get_dummies(df, dtype=bool).dtypes | ||
.. versionadded:: 0.22.0 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ca you move to before the example |
||
|
||
|
||
.. _reshaping.factorize: | ||
|
||
Factorizing values | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,7 +10,7 @@ | |
from pandas.core.dtypes.common import ( | ||
_ensure_platform_int, | ||
is_list_like, is_bool_dtype, | ||
needs_i8_conversion, is_sparse) | ||
needs_i8_conversion, is_sparse, is_object_dtype) | ||
from pandas.core.dtypes.cast import maybe_promote | ||
from pandas.core.dtypes.missing import notna | ||
|
||
|
@@ -697,7 +697,7 @@ def _convert_level_number(level_num, columns): | |
|
||
|
||
def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, | ||
columns=None, sparse=False, drop_first=False): | ||
columns=None, sparse=False, drop_first=False, dtype=None): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any reason not to use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've tried to mirror API of DataFrame, Series, Panel etc. where passing None explicitly is allowed and means "dtype will be inferred". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback @TomAugspurger So this is the last question to answer. Do you accept my argument about None or should I change it to np.uint8? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is ok here, it follows a similar style elsewhere |
||
""" | ||
Convert categorical variable into dummy/indicator variables | ||
|
@@ -728,6 +728,11 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, | |
.. versionadded:: 0.18.0 | ||
dtype : dtype, default np.uint8 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we also accept arguments to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. he is using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that should work already, I'll add it to the tests. |
||
Data type for new columns. Only a single dtype is allowed. | ||
.. versionadded:: 0.22.0 | ||
Returns | ||
------- | ||
dummies : DataFrame or SparseDataFrame | ||
|
@@ -783,6 +788,12 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, | |
3 0 0 | ||
4 0 0 | ||
>>> pd.get_dummies(pd.Series(list('abc')), dtype=float) | ||
a b c | ||
0 1.0 0.0 0.0 | ||
1 0.0 1.0 0.0 | ||
2 0.0 0.0 1.0 | ||
See Also | ||
-------- | ||
Series.str.get_dummies | ||
|
@@ -835,20 +846,29 @@ def check_len(item, name): | |
|
||
dummy = _get_dummies_1d(data[col], prefix=pre, prefix_sep=sep, | ||
dummy_na=dummy_na, sparse=sparse, | ||
drop_first=drop_first) | ||
drop_first=drop_first, dtype=dtype) | ||
with_dummies.append(dummy) | ||
result = concat(with_dummies, axis=1) | ||
else: | ||
result = _get_dummies_1d(data, prefix, prefix_sep, dummy_na, | ||
sparse=sparse, drop_first=drop_first) | ||
sparse=sparse, | ||
drop_first=drop_first, | ||
dtype=dtype) | ||
return result | ||
|
||
|
||
def _get_dummies_1d(data, prefix, prefix_sep='_', dummy_na=False, | ||
sparse=False, drop_first=False): | ||
sparse=False, drop_first=False, dtype=None): | ||
# Series avoids inconsistent NaN handling | ||
codes, levels = _factorize_from_iterable(Series(data)) | ||
|
||
if dtype is None: | ||
dtype = np.uint8 | ||
dtype = np.dtype(dtype) | ||
|
||
if is_object_dtype(dtype): | ||
raise ValueError("dtype=object is not a valid dtype for get_dummies") | ||
|
||
def get_empty_Frame(data, sparse): | ||
if isinstance(data, Series): | ||
index = data.index | ||
|
@@ -903,18 +923,18 @@ def get_empty_Frame(data, sparse): | |
sp_indices = sp_indices[1:] | ||
dummy_cols = dummy_cols[1:] | ||
for col, ixs in zip(dummy_cols, sp_indices): | ||
sarr = SparseArray(np.ones(len(ixs), dtype=np.uint8), | ||
sarr = SparseArray(np.ones(len(ixs), dtype=dtype), | ||
sparse_index=IntIndex(N, ixs), fill_value=0, | ||
dtype=np.uint8) | ||
dtype=dtype) | ||
sparse_series[col] = SparseSeries(data=sarr, index=index) | ||
|
||
out = SparseDataFrame(sparse_series, index=index, columns=dummy_cols, | ||
default_fill_value=0, | ||
dtype=np.uint8) | ||
dtype=dtype) | ||
return out | ||
|
||
else: | ||
dummy_mat = np.eye(number_of_cols, dtype=np.uint8).take(codes, axis=0) | ||
dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=0) | ||
|
||
if not dummy_na: | ||
# reset NaN GH4446 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a typo, right? There are couple more places where second double column is missing:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm yes looks that way. would be great if you can update those! (if you really want to could also add a lint rule to search for these and fail the build if they are found) (also in doc dir too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate PR or this will do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate PR prob better. (the one you changed already is fine). I think we DO want to add some more generic checks for these formatting tags, I guess sphinx doesn't complain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's just comments for sphinx. I'll create an issue then, and see what I can do when I have time to look into it. Or somebody will pick it up before that, which is also fine :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#18425