Skip to content

ENH: allow get_dummies to accept dtype argument #18330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Nov 22, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
a333fa9
TST: get_dummies dtype tests
Scorpil Nov 15, 2017
f84f83e
ENH: add dtype argument to get_dummies
Scorpil Nov 15, 2017
2737069
CLN: clean up lint errors
Scorpil Nov 16, 2017
c412dae
DOC: update whatsnew
Scorpil Nov 16, 2017
b869afe
DOC: improve get_dummies dtype documentation
Scorpil Nov 17, 2017
7038b31
TST: change get_dummies test setup
Scorpil Nov 18, 2017
c412be0
DOC: more info for dtype argument of get_dummies in whatsnew
Scorpil Nov 19, 2017
769b3b6
ENH: raise TypeError for object dtype on get_dummies
Scorpil Nov 19, 2017
20556f2
TST: better tests for get_dummies dtype
Scorpil Nov 19, 2017
b3ec885
CLN: cleanup reshape test style
Scorpil Nov 19, 2017
9e5d0bb
DOC: fix wording in whatsnew for get_dummies dtype argument
Scorpil Nov 19, 2017
b8ab365
CLN: Raise ValueError on invalid dtype
Scorpil Nov 19, 2017
9db17f2
TST: remove fixtures where not needed
Scorpil Nov 19, 2017
ef7a473
TST: remove dtype fixture from subset test
Scorpil Nov 19, 2017
67d346d
TST: fix bug in get_dummy tests under python3
Scorpil Nov 20, 2017
367e753
TST: Remove dtype fixture where not needed
Scorpil Nov 20, 2017
4e47860
CLN: move dtype logic to internal function in get_dummies
Scorpil Nov 20, 2017
bf8327c
DOC: add ref to get_dummies entry in whatsnew
Scorpil Nov 20, 2017
f3abd2b
DOC: remove extra space in whatsnew
Scorpil Nov 21, 2017
649d303
TST: change dtype on expected output instead of input
Scorpil Nov 21, 2017
a7a60b7
DOC: update whatsnew, change test name
Scorpil Nov 22, 2017
bc192fd
DOC: add get_dummies dtype argument description to reshaping.rst
Scorpil Nov 22, 2017
d19d81f
DOC: update whatsnew style, minore codestyle change
Scorpil Nov 22, 2017
158a317
DOC: fix typo and trigger tests
Scorpil Nov 22, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion doc/source/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -240,7 +240,7 @@ values will be set to ``NaN``.
df3
df3.unstack()

.. versionadded: 0.18.0
.. versionadded:: 0.18.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a typo, right? There are couple more places where second double column is missing:

pandas/core/frame.py
4516:            .. versionadded: 0.18.0
4679:            .. versionadded: 0.16.1

pandas/core/generic.py
968:            .. versionadded: 0.21.0

pandas/core/series.py
1629:            .. versionadded: 0.19.0
2216:            .. versionadded: 0.18.0

pandas/core/tools/datetimes.py
117:        .. versionadded: 0.18.1
143:        .. versionadded: 0.16.1
181:        .. versionadded: 0.20.0
187:        .. versionadded: 0.22.0

pandas/tseries/offsets.py
778:    .. versionadded: 0.16.1
882:    .. versionadded: 0.18.1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yes looks that way. would be great if you can update those! (if you really want to could also add a lint rule to search for these and fail the build if they are found) (also in doc dir too).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate PR or this will do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate PR prob better. (the one you changed already is fine). I think we DO want to add some more generic checks for these formatting tags, I guess sphinx doesn't complain

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's just comments for sphinx. I'll create an issue then, and see what I can do when I have time to look into it. Or somebody will pick it up before that, which is also fine :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Alternatively, unstack takes an optional ``fill_value`` argument, for specifying
the value of missing data.
Expand Down Expand Up @@ -634,6 +634,17 @@ When a column contains only one level, it will be omitted in the result.

pd.get_dummies(df, drop_first=True)

By default new columns will have ``np.uint8`` dtype. To choose another dtype use ``dtype`` argument:

.. ipython:: python

df = pd.DataFrame({'A': list('abc'), 'B': [1.1, 2.2, 3.3]})

pd.get_dummies(df, dtype=bool).dtypes

.. versionadded:: 0.22.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ca you move to before the example



.. _reshaping.factorize:

Factorizing values
Expand Down
15 changes: 15 additions & 0 deletions doc/source/whatsnew/v0.22.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,21 @@ New features
-
-


.. _whatsnew_0210.enhancements.get_dummies_dtype:

``get_dummies`` now supports ``dtype`` argument
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The :func:`get_dummies` now accepts a ``dtype`` argument, which specifies a dtype for the new columns. The default remains uint8. (:issue:`18330`)

.. ipython:: python

df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
pd.get_dummies(df, columns=['c']).dtypes
pd.get_dummies(df, columns=['c'], dtype=bool).dtypes


.. _whatsnew_0220.enhancements.other:

Other Enhancements
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -965,7 +965,7 @@ def _set_axis_name(self, name, axis=0, inplace=False):
inplace : bool
whether to modify `self` directly or return a copy

.. versionadded: 0.21.0
.. versionadded:: 0.21.0

Returns
-------
Expand Down
38 changes: 29 additions & 9 deletions pandas/core/reshape/reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from pandas.core.dtypes.common import (
_ensure_platform_int,
is_list_like, is_bool_dtype,
needs_i8_conversion, is_sparse)
needs_i8_conversion, is_sparse, is_object_dtype)
from pandas.core.dtypes.cast import maybe_promote
from pandas.core.dtypes.missing import notna

Expand Down Expand Up @@ -697,7 +697,7 @@ def _convert_level_number(level_num, columns):


def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
columns=None, sparse=False, drop_first=False):
columns=None, sparse=False, drop_first=False, dtype=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use 'uint8' or np.uint8 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to mirror API of DataFrame, Series, Panel etc. where passing None explicitly is allowed and means "dtype will be inferred".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback @TomAugspurger So this is the last question to answer. Do you accept my argument about None or should I change it to np.uint8?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ok here, it follows a similar style elsewhere

"""
Convert categorical variable into dummy/indicator variables

Expand Down Expand Up @@ -728,6 +728,11 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,

.. versionadded:: 0.18.0

dtype : dtype, default np.uint8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also accept arguments to np.dtype like the string 'i8', and handle those appropriately?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

he is using np.dtype()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should work already, I'll add it to the tests.

Data type for new columns. Only a single dtype is allowed.

.. versionadded:: 0.22.0

Returns
-------
dummies : DataFrame or SparseDataFrame
Expand Down Expand Up @@ -783,6 +788,12 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False,
3 0 0
4 0 0

>>> pd.get_dummies(pd.Series(list('abc')), dtype=float)
a b c
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0

See Also
--------
Series.str.get_dummies
Expand Down Expand Up @@ -835,20 +846,29 @@ def check_len(item, name):

dummy = _get_dummies_1d(data[col], prefix=pre, prefix_sep=sep,
dummy_na=dummy_na, sparse=sparse,
drop_first=drop_first)
drop_first=drop_first, dtype=dtype)
with_dummies.append(dummy)
result = concat(with_dummies, axis=1)
else:
result = _get_dummies_1d(data, prefix, prefix_sep, dummy_na,
sparse=sparse, drop_first=drop_first)
sparse=sparse,
drop_first=drop_first,
dtype=dtype)
return result


def _get_dummies_1d(data, prefix, prefix_sep='_', dummy_na=False,
sparse=False, drop_first=False):
sparse=False, drop_first=False, dtype=None):
# Series avoids inconsistent NaN handling
codes, levels = _factorize_from_iterable(Series(data))

if dtype is None:
dtype = np.uint8
dtype = np.dtype(dtype)

if is_object_dtype(dtype):
raise ValueError("dtype=object is not a valid dtype for get_dummies")

def get_empty_Frame(data, sparse):
if isinstance(data, Series):
index = data.index
Expand Down Expand Up @@ -903,18 +923,18 @@ def get_empty_Frame(data, sparse):
sp_indices = sp_indices[1:]
dummy_cols = dummy_cols[1:]
for col, ixs in zip(dummy_cols, sp_indices):
sarr = SparseArray(np.ones(len(ixs), dtype=np.uint8),
sarr = SparseArray(np.ones(len(ixs), dtype=dtype),
sparse_index=IntIndex(N, ixs), fill_value=0,
dtype=np.uint8)
dtype=dtype)
sparse_series[col] = SparseSeries(data=sarr, index=index)

out = SparseDataFrame(sparse_series, index=index, columns=dummy_cols,
default_fill_value=0,
dtype=np.uint8)
dtype=dtype)
return out

else:
dummy_mat = np.eye(number_of_cols, dtype=np.uint8).take(codes, axis=0)
dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=0)

if not dummy_na:
# reset NaN GH4446
Expand Down
Loading