-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: Always return DataFrame from get_dummies #24284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
bfb3dfb
f4fa09e
4a09d1d
da8a6cb
b20609b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -708,6 +708,51 @@ Finally, a ``Series.sparse`` accessor was added to provide sparse-specific metho | |
s = pd.Series([0, 0, 1, 1, 1], dtype='Sparse[int]') | ||
s.sparse.density | ||
|
||
.. _whatsnew_0240.api_breaking.get_dummies: | ||
|
||
:meth:`get_dummies` always returns a DataFrame | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Previously, when ``sparse=True`` was passed to :func:`get_dummies`, the return value could be either | ||
a :class:`DataFrame` or a :class:`SparseDataFrame`, depending on whether all or a just a subset | ||
of the columns were dummy-encoded. Now, a :class:`DataFrame` is always returned. | ||
|
||
*Previous Behavior* | ||
|
||
The first :func:`get_dummies` returns a :class:`DataFrame` because the column ``A`` | ||
is not dummy encoded. When just ``["B", "C"]`` are passed to ``get_dummies``, | ||
then all the columns are dummy-encoded, and a :class:`SparseDataFrame` was returned. | ||
|
||
.. ipython:: python | ||
|
||
In [2]: df = pd.DataFrame({"A": [1, 2], "B": ['a', 'b'], "C": ['a', 'a']}) | ||
|
||
In [3]: type(pd.get_dummies(df, sparse=True)) | ||
Out[3]: pandas.core.frame.DataFrame | ||
|
||
In [4]: type(pd.get_dummies(df[['B', 'C']], sparse=True)) | ||
Out[4]: pandas.core.sparse.frame.SparseDataFrame | ||
|
||
.. ipython:: python | ||
:suppress: | ||
|
||
df = pd.DataFrame({"A": [1, 2], "B": ['a', 'b'], "C": ['a', 'a']}) | ||
|
||
*New Behavior* | ||
|
||
Now, the return type is consistently a :class:`DataFrame`. | ||
|
||
.. ipython:: python | ||
|
||
type(pd.get_dummies(df, sparse=True)) | ||
type(pd.get_dummies(df[['B', 'C']], sparse=True)) | ||
|
||
.. note:: | ||
|
||
There's no difference in memory usage between a :class:`SparseDataFrame` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you might need to update the existing docs slightly and/or change the usage in previous whatsnew notes. |
||
and a :class:`DataFrame` with sparse values. The memory usage will | ||
be the same as in the previous version of pandas. | ||
|
||
.. _whatsnew_0240.api_breaking.frame_to_dict_index_orient: | ||
|
||
Raise ValueError in ``DataFrame.to_dict(orient='index')`` | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,7 +25,6 @@ | |
from pandas.core.sorting import ( | ||
compress_group_index, decons_obs_group_ids, get_compressed_ids, | ||
get_group_index) | ||
from pandas.core.sparse.api import SparseDataFrame, SparseSeries | ||
|
||
|
||
class _Unstacker(object): | ||
|
@@ -706,9 +705,8 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, | |
If `columns` is None then all the columns with | ||
`object` or `category` dtype will be converted. | ||
sparse : bool, default False | ||
Whether the dummy columns should be sparse or not. Returns | ||
SparseDataFrame if `data` is a Series or if all columns are included. | ||
Otherwise returns a DataFrame with some SparseBlocks. | ||
Whether the dummy-encoded columns should be be backed by | ||
a :class:`SparseArray` (True) or a regular NumPy array (False). | ||
drop_first : bool, default False | ||
Whether to get k-1 dummies out of k categorical levels by removing the | ||
first level. | ||
|
@@ -722,7 +720,7 @@ def get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, | |
|
||
Returns | ||
------- | ||
dummies : DataFrame or SparseDataFrame | ||
dummies : DataFrame | ||
|
||
See Also | ||
-------- | ||
|
@@ -865,19 +863,16 @@ def _get_dummies_1d(data, prefix, prefix_sep='_', dummy_na=False, | |
if is_object_dtype(dtype): | ||
raise ValueError("dtype=object is not a valid dtype for get_dummies") | ||
|
||
def get_empty_Frame(data, sparse): | ||
def get_empty_Frame(data): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lowercase this, maybe make a module level function. |
||
if isinstance(data, Series): | ||
index = data.index | ||
else: | ||
index = np.arange(len(data)) | ||
if not sparse: | ||
return DataFrame(index=index) | ||
else: | ||
return SparseDataFrame(index=index, default_fill_value=0) | ||
return DataFrame(index=index) | ||
|
||
# if all NaN | ||
if not dummy_na and len(levels) == 0: | ||
return get_empty_Frame(data, sparse) | ||
return get_empty_Frame(data) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. did you mean to capitalize? (or was that before) |
||
|
||
codes = codes.copy() | ||
if dummy_na: | ||
|
@@ -886,7 +881,7 @@ def get_empty_Frame(data, sparse): | |
|
||
# if dummy_na, we just fake a nan level. drop_first will drop it again | ||
if drop_first and len(levels) == 1: | ||
return get_empty_Frame(data, sparse) | ||
return get_empty_Frame(data) | ||
|
||
number_of_cols = len(levels) | ||
|
||
|
@@ -933,11 +928,10 @@ def _make_col_name(prefix, prefix_sep, level): | |
sarr = SparseArray(np.ones(len(ixs), dtype=dtype), | ||
sparse_index=IntIndex(N, ixs), fill_value=0, | ||
dtype=dtype) | ||
sparse_series[col] = SparseSeries(data=sarr, index=index) | ||
sparse_series[col] = Series(data=sarr, index=index) | ||
|
||
out = SparseDataFrame(sparse_series, index=index, columns=dummy_cols, | ||
default_fill_value=0, | ||
dtype=dtype) | ||
out = DataFrame(sparse_series, index=index, columns=dummy_cols, | ||
dtype=dtype) | ||
return out | ||
|
||
else: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code-block