Skip to content

Initial draft: from_dummies #41902

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 121 commits into from
Jun 30, 2022
Merged
Show file tree
Hide file tree
Changes from 110 commits
Commits
Show all changes
121 commits
Select commit Hold shift + click to select a range
f3e6afe
Initial draft: from_dummies
pckSF Jun 9, 2021
c7c5588
Clean-up tests with fixtures
pckSF Jun 9, 2021
d06540f
Make tests more elegant
pckSF Jun 14, 2021
1fa4e8a
Remove variable argument
pckSF Jun 22, 2021
c7f8ec8
Remove dummy_na argument
pckSF Jun 22, 2021
3cc98ca
Remove loop over df rows
pckSF Jun 30, 2021
0e131c6
Add fillna and basic tests
pckSF Jul 2, 2021
9f74dc7
Fix testnames regarding nan and unassigned
pckSF Jul 2, 2021
442b340
Remove fillna
pckSF Jul 3, 2021
38cf04d
Add from_dummies docstring
pckSF Jul 11, 2021
8eccfab
Add docstring to _from_dummies_1d
pckSF Jul 11, 2021
fd027c5
Fix column behaviour
pckSF Jul 11, 2021
106ff3c
Update handling of unassigned rows
pckSF Jul 11, 2021
2019228
Start user_guide entry
pckSF Jul 17, 2021
be39c05
Draft reshaping user_guide entry
pckSF Jul 19, 2021
d406227
Fix: remove temp workspace separation
pckSF Jul 19, 2021
61a25e0
Add raise ValueError on unassigned values
pckSF Aug 5, 2021
1d104f8
Merge updates from upstream/master
pckSF Aug 11, 2021
5bcfbb4
Fix mypy issues
pckSF Aug 11, 2021
ca6200e
Fix docstring multi-line statements
pckSF Aug 11, 2021
bf17cdb
Add TypeError for wrong dropped_first type
pckSF Aug 29, 2021
92b5dae
Add tests for incomplete seperators
pckSF Sep 6, 2021
c2cd747
Add tests for complex prefix separators
pckSF Sep 6, 2021
dc50464
Remove magic handling of non-dummy columns
pckSF Sep 9, 2021
4d9cfd0
Removed to_series argument
pckSF Sep 9, 2021
82d6743
Renamed column argument to subset
pckSF Sep 9, 2021
153202d
Renamed tests to reflect the removal of to_series
pckSF Sep 9, 2021
d3dd9f7
Fix input data NA value test to account for subset
pckSF Sep 9, 2021
e6ec175
Renamed argument prefix_sep to just sep
pckSF Sep 9, 2021
ee6025d
Improve docstring for sep
pckSF Sep 9, 2021
4e741c8
Update user guide entry
pckSF Sep 9, 2021
1b4a8e9
Fix wrong variable name in docstring: d to df
pckSF Sep 9, 2021
90177be
Fix mypy issues
pckSF Sep 9, 2021
d58c668
Merge remote-tracking branch 'upstream/master' into add-from_dummies
pckSF Sep 10, 2021
46457fa
Fix post upstream merge mypy issues
pckSF Sep 10, 2021
131f42b
Fix errors in user guide
pckSF Sep 10, 2021
1af65ac
Merge 'upstream/master' into add-from_dummies
pckSF Oct 6, 2021
6dacf53
Allow hashable categories
pckSF Oct 7, 2021
61edd30
Add None category to mixed_cats_basic test
pckSF Oct 16, 2021
04f360c
Add index to argument types and fix resulting mypy issues
pckSF Oct 21, 2021
7ff2f3b
Merge remote-tracking branch 'upstream/master' into add-from_dummies
pckSF Nov 16, 2021
56ea182
Remove list from dropped_first args
pckSF Nov 20, 2021
39a0199
Remove list from sep args
pckSF Nov 20, 2021
e05fe3f
Remove default category name
pckSF Nov 20, 2021
23f6c07
Adapt docstring examples to removal of list from sep and dropped_firs…
pckSF Nov 20, 2021
7190879
Update docstring: Remove default category name
pckSF Nov 20, 2021
012a1dd
Updaterst: Add missing word
pckSF Nov 20, 2021
52ed909
Add from_dummies to reshaping api
pckSF Nov 20, 2021
d8e4743
Merge remote-tracking branch 'upstream/master' into add-from_dummies
pckSF Nov 20, 2021
0cf35d8
Add: allow dropped_first to be any hashable type
pckSF Nov 20, 2021
b9303bc
Add: Temporary mypy fix
pckSF Nov 20, 2021
3207534
Merge remote-tracking branch 'upstream/master' into add-from_dummies
pckSF Nov 22, 2021
8089fe5
Merge remote-tracking branch 'upstream/master' into add-from_dummies
pckSF Nov 24, 2021
55ad274
Add from_dummies to pandas __init__ file
pckSF Nov 27, 2021
1b17815
Add from_dummies to test_api tests
pckSF Nov 27, 2021
00c7b05
Fix docstring examples
pckSF Nov 27, 2021
07ba536
Adapt documentation to account for removal of list arguments
pckSF Nov 27, 2021
bbe41d0
Fix wrong parenthesis in docstring
pckSF Nov 27, 2021
329394b
Fix docstring example expected return
pckSF Nov 28, 2021
b83ac6a
Simplify from_dummies
pckSF Nov 29, 2021
1f5e1dc
Update user guide entry
pckSF Nov 29, 2021
8a3421b
Change arg dropped_first to implied_value
pckSF Nov 29, 2021
16cdaa0
Add dosctring note and test for boolean dummy values
pckSF Nov 29, 2021
174df1f
Fix docstring typo
pckSF Nov 29, 2021
e45d3f8
Change arg implied_value to implied_category
pckSF Nov 29, 2021
e83faed
Fix docstring format mistakes
pckSF Dec 4, 2021
1e12e6a
Replace argmax/min with idxmax/min
pckSF Dec 4, 2021
24e9899
Reduce complexity by using defaultdict
pckSF Dec 4, 2021
c8e7a7d
Merge remote-tracking branch 'upstream/master' into add-from_dummies
pckSF Dec 4, 2021
0ac8fff
Ignore dependency based mypy errors
pckSF Dec 16, 2021
6af6cad
Merge remote-tracking branch 'upstream/master' into add-from_dummies
pckSF Dec 16, 2021
54fdcbd
Add Raises section to docstring
pckSF Dec 29, 2021
ced3ed0
Change implied_category to base_category
pckSF Jan 5, 2022
6db7744
Add proper reference to get_dummies in docstring
pckSF Jan 5, 2022
c84d973
Remove unnecessary copy of input data
pckSF Jan 5, 2022
842d335
Merge upstream master
pckSF Jan 5, 2022
8f91012
Fix docstring section order
pckSF Jan 5, 2022
84d5bd8
Remove redundant f-strings
pckSF Jan 10, 2022
fd0f985
Add check for 'data' type
pckSF Jan 10, 2022
6230d0f
Add TypeError for wrong data type to docstring
pckSF Jan 14, 2022
84a60f7
Add roundtrip tests get_dummies from_dummies
pckSF Jan 14, 2022
c78ef2a
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Jan 14, 2022
52a9dea
Move from_dummies to encoding.py
pckSF Jan 29, 2022
bc658ba
Fix from_dummies import in test file
pckSF Jan 29, 2022
9fbca72
Update userguide versionadded to 1.5
pckSF Jan 30, 2022
2581fc9
Draft whats-new entry
pckSF Jan 31, 2022
85a0ed8
Change code-block to ipython
pckSF Jan 31, 2022
5b74039
Improve test names and organization
pckSF Feb 1, 2022
015ee94
Show DataFrames used in docstring examples
pckSF Feb 1, 2022
66c0292
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Feb 18, 2022
30b8ff1
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Mar 3, 2022
b261656
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Mar 18, 2022
555825b
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Mar 22, 2022
9d6e571
Merge from umstream/main
pckSF Apr 1, 2022
9f1bb8e
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Apr 2, 2022
dc52985
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Apr 3, 2022
e7d6828
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Apr 20, 2022
ae9f3d2
Fix whatsnew entry typo
pckSF Apr 20, 2022
a59ed4e
Fix whats-new
pckSF Apr 28, 2022
66c7a64
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Apr 28, 2022
76221f8
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Apr 30, 2022
7fa66b3
Change base_category to default_category
pckSF Jun 3, 2022
536f9c5
Merge updates from upstream/main
pckSF Jun 3, 2022
530889e
Add double ticks to render code in docstring
pckSF Jun 3, 2022
6536c65
Fix docstring typos and alignments
pckSF Jun 3, 2022
1272a23
Inline the check_len check for the default_vategory
pckSF Jun 3, 2022
fd3b115
Fix mypy issues by removing fixed ignores
pckSF Jun 3, 2022
bd5a118
Fix error encountered during docstring parsing
pckSF Jun 4, 2022
f7d08d0
Fix redundant backticks following :func:
pckSF Jun 4, 2022
c32e514
Add space before colon for numpydoc
pckSF Jun 4, 2022
0fda02f
Added pd.Categorical to See Also
pckSF Jun 6, 2022
62b09ae
Add version added
pckSF Jun 6, 2022
1dcdd9a
Add from_dummies to get_dummies see also
pckSF Jun 6, 2022
3c00690
Fix see also missing period error
pckSF Jun 6, 2022
4425b4a
Fix See Also of get_dummies
pckSF Jun 6, 2022
dc144f7
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Jun 6, 2022
15503b0
Fix docs compiler error
pckSF Jun 22, 2022
61a348b
Merge from master
pckSF Jun 22, 2022
f06a45c
Fix default_category=0 bug and add corresponding tests
pckSF Jun 25, 2022
f3a0f83
Merge remote-tracking branch 'upstream/main' into add-from_dummies
pckSF Jun 25, 2022
23c133f
Use .loc[:, prefix_slice] instead of [prefix_slice]
pckSF Jun 25, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/reference/general_functions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Data manipulations
merge_asof
concat
get_dummies
from_dummies
factorize
unique
wide_to_long
Expand Down
24 changes: 24 additions & 0 deletions doc/source/user_guide/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -706,6 +706,30 @@ To choose another dtype, use the ``dtype`` argument:

pd.get_dummies(df, dtype=bool).dtypes

.. versionadded:: 1.5.0

To convert a "dummy" or "indicator" ``DataFrame``, into a categorical ``DataFrame``,
for example ``k`` columns of a ``DataFrame`` containing 1s and 0s can derive a
``DataFrame`` which has ``k`` distinct values using
:func:`~pandas.from_dummies`:

.. ipython:: python

df = pd.DataFrame({"prefix_a": [0, 1, 0], "prefix_b": [1, 0, 1]})
df

pd.from_dummies(df, sep="_")

Dummy coded data only requires ``k - 1`` categories to be included, in this case
the ``k`` th category is the default category, implied by not being assigned any of
the other ``k - 1`` categories, can be passed via ``default_category``.

.. ipython:: python

df = pd.DataFrame({"prefix_a": [0, 1, 0]})
df

pd.from_dummies(df, sep="_", default_category="b")

.. _reshaping.factorize:

Expand Down
19 changes: 19 additions & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,25 @@ as seen in the following example.
1 2021-01-02 08:00:00 4
2 2021-01-02 16:00:00 5

.. _whatsnew_150.enhancements.from_dummies:

from_dummies
^^^^^^^^^^^^

Added new function :func:`~pandas.from_dummies` to convert a dummy coded :class:`DataFrame` into a categorical :class:`DataFrame`.

Example::

.. ipython:: python

import pandas as pd

df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0],
"col2_a": [0, 1, 0], "col2_b": [1, 0, 0],
"col2_c": [0, 0, 1]})

pd.from_dummies(df, sep="_")

.. _whatsnew_150.enhancements.tar:

Reading directly from TAR archives
Expand Down
2 changes: 2 additions & 0 deletions pandas/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@
pivot,
pivot_table,
get_dummies,
from_dummies,
cut,
qcut,
)
Expand Down Expand Up @@ -361,6 +362,7 @@ def __getattr__(name):
"eval",
"factorize",
"get_dummies",
"from_dummies",
"get_option",
"infer_freq",
"interval_range",
Expand Down
5 changes: 4 additions & 1 deletion pandas/core/reshape/api.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
# flake8: noqa:F401

from pandas.core.reshape.concat import concat
from pandas.core.reshape.encoding import get_dummies
from pandas.core.reshape.encoding import (
from_dummies,
get_dummies,
)
from pandas.core.reshape.melt import (
lreshape,
melt,
Expand Down
197 changes: 197 additions & 0 deletions pandas/core/reshape/encoding.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
from __future__ import annotations

from collections import defaultdict
import itertools
from typing import Hashable

import numpy as np

Expand Down Expand Up @@ -316,3 +318,198 @@ def get_empty_frame(data) -> DataFrame:
dummy_mat = dummy_mat[:, 1:]
dummy_cols = dummy_cols[1:]
return DataFrame(dummy_mat, index=index, columns=dummy_cols)


def from_dummies(
data: DataFrame,
sep: None | str = None,
default_category: None | Hashable | dict[str, Hashable] = None,
) -> DataFrame:
"""
Create a categorical ``DataFrame`` from a ``DataFrame`` of dummy variables.

Inverts the operation performed by :func:`~pandas.get_dummies`.

Parameters
----------
data : DataFrame
Data which contains dummy-coded variables in form of integer columns of
1's and 0's.
sep : str, default None
Separator used in the column names of the dummy categories they are
character indicating the separation of the categorical names from the prefixes.
For example, if your column names are 'prefix_A' and 'prefix_B',
you can strip the underscore by specifying sep='_'.
default_category : None, Hashable or dict of Hashables, default None
The default category is the implied category when a value has none of the
listed categories specified with a one, i.e. if all dummies in a row are
zero. Can be a single value for all variables or a dict directly mapping
the default categories to a prefix of a variable.

Returns
-------
DataFrame
Categorical data decoded from the dummy input-data.

Raises
------
ValueError
* When the input ``DataFrame`` ``data`` contains NA values.
* When the input ``DataFrame`` ``data`` contains column names with separators
that do not match the separator specified with ``sep``.
* When a ``dict`` passed to ``default_category`` does not include an implied
category for each prefix.
* When a value in ``data`` has more than one category assigned to it.
* When ``default_category=None`` and a value in ``data`` has no category
assigned to it.
TypeError
* When the input ``data`` is not of type ``DataFrame``.
* When the input ``DataFrame`` ``data`` contains non-dummy data.
* When the passed ``sep`` is of a wrong data type.
* When the passed ``default_category`` is of a wrong data type.

See Also
--------
:func:`~pandas.get_dummies` : Convert ``Series`` or ``DataFrame`` to dummy
codes.

Notes
-----
The columns of the passed dummy data should only include 1's and 0's,
or boolean values.

Examples
--------
>>> df = pd.DataFrame({"a": [1, 0, 0, 1], "b": [0, 1, 0, 0],
... "c": [0, 0, 1, 0]})

>>> df
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0

>>> pd.from_dummies(df)
0 a
1 b
2 c
3 a

>>> df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0],
... "col2_a": [0, 1, 0], "col2_b": [1, 0, 0],
... "col2_c": [0, 0, 1]})

>>> df
col1_a col1_b col2_a col2_b col2_c
0 1 0 0 1 0
1 0 1 1 0 0
2 1 0 0 0 1

>>> pd.from_dummies(df, sep="_")
col1 col2
0 a b
1 b a
2 a c

>>> df = pd.DataFrame({"col1_a": [1, 0, 0], "col1_b": [0, 1, 0],
... "col2_a": [0, 1, 0], "col2_b": [1, 0, 0],
... "col2_c": [0, 0, 0]})

>>> df
col1_a col1_b col2_a col2_b col2_c
0 1 0 0 1 0
1 0 1 1 0 0
2 0 0 0 0 0

>>> pd.from_dummies(df, sep="_", default_category={"col1": "d", "col2": "e"})
col1 col2
0 a b
1 b a
2 d e
"""
from pandas.core.reshape.concat import concat

if not isinstance(data, DataFrame):
raise TypeError(
"Expected 'data' to be a 'DataFrame'; "
f"Received 'data' of type: {type(data).__name__}"
)

if data.isna().any().any():
raise ValueError(
"Dummy DataFrame contains NA value in column: "
f"'{data.isna().any().idxmax()}'"
)

# index data with a list of all columns that are dummies
try:
data_to_decode = data.astype("boolean", copy=False)
except TypeError:
raise TypeError("Passed DataFrame contains non-dummy data")

# collect prefixes and get lists to slice data for each prefix
variables_slice = defaultdict(list)
if sep is None:
variables_slice[""] = list(data.columns)
elif isinstance(sep, str):
for col in data_to_decode.columns:
prefix = col.split(sep)[0]
if len(prefix) == len(col):
raise ValueError(f"Separator not specified for column: {col}")
variables_slice[prefix].append(col)
else:
raise TypeError(
"Expected 'sep' to be of type 'str' or 'None'; "
f"Received 'sep' of type: {type(sep).__name__}"
)

if default_category:
if isinstance(default_category, dict):
if not len(default_category) == len(variables_slice):
len_msg = (
f"Length of 'default_category' ({len(default_category)}) "
f"did not match the length of the columns being encoded "
f"({len(variables_slice)})"
)
raise ValueError(len_msg)
elif isinstance(default_category, Hashable):
default_category = dict(
zip(variables_slice, [default_category] * len(variables_slice))
)
else:
raise TypeError(
"Expected 'default_category' to be of type "
"'None', 'Hashable', or 'dict'; "
"Received 'default_category' of type: "
f"{type(default_category).__name__}"
)

cat_data = {}
for prefix, prefix_slice in variables_slice.items():
if sep is None:
cats = prefix_slice.copy()
else:
cats = [col[len(prefix + sep) :] for col in prefix_slice]
assigned = data_to_decode[prefix_slice].sum(axis=1)
if any(assigned > 1):
raise ValueError(
"Dummy DataFrame contains multi-assignment(s); "
f"First instance in row: {assigned.idxmax()}"
)
elif any(assigned == 0):
if isinstance(default_category, dict):
cats.append(default_category[prefix])
else:
raise ValueError(
"Dummy DataFrame contains unassigned value(s); "
f"First instance in row: {assigned.idxmin()}"
)
data_slice = concat((data_to_decode[prefix_slice], assigned == 0), axis=1)
else:
data_slice = data_to_decode[prefix_slice]
cats_array = np.array(cats, dtype="object")
# get indices of True entries along axis=1
cat_data[prefix] = cats_array[data_slice.to_numpy().nonzero()[1]]

return DataFrame(cat_data)
1 change: 1 addition & 0 deletions pandas/tests/api/test_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,7 @@ class TestPDApi(Base):
"eval",
"factorize",
"get_dummies",
"from_dummies",
"infer_freq",
"isna",
"isnull",
Expand Down
Loading