-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Initial draft: from_dummies #41902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial draft: from_dummies #41902
Changes from 16 commits
f3e6afe
c7c5588
d06540f
1fa4e8a
c7f8ec8
3cc98ca
0e131c6
9f74dc7
442b340
38cf04d
8eccfab
fd027c5
106ff3c
2019228
be39c05
d406227
61a25e0
1d104f8
5bcfbb4
ca6200e
bf17cdb
92b5dae
c2cd747
dc50464
4d9cfd0
82d6743
153202d
d3dd9f7
e6ec175
ee6025d
4e741c8
1b4a8e9
90177be
d58c668
46457fa
131f42b
1af65ac
6dacf53
61edd30
04f360c
7ff2f3b
56ea182
39a0199
e05fe3f
23f6c07
7190879
012a1dd
52ed909
d8e4743
0cf35d8
b9303bc
3207534
8089fe5
55ad274
1b17815
00c7b05
07ba536
bbe41d0
329394b
b83ac6a
1f5e1dc
8a3421b
16cdaa0
174df1f
e45d3f8
e83faed
1e12e6a
24e9899
c8e7a7d
0ac8fff
6af6cad
54fdcbd
ced3ed0
6db7744
c84d973
842d335
8f91012
84d5bd8
fd0f985
6230d0f
84a60f7
c78ef2a
52a9dea
bc658ba
9fbca72
2581fc9
85a0ed8
5b74039
015ee94
66c0292
30b8ff1
b261656
555825b
9d6e571
9f1bb8e
dc52985
e7d6828
ae9f3d2
a59ed4e
66c7a64
76221f8
7fa66b3
536f9c5
530889e
6536c65
1272a23
fd3b115
bd5a118
f7d08d0
c32e514
0fda02f
62b09ae
1dcdd9a
3c00690
4425b4a
dc144f7
15503b0
61a348b
f06a45c
f3a0f83
23c133f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -719,6 +719,58 @@ To choose another dtype, use the ``dtype`` argument: | |
pd.get_dummies(df, dtype=bool).dtypes | ||
|
||
|
||
To convert a "dummy" or "indicator" ``DataFrame``, into a categorical ``DataFrame`` | ||
(a categorical ``Series``), for example ``k`` columns of a ``DataFrame`` containing | ||
1s and 0s can derive a ``DataFrame`` (a ``Series``) which has ``k`` distinct values | ||
:func:`~pandas.from_dummies`: | ||
|
||
.. ipython:: python | ||
|
||
d = pd.DataFrame({"prefix_a": [0, 1, 0], "prefix_b": [1, 0, 1]}) | ||
|
||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
pd.from_dummies(d) | ||
|
||
The ``k`` distinct values can also be represented be a ``dropped_first`` which | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you clarify this sentence not sure exactly what the second 'be' means There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True, that was weird - I rewrote this part completely |
||
means that no vale assigned implies a the value of the dropped value: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. vale -> value |
||
|
||
.. ipython:: python | ||
|
||
d = pd.DataFrame({"prefix_a": [0, 1, 0]}) | ||
|
||
pd.from_dummies(d, dropped_first="b") | ||
|
||
The function is the inverse of :func:`pandas.get_dummies <pandas.reshape.get_dummies>`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what does this apply to? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will remove this, it is a relic from when I tried to strickly invert all functionalities of the |
||
|
||
All non-dummy columns are included untouched in the output. You can control | ||
which columns are included in the output with the ``columns`` argument. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you don't need to show every option here, that's the purpose of the doc-string. i dont mind a complete example but just listing options of what it can do is not so useful. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. e.g. L769-773 are for the doc-string not here. even L763 is not very useful here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True, I removed the parts that are more of a docstring, I hope it is more in line with being a user-guide now (or at least the right direction) |
||
.. ipython:: python | ||
|
||
pd.get_dummies(df, columns=["C", "prefix_A", "prefix_B"]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we call this subset elsewhere |
||
|
||
You can pass values for for the ``prefix_sep`` argument depending on how many or | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm does this match the naming in wide_to_long (which is called sep) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you, I will change that and update the explanation to make it more clear (And steal a part of the good description). |
||
nested prefix separators are used in the column names. By default the prefix | ||
separator is assumed to be a '_', however ``prefix_sep`` can be specified in | ||
3 ways: | ||
|
||
* string: Use the same value for ``prefix_sep`` for each column | ||
to be dencoded. | ||
* list: Variables will be decoded by the first instance of prefix separator passed | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. umm this seems quite complicated. can you show the results here (in a comment) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, I also updated the description (with examples). It is for a case where some characters are used as separators and as parts of variable names (mixed across different prefixes): >>> dummies = DataFrame(
{
"col1_a-a": [1, 0, 1],
"col1_b-b": [0, 1, 0],
"col_2-a": [0, 1, 0],
"col_2-b": [1, 0, 0],
"col_2-c": [0, 0, 1],
},
)
>>> from_dummies(dummies, sep={"col1": "_", "col_2": "-"})
col1 col_2
0 a-a b
1 b-b a
2 a-a c |
||
the list that is encountered in the column name. | ||
* dict: Directly map prefix separators to prefixes. Can be used in case mixed | ||
separators are used within the variable name and to separate the variable from | ||
the prefix. | ||
|
||
.. ipython:: python | ||
|
||
simple = pd.get_dummies(df, prefix_sep="-") | ||
simple | ||
from_list = pd.get_dummies(df, prefix_sep=["_", "-"]) | ||
from_list | ||
from_dict = pd.get_dummies(df, prefix_sep={"prefix1": "-", "prefix2": "_"}) | ||
from_dict | ||
|
||
|
||
.. _reshaping.factorize: | ||
|
||
Factorizing values | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1053,6 +1053,225 @@ def get_empty_frame(data) -> DataFrame: | |
return DataFrame(dummy_mat, index=index, columns=dummy_cols) | ||
|
||
|
||
def from_dummies( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should consider moving get_dummies / from_dummies to a separate file (in /reshape), could be a precursor PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like that idea to improve clarity. What would be an elegant and obvious name for a collection of "reshape operations that change the data representation" - maybe There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or if its supposed to be a dummy operations file: |
||
data: DataFrame, | ||
to_series: bool = False, | ||
prefix_sep: str | list[str] | dict[str, str] = "_", | ||
columns: None | list[str] = None, | ||
dropped_first: None | str | list[str] | dict[str, str] = None, | ||
) -> Series | DataFrame: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. let's just always return a DataFrame, much simpler There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea, and in line with this perspective as the |
||
""" | ||
Create a categorical `Series` or `DataFrame` from a `DataFrame` of dummy | ||
variables. | ||
|
||
Inverts the operation performed by 'get_dummies'. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add a ref so this links properly |
||
|
||
Parameters | ||
---------- | ||
data : `DataFrame` | ||
pckSF marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Data which contains dummy-coded variables. | ||
to_series : bool, default False | ||
Converts the input data to a categorical `Series`, converts the input data | ||
to a categorical `DataFrame` if False. | ||
prefix_sep : str, list of str, or dict of str, default '_' | ||
Separator/deliminator used in the column names of the dummy categories. | ||
Pass a list if multiple prefix separators are used in the columns names. | ||
Alternatively, pass a dictionary to map prefix separators to prefixes if | ||
multiple and/ mixed separators are used in the column names. | ||
columns : None or list of str, default 'None' | ||
The columns which to convert from dummy-encoding and return as categorical | ||
`DataFrame`. | ||
If `columns` is None then all dummy columns are converted and appended | ||
to the non-dummy columns. | ||
dropped_fist : None, str, list of str, or dict of str, default None | ||
pckSF marked this conversation as resolved.
Show resolved
Hide resolved
|
||
The implied value the dummy takes when all values are zero. | ||
Can be a a single value for all variables, a list with a number of values | ||
equal to the dummy variables, or a dict directly mapping the dropped value | ||
to a prefix of a variable. | ||
|
||
Returns | ||
------- | ||
`Series` or `DataFrame` | ||
Categorical data decoded from the dummy input-data. | ||
|
||
See Also | ||
-------- | ||
get_dummies : Convert `Series` or `DataFrame` to dummy codes. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a Raises section to show what errors happen |
||
Examples | ||
-------- | ||
>>> d = pd.DataFrame({"a": [1, 0, 0, 1], "b": [0, 1, 0, 0], | ||
"c": [0, 0, 1, 0]}) | ||
|
||
>>> pd.from_dummies(s, to_series=True) | ||
0 a | ||
1 b | ||
2 c | ||
3 a | ||
|
||
>>> d = pd.DataFrame({"C": [1, 2, 3], "col1_a": [1, 0, 1], | ||
"col1_b": [0, 1, 0], "col2_a": [0, 1, 0], | ||
"col2_b": [1, 0, 0], "col2_c": [0, 0, 1]}) | ||
|
||
>>> pd.from_dummies(d) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is way too magical on the naming, e.g how is it dropping the _ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True, I will change it to always require a separator input if a separation is requred/wanted (This should be solved togehter with me removing |
||
C col1 col2 | ||
0 1 a b | ||
1 2 b a | ||
2 3 a c | ||
|
||
>>> d = pd.DataFrame({"C": [1, 2, 3], "col1_a": [1, 0, 0], | ||
"col1_b": [0, 1, 0], "col2_a": [0, 1, 0], | ||
"col2_b": [1, 0, 0], "col2_c": [0, 0, 0]}) | ||
|
||
>>> pd.from_dummies(d, dropped_first=["d", "e"]) | ||
C col1 col2 | ||
0 1 a b | ||
1 2 b a | ||
2 3 d e | ||
|
||
>>> d = pd.DataFrame({"col1_a-a": [1, 0, 1], "col1_b-b": [0, 1, 0], | ||
"col2-a_a": [0, 1, 0], "col2-b_b": [1, 0, 0], | ||
"col2-c_c": [0, 0, 1]}) | ||
|
||
>>> pd.from_dummies(d, prefix_sep={"col1": "_", "col2": "-"}) | ||
col1 col2 | ||
0 a-a b-b | ||
1 b-b a-a | ||
2 a-a c-c | ||
""" | ||
from pandas.core.reshape.concat import concat | ||
|
||
if data.isna().any().any(): | ||
pckSF marked this conversation as resolved.
Show resolved
Hide resolved
|
||
raise ValueError( | ||
f"Dummy DataFrame contains NA value in column: " | ||
f"'{data.columns[data.isna().any().argmax()]}'" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this tested? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same as above There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. argmax or idxmax? |
||
) | ||
|
||
if to_series: | ||
return _from_dummies_1d(data, dropped_first) | ||
|
||
data_to_decode: DataFrame | ||
if columns is None: | ||
columns = data.columns | ||
elif not is_list_like(columns): | ||
raise TypeError("Argument for parameter 'columns' must be list-like") | ||
# index data with a list of all columns that are dummies | ||
cat_columns = [] | ||
non_cat_columns = [] | ||
for col in columns: | ||
if any(ps in col for ps in prefix_sep): | ||
cat_columns.append(col) | ||
else: | ||
non_cat_columns.append(col) | ||
data_to_decode = data[cat_columns].astype("boolean") | ||
non_cat_data = data[non_cat_columns] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Working on fixing this I just realized, that this is also quite magic. As I now plan to require a prefix separator argument if a separation is required I could also set the default behaviour to expect dummy only input data which leaves two options:
I would choose option 1. as it more in line with the actual purpose of the function. |
||
|
||
# get separator for each prefix and lists to slice data for each prefix | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. umm this is very complicated. what are you actually trying to do here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I want to get all columns that correspond to a specific prefix such that I can extract the values for each block. I do this here to avoid deep nesting (and checking whether or not a column belongs to a prefix) later on, when the value for each entry is determined. |
||
if isinstance(prefix_sep, dict): | ||
variables_slice = {prefix: [] for prefix in prefix_sep} | ||
for col in data_to_decode.columns: | ||
for prefix in prefix_sep: | ||
if prefix in col: | ||
variables_slice[prefix].append(col) | ||
else: | ||
sep_for_prefix = {} | ||
variables_slice = {} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could remove There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Awesome advice, thank you very much :) |
||
for col in data_to_decode.columns: | ||
ps = [ps for ps in prefix_sep if ps in col][0] | ||
prefix = col.split(ps)[0] | ||
if prefix not in sep_for_prefix: | ||
sep_for_prefix[prefix] = ps | ||
if prefix not in variables_slice: | ||
variables_slice[prefix] = [col] | ||
else: | ||
variables_slice[prefix].append(col) | ||
prefix_sep = sep_for_prefix | ||
|
||
# validate number of dropped_first | ||
def check_len(item, name) -> None: | ||
if not len(item) == len(variables_slice): | ||
len_msg = ( | ||
f"Length of '{name}' ({len(item)}) did not match the " | ||
"length of the columns being encoded " | ||
f"({len(variables_slice)})." | ||
) | ||
raise ValueError(len_msg) | ||
|
||
if dropped_first: | ||
if isinstance(dropped_first, dict): | ||
check_len(dropped_first, "dropped_first") | ||
elif is_list_like(dropped_first): | ||
check_len(dropped_first, "dropped_first") | ||
dropped_first = dict(zip(variables_slice, dropped_first)) | ||
else: | ||
dropped_first = dict( | ||
zip(variables_slice, [dropped_first] * len(variables_slice)) | ||
) | ||
|
||
cat_data = {} | ||
for prefix, prefix_slice in variables_slice.items(): | ||
cats = [col[len(prefix + prefix_sep[prefix]) :] for col in prefix_slice] | ||
assigned = data_to_decode[prefix_slice].sum(axis=1) | ||
if any(assigned > 1): | ||
raise ValueError( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Couldn't you check this much earlier with a row sum after the conversion to boolean, e.g. , if (data_to_decode.sum(1) > 1).any()? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, that only works if there are no prefixes/multiple variables as each prefix slice has to be checked individually and |
||
f"Dummy DataFrame contains multi-assignment(s) for prefix: " | ||
f"'{prefix}' in row {assigned.argmax()}." | ||
) | ||
elif any(assigned == 0): | ||
if dropped_first: | ||
cats.append(dropped_first[prefix]) | ||
else: | ||
cats.append("from_dummies_nan_placeholer_string") | ||
data_slice = concat((data_to_decode[prefix_slice], assigned == 0), axis=1) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe i am missing something, but this loop can overwrite There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The |
||
else: | ||
data_slice = data_to_decode[prefix_slice] | ||
cat_data[prefix] = data_slice.dot(cats) | ||
|
||
categorical_df = concat((non_cat_data, DataFrame(cat_data)), axis=1) | ||
if dropped_first is None: | ||
categorical_df.replace( | ||
"from_dummies_nan_placeholer_string", np.nan, inplace=True | ||
) | ||
pckSF marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return categorical_df | ||
|
||
|
||
def _from_dummies_1d( | ||
data: DataFrame, | ||
dropped_first: None | str = None, | ||
) -> Series: | ||
""" | ||
Helper function for from_dummies. | ||
|
||
Handles the conversion of dummy encoded data to a categorical `Series`. | ||
For parameters and usage see: from_dummies. | ||
""" | ||
from pandas.core.reshape.concat import concat | ||
|
||
if dropped_first and not isinstance(dropped_first, str): | ||
raise ValueError("Only one dropped first value possible in 1D dummy DataFrame.") | ||
|
||
data = data.astype("boolean") | ||
cats = data.columns.tolist() | ||
assigned = data.sum(axis=1) | ||
if any(assigned > 1): | ||
raise ValueError( | ||
f"Dummy DataFrame contains multi-assignment in row {assigned.argmax()}." | ||
) | ||
elif any(assigned == 0): | ||
if dropped_first: | ||
cats.append(dropped_first) | ||
else: | ||
cats.append("from_dummies_nan_placeholer_string") | ||
data = concat((data, assigned == 0), axis=1) | ||
|
||
categorical_series = data.dot(cats) | ||
if dropped_first is None: | ||
categorical_series.replace( | ||
"from_dummies_nan_placeholer_string", np.nan, inplace=True | ||
) | ||
return categorical_series | ||
|
||
|
||
def _reorder_for_extension_array_stack( | ||
arr: ExtensionArray, n_rows: int, n_columns: int | ||
) -> ExtensionArray: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionadded 1.4.0 tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i find the parens a bit weird here, can you either say (or from a categorical 'Series' or drop them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove all the
Series
(andto_series
) related parts.