-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API/ENH: from_dummies #8745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We'll need to handle the case of a DataFrame with dummy columns and non-dummy columns. |
@TomAugspurger Can't we say that it is up to the user to provide the correct selection of columns? (and so error on non-dummy columns?) I am not really sold on |
@jorisvandenbossche, yeah, by "handle" I meant think about, and I think raising is the best solution, sorry. What to do with NaNs? |
I like |
+1 |
Should the milestone be modified from 0.16.0 to 0.18.0? |
Here's a function for DataFrames (again from SO):
|
What kind of roundtrip-ability can we hope for here. Ideally we have
The problem is we lose the Categorical information when calling def from_dummies(data, categories, ordered):
... Additionally it could be that
Where all of Thoughts? That's kind of messy, but I don't see any way around it and I think we should shoot for perfect roundtrip-ability. |
you can simply infer the categories (as they are the labels of the matrix). |
Categories you can get, but not whether it's ordered and what the ordering is if they are ordered. EDIT: Oh, you can't necessarily infer categories even since
|
@TomAugspurger How does the signature look like in the version you are working on? |
Current signature def from_dummies(data, categories=None, ordered=None, prefixes=None):
'''
The inverse transformation of ``pandas.get_dummies``.
Parameters
----------
data : DataFrame
categories : Index or list of Indexes
ordered : boolean or list of booleans
prefixes : str or list of str
Returns
-------
transformed : Series or DataFrame
Notes
-----
To recover a Categorical, you must provide the categories and
maybe whether it is ordered (default False). To invert a DataFrame that includes either
multiple sets of dummy-encoded columns or a mixture of dummy-encoded
columns and regular columns, you must specify ``prefixes``. The default will be to return a regular Series where the values are the column labels (so int or str probably). To return a Categorical you pass in the categories. If I switched to returning a Categorical by default, we would need to provide a flag like
That's what my |
This is exactly what I'm looking for... any progress? Beta? Thanks! |
@jpgrossman I have a branch at https://github.com/TomAugspurger/pandas/tree/from_dummies, though it's been a while since I've looked at that. There are several changes I would make to that, so if you're interested you could use that as a starting point (maybe just the tests). |
Thank you Tom – will have a look at this soon. |
pull requests are welcome! |
Any update here? |
@liorshk I haven't had time. Would you have a chance to submit a PR? |
Here is a quick-and-dirty solution for the easiest case, using no prefix. def from_dummies(data, categories, prefix_sep='_'):
out = data.copy()
for l in categories:
cols, labs = [[c.replace(x,"") for c in data.columns if l+prefix_sep in c] for x in ["", l+prefix_sep]]
out[l] = pd.Categorical(np.array(labs)[np.argmax(data[cols].as_matrix(), axis=1)])
out.drop(cols, axis=1, inplace=True)
return out Usage: categorical_cols = df.columns[df.dtypes.astype(str) == "category"]
dummies = pd.get_dummies(df)
original_df = from_dummies(dummies, categories=categorical_cols) Please note that the the transformed columns are appended at the end, hence the DataFrame will not be in the same order. I hope that helps some of you! |
Would it make more sense to provide an option in |
I have edited @kevin-winter 's code in case someone has def from_dummies(data, categorical_cols, categorical_cols_first, prefix_sep='_'):
out = data.copy()
for col_parent in categorical_cols:
filter_col = [col for col in data if col.startswith(col_parent)]
cols_with_ones = np.argmax(data[filter_col].values, axis=1)
org_col_values = []
for row, col in enumerate(cols_with_ones):
if((col==0) & (data[filter_col].iloc[row][col] < 1)):
org_col_values.append(categorical_cols_first.get(col_parent))
else:
org_col_values.append(data[filter_col].columns[col].split(col_parent+prefix_sep,1)[1])
out[col_parent] = pd.Series(org_col_values).values
out.drop(filter_col, axis=1, inplace=True)
return out
categorical_cols_first = []
for col in categorical_cols:
categorical_cols_first.append(df[col].value_counts().sort_index().keys()[0])
categorical_cols_first = dict(zip(categorical_cols, categorical_cols_first)) Wrote it quickly, so please comment if there is any bug. It worked for me though. |
I would raise en exception in the function of @kevin-winter in case data[cols] is empty, explaining that one of the provided cols is incorrect |
Seems like a popular request, I'll start working on this |
I failed to find this on a search, and so created a duplicate issue. My approach was to add This implementation minimises loops in python (although there are a couple of whole-dataframe copies), but doesn't do a lot of nannying for incorrect inputs: import numpy as np
import pandas as pd
class Categorical:
...
@classmethod
def from_dummies(cls, df: pd.DataFrame, **kwargs):
onehot = df.astype(bool)
if (onehot.sum(axis=1) > 1).any():
raise ValueError("Some rows belong to >1 category")
index_into = pd.Series([np.nan] + list(onehot.columns))
mult_by = np.arange(1, len(index_into))
indexes = (onehot.astype(int) * mult_by).sum(axis=1)
values = index_into[indexes]
return cls(values, df.columns, **kwargs) |
Think I'm taking this on, should be able to have a go tomorrow. For the sake of symmetry, I'd also like to give Categorical a Also just to check - strictly, dummy variables are of float type, and valued 0 and 1, where one-hot encoded variables are of binary type? Is that a distinction we want to keep here? Users can always |
Why do you say they're float dtype? In [4]: pd.get_dummies(pd.Series([1, 2, 3])).dtypes
Out[4]:
1 uint8
2 uint8
3 uint8
dtype: object |
I just had a look through some docs and it looked like the term "dummy variable" is used mainly in regression, in cases where you have a categorical variable but need to encode it as continuous (i.e. floating) for the purposes of that regression. The term "one-hot encoding" seems more commonly used in applications which deals in actual booleans. For both of them, the information itself is binary, of course. I may be completely making up that distinction, though. |
In my experience "one-hot encoding" and "dummy variables" are synonymous. |
Seems the scikit-learn docs would agree
|
take |
Closed by #41902 |
Motivating from SO
This is the inverse of
pd.get_dummies
. So maybeinvert_dummies
is better?I think this name makes more sense though.
This seems a reasonable way to do it. Am I missing anything?
NB. this is buggy ATM.
The text was updated successfully, but these errors were encountered: