API/ENH: from_dummies #8745

jreback · 2014-11-06T11:08:01Z

Motivating from SO

This is the inverse of pd.get_dummies. So maybe invert_dummies is better?
I think this name makes more sense though.

This seems a reasonable way to do it. Am I missing anything?

In [46]: s = Series(list('aaabbbccddefgh')).astype('category')

In [47]: s
Out[47]: 
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

In [48]: df = pd.get_dummies(s)

In [49]: df
Out[49]: 
    a  b  c  d  e  f  g  h
0   1  0  0  0  0  0  0  0
1   1  0  0  0  0  0  0  0
2   1  0  0  0  0  0  0  0
3   0  1  0  0  0  0  0  0
4   0  1  0  0  0  0  0  0
5   0  1  0  0  0  0  0  0
6   0  0  1  0  0  0  0  0
7   0  0  1  0  0  0  0  0
8   0  0  0  1  0  0  0  0
9   0  0  0  1  0  0  0  0
10  0  0  0  0  1  0  0  0
11  0  0  0  0  0  1  0  0
12  0  0  0  0  0  0  1  0
13  0  0  0  0  0  0  0  1

In [50]: x = df.stack()

# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]: 
0     a
1     a
2     a
3     b
4     b
5     b
6     c
7     c
8     d
9     d
10    e
11    f
12    g
13    h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]

NB. this is buggy ATM.

In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)),categories=df.categories)

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2014-11-06T13:12:23Z

We'll need to handle the case of a DataFrame with dummy columns and non-dummy columns.

jorisvandenbossche · 2014-11-06T13:20:32Z

@TomAugspurger Can't we say that it is up to the user to provide the correct selection of columns? (and so error on non-dummy columns?)

I am not really sold on get_categories (as this could also mean a lot of other things, you can get categories from other type of data than dummies), so something with 'dummies' in the name feels better (invert_dummies, from_dummies, .. or something with the meaning of 'condense/melt dummies')

TomAugspurger · 2014-11-06T13:48:19Z

@jorisvandenbossche, yeah, by "handle" I meant think about, and I think raising is the best solution, sorry.

What to do with NaNs? pd.get_dummies(['a', 'b', np.nan], dummy_na=True) We should probably have a symmetrical argument for from_dummies. (I'm not sure how Categorical handles a NaN as a category).

jreback · 2014-11-06T14:17:43Z

I like from_dummies

metasyn · 2015-05-17T03:02:31Z

+1

pkch · 2015-11-30T09:48:30Z

Should the milestone be modified from 0.16.0 to 0.18.0?

hayd · 2015-12-30T06:03:45Z

Here's a function for DataFrames (again from SO):

from collections import defaultdict

def reverse_dummy(df_dummies):
    pos = defaultdict(list)
    vals = defaultdict(list)

    for i, c in enumerate(df_dummies.columns):
        if "_" in c:
            k, v = c.split("_", 1)
            pos[k].append(i)
            vals[k].append(v)
        else:
            pos["_"].append(i)

    df = pd.DataFrame({k: pd.Categorical.from_codes(
                              np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1),
                              vals[k])
                      for k in vals})

    df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]]
    return df

TomAugspurger · 2016-01-09T21:18:25Z

What kind of roundtrip-ability can we hope for here. Ideally we have

x == pd.from_dummies(pd.get_dummies(x))

The problem is we lose the Categorical information when calling get_dummies.
In order to fully reconstruct a Categorical we would need to include the categories (if any, remember get_dummies will work on non-categorical) and the ordering when calling from_dummies.

def from_dummies(data, categories, ordered):
   ...

Additionally it could be that data came from a DataFrame, so they're might be multiple sets of dummy columns and non-dummy columns. In this case we have something like

def from_dummies(data, categories, ordered, prefixes)
    pass

Where all of prefixes, categories and ordered are scalars or lists of the same length (special case for categories and ordered as scalars and prefixes=None to handle inverting pd.get_dummies(Series).

Thoughts? That's kind of messy, but I don't see any way around it and I think we should shoot for perfect roundtrip-ability.

jreback · 2016-01-09T21:25:43Z

you can simply infer the categories (as they are the labels of the matrix).

TomAugspurger · 2016-01-09T21:34:26Z

Categories you can get, but not whether it's ordered and what the ordering is if they are ordered.

EDIT: Oh, you can't necessarily infer categories even since pd.get_dummies(['a', 'a', 'b']) is the same as pd.get_dummies(pd.Series(pd.Categorical(['a', 'a', 'b'])))

On Jan 9, 2016, at 15:25, Jeff Reback [email protected] wrote:

you can simply infer the categories (as they are the labels of the matrix).

—
Reply to this email directly or view it on GitHub.

jorisvandenbossche · 2016-01-10T14:52:38Z

@TomAugspurger How does the signature look like in the version you are working on?
Is the purpose to detect the different sets of dummies based on the column names (as the output of get_dummies looks like)?
Would it return object or category columns?

TomAugspurger · 2016-01-10T16:03:23Z

Current signature

def from_dummies(data, categories=None, ordered=None, prefixes=None):
    '''
    The inverse transformation of ``pandas.get_dummies``.

    Parameters
    ----------
    data : DataFrame
    categories : Index or list of Indexes
    ordered : boolean or list of booleans
    prefixes : str or list of str

    Returns
    -------
    transformed : Series or DataFrame

    Notes
    -----
    To recover a Categorical, you must provide the categories and
    maybe whether it is ordered (default False). To invert a DataFrame that includes either
    multiple sets of dummy-encoded columns or a mixture of dummy-encoded
    columns and regular columns, you must specify ``prefixes``.

The default will be to return a regular Series where the values are the column labels (so int or str probably). To return a Categorical you pass in the categories. If I switched to returning a Categorical by default, we would need to provide a flag like return_categorical to disable that.

Is the purpose to detect the different sets of dummies based on the column names

That's what my prefixes argument is for. If you have multiple dummy-encoded sets you use prefixes=["fist_dummy_set", "second_set", ..."] and that will find all the ones with that as the prefix. This will maybe fail (or succeed silently!) if you have a column name that happened to share a prefix... This is beginning to look pretty complicated.

jpgrossman · 2016-10-18T23:44:26Z

This is exactly what I'm looking for... any progress? Beta?

Thanks!

TomAugspurger · 2016-10-24T12:34:46Z

@jpgrossman I have a branch at https://github.com/TomAugspurger/pandas/tree/from_dummies, though it's been a while since I've looked at that. There are several changes I would make to that, so if you're interested you could use that as a starting point (maybe just the tests).

jpgrossman · 2016-10-25T18:23:37Z

Thank you Tom – will have a look at this soon.

jreback · 2017-01-28T02:24:08Z

pull requests are welcome!

liorshk · 2017-06-20T12:38:15Z

Any update here?
@TomAugspurger Your link doesn't work anymore

TomAugspurger · 2017-06-20T22:24:12Z

@liorshk I haven't had time. Would you have a chance to submit a PR?

kevin-winter · 2017-07-08T11:20:31Z

Here is a quick-and-dirty solution for the easiest case, using no prefix.

def from_dummies(data, categories, prefix_sep='_'):
    out = data.copy()
    for l in categories:
        cols, labs = [[c.replace(x,"") for c in data.columns if l+prefix_sep in c] for x in ["", l+prefix_sep]]
        out[l] = pd.Categorical(np.array(labs)[np.argmax(data[cols].as_matrix(), axis=1)])
        out.drop(cols, axis=1, inplace=True)
    return out

Usage:

categorical_cols = df.columns[df.dtypes.astype(str) == "category"]
dummies = pd.get_dummies(df)
original_df = from_dummies(dummies, categories=categorical_cols)

Please note that the the transformed columns are appended at the end, hence the DataFrame will not be in the same order. I hope that helps some of you!
Cheers!

joshlk · 2018-05-21T16:13:39Z

Would it make more sense to provide an option in get_dummies to also output a map between the original column name, new column name and categories? This could then be used to feed the reverse from_dummies function to recreate the old dataframe

raam93 · 2018-09-01T13:53:04Z

I have edited @kevin-winter 's code in case someone has drop_first=True in pd.get_dummies():
i.e., dummies = pd.get_dummies(df, drop_first=True)

def from_dummies(data, categorical_cols, categorical_cols_first, prefix_sep='_'):
    out = data.copy()

    for col_parent in categorical_cols:
        
        filter_col = [col for col in data if col.startswith(col_parent)]
        cols_with_ones = np.argmax(data[filter_col].values, axis=1)
        
        org_col_values = []
        for row, col in enumerate(cols_with_ones):
            if((col==0) & (data[filter_col].iloc[row][col] < 1)):
                org_col_values.append(categorical_cols_first.get(col_parent))
            else:
                org_col_values.append(data[filter_col].columns[col].split(col_parent+prefix_sep,1)[1])
        
        out[col_parent] = pd.Series(org_col_values).values
        out.drop(filter_col, axis=1, inplace=True)    
        
    return out

categorical_cols_first is a dictionary of first levels (of each categorical variables) that will be dropped by pd.get_dummies()

categorical_cols_first = []
for col in categorical_cols:
    categorical_cols_first.append(df[col].value_counts().sort_index().keys()[0])
categorical_cols_first = dict(zip(categorical_cols, categorical_cols_first))

Wrote it quickly, so please comment if there is any bug. It worked for me though.
Hope this helps!

andreaaraldo · 2019-04-17T16:53:57Z

I would raise en exception in the function of @kevin-winter in case data[cols] is empty, explaining that one of the provided cols is incorrect

MarcoGorelli · 2020-02-02T13:15:31Z

Seems like a popular request, I'll start working on this

clbarnes · 2020-05-19T16:14:48Z

I failed to find this on a search, and so created a duplicate issue.

My approach was to add from_dummies as an alternate constructor for Categorical: that way it's clear what it creates, it's easy to discover and to find documentation for, and the additional arguments are passed straight to that object. And let's not forget, "Namespaces are one honking great idea -- let's do more of those!".

This implementation minimises loops in python (although there are a couple of whole-dataframe copies), but doesn't do a lot of nannying for incorrect inputs:

import numpy as np 
import pandas as pd

class Categorical:
    ...
    
    @classmethod
    def from_dummies(cls, df: pd.DataFrame, **kwargs):
        onehot = df.astype(bool)

        if (onehot.sum(axis=1) > 1).any():
            raise ValueError("Some rows belong to >1 category")

        index_into = pd.Series([np.nan] + list(onehot.columns))
        mult_by = np.arange(1, len(index_into))

        indexes = (onehot.astype(int) * mult_by).sum(axis=1)
        values = index_into[indexes]

        return cls(values, df.columns, **kwargs)

clbarnes · 2020-05-26T10:30:02Z

Think I'm taking this on, should be able to have a go tomorrow. For the sake of symmetry, I'd also like to give Categorical a to_dummies. If we go down that route, it might be nice to eventually deprecate the get_dummies free function so as to keep categorical-related functionality on the Categorical class and not duplicate API surface.

Also just to check - strictly, dummy variables are of float type, and valued 0 and 1, where one-hot encoded variables are of binary type? Is that a distinction we want to keep here? Users can always .astype(bool) on it.

TomAugspurger · 2020-05-26T11:29:05Z

Also just to check - strictly, dummy variables are of float type, and valued 0 and 1, where one-hot encoded variables are of binary type? Is that a distinction we want to keep here?

Why do you say they're float dtype?

In [4]: pd.get_dummies(pd.Series([1, 2, 3])).dtypes
Out[4]:
1    uint8
2    uint8
3    uint8
dtype: object

clbarnes · 2020-05-26T11:44:25Z

I just had a look through some docs and it looked like the term "dummy variable" is used mainly in regression, in cases where you have a categorical variable but need to encode it as continuous (i.e. floating) for the purposes of that regression. The term "one-hot encoding" seems more commonly used in applications which deals in actual booleans. For both of them, the information itself is binary, of course.

I may be completely making up that distinction, though.

TomAugspurger · 2020-05-26T12:45:27Z

In my experience "one-hot encoding" and "dummy variables" are synonymous.

MarcoGorelli · 2020-05-26T12:51:04Z

In my experience "one-hot encoding" and "dummy variables" are synonymous.

Seems the scikit-learn docs would agree

The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter)

clbarnes · 2020-05-27T13:29:18Z

take

mroeschke · 2022-08-15T21:08:27Z

Closed by #41902

jreback added Bug Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design Categorical Categorical Data Type labels Nov 6, 2014

jreback added this to the 0.15.2 milestone Nov 6, 2014

jreback changed the title ~~API/ENH: get_categories~~ API/ENH: from_dummies Nov 6, 2014

jreback modified the milestones: 0.16.0, 0.15.2 Nov 30, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback mentioned this issue Mar 8, 2015

reverse get dummies #7344

Closed

TomAugspurger mentioned this issue Jun 17, 2015

Implementing support for Panda's Categorical dtype scikit-learn/scikit-learn#4865

Closed

brookewenig mentioned this issue Apr 19, 2019

Reverse get_dummies databricks/koalas#99

Open

MarcoGorelli self-assigned this Feb 2, 2020

MarcoGorelli mentioned this issue Feb 7, 2020

ENH: add from_dummies #31795

Closed

dsaxton mentioned this issue May 19, 2020

ENH: One-hot decoding #34260

Closed

github-actions bot assigned clbarnes May 27, 2020

MarcoGorelli removed their assignment May 27, 2020

clbarnes mentioned this issue May 28, 2020

Categorical.(get|from)_dummies #34426

Closed

5 tasks

mroeschke removed API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode Bug labels Apr 11, 2021

pckSF mentioned this issue Jun 9, 2021

Initial draft: from_dummies #41902

Merged

10 tasks

jreback modified the milestones: Contributions Welcome, 1.5 Feb 1, 2022

mroeschke closed this as completed Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API/ENH: from_dummies #8745

API/ENH: from_dummies #8745

jreback commented Nov 6, 2014

TomAugspurger commented Nov 6, 2014

jorisvandenbossche commented Nov 6, 2014

TomAugspurger commented Nov 6, 2014

jreback commented Nov 6, 2014

metasyn commented May 17, 2015

pkch commented Nov 30, 2015

hayd commented Dec 30, 2015

TomAugspurger commented Jan 9, 2016

jreback commented Jan 9, 2016

TomAugspurger commented Jan 9, 2016

jorisvandenbossche commented Jan 10, 2016

TomAugspurger commented Jan 10, 2016

jpgrossman commented Oct 18, 2016

TomAugspurger commented Oct 24, 2016

jpgrossman commented Oct 25, 2016 •

edited by jorisvandenbossche

Loading

jreback commented Jan 28, 2017

liorshk commented Jun 20, 2017

TomAugspurger commented Jun 20, 2017

kevin-winter commented Jul 8, 2017

joshlk commented May 21, 2018

raam93 commented Sep 1, 2018

andreaaraldo commented Apr 17, 2019

MarcoGorelli commented Feb 2, 2020

clbarnes commented May 19, 2020 •

edited

Loading

clbarnes commented May 26, 2020

TomAugspurger commented May 26, 2020

clbarnes commented May 26, 2020

TomAugspurger commented May 26, 2020

MarcoGorelli commented May 26, 2020

clbarnes commented May 27, 2020

mroeschke commented Aug 15, 2022

API/ENH: from_dummies #8745

API/ENH: from_dummies #8745

Comments

jreback commented Nov 6, 2014

TomAugspurger commented Nov 6, 2014

jorisvandenbossche commented Nov 6, 2014

TomAugspurger commented Nov 6, 2014

jreback commented Nov 6, 2014

metasyn commented May 17, 2015

pkch commented Nov 30, 2015

hayd commented Dec 30, 2015

TomAugspurger commented Jan 9, 2016

jreback commented Jan 9, 2016

TomAugspurger commented Jan 9, 2016

jorisvandenbossche commented Jan 10, 2016

TomAugspurger commented Jan 10, 2016

jpgrossman commented Oct 18, 2016

TomAugspurger commented Oct 24, 2016

jpgrossman commented Oct 25, 2016 • edited by jorisvandenbossche Loading

jreback commented Jan 28, 2017

liorshk commented Jun 20, 2017

TomAugspurger commented Jun 20, 2017

kevin-winter commented Jul 8, 2017

joshlk commented May 21, 2018

raam93 commented Sep 1, 2018

andreaaraldo commented Apr 17, 2019

MarcoGorelli commented Feb 2, 2020

clbarnes commented May 19, 2020 • edited Loading

clbarnes commented May 26, 2020

TomAugspurger commented May 26, 2020

clbarnes commented May 26, 2020

TomAugspurger commented May 26, 2020

MarcoGorelli commented May 26, 2020

clbarnes commented May 27, 2020

mroeschke commented Aug 15, 2022

jpgrossman commented Oct 25, 2016 •

edited by jorisvandenbossche

Loading

clbarnes commented May 19, 2020 •

edited

Loading