Implementing support for Panda's Categorical dtype #4865

tgarc · 2015-06-15T18:26:53Z

It seems that there is no support in sklearn to use Panda's Categorical datatype directly in fitting models. The Categorical datatype is convenient since it encodes categorical data and contains the mapping scheme of the data. In addition categorical encoding is purely a data handling/processing problem so it seems more natural that it would be handled by Pandas. Is there anything in the pipeline to add support for this in the future? Would this require a lot of restructuring of the sklearn API to support?

amueller · 2015-06-15T18:36:58Z

I think these are at least two separate questions:

can / will sklearn support pandas dataframes with categorical features as input
can / will sklearn support categorical variables via pandas categorical datatypes.
would be more or less converting all categorical variables into one-hot encoded features, aka dummy columns. That is really easy to do for the user. We could do that "under the hood" in scikit-learn, but it would complicate the code and I don't see a great benefit.
Is basically impossible. Having a categorical datatype would be nice for the trees, but I think pandas has no stable c-level interface, so we can't really tab into that. Even if there was, it would still require a substantial rewrite of the tree code. I don't think it would be helpful anywhere else.

TomAugspurger · 2015-06-17T13:52:21Z

Agreed with @amueller here. It'd be better for the user to do the transform. In pandas, we're maybe adding a couple methods like from_dummies which should make it easier to round-trip from DataFrame (with categoricals) to scikit-learn and back.

amueller · 2015-06-17T13:57:38Z

Cool :)

tgarc · 2015-06-17T16:08:03Z

My issue was really coming from the fact that I have to keep two copies of a dataframe (which may be very large) when exploring datasets with categorical data: an encoded copy for fitting models, and an unencoded copy for plotting and exploring the data. So the encoded copy is basically useless except for passing to estimators. However, I realize that by using the pipeline module I could bypass this particular issue.

I think it would be a nice feature to support categorical features as a datatype out of the box but as you said it is most likely not worth the investment.

amueller · 2015-06-17T20:16:21Z

well, the pipeline would generate an intermediate version, so you wouldn't save any ram. To make sklearn work directly on the categorical representation on anything but tree-based models wouldn't really make sense.

tgarc closed this as completed Jun 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing support for Panda's Categorical dtype #4865

Implementing support for Panda's Categorical dtype #4865

tgarc commented Jun 15, 2015

amueller commented Jun 15, 2015

TomAugspurger commented Jun 17, 2015

amueller commented Jun 17, 2015

tgarc commented Jun 17, 2015

amueller commented Jun 17, 2015

Implementing support for Panda's Categorical dtype #4865

Implementing support for Panda's Categorical dtype #4865

Comments

tgarc commented Jun 15, 2015

amueller commented Jun 15, 2015

TomAugspurger commented Jun 17, 2015

amueller commented Jun 17, 2015

tgarc commented Jun 17, 2015

amueller commented Jun 17, 2015