Skip to content

Implementing support for Panda's Categorical dtype #4865

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tgarc opened this issue Jun 15, 2015 · 5 comments
Closed

Implementing support for Panda's Categorical dtype #4865

tgarc opened this issue Jun 15, 2015 · 5 comments

Comments

@tgarc
Copy link

tgarc commented Jun 15, 2015

It seems that there is no support in sklearn to use Panda's Categorical datatype directly in fitting models. The Categorical datatype is convenient since it encodes categorical data and contains the mapping scheme of the data. In addition categorical encoding is purely a data handling/processing problem so it seems more natural that it would be handled by Pandas. Is there anything in the pipeline to add support for this in the future? Would this require a lot of restructuring of the sklearn API to support?

@amueller
Copy link
Member

I think these are at least two separate questions:

  1. can / will sklearn support pandas dataframes with categorical features as input
  2. can / will sklearn support categorical variables via pandas categorical datatypes.
  3. would be more or less converting all categorical variables into one-hot encoded features, aka dummy columns. That is really easy to do for the user. We could do that "under the hood" in scikit-learn, but it would complicate the code and I don't see a great benefit.
  4. Is basically impossible. Having a categorical datatype would be nice for the trees, but I think pandas has no stable c-level interface, so we can't really tab into that. Even if there was, it would still require a substantial rewrite of the tree code. I don't think it would be helpful anywhere else.

@TomAugspurger
Copy link
Contributor

Agreed with @amueller here. It'd be better for the user to do the transform. In pandas, we're maybe adding a couple methods like from_dummies which should make it easier to round-trip from DataFrame (with categoricals) to scikit-learn and back.

@amueller
Copy link
Member

Cool :)

@tgarc
Copy link
Author

tgarc commented Jun 17, 2015

My issue was really coming from the fact that I have to keep two copies of a dataframe (which may be very large) when exploring datasets with categorical data: an encoded copy for fitting models, and an unencoded copy for plotting and exploring the data. So the encoded copy is basically useless except for passing to estimators. However, I realize that by using the pipeline module I could bypass this particular issue.

I think it would be a nice feature to support categorical features as a datatype out of the box but as you said it is most likely not worth the investment.

@tgarc tgarc closed this as completed Jun 17, 2015
@amueller
Copy link
Member

well, the pipeline would generate an intermediate version, so you wouldn't save any ram. To make sklearn work directly on the categorical representation on anything but tree-based models wouldn't really make sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants