-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Implementing support for Panda's Categorical dtype #4865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think these are at least two separate questions:
|
Agreed with @amueller here. It'd be better for the user to do the transform. In pandas, we're maybe adding a couple methods like from_dummies which should make it easier to round-trip from DataFrame (with categoricals) to scikit-learn and back. |
Cool :) |
My issue was really coming from the fact that I have to keep two copies of a dataframe (which may be very large) when exploring datasets with categorical data: an encoded copy for fitting models, and an unencoded copy for plotting and exploring the data. So the encoded copy is basically useless except for passing to estimators. However, I realize that by using the pipeline module I could bypass this particular issue. I think it would be a nice feature to support categorical features as a datatype out of the box but as you said it is most likely not worth the investment. |
well, the pipeline would generate an intermediate version, so you wouldn't save any ram. To make sklearn work directly on the categorical representation on anything but tree-based models wouldn't really make sense. |
It seems that there is no support in sklearn to use Panda's Categorical datatype directly in fitting models. The Categorical datatype is convenient since it encodes categorical data and contains the mapping scheme of the data. In addition categorical encoding is purely a data handling/processing problem so it seems more natural that it would be handled by Pandas. Is there anything in the pipeline to add support for this in the future? Would this require a lot of restructuring of the sklearn API to support?
The text was updated successfully, but these errors were encountered: