-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Wishlist: make get_dummies() usable for train / test framework #8918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
well how about a pseudo code example with inputs and outputs from a sample frame would be useful |
@chrish42, an example would be great. FYI scikit-learn has the OneHotEncoder class which fits into their pipeline. Something like this should work? import pandas as pd
from sklearn.pipeline import TransformerMixin
class DummyEncoder(TransformerMixin):
def __init__(self, columns=None):
self.columns = columns
def transform(self, X, y=None, **kwargs):
return pd.get_dummies(X, columns=self.columns)
def fit(self, X, y=None, **kwargs):
return self Giving
Be careful with the ordering of the columns. |
@TomAugspurger, actually the compatibility with the sklearn processing pipeline itself is not the part that interests me. What I would like is the ability to save the transformation done by get_dummes() to a dataset, and then apply said transformation as is (creating the exact same columns), even if the second dataset has a subset of the values of the first one in some column, etc. That's actually what I meant by "usable in a train/test framework". Is this explanation clearer? (I can add an example someone thinks that's still needed.) I'm aware of the |
I stumbled upton the same problem as @chrish42 and I found get_dummies giving me some headache. Example of the limitations of the current get dummiesLet us assume we work with data from the following df_train DataFrame
Then we are provided with
Since I have never observed a "mercedes" value for variable "car" in df_train I would like to be able to get the following one hot encoding:
Where the column car_mercedes actually never appears. This could be solved by allowing get_dummies to receive an input dictionary stating the accepted values that we allow for each column. Returning to the previous example, we could give as input to get_dummies the following dict of sets
and we would expect get_dummies to return
and expect get_dummies(df_test) to return what already returns. |
You simply need make your variables
The original question is not well specified, so closing. |
And when you're going the other way, from the encoding to back to Categorical you'll use Categorical.from_codes. One more bit of unsolicited advice. If you care at all about accurate estimates of the coefficients on the categoricals, drop one of the encoded columns or else you'll have multicolinearity with the intercept (if you have one).
|
@TomAugspurger @jreback I think i have run into the same problem lately and i would like to state an example train_a = pd.DataFrame({"IsBadBuy":[0,1,0], "Make":['Toyota', 'Mazda','BMW']}) IsBadBuy Make_BMW Make_Mazda Make_Toyota test_a = pd.DataFrame({"Make":['Toyota','BMW']}) Make_BMW Make_Toyota Here ideally the Make_Mazda column should be preserved as the ML algorithm would expect the same number of features and the values that we get in the test will be a subset of that in train. |
Use a Categorical. That will expand to the correct number of columns. I gave a talk about this if you're interested https://m.youtube.com/watch?v=KLPtEBokqQ0
…_____________________________
From: Ajay Saxena <[email protected]>
Sent: Thursday, January 12, 2017 18:31
Subject: Re: [pandas-dev/pandas] Wishlist: make get_dummies() usable for train / test framework (#8918)
To: pandas-dev/pandas <[email protected]>
Cc: Tom Augspurger <[email protected]>, Mention <[email protected]>
@jreback I think i have run into the same problem lately and i would like to state an example
train_a = pd.DataFrame({"IsBadBuy":[0,1,0], "Make":['Toyota', 'Mazda','BMW']})
IsBadBuy Make_BMW Make_Mazda Make_Toyota
0 0 0 0 1
1 1 0 1 0
2 0 1 0 0
test_a = pd.DataFrame({"Make":['Toyota','BMW']})
print pd.get_dummies(test_a,columns=['Make'])
Make_BMW Make_Toyota
0 0 1
1 1 0
Here ideally the Make_Mazda column should be preserved as the ML algorithm would expect the same number of features and the values that we get in the test will be a subset of that in train.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Thanks @TomAugspurger |
The PyData Chicago 2016 talk given by @TomAugspurger was really well done. He did a fantastic job of illustrating all the reasons why this issue/request should not be closed. IMHO either his class DummyEncoder or some reasonable equivalent should be included in Pandas proper. Yes I can go to his github and copy/emulate his class but it would be much nicer to just have it supported within the library. |
I think there's a need for a library that goes early in the data-modeling
pipeline and works well with pandas and scikit-learn.
But pandas doesn't depend on scikit-learn and vice-versa. I think there's
room for another library built on top of both.
…On Wed, May 10, 2017 at 6:13 PM, Brian Wylie ***@***.***> wrote:
The PyData Chicago 2016 talk given by @TomAugspurger
<https://github.com/TomAugspurger> was really well done. He did a
fantastic job of illustrating all the reasons why this issue/request should
not be closed. IMHO either his class DummyEncoder or some reasonable
equivalent should be included in Pandas proper. Yes I can go to his github
and copy/emulate his class but it would be much nicer to just have it
supported within the library.
BTW I think @TomAugspurger <https://github.com/TomAugspurger> might be my
new favorite PyData guru. I'm going to hunt down everything he's
done/working on and try to absorb it.. not in a creepy/stalking way.. you
know just in a normal way that's not creepy at all. :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8918 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIpTqgHSE7iFVF9Pp4_YoKB9DPLcEks5r4kSrgaJpZM4DB6Hb>
.
|
Here's a little solution some of us worked on that may be helpful for some here. Dummy variables with fit/transform capabilities. https://github.com/joeddav/get_smarties Feedback and contributions would be helpful! |
This appears related to #14017 |
I have created a solution that may be helpful exactly in this problem. One Hot Encoding categorical variable in a train test framework. It can also handle cases when the dataset is too large to fit in the machine memory. https://github.com/yashu-seth/dummyPy You can also find a small tutorial on this here. |
@TomAugspurger This Code doesn't work. When I go to transform my production single record data it only gives me the one hot encoded column for the single value that is present. import pyodbc import numpy as np class DummyEncoder(TransformerMixin):
#import data from SQL to pandas Dataframe prod_sql =""" InputDataSet = pd.read_sql(sql, cnxn) print("*************** Data ******************") print("******** Category Columns Info ***********") InputDataSet.info() print("******** Linear Regression ***********") X = InputDataSet.drop('return_flag', axis=1) A = ProdDataSet.drop('return_flag', axis=1) enc = DummyEncoder() print(Prod) OUTPUT: *************** Data ****************** |
So I think this thread is a bit messy, so I will try to summarize a simple solution here and how this is already possible. I will demonstrate on one column, but you can generalize it to many. So in "fit" call you just do: categories = sorted(training_data.iloc[:, column_index].value_counts(dropna=True).index) You store And then in "transform" you do: from pandas.api import types as pandas_types
categorical_data = testing_data.iloc[:, [column_index]].astype(
pandas_types.CategoricalDtype(categories=categories),
)
one_hot_encoded = pandas.get_dummies(categorical_data) And it will do one-hot encoding always in the same mapping for values. If some categorical value was not present during training, it will be seen as NaN during testing. If some value is not seen during testing, no column will be set for it. |
That's very nice. I just wish everyone who wants to do this didn't have to discover it anew. ;-) |
The approach suggested by @mitar is a nice, short example. For a longer exploration of this issue here's a notebook that might be useful/helpful: https://nbviewer.jupyter.org/github/SuperCowPowers/scp-labs/blob/master/notebooks/Categorical_Encoding_Dangers.ipynb |
Saw below code in exercise of Kaggle XGBoost tutorial. This does the trick.
|
I have also faced the same issue multiple times. I have written a class (taking ideas from this discussion) below that made things easier for me.
Easy to initiate and use an instance of the encoder as well.
|
Having get_dummies() in Pandas is really nice, but to be useful for machine learning, it would need to be usable in a train / test framework (or "fit_transform" and "transform", with the sklearn terminology). Let me know if this needs more explanations.
So, I guess this is a wishlist bug report to add that functionality to Pandas. I can even create a pull request, if people agree this would be something useful to have in Pandas (and are willing to coach a bit and do code review for what would be my first contribution to this project).
The text was updated successfully, but these errors were encountered: