Skip to content

Raising SettingWithCopyWarning in cross_validate #16191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
glemaitre opened this issue Jan 24, 2020 · 13 comments · Fixed by #20673
Closed

Raising SettingWithCopyWarning in cross_validate #16191

glemaitre opened this issue Jan 24, 2020 · 13 comments · Fixed by #20673

Comments

@glemaitre
Copy link
Member

glemaitre commented Jan 24, 2020

Before submitting a bug, please make sure the issue hasn't been already
addressed by searching through the past issues.

Describe the bug

Using a FunctionTransformer which add a column to a dataframe will raise a SettingWithCopyWarning when used in cross_validate`.

This is highly linked to #8723 but I think that in this case, we should do something about it since:

  • FunctionTransformer is stateless transformer
  • such pipeline could be a valid pipeline

Steps/Code to Reproduce

from sklearn.datasets import fetch_openml

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

def processing(X):
    # trigger a dummy processing which should be ok
    X.loc[:, 'new_age'] = X["age"].fillna(0).astype(int)
    return X

from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

func_trans = FunctionTransformer(processing)
encoder = make_column_transformer(
    (make_pipeline(
        SimpleImputer(strategy="constant", fill_value="missing"),
        OneHotEncoder(handle_unknown='ignore')
    ), ["pclass", "sex", "embarked"]),
    (make_pipeline(SimpleImputer(), StandardScaler()), ["age", "new_age"])
)
pipeline = make_pipeline(
    func_trans,
    encoder,
    LogisticRegression()
)

print("Fit the pipeline on the entire data")
pipeline.fit(X, y)

from sklearn.model_selection import cross_val_score

print("Apply cross-validation")
cross_val_score(pipeline, X, y)

Expected Results

Getting some SettingWithCopyWarning

@glemaitre glemaitre added the Bug label Jan 24, 2020
@jorisvandenbossche
Copy link
Member

This is indeed the same underlying issue as was discussed in #8723.
Some related discussions on the pandas side: pandas-dev/pandas#18801, pandas-dev/pandas#18799, pandas-dev/pandas#19102

@glemaitre
Copy link
Member Author

I would propose to set the attribute is_copy or _is_copy to False within cross-validate since this would be an internal thing.

@glemaitre
Copy link
Member Author

ping @jnothman @amueller

@jorisvandenbossche
Copy link
Member

It's setting it to None (not False), I think.
Indeed, that would be the solution that avoids an additional copy of the data. Just note that it is a private attribute, so I can't guarantee it will never change (but if you do it with hasattr, you should be able to write it in a safe way, that doesn't break if it changes in pandas).

I reopened pandas-dev/pandas#19102 to have a potential discussion for a public way to do this.

@jorisvandenbossche
Copy link
Member

Actually, now I think about it, there is a better way: you can use DataFrame.take instead of iloc in safe_indexing. In DataFrame.take, there is an option to ensure the _is_copy is not set, so you can ensure that the result is a "real" copy that will not generate those warnings.

See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.take.html. The explanation of the is_copy argument is confusing / wrong, you need to use is_copy=False to get a "real" copy (without tracking the parent dataframe).
This keyword will be deprecated in 1.0, but take will then also by default do such a real copy. So you probably need a version check on whether to pass the argument or not. See pandas-dev/pandas#30784

@Henley13
Copy link
Contributor

Henley13 commented Jan 29, 2020

Hello, I can take this issue.

@jnothman
Copy link
Member

@Henley13 try making the comment "take" to assign yourself the issues

@thomasjpfan
Copy link
Member

I believe the canonical pandas way to deal with this is to make a copy:

def processing(X):
    return X.assign(new_age=X["age"].fillna(0).astype(int))

I see copying as making FunctionTransformer stateless, since it would not modify the input.

@amueller
Copy link
Member

The canonical pandas way leads to many unnecessary copies as I noted in the issue @jorisvandenbossche linked pandas-dev/pandas#18799

@thomasjpfan
Copy link
Member

For this specific example, I am thinking if FunctionTransformer should be changing the input in general. For simple transformations, we make a copy:

import numpy as np
from sklearn.preprocessing import FunctionTransformer
from numpy.testing import assert_array_equal

transformer = FunctionTransformer(np.log1p)
X = np.array([[0, 1], [2, 3]])
X_trans = transformer.transform(X)

assert_array_equal(X, [[0, 1], [2, 3]])

But for the pandas example we have above:

import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import FunctionTransformer

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

def processing(X):
    X.loc[:, "new_age"] = X["age"].fillna(0).astype(int)
    return X

X_trans = FunctionTransformer(processing).transform(X)

assert "new_age" in X.columns

It changes the input. Are we okay with this type of usage?

@jnothman
Copy link
Member

jnothman commented Feb 2, 2020 via email

@thomasjpfan
Copy link
Member

I am thinking about how we document the FunctionTransformer as "useful for stateless transformations" and modifying the input is not stateless. Imagine if we tried to do the same thing as the original issue but the input is a numpy array, we would be forced to create a new numpy array.

I see two options to resolve this issue is:

  1. Use private pandas API as shown in https://github.com/scikit-learn/scikit-learn/pull/16293/files
  2. Document that FunctionTransformer should return copies and modify the input only if you know what you are doing.

@linehammer
Copy link

The first thing you should understand is that SettingWithCopyWarning is a warning, and not an error. The real problem behind the warning is that it is generally difficult to predict whether a view or a copy is returned. In most cases, the warning was raised because you have chained two indexing operations together. The SettingWithCopyWarning was created to flag "chained assignment" operations. This is made easier to spot because you might be used [] (square brackets) twice, but the same would be true if you used other access methods such as .loc[] , .iloc[] and so on.

Moreover, you can change the behaviour of SettingWithCopyWarning warning using pd.options.mode.chained_assignment with three option "None/raise"/"warn".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment