Skip to content

HistGradientBoostingClassifier does not support pd.Int64Dtype in v1.4.0 #28317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
timvink opened this issue Jan 31, 2024 · 4 comments · Fixed by #28385
Closed

HistGradientBoostingClassifier does not support pd.Int64Dtype in v1.4.0 #28317

timvink opened this issue Jan 31, 2024 · 4 comments · Fixed by #28385

Comments

@timvink
Copy link
Contributor

timvink commented Jan 31, 2024

Describe the bug

Fitting a HistGradientBoostingClassifier where one of the features has a pd.Int64Dtype dtype will give an error:

AttributeError: 'Int64Dtype' object has no attribute 'byteorder'

Steps/Code to Reproduce

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression

X, y = load_iris(as_frame=True, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
X_train['i'] = 1
X_train['i'] = X_train['i'].astype(pd.Int64Dtype())
clf =  LogisticRegression()
clf.fit(X_train, y_train) # all good
clf = RandomForestClassifier()
clf.fit(X_train, y_train) # all good
clf = HistGradientBoostingClassifier()
clf.fit(X_train, y_train) # breaks

Expected Results

No error is thrown.

Actual Results

Stacktrace suggests it's related to HistGradientBoostingClassifier getting support for categorical dtypes in v1.4.0

stacktrace
File /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:558, in BaseHistGradientBoosting.fit(self, X, y, sample_weight)
    [556](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=555) # time spent predicting X for gradient and hessians update
    [557](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=556) acc_prediction_time = 0.0
--> [558](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=557) X, known_categories = self._preprocess_X(X, reset=True)
    [559](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=558) y = _check_y(y, estimator=self)
    [560](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=559) y = self._encode_y(y)

File /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:271, in BaseHistGradientBoosting._preprocess_X(self, X, reset)
    [268](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=267)     return self._preprocessor.transform(X)
    [270](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=269) # At this point, reset is False, which runs during `fit`.
--> [271](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=270) self.is_categorical_ = self._check_categorical_features(X)
    [273](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=272) if self.is_categorical_ is None:
    [274](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=273)     self._preprocessor = None

File /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:374, in BaseHistGradientBoosting._check_categorical_features(self, X)
    [371](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=370) if hasattr(X, "__dataframe__"):
    [372](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=371)     X_is_dataframe = True
    [373](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=372)     categorical_columns_mask = np.asarray(
--> [374](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=373)         [
    [375](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=374)             c.dtype[0].name == "CATEGORICAL"
    [376](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=375)             for c in X.__dataframe__().get_columns()
    [377](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=376)         ]
    [378](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=377)     )
    [379](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=378)     X_has_categorical_columns = categorical_columns_mask.any()
    [380](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=379) # pandas versions < 1.5.1 do not support the dataframe interchange
    [381](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=380) # protocol so we inspect X.dtypes directly

File /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:375, in <listcomp>(.0)
    [371](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=370) if hasattr(X, "__dataframe__"):
    [372](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=371)     X_is_dataframe = True
    [373](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=372)     categorical_columns_mask = np.asarray(
    [374](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=373)         [
--> [375](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=374)             c.dtype[0].name == "CATEGORICAL"
    [376](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=375)             for c in X.__dataframe__().get_columns()
    [377](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=376)         ]
    [378](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=377)     )
    [379](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=378)     X_has_categorical_columns = categorical_columns_mask.any()
    [380](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=379) # pandas versions < 1.5.1 do not support the dataframe interchange
    [381](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=380) # protocol so we inspect X.dtypes directly

File properties.pyx:36, in pandas._libs.properties.CachedProperty.__get__()

File /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py:128, in PandasColumn.dtype(self)
    [126](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=125)     raise NotImplementedError("Non-string object dtypes are not supported yet")
    [127](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=126) else:
--> [128](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=127)     return self._dtype_from_pandasdtype(dtype)

File /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py:147, in PandasColumn._dtype_from_pandasdtype(self, dtype)
    [145](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=144)     byteorder = dtype.base.byteorder  # type: ignore[union-attr]
    [146](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=145) else:
--> [147](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=146)     byteorder = dtype.byteorder
    [149](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=148) return kind, dtype.itemsize * 8, dtype_to_arrow_c_fmt(dtype), byteorder

AttributeError: 'Int64Dtype' object has no attribute 'byteorder'

Versions

system information
System:
    python: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]
executable: /anaconda/envs/ds_data_schemas/bin/python
   machine: Linux-5.15.0-1053-azure-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.4.0
          pip: 23.3.2
   setuptools: 69.0.3
        numpy: 1.26.3
        scipy: 1.12.0
       Cython: None
       pandas: 2.2.0
   matplotlib: 3.8.2
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 4
         prefix: libgomp
       filepath: /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 4
         prefix: libopenblas
       filepath: /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: SkylakeX

       user_api: blas
   internal_api: openblas
    num_threads: 4
         prefix: libopenblas
       filepath: /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: SkylakeX
@ogrisel
Copy link
Member

ogrisel commented Jan 31, 2024

Thanks for the report. I also confirm that this problem did not exist in 1.3.2. So this can be consider a regression introduced in 1.4.0.

Adding the milestone for 1.4.1.

@ogrisel ogrisel added this to the 1.4.1 milestone Jan 31, 2024
@lesteve
Copy link
Member

lesteve commented Feb 1, 2024

This may be a pandas bug actually, see pandas-dev/pandas#55069.

I am not too sure whether we want to be robust against this bug in the scikit-learn code. The pandas fix for now seems reasonably small pandas-dev/pandas#57173, so maybe worth having it in sklearn.utils.fixes?

@lesteve
Copy link
Member

lesteve commented Feb 8, 2024

I opened a PR fixing this: #28385

@timvink
Copy link
Contributor Author

timvink commented Feb 10, 2024

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants