`HistGradientBoostingClassifier` does not support `pd.Int64Dtype` in v1.4.0 #28317

timvink · 2024-01-31T09:07:34Z

Describe the bug

Fitting a HistGradientBoostingClassifier where one of the features has a pd.Int64Dtype dtype will give an error:

AttributeError: 'Int64Dtype' object has no attribute 'byteorder'

Steps/Code to Reproduce

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression

X, y = load_iris(as_frame=True, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
X_train['i'] = 1
X_train['i'] = X_train['i'].astype(pd.Int64Dtype())
clf =  LogisticRegression()
clf.fit(X_train, y_train) # all good
clf = RandomForestClassifier()
clf.fit(X_train, y_train) # all good
clf = HistGradientBoostingClassifier()
clf.fit(X_train, y_train) # breaks

Expected Results

No error is thrown.

Actual Results

Stacktrace suggests it's related to HistGradientBoostingClassifier getting support for categorical dtypes in v1.4.0

stacktrace

File /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:558, in BaseHistGradientBoosting.fit(self, X, y, sample_weight)
    [556](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=555) # time spent predicting X for gradient and hessians update
    [557](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=556) acc_prediction_time = 0.0
--> [558](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=557) X, known_categories = self._preprocess_X(X, reset=True)
    [559](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=558) y = _check_y(y, estimator=self)
    [560](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=559) y = self._encode_y(y)

File /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:271, in BaseHistGradientBoosting._preprocess_X(self, X, reset)
    [268](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=267)     return self._preprocessor.transform(X)
    [270](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=269) # At this point, reset is False, which runs during `fit`.
--> [271](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=270) self.is_categorical_ = self._check_categorical_features(X)
    [273](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=272) if self.is_categorical_ is None:
    [274](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=273)     self._preprocessor = None

File /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:374, in BaseHistGradientBoosting._check_categorical_features(self, X)
    [371](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=370) if hasattr(X, "__dataframe__"):
    [372](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=371)     X_is_dataframe = True
    [373](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=372)     categorical_columns_mask = np.asarray(
--> [374](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=373)         [
    [375](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=374)             c.dtype[0].name == "CATEGORICAL"
    [376](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=375)             for c in X.__dataframe__().get_columns()
    [377](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=376)         ]
    [378](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=377)     )
    [379](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=378)     X_has_categorical_columns = categorical_columns_mask.any()
    [380](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=379) # pandas versions < 1.5.1 do not support the dataframe interchange
    [381](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=380) # protocol so we inspect X.dtypes directly

File /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:375, in <listcomp>(.0)
    [371](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=370) if hasattr(X, "__dataframe__"):
    [372](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=371)     X_is_dataframe = True
    [373](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=372)     categorical_columns_mask = np.asarray(
    [374](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=373)         [
--> [375](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=374)             c.dtype[0].name == "CATEGORICAL"
    [376](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=375)             for c in X.__dataframe__().get_columns()
    [377](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=376)         ]
    [378](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=377)     )
    [379](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=378)     X_has_categorical_columns = categorical_columns_mask.any()
    [380](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=379) # pandas versions < 1.5.1 do not support the dataframe interchange
    [381](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py?line=380) # protocol so we inspect X.dtypes directly

File properties.pyx:36, in pandas._libs.properties.CachedProperty.__get__()

File /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py:128, in PandasColumn.dtype(self)
    [126](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=125)     raise NotImplementedError("Non-string object dtypes are not supported yet")
    [127](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=126) else:
--> [128](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=127)     return self._dtype_from_pandasdtype(dtype)

File /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py:147, in PandasColumn._dtype_from_pandasdtype(self, dtype)
    [145](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=144)     byteorder = dtype.base.byteorder  # type: ignore[union-attr]
    [146](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=145) else:
--> [147](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=146)     byteorder = dtype.byteorder
    [149](file:///anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/pandas/core/interchange/column.py?line=148) return kind, dtype.itemsize * 8, dtype_to_arrow_c_fmt(dtype), byteorder

AttributeError: 'Int64Dtype' object has no attribute 'byteorder'

Versions

system information

System:
    python: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]
executable: /anaconda/envs/ds_data_schemas/bin/python
   machine: Linux-5.15.0-1053-azure-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.4.0
          pip: 23.3.2
   setuptools: 69.0.3
        numpy: 1.26.3
        scipy: 1.12.0
       Cython: None
       pandas: 2.2.0
   matplotlib: 3.8.2
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 4
         prefix: libgomp
       filepath: /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 4
         prefix: libopenblas
       filepath: /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: SkylakeX

       user_api: blas
   internal_api: openblas
    num_threads: 4
         prefix: libopenblas
       filepath: /anaconda/envs/ds_data_schemas/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: SkylakeX

The text was updated successfully, but these errors were encountered:

ogrisel · 2024-01-31T16:21:13Z

Thanks for the report. I also confirm that this problem did not exist in 1.3.2. So this can be consider a regression introduced in 1.4.0.

Adding the milestone for 1.4.1.

lesteve · 2024-02-01T14:40:01Z

This may be a pandas bug actually, see pandas-dev/pandas#55069.

I am not too sure whether we want to be robust against this bug in the scikit-learn code. The pandas fix for now seems reasonably small pandas-dev/pandas#57173, so maybe worth having it in sklearn.utils.fixes?

lesteve · 2024-02-08T11:55:02Z

I opened a PR fixing this: #28385

timvink · 2024-02-10T18:46:23Z

Thanks!

timvink added Bug Needs Triage Issue requires triage labels Jan 31, 2024

timvink mentioned this issue Jan 31, 2024

BUG: pandas int extension dtypes has no attribute byteorder pandas-dev/pandas#55069

Closed

3 tasks

ogrisel removed the Needs Triage Issue requires triage label Jan 31, 2024

ogrisel added this to the 1.4.1 milestone Jan 31, 2024

ogrisel added the Regression label Jan 31, 2024

lesteve mentioned this issue Feb 8, 2024

FIX HistgradientBoosting with pandas extension dtypes #28385

Merged

jjerphan closed this as completed in #28385 Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`HistGradientBoostingClassifier` does not support `pd.Int64Dtype` in v1.4.0 #28317

`HistGradientBoostingClassifier` does not support `pd.Int64Dtype` in v1.4.0 #28317

timvink commented Jan 31, 2024 •

edited

Loading

ogrisel commented Jan 31, 2024

lesteve commented Feb 1, 2024 •

edited

Loading

lesteve commented Feb 8, 2024

timvink commented Feb 10, 2024

HistGradientBoostingClassifier does not support pd.Int64Dtype in v1.4.0 #28317

HistGradientBoostingClassifier does not support pd.Int64Dtype in v1.4.0 #28317

Comments

timvink commented Jan 31, 2024 • edited Loading

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

ogrisel commented Jan 31, 2024

lesteve commented Feb 1, 2024 • edited Loading

lesteve commented Feb 8, 2024

timvink commented Feb 10, 2024

`HistGradientBoostingClassifier` does not support `pd.Int64Dtype` in v1.4.0 #28317

`HistGradientBoostingClassifier` does not support `pd.Int64Dtype` in v1.4.0 #28317

timvink commented Jan 31, 2024 •

edited

Loading

lesteve commented Feb 1, 2024 •

edited

Loading