|
| 1 | +.. _slep_014: |
| 2 | + |
| 3 | +============================== |
| 4 | +SLEP014: Pandas In, Pandas Out |
| 5 | +============================== |
| 6 | + |
| 7 | +:Author: Thomas J Fan |
| 8 | +:Status: Rejected |
| 9 | +:Type: Standards Track |
| 10 | +:Created: 2020-02-18 |
| 11 | + |
| 12 | +Abstract |
| 13 | +######## |
| 14 | + |
| 15 | +This SLEP proposes using pandas DataFrames for propagating feature names |
| 16 | +through ``scikit-learn`` transformers. |
| 17 | + |
| 18 | +Motivation |
| 19 | +########## |
| 20 | + |
| 21 | +``scikit-learn`` is commonly used as a part of a larger data processing |
| 22 | +pipeline. When this pipeline is used to transform data, the result is a |
| 23 | +NumPy array, discarding column names. The current workflow for |
| 24 | +extracting the feature names requires calling ``get_feature_names`` on the |
| 25 | +transformer that created the feature. This interface can be cumbersome when used |
| 26 | +together with a pipeline with multiple column names:: |
| 27 | + |
| 28 | + import pandas as pd |
| 29 | + import numpy as np |
| 30 | + from sklearn.compose import make_column_transformer |
| 31 | + from sklearn.preprocessing import OneHotEncoder, StandardScaler |
| 32 | + from sklearn.pipeline import make_pipeline |
| 33 | + from sklearn.linear_model import LogisticRegression |
| 34 | + |
| 35 | + X = pd.DataFrame({'letter': ['a', 'b', 'c'], |
| 36 | + 'pet': ['dog', 'snake', 'dog'], |
| 37 | + 'num': [1, 2, 3]}) |
| 38 | + y = [0, 0, 1] |
| 39 | + orig_cat_cols, orig_num_cols = ['letter', 'pet'], ['num'] |
| 40 | + |
| 41 | + ct = make_column_transformer( |
| 42 | + (OneHotEncoder(), orig_cat_cols), (StandardScaler(), orig_num_cols)) |
| 43 | + pipe = make_pipeline(ct, LogisticRegression()).fit(X, y) |
| 44 | + |
| 45 | + cat_names = (pipe['columntransformer'] |
| 46 | + .named_transformers_['onehotencoder'] |
| 47 | + .get_feature_names(orig_cat_cols)) |
| 48 | + |
| 49 | + feature_names = np.r_[cat_names, orig_num_cols] |
| 50 | + |
| 51 | +The ``feature_names`` extracted above corresponds to the features directly |
| 52 | +passed into ``LogisticRegression``. As demonstrated above, the process of |
| 53 | +extracting ``feature_names`` requires knowing the order of the selected |
| 54 | +categories in the ``ColumnTransformer``. Furthemore, if there is feature |
| 55 | +selection in the pipeline, such as ``SelectKBest``, the ``get_support`` method |
| 56 | +would need to be used to determine column names that were selected. |
| 57 | + |
| 58 | +Solution |
| 59 | +######## |
| 60 | + |
| 61 | +The pandas DataFrame has been widely adopted by the Python Data ecosystem to |
| 62 | +store data with feature names. This SLEP proposes using a DataFrame to |
| 63 | +track the feature names as the data is transformed. With this feature, the |
| 64 | +API for extracting feature names would be:: |
| 65 | + |
| 66 | + from sklearn import set_config |
| 67 | + set_config(pandas_in_out=True) |
| 68 | + |
| 69 | + pipe.fit(X, y) |
| 70 | + X_trans = pipe[:-1].transform(X) |
| 71 | + |
| 72 | + X_trans.columns.tolist() |
| 73 | + ['letter_a', 'letter_b', 'letter_c', 'pet_dog', 'pet_snake', 'num'] |
| 74 | + |
| 75 | +This SLEP proposes attaching feature names to the output of ``transform``. In |
| 76 | +the above example, ``pipe[:-1].transform(X)`` propagates the feature names |
| 77 | +through the multiple transformers. |
| 78 | + |
| 79 | +This feature is only available through a soft dependency on pandas. Furthermore, |
| 80 | +it will be opt-in with the the configuration flag: ``pandas_in_out``. By |
| 81 | +default, ``pandas_in_out`` is set to ``False``, resulting in the output of all |
| 82 | +estimators to be a ndarray. |
| 83 | + |
| 84 | +Enabling Functionality |
| 85 | +###################### |
| 86 | + |
| 87 | +The following enhancements are **not** a part of this SLEP. These features are |
| 88 | +made possible if this SLEP gets accepted. |
| 89 | + |
| 90 | +1. Allows estimators to treat columns differently based on name or dtype. For |
| 91 | + example, the categorical dtype is useful for tree building algorithms. |
| 92 | + |
| 93 | +2. Storing feature names inside estimators for model inspection:: |
| 94 | + |
| 95 | + from sklearn import set_config |
| 96 | + set_config(store_feature_names_in=True) |
| 97 | + |
| 98 | + pipe.fit(X, y) |
| 99 | + |
| 100 | + pipe['logisticregression'].feature_names_in_ |
| 101 | + |
| 102 | +3. Allow for extracting the feature names of estimators in meta-estimators:: |
| 103 | + |
| 104 | + from sklearn import set_config |
| 105 | + set_config(store_feature_names_in=True) |
| 106 | + |
| 107 | + est = BaggingClassifier(LogisticRegression()) |
| 108 | + est.fit(X, y) |
| 109 | + |
| 110 | + # Gets the feature names used by an estimator in the ensemble |
| 111 | + est.estimators_[0].feature_names_in_ |
| 112 | + |
| 113 | +For options 2 and 3 the default value of configuration flag: |
| 114 | +``store_feature_names_in`` is False. |
| 115 | + |
| 116 | +Considerations |
| 117 | +############## |
| 118 | + |
| 119 | +Memory copies |
| 120 | +------------- |
| 121 | + |
| 122 | +As noted in `pandas #27211 <https://github.com/pandas-dev/pandas/issues/27211>`_, |
| 123 | +there is not a guarantee that there is a zero-copy round-trip going from numpy |
| 124 | +to a DataFrame. In other words, the following may lead to a memory copy in |
| 125 | +a future version of ``pandas``:: |
| 126 | + |
| 127 | + X = np.array(...) |
| 128 | + X_df = pd.DataFrame(X) |
| 129 | + X_again = np.asarray(X_df) |
| 130 | + |
| 131 | +This is an issue for ``scikit-learn`` when estimators are placed into a |
| 132 | +pipeline. For example, consider the following pipeline:: |
| 133 | + |
| 134 | + set_config(pandas_in_out=True) |
| 135 | + pipe = make_pipeline(StandardScaler(), LogisticRegression()) |
| 136 | + pipe.fit(X, y) |
| 137 | + |
| 138 | +Interally, ``StandardScaler.fit_transform`` will operate on a ndarray and |
| 139 | +wrap the ndarray into a DataFrame as a return value. This is will be |
| 140 | +piped into ``LogisticRegression.fit`` which calls ``check_array`` on the |
| 141 | +DataFrame, which may lead to a memory copy in a future version of |
| 142 | +``pandas``. This leads to unnecessary overhead from piping the data from one |
| 143 | +estimator to another. |
| 144 | + |
| 145 | +Sparse matrices |
| 146 | +--------------- |
| 147 | + |
| 148 | +Traditionally, ``scikit-learn`` prefers to process sparse matrices in |
| 149 | +the compressed sparse row (CSR) matrix format. The `sparse data structure <https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html>`_ in pandas 1.0 only supports converting directly to |
| 150 | +the coordinate format (COO). Although this format was designed to quickly |
| 151 | +convert to CSR or CSC formats, the conversion process still needs to allocate |
| 152 | +more memory to store. This can be an issue with transformers such as the |
| 153 | +``OneHotEncoder.transform`` which has been optimized to construct a CSR matrix. |
| 154 | + |
| 155 | +Backward compatibility |
| 156 | +###################### |
| 157 | + |
| 158 | +The ``set_config(pandas_in_out=True)`` global configuration flag will be set to |
| 159 | +``False`` by default to ensure backward compatibility. When this flag is False, |
| 160 | +the output of all estimators will be a ndarray. |
| 161 | + |
| 162 | +Community Adoption |
| 163 | +################## |
| 164 | + |
| 165 | +With the new ``pandas_in_out`` configuration flag, third party libraries may |
| 166 | +need to query the configuration flag to be fully compliant with this SLEP. |
| 167 | +Specifically, "to be fully compliant" entails the following policy: |
| 168 | + |
| 169 | +1. If ``pandas_in_out=False``, then ``transform`` always returns numpy array. |
| 170 | +2. If ``pandas_in_out=True``, then ``transform`` returns a DataFrame if the |
| 171 | + input is a Dataframe. |
| 172 | + |
| 173 | +This policy can either be enforced with ``check_estimator`` or not: |
| 174 | + |
| 175 | +- **Enforce**: This increases the maintaince burden of third party libraries. |
| 176 | + This burden includes: checking for the configuration flag, generating feature names and including pandas as a dependency to their library. |
| 177 | + |
| 178 | +- **Not Enforce**: Currently, third party transformers can return a DataFrame |
| 179 | + or a numpy and this is mostly compatible with ``scikit-learn``. Users with |
| 180 | + third party transformers would not be able to access the features enabled |
| 181 | + by this SLEP. |
| 182 | + |
| 183 | + |
| 184 | +Alternatives |
| 185 | +############ |
| 186 | + |
| 187 | +This section lists alternative data structures that can be used with their |
| 188 | +advantages and disadvantages when compared to a pandas DataFrame. |
| 189 | + |
| 190 | +InputArray |
| 191 | +---------- |
| 192 | + |
| 193 | +The proposed ``InputArray`` described |
| 194 | +:ref:`SLEP012 Custom InputArray Data Structure <slep_012>` introduces a new |
| 195 | +data structure for homogenous data. |
| 196 | + |
| 197 | +Pros |
| 198 | +~~~~ |
| 199 | + |
| 200 | +- A thin wrapper around a numpy array or a sparse matrix with a minimial feature |
| 201 | + set that ``scikit-learn`` can evolve independently. |
| 202 | + |
| 203 | +Cons |
| 204 | +~~~~ |
| 205 | + |
| 206 | +- Introduces another data structure for data storage in the PyData ecosystem. |
| 207 | +- Currently, the design only allows for homogenous data. |
| 208 | +- Increases maintenance responsibilities for ``scikit-learn``. |
| 209 | + |
| 210 | +XArray Dataset |
| 211 | +-------------- |
| 212 | + |
| 213 | +`xarray's Dataset <http://xarray.pydata.org/en/stable/data-structures.html#dataset>`_ |
| 214 | +is a multi-dimenstional version of panda's DataFrame. |
| 215 | + |
| 216 | +Pros |
| 217 | +~~~~ |
| 218 | + |
| 219 | +- Can be used for heterogeneous data. |
| 220 | + |
| 221 | +Cons |
| 222 | +~~~~ |
| 223 | + |
| 224 | +- ``scikit-learn`` does not require many of the features Dataset provides. |
| 225 | +- Needs to be converted to a DataArray before it can be converted to a numpy array. |
| 226 | +- The `conversion from a pandas DataFrame to a Dataset <http://xarray.pydata.org/en/stable/pandas.html>`_ |
| 227 | + is not lossless. For example, categorical dtypes in a pandas dataframe will |
| 228 | + lose their categorical information when converted to a Dataset. |
| 229 | +- xarray does not have as much adoption as pandas, which increases the learning |
| 230 | + curve for using Dataset with ``scikit-learn``. |
| 231 | + |
| 232 | +XArray DataArray |
| 233 | +---------------- |
| 234 | + |
| 235 | +`xarray's DataArray <http://xarray.pydata.org/en/stable/data-structures.html#dataarray>`_ |
| 236 | +is a data structure that store homogenous data. |
| 237 | + |
| 238 | +Pros |
| 239 | +~~~~ |
| 240 | + |
| 241 | +- xarray guarantees that there will be no copies during round-trips from |
| 242 | + numpy. (`xarray #3077 <https://github.com/pydata/xarray/issues/3077>`_) |
| 243 | + |
| 244 | +Cons |
| 245 | +~~~~ |
| 246 | + |
| 247 | +- Can only be used for homogenous data. |
| 248 | +- As with XArray's Dataset, DataArray does not as much adoption as pandas, |
| 249 | + which increases the learning curve for using DataArray with ``scikit-learn``. |
| 250 | + |
| 251 | +References and Footnotes |
| 252 | +######################## |
| 253 | + |
| 254 | +.. [1] Each SLEP must either be explicitly labeled as placed in the public |
| 255 | + domain (see this SLEP as an example) or licensed under the `Open |
| 256 | + Publication License`_. |
| 257 | +
|
| 258 | +.. _Open Publication License: https://www.opencontent.org/openpub/ |
| 259 | + |
| 260 | + |
| 261 | +Copyright |
| 262 | +######### |
| 263 | + |
| 264 | +This document has been placed in the public domain. [1]_ |
0 commit comments