scikit-learn · adrinjalali · Nov 29, 2022 · Feb 19, 2020 · Feb 19, 2020 · Feb 19, 2020
diff --git a/slep014/proposal.rst b/slep014/proposal.rst
@@ -0,0 +1,202 @@
+.. _slep_014:
+
+==============================
+SLEP014: Pandas In, Pandas Out
+==============================
+
+:Author: Thomas J Fan
+:Status: Draft
+:Type: Standards Track
+:Created: 2020-02-18
+
+Abstract
+########
+
+This SLEP proposes using pandas DataFrames for propagating feature names
+through ``scikit-learn`` transformers.
+
+Motivation
+##########
+
+``scikit-learn`` is commonly used as a part of a larger data processing
+pipeline. When this pipeline is used to transform data, the result is a
+NumPy array, discarding column names. The current workflow for
+extracting the feature names requires calling ``get_feature_names`` on the
+transformer that created the feature. This interface can be cumbersome when used
+together with a pipeline with multiple column names::
+
+    import pandas as pd
+    import numpy as np
+    from sklearn.compose import make_column_transformer
+    from sklearn.preprocessing import OneHotEncoder, StandardScaler
+    from sklearn.pipeline import make_pipeline
+    from sklearn.linear_model import LogisticRegression
+
+    X = pd.DataFrame({'letter': ['a', 'b', 'c'], 
+                      'pet': ['dog', 'snake', 'dog'],
+                      'num': [1, 2, 3]})
+    y = [0, 0, 1]
+    orig_cat_cols, orig_num_cols = ['letter', 'pet'], ['num']
+
+    ct = make_column_transformer(
+        (OneHotEncoder(), orig_cat_cols), (StandardScaler(), orig_num_cols))
+    pipe = make_pipeline(ct, LogisticRegression()).fit(X,y)
+
+    cat_names = (pipe['columntransformer']
+                 .named_transformers_['onehotencoder']
+                 .get_feature_names(orig_cat_cols))
+
+    feature_names = np.r_[cat_names, orig_num_cols]
+
+The ``feature_names`` extracted above corresponds to the features directly
+passed into ``LogisticRegression``. As demonstrated above, the process of
+extracting ``feature_names`` requires knowing the order of the selected
+categories in the ``ColumnTransformer``. Furthemore, if there is feature
+selection in the pipeline, such as ``SelectKBest``, the ``get_support`` method
+would need to be used to select column names that were selected.
+
+Solution
+########
+
+The pandas DataFrame has been widely adopted by the Python Data ecosystem to
+store data with feature names. This SLEP proposes using a DataFrame to
+track the feature names as the data is transformed. With this feature, the
+API for extracting feature names would be::
+
+    from sklearn import set_config
+    set_config(pandas_in_out=True)
+
+    pipe.fit(X, y)
+    X_trans = pipe[:-1].transform(X)
+
+    X_trans.columns.tolist()
+    ['letter_a', 'letter_b', 'letter_c', 'pet_dog', 'pet_snake', 'num']
+
+This SLEP proposes attaching feature names to the output of ``transform``. In
+the above example, ``pipe[:-1].transform(X)`` propagates the feature names
+through the multiple transformers.
+
+This feature is only available through a soft dependency on pandas. Furthermore,
+it will be opt-in with the the configuration flag: ``pandas_in_out``. By
+default, ``pandas_in_out`` is set to ``False``, resulting in the output of all
+estimators to be a ndarray.
+
+Enabling Functionality
+######################
+
+The following enhancements are **not** a part of this SLEP. These features are
+made possible if this SLEP gets accepted.
+
+1. Allows estimators to treat columns differently based on name or dtype. For
+   example, the categorical dtype is useful for tree building algorithms.
+
+2. Storing feature names inside estimators for model inspection::
+
+    from sklearn import set_config
+    set_config(store_feature_names_in=True)
+
+    pipe.fit(X, y)
+
+    pipe['logisticregression'].feature_names_in_
+
+3. Allow for extracting the feature names of estimators in meta-estimators::
+
+    from sklearn import set_config
+    set_config(store_feature_names_in=True)
+
+    est = BaggingClassifier(LogisticRegression())
+    est.fit(X, y)
+
+    # Gets the feature names used by an estimator in the ensemble
+    est.estimators_[0].feature_names_in_
+
+For options 2 and 3 the default value of configuration flag:
+`store_feature_names_in` is False.
+
+Considerations
+##############
+
+Index alignment
+---------------
+
+Operations are index aligned when working with DataFrames. Internally,
+``scikit-learn`` will ignore the alignment by operating on the ndarray as
+suggested by `TomAugspurger <https://github.com/scikit-learn/enhancement_proposals/pull/25#issuecomment-573859151>`_::
+
+    def transform(self, X, y=None):
+        X, row_labels, input_type = check_array(X)
+        # X is a ndarray
+        result = ...
+        # some hypothetical function that recreates a DataFrame / DataArray,
+        # preserving row labels, attaching new features names.
+        return construct_result(result, output_feature_names, row_labels, input_type)
+
+Memory copies
+-------------
+
+As noted in `pandas #27211 <https://github.com/pandas-dev/pandas/issues/27211>`_,
+there is not a guarantee that there is a zero-copy round-trip going from numpy
+to a DataFrame. In other words, the following may lead to a memory copy in
+a future version of ``pandas``::
+
+    X = np.array(...)
+    X_df = pd.DataFrame(X)
+    X_again = np.asarray(X_df)
+
+This is an issue for ``scikit-learn`` when estimators are placed into a
+pipeline. For example, consider the following pipeline::
+
+    set_config(pandas_in_out=True)
+    pipe = make_pipeline(StandardScaler(), LogisticRegression())
+    pipe.fit(X, y)
+
+Interally, ``StandardScaler.fit_transform`` will operate on a ndarray and
+wrap the ndarray into a DataFrame as a return value. This is will be
+piped into ``LogisticRegression.fit`` which calls ``check_array`` on the
+DataFrame, which may lead to a memory copy in a future version of
+``pandas``. This leads to unnecessary overhead from piping the data from one
+estimator to another.
+
+Backward compatibility
+######################
+
+The ``set_config(pandas_in_out=True)`` global configuration flag will be set to
+``False`` by default to ensure backward compatibility. When this flag is False,
+the output of all estimators will be a ndarray.
+
+Alternatives
+############
+
+- :ref:`SLEP012 Custom InputArray Data Structure <slep_012>`: This approach
+  adds another data structure in the Python Data ecosystem. This increases
+  the maintenance responsibilities of the ``scikit-learn`` library.
+
+- Use xarray's Dataset, ``xr.Dataset``: The pandas DataFrame is more widely used
+  in Python's Data ecosystem, which means more libraries are built with pandas
+  in mind. With xarray support, users will need to convert their DataFrame into
+  a ``xr.Dataset``. This converstion process will be lossy when working with
-  a ``xr.Dataset``. This converstion process will be lossy when working with
+  a ``xr.Dataset``. This conversion process will be lossy when working with
-  a ``xr.Dataset``. This converstion process will be lossy when working with
+  a ``xr.Dataset``. This conversion process will be lossy when working with
+  pandas categorical dtypes.
+
+In both alternatives, the output data structure will need to be converted into
+a pandas DataFrame to take advantage of the ecosytem built around pandas.
+
+A major advantage of both alternatives is that they do not have the memory
+copy issue. Since ``InputArray`` is designed from the ground up, we can
+guarantee that it does not make memory copies during round-trips from numpy.
+As stated in `xarray #3077 <https://github.com/pydata/xarray/issues/3077>`_,
+``xarray`` guarantees that there is no copies during round-trips from numpy.
+
+References and Footnotes
+------------------------
+
+.. [1] Each SLEP must either be explicitly labeled as placed in the public
+   domain (see this SLEP as an example) or licensed under the `Open
+   Publication License`_.
+
+.. _Open Publication License: https://www.opencontent.org/openpub/
+
+
+Copyright
+---------
+
+This document has been placed in the public domain. [1]_
diff --git a/under_review.rst b/under_review.rst
@@ -11,3 +11,4 @@ SLEPs under review
     slep007/proposal
     slep012/proposal
     slep013/proposal
+    slep014/proposal