Skip to content

Commit 25edba4

Browse files
SLEP 014 Pandas in Pandas out (#37)
Co-authored-by: Joel Nothman <[email protected]>
1 parent f2f7418 commit 25edba4

File tree

3 files changed

+265
-5
lines changed

3 files changed

+265
-5
lines changed

index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@
3939
:maxdepth: 1
4040
:caption: Rejected
4141

42-
rejected
42+
slep014/proposal
4343

4444
.. toctree::
4545
:maxdepth: 1

rejected.rst

Lines changed: 0 additions & 4 deletions
This file was deleted.

slep014/proposal.rst

Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
.. _slep_014:
2+
3+
==============================
4+
SLEP014: Pandas In, Pandas Out
5+
==============================
6+
7+
:Author: Thomas J Fan
8+
:Status: Rejected
9+
:Type: Standards Track
10+
:Created: 2020-02-18
11+
12+
Abstract
13+
########
14+
15+
This SLEP proposes using pandas DataFrames for propagating feature names
16+
through ``scikit-learn`` transformers.
17+
18+
Motivation
19+
##########
20+
21+
``scikit-learn`` is commonly used as a part of a larger data processing
22+
pipeline. When this pipeline is used to transform data, the result is a
23+
NumPy array, discarding column names. The current workflow for
24+
extracting the feature names requires calling ``get_feature_names`` on the
25+
transformer that created the feature. This interface can be cumbersome when used
26+
together with a pipeline with multiple column names::
27+
28+
import pandas as pd
29+
import numpy as np
30+
from sklearn.compose import make_column_transformer
31+
from sklearn.preprocessing import OneHotEncoder, StandardScaler
32+
from sklearn.pipeline import make_pipeline
33+
from sklearn.linear_model import LogisticRegression
34+
35+
X = pd.DataFrame({'letter': ['a', 'b', 'c'],
36+
'pet': ['dog', 'snake', 'dog'],
37+
'num': [1, 2, 3]})
38+
y = [0, 0, 1]
39+
orig_cat_cols, orig_num_cols = ['letter', 'pet'], ['num']
40+
41+
ct = make_column_transformer(
42+
(OneHotEncoder(), orig_cat_cols), (StandardScaler(), orig_num_cols))
43+
pipe = make_pipeline(ct, LogisticRegression()).fit(X, y)
44+
45+
cat_names = (pipe['columntransformer']
46+
.named_transformers_['onehotencoder']
47+
.get_feature_names(orig_cat_cols))
48+
49+
feature_names = np.r_[cat_names, orig_num_cols]
50+
51+
The ``feature_names`` extracted above corresponds to the features directly
52+
passed into ``LogisticRegression``. As demonstrated above, the process of
53+
extracting ``feature_names`` requires knowing the order of the selected
54+
categories in the ``ColumnTransformer``. Furthemore, if there is feature
55+
selection in the pipeline, such as ``SelectKBest``, the ``get_support`` method
56+
would need to be used to determine column names that were selected.
57+
58+
Solution
59+
########
60+
61+
The pandas DataFrame has been widely adopted by the Python Data ecosystem to
62+
store data with feature names. This SLEP proposes using a DataFrame to
63+
track the feature names as the data is transformed. With this feature, the
64+
API for extracting feature names would be::
65+
66+
from sklearn import set_config
67+
set_config(pandas_in_out=True)
68+
69+
pipe.fit(X, y)
70+
X_trans = pipe[:-1].transform(X)
71+
72+
X_trans.columns.tolist()
73+
['letter_a', 'letter_b', 'letter_c', 'pet_dog', 'pet_snake', 'num']
74+
75+
This SLEP proposes attaching feature names to the output of ``transform``. In
76+
the above example, ``pipe[:-1].transform(X)`` propagates the feature names
77+
through the multiple transformers.
78+
79+
This feature is only available through a soft dependency on pandas. Furthermore,
80+
it will be opt-in with the the configuration flag: ``pandas_in_out``. By
81+
default, ``pandas_in_out`` is set to ``False``, resulting in the output of all
82+
estimators to be a ndarray.
83+
84+
Enabling Functionality
85+
######################
86+
87+
The following enhancements are **not** a part of this SLEP. These features are
88+
made possible if this SLEP gets accepted.
89+
90+
1. Allows estimators to treat columns differently based on name or dtype. For
91+
example, the categorical dtype is useful for tree building algorithms.
92+
93+
2. Storing feature names inside estimators for model inspection::
94+
95+
from sklearn import set_config
96+
set_config(store_feature_names_in=True)
97+
98+
pipe.fit(X, y)
99+
100+
pipe['logisticregression'].feature_names_in_
101+
102+
3. Allow for extracting the feature names of estimators in meta-estimators::
103+
104+
from sklearn import set_config
105+
set_config(store_feature_names_in=True)
106+
107+
est = BaggingClassifier(LogisticRegression())
108+
est.fit(X, y)
109+
110+
# Gets the feature names used by an estimator in the ensemble
111+
est.estimators_[0].feature_names_in_
112+
113+
For options 2 and 3 the default value of configuration flag:
114+
``store_feature_names_in`` is False.
115+
116+
Considerations
117+
##############
118+
119+
Memory copies
120+
-------------
121+
122+
As noted in `pandas #27211 <https://github.com/pandas-dev/pandas/issues/27211>`_,
123+
there is not a guarantee that there is a zero-copy round-trip going from numpy
124+
to a DataFrame. In other words, the following may lead to a memory copy in
125+
a future version of ``pandas``::
126+
127+
X = np.array(...)
128+
X_df = pd.DataFrame(X)
129+
X_again = np.asarray(X_df)
130+
131+
This is an issue for ``scikit-learn`` when estimators are placed into a
132+
pipeline. For example, consider the following pipeline::
133+
134+
set_config(pandas_in_out=True)
135+
pipe = make_pipeline(StandardScaler(), LogisticRegression())
136+
pipe.fit(X, y)
137+
138+
Interally, ``StandardScaler.fit_transform`` will operate on a ndarray and
139+
wrap the ndarray into a DataFrame as a return value. This is will be
140+
piped into ``LogisticRegression.fit`` which calls ``check_array`` on the
141+
DataFrame, which may lead to a memory copy in a future version of
142+
``pandas``. This leads to unnecessary overhead from piping the data from one
143+
estimator to another.
144+
145+
Sparse matrices
146+
---------------
147+
148+
Traditionally, ``scikit-learn`` prefers to process sparse matrices in
149+
the compressed sparse row (CSR) matrix format. The `sparse data structure <https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html>`_ in pandas 1.0 only supports converting directly to
150+
the coordinate format (COO). Although this format was designed to quickly
151+
convert to CSR or CSC formats, the conversion process still needs to allocate
152+
more memory to store. This can be an issue with transformers such as the
153+
``OneHotEncoder.transform`` which has been optimized to construct a CSR matrix.
154+
155+
Backward compatibility
156+
######################
157+
158+
The ``set_config(pandas_in_out=True)`` global configuration flag will be set to
159+
``False`` by default to ensure backward compatibility. When this flag is False,
160+
the output of all estimators will be a ndarray.
161+
162+
Community Adoption
163+
##################
164+
165+
With the new ``pandas_in_out`` configuration flag, third party libraries may
166+
need to query the configuration flag to be fully compliant with this SLEP.
167+
Specifically, "to be fully compliant" entails the following policy:
168+
169+
1. If ``pandas_in_out=False``, then ``transform`` always returns numpy array.
170+
2. If ``pandas_in_out=True``, then ``transform`` returns a DataFrame if the
171+
input is a Dataframe.
172+
173+
This policy can either be enforced with ``check_estimator`` or not:
174+
175+
- **Enforce**: This increases the maintaince burden of third party libraries.
176+
This burden includes: checking for the configuration flag, generating feature names and including pandas as a dependency to their library.
177+
178+
- **Not Enforce**: Currently, third party transformers can return a DataFrame
179+
or a numpy and this is mostly compatible with ``scikit-learn``. Users with
180+
third party transformers would not be able to access the features enabled
181+
by this SLEP.
182+
183+
184+
Alternatives
185+
############
186+
187+
This section lists alternative data structures that can be used with their
188+
advantages and disadvantages when compared to a pandas DataFrame.
189+
190+
InputArray
191+
----------
192+
193+
The proposed ``InputArray`` described
194+
:ref:`SLEP012 Custom InputArray Data Structure <slep_012>` introduces a new
195+
data structure for homogenous data.
196+
197+
Pros
198+
~~~~
199+
200+
- A thin wrapper around a numpy array or a sparse matrix with a minimial feature
201+
set that ``scikit-learn`` can evolve independently.
202+
203+
Cons
204+
~~~~
205+
206+
- Introduces another data structure for data storage in the PyData ecosystem.
207+
- Currently, the design only allows for homogenous data.
208+
- Increases maintenance responsibilities for ``scikit-learn``.
209+
210+
XArray Dataset
211+
--------------
212+
213+
`xarray's Dataset <http://xarray.pydata.org/en/stable/data-structures.html#dataset>`_
214+
is a multi-dimenstional version of panda's DataFrame.
215+
216+
Pros
217+
~~~~
218+
219+
- Can be used for heterogeneous data.
220+
221+
Cons
222+
~~~~
223+
224+
- ``scikit-learn`` does not require many of the features Dataset provides.
225+
- Needs to be converted to a DataArray before it can be converted to a numpy array.
226+
- The `conversion from a pandas DataFrame to a Dataset <http://xarray.pydata.org/en/stable/pandas.html>`_
227+
is not lossless. For example, categorical dtypes in a pandas dataframe will
228+
lose their categorical information when converted to a Dataset.
229+
- xarray does not have as much adoption as pandas, which increases the learning
230+
curve for using Dataset with ``scikit-learn``.
231+
232+
XArray DataArray
233+
----------------
234+
235+
`xarray's DataArray <http://xarray.pydata.org/en/stable/data-structures.html#dataarray>`_
236+
is a data structure that store homogenous data.
237+
238+
Pros
239+
~~~~
240+
241+
- xarray guarantees that there will be no copies during round-trips from
242+
numpy. (`xarray #3077 <https://github.com/pydata/xarray/issues/3077>`_)
243+
244+
Cons
245+
~~~~
246+
247+
- Can only be used for homogenous data.
248+
- As with XArray's Dataset, DataArray does not as much adoption as pandas,
249+
which increases the learning curve for using DataArray with ``scikit-learn``.
250+
251+
References and Footnotes
252+
########################
253+
254+
.. [1] Each SLEP must either be explicitly labeled as placed in the public
255+
domain (see this SLEP as an example) or licensed under the `Open
256+
Publication License`_.
257+
258+
.. _Open Publication License: https://www.opencontent.org/openpub/
259+
260+
261+
Copyright
262+
#########
263+
264+
This document has been placed in the public domain. [1]_

0 commit comments

Comments
 (0)