Skip to content

Commit 3a36b1d

Browse files
authored
Merge pull request #67 from jnothman/clone
SLEP017: clone override
2 parents 4ae4a15 + 3b9bd20 commit 3a36b1d

File tree

1 file changed

+183
-0
lines changed

1 file changed

+183
-0
lines changed

slep017/proposal.rst

Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
==================================================
2+
Clone Override Protocol with ``__sklearn_clone__``
3+
==================================================
4+
5+
:Author: Joel Nothman
6+
:Status: Draft
7+
:Type: Standards Track
8+
:Created: 2022-03-19
9+
:Resolution: (required for Accepted | Rejected | Withdrawn)
10+
11+
Abstract
12+
--------
13+
14+
The ability to clone Scikit-learn estimators -- removing any state due to
15+
previous fitting -- is essential to ensuring estimator configurations are
16+
reusable across multiple instances in cross validation.
17+
A centralised implementation of :func:`sklearn.base.clone` regards
18+
an estimator's constructor parameters as the state that should be copied.
19+
This proposal allows for an estimator class to implement custom cloning
20+
functionality with a ``__sklearn_clone__`` method, which will default to
21+
the current ``clone`` behaviour.
22+
23+
Detailed description
24+
--------------------
25+
26+
Cloning estimators is one way that Scikit-learn ensures that there is no
27+
data leakage across data splits in cross-validation: by only copying an
28+
estimator's configuration, with no data from previous fitting, the
29+
estimator must fit with a cold start. Cloning an estimator often also
30+
occurs prior to parallelism, ensuring that a minimal version of the
31+
estimator -- without a large stored model -- is serialised and distributed.
32+
33+
Cloning is currently governed by the implementation of
34+
:func:`sklearn.base.clone`, which recursively descends and copies the
35+
parameters of the passed object. For an estimator, it constructs a new
36+
instance of the estimator's class, passing to it cloned versions of the
37+
parameter values returned by its ``get_params``. It then performs some
38+
sanity checks to ensure that the values passed to the construtor are
39+
identical to what is then returned by the clone's ``get_params``.
40+
41+
The current equivalence between constructor parameters and what is cloned
42+
means that whenever an estimator or library developer deems it necessary
43+
to have further configuration of an estimator reproduced in a clone,
44+
they must include this configuration as a constructor parameter.
45+
46+
Cases where this need has been raised in Scikit-learn development include:
47+
48+
* ensuring metadata requests are cloned with an estimator
49+
* ensuring parameter spaces are cloned with an estimator
50+
* building a simple wrapper that can "freeze" a pre-fitted estimator
51+
* allowing existing options for using prefitted models in ensembles
52+
to work under cloning
53+
54+
The current design also limits the ability for an estimator developer to
55+
define an exception to the sanity checks (see :issue:`15371`).
56+
57+
This proposal empowers estimator developers to extend the base implementation
58+
of ``clone`` by providing a ``__sklearn_clone__`` method, which ``clone`` will
59+
delegate to when available. The default implementaton will match current
60+
``clone`` behaviour. It will be provided through
61+
``BaseEstimator.__sklearn_clone__`` but also
62+
provided for estimators not inheriting from :obj:`~sklearn.base.BaseEstimator`.
63+
64+
This shifts the paradigm from ``clone`` being a fixed operation that
65+
Scikit-learn must be able to perform on an estimator to ``clone`` being a
66+
behaviour that each Scikit-learn compatible estimator may implement.
67+
68+
Developers that define ``__sklearn_clone__`` are expected to be responsible
69+
in maintaintaining the fundamental properties of cloning. Ordinarily, they
70+
can achieve this through use of ``super().__sklearn_clone__``. Core behaviours,
71+
such as constructor parameters being preserved through ``clone`` operations,
72+
can be ensured through estimator checks.
73+
74+
Implementation
75+
--------------
76+
77+
Implementing this SLEP will require:
78+
79+
1. Factoring out `clone_parametrized` from `clone`, being the portion of its
80+
implementation that handles objects with `get_params`.
81+
2. Modifying `clone` to call ``__sklearn_clone__`` when available on an
82+
object with ``get_params``, or ``clone_parametrized`` when not available.
83+
3. Defining ``BaseEstimator.__sklearn_clone__`` to call ``clone_parametrized``.
84+
4. Documenting the above.
85+
86+
Backward compatibility
87+
----------------------
88+
89+
No breakage.
90+
91+
Alternatives
92+
------------
93+
94+
Instead of allowing estimators to overwrite the entire clone process,
95+
the core clone process could be obligatory, with the ability for an
96+
estimator class to customise additional steps.
97+
98+
One API would allow for an estimator class to provide
99+
``__sklearn__post_clone__(self, source)`` for operations in addition
100+
to the core cloning, or ``__sklearn__clone_attrs__`` could be defined
101+
on a class to specify additional attributes that should be copied for
102+
that class and its descendants.
103+
104+
Alternative solutions include continuing to force developers into providing
105+
sometimes-awkward constructor parameters for any clonable material, and
106+
Scikit-learn core developers having the exceptional ability to extend
107+
the ``clone`` function as needed.
108+
109+
Discussion
110+
----------
111+
112+
:issue:`5080` raised the proposal of polymorphism for ``clone`` as the right
113+
way to provide an object-oriented API, and as a way to enable the
114+
implementation of wrappers around estimators for model memoisation and
115+
freezing.
116+
The naming of ``__sklearn_clone__`` was further proposed and discussed in
117+
:issue:`21838`.
118+
119+
Making cloning more flexible either enables or simplifies the design and
120+
implementation of several features, including wrapping pre-fitted estimators,
121+
and providing estimator configuration through methods without adding new
122+
constructor arguments (e.g. through mixins).
123+
124+
Related issues include:
125+
126+
- :issue:`6451`, :issue:`8710`, :issue:`19848`: CalibratedClassifierCV with
127+
prefitted base estimator
128+
- :issue:`7382`: VotingClassifier with prefitted base estimator
129+
- :issue:`16748`: Stacking estimator with prefitted base estimator
130+
- :issue:`8370`, :issue:`9464`: generic estimator wrapper for model freezing
131+
- :issue:`5082`: configuring parameter search spaces
132+
- :issue:`16079`: configuring the routing of sample-aligned metadata
133+
- :issue:`16185`: configuring selected parameters to not be deep-copied
134+
135+
Under the incumbent monolithic clone implementation, designing such additional
136+
per-estimator configuration requires resolving whether to:
137+
138+
- adjust the monolithic ``clone`` to account for the new configuration
139+
attributes (an option only available to the Scikit-learn core developer
140+
team);
141+
- add constructor attributes for each new configuration option; or
142+
- not clone estimator configurations, and accept that some use cases may not
143+
be possible.
144+
145+
A more flexible cloning operation provides a simpler pattern for adding new
146+
configuration options through mixins.
147+
It should be noted that adding new capabilities to *all* estimators remains
148+
possible only through modifying the default ``__sklearn_clone__``
149+
implementation.
150+
151+
There are, however, notable concerns in relation to this proposal.
152+
Introducing a generic clone handler on each estimator gives a developer
153+
complete freedom to disregard existing conventions regarding parameter
154+
setting and construction in Scikit-learn.
155+
In this vein, objections to :issue:`5080` cited the notion that "``clone``
156+
has a simple contract," and that "extension to it would open the door to
157+
violations of that contract" [2]_.
158+
159+
While these objections identify considerable risks, many public libraries
160+
include developers regularly working around Scikit-learn conventions and
161+
contracts, in part because developers are backed into a "design corner",
162+
wherein it is not always obvious how to build an acceptable UX while adhering
163+
to established conventions; in this case, that everything to be cloned must
164+
go into ``__init__``. This proposal paves a road for how developers can
165+
solve functionality UX limitations in the core library, rather than
166+
inviting custom workarounds.
167+
168+
References and Footnotes
169+
------------------------
170+
171+
.. [1] Each SLEP must either be explicitly labeled as placed in the public
172+
domain (see this SLEP as an example) or licensed under the `Open
173+
Publication License`_.
174+
.. _Open Publication License: https://www.opencontent.org/openpub/
175+
176+
.. [2] `Gael Varoquaux's comments on #5080 in 2015
177+
<https://github.com/scikit-learn/scikit-learn/issues/5080#issuecomment-127128808>`__
178+
179+
180+
Copyright
181+
---------
182+
183+
This document has been placed in the public domain. [1]_

0 commit comments

Comments
 (0)