ENH Register models trained outside of Civis Platform (civisanalytics#242)

Stephen Hoover · web-flow · commit 242d2e241f13 · 2018-03-26T11:23:49.000-05:00
If you train a scikit-learn compatible estimator outside of Civis Platform, you can use this to upload it to Civis Platform and prepare it for scoring with CivisML. There's a new Custom Script which will introspect metadata necessary for CivisML and make itself appear sufficiently like a CivisML training job that it can be used as input to a scoring job.
diff --git a/civis/ml/_model.py b/civis/ml/_model.py
@@ -42,8 +42,11 @@
                    9112: 9113,    # v1.1
                    8387: 9113,    # v1.0
                    7020: 7021,    # v0.5
+                   11028: 10616,  # v2.2 registration CHANGE ME
                    }
 _CIVISML_TEMPLATE = None  # CivisML training template to use
+REGISTRATION_TEMPLATES = [11028,  # v2.2 CHANGE ME
+                          ]
 
 
 class ModelError(RuntimeError):
@@ -631,10 +634,10 @@ class ModelPipeline:
         See :func:`~civis.resources._resources.Scripts.post_custom` for
         further documentation about email and URL notification.
     dependencies : array, optional
-        List of packages to install from PyPI or git repository (i.e., Github
+        List of packages to install from PyPI or git repository (e.g., Github
         or Bitbucket). If a private repo is specified, please include a
         ``git_token_name`` argument as well (see below). Make sure to pin
-        dependencies to a specific version, since dependecies will be
+        dependencies to a specific version, since dependencies will be
         reinstalled during every training and predict job.
     git_token_name : str, optional
         Name of remote git API token stored in Civis Platform as the password
@@ -713,6 +716,8 @@ def _get_template_ids(self, client):
         global _CIVISML_TEMPLATE
         if _CIVISML_TEMPLATE is None:
             for t_id in sorted(_PRED_TEMPLATES)[::-1]:
+                if t_id in REGISTRATION_TEMPLATES:
+                    continue
                 try:
                     # Check that we can access the template
                     client.templates.get_scripts(id=t_id)
@@ -783,6 +788,147 @@ def __setstate__(self, state):
         template_ids = self._get_template_ids(self._client)
         self.train_template_id, self.predict_template_id = template_ids
 
+    @classmethod
+    def register_pretrained_model(cls, model, dependent_variable=None,
+                                  features=None, primary_key=None,
+                                  model_name=None, dependencies=None,
+                                  git_token_name=None,
+                                  skip_model_check=False, verbose=False,
+                                  client=None):
+        """Use a fitted scikit-learn model with CivisML scoring
+
+        Use this function to set up your own fitted scikit-learn-compatible
+        Estimator object for scoring with CivisML. This function will
+        upload your model to Civis Platform and store enough metadata
+        about it that you can subsequently use it with a CivisML scoring job.
+
+        The only required input is the model itself, but you are strongly
+        recommended to also provide a list of feature names. Without a list
+        of feature names, CivisML will have to assume that your scoring
+        table contains only the features needed for scoring (perhaps also
+        with a primary key column), in all in the correct order.
+
+        Parameters
+        ----------
+        model : sklearn.base.BaseEstimator or int
+            The model object. This must be a fitted scikit-learn compatible
+            Estimator object, or else the integer Civis File ID of a
+            pickle or joblib-serialized file which stores such an object.
+        dependent_variable : string or List[str], optional
+            The dependent variable of the training dataset.
+            For a multi-target problem, this should be a list of
+            column names of dependent variables.
+        features : string or List[str], optional
+            A list of column names of features which were used for training.
+            These will be used to ensure that tables input for prediction
+            have the correct features in the correct order.
+        primary_key : string, optional
+            The unique ID (primary key) of the scoring dataset
+        model_name : string, optional
+            The name of the Platform registration job. It will have
+            " Predict" added to become the Script title for predictions.
+        dependencies : array, optional
+            List of packages to install from PyPI or git repository (e.g.,
+            GitHub or Bitbucket). If a private repo is specified, please
+            include a ``git_token_name`` argument as well (see below).
+            Make sure to pin dependencies to a specific version, since
+            dependencies will be reinstalled during every predict job.
+        git_token_name : str, optional
+            Name of remote git API token stored in Civis Platform as
+            the password field in a custom platform credential.
+            Used only when installing private git repositories.
+        skip_model_check : bool, optional
+            If you're sure that your model will work with CivisML, but it
+            will fail the comprehensive verification, set this to True.
+        verbose : bool, optional
+            If True, supply debug outputs in Platform logs and make
+            prediction child jobs visible.
+        client : :class:`~civis.APIClient`, optional
+            If not provided, an :class:`~civis.APIClient` object will be
+            created from the :envvar:`CIVIS_API_KEY`.
+
+        Returns
+        -------
+        :class:`~civis.ml.ModelPipeline`
+
+        Examples
+        --------
+        This example assumes that you already have training data
+        ``X`` and ``y``, where ``X`` is a :class:`~pandas.DataFrame`.
+        >>> from civis.ml import ModelPipeline
+        >>> from sklearn.linear_model import Lasso
+        >>> est = Lasso().fit(X, y)
+        >>> model = ModelPipeline.register_pretrained_model(
+        ...     est, 'concrete', features=X.columns)
+        >>> model.predict(table_name='my.table', database_name='my-db')
+        """
+        client = client or APIClient()
+
+        if isinstance(dependent_variable, six.string_types):
+            dependent_variable = [dependent_variable]
+        if isinstance(features, six.string_types):
+            features = [features]
+        if isinstance(dependencies, six.string_types):
+            dependencies = [dependencies]
+        if not model_name:
+            model_name = ("Pretrained {} model for "
+                          "CivisML".format(model.__class__.__name__))
+            model_name = model_name[:255]  # Max size is 255 characters
+
+        if isinstance(model, (int, float, six.string_types)):
+            model_file_id = int(model)
+        else:
+            try:
+                tempdir = tempfile.mkdtemp()
+                fout = os.path.join(tempdir, 'model_for_civisml.pkl')
+                joblib.dump(model, fout, compress=3)
+                with open(fout, 'rb') as _fout:
+                    # NB: Using the name "estimator.pkl" means that
+                    # CivisML doesn't need to copy this input to a file
+                    # with a different name.
+                    model_file_id = cio.file_to_civis(_fout, 'estimator.pkl',
+                                                      client=client)
+            finally:
+                shutil.rmtree(tempdir)
+
+        args = {'MODEL_FILE_ID': str(model_file_id),
+                'SKIP_MODEL_CHECK': skip_model_check,
+                'DEBUG': verbose}
+        if dependent_variable is not None:
+            args['TARGET_COLUMN'] = ' '.join(dependent_variable)
+        if features is not None:
+            args['FEATURE_COLUMNS'] = ' '.join(features)
+        if dependencies is not None:
+            args['DEPENDENCIES'] = ' '.join(dependencies)
+        if git_token_name:
+            creds = find(client.credentials.list(),
+                         name=git_token_name,
+                         type='Custom')
+            if len(creds) > 1:
+                raise ValueError("Unique credential with name '{}' for "
+                                 "remote git hosting service not found!"
+                                 .format(git_token_name))
+            args['GIT_CRED'] = creds[0].id
+
+        template_id = max(REGISTRATION_TEMPLATES)
+        container = client.scripts.post_custom(
+            from_template_id=template_id,
+            name=model_name,
+            arguments=args)
+        log.info('Created custom script %s.', container.id)
+
+        run = client.scripts.post_custom_runs(container.id)
+        log.debug('Started job %s, run %s.', container.id, run.id)
+
+        fut = ModelFuture(container.id, run.id, client=client,
+                          poll_on_creation=False)
+        fut.result()
+        log.info('Model registration complete.')
+
+        mp = ModelPipeline.from_existing(fut.job_id, fut.run_id, client)
+        mp.primary_key = primary_key
+        return mp
+
     @classmethod
     def from_existing(cls, train_job_id, train_run_id='latest', client=None):
         """Create a :class:`ModelPipeline` object from existing model IDs
@@ -887,7 +1033,8 @@ def from_existing(cls, train_job_id, train_run_id='latest', client=None):
                           'prediction code. Prediction will either fail '
                           'immediately or succeed.'
                           % (train_job_id, __version__), RuntimeWarning)
-            p_id = max(_PRED_TEMPLATES.values())
+            p_id = max([v for k, v in _PRED_TEMPLATES.items()
+                        if k not in REGISTRATION_TEMPLATES])
         klass.predict_template_id = p_id
 
         return klass
diff --git a/civis/ml/tests/test_model.py b/civis/ml/tests/test_model.py
@@ -39,6 +39,10 @@
 from civis.ml import _model
 
 
+LATEST_TRAIN_TEMPLATE = 10582
+LATEST_PRED_TEMPLATE = 10583
+
+
 def setup_client_mock(script_id=-10, run_id=100, state='succeeded',
                       run_outputs=None):
     """Return a Mock set up for use in testing container scripts
@@ -682,7 +686,7 @@ def test_modelpipeline_init_newest():
     mp = _model.ModelPipeline(LogisticRegression(), 'test', etl=etl,
                               client=mock_client)
     assert mp.etl == etl
-    assert mp.train_template_id == max(_model._PRED_TEMPLATES)
+    assert mp.train_template_id == LATEST_TRAIN_TEMPLATE
     # clean up
     _model._CIVISML_TEMPLATE = None
 
@@ -787,16 +791,15 @@ def test_modelpipeline_classmethod_constructor_defaults(
 def test_modelpipeline_classmethod_constructor_future_train_version():
     # Test handling attempts to restore a model created with a newer
     # version of CivisML.
-    current_max_template = max(_model._PRED_TEMPLATES)
-    cont = container_response_stub(current_max_template + 1000)
+    cont = container_response_stub(LATEST_TRAIN_TEMPLATE + 1000)
     mock_client = mock.Mock()
     mock_client.scripts.get_containers.return_value = cont
     mock_client.credentials.get.return_value = Response({'name': 'Token'})
 
     # test everything is working fine
     with pytest.warns(RuntimeWarning):
         mp = _model.ModelPipeline.from_existing(1, 1, client=mock_client)
-    exp_p_id = _model._PRED_TEMPLATES[current_max_template]
+    exp_p_id = _model._PRED_TEMPLATES[LATEST_TRAIN_TEMPLATE]
     assert mp.predict_template_id == exp_p_id
 
 
@@ -892,7 +895,7 @@ def test_modelpipeline_train_df(mock_ccr, mock_stash, mp_setup):
     train_data = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
     assert 'res' == mp.train(train_data)
     mock_stash.assert_called_once_with(
-        train_data, max(_model._PRED_TEMPLATES.keys()), client=mock.ANY)
+        train_data, LATEST_TRAIN_TEMPLATE, client=mock.ANY)
     assert mp.train_result_ == 'res'
 
 
diff --git a/docs/source/ml.rst b/docs/source/ml.rst
@@ -68,6 +68,9 @@ or by providing your own scikit-learn
 Note that whichever option you chose, CivisML will pre-process your
 data using either its default ETL, or ETL that you provide (see :ref:`custom-etl`).
 
+If you have already trained a scikit-learn model outside of Civis Platform,
+you can register it with Civis Platform as a CivisML model so that you can
+score it using CivisML. Read :ref:`model-registration` for how to do this.
 
 Pre-Defined Models
 ------------------
@@ -359,6 +362,32 @@ for solving a problem. For example:
   train = [model.train(table_name='schema.name', database_name='My DB') for model in models]
   aucs = [tr.metrics['roc_auc'] for tr in train]  # Code blocks here
 
+..  _model-registration:
+
+Registering Models Trained Outside of Civis
+===========================================
+
+Instead of using CivisML to train your model, you may train any
+scikit-learn-compatible model outside of Civis Platform and use
+:meth:`civis.ml.ModelPipeline.register_pretrained_model` to register it
+as a CivisML model in Civis Platform. This will let you use Civis Platform
+to make predictions using your model, either to take advantage of distributed
+predictions on large datasets, or to create predictions as part of
+a workflow or service in Civis Platform.
+
+When registering a model trained outside of Civis Platform, you are
+strongly advised to provide an ordered list of feature names used
+for training. This will allow CivisML to ensure that tables of data
+input for predictions have the correct features in the correct order.
+If your model has more than one output, you should also provide a list
+of output names so that CivisML knows how many outputs to expect and
+how to name them in the resulting table of model predictions.
+
+If your model uses dependencies which aren't part of the default CivisML
+execution environment, you must provide them to the ``dependencies``
+parameter of the :meth:`~civis.ml.ModelPipeline.register_pretrained_model`
+function, just as with the :class:`~civis.ml.ModelPipeline` constructor.
+
 
 Object reference
 ================