From 87df158f0d6192d48366263e136e7a3eec2b068e Mon Sep 17 00:00:00 2001 From: eslesar-aws Date: Wed, 19 Jun 2019 12:56:50 -0700 Subject: [PATCH 01/14] doc: refactor overview section per improvement plan --- doc/overview.rst | 530 +++++++++++++++++++++++++++-------------------- 1 file changed, 304 insertions(+), 226 deletions(-) diff --git a/doc/overview.rst b/doc/overview.rst index 5cd440d762..25b441d014 100644 --- a/doc/overview.rst +++ b/doc/overview.rst @@ -1,5 +1,6 @@ +############################## Using the SageMaker Python SDK -============================== +############################## SageMaker Python SDK provides several high-level abstractions for working with Amazon SageMaker. These are: @@ -8,13 +9,82 @@ SageMaker Python SDK provides several high-level abstractions for working with A - **Predictors**: Provide real-time inference and transformation using Python data-types against a SageMaker endpoint. - **Session**: Provides a collection of methods for working with SageMaker resources. -``Estimator`` and ``Model`` implementations for MXNet, TensorFlow, Chainer, PyTorch, and Amazon ML algorithms are included. +``Estimator`` and ``Model`` implementations for MXNet, TensorFlow, Chainer, PyTorch, scikit-Learn, Amazon SageMaker built-in algorithms, Reinforcement Learning, are included. There's also an ``Estimator`` that runs SageMaker compatible custom Docker containers, enabling you to run your own ML algorithms by using the SageMaker Python SDK. .. contents:: + :depth: 2 + +******************************************* +Train a Model with the SageMaker Python SDK +******************************************* + +To train a model by using the SageMaker Pthon SDK, you: + +1. Prepare a training script +2. Create an estimator +3. Call the `fit` method of the estimator + +After you train a model, you can save it, and then serve the model as an endpoint to get real-time inferences or get inferences for an entire dataset by using batch transform. + +Prepare a Training script +========================= + +Your training script must be a Python 2.7 or 3.5 compatible source file. + +The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, including the following: + +* ``SM_MODEL_DIR``: A string that represents the path where the training job writes the model artifacts to. + After training, artifacts in this directory are uploaded to S3 for model hosting. +* ``SM_NUM_GPUS``: An integer representing the number of GPUs available to the host. +* ``SM_CHANNEL_XXXX``: A string that represents the path to the directory that contains the input data for the specified channel. + For example, if you specify two input channels in the MXNet estimator's ``fit`` call, named 'train' and 'test', the environment variables ``SM_CHANNEL_TRAIN`` and ``SM_CHANNEL_TEST`` are set. +* ``SM_HPS``: A json dump of the hyperparameters preserving json types (boolean, integer, etc.) + +For the exhaustive list of available environment variables, see the `SageMaker Containers documentation `__. + +A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to ``model_dir`` so that it can be deployed for inference later. +Hyperparameters are passed to your script as arguments and can be retrieved with an ``argparse.ArgumentParser`` instance. +For example, a training script might start with the following: + +.. code:: python + + import argparse + import os + import json + + if __name__ =='__main__': + + parser = argparse.ArgumentParser() + + # hyperparameters sent by the client are passed as command-line arguments to the script. + parser.add_argument('--epochs', type=int, default=10) + parser.add_argument('--batch-size', type=int, default=100) + parser.add_argument('--learning-rate', type=float, default=0.1) + + # an alternative way to load hyperparameters via SM_HPS environment variable. + parser.add_argument('--sm-hps', type=json.loads, default=os.environ['SM_HPS']) + + # input data and model directories + parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) + parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN']) + parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TEST']) + + args, _ = parser.parse_known_args() + + # ... load from args.train and args.test, train a model, write model to args.model_dir. + +Because the SageMaker imports your training script, you should put your training code in a main guard (``if __name__=='__main__':``) if you are using the same script to host your model, +so that SageMaker does not inadvertently run your training code at the wrong point in execution. + +Note that SageMaker doesn't support argparse actions. +If you want to use, for example, boolean hyperparameters, you need to specify ``type`` as ``bool`` in your script and provide an explicit ``True`` or ``False`` value for this hyperparameter when you create your estimator. + +For more on training environment variables, please visit `SageMaker Containers `_. + Using Estimators ----------------- +================ Here is an end to end example of how to use a SageMaker Estimator: @@ -84,8 +154,37 @@ For more `information `__ for more details about built-in metrics of each Amazon SageMaker algorithm. -Local Mode -~~~~~~~~~~ - -The SageMaker Python SDK supports local mode, which allows you to create estimators and deploy them to your local environment. -This is a great way to test your deep learning scripts before running them in SageMaker's managed training or hosting environments. -Local Mode is supported for frameworks images (TensorFlow, MXNet, Chainer, PyTorch, and Scikit-Learn) and images you supply yourself. - -We can take the example in `Using Estimators <#using-estimators>`__ , and use either ``local`` or ``local_gpu`` as the instance type. - -.. code:: python - - from sagemaker.mxnet import MXNet - - # Configure an MXNet Estimator (no training happens yet) - mxnet_estimator = MXNet('train.py', - role='SageMakerRole', - train_instance_type='local', - train_instance_count=1, - framework_version='1.2.1') - - # In Local Mode, fit will pull the MXNet container Docker image and run it locally - mxnet_estimator.fit('s3://my_bucket/my_training_data/') - - # Alternatively, you can train using data in your local file system. This is only supported in Local mode. - mxnet_estimator.fit('file:///tmp/my_training_data') - - # Deploys the model that was generated by fit() to local endpoint in a container - mxnet_predictor = mxnet_estimator.deploy(initial_instance_count=1, instance_type='local') - - # Serializes data and makes a prediction request to the local endpoint - response = mxnet_predictor.predict(data) - - # Tears down the endpoint container and deletes the corresponding endpoint configuration - mxnet_predictor.delete_endpoint() - - # Deletes the model - mxnet_predictor.delete_model() - - -If you have an existing model and want to deploy it locally, don't specify a sagemaker_session argument to the ``MXNetModel`` constructor. -The correct session is generated when you call ``model.deploy()``. - -Here is an end-to-end example: - -.. code:: python - - import numpy - from sagemaker.mxnet import MXNetModel - - model_location = 's3://mybucket/my_model.tar.gz' - code_location = 's3://mybucket/sourcedir.tar.gz' - s3_model = MXNetModel(model_data=model_location, role='SageMakerRole', - entry_point='mnist.py', source_dir=code_location) - - predictor = s3_model.deploy(initial_instance_count=1, instance_type='local') - data = numpy.zeros(shape=(1, 1, 28, 28)) - predictor.predict(data) - - # Tear down the endpoint container and delete the corresponding endpoint configuration - predictor.delete_endpoint() - - # Deletes the model - predictor.delete_model() - - -If you don't want to deploy your model locally, you can also choose to perform a Local Batch Transform Job. This is -useful if you want to test your container before creating a Sagemaker Batch Transform Job. Note that the performance -will not match Batch Transform Jobs hosted on SageMaker but it is still a useful tool to ensure you have everything -right or if you are not dealing with huge amounts of data. - -Here is an end-to-end example: - -.. code:: python - - from sagemaker.mxnet import MXNet - - mxnet_estimator = MXNet('train.py', - role='SageMakerRole', - train_instance_type='local', - train_instance_count=1, - framework_version='1.2.1') - - mxnet_estimator.fit('file:///tmp/my_training_data') - transformer = mxnet_estimator.transformer(1, 'local', assemble_with='Line', max_payload=1) - transformer.transform('s3://my/transform/data, content_type='text/csv', split_type='Line') - transformer.wait() +BYO Docker Containers with SageMaker Estimators +----------------------------------------------- - # Deletes the SageMaker model - transformer.delete_model() +To use a Docker image that you created and use the SageMaker SDK for training, the easiest way is to use the dedicated ``Estimator`` class. +You can create an instance of the ``Estimator`` class with desired Docker image and use it as described in previous sections. +Please refer to the full example in the examples repo: -For detailed examples of running Docker in local mode, see: +:: -- `TensorFlow local mode example notebook `__. -- `MXNet local mode example notebook `__. + git clone https://github.com/awslabs/amazon-sagemaker-examples.git -A few important notes: -- Only one local mode endpoint can be running at a time. -- If you are using S3 data as input, it is pulled from S3 to your local environment. Ensure you have sufficient space to store the data locally. -- If you run into problems it often due to different Docker containers conflicting. Killing these containers and re-running often solves your problems. -- Local Mode requires Docker Compose and `nvidia-docker2 `__ for ``local_gpu``. -- Distributed training is not yet supported for ``local_gpu``. +The example notebook is located here: +``advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb`` Incremental Training -~~~~~~~~~~~~~~~~~~~~ +==================== Incremental training allows you to bring a pre-trained model into a SageMaker training job and use it as a starting point for a new model. There are several situations where you might want to do this: @@ -268,38 +279,45 @@ Currently, the following algorithms support incremental training: - Object Detection - Semantic Segmentation -Using SageMaker AlgorithmEstimators ------------------------------------ +************************************************ +Using Models Trained Outside of Amazon SageMaker +************************************************ -With the SageMaker Algorithm entities, you can create training jobs with just an ``algorithm_arn`` instead of -a training image. There is a dedicated ``AlgorithmEstimator`` class that accepts ``algorithm_arn`` as a -parameter, the rest of the arguments are similar to the other Estimator classes. This class also allows you to -consume algorithms that you have subscribed to in the AWS Marketplace. The AlgorithmEstimator performs -client-side validation on your inputs based on the algorithm's properties. +You can use models that you train outside of Amazon SageMaker, and model packages that you create or subscribe to in the AWS Marketplace to get inferences. -Here is an example: +BYO Model +========= + +You can create an endpoint from an existing model that you trained outside of Amazon Sagemaker. +That is, you can bring your own model: + +First, package the files for the trained model into a ``.tar.gz`` file, and upload the archive to S3. + +Next, create a ``Model`` object that corresponds to the framework that you are using: `MXNetModel `__ or `TensorFlowModel `__. + +Example code using ``MXNetModel``: .. code:: python - import sagemaker + from sagemaker.mxnet.model import MXNetModel - algo = sagemaker.AlgorithmEstimator( - algorithm_arn='arn:aws:sagemaker:us-west-2:1234567:algorithm/some-algorithm', - role='SageMakerRole', - train_instance_count=1, - train_instance_type='ml.c4.xlarge') + sagemaker_model = MXNetModel(model_data='s3://path/to/model.tar.gz', + role='arn:aws:iam::accid:sagemaker-role', + entry_point='entry_point.py') - train_input = algo.sagemaker_session.upload_data(path='/path/to/your/data') +After that, invoke the ``deploy()`` method on the ``Model``: - algo.fit({'training': train_input}) - algo.deploy(1, 'ml.m4.xlarge') +.. code:: python - # When you are done using your endpoint - algo.delete_endpoint() + predictor = sagemaker_model.deploy(initial_instance_count=1, + instance_type='ml.m4.xlarge') +This returns a predictor the same way an ``Estimator`` does when ``deploy()`` is called. You can now get inferences just like with any other model deployed on Amazon SageMaker. + +A full example is available in the `Amazon SageMaker examples repository `__. Consuming SageMaker Model Packages ----------------------------------- +================================== SageMaker Model Packages are a way to specify and share information for how to create SageMaker Models. With a SageMaker Model Package that you have created or subscribed to in the AWS Marketplace, @@ -321,26 +339,9 @@ Here is an example: # When you are done using your endpoint model.sagemaker_session.delete_endpoint('my-endpoint') - -BYO Docker Containers with SageMaker Estimators ------------------------------------------------ - -To use a Docker image that you created and use the SageMaker SDK for training, the easiest way is to use the dedicated ``Estimator`` class. -You can create an instance of the ``Estimator`` class with desired Docker image and use it as described in previous sections. - -Please refer to the full example in the examples repo: - -:: - - git clone https://github.com/awslabs/amazon-sagemaker-examples.git - - -The example notebook is located here: -``advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb`` - - +******************************** SageMaker Automatic Model Tuning --------------------------------- +******************************** All of the estimators can be used with SageMaker Automatic Model Tuning, which performs hyperparameter tuning jobs. A hyperparameter tuning job finds the best version of a model by running many training jobs on your dataset using the algorithm with different values of hyperparameters within ranges @@ -439,9 +440,9 @@ For more detailed explanations of the classes that this library provides for aut - `API docs for HyperparameterTuner and parameter range classes `__ - `API docs for analytics classes `__ - +************************* SageMaker Batch Transform -------------------------- +************************* After you train a model, you can use Amazon SageMaker Batch Transform to perform inferences with the model. Batch Transform manages all necessary compute resources, including launching instances to deploy endpoints and deleting them afterward. @@ -473,9 +474,114 @@ You can also specify other attributes of your data, such as the content type. For more details about what can be specified here, see `API docs `__. +********** +Local Mode +********** +The SageMaker Python SDK supports local mode, which allows you to create estimators and deploy them to your local environment. +This is a great way to test your deep learning scripts before running them in SageMaker's managed training or hosting environments. +Local Mode is supported for frameworks images (TensorFlow, MXNet, Chainer, PyTorch, and Scikit-Learn) and images you supply yourself. + +We can take the example in `Using Estimators <#using-estimators>`__ , and use either ``local`` or ``local_gpu`` as the instance type. + +.. code:: python + + from sagemaker.mxnet import MXNet + + # Configure an MXNet Estimator (no training happens yet) + mxnet_estimator = MXNet('train.py', + role='SageMakerRole', + train_instance_type='local', + train_instance_count=1, + framework_version='1.2.1') + + # In Local Mode, fit will pull the MXNet container Docker image and run it locally + mxnet_estimator.fit('s3://my_bucket/my_training_data/') + + # Alternatively, you can train using data in your local file system. This is only supported in Local mode. + mxnet_estimator.fit('file:///tmp/my_training_data') + + # Deploys the model that was generated by fit() to local endpoint in a container + mxnet_predictor = mxnet_estimator.deploy(initial_instance_count=1, instance_type='local') + + # Serializes data and makes a prediction request to the local endpoint + response = mxnet_predictor.predict(data) + + # Tears down the endpoint container and deletes the corresponding endpoint configuration + mxnet_predictor.delete_endpoint() + + # Deletes the model + mxnet_predictor.delete_model() + + +If you have an existing model and want to deploy it locally, don't specify a sagemaker_session argument to the ``MXNetModel`` constructor. +The correct session is generated when you call ``model.deploy()``. + +Here is an end-to-end example: + +.. code:: python + + import numpy + from sagemaker.mxnet import MXNetModel + + model_location = 's3://mybucket/my_model.tar.gz' + code_location = 's3://mybucket/sourcedir.tar.gz' + s3_model = MXNetModel(model_data=model_location, role='SageMakerRole', + entry_point='mnist.py', source_dir=code_location) + + predictor = s3_model.deploy(initial_instance_count=1, instance_type='local') + data = numpy.zeros(shape=(1, 1, 28, 28)) + predictor.predict(data) + + # Tear down the endpoint container and delete the corresponding endpoint configuration + predictor.delete_endpoint() + + # Deletes the model + predictor.delete_model() + + +If you don't want to deploy your model locally, you can also choose to perform a Local Batch Transform Job. This is +useful if you want to test your container before creating a Sagemaker Batch Transform Job. Note that the performance +will not match Batch Transform Jobs hosted on SageMaker but it is still a useful tool to ensure you have everything +right or if you are not dealing with huge amounts of data. + +Here is an end-to-end example: + +.. code:: python + + from sagemaker.mxnet import MXNet + + mxnet_estimator = MXNet('train.py', + role='SageMakerRole', + train_instance_type='local', + train_instance_count=1, + framework_version='1.2.1') + + mxnet_estimator.fit('file:///tmp/my_training_data') + transformer = mxnet_estimator.transformer(1, 'local', assemble_with='Line', max_payload=1) + transformer.transform('s3://my/transform/data, content_type='text/csv', split_type='Line') + transformer.wait() + + # Deletes the SageMaker model + transformer.delete_model() + + +For detailed examples of running Docker in local mode, see: + +- `TensorFlow local mode example notebook `__. +- `MXNet local mode example notebook `__. + +A few important notes: + +- Only one local mode endpoint can be running at a time. +- If you are using S3 data as input, it is pulled from S3 to your local environment. Ensure you have sufficient space to store the data locally. +- If you run into problems it often due to different Docker containers conflicting. Killing these containers and re-running often solves your problems. +- Local Mode requires Docker Compose and `nvidia-docker2 `__ for ``local_gpu``. +- Distributed training is not yet supported for ``local_gpu``. + +************************************** Secure Training and Inference with VPC --------------------------------------- +************************************** Amazon SageMaker allows you to control network traffic to and from model container instances using Amazon Virtual Private Cloud (VPC). You can configure SageMaker to use your own private VPC in order to further protect and monitor traffic. @@ -559,88 +665,10 @@ Likewise, when you create ``Transformer`` from the ``Estimator`` using ``transfo # Transform Job container instances will run in your VPC mxnet_vpc_transformer.transform('s3://my-bucket/batch-transform-input') - -FAQ ---- - -I want to train a SageMaker Estimator with local data, how do I do this? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Upload the data to S3 before training. You can use the AWS Command Line Tool (the aws cli) to achieve this. - -If you don't have the aws cli, you can install it using pip: - -:: - - pip install awscli --upgrade --user - -If you don't have pip or want to learn more about installing the aws cli, see the official `Amazon aws cli installation guide `__. - -After you install the AWS cli, you can upload a directory of files to S3 with the following command: - -:: - - aws s3 cp /tmp/foo/ s3://bucket/path - -For more information about using the aws cli for manipulating S3 resources, see `AWS cli command reference `__. - - -How do I make predictions against an existing endpoint? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Create a ``Predictor`` object and provide it with your endpoint name, -then call its ``predict()`` method with your input. - -You can use either the generic ``RealTimePredictor`` class, which by default does not perform any serialization/deserialization transformations on your input, -but can be configured to do so through constructor arguments: -http://sagemaker.readthedocs.io/en/stable/predictors.html - -Or you can use the TensorFlow / MXNet specific predictor classes, which have default serialization/deserialization logic: -http://sagemaker.readthedocs.io/en/stable/sagemaker.tensorflow.html#tensorflow-predictor -http://sagemaker.readthedocs.io/en/stable/sagemaker.mxnet.html#mxnet-predictor - -Example code using the TensorFlow predictor: - -:: - - from sagemaker.tensorflow import TensorFlowPredictor - - predictor = TensorFlowPredictor('myexistingendpoint') - result = predictor.predict(['my request body']) - - -BYO Model ---------- -You can also create an endpoint from an existing model rather than training one. -That is, you can bring your own model: - -First, package the files for the trained model into a ``.tar.gz`` file, and upload the archive to S3. - -Next, create a ``Model`` object that corresponds to the framework that you are using: `MXNetModel `__ or `TensorFlowModel `__. - -Example code using ``MXNetModel``: - -.. code:: python - - from sagemaker.mxnet.model import MXNetModel - - sagemaker_model = MXNetModel(model_data='s3://path/to/model.tar.gz', - role='arn:aws:iam::accid:sagemaker-role', - entry_point='entry_point.py') - -After that, invoke the ``deploy()`` method on the ``Model``: - -.. code:: python - - predictor = sagemaker_model.deploy(initial_instance_count=1, - instance_type='ml.m4.xlarge') - -This returns a predictor the same way an ``Estimator`` does when ``deploy()`` is called. You can now get inferences just like with any other model deployed on Amazon SageMaker. - -A full example is available in the `Amazon SageMaker examples repository `__. - - +******************* Inference Pipelines -------------------- +******************* + You can create a Pipeline for realtime or batch inference comprising of one or multiple model containers. This will help you to deploy an ML pipeline behind a single endpoint and you can have one API call perform pre-processing, model-scoring and post-processing on your data before returning it back as the response. @@ -706,11 +734,61 @@ For comprehensive examples on how to use Inference Pipelines please refer to the - `inference_pipeline_sparkml_xgboost_abalone.ipynb `__ - `inference_pipeline_sparkml_blazingtext_dbpedia.ipynb `__ +****************** SageMaker Workflow ------------------- +****************** You can use Apache Airflow to author, schedule and monitor SageMaker workflow. For more information, see `SageMaker Workflow in Apache Airflow`_. .. _SageMaker Workflow in Apache Airflow: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/workflow/README.rst + +*** +FAQ +*** + +I want to train a SageMaker Estimator with local data, how do I do this? +======================================================================== + +Upload the data to S3 before training. You can use the AWS Command Line Tool (the aws cli) to achieve this. + +If you don't have the aws cli, you can install it using pip: + +:: + + pip install awscli --upgrade --user + +If you don't have pip or want to learn more about installing the aws cli, see the official `Amazon aws cli installation guide `__. + +After you install the AWS cli, you can upload a directory of files to S3 with the following command: + +:: + + aws s3 cp /tmp/foo/ s3://bucket/path + +For more information about using the aws cli for manipulating S3 resources, see `AWS cli command reference `__. + + +How do I make predictions against an existing endpoint? +======================================================= + +Create a ``Predictor`` object and provide it with your endpoint name, +then call its ``predict()`` method with your input. + +You can use either the generic ``RealTimePredictor`` class, which by default does not perform any serialization/deserialization transformations on your input, +but can be configured to do so through constructor arguments: +http://sagemaker.readthedocs.io/en/stable/predictors.html + +Or you can use the TensorFlow / MXNet specific predictor classes, which have default serialization/deserialization logic: +http://sagemaker.readthedocs.io/en/stable/sagemaker.tensorflow.html#tensorflow-predictor +http://sagemaker.readthedocs.io/en/stable/sagemaker.mxnet.html#mxnet-predictor + +Example code using the TensorFlow predictor: + +:: + + from sagemaker.tensorflow import TensorFlowPredictor + + predictor = TensorFlowPredictor('myexistingendpoint') + result = predictor.predict(['my request body']) \ No newline at end of file From 9276c3ceb2c2016bb75ec1973d60d25842b3b56d Mon Sep 17 00:00:00 2001 From: Eric Slesar <34587362+eslesar-aws@users.noreply.github.com> Date: Wed, 26 Jun 2019 11:02:43 -0700 Subject: [PATCH 02/14] Update doc/overview.rst Co-Authored-By: Marcio Vinicius dos Santos --- doc/overview.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/overview.rst b/doc/overview.rst index 6eb50d90cd..f6090e339c 100644 --- a/doc/overview.rst +++ b/doc/overview.rst @@ -30,7 +30,7 @@ After you train a model, you can save it, and then serve the model as an endpoin Prepare a Training script ========================= -Your training script must be a Python 2.7 or 3.5 compatible source file. +Your training script must be a Python 2.7 or 3.6 compatible source file. The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, including the following: @@ -956,4 +956,4 @@ Example code using the TensorFlow predictor: from sagemaker.tensorflow import TensorFlowPredictor predictor = TensorFlowPredictor('myexistingendpoint') - result = predictor.predict(['my request body']) \ No newline at end of file + result = predictor.predict(['my request body']) From ace79290e5c0f76d6fb7dd2073643f90d5e347db Mon Sep 17 00:00:00 2001 From: Eric Slesar <34587362+eslesar-aws@users.noreply.github.com> Date: Wed, 26 Jun 2019 11:02:52 -0700 Subject: [PATCH 03/14] Update doc/overview.rst Co-Authored-By: Marcio Vinicius dos Santos --- doc/overview.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/overview.rst b/doc/overview.rst index f6090e339c..a07c7a82b2 100644 --- a/doc/overview.rst +++ b/doc/overview.rst @@ -9,7 +9,7 @@ SageMaker Python SDK provides several high-level abstractions for working with A - **Predictors**: Provide real-time inference and transformation using Python data-types against a SageMaker endpoint. - **Session**: Provides a collection of methods for working with SageMaker resources. -``Estimator`` and ``Model`` implementations for MXNet, TensorFlow, Chainer, PyTorch, scikit-Learn, Amazon SageMaker built-in algorithms, Reinforcement Learning, are included. +``Estimator`` and ``Model`` implementations for MXNet, TensorFlow, Chainer, PyTorch, scikit-learn, Amazon SageMaker built-in algorithms, Reinforcement Learning, are included. There's also an ``Estimator`` that runs SageMaker compatible custom Docker containers, enabling you to run your own ML algorithms by using the SageMaker Python SDK. .. contents:: From 122f066fcc13b9cc2be26259c0a72397b8e9633a Mon Sep 17 00:00:00 2001 From: eslesar-aws Date: Wed, 26 Jun 2019 15:22:13 -0700 Subject: [PATCH 04/14] doc: made changes per feedback comments --- doc/overview.rst | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-) diff --git a/doc/overview.rst b/doc/overview.rst index 6eb50d90cd..4c11709854 100644 --- a/doc/overview.rst +++ b/doc/overview.rst @@ -19,11 +19,11 @@ There's also an ``Estimator`` that runs SageMaker compatible custom Docker conta Train a Model with the SageMaker Python SDK ******************************************* -To train a model by using the SageMaker Pthon SDK, you: +To train a model by using the SageMaker Python SDK, you: 1. Prepare a training script 2. Create an estimator -3. Call the `fit` method of the estimator +3. Call the ``fit`` method of the estimator After you train a model, you can save it, and then serve the model as an endpoint to get real-time inferences or get inferences for an entire dataset by using batch transform. @@ -277,6 +277,10 @@ Please refer to the full example in the examples repo: The example notebook is located here: ``advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb`` +You can also find this notebook in the **Advanced Functionality** folder of the **SageMaker Examples** section in a notebook instance. +For information about using sample notebooks in a SageMaker notebook instance, see `Use Example Notebooks `__ +in the AWS documentation. + Incremental Training ==================== @@ -375,6 +379,10 @@ This returns a predictor the same way an ``Estimator`` does when ``deploy()`` is A full example is available in the `Amazon SageMaker examples repository `__. +You can also find this notebook in the **Advanced Functionality** section of the **SageMaker Examples** section in a notebook instance. +For information about using sample notebooks in a SageMaker notebook instance, see `Use Example Notebooks `__ +in the AWS documentation. + Consuming SageMaker Model Packages ================================== @@ -494,6 +502,10 @@ For more detailed examples of running hyperparameter tuning jobs, see: - `Bringing your own estimator for hyperparameter tuning `__ - `Analyzing results `__ +You can also find these notebooks in the **Hyperprameter Tuning** section of the **SageMaker Examples** section in a notebook instance. +For information about using sample notebooks in a SageMaker notebook instance, see `Use Example Notebooks `__ +in the AWS documentation. + For more detailed explanations of the classes that this library provides for automatic model tuning, see: - `API docs for HyperparameterTuner and parameter range classes `__ @@ -630,6 +642,10 @@ For detailed examples of running Docker in local mode, see: - `TensorFlow local mode example notebook `__. - `MXNet local mode example notebook `__. +You can also find these notebooks in the **SageMaker Python SDK** section of the **SageMaker Examples** section in a notebook instance. +For information about using sample notebooks in a SageMaker notebook instance, see `Use Example Notebooks `__ +in the AWS documentation. + A few important notes: - Only one local mode endpoint can be running at a time. @@ -830,6 +846,10 @@ This returns a predictor the same way an ``Estimator`` does when ``deploy()`` is A full example is available in the `Amazon SageMaker examples repository `__. +You can also find this notebook in the **Advanced Functionality** section of the **SageMaker Examples** section in a notebook instance. +For information about using sample notebooks in a SageMaker notebook instance, see `Use Example Notebooks `__ +in the AWS documentation. + Inference Pipelines ******************* @@ -857,6 +877,10 @@ For more information about how to train an XGBoost model, please refer to the XG .. _here: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone.ipynb +You can also find this notebook in the **Introduction to Amazon Algorithms** section of the **SageMaker Examples** section in a notebook instance. +For information about using sample notebooks in a SageMaker notebook instance, see `Use Example Notebooks `__ +in the AWS documentation. + .. code:: python sm_model.deploy(initial_instance_count=1, instance_type='ml.c5.xlarge', endpoint_name=endpoint_name) @@ -899,6 +923,10 @@ For comprehensive examples on how to use Inference Pipelines please refer to the - `inference_pipeline_sparkml_xgboost_abalone.ipynb `__ - `inference_pipeline_sparkml_blazingtext_dbpedia.ipynb `__ +You can also find these notebooks in the **Advanced Functionality** section of the **SageMaker Examples** section in a notebook instance. +For information about using sample notebooks in a SageMaker notebook instance, see `Use Example Notebooks `__ +in the AWS documentation. + ****************** SageMaker Workflow ****************** From 9eca3ef4a92fd50edea3ded31c96fbac9a923aa4 Mon Sep 17 00:00:00 2001 From: eslesar-aws Date: Fri, 28 Jun 2019 12:25:04 -0700 Subject: [PATCH 05/14] doc: remove duplicate faq section and fixed heading --- doc/overview.rst | 51 +----------------------------------------------- 1 file changed, 1 insertion(+), 50 deletions(-) diff --git a/doc/overview.rst b/doc/overview.rst index b1acb896f9..8832a57e43 100644 --- a/doc/overview.rst +++ b/doc/overview.rst @@ -516,7 +516,7 @@ SageMaker Batch Transform ************************* After you train a model, you can use Amazon SageMaker Batch Transform to perform inferences with the model. -Batch Transform manages all necessary compute resources, including launching instances to deploy endpoints and deleting them afterward. +Batch transform manages all necessary compute resources, including launching instances to deploy endpoints and deleting them afterward. You can read more about SageMaker Batch Transform in the `AWS documentation `__. If you trained the model using a SageMaker Python SDK estimator, @@ -767,55 +767,6 @@ A new training job channel, named ``code``, will be added with that S3 URI. Bef Once the training job begins, the training container will look at the offline input ``code`` channel to install dependencies and run the entry script. This isolates the training container, so no inbound or outbound network calls can be made. - -FAQ ---- - -I want to train a SageMaker Estimator with local data, how do I do this? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Upload the data to S3 before training. You can use the AWS Command Line Tool (the aws cli) to achieve this. - -If you don't have the aws cli, you can install it using pip: - -:: - - pip install awscli --upgrade --user - -If you don't have pip or want to learn more about installing the aws cli, see the official `Amazon aws cli installation guide `__. - -After you install the AWS cli, you can upload a directory of files to S3 with the following command: - -:: - - aws s3 cp /tmp/foo/ s3://bucket/path - -For more information about using the aws cli for manipulating S3 resources, see `AWS cli command reference `__. - - -How do I make predictions against an existing endpoint? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Create a ``Predictor`` object and provide it with your endpoint name, -then call its ``predict()`` method with your input. - -You can use either the generic ``RealTimePredictor`` class, which by default does not perform any serialization/deserialization transformations on your input, -but can be configured to do so through constructor arguments: -http://sagemaker.readthedocs.io/en/stable/predictors.html - -Or you can use the TensorFlow / MXNet specific predictor classes, which have default serialization/deserialization logic: -http://sagemaker.readthedocs.io/en/stable/sagemaker.tensorflow.html#tensorflow-predictor -http://sagemaker.readthedocs.io/en/stable/sagemaker.mxnet.html#mxnet-predictor - -Example code using the TensorFlow predictor: - -:: - - from sagemaker.tensorflow import TensorFlowPredictor - - predictor = TensorFlowPredictor('myexistingendpoint') - result = predictor.predict(['my request body']) - - BYO Model --------- You can also create an endpoint from an existing model rather than training one. From 565ff7021714a6ddf8f087e8c617840ef0f9a12f Mon Sep 17 00:00:00 2001 From: eslesar-aws Date: Fri, 28 Jun 2019 12:35:47 -0700 Subject: [PATCH 06/14] doc: fix heading levels in overview.rst --- doc/overview.rst | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/doc/overview.rst b/doc/overview.rst index 8832a57e43..784388f025 100644 --- a/doc/overview.rst +++ b/doc/overview.rst @@ -740,8 +740,10 @@ Likewise, when you create ``Transformer`` from the ``Estimator`` using ``transfo # Transform Job container instances will run in your VPC mxnet_vpc_transformer.transform('s3://my-bucket/batch-transform-input') +*********************************************************** Secure Training with Network Isolation (Internet-Free) Mode -------------------------------------------------------------------------- +*********************************************************** + You can enable network isolation mode when running training and inference on Amazon SageMaker. For more information about Amazon SageMaker network isolation mode, see the `SageMaker documentation on network isolation or internet-free mode `__. @@ -767,8 +769,10 @@ A new training job channel, named ``code``, will be added with that S3 URI. Bef Once the training job begins, the training container will look at the offline input ``code`` channel to install dependencies and run the entry script. This isolates the training container, so no inbound or outbound network calls can be made. +********* BYO Model ---------- +********* + You can also create an endpoint from an existing model rather than training one. That is, you can bring your own model: @@ -801,7 +805,7 @@ You can also find this notebook in the **Advanced Functionality** section of the For information about using sample notebooks in a SageMaker notebook instance, see `Use Example Notebooks `__ in the AWS documentation. - +******************* Inference Pipelines ******************* From da0868ee414f4621b8ec76506385c9c85375424e Mon Sep 17 00:00:00 2001 From: eslesar-aws Date: Mon, 8 Jul 2019 16:07:19 -0700 Subject: [PATCH 07/14] doc: update TensorFlow using topic --- doc/conf.py | 4 + doc/overview.rst | 36 --- doc/using_tf.rst | 815 +++++++++++++++++++++++++++++++++++++---------- 3 files changed, 643 insertions(+), 212 deletions(-) diff --git a/doc/conf.py b/doc/conf.py index f9705e0089..d16ba8d8a0 100644 --- a/doc/conf.py +++ b/doc/conf.py @@ -55,6 +55,7 @@ def __getattr__(cls, name): "sphinx.ext.coverage", "sphinx.ext.autosummary", "sphinx.ext.napoleon", + "sphinx.ext.autosectionlabel" ] # Add any paths that contain templates here, relative to this directory. @@ -90,3 +91,6 @@ def __getattr__(cls, name): # autosummary autosummary_generate = True + +#autosectionlabel +autosectionlabel_prefix_document = True diff --git a/doc/overview.rst b/doc/overview.rst index 784388f025..f354bd0946 100644 --- a/doc/overview.rst +++ b/doc/overview.rst @@ -769,42 +769,6 @@ A new training job channel, named ``code``, will be added with that S3 URI. Bef Once the training job begins, the training container will look at the offline input ``code`` channel to install dependencies and run the entry script. This isolates the training container, so no inbound or outbound network calls can be made. -********* -BYO Model -********* - -You can also create an endpoint from an existing model rather than training one. -That is, you can bring your own model: - -First, package the files for the trained model into a ``.tar.gz`` file, and upload the archive to S3. - -Next, create a ``Model`` object that corresponds to the framework that you are using: `MXNetModel `__ or `TensorFlowModel `__. - -Example code using ``MXNetModel``: - -.. code:: python - - from sagemaker.mxnet.model import MXNetModel - - sagemaker_model = MXNetModel(model_data='s3://path/to/model.tar.gz', - role='arn:aws:iam::accid:sagemaker-role', - entry_point='entry_point.py') - -After that, invoke the ``deploy()`` method on the ``Model``: - -.. code:: python - - predictor = sagemaker_model.deploy(initial_instance_count=1, - instance_type='ml.m4.xlarge') - -This returns a predictor the same way an ``Estimator`` does when ``deploy()`` is called. You can now get inferences just like with any other model deployed on Amazon SageMaker. - -A full example is available in the `Amazon SageMaker examples repository `__. - -You can also find this notebook in the **Advanced Functionality** section of the **SageMaker Examples** section in a notebook instance. -For information about using sample notebooks in a SageMaker notebook instance, see `Use Example Notebooks `__ -in the AWS documentation. - ******************* Inference Pipelines ******************* diff --git a/doc/using_tf.rst b/doc/using_tf.rst index d2d9228153..681de5f9b4 100644 --- a/doc/using_tf.rst +++ b/doc/using_tf.rst @@ -1,28 +1,19 @@ -============================================== +############################################## Using TensorFlow with the SageMaker Python SDK -============================================== +############################################## TensorFlow SageMaker Estimators allow you to run your own TensorFlow training algorithms on SageMaker Learner, and to host your own TensorFlow models on SageMaker Hosting. -**Note:** This topic describes how to use script mode for TensorFlow versions 1.11 and later. -For Documentation of the previous Legacy Mode versions, see: - -* `1.4.1 `_ -* `1.5.0 `_ -* `1.6.0 `_ -* `1.7.0 `_ -* `1.8.0 `_ -* `1.9.0 `_ -* `1.10.0 `_ +For general information about using the SageMaker Python SDK, see :ref:`overview:Using the SageMaker Python SDK`. .. warning:: We have added a new format of your TensorFlow training script with TensorFlow version 1.11. This new way gives the user script more flexibility. This new format is called Script Mode, as opposed to Legacy Mode, which is what we support with TensorFlow 1.11 and older versions. In addition we are adding Python 3 support with Script Mode. - Last supported version of Legacy Mode will be TensorFlow 1.12. + The last supported version of Legacy Mode will be TensorFlow 1.12. Script Mode is available with TensorFlow version 1.11 and newer. Make sure you refer to the correct version of this README when you prepare your script. You can find the Legacy Mode README `here `_. @@ -31,23 +22,27 @@ For Documentation of the previous Legacy Mode versions, see: Supported versions of TensorFlow for Elastic Inference: ``1.11.0``, ``1.12.0``. -Training with TensorFlow -~~~~~~~~~~~~~~~~~~~~~~~~ -Training TensorFlow models using ``sagemaker.tensorflow.TensorFlow`` is a two-step process. -First, you prepare your training script, then second, you run it on -SageMaker Learner via the ``sagemaker.tensorflow.TensorFlow`` estimator. +***************************** +Train a Model with TensorFlow +***************************** + +To train a TensorFlow model by using the SageMaker Python SDK: + +1. `Prepare a training script <#prepare-a-script-mode-training-script>`_ +2. `Create a ``sagemaker.tensorflow.TensorFlow`` estimator <#create-an-estimator>`_ +3. `Call the estimator's `fit` method <#call-fit>`_ -Preparing a Script Mode training script -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Prepare a Script Mode Training Script +====================================== Your TensorFlow training script must be a Python 2.7- or 3.6-compatible source file. The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, including the following: -* ``SM_MODEL_DIR``: A string that represents the local path where the training job can write the model artifacts to. +* ``SM_MODEL_DIR``: A string that represents the local path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting. This is different than the ``model_dir`` - argument passed in your training script which is a S3 location. ``SM_MODEL_DIR`` is always set to ``/opt/ml/model``. + argument passed in your training script, which is an S3 location. ``SM_MODEL_DIR`` is always set to ``/opt/ml/model``. * ``SM_NUM_GPUS``: An integer representing the number of GPUs available to the host. * ``SM_OUTPUT_DATA_DIR``: A string that represents the path to the directory to write output artifacts to. Output artifacts might include checkpoints, graphs, and other files to save, but do not include model artifacts. @@ -55,7 +50,7 @@ The training script is very similar to a training script you might run outside o * ``SM_CHANNEL_XXXX``: A string that represents the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the TensorFlow estimator's ``fit`` call, named 'train' and 'test', the environment variables ``SM_CHANNEL_TRAIN`` and ``SM_CHANNEL_TEST`` are set. -For the exhaustive list of available environment variables, see the `SageMaker Containers documentation `__. +For the exhaustive list of available environment variables, see the `SageMaker Containers documentation `_. A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to ``SM_CHANNEL_TRAIN`` so that it can be deployed for inference later. Hyperparameters are passed to your script as arguments and can be retrieved with an ``argparse.ArgumentParser`` instance. @@ -88,17 +83,20 @@ Because the SageMaker imports your training script, putting your training launch is good practice. Note that SageMaker doesn't support argparse actions. -If you want to use, for example, boolean hyperparameters, you need to specify ``type`` as ``bool`` in your script and provide an explicit ``True`` or ``False`` value for this hyperparameter when instantiating your TensorFlow estimator. +For example, if you want to use a boolean hyperparameter, specify ``type`` as ``bool`` in your script and provide an explicit ``True`` or ``False`` value for this hyperparameter when you create the TensorFlow estimator. +For a complete example of a TensorFlow training script, see < + + Adapting your local TensorFlow script -''''''''''''''''''''''''''''''''''''' +------------------------------------- -If you have a TensorFlow training script that runs outside of SageMaker please follow the directions here: +If you have a TensorFlow training script that runs outside of SageMaker, do the following to adapt the script to run in SageMaker: 1. Make sure your script can handle ``--model_dir`` as an additional command line argument. If you did not specify a -location when the TensorFlow estimator is constructed a S3 location under the default training job bucket will be passed -in here. Distributed training with parameter servers requires you use the ``tf.estimator.train_and_evaluate`` API and -a S3 location is needed as the model directory during training. Here is an example: +location when you created the TensorFlow estimator, an S3 location under the default training job bucket is used. +Distributed training with parameter servers requires you to use the ``tf.estimator.train_and_evaluate`` API and +to provide an S3 location as the model directory during training. Here is an example: .. code:: python @@ -129,22 +127,21 @@ In your training script the channels will be stored in environment variables ``S ``output_path``. -Training with TensorFlow estimator -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Create an Estimator +=================== -Calling fit -''''''''''' +After you create your training script, create an instance of the :class:`sagemaker.tensorflow.TensorFlow` estimator. To use Script Mode, set at least one of these args - ``py_version='py3'`` - ``script_mode=True`` -Please note that when using Script Mode, your training script need to accept the following args: +When using Script Mode, your training script needs to accept the following args: - ``model_dir`` -Please note that the following args are not permitted when using Script Mode: +The following args are not permitted when using Script Mode: - ``checkpoint_path`` - ``training_steps`` @@ -160,15 +157,19 @@ Please note that the following args are not permitted when using Script Mode: framework_version='1.12', py_version='py3') tf_estimator.fit('s3://bucket/path/to/training/data') -Where the S3 url is a path to your training data, within Amazon S3. The -constructor keyword arguments define how SageMaker runs your training -script which we discussed earlier. +Where the S3 url is a path to your training data within Amazon S3. +The constructor keyword arguments define how SageMaker runs your training script. -You start your training script by calling ``fit`` on a ``TensorFlow`` estimator. ``fit`` takes +For more information about the sagemaker.tensorflow.TensorFlow estimator, see `sagemaker.tensorflow.TensorFlow Class`_. + +Call ``fit`` +============ + +You start your training script by calling the ``fit`` method on a ``TensorFlow`` estimator. ``fit`` takes both required and optional arguments. -Required argument -""""""""""""""""" +Required arguments +------------------ - ``inputs``: The S3 location(s) of datasets to be used for training. This can take one of two forms: @@ -177,7 +178,7 @@ Required argument - ``sagemaker.session.s3_input``: channel configuration for S3 data sources that can provide additional information as well as the path to the training dataset. See `the API docs `_ for full details. Optional arguments -"""""""""""""""""" +------------------ - ``wait (bool)``: Defaults to True, whether to block and wait for the training script to complete before returning. @@ -189,7 +190,7 @@ Optional arguments based on the training image name and current timestamp. What happens when fit is called -""""""""""""""""""""""""""""""" +------------------------------- Calling ``fit`` starts a SageMaker training job. The training job will execute the following. @@ -202,9 +203,9 @@ Calling ``fit`` starts a SageMaker training job. The training job will execute t - setup up distributed training environment if configured to use parameter server - starts asynchronous training -If the ``wait=False`` flag is passed to ``fit``, then it will return immediately. The training job will continue running -asynchronously. At a later time, a Tensorflow Estimator can be obtained by attaching to the existing training job. If -the training job is not finished it will start showing the standard output of training and wait until it completes. +If the ``wait=False`` flag is passed to ``fit``, then it returns immediately. The training job continues running +asynchronously. Later, a Tensorflow estimator can be obtained by attaching to the existing training job. +If the training job is not finished, it starts showing the standard output of training and wait until it completes. After attaching, the estimator can be deployed as usual. .. code:: python @@ -217,14 +218,14 @@ After attaching, the estimator can be deployed as usual. tf_estimator = TensorFlow.attach(training_job_name=training_job_name) Distributed Training -'''''''''''''''''''' +==================== To run your training job with multiple instances in a distributed fashion, set ``train_instance_count`` to a number larger than 1. We support two different types of distributed training, parameter server and Horovod. The ``distributions`` parameter is used to configure which distributed training strategy to use. Training with parameter servers -""""""""""""""""""""""""""""""" +------------------------------- If you specify parameter_server as the value of the distributions parameter, the container launches a parameter server thread on each instance in the training cluster, and then executes your training code. You can find more information on @@ -242,7 +243,7 @@ To enable parameter server training: tf_estimator.fit('s3://bucket/path/to/training/data') Training with Horovod -""""""""""""""""""""" +--------------------- Horovod is a distributed training framework based on MPI. Horovod is only available with TensorFlow version ``1.12`` or newer. You can find more details at `Horovod README `__. @@ -278,80 +279,9 @@ In the below example we create an estimator to launch Horovod distributed traini }) tf_estimator.fit('s3://bucket/path/to/training/data') -sagemaker.tensorflow.TensorFlow class -''''''''''''''''''''''''''''''''''''' - -The ``TensorFlow`` constructor takes both required and optional arguments. - -Required: - -- ``entry_point (str)`` Path (absolute or relative) to the Python file which - should be executed as the entry point to training. -- ``role (str)`` An AWS IAM role (either name or full ARN). The Amazon - SageMaker training jobs and APIs that create Amazon SageMaker - endpoints use this role to access training data and model artifacts. - After the endpoint is created, the inference code might use the IAM - role, if accessing AWS resource. -- ``train_instance_count (int)`` Number of Amazon EC2 instances to use for - training. -- ``train_instance_type (str)`` Type of EC2 instance to use for training, for - example, 'ml.c4.xlarge'. - -Optional: - -- ``source_dir (str)`` Path (absolute or relative) to a directory with any - other training source code dependencies including the entry point - file. Structure within this directory will be preserved when training - on SageMaker. -- ``dependencies (list[str])`` A list of paths to directories (absolute or relative) with - any additional libraries that will be exported to the container (default: ``[]``). - The library folders will be copied to SageMaker in the same folder where the entrypoint is copied. - If the ``source_dir`` points to S3, code will be uploaded and the S3 location will be used - instead. Example: - - The following call - - >>> TensorFlow(entry_point='train.py', dependencies=['my/libs/common', 'virtual-env']) - - results in the following inside the container: - - >>> opt/ml/code - >>> ├── train.py - >>> ├── common - >>> └── virtual-env - -- ``hyperparameters (dict[str, ANY])`` Hyperparameters that will be used for training. - Will be made accessible as command line arguments. -- ``train_volume_size (int)`` Size in GB of the EBS volume to use for storing - input data during training. Must be large enough to the store training - data. -- ``train_max_run (int)`` Timeout in seconds for training, after which Amazon - SageMaker terminates the job regardless of its current status. -- ``output_path (str)`` S3 location where you want the training result (model - artifacts and optional output files) saved. If not specified, results - are stored to a default bucket. If the bucket with the specific name - does not exist, the estimator creates the bucket during the ``fit`` - method execution. -- ``output_kms_key`` Optional KMS key ID to optionally encrypt training - output with. -- ``base_job_name`` Name to assign for the training job that the ``fit`` - method launches. If not specified, the estimator generates a default - job name, based on the training image name and current timestamp. -- ``image_name`` An alternative docker image to use for training and - serving. If specified, the estimator will use this image for training and - hosting, instead of selecting the appropriate SageMaker official image based on - ``framework_version`` and ``py_version``. Refer to: `SageMaker TensorFlow Docker Containers - <#sagemaker-tensorflow-docker-containers>`_ for details on what the official images support - and where to find the source code to build your custom image. -- ``script_mode (bool)`` Whether to use Script Mode or not. Script mode is the only available training mode in Python 3, - setting ``py_version`` to ``py3`` automatically sets ``script_mode`` to True. -- ``model_dir (str)`` Location where model data, checkpoint data, and TensorBoard checkpoints should be saved during training. - If not specified a S3 location will be generated under the training job's default bucket. And ``model_dir`` will be - passed in your training script as one of the command line arguments. -- ``distributions (dict)`` Configure your distribution strategy with this argument. Training with Pipe Mode using PipeModeDataset -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +============================================= Amazon SageMaker allows users to create training jobs using Pipe input mode. With Pipe input mode, your dataset is streamed directly to your training instances instead of being downloaded first. @@ -419,7 +349,7 @@ You can learn more about ``PipeModeDataset`` in the sagemaker-tensorflow-extensi Training with MKL-DNN disabled -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +============================== SageMaker TensorFlow CPU images use TensorFlow built with Intel® MKL-DNN optimization. @@ -435,67 +365,600 @@ You can disable MKL-DNN optimization for TensorFlow ``1.8.0`` and above by setti os.environ['TF_DISABLE_MKL'] = '1' os.environ['TF_DISABLE_POOL_ALLOCATOR'] = '1' - -Deploying TensorFlow Serving models -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +******************************** +Deploy TensorFlow Serving models +******************************** After a TensorFlow estimator has been fit, it saves a TensorFlow SavedModel in the S3 location defined by ``output_path``. You can call ``deploy`` on a TensorFlow -estimator to create a SageMaker Endpoint. +estimator to create a SageMaker Endpoint, or you can call ``transformer`` to create a ``Transformer`` that you can use to run a batch transform job. Your model will be deployed to a TensorFlow Serving-based server. The server provides a super-set of the `TensorFlow Serving REST API `_. -See `Deploying to TensorFlow Serving Endpoints `_ to learn how to deploy your model and make inference requests. + +Deploy to a SageMaker Endpoint +============================== + +Deploying from an Estimator +--------------------------- + +After a TensorFlow estimator has been fit, it saves a TensorFlow +`SavedModel `_ bundle in +the S3 location defined by ``output_path``. You can call ``deploy`` on a TensorFlow +estimator object to create a SageMaker Endpoint: + +.. code:: python + + from sagemaker.tensorflow import TensorFlow + + estimator = TensorFlow(entry_point='tf-train.py', ..., train_instance_count=1, + train_instance_type='ml.c4.xlarge', framework_version='1.11') + + estimator.fit(inputs) + + predictor = estimator.deploy(initial_instance_count=1, + instance_type='ml.c5.xlarge', + endpoint_type='tensorflow-serving') + + +The code block above deploys a SageMaker Endpoint with one instance of the type 'ml.c5.xlarge'. + +What happens when deploy is called +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Calling ``deploy`` starts the process of creating a SageMaker Endpoint. This process includes the following steps. + +- Starts ``initial_instance_count`` EC2 instances of the type ``instance_type``. +- On each instance, it will do the following steps: + + - start a Docker container optimized for TensorFlow Serving, see `SageMaker TensorFlow Serving containers `_. + - start a `TensorFlow Serving` process configured to run your model. + - start an HTTP server that provides access to TensorFlow Server through the SageMaker InvokeEndpoint API. + + +When the ``deploy`` call finishes, the created SageMaker Endpoint is ready for prediction requests. The +`Making predictions against a SageMaker Endpoint`_ section will explain how to make prediction requests +against the Endpoint. + +Deploying directly from model artifacts +--------------------------------------- + +If you already have existing model artifacts in S3, you can skip training and deploy them directly to an endpoint: + +.. code:: python + + from sagemaker.tensorflow.serving import Model + + model = Model(model_data='s3://mybucket/model.tar.gz', role='MySageMakerRole') + + predictor = model.deploy(initial_instance_count=1, instance_type='ml.c5.xlarge') + +Python-based TensorFlow serving on SageMaker has support for `Elastic Inference `__, which allows for inference acceleration to a hosted endpoint for a fraction of the cost of using a full GPU instance. In order to attach an Elastic Inference accelerator to your endpoint provide the accelerator type to accelerator_type to your deploy call. + +.. code:: python + + from sagemaker.tensorflow.serving import Model + + model = Model(model_data='s3://mybucket/model.tar.gz', role='MySageMakerRole') + + predictor = model.deploy(initial_instance_count=1, instance_type='ml.c5.xlarge', accelerator_type='ml.eia1.medium') + +Making predictions against a SageMaker Endpoint +----------------------------------------------- + +Once you have the ``Predictor`` instance returned by ``model.deploy(...)`` or ``estimator.deploy(...)``, you +can send prediction requests to your Endpoint. + +The following code shows how to make a prediction request: + +.. code:: python + + input = { + 'instances': [1.0, 2.0, 5.0] + } + result = predictor.predict(input) + +The result object will contain a Python dict like this: + +.. code:: python + + { + 'predictions': [3.5, 4.0, 5.5] + } + +The formats of the input and the output data correspond directly to the request and response formats +of the ``Predict`` method in the `TensorFlow Serving REST API `_. + +If your SavedModel includes the right ``signature_def``, you can also make Classify or Regress requests: + +.. code:: python + + # input matches the Classify and Regress API + input = { + 'signature_name': 'tensorflow/serving/regress', + 'examples': [{'x': 1.0}, {'x': 2.0}] + } + + result = predictor.regress(input) # or predictor.classify(...) + + # result contains: + { + 'results': [3.5, 4.0] + } + +You can include multiple ``instances`` in your predict request (or multiple ``examples`` in +classify/regress requests) to get multiple prediction results in one request to your Endpoint: + +.. code:: python + + input = { + 'instances': [ + [1.0, 2.0, 5.0], + [1.0, 2.0, 5.0], + [1.0, 2.0, 5.0] + ] + } + result = predictor.predict(input) + + # result contains: + { + 'predictions': [ + [3.5, 4.0, 5.5], + [3.5, 4.0, 5.5], + [3.5, 4.0, 5.5] + ] + } + +If your application allows request grouping like this, it is **much** more efficient than making separate requests. + +See `Deploying to TensorFlow Serving Endpoints ` to learn how to deploy your model and make inference requests. + +Run a Batch Transform Job +========================= + +Batch transform allows you to get inferences for an entire dataset that is stored in an S3 bucket. + +For general information about using batch transform with the SageMaker Python SDK, see :ref:`overview:SageMaker Batch Transform`. +For information about SageMaker batch transform, see `Get Inferences for an Entire Dataset with Batch Transform ` in the AWS documentation. + +To run a batch transform job, you first create a ``Transformer`` object, and then call that object's ``transform`` method. + +Create a Transformer Object +--------------------------- + +If you used an estimator to train your model, you can call the ``transformer`` method of the estimator to create a ``Transformer`` object. +To use a model trained outside of SageMaker, you can package the model as a SageMaker model, and call the ``transformer`` method of the SageMaker model. +For information about how to package a model as a SageMaker model, see :ref:`overview:BYO Model`. +When you call the ``tranformer`` method, you specify the type and number of instances to use for the batch transform job, and the location where the results are stored in S3. + +For example: + +.. code:: python + + bucket = myBucket # The name of the S3 bucket where the results are stored + prefix = 'batch-results' # The folder in the S3 bucket where the results are stored + + batch_output = 's3://{}/{}/results'.format(bucket, prefix) # The location to store the results + + tf_transformer = tf_estimator.transformer(instance_count=1, instance_type='ml.m4.xlarge, output_path=batch_output) + +Call transform +-------------- + +After you create a ``Transformer`` object, you call that object's ``transform`` method to start a batch transform job. +For example: + +.. code:: python + + batch_input = 's3://{}/{}/test/examples'.format(bucket, prefix) # The location of the input dataset + + tf_transformer.transform(data=batch_input, data_type='S3Prefix', content_type='text/csv', split_type='Line') + +In the example, the content type is CSV, and each line in the dataset is treated as a record to get a predition for. + +Batch Transform Supported Data Formats +-------------------------------------- + +When you call the ``tranform`` method to start a batch transform job, +you specify the data format by providing a MIME type as the value for the ``content_type`` parameter. + +The following content formats are supported without custom intput and output handling: + +* CSV - specify ``text/csv`` as the value of the ``content_type`` parameter. +* JSON - specify ``application/json`` as the value of the ``content_type`` parameter. +* JSON lines - specify ``application/jsonlines`` as the value of the ``content_type`` parameter. + +For detailed information about how TensorFlow Serving formats these data types for input and output, see :ref:`using_tf:TensorFlow Serving Input and Output`. + +You can also accept any custom data format by writing input and output functions, and include them in the ``inference.py`` file in your model. +For information, see :ref:`using_tf:Create Python Scripts for Custom Input and Output Formats`. +TensorFlow Serving Input and Output +=================================== + +The following sections describe the data formats that TensorFlow Serving endpoints and batch transform jobs accept, +and how to write input and output functions to input and output custom data formats. + +Supported Formats +----------------- + +SageMaker's TensforFlow Serving endpoints can also accept some additional input formats that are not part of the +TensorFlow REST API, including a simplified json format, line-delimited json objects ("jsons" or "jsonlines"), and +CSV data. + +Simplified JSON Input +^^^^^^^^^^^^^^^^^^^^^ + +The Endpoint will accept simplified JSON input that doesn't match the TensorFlow REST API's Predict request format. +When the Endpoint receives data like this, it will attempt to transform it into a valid +Predict request, using a few simple rules: + +- python value, dict, or one-dimensional arrays are treated as the input value in a single 'instance' Predict request. +- multidimensional arrays are treated as a multiple values in a multi-instance Predict request. + +Combined with the client-side ``Predictor`` object's JSON serialization, this allows you to make simple +requests like this: + +.. code:: python + + input = [ + [1.0, 2.0, 5.0], + [1.0, 2.0, 5.0] + ] + result = predictor.predict(input) + + # result contains: + { + 'predictions': [ + [3.5, 4.0, 5.5], + [3.5, 4.0, 5.5] + ] + } + +Or this: + +.. code:: python + + # 'x' must match name of input tensor in your SavedModel graph + # for models with multiple named inputs, just include all the keys in the input dict + input = { + 'x': [1.0, 2.0, 5.0] + } + + # result contains: + { + 'predictions': [ + [3.5, 4.0, 5.5] + ] + } + + +Line-delimited JSON +^^^^^^^^^^^^^^^^^^^ + +The Endpoint will accept line-delimited JSON objects (also known as "jsons" or "jsonlines" data). +The Endpoint treats each line as a separate instance in a multi-instance Predict request. To use +this feature from your python code, you need to create a ``Predictor`` instance that does not +try to serialize your input to JSON: + +.. code:: python + + # create a Predictor without JSON serialization + + predictor = Predictor('endpoint-name', serializer=None, content_type='application/jsonlines') + + input = '''{'x': [1.0, 2.0, 5.0]} + {'x': [1.0, 2.0, 5.0]} + {'x': [1.0, 2.0, 5.0]}''' + + result = predictor.predict(input) + + # result contains: + { + 'predictions': [ + [3.5, 4.0, 5.5], + [3.5, 4.0, 5.5], + [3.5, 4.0, 5.5] + ] + } + +This feature is especially useful if you are reading data from a file containing jsonlines data. + +**CSV (comma-separated values)** + +The Endpoint will accept CSV data. Each line is treated as a separate instance. This is a +compact format for representing multiple instances of 1-d array data. To use this feature +from your python code, you need to create a ``Predictor`` instance that can serialize +your input data to CSV format: + +.. code:: python + + # create a Predictor with JSON serialization + + predictor = Predictor('endpoint-name', serializer=sagemaker.predictor.csv_serializer) + + # CSV-formatted string input + input = '1.0,2.0,5.0\n1.0,2.0,5.0\n1.0,2.0,5.0' + + result = predictor.predict(input) + + # result contains: + { + 'predictions': [ + [3.5, 4.0, 5.5], + [3.5, 4.0, 5.5], + [3.5, 4.0, 5.5] + ] + } + +You can also use python arrays or numpy arrays as input and let the `csv_serializer` object +convert them to CSV, but the client-size CSV conversion is more sophisticated than the +CSV parsing on the Endpoint, so if you encounter conversion problems, try using one of the +JSON options instead. + + +Create Python Scripts for Custom Input and Output Formats +--------------------------------------------------------- + +You can add your customized Python code to process your input and output data: + +.. code:: + + from sagemaker.tensorflow.serving import Model + + model = Model(entry_point='inference.py', + model_data='s3://mybucket/model.tar.gz', + role='MySageMakerRole') + +How to implement the pre- and/or post-processing handler(s) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Your entry point file should implement either a pair of ``input_handler`` + and ``output_handler`` functions or a single ``handler`` function. + Note that if ``handler`` function is implemented, ``input_handler`` + and ``output_handler`` are ignored. + +To implement pre- and/or post-processing handler(s), use the Context +object that the Python service creates. The Context object is a namedtuple with the following attributes: + +- ``model_name (string)``: the name of the model to use for + inference. For example, 'half-plus-three' + +- ``model_version (string)``: version of the model. For example, '5' + +- ``method (string)``: inference method. For example, 'predict', + 'classify' or 'regress', for more information on methods, please see + `Classify and Regress + API `__ + and `Predict + API `__ + +- ``rest_uri (string)``: the TFS REST uri generated by the Python + service. For example, + 'http://localhost:8501/v1/models/half_plus_three:predict' + +- ``grpc_uri (string)``: the GRPC port number generated by the Python + service. For example, '9000' + +- ``custom_attributes (string)``: content of + 'X-Amzn-SageMaker-Custom-Attributes' header from the original + request. For example, + 'tfs-model-name=half*plus*\ three,tfs-method=predict' + +- ``request_content_type (string)``: the original request content type, + defaulted to 'application/json' if not provided + +- ``accept_header (string)``: the original request accept type, + defaulted to 'application/json' if not provided + +- ``content_length (int)``: content length of the original request + +The following code example implements ``input_handler`` and +``output_handler``. By providing these, the Python service posts the +request to the TFS REST URI with the data pre-processed by ``input_handler`` +and passes the response to ``output_handler`` for post-processing. + +.. code:: + + import json + + def input_handler(data, context): + """ Pre-process request input before it is sent to TensorFlow Serving REST API + Args: + data (obj): the request data, in format of dict or string + context (Context): an object containing request and configuration details + Returns: + (dict): a JSON-serializable dict that contains request body and headers + """ + if context.request_content_type == 'application/json': + # pass through json (assumes it's correctly formed) + d = data.read().decode('utf-8') + return d if len(d) else '' + + if context.request_content_type == 'text/csv': + # very simple csv handler + return json.dumps({ + 'instances': [float(x) for x in data.read().decode('utf-8').split(',')] + }) + + raise ValueError('{{"error": "unsupported content type {}"}}'.format( + context.request_content_type or "unknown")) + + + def output_handler(data, context): + """Post-process TensorFlow Serving output before it is returned to the client. + Args: + data (obj): the TensorFlow serving response + context (Context): an object containing request and configuration details + Returns: + (bytes, string): data to return to client, response content type + """ + if data.status_code != 200: + raise ValueError(data.content.decode('utf-8')) + + response_content_type = context.accept_header + prediction = data.content + return prediction, response_content_type + +You might want to have complete control over the request. +For example, you might want to make a TFS request (REST or GRPC) to the first model, +inspect the results, and then make a request to a second model. In this case, implement +the ``handler`` method instead of the ``input_handler`` and ``output_handler`` methods, as demonstrated +in the following code: + +.. code:: + + import json + import requests + + + def handler(data, context): + """Handle request. + Args: + data (obj): the request data + context (Context): an object containing request and configuration details + Returns: + (bytes, string): data to return to client, (optional) response content type + """ + processed_input = _process_input(data, context) + response = requests.post(context.rest_uri, data=processed_input) + return _process_output(response, context) + + + def _process_input(data, context): + if context.request_content_type == 'application/json': + # pass through json (assumes it's correctly formed) + d = data.read().decode('utf-8') + return d if len(d) else '' + + if context.request_content_type == 'text/csv': + # very simple csv handler + return json.dumps({ + 'instances': [float(x) for x in data.read().decode('utf-8').split(',')] + }) + + raise ValueError('{{"error": "unsupported content type {}"}}'.format( + context.request_content_type or "unknown")) + + + def _process_output(data, context): + if data.status_code != 200: + raise ValueError(data.content.decode('utf-8')) + + response_content_type = context.accept_header + prediction = data.content + return prediction, response_content_type + +You can also bring in external dependencies to help with your data +processing. There are 2 ways to do this: + +1. If you included ``requirements.txt`` in your ``source_dir`` or in + your dependencies, the container installs the Python dependencies at runtime using ``pip install -r``: + +.. code:: + + from sagemaker.tensorflow.serving import Model + + model = Model(entry_point='inference.py', + dependencies=['requirements.txt'], + model_data='s3://mybucket/model.tar.gz', + role='MySageMakerRole') + + +2. If you are working in a network-isolation situation or if you don't + want to install dependencies at runtime every time your endpoint starts or a batch + transform job runs, you might want to put + pre-downloaded dependencies under a ``lib`` directory and this + directory as dependency. The container adds the modules to the Python + path. Note that if both ``lib`` and ``requirements.txt`` + are present in the model archive, the ``requirements.txt`` is ignored: + +.. code:: + + from sagemaker.tensorflow.serving import Model + + model = Model(entry_point='inference.py', + dependencies=['/path/to/folder/named/lib'], + model_data='s3://mybucket/model.tar.gz', + role='MySageMakerRole') + + +************************************* +sagemaker.tensorflow.TensorFlow Class +************************************* + +The following are the most commonly used ``TensorFlow`` constructor arguments. + +Required: + +- ``entry_point (str)`` Path (absolute or relative) to the Python file which + should be executed as the entry point to training. +- ``role (str)`` An AWS IAM role (either name or full ARN). The Amazon + SageMaker training jobs and APIs that create Amazon SageMaker + endpoints use this role to access training data and model artifacts. + After the endpoint is created, the inference code might use the IAM + role, if accessing AWS resource. +- ``train_instance_count (int)`` Number of Amazon EC2 instances to use for + training. +- ``train_instance_type (str)`` Type of EC2 instance to use for training, for + example, 'ml.c4.xlarge'. + +Optional: + +- ``source_dir (str)`` Path (absolute or relative) to a directory with any + other training source code dependencies including the entry point + file. Structure within this directory will be preserved when training + on SageMaker. +- ``dependencies (list[str])`` A list of paths to directories (absolute or relative) with + any additional libraries that will be exported to the container (default: ``[]``). + The library folders will be copied to SageMaker in the same folder where the entrypoint is copied. + If the ``source_dir`` points to S3, code will be uploaded and the S3 location will be used + instead. Example: + + The following call + + >>> TensorFlow(entry_point='train.py', dependencies=['my/libs/common', 'virtual-env']) + + results in the following inside the container: + + >>> opt/ml/code + >>> ├── train.py + >>> ├── common + >>> └── virtual-env + +- ``hyperparameters (dict[str, ANY])`` Hyperparameters that will be used for training. + Will be made accessible as command line arguments. +- ``train_volume_size (int)`` Size in GB of the EBS volume to use for storing + input data during training. Must be large enough to the store training + data. +- ``train_max_run (int)`` Timeout in seconds for training, after which Amazon + SageMaker terminates the job regardless of its current status. +- ``output_path (str)`` S3 location where you want the training result (model + artifacts and optional output files) saved. If not specified, results + are stored to a default bucket. If the bucket with the specific name + does not exist, the estimator creates the bucket during the ``fit`` + method execution. +- ``output_kms_key`` Optional KMS key ID to optionally encrypt training + output with. +- ``base_job_name`` Name to assign for the training job that the ``fit`` + method launches. If not specified, the estimator generates a default + job name, based on the training image name and current timestamp. +- ``image_name`` An alternative docker image to use for training and + serving. If specified, the estimator will use this image for training and + hosting, instead of selecting the appropriate SageMaker official image based on + ``framework_version`` and ``py_version``. Refer to: `SageMaker TensorFlow Docker containers `_ for details on what the official images support + and where to find the source code to build your custom image. +- ``script_mode (bool)`` Whether to use Script Mode or not. Script mode is the only available training mode in Python 3, + setting ``py_version`` to ``py3`` automatically sets ``script_mode`` to True. +- ``model_dir (str)`` Location where model data, checkpoint data, and TensorBoard checkpoints should be saved during training. + If not specified a S3 location will be generated under the training job's default bucket. And ``model_dir`` will be + passed in your training script as one of the command line arguments. +- ``distributions (dict)`` Configure your distribution strategy with this argument. + +************************************** SageMaker TensorFlow Docker containers -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The containers include the following Python packages: - -+--------------------------------+---------------+-------------------+ -| Dependencies | Script Mode | Legacy Mode | -+--------------------------------+---------------+-------------------+ -| boto3 | Latest | Latest | -+--------------------------------+---------------+-------------------+ -| botocore | Latest | Latest | -+--------------------------------+---------------+-------------------+ -| CUDA (GPU image only) | 9.0 | 9.0 | -+--------------------------------+---------------+-------------------+ -| numpy | Latest | Latest | -+--------------------------------+---------------+-------------------+ -| Pillow | Latest | Latest | -+--------------------------------+---------------+-------------------+ -| scipy | Latest | Latest | -+--------------------------------+---------------+-------------------+ -| sklean | Latest | Latest | -+--------------------------------+---------------+-------------------+ -| h5py | Latest | Latest | -+--------------------------------+---------------+-------------------+ -| pip | 18.1 | 18.1 | -+--------------------------------+---------------+-------------------+ -| curl | Latest | Latest | -+--------------------------------+---------------+-------------------+ -| tensorflow | 1.12.0 | 1.12.0 | -+--------------------------------+---------------+-------------------+ -| tensorflow-serving-api | 1.12.0 | None | -+--------------------------------+---------------+-------------------+ -| sagemaker-containers | >=2.3.5 | >=2.3.5 | -+--------------------------------+---------------+-------------------+ -| sagemaker-tensorflow-container | 1.0 | 1.0 | -+--------------------------------+---------------+-------------------+ -| Python | 2.7 or 3.6 | 2.7 | -+--------------------------------+---------------+-------------------+ - -Legacy Mode TensorFlow Docker images support Python 2.7. Script Mode TensorFlow Docker images support both Python 2.7 -and Python 3.6. The Docker images extend Ubuntu 16.04. - -You can select version of TensorFlow by passing a ``framework_version`` keyword arg to the TensorFlow Estimator constructor. Currently supported versions are listed in the table above. You can also set ``framework_version`` to only specify major and minor version, e.g ``'1.6'``, which will cause your training script to be run on the latest supported patch version of that minor version, which in this example would be 1.6.0. -Alternatively, you can build your own image by following the instructions in the SageMaker TensorFlow containers -repository, and passing ``image_name`` to the TensorFlow Estimator constructor. - -For more information on the contents of the images, see the SageMaker TensorFlow containers repositories here: - -- training: https://github.com/aws/sagemaker-tensorflow-container -- serving: https://github.com/aws/sagemaker-tensorflow-serving-container +************************************** + +For information about SageMaker TensorFlow Docker containers and their dependencies, see `SageMaker TensorFlow Docker containers `_. From e18384b6b8dfec5c857caadc2263a78c5f9757f1 Mon Sep 17 00:00:00 2001 From: eslesar-aws Date: Thu, 18 Jul 2019 16:59:09 -0700 Subject: [PATCH 08/14] doc: Update using_tf.rst to address feedback --- doc/using_tf.rst | 23 ++++++++++++++++++++--- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/doc/using_tf.rst b/doc/using_tf.rst index 681de5f9b4..1d4de934e4 100644 --- a/doc/using_tf.rst +++ b/doc/using_tf.rst @@ -528,9 +528,6 @@ Create a Transformer Object --------------------------- If you used an estimator to train your model, you can call the ``transformer`` method of the estimator to create a ``Transformer`` object. -To use a model trained outside of SageMaker, you can package the model as a SageMaker model, and call the ``transformer`` method of the SageMaker model. -For information about how to package a model as a SageMaker model, see :ref:`overview:BYO Model`. -When you call the ``tranformer`` method, you specify the type and number of instances to use for the batch transform job, and the location where the results are stored in S3. For example: @@ -543,6 +540,26 @@ For example: tf_transformer = tf_estimator.transformer(instance_count=1, instance_type='ml.m4.xlarge, output_path=batch_output) +To use a model trained outside of SageMaker, you can package the model as a SageMaker model, and call the ``transformer`` method of the SageMaker model. + +For example: + +For example: + +.. code:: python + + bucket = myBucket # The name of the S3 bucket where the results are stored + prefix = 'batch-results' # The folder in the S3 bucket where the results are stored + + batch_output = 's3://{}/{}/results'.format(bucket, prefix) # The location to store the results + + tf_transformer = tensorflow_serving_model.transformer(instance_count=1, instance_type='ml.m4.xlarge, output_path=batch_output) + +For information about how to package a model as a SageMaker model, see :ref:`overview:BYO Model`. +When you call the ``tranformer`` method, you specify the type and number of instances to use for the batch transform job, and the location where the results are stored in S3. + + + Call transform -------------- From 1ef4b8cff62b3cd6564b057a973baa53ba5079b8 Mon Sep 17 00:00:00 2001 From: eslesar-aws Date: Thu, 18 Jul 2019 17:43:15 -0700 Subject: [PATCH 09/14] doc: fix comment in conf.py per build log --- doc/conf.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/conf.py b/doc/conf.py index d16ba8d8a0..b8fc2dc3de 100644 --- a/doc/conf.py +++ b/doc/conf.py @@ -92,5 +92,5 @@ def __getattr__(cls, name): # autosummary autosummary_generate = True -#autosectionlabel -autosectionlabel_prefix_document = True +# autosectionlabel +autosectionlabel_prefix_document = True \ No newline at end of file From 18fcde1ccfb3a4d77ce46676f53bb468d95bc20c Mon Sep 17 00:00:00 2001 From: eslesar-aws Date: Fri, 19 Jul 2019 09:42:30 -0700 Subject: [PATCH 10/14] doc: add newline to conf.py to fix error --- doc/conf.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/conf.py b/doc/conf.py index dfad47c7c7..b08e04006e 100644 --- a/doc/conf.py +++ b/doc/conf.py @@ -98,4 +98,4 @@ def __getattr__(cls, name): autosummary_generate = True # autosectionlabel -autosectionlabel_prefix_document = True \ No newline at end of file +autosectionlabel_prefix_document = True From 4b0d1817bb3804358d83565ed0a208445dc4826c Mon Sep 17 00:00:00 2001 From: eslesar-aws Date: Fri, 19 Jul 2019 14:11:26 -0700 Subject: [PATCH 11/14] doc: addressed feedback for PR --- doc/using_chainer.rst | 1396 ++++++++++++++-------------- doc/using_pytorch.rst | 1460 ++++++++++++++--------------- doc/using_rl.rst | 642 ++++++------- doc/using_sklearn.rst | 1276 +++++++++++++------------- doc/using_tf.rst | 1966 ++++++++++++++++++++-------------------- doc/using_workflow.rst | 330 +++---- 6 files changed, 3537 insertions(+), 3533 deletions(-) diff --git a/doc/using_chainer.rst b/doc/using_chainer.rst index ddee676e73..248071bb67 100644 --- a/doc/using_chainer.rst +++ b/doc/using_chainer.rst @@ -1,699 +1,699 @@ -=========================================== -Using Chainer with the SageMaker Python SDK -=========================================== - -.. contents:: - -With Chainer Estimators, you can train and host Chainer models on Amazon SageMaker. - -Supported versions of Chainer: ``4.0.0``, ``4.1.0``, ``5.0.0`` - -You can visit the Chainer repository at https://github.com/chainer/chainer. - - -Training with Chainer -~~~~~~~~~~~~~~~~~~~~~ - -Training Chainer models using ``Chainer`` Estimators is a two-step process: - -1. Prepare a Chainer script to run on SageMaker -2. Run this script on SageMaker via a ``Chainer`` Estimator. - - -First, you prepare your training script, then second, you run this on SageMaker via a ``Chainer`` Estimator. -You should prepare your script in a separate source file than the notebook, terminal session, or source file you're -using to submit the script to SageMaker via a ``Chainer`` Estimator. - -Suppose that you already have an Chainer training script called -``chainer-train.py``. You can run this script in SageMaker as follows: - -.. code:: python - - from sagemaker.chainer import Chainer - chainer_estimator = Chainer(entry_point='chainer-train.py', - role='SageMakerRole', - train_instance_type='ml.p3.2xlarge', - train_instance_count=1, - framework_version='5.0.0') - chainer_estimator.fit('s3://bucket/path/to/training/data') - -Where the S3 URL is a path to your training data, within Amazon S3. The constructor keyword arguments define how -SageMaker runs your training script and are discussed in detail in a later section. - -In the following sections, we'll discuss how to prepare a training script for execution on SageMaker, -then how to run that script on SageMaker using a ``Chainer`` Estimator. - -Preparing the Chainer training script -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Your Chainer training script must be a Python 2.7 or 3.5 compatible source file. - -The training script is very similar to a training script you might run outside of SageMaker, but you -can access useful properties about the training environment through various environment variables, such as - -* ``SM_MODEL_DIR``: A string representing the path to the directory to write model artifacts to. - These artifacts are uploaded to S3 for model hosting. -* ``SM_NUM_GPUS``: An integer representing the number of GPUs available to the host. -* ``SM_OUTPUT_DATA_DIR``: A string representing the filesystem path to write output artifacts to. Output artifacts may - include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed - and uploaded to S3 to the same S3 prefix as the model artifacts. - -Supposing two input channels, 'train' and 'test', were used in the call to the Chainer estimator's ``fit()`` method, -the following will be set, following the format "SM_CHANNEL_[channel_name]": - -* ``SM_CHANNEL_TRAIN``: A string representing the path to the directory containing data in the 'train' channel -* ``SM_CHANNEL_TEST``: Same as above, but for the 'test' channel. - -A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, -and saves a model to model_dir so that it can be hosted later. Hyperparameters are passed to your script as arguments -and can be retrieved with an argparse.ArgumentParser instance. For example, a training script might start -with the following: - -.. code:: python - - import argparse - import os - - if __name__ =='__main__': - - parser = argparse.ArgumentParser() - - # hyperparameters sent by the client are passed as command-line arguments to the script. - parser.add_argument('--epochs', type=int, default=50) - parser.add_argument('--batch-size', type=int, default=64) - parser.add_argument('--learning-rate', type=float, default=0.05) - - # Data, model, and output directories - parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR']) - parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) - parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN']) - parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TEST']) - - args, _ = parser.parse_known_args() - - # ... load from args.train and args.test, train a model, write model to args.model_dir. - -Because the SageMaker imports your training script, you should put your training code in a main guard -(``if __name__=='__main__':``) if you are using the same script to host your model, so that SageMaker does not -inadvertently run your training code at the wrong point in execution. - -For more on training environment variables, please visit https://github.com/aws/sagemaker-containers. - -Using third-party libraries -^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -When running your training script on SageMaker, it will have access to some pre-installed third-party libraries including ``chainer``, ``numpy``, and ``cupy``. -For more information on the runtime environment, including specific package versions, see `SageMaker Chainer Docker containers <#sagemaker-chainer-docker-containers>`__. - -If there are other packages you want to use with your script, you can include a ``requirements.txt`` file in the same directory as your training script to install other dependencies at runtime. -A ``requirements.txt`` file is a text file that contains a list of items that are installed by using ``pip install``. You can also specify the version of an item to install. -For information about the format of a ``requirements.txt`` file, see `Requirements Files `__ in the pip documentation. - -Running a Chainer training script in SageMaker -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You run Chainer training scripts on SageMaker by creating ``Chainer`` Estimators. -SageMaker training of your script is invoked when you call ``fit`` on a ``Chainer`` Estimator. -The following code sample shows how you train a custom Chainer script "chainer-train.py", passing -in three hyperparameters ('epochs', 'batch-size', and 'learning-rate'), and using two input channel -directories ('train' and 'test'). - -.. code:: python - - chainer_estimator = Chainer('chainer-train.py', - train_instance_type='ml.p3.2xlarge', - train_instance_count=1, - framework_version='5.0.0', - hyperparameters = {'epochs': 20, 'batch-size': 64, 'learning-rate': 0.1}) - chainer_estimator.fit({'train': 's3://my-data-bucket/path/to/my/training/data', - 'test': 's3://my-data-bucket/path/to/my/test/data'}) - - -Chainer Estimators -^^^^^^^^^^^^^^^^^^ - -The `Chainer` constructor takes both required and optional arguments. - -Required arguments -'''''''''''''''''' - -The following are required arguments to the ``Chainer`` constructor. When you create a Chainer object, you must include -these in the constructor, either positionally or as keyword arguments. - -- ``entry_point`` Path (absolute or relative) to the Python file which - should be executed as the entry point to training. -- ``role`` An AWS IAM role (either name or full ARN). The Amazon - SageMaker training jobs and APIs that create Amazon SageMaker - endpoints use this role to access training data and model artifacts. - After the endpoint is created, the inference code might use the IAM - role, if accessing AWS resource. -- ``train_instance_count`` Number of Amazon EC2 instances to use for - training. -- ``train_instance_type`` Type of EC2 instance to use for training, for - example, 'ml.m4.xlarge'. - -Optional arguments -'''''''''''''''''' - -The following are optional arguments. When you create a ``Chainer`` object, you can specify these as keyword arguments. - -- ``source_dir`` Path (absolute or relative) to a directory with any - other training source code dependencies including the entry point - file. Structure within this directory will be preserved when training - on SageMaker. -- ``dependencies (list[str])`` A list of paths to directories (absolute or relative) with - any additional libraries that will be exported to the container (default: []). - The library folders will be copied to SageMaker in the same folder where the entrypoint is copied. - If the ```source_dir``` points to S3, code will be uploaded and the S3 location will be used - instead. Example: - - The following call - >>> Chainer(entry_point='train.py', dependencies=['my/libs/common', 'virtual-env']) - results in the following inside the container: - - >>> $ ls - - >>> opt/ml/code - >>> ├── train.py - >>> ├── common - >>> └── virtual-env - -- ``hyperparameters`` Hyperparameters that will be used for training. - Will be made accessible as a dict[str, str] to the training code on - SageMaker. For convenience, accepts other types besides str, but - str() will be called on keys and values to convert them before - training. -- ``py_version`` Python version you want to use for executing your - model training code. -- ``train_volume_size`` Size in GB of the EBS volume to use for storing - input data during training. Must be large enough to store training - data if input_mode='File' is used (which is the default). -- ``train_max_run`` Timeout in seconds for training, after which Amazon - SageMaker terminates the job regardless of its current status. -- ``input_mode`` The input mode that the algorithm supports. Valid - modes: 'File' - Amazon SageMaker copies the training dataset from the - s3 location to a directory in the Docker container. 'Pipe' - Amazon - SageMaker streams data directly from s3 to the container via a Unix - named pipe. -- ``output_path`` s3 location where you want the training result (model - artifacts and optional output files) saved. If not specified, results - are stored to a default bucket. If the bucket with the specific name - does not exist, the estimator creates the bucket during the fit() - method execution. -- ``output_kms_key`` Optional KMS key ID to optionally encrypt training - output with. -- ``job_name`` Name to assign for the training job that the fit() - method launches. If not specified, the estimator generates a default - job name, based on the training image name and current timestamp -- ``image_name`` An alternative docker image to use for training and - serving. If specified, the estimator will use this image for training and - hosting, instead of selecting the appropriate SageMaker official image based on - framework_version and py_version. Refer to: `SageMaker Chainer Docker Containers - <#sagemaker-chainer-docker-containers>`__ for details on what the Official images support - and where to find the source code to build your custom image. - - -Distributed Chainer Training -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - - -Chainer allows you to train a model on multiple nodes using ChainerMN_, which distributes training with MPI. - -.. _ChainerMN: https://github.com/chainer/chainermn - -In order to run distributed Chainer training on SageMaker, your training script should use a ``chainermn`` Communicator -object to coordinate training between multiple hosts. - -SageMaker runs your script with ``mpirun`` if ``train_instance_count`` is greater than two. -The following are optional arguments modify how MPI runs your distributed training script. - -- ``use_mpi`` Boolean that overrides whether to run your training script with MPI. -- ``num_processes`` Integer that determines how many total processes to run with MPI. By default, this is equal to ``process_slots_per_host`` times the number of nodes. -- ``process_slots_per_host`` Integer that determines how many processes can be run on each host. By default, this is equal to one process per host on CPU instances, or one process per GPU on GPU instances. -- ``additional_mpi_options`` String of additional options to pass to the ``mpirun`` command. - - -Calling fit -^^^^^^^^^^^ - -You start your training script by calling ``fit`` on a ``Chainer`` Estimator. ``fit`` takes both required and optional -arguments. - -Required arguments -'''''''''''''''''' - -- ``inputs``: This can take one of the following forms: A string - s3 URI, for example ``s3://my-bucket/my-training-data``. In this - case, the s3 objects rooted at the ``my-training-data`` prefix will - be available in the default ``train`` channel. A dict from - string channel names to s3 URIs. In this case, the objects rooted at - each s3 prefix will available as files in each channel directory. - -For example: - -.. code:: python - - {'train':'s3://my-bucket/my-training-data', - 'eval':'s3://my-bucket/my-evaluation-data'} - -.. optional-arguments-1: - -Optional arguments -'''''''''''''''''' - -- ``wait``: Defaults to True, whether to block and wait for the - training script to complete before returning. -- ``logs``: Defaults to True, whether to show logs produced by training - job in the Python session. Only meaningful when wait is True. - - -Saving models -~~~~~~~~~~~~~ - -In order to save your trained Chainer model for deployment on SageMaker, your training script should save your model -to a certain filesystem path called `model_dir`. This value is accessible through the environment variable -``SM_MODEL_DIR``. The following code demonstrates how to save a trained Chainer model named ``model`` as -``model.npz`` at the end of training: - -.. code:: python - - import chainer - import argparse - import os - - if __name__=='__main__': - # default to the value in environment variable `SM_MODEL_DIR`. Using args makes the script more portable. - parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) - args, _ = parser.parse_known_args() - - # ... train `model`, then save it to `model_dir` as file 'model.npz' - chainer.serializers.save_npz(os.path.join(args.model_dir, 'model.npz'), model) - -After your training job is complete, SageMaker will compress and upload the serialized model to S3, and your model data -will available in the s3 ``output_path`` you specified when you created the Chainer Estimator. - -Deploying Chainer models -~~~~~~~~~~~~~~~~~~~~~~~~ - -After an Chainer Estimator has been fit, you can host the newly created model in SageMaker. - -After calling ``fit``, you can call ``deploy`` on a ``Chainer`` Estimator to create a SageMaker Endpoint. -The Endpoint runs a SageMaker-provided Chainer model server and hosts the model produced by your training script, -which was run when you called ``fit``. This was the model you saved to ``model_dir``. - -``deploy`` returns a ``Predictor`` object, which you can use to do inference on the Endpoint hosting your Chainer model. -Each ``Predictor`` provides a ``predict`` method which can do inference with numpy arrays or Python lists. -Inference arrays or lists are serialized and sent to the Chainer model server by an ``InvokeEndpoint`` SageMaker -operation. - -``predict`` returns the result of inference against your model. By default, the inference result a NumPy array. - -.. code:: python - - # Train my estimator - chainer_estimator = Chainer(entry_point='train_and_deploy.py', - train_instance_type='ml.p3.2xlarge', - train_instance_count=1, - framework_version='5.0.0') - chainer_estimator.fit('s3://my_bucket/my_training_data/') - - # Deploy my estimator to a SageMaker Endpoint and get a Predictor - predictor = chainer_estimator.deploy(instance_type='ml.m4.xlarge', - initial_instance_count=1) - - # `data` is a NumPy array or a Python list. - # `response` is a NumPy array. - response = predictor.predict(data) - -You use the SageMaker Chainer model server to host your Chainer model when you call ``deploy`` on an ``Chainer`` -Estimator. The model server runs inside a SageMaker Endpoint, which your call to ``deploy`` creates. -You can access the name of the Endpoint by the ``name`` property on the returned ``Predictor``. - - -The SageMaker Chainer Model Server -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The Chainer Endpoint you create with ``deploy`` runs a SageMaker Chainer model server. -The model server loads the model that was saved by your training script and performs inference on the model in response -to SageMaker InvokeEndpoint API calls. - -You can configure two components of the SageMaker Chainer model server: Model loading and model serving. -Model loading is the process of deserializing your saved model back into an Chainer model. -Serving is the process of translating InvokeEndpoint requests to inference calls on the loaded model. - -You configure the Chainer model server by defining functions in the Python source file you passed to the Chainer constructor. - -Model loading -^^^^^^^^^^^^^ - -Before a model can be served, it must be loaded. The SageMaker Chainer model server loads your model by invoking a -``model_fn`` function that you must provide in your script. The ``model_fn`` should have the following signature: - -.. code:: python - - def model_fn(model_dir) - -SageMaker will inject the directory where your model files and sub-directories, saved by ``save``, have been mounted. -Your model function should return a model object that can be used for model serving. - -SageMaker provides automated serving functions that work with Gluon API ``net`` objects and Module API ``Module`` objects. If you return either of these types of objects, then you will be able to use the default serving request handling functions. - -The following code-snippet shows an example ``model_fn`` implementation. -This loads returns a Chainer Classifier from a multi-layer perceptron class ``MLP`` that extends ``chainer.Chain``. -It loads the model parameters from a ``model.npz`` file in the SageMaker model directory ``model_dir``. - -.. code:: python - - import chainer - import os - - def model_fn(model_dir): - chainer.config.train = False - model = chainer.links.Classifier(MLP(1000, 10)) - chainer.serializers.load_npz(os.path.join(model_dir, 'model.npz'), model) - return model.predictor - -Model serving -^^^^^^^^^^^^^ - -After the SageMaker model server has loaded your model by calling ``model_fn``, SageMaker will serve your model. -Model serving is the process of responding to inference requests, received by SageMaker InvokeEndpoint API calls. -The SageMaker Chainer model server breaks request handling into three steps: - - -- input processing, -- prediction, and -- output processing. - -In a similar way to model loading, you configure these steps by defining functions in your Python source file. - -Each step involves invoking a python function, with information about the request and the return-value from the previous -function in the chain. Inside the SageMaker Chainer model server, the process looks like: - -.. code:: python - - # Deserialize the Invoke request body into an object we can perform prediction on - input_object = input_fn(request_body, request_content_type) - - # Perform prediction on the deserialized object, with the loaded model - prediction = predict_fn(input_object, model) - - # Serialize the prediction result into the desired response content type - output = output_fn(prediction, response_content_type) - -The above code-sample shows the three function definitions: - -- ``input_fn``: Takes request data and deserializes the data into an - object for prediction. -- ``predict_fn``: Takes the deserialized request object and performs - inference against the loaded model. -- ``output_fn``: Takes the result of prediction and serializes this - according to the response content type. - -The SageMaker Chainer model server provides default implementations of these functions. -You can provide your own implementations for these functions in your hosting script. -If you omit any definition then the SageMaker Chainer model server will use its default implementation for that -function. - -The ``RealTimePredictor`` used by Chainer in the SageMaker Python SDK serializes NumPy arrays to the `NPY `_ format -by default, with Content-Type ``application/x-npy``. The SageMaker Chainer model server can deserialize NPY-formatted -data (along with JSON and CSV data). - -If you rely solely on the SageMaker Chainer model server defaults, you get the following functionality: - -- Prediction on models that implement the ``__call__`` method -- Serialization and deserialization of NumPy arrays. - -The default ``input_fn`` and ``output_fn`` are meant to make it easy to predict on NumPy arrays. If your model expects -a NumPy array and returns a NumPy array, then these functions do not have to be overridden when sending NPY-formatted -data. - -In the following sections we describe the default implementations of input_fn, predict_fn, and output_fn. -We describe the input arguments and expected return types of each, so you can define your own implementations. - -Input processing -'''''''''''''''' - -When an InvokeEndpoint operation is made against an Endpoint running a SageMaker Chainer model server, -the model server receives two pieces of information: - -- The request Content-Type, for example "application/x-npy" -- The request data body, a byte array - -The SageMaker Chainer model server will invoke an "input_fn" function in your hosting script, -passing in this information. If you define an ``input_fn`` function definition, -it should return an object that can be passed to ``predict_fn`` and have the following signature: - -.. code:: python - - def input_fn(request_body, request_content_type) - -Where ``request_body`` is a byte buffer and ``request_content_type`` is a Python string - -The SageMaker Chainer model server provides a default implementation of ``input_fn``. -This function deserializes JSON, CSV, or NPY encoded data into a NumPy array. - -Default NPY deserialization requires ``request_body`` to follow the `NPY `_ format. For Chainer, the Python SDK -defaults to sending prediction requests with this format. - -Default json deserialization requires ``request_body`` contain a single json list. -Sending multiple json objects within the same ``request_body`` is not supported. -The list must have a dimensionality compatible with the model loaded in ``model_fn``. -The list's shape must be identical to the model's input shape, for all dimensions after the first (which first -dimension is the batch size). - -Default csv deserialization requires ``request_body`` contain one or more lines of CSV numerical data. -The data is loaded into a two-dimensional array, where each line break defines the boundaries of the first dimension. - -The example below shows a custom ``input_fn`` for preparing pickled NumPy arrays. - -.. code:: python - - import numpy as np - - def input_fn(request_body, request_content_type): - """An input_fn that loads a pickled numpy array""" - if request_content_type == "application/python-pickle": - array = np.load(StringIO(request_body)) - return array - else: - # Handle other content-types here or raise an Exception - # if the content type is not supported. - pass - - - -Prediction -'''''''''' - -After the inference request has been deserialized by ``input_fn``, the SageMaker Chainer model server invokes -``predict_fn`` on the return value of ``input_fn``. - -As with ``input_fn``, you can define your own ``predict_fn`` or use the SageMaker Chainer model server default. - -The ``predict_fn`` function has the following signature: - -.. code:: python - - def predict_fn(input_object, model) - -Where ``input_object`` is the object returned from ``input_fn`` and -``model`` is the model loaded by ``model_fn``. - -The default implementation of ``predict_fn`` invokes the loaded model's ``__call__`` function on ``input_object``, -and returns the resulting value. The return-type should be a NumPy array to be compatible with the default -``output_fn``. - -The example below shows an overridden ``predict_fn``. This model accepts a Python list and returns a tuple of -bounding boxes, labels, and scores from the model in a NumPy array. This ``predict_fn`` can rely on the default -``input_fn`` and ``output_fn`` because ``input_data`` is a NumPy array, and the return value of this function is -a NumPy array. - -.. code:: python - - import chainer - import numpy as np - - def predict_fn(input_data, model): - with chainer.using_config('train', False), chainer.no_backprop_mode(): - bboxes, labels, scores = model.predict([input_data]) - bbox, label, score = bboxes[0], labels[0], scores[0] - return np.array([bbox.tolist(), label, score]) - -If you implement your own prediction function, you should take care to ensure that: - -- The first argument is expected to be the return value from input_fn. - If you use the default input_fn, this will be a NumPy array. -- The second argument is the loaded model. -- The return value should be of the correct type to be passed as the - first argument to ``output_fn``. If you use the default - ``output_fn``, this should be a NumPy array. - -Output processing -''''''''''''''''' - -After invoking ``predict_fn``, the model server invokes ``output_fn``, passing in the return-value from ``predict_fn`` -and the InvokeEndpoint requested response content-type. - -The ``output_fn`` has the following signature: - -.. code:: python - - def output_fn(prediction, content_type) - -Where ``prediction`` is the result of invoking ``predict_fn`` and -``content_type`` is the InvokeEndpoint requested response content-type. -The function should return a byte array of data serialized to content_type. - -The default implementation expects ``prediction`` to be an NumPy and can serialize the result to JSON, CSV, or NPY. -It accepts response content types of "application/json", "text/csv", and "application/x-npy". - -Working with existing model data and training jobs -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Attaching to existing training jobs -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -You can attach an Chainer Estimator to an existing training job using the -``attach`` method. - -.. code:: python - - my_training_job_name = "MyAwesomeChainerTrainingJob" - chainer_estimator = Chainer.attach(my_training_job_name) - -After attaching, if the training job is in a Complete status, it can be -``deploy``\ ed to create a SageMaker Endpoint and return a -``Predictor``. If the training job is in progress, -attach will block and display log messages from the training job, until the training job completes. - -The ``attach`` method accepts the following arguments: - -- ``training_job_name (str):`` The name of the training job to attach - to. -- ``sagemaker_session (sagemaker.Session or None):`` The Session used - to interact with SageMaker - -Deploying Endpoints from model data -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -As well as attaching to existing training jobs, you can deploy models directly from model data in S3. -The following code sample shows how to do this, using the ``ChainerModel`` class. - -.. code:: python - - chainer_model = ChainerModel(model_data="s3://bucket/model.tar.gz", role="SageMakerRole", - entry_point="transform_script.py") - - predictor = chainer_model.deploy(instance_type="ml.c4.xlarge", initial_instance_count=1) - -The ChainerModel constructor takes the following arguments: - -- ``model_data (str):`` An S3 location of a SageMaker model data - .tar.gz file -- ``image (str):`` A Docker image URI -- ``role (str):`` An IAM role name or Arn for SageMaker to access AWS - resources on your behalf. -- ``predictor_cls (callable[string,sagemaker.Session]):`` A function to - call to create a predictor. If not None, ``deploy`` will return the - result of invoking this function on the created endpoint name -- ``env (dict[string,string]):`` Environment variables to run with - ``image`` when hosted in SageMaker. -- ``name (str):`` The model name. If None, a default model name will be - selected on each ``deploy.`` -- ``entry_point (str):`` Path (absolute or relative) to the Python file - which should be executed as the entry point to model hosting. -- ``source_dir (str):`` Optional. Path (absolute or relative) to a - directory with any other training source code dependencies including - tne entry point file. Structure within this directory will be - preserved when training on SageMaker. -- ``enable_cloudwatch_metrics (boolean):`` Optional. If true, training - and hosting containers will generate Cloudwatch metrics under the - AWS/SageMakerContainer namespace. -- ``container_log_level (int):`` Log level to use within the container. - Valid values are defined in the Python logging module. -- ``code_location (str):`` Optional. Name of the S3 bucket where your - custom code will be uploaded to. If not specified, will use the - SageMaker default bucket created by sagemaker.Session. -- ``sagemaker_session (sagemaker.Session):`` The SageMaker Session - object, used for SageMaker interaction""" - -Your model data must be a .tar.gz file in S3. SageMaker Training Job model data is saved to .tar.gz files in S3, -however if you have local data you want to deploy, you can prepare the data yourself. - -Assuming you have a local directory containg your model data named "my_model" you can tar and gzip compress the file and -upload to S3 using the following commands: - -:: - - tar -czf model.tar.gz my_model - aws s3 cp model.tar.gz s3://my-bucket/my-path/model.tar.gz - -This uploads the contents of my_model to a gzip compressed tar file to S3 in the bucket "my-bucket", with the key -"my-path/model.tar.gz". - -To run this command, you'll need the aws cli tool installed. Please refer to our `FAQ <#FAQ>`__ for more information on -installing this. - -Chainer Training Examples -~~~~~~~~~~~~~~~~~~~~~~~~~ - -Amazon provides several example Jupyter notebooks that demonstrate end-to-end training on Amazon SageMaker using Chainer. -Please refer to: - -https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk - -These are also available in SageMaker Notebook Instance hosted Jupyter notebooks under the "sample notebooks" folder. - - -SageMaker Chainer Docker containers -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -When training and deploying training scripts, SageMaker runs your Python script in a Docker container with several -libraries installed. When creating the Estimator and calling deploy to create the SageMaker Endpoint, you can control -the environment your script runs in. - -SageMaker runs Chainer Estimator scripts in either Python 2.7 or Python 3.5. You can select the Python version by -passing a py_version keyword arg to the Chainer Estimator constructor. Setting this to py3 (the default) will cause your -training script to be run on Python 3.5. Setting this to py2 will cause your training script to be run on Python 2.7 -This Python version applies to both the Training Job, created by fit, and the Endpoint, created by deploy. - -The Chainer Docker images have the following dependencies installed: - -+-----------------------------+-------------+-------------+-------------+ -| Dependencies | chainer 4.0 | chainer 4.1 | chainer 5.0 | -+-----------------------------+-------------+-------------+-------------+ -| chainer | 4.0.0 | 4.1.0 | 5.0.0 | -+-----------------------------+-------------+-------------+-------------+ -| chainercv | 0.9.0 | 0.10.0 | 0.10.0 | -+-----------------------------+-------------+-------------+-------------+ -| chainermn | 1.2.0 | 1.3.0 | N/A | -+-----------------------------+-------------+-------------+-------------+ -| CUDA (GPU image only) | 9.0 | 9.0 | 9.0 | -+-----------------------------+-------------+-------------+-------------+ -| cupy | 4.0.0 | 4.1.0 | 5.0.0 | -+-----------------------------+-------------+-------------+-------------+ -| matplotlib | 2.2.0 | 2.2.0 | 2.2.0 | -+-----------------------------+-------------+-------------+-------------+ -| mpi4py | 3.0.0 | 3.0.0 | 3.0.0 | -+-----------------------------+-------------+-------------+-------------+ -| numpy | 1.14.3 | 1.15.3 | 1.15.4 | -+-----------------------------+-------------+-------------+-------------+ -| opencv-python | 3.4.0.12 | 3.4.0.12 | 3.4.0.12 | -+-----------------------------+-------------+-------------+-------------+ -| Pillow | 5.1.0 | 5.3.0 | 5.3.0 | -+-----------------------------+-------------+-------------+-------------+ -| Python | 2.7 or 3.5 | 2.7 or 3.5 | 2.7 or 3.5 | -+-----------------------------+-------------+-------------+-------------+ - -The Docker images extend Ubuntu 16.04. - -You must select a version of Chainer by passing a ``framework_version`` keyword arg to the Chainer Estimator -constructor. Currently supported versions are listed in the above table. You can also set framework_version to only -specify major and minor version, which will cause your training script to be run on the latest supported patch -version of that minor version. - -Alternatively, you can build your own image by following the instructions in the SageMaker Chainer containers -repository, and passing ``image_name`` to the Chainer Estimator constructor. - +=========================================== +Using Chainer with the SageMaker Python SDK +=========================================== + +.. contents:: + +With Chainer Estimators, you can train and host Chainer models on Amazon SageMaker. + +Supported versions of Chainer: ``4.0.0``, ``4.1.0``, ``5.0.0`` + +You can visit the Chainer repository at https://github.com/chainer/chainer. + + +Training with Chainer +~~~~~~~~~~~~~~~~~~~~~ + +Training Chainer models using ``Chainer`` Estimators is a two-step process: + +1. Prepare a Chainer script to run on SageMaker +2. Run this script on SageMaker via a ``Chainer`` Estimator. + + +First, you prepare your training script, then second, you run this on SageMaker via a ``Chainer`` Estimator. +You should prepare your script in a separate source file than the notebook, terminal session, or source file you're +using to submit the script to SageMaker via a ``Chainer`` Estimator. + +Suppose that you already have an Chainer training script called +``chainer-train.py``. You can run this script in SageMaker as follows: + +.. code:: python + + from sagemaker.chainer import Chainer + chainer_estimator = Chainer(entry_point='chainer-train.py', + role='SageMakerRole', + train_instance_type='ml.p3.2xlarge', + train_instance_count=1, + framework_version='5.0.0') + chainer_estimator.fit('s3://bucket/path/to/training/data') + +Where the S3 URL is a path to your training data, within Amazon S3. The constructor keyword arguments define how +SageMaker runs your training script and are discussed in detail in a later section. + +In the following sections, we'll discuss how to prepare a training script for execution on SageMaker, +then how to run that script on SageMaker using a ``Chainer`` Estimator. + +Preparing the Chainer training script +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Your Chainer training script must be a Python 2.7 or 3.5 compatible source file. + +The training script is very similar to a training script you might run outside of SageMaker, but you +can access useful properties about the training environment through various environment variables, such as + +* ``SM_MODEL_DIR``: A string representing the path to the directory to write model artifacts to. + These artifacts are uploaded to S3 for model hosting. +* ``SM_NUM_GPUS``: An integer representing the number of GPUs available to the host. +* ``SM_OUTPUT_DATA_DIR``: A string representing the filesystem path to write output artifacts to. Output artifacts may + include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed + and uploaded to S3 to the same S3 prefix as the model artifacts. + +Supposing two input channels, 'train' and 'test', were used in the call to the Chainer estimator's ``fit()`` method, +the following will be set, following the format "SM_CHANNEL_[channel_name]": + +* ``SM_CHANNEL_TRAIN``: A string representing the path to the directory containing data in the 'train' channel +* ``SM_CHANNEL_TEST``: Same as above, but for the 'test' channel. + +A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, +and saves a model to model_dir so that it can be hosted later. Hyperparameters are passed to your script as arguments +and can be retrieved with an argparse.ArgumentParser instance. For example, a training script might start +with the following: + +.. code:: python + + import argparse + import os + + if __name__ =='__main__': + + parser = argparse.ArgumentParser() + + # hyperparameters sent by the client are passed as command-line arguments to the script. + parser.add_argument('--epochs', type=int, default=50) + parser.add_argument('--batch-size', type=int, default=64) + parser.add_argument('--learning-rate', type=float, default=0.05) + + # Data, model, and output directories + parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR']) + parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) + parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN']) + parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TEST']) + + args, _ = parser.parse_known_args() + + # ... load from args.train and args.test, train a model, write model to args.model_dir. + +Because the SageMaker imports your training script, you should put your training code in a main guard +(``if __name__=='__main__':``) if you are using the same script to host your model, so that SageMaker does not +inadvertently run your training code at the wrong point in execution. + +For more on training environment variables, please visit https://github.com/aws/sagemaker-containers. + +Using third-party libraries +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +When running your training script on SageMaker, it will have access to some pre-installed third-party libraries including ``chainer``, ``numpy``, and ``cupy``. +For more information on the runtime environment, including specific package versions, see `SageMaker Chainer Docker containers <#sagemaker-chainer-docker-containers>`__. + +If there are other packages you want to use with your script, you can include a ``requirements.txt`` file in the same directory as your training script to install other dependencies at runtime. +A ``requirements.txt`` file is a text file that contains a list of items that are installed by using ``pip install``. You can also specify the version of an item to install. +For information about the format of a ``requirements.txt`` file, see `Requirements Files `__ in the pip documentation. + +Running a Chainer training script in SageMaker +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You run Chainer training scripts on SageMaker by creating ``Chainer`` Estimators. +SageMaker training of your script is invoked when you call ``fit`` on a ``Chainer`` Estimator. +The following code sample shows how you train a custom Chainer script "chainer-train.py", passing +in three hyperparameters ('epochs', 'batch-size', and 'learning-rate'), and using two input channel +directories ('train' and 'test'). + +.. code:: python + + chainer_estimator = Chainer('chainer-train.py', + train_instance_type='ml.p3.2xlarge', + train_instance_count=1, + framework_version='5.0.0', + hyperparameters = {'epochs': 20, 'batch-size': 64, 'learning-rate': 0.1}) + chainer_estimator.fit({'train': 's3://my-data-bucket/path/to/my/training/data', + 'test': 's3://my-data-bucket/path/to/my/test/data'}) + + +Chainer Estimators +^^^^^^^^^^^^^^^^^^ + +The `Chainer` constructor takes both required and optional arguments. + +Required arguments +'''''''''''''''''' + +The following are required arguments to the ``Chainer`` constructor. When you create a Chainer object, you must include +these in the constructor, either positionally or as keyword arguments. + +- ``entry_point`` Path (absolute or relative) to the Python file which + should be executed as the entry point to training. +- ``role`` An AWS IAM role (either name or full ARN). The Amazon + SageMaker training jobs and APIs that create Amazon SageMaker + endpoints use this role to access training data and model artifacts. + After the endpoint is created, the inference code might use the IAM + role, if accessing AWS resource. +- ``train_instance_count`` Number of Amazon EC2 instances to use for + training. +- ``train_instance_type`` Type of EC2 instance to use for training, for + example, 'ml.m4.xlarge'. + +Optional arguments +'''''''''''''''''' + +The following are optional arguments. When you create a ``Chainer`` object, you can specify these as keyword arguments. + +- ``source_dir`` Path (absolute or relative) to a directory with any + other training source code dependencies including the entry point + file. Structure within this directory will be preserved when training + on SageMaker. +- ``dependencies (list[str])`` A list of paths to directories (absolute or relative) with + any additional libraries that will be exported to the container (default: []). + The library folders will be copied to SageMaker in the same folder where the entrypoint is copied. + If the ```source_dir``` points to S3, code will be uploaded and the S3 location will be used + instead. Example: + + The following call + >>> Chainer(entry_point='train.py', dependencies=['my/libs/common', 'virtual-env']) + results in the following inside the container: + + >>> $ ls + + >>> opt/ml/code + >>> ├── train.py + >>> ├── common + >>> └── virtual-env + +- ``hyperparameters`` Hyperparameters that will be used for training. + Will be made accessible as a dict[str, str] to the training code on + SageMaker. For convenience, accepts other types besides str, but + str() will be called on keys and values to convert them before + training. +- ``py_version`` Python version you want to use for executing your + model training code. +- ``train_volume_size`` Size in GB of the EBS volume to use for storing + input data during training. Must be large enough to store training + data if input_mode='File' is used (which is the default). +- ``train_max_run`` Timeout in seconds for training, after which Amazon + SageMaker terminates the job regardless of its current status. +- ``input_mode`` The input mode that the algorithm supports. Valid + modes: 'File' - Amazon SageMaker copies the training dataset from the + s3 location to a directory in the Docker container. 'Pipe' - Amazon + SageMaker streams data directly from s3 to the container via a Unix + named pipe. +- ``output_path`` s3 location where you want the training result (model + artifacts and optional output files) saved. If not specified, results + are stored to a default bucket. If the bucket with the specific name + does not exist, the estimator creates the bucket during the fit() + method execution. +- ``output_kms_key`` Optional KMS key ID to optionally encrypt training + output with. +- ``job_name`` Name to assign for the training job that the fit() + method launches. If not specified, the estimator generates a default + job name, based on the training image name and current timestamp +- ``image_name`` An alternative docker image to use for training and + serving. If specified, the estimator will use this image for training and + hosting, instead of selecting the appropriate SageMaker official image based on + framework_version and py_version. Refer to: `SageMaker Chainer Docker Containers + <#sagemaker-chainer-docker-containers>`__ for details on what the Official images support + and where to find the source code to build your custom image. + + +Distributed Chainer Training +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + + +Chainer allows you to train a model on multiple nodes using ChainerMN_, which distributes training with MPI. + +.. _ChainerMN: https://github.com/chainer/chainermn + +In order to run distributed Chainer training on SageMaker, your training script should use a ``chainermn`` Communicator +object to coordinate training between multiple hosts. + +SageMaker runs your script with ``mpirun`` if ``train_instance_count`` is greater than two. +The following are optional arguments modify how MPI runs your distributed training script. + +- ``use_mpi`` Boolean that overrides whether to run your training script with MPI. +- ``num_processes`` Integer that determines how many total processes to run with MPI. By default, this is equal to ``process_slots_per_host`` times the number of nodes. +- ``process_slots_per_host`` Integer that determines how many processes can be run on each host. By default, this is equal to one process per host on CPU instances, or one process per GPU on GPU instances. +- ``additional_mpi_options`` String of additional options to pass to the ``mpirun`` command. + + +Calling fit +^^^^^^^^^^^ + +You start your training script by calling ``fit`` on a ``Chainer`` Estimator. ``fit`` takes both required and optional +arguments. + +Required arguments +'''''''''''''''''' + +- ``inputs``: This can take one of the following forms: A string + s3 URI, for example ``s3://my-bucket/my-training-data``. In this + case, the s3 objects rooted at the ``my-training-data`` prefix will + be available in the default ``train`` channel. A dict from + string channel names to s3 URIs. In this case, the objects rooted at + each s3 prefix will available as files in each channel directory. + +For example: + +.. code:: python + + {'train':'s3://my-bucket/my-training-data', + 'eval':'s3://my-bucket/my-evaluation-data'} + +.. optional-arguments-1: + +Optional arguments +'''''''''''''''''' + +- ``wait``: Defaults to True, whether to block and wait for the + training script to complete before returning. +- ``logs``: Defaults to True, whether to show logs produced by training + job in the Python session. Only meaningful when wait is True. + + +Saving models +~~~~~~~~~~~~~ + +In order to save your trained Chainer model for deployment on SageMaker, your training script should save your model +to a certain filesystem path called `model_dir`. This value is accessible through the environment variable +``SM_MODEL_DIR``. The following code demonstrates how to save a trained Chainer model named ``model`` as +``model.npz`` at the end of training: + +.. code:: python + + import chainer + import argparse + import os + + if __name__=='__main__': + # default to the value in environment variable `SM_MODEL_DIR`. Using args makes the script more portable. + parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) + args, _ = parser.parse_known_args() + + # ... train `model`, then save it to `model_dir` as file 'model.npz' + chainer.serializers.save_npz(os.path.join(args.model_dir, 'model.npz'), model) + +After your training job is complete, SageMaker will compress and upload the serialized model to S3, and your model data +will available in the s3 ``output_path`` you specified when you created the Chainer Estimator. + +Deploying Chainer models +~~~~~~~~~~~~~~~~~~~~~~~~ + +After an Chainer Estimator has been fit, you can host the newly created model in SageMaker. + +After calling ``fit``, you can call ``deploy`` on a ``Chainer`` Estimator to create a SageMaker Endpoint. +The Endpoint runs a SageMaker-provided Chainer model server and hosts the model produced by your training script, +which was run when you called ``fit``. This was the model you saved to ``model_dir``. + +``deploy`` returns a ``Predictor`` object, which you can use to do inference on the Endpoint hosting your Chainer model. +Each ``Predictor`` provides a ``predict`` method which can do inference with numpy arrays or Python lists. +Inference arrays or lists are serialized and sent to the Chainer model server by an ``InvokeEndpoint`` SageMaker +operation. + +``predict`` returns the result of inference against your model. By default, the inference result a NumPy array. + +.. code:: python + + # Train my estimator + chainer_estimator = Chainer(entry_point='train_and_deploy.py', + train_instance_type='ml.p3.2xlarge', + train_instance_count=1, + framework_version='5.0.0') + chainer_estimator.fit('s3://my_bucket/my_training_data/') + + # Deploy my estimator to a SageMaker Endpoint and get a Predictor + predictor = chainer_estimator.deploy(instance_type='ml.m4.xlarge', + initial_instance_count=1) + + # `data` is a NumPy array or a Python list. + # `response` is a NumPy array. + response = predictor.predict(data) + +You use the SageMaker Chainer model server to host your Chainer model when you call ``deploy`` on an ``Chainer`` +Estimator. The model server runs inside a SageMaker Endpoint, which your call to ``deploy`` creates. +You can access the name of the Endpoint by the ``name`` property on the returned ``Predictor``. + + +The SageMaker Chainer Model Server +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The Chainer Endpoint you create with ``deploy`` runs a SageMaker Chainer model server. +The model server loads the model that was saved by your training script and performs inference on the model in response +to SageMaker InvokeEndpoint API calls. + +You can configure two components of the SageMaker Chainer model server: Model loading and model serving. +Model loading is the process of deserializing your saved model back into an Chainer model. +Serving is the process of translating InvokeEndpoint requests to inference calls on the loaded model. + +You configure the Chainer model server by defining functions in the Python source file you passed to the Chainer constructor. + +Model loading +^^^^^^^^^^^^^ + +Before a model can be served, it must be loaded. The SageMaker Chainer model server loads your model by invoking a +``model_fn`` function that you must provide in your script. The ``model_fn`` should have the following signature: + +.. code:: python + + def model_fn(model_dir) + +SageMaker will inject the directory where your model files and sub-directories, saved by ``save``, have been mounted. +Your model function should return a model object that can be used for model serving. + +SageMaker provides automated serving functions that work with Gluon API ``net`` objects and Module API ``Module`` objects. If you return either of these types of objects, then you will be able to use the default serving request handling functions. + +The following code-snippet shows an example ``model_fn`` implementation. +This loads returns a Chainer Classifier from a multi-layer perceptron class ``MLP`` that extends ``chainer.Chain``. +It loads the model parameters from a ``model.npz`` file in the SageMaker model directory ``model_dir``. + +.. code:: python + + import chainer + import os + + def model_fn(model_dir): + chainer.config.train = False + model = chainer.links.Classifier(MLP(1000, 10)) + chainer.serializers.load_npz(os.path.join(model_dir, 'model.npz'), model) + return model.predictor + +Model serving +^^^^^^^^^^^^^ + +After the SageMaker model server has loaded your model by calling ``model_fn``, SageMaker will serve your model. +Model serving is the process of responding to inference requests, received by SageMaker InvokeEndpoint API calls. +The SageMaker Chainer model server breaks request handling into three steps: + + +- input processing, +- prediction, and +- output processing. + +In a similar way to model loading, you configure these steps by defining functions in your Python source file. + +Each step involves invoking a python function, with information about the request and the return-value from the previous +function in the chain. Inside the SageMaker Chainer model server, the process looks like: + +.. code:: python + + # Deserialize the Invoke request body into an object we can perform prediction on + input_object = input_fn(request_body, request_content_type) + + # Perform prediction on the deserialized object, with the loaded model + prediction = predict_fn(input_object, model) + + # Serialize the prediction result into the desired response content type + output = output_fn(prediction, response_content_type) + +The above code-sample shows the three function definitions: + +- ``input_fn``: Takes request data and deserializes the data into an + object for prediction. +- ``predict_fn``: Takes the deserialized request object and performs + inference against the loaded model. +- ``output_fn``: Takes the result of prediction and serializes this + according to the response content type. + +The SageMaker Chainer model server provides default implementations of these functions. +You can provide your own implementations for these functions in your hosting script. +If you omit any definition then the SageMaker Chainer model server will use its default implementation for that +function. + +The ``RealTimePredictor`` used by Chainer in the SageMaker Python SDK serializes NumPy arrays to the `NPY `_ format +by default, with Content-Type ``application/x-npy``. The SageMaker Chainer model server can deserialize NPY-formatted +data (along with JSON and CSV data). + +If you rely solely on the SageMaker Chainer model server defaults, you get the following functionality: + +- Prediction on models that implement the ``__call__`` method +- Serialization and deserialization of NumPy arrays. + +The default ``input_fn`` and ``output_fn`` are meant to make it easy to predict on NumPy arrays. If your model expects +a NumPy array and returns a NumPy array, then these functions do not have to be overridden when sending NPY-formatted +data. + +In the following sections we describe the default implementations of input_fn, predict_fn, and output_fn. +We describe the input arguments and expected return types of each, so you can define your own implementations. + +Input processing +'''''''''''''''' + +When an InvokeEndpoint operation is made against an Endpoint running a SageMaker Chainer model server, +the model server receives two pieces of information: + +- The request Content-Type, for example "application/x-npy" +- The request data body, a byte array + +The SageMaker Chainer model server will invoke an "input_fn" function in your hosting script, +passing in this information. If you define an ``input_fn`` function definition, +it should return an object that can be passed to ``predict_fn`` and have the following signature: + +.. code:: python + + def input_fn(request_body, request_content_type) + +Where ``request_body`` is a byte buffer and ``request_content_type`` is a Python string + +The SageMaker Chainer model server provides a default implementation of ``input_fn``. +This function deserializes JSON, CSV, or NPY encoded data into a NumPy array. + +Default NPY deserialization requires ``request_body`` to follow the `NPY `_ format. For Chainer, the Python SDK +defaults to sending prediction requests with this format. + +Default json deserialization requires ``request_body`` contain a single json list. +Sending multiple json objects within the same ``request_body`` is not supported. +The list must have a dimensionality compatible with the model loaded in ``model_fn``. +The list's shape must be identical to the model's input shape, for all dimensions after the first (which first +dimension is the batch size). + +Default csv deserialization requires ``request_body`` contain one or more lines of CSV numerical data. +The data is loaded into a two-dimensional array, where each line break defines the boundaries of the first dimension. + +The example below shows a custom ``input_fn`` for preparing pickled NumPy arrays. + +.. code:: python + + import numpy as np + + def input_fn(request_body, request_content_type): + """An input_fn that loads a pickled numpy array""" + if request_content_type == "application/python-pickle": + array = np.load(StringIO(request_body)) + return array + else: + # Handle other content-types here or raise an Exception + # if the content type is not supported. + pass + + + +Prediction +'''''''''' + +After the inference request has been deserialized by ``input_fn``, the SageMaker Chainer model server invokes +``predict_fn`` on the return value of ``input_fn``. + +As with ``input_fn``, you can define your own ``predict_fn`` or use the SageMaker Chainer model server default. + +The ``predict_fn`` function has the following signature: + +.. code:: python + + def predict_fn(input_object, model) + +Where ``input_object`` is the object returned from ``input_fn`` and +``model`` is the model loaded by ``model_fn``. + +The default implementation of ``predict_fn`` invokes the loaded model's ``__call__`` function on ``input_object``, +and returns the resulting value. The return-type should be a NumPy array to be compatible with the default +``output_fn``. + +The example below shows an overridden ``predict_fn``. This model accepts a Python list and returns a tuple of +bounding boxes, labels, and scores from the model in a NumPy array. This ``predict_fn`` can rely on the default +``input_fn`` and ``output_fn`` because ``input_data`` is a NumPy array, and the return value of this function is +a NumPy array. + +.. code:: python + + import chainer + import numpy as np + + def predict_fn(input_data, model): + with chainer.using_config('train', False), chainer.no_backprop_mode(): + bboxes, labels, scores = model.predict([input_data]) + bbox, label, score = bboxes[0], labels[0], scores[0] + return np.array([bbox.tolist(), label, score]) + +If you implement your own prediction function, you should take care to ensure that: + +- The first argument is expected to be the return value from input_fn. + If you use the default input_fn, this will be a NumPy array. +- The second argument is the loaded model. +- The return value should be of the correct type to be passed as the + first argument to ``output_fn``. If you use the default + ``output_fn``, this should be a NumPy array. + +Output processing +''''''''''''''''' + +After invoking ``predict_fn``, the model server invokes ``output_fn``, passing in the return-value from ``predict_fn`` +and the InvokeEndpoint requested response content-type. + +The ``output_fn`` has the following signature: + +.. code:: python + + def output_fn(prediction, content_type) + +Where ``prediction`` is the result of invoking ``predict_fn`` and +``content_type`` is the InvokeEndpoint requested response content-type. +The function should return a byte array of data serialized to content_type. + +The default implementation expects ``prediction`` to be an NumPy and can serialize the result to JSON, CSV, or NPY. +It accepts response content types of "application/json", "text/csv", and "application/x-npy". + +Working with existing model data and training jobs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Attaching to existing training jobs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +You can attach an Chainer Estimator to an existing training job using the +``attach`` method. + +.. code:: python + + my_training_job_name = "MyAwesomeChainerTrainingJob" + chainer_estimator = Chainer.attach(my_training_job_name) + +After attaching, if the training job is in a Complete status, it can be +``deploy``\ ed to create a SageMaker Endpoint and return a +``Predictor``. If the training job is in progress, +attach will block and display log messages from the training job, until the training job completes. + +The ``attach`` method accepts the following arguments: + +- ``training_job_name (str):`` The name of the training job to attach + to. +- ``sagemaker_session (sagemaker.Session or None):`` The Session used + to interact with SageMaker + +Deploying Endpoints from model data +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +As well as attaching to existing training jobs, you can deploy models directly from model data in S3. +The following code sample shows how to do this, using the ``ChainerModel`` class. + +.. code:: python + + chainer_model = ChainerModel(model_data="s3://bucket/model.tar.gz", role="SageMakerRole", + entry_point="transform_script.py") + + predictor = chainer_model.deploy(instance_type="ml.c4.xlarge", initial_instance_count=1) + +The ChainerModel constructor takes the following arguments: + +- ``model_data (str):`` An S3 location of a SageMaker model data + .tar.gz file +- ``image (str):`` A Docker image URI +- ``role (str):`` An IAM role name or Arn for SageMaker to access AWS + resources on your behalf. +- ``predictor_cls (callable[string,sagemaker.Session]):`` A function to + call to create a predictor. If not None, ``deploy`` will return the + result of invoking this function on the created endpoint name +- ``env (dict[string,string]):`` Environment variables to run with + ``image`` when hosted in SageMaker. +- ``name (str):`` The model name. If None, a default model name will be + selected on each ``deploy.`` +- ``entry_point (str):`` Path (absolute or relative) to the Python file + which should be executed as the entry point to model hosting. +- ``source_dir (str):`` Optional. Path (absolute or relative) to a + directory with any other training source code dependencies including + tne entry point file. Structure within this directory will be + preserved when training on SageMaker. +- ``enable_cloudwatch_metrics (boolean):`` Optional. If true, training + and hosting containers will generate Cloudwatch metrics under the + AWS/SageMakerContainer namespace. +- ``container_log_level (int):`` Log level to use within the container. + Valid values are defined in the Python logging module. +- ``code_location (str):`` Optional. Name of the S3 bucket where your + custom code will be uploaded to. If not specified, will use the + SageMaker default bucket created by sagemaker.Session. +- ``sagemaker_session (sagemaker.Session):`` The SageMaker Session + object, used for SageMaker interaction""" + +Your model data must be a .tar.gz file in S3. SageMaker Training Job model data is saved to .tar.gz files in S3, +however if you have local data you want to deploy, you can prepare the data yourself. + +Assuming you have a local directory containg your model data named "my_model" you can tar and gzip compress the file and +upload to S3 using the following commands: + +:: + + tar -czf model.tar.gz my_model + aws s3 cp model.tar.gz s3://my-bucket/my-path/model.tar.gz + +This uploads the contents of my_model to a gzip compressed tar file to S3 in the bucket "my-bucket", with the key +"my-path/model.tar.gz". + +To run this command, you'll need the aws cli tool installed. Please refer to our `FAQ <#FAQ>`__ for more information on +installing this. + +Chainer Training Examples +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Amazon provides several example Jupyter notebooks that demonstrate end-to-end training on Amazon SageMaker using Chainer. +Please refer to: + +https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk + +These are also available in SageMaker Notebook Instance hosted Jupyter notebooks under the "sample notebooks" folder. + + +SageMaker Chainer Docker containers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When training and deploying training scripts, SageMaker runs your Python script in a Docker container with several +libraries installed. When creating the Estimator and calling deploy to create the SageMaker Endpoint, you can control +the environment your script runs in. + +SageMaker runs Chainer Estimator scripts in either Python 2.7 or Python 3.5. You can select the Python version by +passing a py_version keyword arg to the Chainer Estimator constructor. Setting this to py3 (the default) will cause your +training script to be run on Python 3.5. Setting this to py2 will cause your training script to be run on Python 2.7 +This Python version applies to both the Training Job, created by fit, and the Endpoint, created by deploy. + +The Chainer Docker images have the following dependencies installed: + ++-----------------------------+-------------+-------------+-------------+ +| Dependencies | chainer 4.0 | chainer 4.1 | chainer 5.0 | ++-----------------------------+-------------+-------------+-------------+ +| chainer | 4.0.0 | 4.1.0 | 5.0.0 | ++-----------------------------+-------------+-------------+-------------+ +| chainercv | 0.9.0 | 0.10.0 | 0.10.0 | ++-----------------------------+-------------+-------------+-------------+ +| chainermn | 1.2.0 | 1.3.0 | N/A | ++-----------------------------+-------------+-------------+-------------+ +| CUDA (GPU image only) | 9.0 | 9.0 | 9.0 | ++-----------------------------+-------------+-------------+-------------+ +| cupy | 4.0.0 | 4.1.0 | 5.0.0 | ++-----------------------------+-------------+-------------+-------------+ +| matplotlib | 2.2.0 | 2.2.0 | 2.2.0 | ++-----------------------------+-------------+-------------+-------------+ +| mpi4py | 3.0.0 | 3.0.0 | 3.0.0 | ++-----------------------------+-------------+-------------+-------------+ +| numpy | 1.14.3 | 1.15.3 | 1.15.4 | ++-----------------------------+-------------+-------------+-------------+ +| opencv-python | 3.4.0.12 | 3.4.0.12 | 3.4.0.12 | ++-----------------------------+-------------+-------------+-------------+ +| Pillow | 5.1.0 | 5.3.0 | 5.3.0 | ++-----------------------------+-------------+-------------+-------------+ +| Python | 2.7 or 3.5 | 2.7 or 3.5 | 2.7 or 3.5 | ++-----------------------------+-------------+-------------+-------------+ + +The Docker images extend Ubuntu 16.04. + +You must select a version of Chainer by passing a ``framework_version`` keyword arg to the Chainer Estimator +constructor. Currently supported versions are listed in the above table. You can also set framework_version to only +specify major and minor version, which will cause your training script to be run on the latest supported patch +version of that minor version. + +Alternatively, you can build your own image by following the instructions in the SageMaker Chainer containers +repository, and passing ``image_name`` to the Chainer Estimator constructor. + You can visit the SageMaker Chainer containers repository here: https://github.com/aws/sagemaker-chainer-containers/ \ No newline at end of file diff --git a/doc/using_pytorch.rst b/doc/using_pytorch.rst index d9239cdbcd..4e89d58347 100644 --- a/doc/using_pytorch.rst +++ b/doc/using_pytorch.rst @@ -1,731 +1,731 @@ -=========================================== -Using PyTorch with the SageMaker Python SDK -=========================================== - -.. contents:: - -With PyTorch Estimators and Models, you can train and host PyTorch models on Amazon SageMaker. - -Supported versions of PyTorch: ``0.4.0``, ``1.0.0``. - -We recommend that you use the latest supported version, because that's where we focus most of our development efforts. - -You can visit the PyTorch repository at https://github.com/pytorch/pytorch. - -Training with PyTorch ------------------------- - -Training PyTorch models using ``PyTorch`` Estimators is a two-step process: - -1. Prepare a PyTorch script to run on SageMaker -2. Run this script on SageMaker via a ``PyTorch`` Estimator. - - -First, you prepare your training script, then second, you run this on SageMaker via a ``PyTorch`` Estimator. -You should prepare your script in a separate source file than the notebook, terminal session, or source file you're -using to submit the script to SageMaker via a ``PyTorch`` Estimator. This will be discussed in further detail below. - -Suppose that you already have a PyTorch training script called `pytorch-train.py`. -You can then setup a ``PyTorch`` Estimator with keyword arguments to point to this script and define how SageMaker runs it: - -.. code:: python - - from sagemaker.pytorch import PyTorch - - pytorch_estimator = PyTorch(entry_point='pytorch-train.py', - role='SageMakerRole', - train_instance_type='ml.p3.2xlarge', - train_instance_count=1, - framework_version='1.0.0') - -After that, you simply tell the estimator to start a training job and provide an S3 URL -that is the path to your training data within Amazon S3: - -.. code:: python - - pytorch_estimator.fit('s3://bucket/path/to/training/data') - -In the following sections, we'll discuss how to prepare a training script for execution on SageMaker, -then how to run that script on SageMaker using a ``PyTorch`` Estimator. - - -Preparing the PyTorch Training Script -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Your PyTorch training script must be a Python 2.7 or 3.5 compatible source file. - -The training script is very similar to a training script you might run outside of SageMaker, but you -can access useful properties about the training environment through various environment variables, such as - -* ``SM_MODEL_DIR``: A string representing the path to the directory to write model artifacts to. - These artifacts are uploaded to S3 for model hosting. -* ``SM_NUM_GPUS``: An integer representing the number of GPUs available to the host. -* ``SM_OUTPUT_DATA_DIR``: A string representing the filesystem path to write output artifacts to. Output artifacts may - include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed - and uploaded to S3 to the same S3 prefix as the model artifacts. - -Supposing two input channels, 'train' and 'test', were used in the call to the PyTorch estimator's ``fit`` method, -the following will be set, following the format "SM_CHANNEL_[channel_name]": - -* ``SM_CHANNEL_TRAIN``: A string representing the path to the directory containing data in the 'train' channel -* ``SM_CHANNEL_TEST``: Same as above, but for the 'test' channel. - -A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, -and saves a model to `model_dir` so that it can be hosted later. Hyperparameters are passed to your script as arguments -and can be retrieved with an argparse.ArgumentParser instance. For example, a training script might start -with the following: - -.. code:: python - - import argparse - import os - - if __name__ =='__main__': - - parser = argparse.ArgumentParser() - - # hyperparameters sent by the client are passed as command-line arguments to the script. - parser.add_argument('--epochs', type=int, default=50) - parser.add_argument('--batch-size', type=int, default=64) - parser.add_argument('--learning-rate', type=float, default=0.05) - parser.add_argument('--use-cuda', type=bool, default=False) - - # Data, model, and output directories - parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR']) - parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) - parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN']) - parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TEST']) - - args, _ = parser.parse_known_args() - - # ... load from args.train and args.test, train a model, write model to args.model_dir. - -Because the SageMaker imports your training script, you should put your training code in a main guard -(``if __name__=='__main__':``) if you are using the same script to host your model, so that SageMaker does not -inadvertently run your training code at the wrong point in execution. - -Note that SageMaker doesn't support argparse actions. If you want to use, for example, boolean hyperparameters, -you need to specify `type` as `bool` in your script and provide an explicit `True` or `False` value for this hyperparameter -when instantiating PyTorch Estimator. - -For more on training environment variables, please visit `SageMaker Containers `_. - -Using third-party libraries -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -When running your training script on SageMaker, it will have access to some pre-installed third-party libraries including ``torch``, ``torchvisopm``, and ``numpy``. -For more information on the runtime environment, including specific package versions, see `SageMaker PyTorch Docker containers <#id4>`__. - -If there are other packages you want to use with your script, you can include a ``requirements.txt`` file in the same directory as your training script to install other dependencies at runtime. -A ``requirements.txt`` file is a text file that contains a list of items that are installed by using ``pip install``. You can also specify the version of an item to install. -For information about the format of a ``requirements.txt`` file, see `Requirements Files `__ in the pip documentation. - -Running a PyTorch training script in SageMaker -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You run PyTorch training scripts on SageMaker by creating ``PyTorch`` Estimators. -SageMaker training of your script is invoked when you call ``fit`` on a ``PyTorch`` Estimator. -The following code sample shows how you train a custom PyTorch script "pytorch-train.py", passing -in three hyperparameters ('epochs', 'batch-size', and 'learning-rate'), and using two input channel -directories ('train' and 'test'). - -.. code:: python - - pytorch_estimator = PyTorch('pytorch-train.py', - train_instance_type='ml.p3.2xlarge', - train_instance_count=1, - framework_version='1.0.0', - hyperparameters = {'epochs': 20, 'batch-size': 64, 'learning-rate': 0.1}) - pytorch_estimator.fit({'train': 's3://my-data-bucket/path/to/my/training/data', - 'test': 's3://my-data-bucket/path/to/my/test/data'}) - - -PyTorch Estimators ------------------- - -The `PyTorch` constructor takes both required and optional arguments. - -Required arguments -~~~~~~~~~~~~~~~~~~ - -The following are required arguments to the ``PyTorch`` constructor. When you create a PyTorch object, you must include -these in the constructor, either positionally or as keyword arguments. - -- ``entry_point`` Path (absolute or relative) to the Python file which - should be executed as the entry point to training. -- ``role`` An AWS IAM role (either name or full ARN). The Amazon - SageMaker training jobs and APIs that create Amazon SageMaker - endpoints use this role to access training data and model artifacts. - After the endpoint is created, the inference code might use the IAM - role, if accessing AWS resource. -- ``train_instance_count`` Number of Amazon EC2 instances to use for - training. -- ``train_instance_type`` Type of EC2 instance to use for training, for - example, 'ml.m4.xlarge'. - -Optional arguments -~~~~~~~~~~~~~~~~~~ - -The following are optional arguments. When you create a ``PyTorch`` object, you can specify these as keyword arguments. - -- ``source_dir`` Path (absolute or relative) to a directory with any - other training source code dependencies including the entry point - file. Structure within this directory will be preserved when training - on SageMaker. -- ``dependencies (list[str])`` A list of paths to directories (absolute or relative) with - any additional libraries that will be exported to the container (default: []). - The library folders will be copied to SageMaker in the same folder where the entrypoint is copied. - If the ```source_dir``` points to S3, code will be uploaded and the S3 location will be used - instead. Example: - - The following call - >>> PyTorch(entry_point='train.py', dependencies=['my/libs/common', 'virtual-env']) - results in the following inside the container: - - >>> $ ls - - >>> opt/ml/code - >>> ├── train.py - >>> ├── common - >>> └── virtual-env - -- ``hyperparameters`` Hyperparameters that will be used for training. - Will be made accessible as a dict[str, str] to the training code on - SageMaker. For convenience, accepts other types besides strings, but - ``str`` will be called on keys and values to convert them before - training. -- ``py_version`` Python version you want to use for executing your - model training code. -- ``framework_version`` PyTorch version you want to use for executing - your model training code. You can find the list of supported versions - in `the section below <#sagemaker-pytorch-docker-containers>`__. -- ``train_volume_size`` Size in GB of the EBS volume to use for storing - input data during training. Must be large enough to store training - data if input_mode='File' is used (which is the default). -- ``train_max_run`` Timeout in seconds for training, after which Amazon - SageMaker terminates the job regardless of its current status. -- ``input_mode`` The input mode that the algorithm supports. Valid - modes: 'File' - Amazon SageMaker copies the training dataset from the - S3 location to a directory in the Docker container. 'Pipe' - Amazon - SageMaker streams data directly from S3 to the container via a Unix - named pipe. -- ``output_path`` S3 location where you want the training result (model - artifacts and optional output files) saved. If not specified, results - are stored to a default bucket. If the bucket with the specific name - does not exist, the estimator creates the bucket during the ``fit`` - method execution. -- ``output_kms_key`` Optional KMS key ID to optionally encrypt training - output with. -- ``job_name`` Name to assign for the training job that the ``fit``` - method launches. If not specified, the estimator generates a default - job name, based on the training image name and current timestamp -- ``image_name`` An alternative docker image to use for training and - serving. If specified, the estimator will use this image for training and - hosting, instead of selecting the appropriate SageMaker official image based on - framework_version and py_version. Refer to: `SageMaker PyTorch Docker Containers - <#id4>`_ for details on what the Official images support - and where to find the source code to build your custom image. - -Calling fit -~~~~~~~~~~~ - -You start your training script by calling ``fit`` on a ``PyTorch`` Estimator. ``fit`` takes both required and optional -arguments. - -Required arguments -'''''''''''''''''' - -- ``inputs``: This can take one of the following forms: A string - S3 URI, for example ``s3://my-bucket/my-training-data``. In this - case, the S3 objects rooted at the ``my-training-data`` prefix will - be available in the default ``train`` channel. A dict from - string channel names to S3 URIs. In this case, the objects rooted at - each S3 prefix will available as files in each channel directory. - -For example: - -.. code:: python - - {'train':'s3://my-bucket/my-training-data', - 'eval':'s3://my-bucket/my-evaluation-data'} - -.. optional-arguments-1: - -Optional arguments -'''''''''''''''''' - -- ``wait``: Defaults to True, whether to block and wait for the - training script to complete before returning. -- ``logs``: Defaults to True, whether to show logs produced by training - job in the Python session. Only meaningful when wait is True. - - -Distributed PyTorch Training ----------------------------- - -You can run a multi-machine, distributed PyTorch training using the PyTorch Estimator. By default, PyTorch objects will -submit single-machine training jobs to SageMaker. If you set ``train_instance_count`` to be greater than one, multi-machine -training jobs will be launched when ``fit`` is called. When you run multi-machine training, SageMaker will import your -training script and run it on each host in the cluster. - -To initialize distributed training in your script you would call ``dist.init_process_group`` providing desired backend -and rank and setting 'WORLD_SIZE' environment variable similar to how you would do it outside of SageMaker using -environment variable initialization: - -.. code:: python - - if args.distributed: - # Initialize the distributed environment. - world_size = len(args.hosts) - os.environ['WORLD_SIZE'] = str(world_size) - host_rank = args.hosts.index(args.current_host) - dist.init_process_group(backend=args.backend, rank=host_rank) - -SageMaker sets 'MASTER_ADDR' and 'MASTER_PORT' environment variables for you, but you can overwrite them. - -Supported backends: -- `gloo` and `tcp` for cpu instances -- `gloo` and `nccl` for gpu instances - -Saving models -------------- - -In order to save your trained PyTorch model for deployment on SageMaker, your training script should save your model -to a certain filesystem path called ``model_dir``. This value is accessible through the environment variable -``SM_MODEL_DIR``. The following code demonstrates how to save a trained PyTorch model named ``model`` as -``model.pth`` at the : - -.. code:: python - - import argparse - import os - import torch - - if __name__=='__main__': - # default to the value in environment variable `SM_MODEL_DIR`. Using args makes the script more portable. - parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) - args, _ = parser.parse_known_args() - - # ... train `model`, then save it to `model_dir` - with open(os.path.join(args.model_dir, 'model.pth'), 'wb') as f: - torch.save(model.state_dict(), f) - -After your training job is complete, SageMaker will compress and upload the serialized model to S3, and your model data -will be available in the S3 ``output_path`` you specified when you created the PyTorch Estimator. - -Deploying PyTorch Models ------------------------- - -After an PyTorch Estimator has been fit, you can host the newly created model in SageMaker. - -After calling ``fit``, you can call ``deploy`` on a ``PyTorch`` Estimator to create a SageMaker Endpoint. -The Endpoint runs a SageMaker-provided PyTorch model server and hosts the model produced by your training script, -which was run when you called ``fit``. This was the model you saved to ``model_dir``. - -``deploy`` returns a ``Predictor`` object, which you can use to do inference on the Endpoint hosting your PyTorch model. -Each ``Predictor`` provides a ``predict`` method which can do inference with numpy arrays or Python lists. -Inference arrays or lists are serialized and sent to the PyTorch model server by an ``InvokeEndpoint`` SageMaker -operation. - -``predict`` returns the result of inference against your model. By default, the inference result a NumPy array. - -.. code:: python - - # Train my estimator - pytorch_estimator = PyTorch(entry_point='train_and_deploy.py', - train_instance_type='ml.p3.2xlarge', - train_instance_count=1, - framework_version='1.0.0') - pytorch_estimator.fit('s3://my_bucket/my_training_data/') - - # Deploy my estimator to a SageMaker Endpoint and get a Predictor - predictor = pytorch_estimator.deploy(instance_type='ml.m4.xlarge', - initial_instance_count=1) - - # `data` is a NumPy array or a Python list. - # `response` is a NumPy array. - response = predictor.predict(data) - -You use the SageMaker PyTorch model server to host your PyTorch model when you call ``deploy`` on an ``PyTorch`` -Estimator. The model server runs inside a SageMaker Endpoint, which your call to ``deploy`` creates. -You can access the name of the Endpoint by the ``name`` property on the returned ``Predictor``. - - -The SageMaker PyTorch Model Server ----------------------------------- - -The PyTorch Endpoint you create with ``deploy`` runs a SageMaker PyTorch model server. -The model server loads the model that was saved by your training script and performs inference on the model in response -to SageMaker InvokeEndpoint API calls. - -You can configure two components of the SageMaker PyTorch model server: Model loading and model serving. -Model loading is the process of deserializing your saved model back into an PyTorch model. -Serving is the process of translating InvokeEndpoint requests to inference calls on the loaded model. - -You configure the PyTorch model server by defining functions in the Python source file you passed to the PyTorch constructor. - -Model loading -~~~~~~~~~~~~~ - -Before a model can be served, it must be loaded. The SageMaker PyTorch model server loads your model by invoking a -``model_fn`` function that you must provide in your script. The ``model_fn`` should have the following signature: - -.. code:: python - - def model_fn(model_dir) - -SageMaker will inject the directory where your model files and sub-directories, saved by ``save``, have been mounted. -Your model function should return a model object that can be used for model serving. - -The following code-snippet shows an example ``model_fn`` implementation. -It loads the model parameters from a ``model.pth`` file in the SageMaker model directory ``model_dir``. - -.. code:: python - - import torch - import os - - def model_fn(model_dir): - model = Your_Model() - with open(os.path.join(model_dir, 'model.pth'), 'rb') as f: - model.load_state_dict(torch.load(f)) - return model - -Model serving -~~~~~~~~~~~~~ - -After the SageMaker model server has loaded your model by calling ``model_fn``, SageMaker will serve your model. -Model serving is the process of responding to inference requests, received by SageMaker InvokeEndpoint API calls. -The SageMaker PyTorch model server breaks request handling into three steps: - - -- input processing, -- prediction, and -- output processing. - -In a similar way to model loading, you configure these steps by defining functions in your Python source file. - -Each step involves invoking a python function, with information about the request and the return value from the previous -function in the chain. Inside the SageMaker PyTorch model server, the process looks like: - -.. code:: python - - # Deserialize the Invoke request body into an object we can perform prediction on - input_object = input_fn(request_body, request_content_type) - - # Perform prediction on the deserialized object, with the loaded model - prediction = predict_fn(input_object, model) - - # Serialize the prediction result into the desired response content type - output = output_fn(prediction, response_content_type) - -The above code sample shows the three function definitions: - -- ``input_fn``: Takes request data and deserializes the data into an - object for prediction. -- ``predict_fn``: Takes the deserialized request object and performs - inference against the loaded model. -- ``output_fn``: Takes the result of prediction and serializes this - according to the response content type. - -The SageMaker PyTorch model server provides default implementations of these functions. -You can provide your own implementations for these functions in your hosting script. -If you omit any definition then the SageMaker PyTorch model server will use its default implementation for that -function. - -The ``RealTimePredictor`` used by PyTorch in the SageMaker Python SDK serializes NumPy arrays to the `NPY `_ format -by default, with Content-Type ``application/x-npy``. The SageMaker PyTorch model server can deserialize NPY-formatted -data (along with JSON and CSV data). - -If you rely solely on the SageMaker PyTorch model server defaults, you get the following functionality: - -- Prediction on models that implement the ``__call__`` method -- Serialization and deserialization of torch.Tensor. - -The default ``input_fn`` and ``output_fn`` are meant to make it easy to predict on torch.Tensors. If your model expects -a torch.Tensor and returns a torch.Tensor, then these functions do not have to be overridden when sending NPY-formatted -data. - -In the following sections we describe the default implementations of input_fn, predict_fn, and output_fn. -We describe the input arguments and expected return types of each, so you can define your own implementations. - -Input processing -'''''''''''''''' - -When an InvokeEndpoint operation is made against an Endpoint running a SageMaker PyTorch model server, -the model server receives two pieces of information: - -- The request Content-Type, for example "application/x-npy" -- The request data body, a byte array - -The SageMaker PyTorch model server will invoke an ``input_fn`` function in your hosting script, -passing in this information. If you define an ``input_fn`` function definition, -it should return an object that can be passed to ``predict_fn`` and have the following signature: - -.. code:: python - - def input_fn(request_body, request_content_type) - -Where ``request_body`` is a byte buffer and ``request_content_type`` is a Python string - -The SageMaker PyTorch model server provides a default implementation of ``input_fn``. -This function deserializes JSON, CSV, or NPY encoded data into a torch.Tensor. - -Default NPY deserialization requires ``request_body`` to follow the `NPY `_ format. For PyTorch, the Python SDK -defaults to sending prediction requests with this format. - -Default JSON deserialization requires ``request_body`` contain a single json list. -Sending multiple JSON objects within the same ``request_body`` is not supported. -The list must have a dimensionality compatible with the model loaded in ``model_fn``. -The list's shape must be identical to the model's input shape, for all dimensions after the first (which first -dimension is the batch size). - -Default csv deserialization requires ``request_body`` contain one or more lines of CSV numerical data. -The data is loaded into a two-dimensional array, where each line break defines the boundaries of the first dimension. - -The example below shows a custom ``input_fn`` for preparing pickled torch.Tensor. - -.. code:: python - - import numpy as np - import torch - from six import BytesIO - - def input_fn(request_body, request_content_type): - """An input_fn that loads a pickled tensor""" - if request_content_type == 'application/python-pickle': - return torch.load(BytesIO(request_body)) - else: - # Handle other content-types here or raise an Exception - # if the content type is not supported. - pass - - - -Prediction -'''''''''' - -After the inference request has been deserialized by ``input_fn``, the SageMaker PyTorch model server invokes -``predict_fn`` on the return value of ``input_fn``. - -As with ``input_fn``, you can define your own ``predict_fn`` or use the SageMaker PyTorch model server default. - -The ``predict_fn`` function has the following signature: - -.. code:: python - - def predict_fn(input_object, model) - -Where ``input_object`` is the object returned from ``input_fn`` and -``model`` is the model loaded by ``model_fn``. - -The default implementation of ``predict_fn`` invokes the loaded model's ``__call__`` function on ``input_object``, -and returns the resulting value. The return-type should be a torch.Tensor to be compatible with the default -``output_fn``. - -The example below shows an overridden ``predict_fn``: - -.. code:: python - - import torch - import numpy as np - - def predict_fn(input_data, model): - device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') - model.to(device) - model.eval() - with torch.no_grad(): - return model(input_data.to(device)) - -If you implement your own prediction function, you should take care to ensure that: - -- The first argument is expected to be the return value from input_fn. - If you use the default input_fn, this will be a torch.Tensor. -- The second argument is the loaded model. -- The return value should be of the correct type to be passed as the - first argument to ``output_fn``. If you use the default - ``output_fn``, this should be a torch.Tensor. - -Output processing -''''''''''''''''' - -After invoking ``predict_fn``, the model server invokes ``output_fn``, passing in the return value from ``predict_fn`` -and the content type for the response, as specified by the InvokeEndpoint request. - -The ``output_fn`` has the following signature: - -.. code:: python - - def output_fn(prediction, content_type) - -Where ``prediction`` is the result of invoking ``predict_fn`` and -the content type for the response, as specified by the InvokeEndpoint request. -The function should return a byte array of data serialized to content_type. - -The default implementation expects ``prediction`` to be a torch.Tensor and can serialize the result to JSON, CSV, or NPY. -It accepts response content types of "application/json", "text/csv", and "application/x-npy". - -Working with Existing Model Data and Training Jobs --------------------------------------------------- - -Attaching to existing training jobs -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You can attach an PyTorch Estimator to an existing training job using the -``attach`` method. - -.. code:: python - - my_training_job_name = 'MyAwesomePyTorchTrainingJob' - pytorch_estimator = PyTorch.attach(my_training_job_name) - -After attaching, if the training job has finished with job status "Completed", it can be -``deploy``\ ed to create a SageMaker Endpoint and return a -``Predictor``. If the training job is in progress, -attach will block and display log messages from the training job, until the training job completes. - -The ``attach`` method accepts the following arguments: - -- ``training_job_name:`` The name of the training job to attach - to. -- ``sagemaker_session:`` The Session used - to interact with SageMaker - -Deploying Endpoints from model data -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -As well as attaching to existing training jobs, you can deploy models directly from model data in S3. -The following code sample shows how to do this, using the ``PyTorchModel`` class. - -.. code:: python - - pytorch_model = PyTorchModel(model_data='s3://bucket/model.tar.gz', role='SageMakerRole', - entry_point='transform_script.py') - - predictor = pytorch_model.deploy(instance_type='ml.c4.xlarge', initial_instance_count=1) - -The PyTorchModel constructor takes the following arguments: - -- ``model_dat:`` An S3 location of a SageMaker model data - .tar.gz file -- ``image:`` A Docker image URI -- ``role:`` An IAM role name or Arn for SageMaker to access AWS - resources on your behalf. -- ``predictor_cls:`` A function to - call to create a predictor. If not None, ``deploy`` will return the - result of invoking this function on the created endpoint name -- ``env:`` Environment variables to run with - ``image`` when hosted in SageMaker. -- ``name:`` The model name. If None, a default model name will be - selected on each ``deploy.`` -- ``entry_point:`` Path (absolute or relative) to the Python file - which should be executed as the entry point to model hosting. -- ``source_dir:`` Optional. Path (absolute or relative) to a - directory with any other training source code dependencies including - tne entry point file. Structure within this directory will be - preserved when training on SageMaker. -- ``enable_cloudwatch_metrics:`` Optional. If true, training - and hosting containers will generate Cloudwatch metrics under the - AWS/SageMakerContainer namespace. -- ``container_log_level:`` Log level to use within the container. - Valid values are defined in the Python logging module. -- ``code_location:`` Optional. Name of the S3 bucket where your - custom code will be uploaded to. If not specified, will use the - SageMaker default bucket created by sagemaker.Session. -- ``sagemaker_session:`` The SageMaker Session - object, used for SageMaker interaction - -Your model data must be a .tar.gz file in S3. SageMaker Training Job model data is saved to .tar.gz files in S3, -however if you have local data you want to deploy, you can prepare the data yourself. - -Assuming you have a local directory containg your model data named "my_model" you can tar and gzip compress the file and -upload to S3 using the following commands: - -:: - - tar -czf model.tar.gz my_model - aws s3 cp model.tar.gz s3://my-bucket/my-path/model.tar.gz - -This uploads the contents of my_model to a gzip compressed tar file to S3 in the bucket "my-bucket", with the key -"my-path/model.tar.gz". - -To run this command, you'll need the AWS CLI tool installed. Please refer to our `FAQ`_ for more information on -installing this. - -.. _FAQ: ../../../README.rst#faq - -PyTorch Training Examples -------------------------- - -Amazon provides several example Jupyter notebooks that demonstrate end-to-end training on Amazon SageMaker using PyTorch. -Please refer to: - -https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk - -These are also available in SageMaker Notebook Instance hosted Jupyter notebooks under the sample notebooks folder. - - -SageMaker PyTorch Docker containers ------------------------------------ - -When training and deploying training scripts, SageMaker runs your Python script in a Docker container with several -libraries installed. When creating the Estimator and calling deploy to create the SageMaker Endpoint, you can control -the environment your script runs in. - -SageMaker runs PyTorch Estimator scripts in either Python 2 or Python 3. You can select the Python version by -passing a ``py_version`` keyword arg to the PyTorch Estimator constructor. Setting this to `py3` (the default) will cause your -training script to be run on Python 3.5. Setting this to `py2` will cause your training script to be run on Python 2.7 -This Python version applies to both the Training Job, created by fit, and the Endpoint, created by deploy. - -The PyTorch Docker images have the following dependencies installed: - -+-----------------------------+---------------+-------------------+ -| Dependencies | pytorch 0.4.0 | pytorch 1.0.0 | -+-----------------------------+---------------+-------------------+ -| boto3 | >=1.7.35 | >=1.9.11 | -+-----------------------------+---------------+-------------------+ -| botocore | >=1.10.35 | >=1.12.11 | -+-----------------------------+---------------+-------------------+ -| CUDA (GPU image only) | 9.0 | 9.0 | -+-----------------------------+---------------+-------------------+ -| numpy | >=1.14.3 | >=1.15.2 | -+-----------------------------+---------------+-------------------+ -| Pillow | >=5.1.0 | >=5.2.0 | -+-----------------------------+---------------+-------------------+ -| pip | >=10.0.1 | >=18.0 | -+-----------------------------+---------------+-------------------+ -| python-dateutil | >=2.7.3 | >=2.7.3 | -+-----------------------------+---------------+-------------------+ -| retrying | >=1.3.3 | >=1.3.3 | -+-----------------------------+---------------+-------------------+ -| s3transfer | >=0.1.13 | >=0.1.13 | -+-----------------------------+---------------+-------------------+ -| sagemaker-containers | >=2.1.0 | >=2.1.0 | -+-----------------------------+---------------+-------------------+ -| sagemaker-pytorch-container | 1.0 | 1.0 | -+-----------------------------+---------------+-------------------+ -| setuptools | >=39.2.0 | >=40.4.3 | -+-----------------------------+---------------+-------------------+ -| six | >=1.11.0 | >=1.11.0 | -+-----------------------------+---------------+-------------------+ -| torch | 0.4.0 | 1.0.0 | -+-----------------------------+---------------+-------------------+ -| torchvision | 0.2.1 | 0.2.1 | -+-----------------------------+---------------+-------------------+ -| Python | 2.7 or 3.5 | 2.7 or 3.6 | -+-----------------------------+---------------+-------------------+ - -The Docker images extend Ubuntu 16.04. - -If you need to install other dependencies you can put them into `requirements.txt` file and put it in the source directory -(``source_dir``) you provide to the `PyTorch Estimator <#pytorch-estimators>`__. - -You can select version of PyTorch by passing a ``framework_version`` keyword arg to the PyTorch Estimator constructor. -Currently supported versions are listed in the above table. You can also set ``framework_version`` to only specify major and -minor version, which will cause your training script to be run on the latest supported patch version of that minor -version. - -Alternatively, you can build your own image by following the instructions in the SageMaker Chainer containers -repository, and passing ``image_name`` to the Chainer Estimator constructor. - +=========================================== +Using PyTorch with the SageMaker Python SDK +=========================================== + +.. contents:: + +With PyTorch Estimators and Models, you can train and host PyTorch models on Amazon SageMaker. + +Supported versions of PyTorch: ``0.4.0``, ``1.0.0``. + +We recommend that you use the latest supported version, because that's where we focus most of our development efforts. + +You can visit the PyTorch repository at https://github.com/pytorch/pytorch. + +Training with PyTorch +------------------------ + +Training PyTorch models using ``PyTorch`` Estimators is a two-step process: + +1. Prepare a PyTorch script to run on SageMaker +2. Run this script on SageMaker via a ``PyTorch`` Estimator. + + +First, you prepare your training script, then second, you run this on SageMaker via a ``PyTorch`` Estimator. +You should prepare your script in a separate source file than the notebook, terminal session, or source file you're +using to submit the script to SageMaker via a ``PyTorch`` Estimator. This will be discussed in further detail below. + +Suppose that you already have a PyTorch training script called `pytorch-train.py`. +You can then setup a ``PyTorch`` Estimator with keyword arguments to point to this script and define how SageMaker runs it: + +.. code:: python + + from sagemaker.pytorch import PyTorch + + pytorch_estimator = PyTorch(entry_point='pytorch-train.py', + role='SageMakerRole', + train_instance_type='ml.p3.2xlarge', + train_instance_count=1, + framework_version='1.0.0') + +After that, you simply tell the estimator to start a training job and provide an S3 URL +that is the path to your training data within Amazon S3: + +.. code:: python + + pytorch_estimator.fit('s3://bucket/path/to/training/data') + +In the following sections, we'll discuss how to prepare a training script for execution on SageMaker, +then how to run that script on SageMaker using a ``PyTorch`` Estimator. + + +Preparing the PyTorch Training Script +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Your PyTorch training script must be a Python 2.7 or 3.5 compatible source file. + +The training script is very similar to a training script you might run outside of SageMaker, but you +can access useful properties about the training environment through various environment variables, such as + +* ``SM_MODEL_DIR``: A string representing the path to the directory to write model artifacts to. + These artifacts are uploaded to S3 for model hosting. +* ``SM_NUM_GPUS``: An integer representing the number of GPUs available to the host. +* ``SM_OUTPUT_DATA_DIR``: A string representing the filesystem path to write output artifacts to. Output artifacts may + include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed + and uploaded to S3 to the same S3 prefix as the model artifacts. + +Supposing two input channels, 'train' and 'test', were used in the call to the PyTorch estimator's ``fit`` method, +the following will be set, following the format "SM_CHANNEL_[channel_name]": + +* ``SM_CHANNEL_TRAIN``: A string representing the path to the directory containing data in the 'train' channel +* ``SM_CHANNEL_TEST``: Same as above, but for the 'test' channel. + +A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, +and saves a model to `model_dir` so that it can be hosted later. Hyperparameters are passed to your script as arguments +and can be retrieved with an argparse.ArgumentParser instance. For example, a training script might start +with the following: + +.. code:: python + + import argparse + import os + + if __name__ =='__main__': + + parser = argparse.ArgumentParser() + + # hyperparameters sent by the client are passed as command-line arguments to the script. + parser.add_argument('--epochs', type=int, default=50) + parser.add_argument('--batch-size', type=int, default=64) + parser.add_argument('--learning-rate', type=float, default=0.05) + parser.add_argument('--use-cuda', type=bool, default=False) + + # Data, model, and output directories + parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR']) + parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) + parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN']) + parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TEST']) + + args, _ = parser.parse_known_args() + + # ... load from args.train and args.test, train a model, write model to args.model_dir. + +Because the SageMaker imports your training script, you should put your training code in a main guard +(``if __name__=='__main__':``) if you are using the same script to host your model, so that SageMaker does not +inadvertently run your training code at the wrong point in execution. + +Note that SageMaker doesn't support argparse actions. If you want to use, for example, boolean hyperparameters, +you need to specify `type` as `bool` in your script and provide an explicit `True` or `False` value for this hyperparameter +when instantiating PyTorch Estimator. + +For more on training environment variables, please visit `SageMaker Containers `_. + +Using third-party libraries +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When running your training script on SageMaker, it will have access to some pre-installed third-party libraries including ``torch``, ``torchvisopm``, and ``numpy``. +For more information on the runtime environment, including specific package versions, see `SageMaker PyTorch Docker containers <#id4>`__. + +If there are other packages you want to use with your script, you can include a ``requirements.txt`` file in the same directory as your training script to install other dependencies at runtime. +A ``requirements.txt`` file is a text file that contains a list of items that are installed by using ``pip install``. You can also specify the version of an item to install. +For information about the format of a ``requirements.txt`` file, see `Requirements Files `__ in the pip documentation. + +Running a PyTorch training script in SageMaker +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You run PyTorch training scripts on SageMaker by creating ``PyTorch`` Estimators. +SageMaker training of your script is invoked when you call ``fit`` on a ``PyTorch`` Estimator. +The following code sample shows how you train a custom PyTorch script "pytorch-train.py", passing +in three hyperparameters ('epochs', 'batch-size', and 'learning-rate'), and using two input channel +directories ('train' and 'test'). + +.. code:: python + + pytorch_estimator = PyTorch('pytorch-train.py', + train_instance_type='ml.p3.2xlarge', + train_instance_count=1, + framework_version='1.0.0', + hyperparameters = {'epochs': 20, 'batch-size': 64, 'learning-rate': 0.1}) + pytorch_estimator.fit({'train': 's3://my-data-bucket/path/to/my/training/data', + 'test': 's3://my-data-bucket/path/to/my/test/data'}) + + +PyTorch Estimators +------------------ + +The `PyTorch` constructor takes both required and optional arguments. + +Required arguments +~~~~~~~~~~~~~~~~~~ + +The following are required arguments to the ``PyTorch`` constructor. When you create a PyTorch object, you must include +these in the constructor, either positionally or as keyword arguments. + +- ``entry_point`` Path (absolute or relative) to the Python file which + should be executed as the entry point to training. +- ``role`` An AWS IAM role (either name or full ARN). The Amazon + SageMaker training jobs and APIs that create Amazon SageMaker + endpoints use this role to access training data and model artifacts. + After the endpoint is created, the inference code might use the IAM + role, if accessing AWS resource. +- ``train_instance_count`` Number of Amazon EC2 instances to use for + training. +- ``train_instance_type`` Type of EC2 instance to use for training, for + example, 'ml.m4.xlarge'. + +Optional arguments +~~~~~~~~~~~~~~~~~~ + +The following are optional arguments. When you create a ``PyTorch`` object, you can specify these as keyword arguments. + +- ``source_dir`` Path (absolute or relative) to a directory with any + other training source code dependencies including the entry point + file. Structure within this directory will be preserved when training + on SageMaker. +- ``dependencies (list[str])`` A list of paths to directories (absolute or relative) with + any additional libraries that will be exported to the container (default: []). + The library folders will be copied to SageMaker in the same folder where the entrypoint is copied. + If the ```source_dir``` points to S3, code will be uploaded and the S3 location will be used + instead. Example: + + The following call + >>> PyTorch(entry_point='train.py', dependencies=['my/libs/common', 'virtual-env']) + results in the following inside the container: + + >>> $ ls + + >>> opt/ml/code + >>> ├── train.py + >>> ├── common + >>> └── virtual-env + +- ``hyperparameters`` Hyperparameters that will be used for training. + Will be made accessible as a dict[str, str] to the training code on + SageMaker. For convenience, accepts other types besides strings, but + ``str`` will be called on keys and values to convert them before + training. +- ``py_version`` Python version you want to use for executing your + model training code. +- ``framework_version`` PyTorch version you want to use for executing + your model training code. You can find the list of supported versions + in `the section below <#sagemaker-pytorch-docker-containers>`__. +- ``train_volume_size`` Size in GB of the EBS volume to use for storing + input data during training. Must be large enough to store training + data if input_mode='File' is used (which is the default). +- ``train_max_run`` Timeout in seconds for training, after which Amazon + SageMaker terminates the job regardless of its current status. +- ``input_mode`` The input mode that the algorithm supports. Valid + modes: 'File' - Amazon SageMaker copies the training dataset from the + S3 location to a directory in the Docker container. 'Pipe' - Amazon + SageMaker streams data directly from S3 to the container via a Unix + named pipe. +- ``output_path`` S3 location where you want the training result (model + artifacts and optional output files) saved. If not specified, results + are stored to a default bucket. If the bucket with the specific name + does not exist, the estimator creates the bucket during the ``fit`` + method execution. +- ``output_kms_key`` Optional KMS key ID to optionally encrypt training + output with. +- ``job_name`` Name to assign for the training job that the ``fit``` + method launches. If not specified, the estimator generates a default + job name, based on the training image name and current timestamp +- ``image_name`` An alternative docker image to use for training and + serving. If specified, the estimator will use this image for training and + hosting, instead of selecting the appropriate SageMaker official image based on + framework_version and py_version. Refer to: `SageMaker PyTorch Docker Containers + <#id4>`_ for details on what the Official images support + and where to find the source code to build your custom image. + +Calling fit +~~~~~~~~~~~ + +You start your training script by calling ``fit`` on a ``PyTorch`` Estimator. ``fit`` takes both required and optional +arguments. + +Required arguments +'''''''''''''''''' + +- ``inputs``: This can take one of the following forms: A string + S3 URI, for example ``s3://my-bucket/my-training-data``. In this + case, the S3 objects rooted at the ``my-training-data`` prefix will + be available in the default ``train`` channel. A dict from + string channel names to S3 URIs. In this case, the objects rooted at + each S3 prefix will available as files in each channel directory. + +For example: + +.. code:: python + + {'train':'s3://my-bucket/my-training-data', + 'eval':'s3://my-bucket/my-evaluation-data'} + +.. optional-arguments-1: + +Optional arguments +'''''''''''''''''' + +- ``wait``: Defaults to True, whether to block and wait for the + training script to complete before returning. +- ``logs``: Defaults to True, whether to show logs produced by training + job in the Python session. Only meaningful when wait is True. + + +Distributed PyTorch Training +---------------------------- + +You can run a multi-machine, distributed PyTorch training using the PyTorch Estimator. By default, PyTorch objects will +submit single-machine training jobs to SageMaker. If you set ``train_instance_count`` to be greater than one, multi-machine +training jobs will be launched when ``fit`` is called. When you run multi-machine training, SageMaker will import your +training script and run it on each host in the cluster. + +To initialize distributed training in your script you would call ``dist.init_process_group`` providing desired backend +and rank and setting 'WORLD_SIZE' environment variable similar to how you would do it outside of SageMaker using +environment variable initialization: + +.. code:: python + + if args.distributed: + # Initialize the distributed environment. + world_size = len(args.hosts) + os.environ['WORLD_SIZE'] = str(world_size) + host_rank = args.hosts.index(args.current_host) + dist.init_process_group(backend=args.backend, rank=host_rank) + +SageMaker sets 'MASTER_ADDR' and 'MASTER_PORT' environment variables for you, but you can overwrite them. + +Supported backends: +- `gloo` and `tcp` for cpu instances +- `gloo` and `nccl` for gpu instances + +Saving models +------------- + +In order to save your trained PyTorch model for deployment on SageMaker, your training script should save your model +to a certain filesystem path called ``model_dir``. This value is accessible through the environment variable +``SM_MODEL_DIR``. The following code demonstrates how to save a trained PyTorch model named ``model`` as +``model.pth`` at the : + +.. code:: python + + import argparse + import os + import torch + + if __name__=='__main__': + # default to the value in environment variable `SM_MODEL_DIR`. Using args makes the script more portable. + parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) + args, _ = parser.parse_known_args() + + # ... train `model`, then save it to `model_dir` + with open(os.path.join(args.model_dir, 'model.pth'), 'wb') as f: + torch.save(model.state_dict(), f) + +After your training job is complete, SageMaker will compress and upload the serialized model to S3, and your model data +will be available in the S3 ``output_path`` you specified when you created the PyTorch Estimator. + +Deploying PyTorch Models +------------------------ + +After an PyTorch Estimator has been fit, you can host the newly created model in SageMaker. + +After calling ``fit``, you can call ``deploy`` on a ``PyTorch`` Estimator to create a SageMaker Endpoint. +The Endpoint runs a SageMaker-provided PyTorch model server and hosts the model produced by your training script, +which was run when you called ``fit``. This was the model you saved to ``model_dir``. + +``deploy`` returns a ``Predictor`` object, which you can use to do inference on the Endpoint hosting your PyTorch model. +Each ``Predictor`` provides a ``predict`` method which can do inference with numpy arrays or Python lists. +Inference arrays or lists are serialized and sent to the PyTorch model server by an ``InvokeEndpoint`` SageMaker +operation. + +``predict`` returns the result of inference against your model. By default, the inference result a NumPy array. + +.. code:: python + + # Train my estimator + pytorch_estimator = PyTorch(entry_point='train_and_deploy.py', + train_instance_type='ml.p3.2xlarge', + train_instance_count=1, + framework_version='1.0.0') + pytorch_estimator.fit('s3://my_bucket/my_training_data/') + + # Deploy my estimator to a SageMaker Endpoint and get a Predictor + predictor = pytorch_estimator.deploy(instance_type='ml.m4.xlarge', + initial_instance_count=1) + + # `data` is a NumPy array or a Python list. + # `response` is a NumPy array. + response = predictor.predict(data) + +You use the SageMaker PyTorch model server to host your PyTorch model when you call ``deploy`` on an ``PyTorch`` +Estimator. The model server runs inside a SageMaker Endpoint, which your call to ``deploy`` creates. +You can access the name of the Endpoint by the ``name`` property on the returned ``Predictor``. + + +The SageMaker PyTorch Model Server +---------------------------------- + +The PyTorch Endpoint you create with ``deploy`` runs a SageMaker PyTorch model server. +The model server loads the model that was saved by your training script and performs inference on the model in response +to SageMaker InvokeEndpoint API calls. + +You can configure two components of the SageMaker PyTorch model server: Model loading and model serving. +Model loading is the process of deserializing your saved model back into an PyTorch model. +Serving is the process of translating InvokeEndpoint requests to inference calls on the loaded model. + +You configure the PyTorch model server by defining functions in the Python source file you passed to the PyTorch constructor. + +Model loading +~~~~~~~~~~~~~ + +Before a model can be served, it must be loaded. The SageMaker PyTorch model server loads your model by invoking a +``model_fn`` function that you must provide in your script. The ``model_fn`` should have the following signature: + +.. code:: python + + def model_fn(model_dir) + +SageMaker will inject the directory where your model files and sub-directories, saved by ``save``, have been mounted. +Your model function should return a model object that can be used for model serving. + +The following code-snippet shows an example ``model_fn`` implementation. +It loads the model parameters from a ``model.pth`` file in the SageMaker model directory ``model_dir``. + +.. code:: python + + import torch + import os + + def model_fn(model_dir): + model = Your_Model() + with open(os.path.join(model_dir, 'model.pth'), 'rb') as f: + model.load_state_dict(torch.load(f)) + return model + +Model serving +~~~~~~~~~~~~~ + +After the SageMaker model server has loaded your model by calling ``model_fn``, SageMaker will serve your model. +Model serving is the process of responding to inference requests, received by SageMaker InvokeEndpoint API calls. +The SageMaker PyTorch model server breaks request handling into three steps: + + +- input processing, +- prediction, and +- output processing. + +In a similar way to model loading, you configure these steps by defining functions in your Python source file. + +Each step involves invoking a python function, with information about the request and the return value from the previous +function in the chain. Inside the SageMaker PyTorch model server, the process looks like: + +.. code:: python + + # Deserialize the Invoke request body into an object we can perform prediction on + input_object = input_fn(request_body, request_content_type) + + # Perform prediction on the deserialized object, with the loaded model + prediction = predict_fn(input_object, model) + + # Serialize the prediction result into the desired response content type + output = output_fn(prediction, response_content_type) + +The above code sample shows the three function definitions: + +- ``input_fn``: Takes request data and deserializes the data into an + object for prediction. +- ``predict_fn``: Takes the deserialized request object and performs + inference against the loaded model. +- ``output_fn``: Takes the result of prediction and serializes this + according to the response content type. + +The SageMaker PyTorch model server provides default implementations of these functions. +You can provide your own implementations for these functions in your hosting script. +If you omit any definition then the SageMaker PyTorch model server will use its default implementation for that +function. + +The ``RealTimePredictor`` used by PyTorch in the SageMaker Python SDK serializes NumPy arrays to the `NPY `_ format +by default, with Content-Type ``application/x-npy``. The SageMaker PyTorch model server can deserialize NPY-formatted +data (along with JSON and CSV data). + +If you rely solely on the SageMaker PyTorch model server defaults, you get the following functionality: + +- Prediction on models that implement the ``__call__`` method +- Serialization and deserialization of torch.Tensor. + +The default ``input_fn`` and ``output_fn`` are meant to make it easy to predict on torch.Tensors. If your model expects +a torch.Tensor and returns a torch.Tensor, then these functions do not have to be overridden when sending NPY-formatted +data. + +In the following sections we describe the default implementations of input_fn, predict_fn, and output_fn. +We describe the input arguments and expected return types of each, so you can define your own implementations. + +Input processing +'''''''''''''''' + +When an InvokeEndpoint operation is made against an Endpoint running a SageMaker PyTorch model server, +the model server receives two pieces of information: + +- The request Content-Type, for example "application/x-npy" +- The request data body, a byte array + +The SageMaker PyTorch model server will invoke an ``input_fn`` function in your hosting script, +passing in this information. If you define an ``input_fn`` function definition, +it should return an object that can be passed to ``predict_fn`` and have the following signature: + +.. code:: python + + def input_fn(request_body, request_content_type) + +Where ``request_body`` is a byte buffer and ``request_content_type`` is a Python string + +The SageMaker PyTorch model server provides a default implementation of ``input_fn``. +This function deserializes JSON, CSV, or NPY encoded data into a torch.Tensor. + +Default NPY deserialization requires ``request_body`` to follow the `NPY `_ format. For PyTorch, the Python SDK +defaults to sending prediction requests with this format. + +Default JSON deserialization requires ``request_body`` contain a single json list. +Sending multiple JSON objects within the same ``request_body`` is not supported. +The list must have a dimensionality compatible with the model loaded in ``model_fn``. +The list's shape must be identical to the model's input shape, for all dimensions after the first (which first +dimension is the batch size). + +Default csv deserialization requires ``request_body`` contain one or more lines of CSV numerical data. +The data is loaded into a two-dimensional array, where each line break defines the boundaries of the first dimension. + +The example below shows a custom ``input_fn`` for preparing pickled torch.Tensor. + +.. code:: python + + import numpy as np + import torch + from six import BytesIO + + def input_fn(request_body, request_content_type): + """An input_fn that loads a pickled tensor""" + if request_content_type == 'application/python-pickle': + return torch.load(BytesIO(request_body)) + else: + # Handle other content-types here or raise an Exception + # if the content type is not supported. + pass + + + +Prediction +'''''''''' + +After the inference request has been deserialized by ``input_fn``, the SageMaker PyTorch model server invokes +``predict_fn`` on the return value of ``input_fn``. + +As with ``input_fn``, you can define your own ``predict_fn`` or use the SageMaker PyTorch model server default. + +The ``predict_fn`` function has the following signature: + +.. code:: python + + def predict_fn(input_object, model) + +Where ``input_object`` is the object returned from ``input_fn`` and +``model`` is the model loaded by ``model_fn``. + +The default implementation of ``predict_fn`` invokes the loaded model's ``__call__`` function on ``input_object``, +and returns the resulting value. The return-type should be a torch.Tensor to be compatible with the default +``output_fn``. + +The example below shows an overridden ``predict_fn``: + +.. code:: python + + import torch + import numpy as np + + def predict_fn(input_data, model): + device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') + model.to(device) + model.eval() + with torch.no_grad(): + return model(input_data.to(device)) + +If you implement your own prediction function, you should take care to ensure that: + +- The first argument is expected to be the return value from input_fn. + If you use the default input_fn, this will be a torch.Tensor. +- The second argument is the loaded model. +- The return value should be of the correct type to be passed as the + first argument to ``output_fn``. If you use the default + ``output_fn``, this should be a torch.Tensor. + +Output processing +''''''''''''''''' + +After invoking ``predict_fn``, the model server invokes ``output_fn``, passing in the return value from ``predict_fn`` +and the content type for the response, as specified by the InvokeEndpoint request. + +The ``output_fn`` has the following signature: + +.. code:: python + + def output_fn(prediction, content_type) + +Where ``prediction`` is the result of invoking ``predict_fn`` and +the content type for the response, as specified by the InvokeEndpoint request. +The function should return a byte array of data serialized to content_type. + +The default implementation expects ``prediction`` to be a torch.Tensor and can serialize the result to JSON, CSV, or NPY. +It accepts response content types of "application/json", "text/csv", and "application/x-npy". + +Working with Existing Model Data and Training Jobs +-------------------------------------------------- + +Attaching to existing training jobs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can attach an PyTorch Estimator to an existing training job using the +``attach`` method. + +.. code:: python + + my_training_job_name = 'MyAwesomePyTorchTrainingJob' + pytorch_estimator = PyTorch.attach(my_training_job_name) + +After attaching, if the training job has finished with job status "Completed", it can be +``deploy``\ ed to create a SageMaker Endpoint and return a +``Predictor``. If the training job is in progress, +attach will block and display log messages from the training job, until the training job completes. + +The ``attach`` method accepts the following arguments: + +- ``training_job_name:`` The name of the training job to attach + to. +- ``sagemaker_session:`` The Session used + to interact with SageMaker + +Deploying Endpoints from model data +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As well as attaching to existing training jobs, you can deploy models directly from model data in S3. +The following code sample shows how to do this, using the ``PyTorchModel`` class. + +.. code:: python + + pytorch_model = PyTorchModel(model_data='s3://bucket/model.tar.gz', role='SageMakerRole', + entry_point='transform_script.py') + + predictor = pytorch_model.deploy(instance_type='ml.c4.xlarge', initial_instance_count=1) + +The PyTorchModel constructor takes the following arguments: + +- ``model_dat:`` An S3 location of a SageMaker model data + .tar.gz file +- ``image:`` A Docker image URI +- ``role:`` An IAM role name or Arn for SageMaker to access AWS + resources on your behalf. +- ``predictor_cls:`` A function to + call to create a predictor. If not None, ``deploy`` will return the + result of invoking this function on the created endpoint name +- ``env:`` Environment variables to run with + ``image`` when hosted in SageMaker. +- ``name:`` The model name. If None, a default model name will be + selected on each ``deploy.`` +- ``entry_point:`` Path (absolute or relative) to the Python file + which should be executed as the entry point to model hosting. +- ``source_dir:`` Optional. Path (absolute or relative) to a + directory with any other training source code dependencies including + tne entry point file. Structure within this directory will be + preserved when training on SageMaker. +- ``enable_cloudwatch_metrics:`` Optional. If true, training + and hosting containers will generate Cloudwatch metrics under the + AWS/SageMakerContainer namespace. +- ``container_log_level:`` Log level to use within the container. + Valid values are defined in the Python logging module. +- ``code_location:`` Optional. Name of the S3 bucket where your + custom code will be uploaded to. If not specified, will use the + SageMaker default bucket created by sagemaker.Session. +- ``sagemaker_session:`` The SageMaker Session + object, used for SageMaker interaction + +Your model data must be a .tar.gz file in S3. SageMaker Training Job model data is saved to .tar.gz files in S3, +however if you have local data you want to deploy, you can prepare the data yourself. + +Assuming you have a local directory containg your model data named "my_model" you can tar and gzip compress the file and +upload to S3 using the following commands: + +:: + + tar -czf model.tar.gz my_model + aws s3 cp model.tar.gz s3://my-bucket/my-path/model.tar.gz + +This uploads the contents of my_model to a gzip compressed tar file to S3 in the bucket "my-bucket", with the key +"my-path/model.tar.gz". + +To run this command, you'll need the AWS CLI tool installed. Please refer to our `FAQ`_ for more information on +installing this. + +.. _FAQ: ../../../README.rst#faq + +PyTorch Training Examples +------------------------- + +Amazon provides several example Jupyter notebooks that demonstrate end-to-end training on Amazon SageMaker using PyTorch. +Please refer to: + +https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk + +These are also available in SageMaker Notebook Instance hosted Jupyter notebooks under the sample notebooks folder. + + +SageMaker PyTorch Docker containers +----------------------------------- + +When training and deploying training scripts, SageMaker runs your Python script in a Docker container with several +libraries installed. When creating the Estimator and calling deploy to create the SageMaker Endpoint, you can control +the environment your script runs in. + +SageMaker runs PyTorch Estimator scripts in either Python 2 or Python 3. You can select the Python version by +passing a ``py_version`` keyword arg to the PyTorch Estimator constructor. Setting this to `py3` (the default) will cause your +training script to be run on Python 3.5. Setting this to `py2` will cause your training script to be run on Python 2.7 +This Python version applies to both the Training Job, created by fit, and the Endpoint, created by deploy. + +The PyTorch Docker images have the following dependencies installed: + ++-----------------------------+---------------+-------------------+ +| Dependencies | pytorch 0.4.0 | pytorch 1.0.0 | ++-----------------------------+---------------+-------------------+ +| boto3 | >=1.7.35 | >=1.9.11 | ++-----------------------------+---------------+-------------------+ +| botocore | >=1.10.35 | >=1.12.11 | ++-----------------------------+---------------+-------------------+ +| CUDA (GPU image only) | 9.0 | 9.0 | ++-----------------------------+---------------+-------------------+ +| numpy | >=1.14.3 | >=1.15.2 | ++-----------------------------+---------------+-------------------+ +| Pillow | >=5.1.0 | >=5.2.0 | ++-----------------------------+---------------+-------------------+ +| pip | >=10.0.1 | >=18.0 | ++-----------------------------+---------------+-------------------+ +| python-dateutil | >=2.7.3 | >=2.7.3 | ++-----------------------------+---------------+-------------------+ +| retrying | >=1.3.3 | >=1.3.3 | ++-----------------------------+---------------+-------------------+ +| s3transfer | >=0.1.13 | >=0.1.13 | ++-----------------------------+---------------+-------------------+ +| sagemaker-containers | >=2.1.0 | >=2.1.0 | ++-----------------------------+---------------+-------------------+ +| sagemaker-pytorch-container | 1.0 | 1.0 | ++-----------------------------+---------------+-------------------+ +| setuptools | >=39.2.0 | >=40.4.3 | ++-----------------------------+---------------+-------------------+ +| six | >=1.11.0 | >=1.11.0 | ++-----------------------------+---------------+-------------------+ +| torch | 0.4.0 | 1.0.0 | ++-----------------------------+---------------+-------------------+ +| torchvision | 0.2.1 | 0.2.1 | ++-----------------------------+---------------+-------------------+ +| Python | 2.7 or 3.5 | 2.7 or 3.6 | ++-----------------------------+---------------+-------------------+ + +The Docker images extend Ubuntu 16.04. + +If you need to install other dependencies you can put them into `requirements.txt` file and put it in the source directory +(``source_dir``) you provide to the `PyTorch Estimator <#pytorch-estimators>`__. + +You can select version of PyTorch by passing a ``framework_version`` keyword arg to the PyTorch Estimator constructor. +Currently supported versions are listed in the above table. You can also set ``framework_version`` to only specify major and +minor version, which will cause your training script to be run on the latest supported patch version of that minor +version. + +Alternatively, you can build your own image by following the instructions in the SageMaker Chainer containers +repository, and passing ``image_name`` to the Chainer Estimator constructor. + You can visit `the SageMaker PyTorch containers repository `_. \ No newline at end of file diff --git a/doc/using_rl.rst b/doc/using_rl.rst index b32ee0ed18..959d60b3ae 100644 --- a/doc/using_rl.rst +++ b/doc/using_rl.rst @@ -1,322 +1,322 @@ -========================================================== -Using Reinforcement Learning with the SageMaker Python SDK -========================================================== - -.. contents:: - -With Reinforcement Learning (RL) Estimators, you can train reinforcement learning models on Amazon SageMaker. - -Supported versions of Coach: ``0.11.1``, ``0.10.1`` with TensorFlow, ``0.11.0`` with TensorFlow or MXNet. -For more information about Coach, see https://github.com/NervanaSystems/coach - -Supported versions of Ray: ``0.5.3`` with TensorFlow. -For more information about Ray, see https://github.com/ray-project/ray - -RL Training ------------ - -Training RL models using ``RLEstimator`` is a two-step process: - -1. Prepare a training script to run on SageMaker -2. Run this script on SageMaker via an ``RlEstimator``. - -You should prepare your script in a separate source file than the notebook, terminal session, or source file you're -using to submit the script to SageMaker via an ``RlEstimator``. This will be discussed in further detail below. - -Suppose that you already have a training script called ``coach-train.py``. -You can then create an ``RLEstimator`` with keyword arguments to point to this script and define how SageMaker runs it: - -.. code:: python - - from sagemaker.rl import RLEstimator, RLToolkit, RLFramework - - rl_estimator = RLEstimator(entry_point='coach-train.py', - toolkit=RLToolkit.COACH, - toolkit_version='0.11.1', - framework=RLFramework.TENSORFLOW, - role='SageMakerRole', - train_instance_type='ml.p3.2xlarge', - train_instance_count=1) - -After that, you simply tell the estimator to start a training job: - -.. code:: python - - rl_estimator.fit() - -In the following sections, we'll discuss how to prepare a training script for execution on SageMaker -and how to run that script on SageMaker using ``RLEstimator``. - - -Preparing the RL Training Script -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Your RL training script must be a Python 3.5 compatible source file from MXNet framework or Python 3.6 for TensorFlow. - -The training script is very similar to a training script you might run outside of SageMaker, but you -can access useful properties about the training environment through various environment variables, such as - -* ``SM_MODEL_DIR``: A string representing the path to the directory to write model artifacts to. - These artifacts are uploaded to S3 for model hosting. -* ``SM_NUM_GPUS``: An integer representing the number of GPUs available to the host. -* ``SM_OUTPUT_DATA_DIR``: A string representing the filesystem path to write output artifacts to. Output artifacts may - include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed - and uploaded to S3 to the same S3 prefix as the model artifacts. - -For the exhaustive list of available environment variables, see the -`SageMaker Containers documentation `__. - - -RL Estimators -------------- - -The ``RLEstimator`` constructor takes both required and optional arguments. - -Required arguments -~~~~~~~~~~~~~~~~~~ - -The following are required arguments to the ``RLEstimator`` constructor. When you create an instance of RLEstimator, you must include -these in the constructor, either positionally or as keyword arguments. - -- ``entry_point`` Path (absolute or relative) to the Python file which - should be executed as the entry point to training. -- ``role`` An AWS IAM role (either name or full ARN). The Amazon - SageMaker training jobs and APIs that create Amazon SageMaker - endpoints use this role to access training data and model artifacts. - After the endpoint is created, the inference code might use the IAM - role, if accessing AWS resource. -- ``train_instance_count`` Number of Amazon EC2 instances to use for - training. -- ``train_instance_type`` Type of EC2 instance to use for training, for - example, 'ml.m4.xlarge'. - -You must as well include either: - -- ``toolkit`` RL toolkit (Ray RLlib or Coach) you want to use for executing your model training code. - -- ``toolkit_version`` RL toolkit version you want to be use for executing your model training code. - -- ``framework`` Framework (MXNet or TensorFlow) you want to be used as - a toolkit backed for reinforcement learning training. - -or provide: - -- ``image_name`` An alternative docker image to use for training and - serving. If specified, the estimator will use this image for training and - hosting, instead of selecting the appropriate SageMaker official image based on - framework_version and py_version. Refer to: `SageMaker RL Docker Containers - <#sagemaker-rl-docker-containers>`_ for details on what the Official images support - and where to find the source code to build your custom image. - - -Optional arguments -~~~~~~~~~~~~~~~~~~ - -The following are optional arguments. When you create an ``RlEstimator`` object, you can specify these as keyword arguments. - -- ``source_dir`` Path (absolute or relative) to a directory with any - other training source code dependencies including the entry point - file. Structure within this directory will be preserved when training - on SageMaker. -- ``dependencies (list[str])`` A list of paths to directories (absolute or relative) with - any additional libraries that will be exported to the container (default: ``[]``). - The library folders will be copied to SageMaker in the same folder where the entrypoint is copied. - If the ``source_dir`` points to S3, code will be uploaded and the S3 location will be used - instead. - - For example, the following call: - - .. code:: python - - >>> RLEstimator(entry_point='train.py', - toolkit=RLToolkit.COACH, - toolkit_version='0.11.0', - framework=RLFramework.TENSORFLOW, - dependencies=['my/libs/common', 'virtual-env']) - - results in the following inside the container: - - .. code:: bash - - >>> $ ls - - >>> opt/ml/code - >>> ├── train.py - >>> ├── common - >>> └── virtual-env - -- ``hyperparameters`` Hyperparameters that will be used for training. - Will be made accessible as a ``dict[str, str]`` to the training code on - SageMaker. For convenience, accepts other types besides strings, but - ``str`` will be called on keys and values to convert them before - training. -- ``train_volume_size`` Size in GB of the EBS volume to use for storing - input data during training. Must be large enough to store training - data if ``input_mode='File'`` is used (which is the default). -- ``train_max_run`` Timeout in seconds for training, after which Amazon - SageMaker terminates the job regardless of its current status. -- ``input_mode`` The input mode that the algorithm supports. Valid - modes: 'File' - Amazon SageMaker copies the training dataset from the - S3 location to a directory in the Docker container. 'Pipe' - Amazon - SageMaker streams data directly from S3 to the container via a Unix - named pipe. -- ``output_path`` S3 location where you want the training result (model - artifacts and optional output files) saved. If not specified, results - are stored to a default bucket. If the bucket with the specific name - does not exist, the estimator creates the bucket during the ``fit`` - method execution. -- ``output_kms_key`` Optional KMS key ID to optionally encrypt training - output with. -- ``job_name`` Name to assign for the training job that the ``fit``` - method launches. If not specified, the estimator generates a default - job name, based on the training image name and current timestamp - -Calling fit -~~~~~~~~~~~ - -You start your training script by calling ``fit`` on an ``RLEstimator``. ``fit`` takes both a few optional -arguments. - -Optional arguments -'''''''''''''''''' - -- ``inputs``: This can take one of the following forms: A string - S3 URI, for example ``s3://my-bucket/my-training-data``. In this - case, the S3 objects rooted at the ``my-training-data`` prefix will - be available in the default ``train`` channel. A dict from - string channel names to S3 URIs. In this case, the objects rooted at - each S3 prefix will available as files in each channel directory. -- ``wait``: Defaults to True, whether to block and wait for the - training script to complete before returning. -- ``logs``: Defaults to True, whether to show logs produced by training - job in the Python session. Only meaningful when wait is True. - - -Distributed RL Training ------------------------ - -Amazon SageMaker RL supports multi-core and multi-instance distributed training. -Depending on your use case, training and/or environment rollout can be distributed. - -Please see the `Amazon SageMaker examples `_ -on how it can be done using different RL toolkits. - - -Saving models -------------- - -In order to save your trained PyTorch model for deployment on SageMaker, your training script should save your model -to a certain filesystem path ``/opt/ml/model``. This value is also accessible through the environment variable -``SM_MODEL_DIR``. - -Deploying RL Models -------------------- - -After an RL Estimator has been fit, you can host the newly created model in SageMaker. - -After calling ``fit``, you can call ``deploy`` on an ``RlEstimator`` Estimator to create a SageMaker Endpoint. -The Endpoint runs one of the SageMaker-provided model server based on the ``framework`` parameter -specified in the ``RLEstimator`` constructor and hosts the model produced by your training script, -which was run when you called ``fit``. This was the model you saved to ``model_dir``. -In case if ``image_name`` was specified it would use provided image for the deployment. - -``deploy`` returns a ``sagemaker.mxnet.MXNetPredictor`` for MXNet or -``sagemaker.tensorflow.serving.Predictor`` for TensorFlow. - -``predict`` returns the result of inference against your model. - -.. code:: python - - # Train my estimator - rl_estimator = RLEstimator(entry_point='coach-train.py', - toolkit=RLToolkit.COACH, - toolkit_version='0.11.0', - framework=RLFramework.MXNET, - role='SageMakerRole', - train_instance_type='ml.c4.2xlarge', - train_instance_count=1) - - rl_estimator.fit() - - # Deploy my estimator to a SageMaker Endpoint and get a MXNetPredictor - predictor = rl_estimator.deploy(instance_type='ml.m4.xlarge', - initial_instance_count=1) - - response = predictor.predict(data) - -For more information please see `The SageMaker MXNet Model Server `_ -and `Deploying to TensorFlow Serving Endpoints `_ documentation. - - -Working with Existing Training Jobs ------------------------------------ - -Attaching to existing training jobs -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You can attach an RL Estimator to an existing training job using the -``attach`` method. - -.. code:: python - - my_training_job_name = 'MyAwesomeRLTrainingJob' - rl_estimator = RLEstimator.attach(my_training_job_name) - -After attaching, if the training job has finished with job status "Completed", it can be -``deploy``\ ed to create a SageMaker Endpoint and return a ``Predictor``. If the training job is in progress, -attach will block and display log messages from the training job, until the training job completes. - -The ``attach`` method accepts the following arguments: - -- ``training_job_name:`` The name of the training job to attach - to. -- ``sagemaker_session:`` The Session used - to interact with SageMaker - -RL Training Examples --------------------- - -Amazon provides several example Jupyter notebooks that demonstrate end-to-end training on Amazon SageMaker using RL. -Please refer to: - -https://github.com/awslabs/amazon-sagemaker-examples/tree/master/reinforcement_learning - -These are also available in SageMaker Notebook Instance hosted Jupyter notebooks under the sample notebooks folder. - - -SageMaker RL Docker Containers ------------------------------- - -When training and deploying training scripts, SageMaker runs your Python script in a Docker container with several -libraries installed. When creating the Estimator and calling deploy to create the SageMaker Endpoint, you can control -the environment your script runs in. - -SageMaker runs RL Estimator scripts in either Python 3.5 for MXNet or Python 3.6 for TensorFlow. - -The Docker images have the following dependencies installed: - -+-------------------------+-------------------+-------------------+-------------------+ -| Dependencies | Coach 0.10.1 | Coach 0.11.0 | Ray 0.5.3 | -+-------------------------+-------------------+-------------------+-------------------+ -| Python | 3.6 | 3.5(MXNet) or | 3.6 | -| | | 3.6(TensorFlow) | | -+-------------------------+-------------------+-------------------+-------------------+ -| CUDA (GPU image only) | 9.0 | 9.0 | 9.0 | -+-------------------------+-------------------+-------------------+-------------------+ -| DL Framework | TensorFlow-1.11.0 | MXNet-1.3.0 or | TensorFlow-1.11.0 | -| | | TensorFlow-1.11.0 | | -+-------------------------+-------------------+-------------------+-------------------+ -| gym | 0.10.5 | 0.10.5 | 0.10.5 | -+-------------------------+-------------------+-------------------+-------------------+ - -The Docker images extend Ubuntu 16.04. - -You can select version of by passing a ``framework_version`` keyword arg to the RL Estimator constructor. -Currently supported versions are listed in the above table. You can also set ``framework_version`` to only specify major and -minor version, which will cause your training script to be run on the latest supported patch version of that minor -version. - -Alternatively, you can build your own image by following the instructions in the SageMaker RL containers -repository, and passing ``image_name`` to the RL Estimator constructor. - +========================================================== +Using Reinforcement Learning with the SageMaker Python SDK +========================================================== + +.. contents:: + +With Reinforcement Learning (RL) Estimators, you can train reinforcement learning models on Amazon SageMaker. + +Supported versions of Coach: ``0.11.1``, ``0.10.1`` with TensorFlow, ``0.11.0`` with TensorFlow or MXNet. +For more information about Coach, see https://github.com/NervanaSystems/coach + +Supported versions of Ray: ``0.5.3`` with TensorFlow. +For more information about Ray, see https://github.com/ray-project/ray + +RL Training +----------- + +Training RL models using ``RLEstimator`` is a two-step process: + +1. Prepare a training script to run on SageMaker +2. Run this script on SageMaker via an ``RlEstimator``. + +You should prepare your script in a separate source file than the notebook, terminal session, or source file you're +using to submit the script to SageMaker via an ``RlEstimator``. This will be discussed in further detail below. + +Suppose that you already have a training script called ``coach-train.py``. +You can then create an ``RLEstimator`` with keyword arguments to point to this script and define how SageMaker runs it: + +.. code:: python + + from sagemaker.rl import RLEstimator, RLToolkit, RLFramework + + rl_estimator = RLEstimator(entry_point='coach-train.py', + toolkit=RLToolkit.COACH, + toolkit_version='0.11.1', + framework=RLFramework.TENSORFLOW, + role='SageMakerRole', + train_instance_type='ml.p3.2xlarge', + train_instance_count=1) + +After that, you simply tell the estimator to start a training job: + +.. code:: python + + rl_estimator.fit() + +In the following sections, we'll discuss how to prepare a training script for execution on SageMaker +and how to run that script on SageMaker using ``RLEstimator``. + + +Preparing the RL Training Script +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Your RL training script must be a Python 3.5 compatible source file from MXNet framework or Python 3.6 for TensorFlow. + +The training script is very similar to a training script you might run outside of SageMaker, but you +can access useful properties about the training environment through various environment variables, such as + +* ``SM_MODEL_DIR``: A string representing the path to the directory to write model artifacts to. + These artifacts are uploaded to S3 for model hosting. +* ``SM_NUM_GPUS``: An integer representing the number of GPUs available to the host. +* ``SM_OUTPUT_DATA_DIR``: A string representing the filesystem path to write output artifacts to. Output artifacts may + include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed + and uploaded to S3 to the same S3 prefix as the model artifacts. + +For the exhaustive list of available environment variables, see the +`SageMaker Containers documentation `__. + + +RL Estimators +------------- + +The ``RLEstimator`` constructor takes both required and optional arguments. + +Required arguments +~~~~~~~~~~~~~~~~~~ + +The following are required arguments to the ``RLEstimator`` constructor. When you create an instance of RLEstimator, you must include +these in the constructor, either positionally or as keyword arguments. + +- ``entry_point`` Path (absolute or relative) to the Python file which + should be executed as the entry point to training. +- ``role`` An AWS IAM role (either name or full ARN). The Amazon + SageMaker training jobs and APIs that create Amazon SageMaker + endpoints use this role to access training data and model artifacts. + After the endpoint is created, the inference code might use the IAM + role, if accessing AWS resource. +- ``train_instance_count`` Number of Amazon EC2 instances to use for + training. +- ``train_instance_type`` Type of EC2 instance to use for training, for + example, 'ml.m4.xlarge'. + +You must as well include either: + +- ``toolkit`` RL toolkit (Ray RLlib or Coach) you want to use for executing your model training code. + +- ``toolkit_version`` RL toolkit version you want to be use for executing your model training code. + +- ``framework`` Framework (MXNet or TensorFlow) you want to be used as + a toolkit backed for reinforcement learning training. + +or provide: + +- ``image_name`` An alternative docker image to use for training and + serving. If specified, the estimator will use this image for training and + hosting, instead of selecting the appropriate SageMaker official image based on + framework_version and py_version. Refer to: `SageMaker RL Docker Containers + <#sagemaker-rl-docker-containers>`_ for details on what the Official images support + and where to find the source code to build your custom image. + + +Optional arguments +~~~~~~~~~~~~~~~~~~ + +The following are optional arguments. When you create an ``RlEstimator`` object, you can specify these as keyword arguments. + +- ``source_dir`` Path (absolute or relative) to a directory with any + other training source code dependencies including the entry point + file. Structure within this directory will be preserved when training + on SageMaker. +- ``dependencies (list[str])`` A list of paths to directories (absolute or relative) with + any additional libraries that will be exported to the container (default: ``[]``). + The library folders will be copied to SageMaker in the same folder where the entrypoint is copied. + If the ``source_dir`` points to S3, code will be uploaded and the S3 location will be used + instead. + + For example, the following call: + + .. code:: python + + >>> RLEstimator(entry_point='train.py', + toolkit=RLToolkit.COACH, + toolkit_version='0.11.0', + framework=RLFramework.TENSORFLOW, + dependencies=['my/libs/common', 'virtual-env']) + + results in the following inside the container: + + .. code:: bash + + >>> $ ls + + >>> opt/ml/code + >>> ├── train.py + >>> ├── common + >>> └── virtual-env + +- ``hyperparameters`` Hyperparameters that will be used for training. + Will be made accessible as a ``dict[str, str]`` to the training code on + SageMaker. For convenience, accepts other types besides strings, but + ``str`` will be called on keys and values to convert them before + training. +- ``train_volume_size`` Size in GB of the EBS volume to use for storing + input data during training. Must be large enough to store training + data if ``input_mode='File'`` is used (which is the default). +- ``train_max_run`` Timeout in seconds for training, after which Amazon + SageMaker terminates the job regardless of its current status. +- ``input_mode`` The input mode that the algorithm supports. Valid + modes: 'File' - Amazon SageMaker copies the training dataset from the + S3 location to a directory in the Docker container. 'Pipe' - Amazon + SageMaker streams data directly from S3 to the container via a Unix + named pipe. +- ``output_path`` S3 location where you want the training result (model + artifacts and optional output files) saved. If not specified, results + are stored to a default bucket. If the bucket with the specific name + does not exist, the estimator creates the bucket during the ``fit`` + method execution. +- ``output_kms_key`` Optional KMS key ID to optionally encrypt training + output with. +- ``job_name`` Name to assign for the training job that the ``fit``` + method launches. If not specified, the estimator generates a default + job name, based on the training image name and current timestamp + +Calling fit +~~~~~~~~~~~ + +You start your training script by calling ``fit`` on an ``RLEstimator``. ``fit`` takes both a few optional +arguments. + +Optional arguments +'''''''''''''''''' + +- ``inputs``: This can take one of the following forms: A string + S3 URI, for example ``s3://my-bucket/my-training-data``. In this + case, the S3 objects rooted at the ``my-training-data`` prefix will + be available in the default ``train`` channel. A dict from + string channel names to S3 URIs. In this case, the objects rooted at + each S3 prefix will available as files in each channel directory. +- ``wait``: Defaults to True, whether to block and wait for the + training script to complete before returning. +- ``logs``: Defaults to True, whether to show logs produced by training + job in the Python session. Only meaningful when wait is True. + + +Distributed RL Training +----------------------- + +Amazon SageMaker RL supports multi-core and multi-instance distributed training. +Depending on your use case, training and/or environment rollout can be distributed. + +Please see the `Amazon SageMaker examples `_ +on how it can be done using different RL toolkits. + + +Saving models +------------- + +In order to save your trained PyTorch model for deployment on SageMaker, your training script should save your model +to a certain filesystem path ``/opt/ml/model``. This value is also accessible through the environment variable +``SM_MODEL_DIR``. + +Deploying RL Models +------------------- + +After an RL Estimator has been fit, you can host the newly created model in SageMaker. + +After calling ``fit``, you can call ``deploy`` on an ``RlEstimator`` Estimator to create a SageMaker Endpoint. +The Endpoint runs one of the SageMaker-provided model server based on the ``framework`` parameter +specified in the ``RLEstimator`` constructor and hosts the model produced by your training script, +which was run when you called ``fit``. This was the model you saved to ``model_dir``. +In case if ``image_name`` was specified it would use provided image for the deployment. + +``deploy`` returns a ``sagemaker.mxnet.MXNetPredictor`` for MXNet or +``sagemaker.tensorflow.serving.Predictor`` for TensorFlow. + +``predict`` returns the result of inference against your model. + +.. code:: python + + # Train my estimator + rl_estimator = RLEstimator(entry_point='coach-train.py', + toolkit=RLToolkit.COACH, + toolkit_version='0.11.0', + framework=RLFramework.MXNET, + role='SageMakerRole', + train_instance_type='ml.c4.2xlarge', + train_instance_count=1) + + rl_estimator.fit() + + # Deploy my estimator to a SageMaker Endpoint and get a MXNetPredictor + predictor = rl_estimator.deploy(instance_type='ml.m4.xlarge', + initial_instance_count=1) + + response = predictor.predict(data) + +For more information please see `The SageMaker MXNet Model Server `_ +and `Deploying to TensorFlow Serving Endpoints `_ documentation. + + +Working with Existing Training Jobs +----------------------------------- + +Attaching to existing training jobs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can attach an RL Estimator to an existing training job using the +``attach`` method. + +.. code:: python + + my_training_job_name = 'MyAwesomeRLTrainingJob' + rl_estimator = RLEstimator.attach(my_training_job_name) + +After attaching, if the training job has finished with job status "Completed", it can be +``deploy``\ ed to create a SageMaker Endpoint and return a ``Predictor``. If the training job is in progress, +attach will block and display log messages from the training job, until the training job completes. + +The ``attach`` method accepts the following arguments: + +- ``training_job_name:`` The name of the training job to attach + to. +- ``sagemaker_session:`` The Session used + to interact with SageMaker + +RL Training Examples +-------------------- + +Amazon provides several example Jupyter notebooks that demonstrate end-to-end training on Amazon SageMaker using RL. +Please refer to: + +https://github.com/awslabs/amazon-sagemaker-examples/tree/master/reinforcement_learning + +These are also available in SageMaker Notebook Instance hosted Jupyter notebooks under the sample notebooks folder. + + +SageMaker RL Docker Containers +------------------------------ + +When training and deploying training scripts, SageMaker runs your Python script in a Docker container with several +libraries installed. When creating the Estimator and calling deploy to create the SageMaker Endpoint, you can control +the environment your script runs in. + +SageMaker runs RL Estimator scripts in either Python 3.5 for MXNet or Python 3.6 for TensorFlow. + +The Docker images have the following dependencies installed: + ++-------------------------+-------------------+-------------------+-------------------+ +| Dependencies | Coach 0.10.1 | Coach 0.11.0 | Ray 0.5.3 | ++-------------------------+-------------------+-------------------+-------------------+ +| Python | 3.6 | 3.5(MXNet) or | 3.6 | +| | | 3.6(TensorFlow) | | ++-------------------------+-------------------+-------------------+-------------------+ +| CUDA (GPU image only) | 9.0 | 9.0 | 9.0 | ++-------------------------+-------------------+-------------------+-------------------+ +| DL Framework | TensorFlow-1.11.0 | MXNet-1.3.0 or | TensorFlow-1.11.0 | +| | | TensorFlow-1.11.0 | | ++-------------------------+-------------------+-------------------+-------------------+ +| gym | 0.10.5 | 0.10.5 | 0.10.5 | ++-------------------------+-------------------+-------------------+-------------------+ + +The Docker images extend Ubuntu 16.04. + +You can select version of by passing a ``framework_version`` keyword arg to the RL Estimator constructor. +Currently supported versions are listed in the above table. You can also set ``framework_version`` to only specify major and +minor version, which will cause your training script to be run on the latest supported patch version of that minor +version. + +Alternatively, you can build your own image by following the instructions in the SageMaker RL containers +repository, and passing ``image_name`` to the RL Estimator constructor. + You can visit `the SageMaker RL containers repository `_. \ No newline at end of file diff --git a/doc/using_sklearn.rst b/doc/using_sklearn.rst index bf9c506b96..e67f5edaff 100644 --- a/doc/using_sklearn.rst +++ b/doc/using_sklearn.rst @@ -1,638 +1,638 @@ -================================================ -Using Scikit-learn with the SageMaker Python SDK -================================================ - -.. contents:: - -With Scikit-learn Estimators, you can train and host Scikit-learn models on Amazon SageMaker. - -Supported versions of Scikit-learn: ``0.20.0`` - -You can visit the Scikit-learn repository at https://github.com/scikit-learn/scikit-learn. - - -Training with Scikit-learn -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Training Scikit-learn models using ``SKLearn`` Estimators is a two-step process: - -1. Prepare a Scikit-learn script to run on SageMaker -2. Run this script on SageMaker via a ``SKLearn`` Estimator. - - -First, you prepare your training script, then second, you run this on SageMaker via a ``SKLearn`` Estimator. -You should prepare your script in a separate source file than the notebook, terminal session, or source file you're -using to submit the script to SageMaker via a ``SKLearn`` Estimator. - -Suppose that you already have an Scikit-learn training script called -``sklearn-train.py``. You can run this script in SageMaker as follows: - -.. code:: python - - from sagemaker.sklearn import SKLearn - sklearn_estimator = SKLearn(entry_point='sklearn-train.py', - role='SageMakerRole', - train_instance_type='ml.m4.xlarge', - framework_version='0.20.0') - sklearn_estimator.fit('s3://bucket/path/to/training/data') - -Where the S3 URL is a path to your training data, within Amazon S3. The constructor keyword arguments define how -SageMaker runs your training script and are discussed in detail in a later section. - -In the following sections, we'll discuss how to prepare a training script for execution on SageMaker, -then how to run that script on SageMaker using a ``SKLearn`` Estimator. - -Preparing the Scikit-learn training script -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Your Scikit-learn training script must be a Python 2.7 or 3.5 compatible source file. - -The training script is very similar to a training script you might run outside of SageMaker, but you -can access useful properties about the training environment through various environment variables, such as - -* ``SM_MODEL_DIR``: A string representing the path to the directory to write model artifacts to. - These artifacts are uploaded to S3 for model hosting. -* ``SM_OUTPUT_DATA_DIR``: A string representing the filesystem path to write output artifacts to. Output artifacts may - include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed - and uploaded to S3 to the same S3 prefix as the model artifacts. - -Supposing two input channels, 'train' and 'test', were used in the call to the Scikit-learn estimator's ``fit()`` method, -the following will be set, following the format "SM_CHANNEL_[channel_name]": - -* ``SM_CHANNEL_TRAIN``: A string representing the path to the directory containing data in the 'train' channel -* ``SM_CHANNEL_TEST``: Same as above, but for the 'test' channel. - -A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, -and saves a model to model_dir so that it can be hosted later. Hyperparameters are passed to your script as arguments -and can be retrieved with an argparse.ArgumentParser instance. For example, a training script might start -with the following: - -.. code:: python - - import argparse - import os - - if __name__ =='__main__': - - parser = argparse.ArgumentParser() - - # hyperparameters sent by the client are passed as command-line arguments to the script. - parser.add_argument('--epochs', type=int, default=50) - parser.add_argument('--batch-size', type=int, default=64) - parser.add_argument('--learning-rate', type=float, default=0.05) - - # Data, model, and output directories - parser.add_argument('--output-data-dir', type=str, default=os.environ.get('SM_OUTPUT_DATA_DIR')) - parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR')) - parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN')) - parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST')) - - args, _ = parser.parse_known_args() - - # ... load from args.train and args.test, train a model, write model to args.model_dir. - -Because the SageMaker imports your training script, you should put your training code in a main guard -(``if __name__=='__main__':``) if you are using the same script to host your model, so that SageMaker does not -inadvertently run your training code at the wrong point in execution. - -For more on training environment variables, please visit https://github.com/aws/sagemaker-containers. - -Running a Scikit-learn training script in SageMaker -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You run Scikit-learn training scripts on SageMaker by creating ``SKLearn`` Estimators. -SageMaker training of your script is invoked when you call ``fit`` on a ``SKLearn`` Estimator. -The following code sample shows how you train a custom Scikit-learn script "sklearn-train.py", passing -in three hyperparameters ('epochs', 'batch-size', and 'learning-rate'), and using two input channel -directories ('train' and 'test'). - -.. code:: python - - sklearn_estimator = SKLearn('sklearn-train.py', - train_instance_type='ml.m4.xlarge', - framework_version='0.20.0', - hyperparameters = {'epochs': 20, 'batch-size': 64, 'learning-rate': 0.1}) - sklearn_estimator.fit({'train': 's3://my-data-bucket/path/to/my/training/data', - 'test': 's3://my-data-bucket/path/to/my/test/data'}) - - -Scikit-learn Estimators -^^^^^^^^^^^^^^^^^^^^^^^ - -The `SKLearn` constructor takes both required and optional arguments. - -Required arguments -'''''''''''''''''' - -The following are required arguments to the ``SKLearn`` constructor. When you create a Scikit-learn object, you must -include these in the constructor, either positionally or as keyword arguments. - -- ``entry_point`` Path (absolute or relative) to the Python file which - should be executed as the entry point to training. -- ``role`` An AWS IAM role (either name or full ARN). The Amazon - SageMaker training jobs and APIs that create Amazon SageMaker - endpoints use this role to access training data and model artifacts. - After the endpoint is created, the inference code might use the IAM - role, if accessing AWS resource. -- ``train_instance_type`` Type of EC2 instance to use for training, for - example, 'ml.m4.xlarge'. Please note that Scikit-learn does not have GPU support. - -Optional arguments -'''''''''''''''''' - -The following are optional arguments. When you create a ``SKLearn`` object, you can specify these as keyword arguments. - -- ``source_dir`` Path (absolute or relative) to a directory with any - other training source code dependencies including the entry point - file. Structure within this directory will be preserved when training - on SageMaker. -- ``hyperparameters`` Hyperparameters that will be used for training. - Will be made accessible as a dict[str, str] to the training code on - SageMaker. For convenience, accepts other types besides str, but - str() will be called on keys and values to convert them before - training. -- ``py_version`` Python version you want to use for executing your - model training code. -- ``train_volume_size`` Size in GB of the EBS volume to use for storing - input data during training. Must be large enough to store training - data if input_mode='File' is used (which is the default). -- ``train_max_run`` Timeout in seconds for training, after which Amazon - SageMaker terminates the job regardless of its current status. -- ``input_mode`` The input mode that the algorithm supports. Valid - modes: 'File' - Amazon SageMaker copies the training dataset from the - s3 location to a directory in the Docker container. 'Pipe' - Amazon - SageMaker streams data directly from s3 to the container via a Unix - named pipe. -- ``output_path`` s3 location where you want the training result (model - artifacts and optional output files) saved. If not specified, results - are stored to a default bucket. If the bucket with the specific name - does not exist, the estimator creates the bucket during the fit() - method execution. -- ``output_kms_key`` Optional KMS key ID to optionally encrypt training - output with. -- ``base_job_name`` Name to assign for the training job that the fit() - method launches. If not specified, the estimator generates a default - job name, based on the training image name and current timestamp -- ``image_name`` An alternative docker image to use for training and - serving. If specified, the estimator will use this image for training and - hosting, instead of selecting the appropriate SageMaker official image based on - framework_version and py_version. Refer to: `SageMaker Scikit-learn Docker Containers - <#sagemaker-scikit-learn-docker-containers>`_ for details on what the official images support - and where to find the source code to build your custom image. - - -Calling fit -^^^^^^^^^^^ - -You start your training script by calling ``fit`` on a ``SKLearn`` Estimator. ``fit`` takes both required and optional -arguments. - -Required arguments -'''''''''''''''''' - -- ``inputs``: This can take one of the following forms: A string - s3 URI, for example ``s3://my-bucket/my-training-data``. In this - case, the s3 objects rooted at the ``my-training-data`` prefix will - be available in the default ``train`` channel. A dict from - string channel names to s3 URIs. In this case, the objects rooted at - each s3 prefix will available as files in each channel directory. - -For example: - -.. code:: python - - {'train':'s3://my-bucket/my-training-data', - 'eval':'s3://my-bucket/my-evaluation-data'} - -.. optional-arguments-1: - -Optional arguments -'''''''''''''''''' - -- ``wait``: Defaults to True, whether to block and wait for the - training script to complete before returning. -- ``logs``: Defaults to True, whether to show logs produced by training - job in the Python session. Only meaningful when wait is True. - - -Saving models -~~~~~~~~~~~~~ - -In order to save your trained Scikit-learn model for deployment on SageMaker, your training script should save your -model to a certain filesystem path called `model_dir`. This value is accessible through the environment variable -``SM_MODEL_DIR``. The following code demonstrates how to save a trained Scikit-learn model named ``model`` as -``model.joblib`` at the end of training: - -.. code:: python - - from sklearn.externals import joblib - import argparse - import os - - if __name__=='__main__': - # default to the value in environment variable `SM_MODEL_DIR`. Using args makes the script more portable. - parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) - args, _ = parser.parse_known_args() - - # ... train classifier `clf`, then save it to `model_dir` as file 'model.joblib' - joblib.dump(clf, os.path.join(args.model_dir, "model.joblib")) - -After your training job is complete, SageMaker will compress and upload the serialized model to S3, and your model data -will available in the s3 ``output_path`` you specified when you created the Scikit-learn Estimator. - -Deploying Scikit-learn models -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -After an Scikit-learn Estimator has been fit, you can host the newly created model in SageMaker. - -After calling ``fit``, you can call ``deploy`` on a ``SKLearn`` Estimator to create a SageMaker Endpoint. -The Endpoint runs a SageMaker-provided Scikit-learn model server and hosts the model produced by your training script, -which was run when you called ``fit``. This was the model you saved to ``model_dir``. - -``deploy`` returns a ``Predictor`` object, which you can use to do inference on the Endpoint hosting your Scikit-learn -model. Each ``Predictor`` provides a ``predict`` method which can do inference with numpy arrays or Python lists. -Inference arrays or lists are serialized and sent to the Scikit-learn model server by an ``InvokeEndpoint`` SageMaker -operation. - -``predict`` returns the result of inference against your model. By default, the inference result a NumPy array. - -.. code:: python - - # Train my estimator - sklearn_estimator = SKLearn(entry_point='train_and_deploy.py', - train_instance_type='ml.m4.xlarge', - framework_version='0.20.0') - sklearn_estimator.fit('s3://my_bucket/my_training_data/') - - # Deploy my estimator to a SageMaker Endpoint and get a Predictor - predictor = sklearn_estimator.deploy(instance_type='ml.m4.xlarge', - initial_instance_count=1) - - # `data` is a NumPy array or a Python list. - # `response` is a NumPy array. - response = predictor.predict(data) - -You use the SageMaker Scikit-learn model server to host your Scikit-learn model when you call ``deploy`` -on an ``SKLearn`` Estimator. The model server runs inside a SageMaker Endpoint, which your call to ``deploy`` creates. -You can access the name of the Endpoint by the ``name`` property on the returned ``Predictor``. - - -SageMaker Scikit-learn Model Server -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The Scikit-learn Endpoint you create with ``deploy`` runs a SageMaker Scikit-learn model server. -The model server loads the model that was saved by your training script and performs inference on the model in response -to SageMaker InvokeEndpoint API calls. - -You can configure two components of the SageMaker Scikit-learn model server: Model loading and model serving. -Model loading is the process of deserializing your saved model back into an Scikit-learn model. -Serving is the process of translating InvokeEndpoint requests to inference calls on the loaded model. - -You configure the Scikit-learn model server by defining functions in the Python source file you passed to the -Scikit-learn constructor. - -Model loading -^^^^^^^^^^^^^ - -Before a model can be served, it must be loaded. The SageMaker Scikit-learn model server loads your model by invoking a -``model_fn`` function that you must provide in your script. The ``model_fn`` should have the following signature: - -.. code:: python - - def model_fn(model_dir) - -SageMaker will inject the directory where your model files and sub-directories, saved by ``save``, have been mounted. -Your model function should return a model object that can be used for model serving. - -SageMaker provides automated serving functions that work with Gluon API ``net`` objects and Module API ``Module`` objects. If you return either of these types of objects, then you will be able to use the default serving request handling functions. - -The following code-snippet shows an example ``model_fn`` implementation. -This loads returns a Scikit-learn Classifier from a ``model.joblib`` file in the SageMaker model directory -``model_dir``. - -.. code:: python - - from sklearn.externals import joblib - import os - - def model_fn(model_dir): - clf = joblib.load(os.path.join(model_dir, "model.joblib")) - return clf - -Model serving -^^^^^^^^^^^^^ - -After the SageMaker model server has loaded your model by calling ``model_fn``, SageMaker will serve your model. -Model serving is the process of responding to inference requests, received by SageMaker InvokeEndpoint API calls. -The SageMaker Scikit-learn model server breaks request handling into three steps: - - -- input processing, -- prediction, and -- output processing. - -In a similar way to model loading, you configure these steps by defining functions in your Python source file. - -Each step involves invoking a python function, with information about the request and the return-value from the previous -function in the chain. Inside the SageMaker Scikit-learn model server, the process looks like: - -.. code:: python - - # Deserialize the Invoke request body into an object we can perform prediction on - input_object = input_fn(request_body, request_content_type) - - # Perform prediction on the deserialized object, with the loaded model - prediction = predict_fn(input_object, model) - - # Serialize the prediction result into the desired response content type - output = output_fn(prediction, response_content_type) - -The above code-sample shows the three function definitions: - -- ``input_fn``: Takes request data and deserializes the data into an - object for prediction. -- ``predict_fn``: Takes the deserialized request object and performs - inference against the loaded model. -- ``output_fn``: Takes the result of prediction and serializes this - according to the response content type. - -The SageMaker Scikit-learn model server provides default implementations of these functions. -You can provide your own implementations for these functions in your hosting script. -If you omit any definition then the SageMaker Scikit-learn model server will use its default implementation for that -function. - -The ``RealTimePredictor`` used by Scikit-learn in the SageMaker Python SDK serializes NumPy arrays to the `NPY `_ format -by default, with Content-Type ``application/x-npy``. The SageMaker Scikit-learn model server can deserialize NPY-formatted -data (along with JSON and CSV data). - -If you rely solely on the SageMaker Scikit-learn model server defaults, you get the following functionality: - -- Prediction on models that implement the ``__call__`` method -- Serialization and deserialization of NumPy arrays. - -The default ``input_fn`` and ``output_fn`` are meant to make it easy to predict on NumPy arrays. If your model expects -a NumPy array and returns a NumPy array, then these functions do not have to be overridden when sending NPY-formatted -data. - -In the following sections we describe the default implementations of input_fn, predict_fn, and output_fn. -We describe the input arguments and expected return types of each, so you can define your own implementations. - -Input processing -'''''''''''''''' - -When an InvokeEndpoint operation is made against an Endpoint running a SageMaker Scikit-learn model server, -the model server receives two pieces of information: - -- The request Content-Type, for example "application/x-npy" -- The request data body, a byte array - -The SageMaker Scikit-learn model server will invoke an "input_fn" function in your hosting script, -passing in this information. If you define an ``input_fn`` function definition, -it should return an object that can be passed to ``predict_fn`` and have the following signature: - -.. code:: python - - def input_fn(request_body, request_content_type) - -Where ``request_body`` is a byte buffer and ``request_content_type`` is a Python string - -The SageMaker Scikit-learn model server provides a default implementation of ``input_fn``. -This function deserializes JSON, CSV, or NPY encoded data into a NumPy array. - -Default NPY deserialization requires ``request_body`` to follow the `NPY `_ format. For Scikit-learn, the Python SDK -defaults to sending prediction requests with this format. - -Default json deserialization requires ``request_body`` contain a single json list. -Sending multiple json objects within the same ``request_body`` is not supported. -The list must have a dimensionality compatible with the model loaded in ``model_fn``. -The list's shape must be identical to the model's input shape, for all dimensions after the first (which first -dimension is the batch size). - -Default csv deserialization requires ``request_body`` contain one or more lines of CSV numerical data. -The data is loaded into a two-dimensional array, where each line break defines the boundaries of the first dimension. - -The example below shows a custom ``input_fn`` for preparing pickled NumPy arrays. - -.. code:: python - - import numpy as np - - def input_fn(request_body, request_content_type): - """An input_fn that loads a pickled numpy array""" - if request_content_type == "application/python-pickle": - array = np.load(StringIO(request_body)) - return array - else: - # Handle other content-types here or raise an Exception - # if the content type is not supported. - pass - - - -Prediction -'''''''''' - -After the inference request has been deserialized by ``input_fn``, the SageMaker Scikit-learn model server invokes -``predict_fn`` on the return value of ``input_fn``. - -As with ``input_fn``, you can define your own ``predict_fn`` or use the SageMaker Scikit-learn model server default. - -The ``predict_fn`` function has the following signature: - -.. code:: python - - def predict_fn(input_object, model) - -Where ``input_object`` is the object returned from ``input_fn`` and -``model`` is the model loaded by ``model_fn``. - -The default implementation of ``predict_fn`` invokes the loaded model's ``__call__`` function on ``input_object``, -and returns the resulting value. The return-type should be a NumPy array to be compatible with the default -``output_fn``. - -The example below shows an overridden ``predict_fn`` for a Logistic Regression classifier. This model accepts a -Python list and returns a tuple of predictions and prediction probabilities from the model in a NumPy array. -This ``predict_fn`` can rely on the default ``input_fn`` and ``output_fn`` because ``input_data`` is a NumPy array, -and the return value of this function is a NumPy array. - -.. code:: python - - import sklearn - import numpy as np - - def predict_fn(input_data, model): - prediction = model.predict(input_data) - pred_prob = model.predict_proba(input_data) - return np.array([prediction, pred_prob]) - -If you implement your own prediction function, you should take care to ensure that: - -- The first argument is expected to be the return value from input_fn. - If you use the default input_fn, this will be a NumPy array. -- The second argument is the loaded model. -- The return value should be of the correct type to be passed as the - first argument to ``output_fn``. If you use the default - ``output_fn``, this should be a NumPy array. - -Output processing -''''''''''''''''' - -After invoking ``predict_fn``, the model server invokes ``output_fn``, passing in the return-value from ``predict_fn`` -and the InvokeEndpoint requested response content-type. - -The ``output_fn`` has the following signature: - -.. code:: python - - def output_fn(prediction, content_type) - -Where ``prediction`` is the result of invoking ``predict_fn`` and -``content_type`` is the InvokeEndpoint requested response content-type. -The function should return a byte array of data serialized to content_type. - -The default implementation expects ``prediction`` to be an NumPy and can serialize the result to JSON, CSV, or NPY. -It accepts response content types of "application/json", "text/csv", and "application/x-npy". - -Working with existing model data and training jobs -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Attaching to existing training jobs -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -You can attach an Scikit-learn Estimator to an existing training job using the -``attach`` method. - -.. code:: python - - my_training_job_name = "MyAwesomeSKLearnTrainingJob" - sklearn_estimator = SKLearn.attach(my_training_job_name) - -After attaching, if the training job is in a Complete status, it can be -``deploy``\ ed to create a SageMaker Endpoint and return a -``Predictor``. If the training job is in progress, -attach will block and display log messages from the training job, until the training job completes. - -The ``attach`` method accepts the following arguments: - -- ``training_job_name (str):`` The name of the training job to attach - to. -- ``sagemaker_session (sagemaker.Session or None):`` The Session used - to interact with SageMaker - -Deploying Endpoints from model data -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -As well as attaching to existing training jobs, you can deploy models directly from model data in S3. -The following code sample shows how to do this, using the ``SKLearnModel`` class. - -.. code:: python - - sklearn_model = SKLearnModel(model_data="s3://bucket/model.tar.gz", role="SageMakerRole", - entry_point="transform_script.py") - - predictor = sklearn_model.deploy(instance_type="ml.c4.xlarge", initial_instance_count=1) - -The sklearn_model constructor takes the following arguments: - -- ``model_data (str):`` An S3 location of a SageMaker model data - .tar.gz file -- ``image (str):`` A Docker image URI -- ``role (str):`` An IAM role name or Arn for SageMaker to access AWS - resources on your behalf. -- ``predictor_cls (callable[string,sagemaker.Session]):`` A function to - call to create a predictor. If not None, ``deploy`` will return the - result of invoking this function on the created endpoint name -- ``env (dict[string,string]):`` Environment variables to run with - ``image`` when hosted in SageMaker. -- ``name (str):`` The model name. If None, a default model name will be - selected on each ``deploy.`` -- ``entry_point (str):`` Path (absolute or relative) to the Python file - which should be executed as the entry point to model hosting. -- ``source_dir (str):`` Optional. Path (absolute or relative) to a - directory with any other training source code dependencies including - tne entry point file. Structure within this directory will be - preserved when training on SageMaker. -- ``enable_cloudwatch_metrics (boolean):`` Optional. If true, training - and hosting containers will generate Cloudwatch metrics under the - AWS/SageMakerContainer namespace. -- ``container_log_level (int):`` Log level to use within the container. - Valid values are defined in the Python logging module. -- ``code_location (str):`` Optional. Name of the S3 bucket where your - custom code will be uploaded to. If not specified, will use the - SageMaker default bucket created by sagemaker.Session. -- ``sagemaker_session (sagemaker.Session):`` The SageMaker Session - object, used for SageMaker interaction""" - -Your model data must be a .tar.gz file in S3. SageMaker Training Job model data is saved to .tar.gz files in S3, -however if you have local data you want to deploy, you can prepare the data yourself. - -Assuming you have a local directory containg your model data named "my_model" you can tar and gzip compress the file and -upload to S3 using the following commands: - -:: - - tar -czf model.tar.gz my_model - aws s3 cp model.tar.gz s3://my-bucket/my-path/model.tar.gz - -This uploads the contents of my_model to a gzip compressed tar file to S3 in the bucket "my-bucket", with the key -"my-path/model.tar.gz". - -To run this command, you'll need the aws cli tool installed. Please refer to our `FAQ <#FAQ>`__ for more information on -installing this. - -Scikit-learn Training Examples -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Amazon provides an example Jupyter notebook that demonstrate end-to-end training on Amazon SageMaker using Scikit-learn. -Please refer to: - -https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk - -These are also available in SageMaker Notebook Instance hosted Jupyter notebooks under the "sample notebooks" folder. - - -SageMaker Scikit-learn Docker Containers -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -When training and deploying training scripts, SageMaker runs your Python script in a Docker container with several -libraries installed. When creating the Estimator and calling deploy to create the SageMaker Endpoint, you can control -the environment your script runs in. - -SageMaker runs Scikit-learn Estimator scripts in either Python 2.7 or Python 3.5. You can select the Python version by -passing a py_version keyword arg to the Scikit-learn Estimator constructor. Setting this to py3 (the default) will cause -your training script to be run on Python 3.5. Setting this to py2 will cause your training script to be run on Python 2.7 -This Python version applies to both the Training Job, created by fit, and the Endpoint, created by deploy. - -The Scikit-learn Docker images have the following dependencies installed: - -+-----------------------------+-------------+ -| Dependencies | sklearn 0.2 | -+-----------------------------+-------------+ -| sklearn | 0.20.0 | -+-----------------------------+-------------+ -| sagemaker | 1.11.3 | -+-----------------------------+-------------+ -| sagemaker-containers | 2.2.4 | -+-----------------------------+-------------+ -| numpy | 1.15.2 | -+-----------------------------+-------------+ -| pandas | 0.23.4 | -+-----------------------------+-------------+ -| Pillow | 3.1.2 | -+-----------------------------+-------------+ -| Python | 2.7 or 3.5 | -+-----------------------------+-------------+ - -You can see the full list by calling ``pip freeze`` from the running Docker image. - -The Docker images extend Ubuntu 16.04. - -You can select version of Scikit-learn by passing a framework_version keyword arg to the Scikit-learn Estimator constructor. -Currently supported versions are listed in the above table. You can also set framework_version to only specify major and -minor version, which will cause your training script to be run on the latest supported patch version of that minor -version. - -Alternatively, you can build your own image by following the instructions in the SageMaker Scikit-learn containers -repository, and passing ``image_name`` to the Scikit-learn Estimator constructor. -sagemaker-containers -You can visit the SageMaker Scikit-learn containers repository here: https://github.com/aws/sagemaker-scikit-learn-container/ +================================================ +Using Scikit-learn with the SageMaker Python SDK +================================================ + +.. contents:: + +With Scikit-learn Estimators, you can train and host Scikit-learn models on Amazon SageMaker. + +Supported versions of Scikit-learn: ``0.20.0`` + +You can visit the Scikit-learn repository at https://github.com/scikit-learn/scikit-learn. + + +Training with Scikit-learn +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Training Scikit-learn models using ``SKLearn`` Estimators is a two-step process: + +1. Prepare a Scikit-learn script to run on SageMaker +2. Run this script on SageMaker via a ``SKLearn`` Estimator. + + +First, you prepare your training script, then second, you run this on SageMaker via a ``SKLearn`` Estimator. +You should prepare your script in a separate source file than the notebook, terminal session, or source file you're +using to submit the script to SageMaker via a ``SKLearn`` Estimator. + +Suppose that you already have an Scikit-learn training script called +``sklearn-train.py``. You can run this script in SageMaker as follows: + +.. code:: python + + from sagemaker.sklearn import SKLearn + sklearn_estimator = SKLearn(entry_point='sklearn-train.py', + role='SageMakerRole', + train_instance_type='ml.m4.xlarge', + framework_version='0.20.0') + sklearn_estimator.fit('s3://bucket/path/to/training/data') + +Where the S3 URL is a path to your training data, within Amazon S3. The constructor keyword arguments define how +SageMaker runs your training script and are discussed in detail in a later section. + +In the following sections, we'll discuss how to prepare a training script for execution on SageMaker, +then how to run that script on SageMaker using a ``SKLearn`` Estimator. + +Preparing the Scikit-learn training script +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Your Scikit-learn training script must be a Python 2.7 or 3.5 compatible source file. + +The training script is very similar to a training script you might run outside of SageMaker, but you +can access useful properties about the training environment through various environment variables, such as + +* ``SM_MODEL_DIR``: A string representing the path to the directory to write model artifacts to. + These artifacts are uploaded to S3 for model hosting. +* ``SM_OUTPUT_DATA_DIR``: A string representing the filesystem path to write output artifacts to. Output artifacts may + include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed + and uploaded to S3 to the same S3 prefix as the model artifacts. + +Supposing two input channels, 'train' and 'test', were used in the call to the Scikit-learn estimator's ``fit()`` method, +the following will be set, following the format "SM_CHANNEL_[channel_name]": + +* ``SM_CHANNEL_TRAIN``: A string representing the path to the directory containing data in the 'train' channel +* ``SM_CHANNEL_TEST``: Same as above, but for the 'test' channel. + +A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, +and saves a model to model_dir so that it can be hosted later. Hyperparameters are passed to your script as arguments +and can be retrieved with an argparse.ArgumentParser instance. For example, a training script might start +with the following: + +.. code:: python + + import argparse + import os + + if __name__ =='__main__': + + parser = argparse.ArgumentParser() + + # hyperparameters sent by the client are passed as command-line arguments to the script. + parser.add_argument('--epochs', type=int, default=50) + parser.add_argument('--batch-size', type=int, default=64) + parser.add_argument('--learning-rate', type=float, default=0.05) + + # Data, model, and output directories + parser.add_argument('--output-data-dir', type=str, default=os.environ.get('SM_OUTPUT_DATA_DIR')) + parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR')) + parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN')) + parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST')) + + args, _ = parser.parse_known_args() + + # ... load from args.train and args.test, train a model, write model to args.model_dir. + +Because the SageMaker imports your training script, you should put your training code in a main guard +(``if __name__=='__main__':``) if you are using the same script to host your model, so that SageMaker does not +inadvertently run your training code at the wrong point in execution. + +For more on training environment variables, please visit https://github.com/aws/sagemaker-containers. + +Running a Scikit-learn training script in SageMaker +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You run Scikit-learn training scripts on SageMaker by creating ``SKLearn`` Estimators. +SageMaker training of your script is invoked when you call ``fit`` on a ``SKLearn`` Estimator. +The following code sample shows how you train a custom Scikit-learn script "sklearn-train.py", passing +in three hyperparameters ('epochs', 'batch-size', and 'learning-rate'), and using two input channel +directories ('train' and 'test'). + +.. code:: python + + sklearn_estimator = SKLearn('sklearn-train.py', + train_instance_type='ml.m4.xlarge', + framework_version='0.20.0', + hyperparameters = {'epochs': 20, 'batch-size': 64, 'learning-rate': 0.1}) + sklearn_estimator.fit({'train': 's3://my-data-bucket/path/to/my/training/data', + 'test': 's3://my-data-bucket/path/to/my/test/data'}) + + +Scikit-learn Estimators +^^^^^^^^^^^^^^^^^^^^^^^ + +The `SKLearn` constructor takes both required and optional arguments. + +Required arguments +'''''''''''''''''' + +The following are required arguments to the ``SKLearn`` constructor. When you create a Scikit-learn object, you must +include these in the constructor, either positionally or as keyword arguments. + +- ``entry_point`` Path (absolute or relative) to the Python file which + should be executed as the entry point to training. +- ``role`` An AWS IAM role (either name or full ARN). The Amazon + SageMaker training jobs and APIs that create Amazon SageMaker + endpoints use this role to access training data and model artifacts. + After the endpoint is created, the inference code might use the IAM + role, if accessing AWS resource. +- ``train_instance_type`` Type of EC2 instance to use for training, for + example, 'ml.m4.xlarge'. Please note that Scikit-learn does not have GPU support. + +Optional arguments +'''''''''''''''''' + +The following are optional arguments. When you create a ``SKLearn`` object, you can specify these as keyword arguments. + +- ``source_dir`` Path (absolute or relative) to a directory with any + other training source code dependencies including the entry point + file. Structure within this directory will be preserved when training + on SageMaker. +- ``hyperparameters`` Hyperparameters that will be used for training. + Will be made accessible as a dict[str, str] to the training code on + SageMaker. For convenience, accepts other types besides str, but + str() will be called on keys and values to convert them before + training. +- ``py_version`` Python version you want to use for executing your + model training code. +- ``train_volume_size`` Size in GB of the EBS volume to use for storing + input data during training. Must be large enough to store training + data if input_mode='File' is used (which is the default). +- ``train_max_run`` Timeout in seconds for training, after which Amazon + SageMaker terminates the job regardless of its current status. +- ``input_mode`` The input mode that the algorithm supports. Valid + modes: 'File' - Amazon SageMaker copies the training dataset from the + s3 location to a directory in the Docker container. 'Pipe' - Amazon + SageMaker streams data directly from s3 to the container via a Unix + named pipe. +- ``output_path`` s3 location where you want the training result (model + artifacts and optional output files) saved. If not specified, results + are stored to a default bucket. If the bucket with the specific name + does not exist, the estimator creates the bucket during the fit() + method execution. +- ``output_kms_key`` Optional KMS key ID to optionally encrypt training + output with. +- ``base_job_name`` Name to assign for the training job that the fit() + method launches. If not specified, the estimator generates a default + job name, based on the training image name and current timestamp +- ``image_name`` An alternative docker image to use for training and + serving. If specified, the estimator will use this image for training and + hosting, instead of selecting the appropriate SageMaker official image based on + framework_version and py_version. Refer to: `SageMaker Scikit-learn Docker Containers + <#sagemaker-scikit-learn-docker-containers>`_ for details on what the official images support + and where to find the source code to build your custom image. + + +Calling fit +^^^^^^^^^^^ + +You start your training script by calling ``fit`` on a ``SKLearn`` Estimator. ``fit`` takes both required and optional +arguments. + +Required arguments +'''''''''''''''''' + +- ``inputs``: This can take one of the following forms: A string + s3 URI, for example ``s3://my-bucket/my-training-data``. In this + case, the s3 objects rooted at the ``my-training-data`` prefix will + be available in the default ``train`` channel. A dict from + string channel names to s3 URIs. In this case, the objects rooted at + each s3 prefix will available as files in each channel directory. + +For example: + +.. code:: python + + {'train':'s3://my-bucket/my-training-data', + 'eval':'s3://my-bucket/my-evaluation-data'} + +.. optional-arguments-1: + +Optional arguments +'''''''''''''''''' + +- ``wait``: Defaults to True, whether to block and wait for the + training script to complete before returning. +- ``logs``: Defaults to True, whether to show logs produced by training + job in the Python session. Only meaningful when wait is True. + + +Saving models +~~~~~~~~~~~~~ + +In order to save your trained Scikit-learn model for deployment on SageMaker, your training script should save your +model to a certain filesystem path called `model_dir`. This value is accessible through the environment variable +``SM_MODEL_DIR``. The following code demonstrates how to save a trained Scikit-learn model named ``model`` as +``model.joblib`` at the end of training: + +.. code:: python + + from sklearn.externals import joblib + import argparse + import os + + if __name__=='__main__': + # default to the value in environment variable `SM_MODEL_DIR`. Using args makes the script more portable. + parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) + args, _ = parser.parse_known_args() + + # ... train classifier `clf`, then save it to `model_dir` as file 'model.joblib' + joblib.dump(clf, os.path.join(args.model_dir, "model.joblib")) + +After your training job is complete, SageMaker will compress and upload the serialized model to S3, and your model data +will available in the s3 ``output_path`` you specified when you created the Scikit-learn Estimator. + +Deploying Scikit-learn models +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +After an Scikit-learn Estimator has been fit, you can host the newly created model in SageMaker. + +After calling ``fit``, you can call ``deploy`` on a ``SKLearn`` Estimator to create a SageMaker Endpoint. +The Endpoint runs a SageMaker-provided Scikit-learn model server and hosts the model produced by your training script, +which was run when you called ``fit``. This was the model you saved to ``model_dir``. + +``deploy`` returns a ``Predictor`` object, which you can use to do inference on the Endpoint hosting your Scikit-learn +model. Each ``Predictor`` provides a ``predict`` method which can do inference with numpy arrays or Python lists. +Inference arrays or lists are serialized and sent to the Scikit-learn model server by an ``InvokeEndpoint`` SageMaker +operation. + +``predict`` returns the result of inference against your model. By default, the inference result a NumPy array. + +.. code:: python + + # Train my estimator + sklearn_estimator = SKLearn(entry_point='train_and_deploy.py', + train_instance_type='ml.m4.xlarge', + framework_version='0.20.0') + sklearn_estimator.fit('s3://my_bucket/my_training_data/') + + # Deploy my estimator to a SageMaker Endpoint and get a Predictor + predictor = sklearn_estimator.deploy(instance_type='ml.m4.xlarge', + initial_instance_count=1) + + # `data` is a NumPy array or a Python list. + # `response` is a NumPy array. + response = predictor.predict(data) + +You use the SageMaker Scikit-learn model server to host your Scikit-learn model when you call ``deploy`` +on an ``SKLearn`` Estimator. The model server runs inside a SageMaker Endpoint, which your call to ``deploy`` creates. +You can access the name of the Endpoint by the ``name`` property on the returned ``Predictor``. + + +SageMaker Scikit-learn Model Server +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The Scikit-learn Endpoint you create with ``deploy`` runs a SageMaker Scikit-learn model server. +The model server loads the model that was saved by your training script and performs inference on the model in response +to SageMaker InvokeEndpoint API calls. + +You can configure two components of the SageMaker Scikit-learn model server: Model loading and model serving. +Model loading is the process of deserializing your saved model back into an Scikit-learn model. +Serving is the process of translating InvokeEndpoint requests to inference calls on the loaded model. + +You configure the Scikit-learn model server by defining functions in the Python source file you passed to the +Scikit-learn constructor. + +Model loading +^^^^^^^^^^^^^ + +Before a model can be served, it must be loaded. The SageMaker Scikit-learn model server loads your model by invoking a +``model_fn`` function that you must provide in your script. The ``model_fn`` should have the following signature: + +.. code:: python + + def model_fn(model_dir) + +SageMaker will inject the directory where your model files and sub-directories, saved by ``save``, have been mounted. +Your model function should return a model object that can be used for model serving. + +SageMaker provides automated serving functions that work with Gluon API ``net`` objects and Module API ``Module`` objects. If you return either of these types of objects, then you will be able to use the default serving request handling functions. + +The following code-snippet shows an example ``model_fn`` implementation. +This loads returns a Scikit-learn Classifier from a ``model.joblib`` file in the SageMaker model directory +``model_dir``. + +.. code:: python + + from sklearn.externals import joblib + import os + + def model_fn(model_dir): + clf = joblib.load(os.path.join(model_dir, "model.joblib")) + return clf + +Model serving +^^^^^^^^^^^^^ + +After the SageMaker model server has loaded your model by calling ``model_fn``, SageMaker will serve your model. +Model serving is the process of responding to inference requests, received by SageMaker InvokeEndpoint API calls. +The SageMaker Scikit-learn model server breaks request handling into three steps: + + +- input processing, +- prediction, and +- output processing. + +In a similar way to model loading, you configure these steps by defining functions in your Python source file. + +Each step involves invoking a python function, with information about the request and the return-value from the previous +function in the chain. Inside the SageMaker Scikit-learn model server, the process looks like: + +.. code:: python + + # Deserialize the Invoke request body into an object we can perform prediction on + input_object = input_fn(request_body, request_content_type) + + # Perform prediction on the deserialized object, with the loaded model + prediction = predict_fn(input_object, model) + + # Serialize the prediction result into the desired response content type + output = output_fn(prediction, response_content_type) + +The above code-sample shows the three function definitions: + +- ``input_fn``: Takes request data and deserializes the data into an + object for prediction. +- ``predict_fn``: Takes the deserialized request object and performs + inference against the loaded model. +- ``output_fn``: Takes the result of prediction and serializes this + according to the response content type. + +The SageMaker Scikit-learn model server provides default implementations of these functions. +You can provide your own implementations for these functions in your hosting script. +If you omit any definition then the SageMaker Scikit-learn model server will use its default implementation for that +function. + +The ``RealTimePredictor`` used by Scikit-learn in the SageMaker Python SDK serializes NumPy arrays to the `NPY `_ format +by default, with Content-Type ``application/x-npy``. The SageMaker Scikit-learn model server can deserialize NPY-formatted +data (along with JSON and CSV data). + +If you rely solely on the SageMaker Scikit-learn model server defaults, you get the following functionality: + +- Prediction on models that implement the ``__call__`` method +- Serialization and deserialization of NumPy arrays. + +The default ``input_fn`` and ``output_fn`` are meant to make it easy to predict on NumPy arrays. If your model expects +a NumPy array and returns a NumPy array, then these functions do not have to be overridden when sending NPY-formatted +data. + +In the following sections we describe the default implementations of input_fn, predict_fn, and output_fn. +We describe the input arguments and expected return types of each, so you can define your own implementations. + +Input processing +'''''''''''''''' + +When an InvokeEndpoint operation is made against an Endpoint running a SageMaker Scikit-learn model server, +the model server receives two pieces of information: + +- The request Content-Type, for example "application/x-npy" +- The request data body, a byte array + +The SageMaker Scikit-learn model server will invoke an "input_fn" function in your hosting script, +passing in this information. If you define an ``input_fn`` function definition, +it should return an object that can be passed to ``predict_fn`` and have the following signature: + +.. code:: python + + def input_fn(request_body, request_content_type) + +Where ``request_body`` is a byte buffer and ``request_content_type`` is a Python string + +The SageMaker Scikit-learn model server provides a default implementation of ``input_fn``. +This function deserializes JSON, CSV, or NPY encoded data into a NumPy array. + +Default NPY deserialization requires ``request_body`` to follow the `NPY `_ format. For Scikit-learn, the Python SDK +defaults to sending prediction requests with this format. + +Default json deserialization requires ``request_body`` contain a single json list. +Sending multiple json objects within the same ``request_body`` is not supported. +The list must have a dimensionality compatible with the model loaded in ``model_fn``. +The list's shape must be identical to the model's input shape, for all dimensions after the first (which first +dimension is the batch size). + +Default csv deserialization requires ``request_body`` contain one or more lines of CSV numerical data. +The data is loaded into a two-dimensional array, where each line break defines the boundaries of the first dimension. + +The example below shows a custom ``input_fn`` for preparing pickled NumPy arrays. + +.. code:: python + + import numpy as np + + def input_fn(request_body, request_content_type): + """An input_fn that loads a pickled numpy array""" + if request_content_type == "application/python-pickle": + array = np.load(StringIO(request_body)) + return array + else: + # Handle other content-types here or raise an Exception + # if the content type is not supported. + pass + + + +Prediction +'''''''''' + +After the inference request has been deserialized by ``input_fn``, the SageMaker Scikit-learn model server invokes +``predict_fn`` on the return value of ``input_fn``. + +As with ``input_fn``, you can define your own ``predict_fn`` or use the SageMaker Scikit-learn model server default. + +The ``predict_fn`` function has the following signature: + +.. code:: python + + def predict_fn(input_object, model) + +Where ``input_object`` is the object returned from ``input_fn`` and +``model`` is the model loaded by ``model_fn``. + +The default implementation of ``predict_fn`` invokes the loaded model's ``__call__`` function on ``input_object``, +and returns the resulting value. The return-type should be a NumPy array to be compatible with the default +``output_fn``. + +The example below shows an overridden ``predict_fn`` for a Logistic Regression classifier. This model accepts a +Python list and returns a tuple of predictions and prediction probabilities from the model in a NumPy array. +This ``predict_fn`` can rely on the default ``input_fn`` and ``output_fn`` because ``input_data`` is a NumPy array, +and the return value of this function is a NumPy array. + +.. code:: python + + import sklearn + import numpy as np + + def predict_fn(input_data, model): + prediction = model.predict(input_data) + pred_prob = model.predict_proba(input_data) + return np.array([prediction, pred_prob]) + +If you implement your own prediction function, you should take care to ensure that: + +- The first argument is expected to be the return value from input_fn. + If you use the default input_fn, this will be a NumPy array. +- The second argument is the loaded model. +- The return value should be of the correct type to be passed as the + first argument to ``output_fn``. If you use the default + ``output_fn``, this should be a NumPy array. + +Output processing +''''''''''''''''' + +After invoking ``predict_fn``, the model server invokes ``output_fn``, passing in the return-value from ``predict_fn`` +and the InvokeEndpoint requested response content-type. + +The ``output_fn`` has the following signature: + +.. code:: python + + def output_fn(prediction, content_type) + +Where ``prediction`` is the result of invoking ``predict_fn`` and +``content_type`` is the InvokeEndpoint requested response content-type. +The function should return a byte array of data serialized to content_type. + +The default implementation expects ``prediction`` to be an NumPy and can serialize the result to JSON, CSV, or NPY. +It accepts response content types of "application/json", "text/csv", and "application/x-npy". + +Working with existing model data and training jobs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Attaching to existing training jobs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +You can attach an Scikit-learn Estimator to an existing training job using the +``attach`` method. + +.. code:: python + + my_training_job_name = "MyAwesomeSKLearnTrainingJob" + sklearn_estimator = SKLearn.attach(my_training_job_name) + +After attaching, if the training job is in a Complete status, it can be +``deploy``\ ed to create a SageMaker Endpoint and return a +``Predictor``. If the training job is in progress, +attach will block and display log messages from the training job, until the training job completes. + +The ``attach`` method accepts the following arguments: + +- ``training_job_name (str):`` The name of the training job to attach + to. +- ``sagemaker_session (sagemaker.Session or None):`` The Session used + to interact with SageMaker + +Deploying Endpoints from model data +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +As well as attaching to existing training jobs, you can deploy models directly from model data in S3. +The following code sample shows how to do this, using the ``SKLearnModel`` class. + +.. code:: python + + sklearn_model = SKLearnModel(model_data="s3://bucket/model.tar.gz", role="SageMakerRole", + entry_point="transform_script.py") + + predictor = sklearn_model.deploy(instance_type="ml.c4.xlarge", initial_instance_count=1) + +The sklearn_model constructor takes the following arguments: + +- ``model_data (str):`` An S3 location of a SageMaker model data + .tar.gz file +- ``image (str):`` A Docker image URI +- ``role (str):`` An IAM role name or Arn for SageMaker to access AWS + resources on your behalf. +- ``predictor_cls (callable[string,sagemaker.Session]):`` A function to + call to create a predictor. If not None, ``deploy`` will return the + result of invoking this function on the created endpoint name +- ``env (dict[string,string]):`` Environment variables to run with + ``image`` when hosted in SageMaker. +- ``name (str):`` The model name. If None, a default model name will be + selected on each ``deploy.`` +- ``entry_point (str):`` Path (absolute or relative) to the Python file + which should be executed as the entry point to model hosting. +- ``source_dir (str):`` Optional. Path (absolute or relative) to a + directory with any other training source code dependencies including + tne entry point file. Structure within this directory will be + preserved when training on SageMaker. +- ``enable_cloudwatch_metrics (boolean):`` Optional. If true, training + and hosting containers will generate Cloudwatch metrics under the + AWS/SageMakerContainer namespace. +- ``container_log_level (int):`` Log level to use within the container. + Valid values are defined in the Python logging module. +- ``code_location (str):`` Optional. Name of the S3 bucket where your + custom code will be uploaded to. If not specified, will use the + SageMaker default bucket created by sagemaker.Session. +- ``sagemaker_session (sagemaker.Session):`` The SageMaker Session + object, used for SageMaker interaction""" + +Your model data must be a .tar.gz file in S3. SageMaker Training Job model data is saved to .tar.gz files in S3, +however if you have local data you want to deploy, you can prepare the data yourself. + +Assuming you have a local directory containg your model data named "my_model" you can tar and gzip compress the file and +upload to S3 using the following commands: + +:: + + tar -czf model.tar.gz my_model + aws s3 cp model.tar.gz s3://my-bucket/my-path/model.tar.gz + +This uploads the contents of my_model to a gzip compressed tar file to S3 in the bucket "my-bucket", with the key +"my-path/model.tar.gz". + +To run this command, you'll need the aws cli tool installed. Please refer to our `FAQ <#FAQ>`__ for more information on +installing this. + +Scikit-learn Training Examples +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Amazon provides an example Jupyter notebook that demonstrate end-to-end training on Amazon SageMaker using Scikit-learn. +Please refer to: + +https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk + +These are also available in SageMaker Notebook Instance hosted Jupyter notebooks under the "sample notebooks" folder. + + +SageMaker Scikit-learn Docker Containers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When training and deploying training scripts, SageMaker runs your Python script in a Docker container with several +libraries installed. When creating the Estimator and calling deploy to create the SageMaker Endpoint, you can control +the environment your script runs in. + +SageMaker runs Scikit-learn Estimator scripts in either Python 2.7 or Python 3.5. You can select the Python version by +passing a py_version keyword arg to the Scikit-learn Estimator constructor. Setting this to py3 (the default) will cause +your training script to be run on Python 3.5. Setting this to py2 will cause your training script to be run on Python 2.7 +This Python version applies to both the Training Job, created by fit, and the Endpoint, created by deploy. + +The Scikit-learn Docker images have the following dependencies installed: + ++-----------------------------+-------------+ +| Dependencies | sklearn 0.2 | ++-----------------------------+-------------+ +| sklearn | 0.20.0 | ++-----------------------------+-------------+ +| sagemaker | 1.11.3 | ++-----------------------------+-------------+ +| sagemaker-containers | 2.2.4 | ++-----------------------------+-------------+ +| numpy | 1.15.2 | ++-----------------------------+-------------+ +| pandas | 0.23.4 | ++-----------------------------+-------------+ +| Pillow | 3.1.2 | ++-----------------------------+-------------+ +| Python | 2.7 or 3.5 | ++-----------------------------+-------------+ + +You can see the full list by calling ``pip freeze`` from the running Docker image. + +The Docker images extend Ubuntu 16.04. + +You can select version of Scikit-learn by passing a framework_version keyword arg to the Scikit-learn Estimator constructor. +Currently supported versions are listed in the above table. You can also set framework_version to only specify major and +minor version, which will cause your training script to be run on the latest supported patch version of that minor +version. + +Alternatively, you can build your own image by following the instructions in the SageMaker Scikit-learn containers +repository, and passing ``image_name`` to the Scikit-learn Estimator constructor. +sagemaker-containers +You can visit the SageMaker Scikit-learn containers repository here: https://github.com/aws/sagemaker-scikit-learn-container/ diff --git a/doc/using_tf.rst b/doc/using_tf.rst index 1d4de934e4..db5d6ae141 100644 --- a/doc/using_tf.rst +++ b/doc/using_tf.rst @@ -1,981 +1,985 @@ -############################################## -Using TensorFlow with the SageMaker Python SDK -############################################## - -TensorFlow SageMaker Estimators allow you to run your own TensorFlow -training algorithms on SageMaker Learner, and to host your own TensorFlow -models on SageMaker Hosting. - -For general information about using the SageMaker Python SDK, see :ref:`overview:Using the SageMaker Python SDK`. - -.. warning:: - We have added a new format of your TensorFlow training script with TensorFlow version 1.11. - This new way gives the user script more flexibility. - This new format is called Script Mode, as opposed to Legacy Mode, which is what we support with TensorFlow 1.11 and older versions. - In addition we are adding Python 3 support with Script Mode. - The last supported version of Legacy Mode will be TensorFlow 1.12. - Script Mode is available with TensorFlow version 1.11 and newer. - Make sure you refer to the correct version of this README when you prepare your script. - You can find the Legacy Mode README `here `_. - -.. contents:: - -Supported versions of TensorFlow for Elastic Inference: ``1.11.0``, ``1.12.0``. - - -***************************** -Train a Model with TensorFlow -***************************** - -To train a TensorFlow model by using the SageMaker Python SDK: - -1. `Prepare a training script <#prepare-a-script-mode-training-script>`_ -2. `Create a ``sagemaker.tensorflow.TensorFlow`` estimator <#create-an-estimator>`_ -3. `Call the estimator's `fit` method <#call-fit>`_ - -Prepare a Script Mode Training Script -====================================== - -Your TensorFlow training script must be a Python 2.7- or 3.6-compatible source file. - -The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, including the following: - -* ``SM_MODEL_DIR``: A string that represents the local path where the training job writes the model artifacts to. - After training, artifacts in this directory are uploaded to S3 for model hosting. This is different than the ``model_dir`` - argument passed in your training script, which is an S3 location. ``SM_MODEL_DIR`` is always set to ``/opt/ml/model``. -* ``SM_NUM_GPUS``: An integer representing the number of GPUs available to the host. -* ``SM_OUTPUT_DATA_DIR``: A string that represents the path to the directory to write output artifacts to. - Output artifacts might include checkpoints, graphs, and other files to save, but do not include model artifacts. - These artifacts are compressed and uploaded to S3 to an S3 bucket with the same prefix as the model artifacts. -* ``SM_CHANNEL_XXXX``: A string that represents the path to the directory that contains the input data for the specified channel. - For example, if you specify two input channels in the TensorFlow estimator's ``fit`` call, named 'train' and 'test', the environment variables ``SM_CHANNEL_TRAIN`` and ``SM_CHANNEL_TEST`` are set. - -For the exhaustive list of available environment variables, see the `SageMaker Containers documentation `_. - -A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to ``SM_CHANNEL_TRAIN`` so that it can be deployed for inference later. -Hyperparameters are passed to your script as arguments and can be retrieved with an ``argparse.ArgumentParser`` instance. -For example, a training script might start with the following: - -.. code:: python - - import argparse - import os - - if __name__ =='__main__': - - parser = argparse.ArgumentParser() - - # hyperparameters sent by the client are passed as command-line arguments to the script. - parser.add_argument('--epochs', type=int, default=10) - parser.add_argument('--batch_size', type=int, default=100) - parser.add_argument('--learning_rate', type=float, default=0.1) - - # input data and model directories - parser.add_argument('--model_dir', type=str) - parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN')) - parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST')) - - args, _ = parser.parse_known_args() - - # ... load from args.train and args.test, train a model, write model to args.model_dir. - -Because the SageMaker imports your training script, putting your training launching code in a main guard (``if __name__=='__main__':``) -is good practice. - -Note that SageMaker doesn't support argparse actions. -For example, if you want to use a boolean hyperparameter, specify ``type`` as ``bool`` in your script and provide an explicit ``True`` or ``False`` value for this hyperparameter when you create the TensorFlow estimator. - -For a complete example of a TensorFlow training script, see < - - -Adapting your local TensorFlow script -------------------------------------- - -If you have a TensorFlow training script that runs outside of SageMaker, do the following to adapt the script to run in SageMaker: - -1. Make sure your script can handle ``--model_dir`` as an additional command line argument. If you did not specify a -location when you created the TensorFlow estimator, an S3 location under the default training job bucket is used. -Distributed training with parameter servers requires you to use the ``tf.estimator.train_and_evaluate`` API and -to provide an S3 location as the model directory during training. Here is an example: - -.. code:: python - - estimator = tf.estimator.Estimator(model_fn=my_model_fn, model_dir=args.model_dir) - ... - train_spec = tf.estimator.TrainSpec(train_input_fn, max_steps=1000) - eval_spec = tf.estimator.EvalSpec(eval_input_fn) - tf.estimator.train_and_evaluate(mnist_classifier, train_spec, eval_spec) - -2. Load input data from the input channels. The input channels are defined when ``fit`` is called. For example: - -.. code:: python - - estimator.fit({'train':'s3://my-bucket/my-training-data', - 'eval':'s3://my-bucket/my-evaluation-data'}) - -In your training script the channels will be stored in environment variables ``SM_CHANNEL_TRAIN`` and -``SM_CHANNEL_EVAL``. You can add them to your argument parsing logic like this: - -.. code:: python - - parser = argparse.ArgumentParser() - parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN')) - parser.add_argument('--eval', type=str, default=os.environ.get('SM_CHANNEL_EVAL')) - -3. Export your final model to path stored in environment variable ``SM_MODEL_DIR`` which should always be - ``/opt/ml/model``. At end of training SageMaker will upload the model file under ``/opt/ml/model`` to - ``output_path``. - - -Create an Estimator -=================== - -After you create your training script, create an instance of the :class:`sagemaker.tensorflow.TensorFlow` estimator. - -To use Script Mode, set at least one of these args - -- ``py_version='py3'`` -- ``script_mode=True`` - -When using Script Mode, your training script needs to accept the following args: - -- ``model_dir`` - -The following args are not permitted when using Script Mode: - -- ``checkpoint_path`` -- ``training_steps`` -- ``evaluation_steps`` -- ``requirements_file`` - -.. code:: python - - from sagemaker.tensorflow import TensorFlow - - tf_estimator = TensorFlow(entry_point='tf-train.py', role='SageMakerRole', - train_instance_count=1, train_instance_type='ml.p2.xlarge', - framework_version='1.12', py_version='py3') - tf_estimator.fit('s3://bucket/path/to/training/data') - -Where the S3 url is a path to your training data within Amazon S3. -The constructor keyword arguments define how SageMaker runs your training script. - -For more information about the sagemaker.tensorflow.TensorFlow estimator, see `sagemaker.tensorflow.TensorFlow Class`_. - -Call ``fit`` -============ - -You start your training script by calling the ``fit`` method on a ``TensorFlow`` estimator. ``fit`` takes -both required and optional arguments. - -Required arguments ------------------- - -- ``inputs``: The S3 location(s) of datasets to be used for training. This can take one of two forms: - - - ``str``: An S3 URI, for example ``s3://my-bucket/my-training-data``, which indicates the dataset's location. - - ``dict[str, str]``: A dictionary mapping channel names to S3 locations, for example ``{'train': 's3://my-bucket/my-training-data/train', 'test': 's3://my-bucket/my-training-data/test'}`` - - ``sagemaker.session.s3_input``: channel configuration for S3 data sources that can provide additional information as well as the path to the training dataset. See `the API docs `_ for full details. - -Optional arguments ------------------- - -- ``wait (bool)``: Defaults to True, whether to block and wait for the - training script to complete before returning. - If set to False, it will return immediately, and can later be attached to. -- ``logs (bool)``: Defaults to True, whether to show logs produced by training - job in the Python session. Only meaningful when wait is True. -- ``run_tensorboard_locally (bool)``: Defaults to False. If set to True a Tensorboard command will be printed out. -- ``job_name (str)``: Training job name. If not specified, the estimator generates a default job name, - based on the training image name and current timestamp. - -What happens when fit is called -------------------------------- - -Calling ``fit`` starts a SageMaker training job. The training job will execute the following. - -- Starts ``train_instance_count`` EC2 instances of the type ``train_instance_type``. -- On each instance, it will do the following steps: - - - starts a Docker container optimized for TensorFlow. - - downloads the dataset. - - setup up training related environment varialbes - - setup up distributed training environment if configured to use parameter server - - starts asynchronous training - -If the ``wait=False`` flag is passed to ``fit``, then it returns immediately. The training job continues running -asynchronously. Later, a Tensorflow estimator can be obtained by attaching to the existing training job. -If the training job is not finished, it starts showing the standard output of training and wait until it completes. -After attaching, the estimator can be deployed as usual. - -.. code:: python - - tf_estimator.fit(your_input_data, wait=False) - training_job_name = tf_estimator.latest_training_job.name - - # after some time, or in a separate Python notebook, we can attach to it again. - - tf_estimator = TensorFlow.attach(training_job_name=training_job_name) - -Distributed Training -==================== - -To run your training job with multiple instances in a distributed fashion, set ``train_instance_count`` -to a number larger than 1. We support two different types of distributed training, parameter server and Horovod. -The ``distributions`` parameter is used to configure which distributed training strategy to use. - -Training with parameter servers -------------------------------- - -If you specify parameter_server as the value of the distributions parameter, the container launches a parameter server -thread on each instance in the training cluster, and then executes your training code. You can find more information on -TensorFlow distributed training at `TensorFlow docs `__. -To enable parameter server training: - -.. code:: python - - from sagemaker.tensorflow import TensorFlow - - tf_estimator = TensorFlow(entry_point='tf-train.py', role='SageMakerRole', - train_instance_count=2, train_instance_type='ml.p2.xlarge', - framework_version='1.11', py_version='py3', - distributions={'parameter_server': {'enabled': True}}) - tf_estimator.fit('s3://bucket/path/to/training/data') - -Training with Horovod ---------------------- - -Horovod is a distributed training framework based on MPI. Horovod is only available with TensorFlow version ``1.12`` or newer. -You can find more details at `Horovod README `__. - -The container sets up the MPI environment and executes the ``mpirun`` command enabling you to run any Horovod -training script with Script Mode. - -Training with ``MPI`` is configured by specifying following fields in ``distributions``: - -- ``enabled (bool)``: If set to ``True``, the MPI setup is performed and ``mpirun`` command is executed. -- ``processes_per_host (int)``: Number of processes MPI should launch on each host. Note, this should not be - greater than the available slots on the selected instance type. This flag should be set for the multi-cpu/gpu - training. -- ``custom_mpi_options (str)``: Any `mpirun` flag(s) can be passed in this field that will be added to the `mpirun` - command executed by SageMaker to launch distributed horovod training. - - -In the below example we create an estimator to launch Horovod distributed training with 2 processes on one host: - -.. code:: python - - from sagemaker.tensorflow import TensorFlow - - tf_estimator = TensorFlow(entry_point='tf-train.py', role='SageMakerRole', - train_instance_count=1, train_instance_type='ml.p2.xlarge', - framework_version='1.12', py_version='py3', - distributions={ - 'mpi': { - 'enabled': True, - 'processes_per_host': 2, - 'custom_mpi_options': '--NCCL_DEBUG INFO' - } - }) - tf_estimator.fit('s3://bucket/path/to/training/data') - - -Training with Pipe Mode using PipeModeDataset -============================================= - -Amazon SageMaker allows users to create training jobs using Pipe input mode. -With Pipe input mode, your dataset is streamed directly to your training instances instead of being downloaded first. -This means that your training jobs start sooner, finish quicker, and need less disk space. - -SageMaker TensorFlow provides an implementation of ``tf.data.Dataset`` that makes it easy to take advantage of Pipe -input mode in SageMaker. You can replace your ``tf.data.Dataset`` with a ``sagemaker_tensorflow.PipeModeDataset`` to -read TFRecords as they are streamed to your training instances. - -In your ``entry_point`` script, you can use ``PipeModeDataset`` like a ``Dataset``. In this example, we create a -``PipeModeDataset`` to read TFRecords from the 'training' channel: - - -.. code:: python - - from sagemaker_tensorflow import PipeModeDataset - - features = { - 'data': tf.FixedLenFeature([], tf.string), - 'labels': tf.FixedLenFeature([], tf.int64), - } - - def parse(record): - parsed = tf.parse_single_example(record, features) - return ({ - 'data': tf.decode_raw(parsed['data'], tf.float64) - }, parsed['labels']) - - def train_input_fn(training_dir, hyperparameters): - ds = PipeModeDataset(channel='training', record_format='TFRecord') - ds = ds.repeat(20) - ds = ds.prefetch(10) - ds = ds.map(parse, num_parallel_calls=10) - ds = ds.batch(64) - return ds - - -To run training job with Pipe input mode, pass in ``input_mode='Pipe'`` to your TensorFlow Estimator: - - -.. code:: python - - from sagemaker.tensorflow import TensorFlow - - tf_estimator = TensorFlow(entry_point='tf-train-with-pipemodedataset.py', role='SageMakerRole', - training_steps=10000, evaluation_steps=100, - train_instance_count=1, train_instance_type='ml.p2.xlarge', - framework_version='1.10.0', input_mode='Pipe') - - tf_estimator.fit('s3://bucket/path/to/training/data') - - -If your TFRecords are compressed, you can train on Gzipped TF Records by passing in ``compression='Gzip'`` to the call to -``fit()``, and SageMaker will automatically unzip the records as data is streamed to your training instances: - -.. code:: python - - from sagemaker.session import s3_input - - train_s3_input = s3_input('s3://bucket/path/to/training/data', compression='Gzip') - tf_estimator.fit(train_s3_input) - - -You can learn more about ``PipeModeDataset`` in the sagemaker-tensorflow-extensions repository: https://github.com/aws/sagemaker-tensorflow-extensions - - -Training with MKL-DNN disabled -============================== - -SageMaker TensorFlow CPU images use TensorFlow built with Intel® MKL-DNN optimization. - -In certain cases you might be able to get a better performance by disabling this optimization -(`for example when using small models `_) - -You can disable MKL-DNN optimization for TensorFlow ``1.8.0`` and above by setting two following environment variables: - -.. code:: python - - import os - - os.environ['TF_DISABLE_MKL'] = '1' - os.environ['TF_DISABLE_POOL_ALLOCATOR'] = '1' - -******************************** -Deploy TensorFlow Serving models -******************************** - -After a TensorFlow estimator has been fit, it saves a TensorFlow SavedModel in -the S3 location defined by ``output_path``. You can call ``deploy`` on a TensorFlow -estimator to create a SageMaker Endpoint, or you can call ``transformer`` to create a ``Transformer`` that you can use to run a batch transform job. - -Your model will be deployed to a TensorFlow Serving-based server. The server provides a super-set of the -`TensorFlow Serving REST API `_. - - -Deploy to a SageMaker Endpoint -============================== - -Deploying from an Estimator ---------------------------- - -After a TensorFlow estimator has been fit, it saves a TensorFlow -`SavedModel `_ bundle in -the S3 location defined by ``output_path``. You can call ``deploy`` on a TensorFlow -estimator object to create a SageMaker Endpoint: - -.. code:: python - - from sagemaker.tensorflow import TensorFlow - - estimator = TensorFlow(entry_point='tf-train.py', ..., train_instance_count=1, - train_instance_type='ml.c4.xlarge', framework_version='1.11') - - estimator.fit(inputs) - - predictor = estimator.deploy(initial_instance_count=1, - instance_type='ml.c5.xlarge', - endpoint_type='tensorflow-serving') - - -The code block above deploys a SageMaker Endpoint with one instance of the type 'ml.c5.xlarge'. - -What happens when deploy is called -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Calling ``deploy`` starts the process of creating a SageMaker Endpoint. This process includes the following steps. - -- Starts ``initial_instance_count`` EC2 instances of the type ``instance_type``. -- On each instance, it will do the following steps: - - - start a Docker container optimized for TensorFlow Serving, see `SageMaker TensorFlow Serving containers `_. - - start a `TensorFlow Serving` process configured to run your model. - - start an HTTP server that provides access to TensorFlow Server through the SageMaker InvokeEndpoint API. - - -When the ``deploy`` call finishes, the created SageMaker Endpoint is ready for prediction requests. The -`Making predictions against a SageMaker Endpoint`_ section will explain how to make prediction requests -against the Endpoint. - -Deploying directly from model artifacts ---------------------------------------- - -If you already have existing model artifacts in S3, you can skip training and deploy them directly to an endpoint: - -.. code:: python - - from sagemaker.tensorflow.serving import Model - - model = Model(model_data='s3://mybucket/model.tar.gz', role='MySageMakerRole') - - predictor = model.deploy(initial_instance_count=1, instance_type='ml.c5.xlarge') - -Python-based TensorFlow serving on SageMaker has support for `Elastic Inference `__, which allows for inference acceleration to a hosted endpoint for a fraction of the cost of using a full GPU instance. In order to attach an Elastic Inference accelerator to your endpoint provide the accelerator type to accelerator_type to your deploy call. - -.. code:: python - - from sagemaker.tensorflow.serving import Model - - model = Model(model_data='s3://mybucket/model.tar.gz', role='MySageMakerRole') - - predictor = model.deploy(initial_instance_count=1, instance_type='ml.c5.xlarge', accelerator_type='ml.eia1.medium') - -Making predictions against a SageMaker Endpoint ------------------------------------------------ - -Once you have the ``Predictor`` instance returned by ``model.deploy(...)`` or ``estimator.deploy(...)``, you -can send prediction requests to your Endpoint. - -The following code shows how to make a prediction request: - -.. code:: python - - input = { - 'instances': [1.0, 2.0, 5.0] - } - result = predictor.predict(input) - -The result object will contain a Python dict like this: - -.. code:: python - - { - 'predictions': [3.5, 4.0, 5.5] - } - -The formats of the input and the output data correspond directly to the request and response formats -of the ``Predict`` method in the `TensorFlow Serving REST API `_. - -If your SavedModel includes the right ``signature_def``, you can also make Classify or Regress requests: - -.. code:: python - - # input matches the Classify and Regress API - input = { - 'signature_name': 'tensorflow/serving/regress', - 'examples': [{'x': 1.0}, {'x': 2.0}] - } - - result = predictor.regress(input) # or predictor.classify(...) - - # result contains: - { - 'results': [3.5, 4.0] - } - -You can include multiple ``instances`` in your predict request (or multiple ``examples`` in -classify/regress requests) to get multiple prediction results in one request to your Endpoint: - -.. code:: python - - input = { - 'instances': [ - [1.0, 2.0, 5.0], - [1.0, 2.0, 5.0], - [1.0, 2.0, 5.0] - ] - } - result = predictor.predict(input) - - # result contains: - { - 'predictions': [ - [3.5, 4.0, 5.5], - [3.5, 4.0, 5.5], - [3.5, 4.0, 5.5] - ] - } - -If your application allows request grouping like this, it is **much** more efficient than making separate requests. - -See `Deploying to TensorFlow Serving Endpoints ` to learn how to deploy your model and make inference requests. - -Run a Batch Transform Job -========================= - -Batch transform allows you to get inferences for an entire dataset that is stored in an S3 bucket. - -For general information about using batch transform with the SageMaker Python SDK, see :ref:`overview:SageMaker Batch Transform`. -For information about SageMaker batch transform, see `Get Inferences for an Entire Dataset with Batch Transform ` in the AWS documentation. - -To run a batch transform job, you first create a ``Transformer`` object, and then call that object's ``transform`` method. - -Create a Transformer Object ---------------------------- - -If you used an estimator to train your model, you can call the ``transformer`` method of the estimator to create a ``Transformer`` object. - -For example: - -.. code:: python - - bucket = myBucket # The name of the S3 bucket where the results are stored - prefix = 'batch-results' # The folder in the S3 bucket where the results are stored - - batch_output = 's3://{}/{}/results'.format(bucket, prefix) # The location to store the results - - tf_transformer = tf_estimator.transformer(instance_count=1, instance_type='ml.m4.xlarge, output_path=batch_output) - -To use a model trained outside of SageMaker, you can package the model as a SageMaker model, and call the ``transformer`` method of the SageMaker model. - -For example: - -For example: - -.. code:: python - - bucket = myBucket # The name of the S3 bucket where the results are stored - prefix = 'batch-results' # The folder in the S3 bucket where the results are stored - - batch_output = 's3://{}/{}/results'.format(bucket, prefix) # The location to store the results - - tf_transformer = tensorflow_serving_model.transformer(instance_count=1, instance_type='ml.m4.xlarge, output_path=batch_output) - -For information about how to package a model as a SageMaker model, see :ref:`overview:BYO Model`. -When you call the ``tranformer`` method, you specify the type and number of instances to use for the batch transform job, and the location where the results are stored in S3. - - - -Call transform --------------- - -After you create a ``Transformer`` object, you call that object's ``transform`` method to start a batch transform job. -For example: - -.. code:: python - - batch_input = 's3://{}/{}/test/examples'.format(bucket, prefix) # The location of the input dataset - - tf_transformer.transform(data=batch_input, data_type='S3Prefix', content_type='text/csv', split_type='Line') - -In the example, the content type is CSV, and each line in the dataset is treated as a record to get a predition for. - -Batch Transform Supported Data Formats --------------------------------------- - -When you call the ``tranform`` method to start a batch transform job, -you specify the data format by providing a MIME type as the value for the ``content_type`` parameter. - -The following content formats are supported without custom intput and output handling: - -* CSV - specify ``text/csv`` as the value of the ``content_type`` parameter. -* JSON - specify ``application/json`` as the value of the ``content_type`` parameter. -* JSON lines - specify ``application/jsonlines`` as the value of the ``content_type`` parameter. - -For detailed information about how TensorFlow Serving formats these data types for input and output, see :ref:`using_tf:TensorFlow Serving Input and Output`. - -You can also accept any custom data format by writing input and output functions, and include them in the ``inference.py`` file in your model. -For information, see :ref:`using_tf:Create Python Scripts for Custom Input and Output Formats`. - - -TensorFlow Serving Input and Output -=================================== - -The following sections describe the data formats that TensorFlow Serving endpoints and batch transform jobs accept, -and how to write input and output functions to input and output custom data formats. - -Supported Formats ------------------ - -SageMaker's TensforFlow Serving endpoints can also accept some additional input formats that are not part of the -TensorFlow REST API, including a simplified json format, line-delimited json objects ("jsons" or "jsonlines"), and -CSV data. - -Simplified JSON Input -^^^^^^^^^^^^^^^^^^^^^ - -The Endpoint will accept simplified JSON input that doesn't match the TensorFlow REST API's Predict request format. -When the Endpoint receives data like this, it will attempt to transform it into a valid -Predict request, using a few simple rules: - -- python value, dict, or one-dimensional arrays are treated as the input value in a single 'instance' Predict request. -- multidimensional arrays are treated as a multiple values in a multi-instance Predict request. - -Combined with the client-side ``Predictor`` object's JSON serialization, this allows you to make simple -requests like this: - -.. code:: python - - input = [ - [1.0, 2.0, 5.0], - [1.0, 2.0, 5.0] - ] - result = predictor.predict(input) - - # result contains: - { - 'predictions': [ - [3.5, 4.0, 5.5], - [3.5, 4.0, 5.5] - ] - } - -Or this: - -.. code:: python - - # 'x' must match name of input tensor in your SavedModel graph - # for models with multiple named inputs, just include all the keys in the input dict - input = { - 'x': [1.0, 2.0, 5.0] - } - - # result contains: - { - 'predictions': [ - [3.5, 4.0, 5.5] - ] - } - - -Line-delimited JSON -^^^^^^^^^^^^^^^^^^^ - -The Endpoint will accept line-delimited JSON objects (also known as "jsons" or "jsonlines" data). -The Endpoint treats each line as a separate instance in a multi-instance Predict request. To use -this feature from your python code, you need to create a ``Predictor`` instance that does not -try to serialize your input to JSON: - -.. code:: python - - # create a Predictor without JSON serialization - - predictor = Predictor('endpoint-name', serializer=None, content_type='application/jsonlines') - - input = '''{'x': [1.0, 2.0, 5.0]} - {'x': [1.0, 2.0, 5.0]} - {'x': [1.0, 2.0, 5.0]}''' - - result = predictor.predict(input) - - # result contains: - { - 'predictions': [ - [3.5, 4.0, 5.5], - [3.5, 4.0, 5.5], - [3.5, 4.0, 5.5] - ] - } - -This feature is especially useful if you are reading data from a file containing jsonlines data. - -**CSV (comma-separated values)** - -The Endpoint will accept CSV data. Each line is treated as a separate instance. This is a -compact format for representing multiple instances of 1-d array data. To use this feature -from your python code, you need to create a ``Predictor`` instance that can serialize -your input data to CSV format: - -.. code:: python - - # create a Predictor with JSON serialization - - predictor = Predictor('endpoint-name', serializer=sagemaker.predictor.csv_serializer) - - # CSV-formatted string input - input = '1.0,2.0,5.0\n1.0,2.0,5.0\n1.0,2.0,5.0' - - result = predictor.predict(input) - - # result contains: - { - 'predictions': [ - [3.5, 4.0, 5.5], - [3.5, 4.0, 5.5], - [3.5, 4.0, 5.5] - ] - } - -You can also use python arrays or numpy arrays as input and let the `csv_serializer` object -convert them to CSV, but the client-size CSV conversion is more sophisticated than the -CSV parsing on the Endpoint, so if you encounter conversion problems, try using one of the -JSON options instead. - - -Create Python Scripts for Custom Input and Output Formats ---------------------------------------------------------- - -You can add your customized Python code to process your input and output data: - -.. code:: - - from sagemaker.tensorflow.serving import Model - - model = Model(entry_point='inference.py', - model_data='s3://mybucket/model.tar.gz', - role='MySageMakerRole') - -How to implement the pre- and/or post-processing handler(s) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Your entry point file should implement either a pair of ``input_handler`` - and ``output_handler`` functions or a single ``handler`` function. - Note that if ``handler`` function is implemented, ``input_handler`` - and ``output_handler`` are ignored. - -To implement pre- and/or post-processing handler(s), use the Context -object that the Python service creates. The Context object is a namedtuple with the following attributes: - -- ``model_name (string)``: the name of the model to use for - inference. For example, 'half-plus-three' - -- ``model_version (string)``: version of the model. For example, '5' - -- ``method (string)``: inference method. For example, 'predict', - 'classify' or 'regress', for more information on methods, please see - `Classify and Regress - API `__ - and `Predict - API `__ - -- ``rest_uri (string)``: the TFS REST uri generated by the Python - service. For example, - 'http://localhost:8501/v1/models/half_plus_three:predict' - -- ``grpc_uri (string)``: the GRPC port number generated by the Python - service. For example, '9000' - -- ``custom_attributes (string)``: content of - 'X-Amzn-SageMaker-Custom-Attributes' header from the original - request. For example, - 'tfs-model-name=half*plus*\ three,tfs-method=predict' - -- ``request_content_type (string)``: the original request content type, - defaulted to 'application/json' if not provided - -- ``accept_header (string)``: the original request accept type, - defaulted to 'application/json' if not provided - -- ``content_length (int)``: content length of the original request - -The following code example implements ``input_handler`` and -``output_handler``. By providing these, the Python service posts the -request to the TFS REST URI with the data pre-processed by ``input_handler`` -and passes the response to ``output_handler`` for post-processing. - -.. code:: - - import json - - def input_handler(data, context): - """ Pre-process request input before it is sent to TensorFlow Serving REST API - Args: - data (obj): the request data, in format of dict or string - context (Context): an object containing request and configuration details - Returns: - (dict): a JSON-serializable dict that contains request body and headers - """ - if context.request_content_type == 'application/json': - # pass through json (assumes it's correctly formed) - d = data.read().decode('utf-8') - return d if len(d) else '' - - if context.request_content_type == 'text/csv': - # very simple csv handler - return json.dumps({ - 'instances': [float(x) for x in data.read().decode('utf-8').split(',')] - }) - - raise ValueError('{{"error": "unsupported content type {}"}}'.format( - context.request_content_type or "unknown")) - - - def output_handler(data, context): - """Post-process TensorFlow Serving output before it is returned to the client. - Args: - data (obj): the TensorFlow serving response - context (Context): an object containing request and configuration details - Returns: - (bytes, string): data to return to client, response content type - """ - if data.status_code != 200: - raise ValueError(data.content.decode('utf-8')) - - response_content_type = context.accept_header - prediction = data.content - return prediction, response_content_type - -You might want to have complete control over the request. -For example, you might want to make a TFS request (REST or GRPC) to the first model, -inspect the results, and then make a request to a second model. In this case, implement -the ``handler`` method instead of the ``input_handler`` and ``output_handler`` methods, as demonstrated -in the following code: - -.. code:: - - import json - import requests - - - def handler(data, context): - """Handle request. - Args: - data (obj): the request data - context (Context): an object containing request and configuration details - Returns: - (bytes, string): data to return to client, (optional) response content type - """ - processed_input = _process_input(data, context) - response = requests.post(context.rest_uri, data=processed_input) - return _process_output(response, context) - - - def _process_input(data, context): - if context.request_content_type == 'application/json': - # pass through json (assumes it's correctly formed) - d = data.read().decode('utf-8') - return d if len(d) else '' - - if context.request_content_type == 'text/csv': - # very simple csv handler - return json.dumps({ - 'instances': [float(x) for x in data.read().decode('utf-8').split(',')] - }) - - raise ValueError('{{"error": "unsupported content type {}"}}'.format( - context.request_content_type or "unknown")) - - - def _process_output(data, context): - if data.status_code != 200: - raise ValueError(data.content.decode('utf-8')) - - response_content_type = context.accept_header - prediction = data.content - return prediction, response_content_type - -You can also bring in external dependencies to help with your data -processing. There are 2 ways to do this: - -1. If you included ``requirements.txt`` in your ``source_dir`` or in - your dependencies, the container installs the Python dependencies at runtime using ``pip install -r``: - -.. code:: - - from sagemaker.tensorflow.serving import Model - - model = Model(entry_point='inference.py', - dependencies=['requirements.txt'], - model_data='s3://mybucket/model.tar.gz', - role='MySageMakerRole') - - -2. If you are working in a network-isolation situation or if you don't - want to install dependencies at runtime every time your endpoint starts or a batch - transform job runs, you might want to put - pre-downloaded dependencies under a ``lib`` directory and this - directory as dependency. The container adds the modules to the Python - path. Note that if both ``lib`` and ``requirements.txt`` - are present in the model archive, the ``requirements.txt`` is ignored: - -.. code:: - - from sagemaker.tensorflow.serving import Model - - model = Model(entry_point='inference.py', - dependencies=['/path/to/folder/named/lib'], - model_data='s3://mybucket/model.tar.gz', - role='MySageMakerRole') - - -************************************* -sagemaker.tensorflow.TensorFlow Class -************************************* - -The following are the most commonly used ``TensorFlow`` constructor arguments. - -Required: - -- ``entry_point (str)`` Path (absolute or relative) to the Python file which - should be executed as the entry point to training. -- ``role (str)`` An AWS IAM role (either name or full ARN). The Amazon - SageMaker training jobs and APIs that create Amazon SageMaker - endpoints use this role to access training data and model artifacts. - After the endpoint is created, the inference code might use the IAM - role, if accessing AWS resource. -- ``train_instance_count (int)`` Number of Amazon EC2 instances to use for - training. -- ``train_instance_type (str)`` Type of EC2 instance to use for training, for - example, 'ml.c4.xlarge'. - -Optional: - -- ``source_dir (str)`` Path (absolute or relative) to a directory with any - other training source code dependencies including the entry point - file. Structure within this directory will be preserved when training - on SageMaker. -- ``dependencies (list[str])`` A list of paths to directories (absolute or relative) with - any additional libraries that will be exported to the container (default: ``[]``). - The library folders will be copied to SageMaker in the same folder where the entrypoint is copied. - If the ``source_dir`` points to S3, code will be uploaded and the S3 location will be used - instead. Example: - - The following call - - >>> TensorFlow(entry_point='train.py', dependencies=['my/libs/common', 'virtual-env']) - - results in the following inside the container: - - >>> opt/ml/code - >>> ├── train.py - >>> ├── common - >>> └── virtual-env - -- ``hyperparameters (dict[str, ANY])`` Hyperparameters that will be used for training. - Will be made accessible as command line arguments. -- ``train_volume_size (int)`` Size in GB of the EBS volume to use for storing - input data during training. Must be large enough to the store training - data. -- ``train_max_run (int)`` Timeout in seconds for training, after which Amazon - SageMaker terminates the job regardless of its current status. -- ``output_path (str)`` S3 location where you want the training result (model - artifacts and optional output files) saved. If not specified, results - are stored to a default bucket. If the bucket with the specific name - does not exist, the estimator creates the bucket during the ``fit`` - method execution. -- ``output_kms_key`` Optional KMS key ID to optionally encrypt training - output with. -- ``base_job_name`` Name to assign for the training job that the ``fit`` - method launches. If not specified, the estimator generates a default - job name, based on the training image name and current timestamp. -- ``image_name`` An alternative docker image to use for training and - serving. If specified, the estimator will use this image for training and - hosting, instead of selecting the appropriate SageMaker official image based on - ``framework_version`` and ``py_version``. Refer to: `SageMaker TensorFlow Docker containers `_ for details on what the official images support - and where to find the source code to build your custom image. -- ``script_mode (bool)`` Whether to use Script Mode or not. Script mode is the only available training mode in Python 3, - setting ``py_version`` to ``py3`` automatically sets ``script_mode`` to True. -- ``model_dir (str)`` Location where model data, checkpoint data, and TensorBoard checkpoints should be saved during training. - If not specified a S3 location will be generated under the training job's default bucket. And ``model_dir`` will be - passed in your training script as one of the command line arguments. -- ``distributions (dict)`` Configure your distribution strategy with this argument. - -************************************** -SageMaker TensorFlow Docker containers -************************************** - -For information about SageMaker TensorFlow Docker containers and their dependencies, see `SageMaker TensorFlow Docker containers `_. +############################################## +Using TensorFlow with the SageMaker Python SDK +############################################## + +TensorFlow SageMaker Estimators allow you to run your own TensorFlow +training algorithms on SageMaker Learner, and to host your own TensorFlow +models on SageMaker Hosting. + +For general information about using the SageMaker Python SDK, see :ref:`overview:Using the SageMaker Python SDK`. + +.. warning:: + We have added a new format of your TensorFlow training script with TensorFlow version 1.11. + This new way gives the user script more flexibility. + This new format is called Script Mode, as opposed to Legacy Mode, which is what we support with TensorFlow 1.11 and older versions. + In addition we are adding Python 3 support with Script Mode. + The last supported version of Legacy Mode will be TensorFlow 1.12. + Script Mode is available with TensorFlow version 1.11 and newer. + Make sure you refer to the correct version of this README when you prepare your script. + You can find the Legacy Mode README `here `_. + +.. contents:: + +Supported versions of TensorFlow for Elastic Inference: ``1.11.0``, ``1.12.0``. + + +***************************** +Train a Model with TensorFlow +***************************** + +To train a TensorFlow model by using the SageMaker Python SDK: + +.. |create tf estimator| replace:: Create a ``sagemaker.tensorflow.TensorFlow estimator`` +.. _create tf estimator: #create-an-estimator + +.. |call fit| replace:: Call the estimator's ``fit`` method +.. _call fit: #call-the-fit-method + +1. `Prepare a training script <#prepare-a-script-mode-training-script>`_ +2. |create tf estimator|_ +3. |call fit|_ + +Prepare a Script Mode Training Script +====================================== + +Your TensorFlow training script must be a Python 2.7- or 3.6-compatible source file. + +The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, including the following: + +* ``SM_MODEL_DIR``: A string that represents the local path where the training job writes the model artifacts to. + After training, artifacts in this directory are uploaded to S3 for model hosting. This is different than the ``model_dir`` + argument passed in your training script, which is an S3 location. ``SM_MODEL_DIR`` is always set to ``/opt/ml/model``. +* ``SM_NUM_GPUS``: An integer representing the number of GPUs available to the host. +* ``SM_OUTPUT_DATA_DIR``: A string that represents the path to the directory to write output artifacts to. + Output artifacts might include checkpoints, graphs, and other files to save, but do not include model artifacts. + These artifacts are compressed and uploaded to S3 to an S3 bucket with the same prefix as the model artifacts. +* ``SM_CHANNEL_XXXX``: A string that represents the path to the directory that contains the input data for the specified channel. + For example, if you specify two input channels in the TensorFlow estimator's ``fit`` call, named 'train' and 'test', the environment variables ``SM_CHANNEL_TRAIN`` and ``SM_CHANNEL_TEST`` are set. + +For the exhaustive list of available environment variables, see the `SageMaker Containers documentation `_. + +A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to ``SM_CHANNEL_TRAIN`` so that it can be deployed for inference later. +Hyperparameters are passed to your script as arguments and can be retrieved with an ``argparse.ArgumentParser`` instance. +For example, a training script might start with the following: + +.. code:: python + + import argparse + import os + + if __name__ =='__main__': + + parser = argparse.ArgumentParser() + + # hyperparameters sent by the client are passed as command-line arguments to the script. + parser.add_argument('--epochs', type=int, default=10) + parser.add_argument('--batch_size', type=int, default=100) + parser.add_argument('--learning_rate', type=float, default=0.1) + + # input data and model directories + parser.add_argument('--model_dir', type=str) + parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN')) + parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST')) + + args, _ = parser.parse_known_args() + + # ... load from args.train and args.test, train a model, write model to args.model_dir. + +Because the SageMaker imports your training script, putting your training launching code in a main guard (``if __name__=='__main__':``) +is good practice. + +Note that SageMaker doesn't support argparse actions. +For example, if you want to use a boolean hyperparameter, specify ``type`` as ``bool`` in your script and provide an explicit ``True`` or ``False`` value for this hyperparameter when you create the TensorFlow estimator. + +For a complete example of a TensorFlow training script, see `mnist.py `__. + + +Adapting your local TensorFlow script +------------------------------------- + +If you have a TensorFlow training script that runs outside of SageMaker, do the following to adapt the script to run in SageMaker: + +1. Make sure your script can handle ``--model_dir`` as an additional command line argument. If you did not specify a +location when you created the TensorFlow estimator, an S3 location under the default training job bucket is used. +Distributed training with parameter servers requires you to use the ``tf.estimator.train_and_evaluate`` API and +to provide an S3 location as the model directory during training. Here is an example: + +.. code:: python + + estimator = tf.estimator.Estimator(model_fn=my_model_fn, model_dir=args.model_dir) + ... + train_spec = tf.estimator.TrainSpec(train_input_fn, max_steps=1000) + eval_spec = tf.estimator.EvalSpec(eval_input_fn) + tf.estimator.train_and_evaluate(mnist_classifier, train_spec, eval_spec) + +2. Load input data from the input channels. The input channels are defined when ``fit`` is called. For example: + +.. code:: python + + estimator.fit({'train':'s3://my-bucket/my-training-data', + 'eval':'s3://my-bucket/my-evaluation-data'}) + +In your training script the channels will be stored in environment variables ``SM_CHANNEL_TRAIN`` and +``SM_CHANNEL_EVAL``. You can add them to your argument parsing logic like this: + +.. code:: python + + parser = argparse.ArgumentParser() + parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN')) + parser.add_argument('--eval', type=str, default=os.environ.get('SM_CHANNEL_EVAL')) + +3. Export your final model to path stored in environment variable ``SM_MODEL_DIR`` which should always be + ``/opt/ml/model``. At end of training SageMaker will upload the model file under ``/opt/ml/model`` to + ``output_path``. + + +Create an Estimator +=================== + +After you create your training script, create an instance of the :class:`sagemaker.tensorflow.TensorFlow` estimator. + +To use Script Mode, set at least one of these args + +- ``py_version='py3'`` +- ``script_mode=True`` + +When using Script Mode, your training script needs to accept the following args: + +- ``model_dir`` + +The following args are not permitted when using Script Mode: + +- ``checkpoint_path`` +- ``training_steps`` +- ``evaluation_steps`` +- ``requirements_file`` + +.. code:: python + + from sagemaker.tensorflow import TensorFlow + + tf_estimator = TensorFlow(entry_point='tf-train.py', role='SageMakerRole', + train_instance_count=1, train_instance_type='ml.p2.xlarge', + framework_version='1.12', py_version='py3') + tf_estimator.fit('s3://bucket/path/to/training/data') + +Where the S3 url is a path to your training data within Amazon S3. +The constructor keyword arguments define how SageMaker runs your training script. + +For more information about the sagemaker.tensorflow.TensorFlow estimator, see `sagemaker.tensorflow.TensorFlow Class`_. + +Call the fit Method +=================== + +You start your training script by calling the ``fit`` method on a ``TensorFlow`` estimator. ``fit`` takes +both required and optional arguments. + +Required arguments +------------------ + +- ``inputs``: The S3 location(s) of datasets to be used for training. This can take one of two forms: + + - ``str``: An S3 URI, for example ``s3://my-bucket/my-training-data``, which indicates the dataset's location. + - ``dict[str, str]``: A dictionary mapping channel names to S3 locations, for example ``{'train': 's3://my-bucket/my-training-data/train', 'test': 's3://my-bucket/my-training-data/test'}`` + - ``sagemaker.session.s3_input``: channel configuration for S3 data sources that can provide additional information as well as the path to the training dataset. See `the API docs `_ for full details. + +Optional arguments +------------------ + +- ``wait (bool)``: Defaults to True, whether to block and wait for the + training script to complete before returning. + If set to False, it will return immediately, and can later be attached to. +- ``logs (bool)``: Defaults to True, whether to show logs produced by training + job in the Python session. Only meaningful when wait is True. +- ``run_tensorboard_locally (bool)``: Defaults to False. If set to True a Tensorboard command will be printed out. +- ``job_name (str)``: Training job name. If not specified, the estimator generates a default job name, + based on the training image name and current timestamp. + +What happens when fit is called +------------------------------- + +Calling ``fit`` starts a SageMaker training job. The training job will execute the following. + +- Starts ``train_instance_count`` EC2 instances of the type ``train_instance_type``. +- On each instance, it will do the following steps: + + - starts a Docker container optimized for TensorFlow. + - downloads the dataset. + - setup up training related environment varialbes + - setup up distributed training environment if configured to use parameter server + - starts asynchronous training + +If the ``wait=False`` flag is passed to ``fit``, then it returns immediately. The training job continues running +asynchronously. Later, a Tensorflow estimator can be obtained by attaching to the existing training job. +If the training job is not finished, it starts showing the standard output of training and wait until it completes. +After attaching, the estimator can be deployed as usual. + +.. code:: python + + tf_estimator.fit(your_input_data, wait=False) + training_job_name = tf_estimator.latest_training_job.name + + # after some time, or in a separate Python notebook, we can attach to it again. + + tf_estimator = TensorFlow.attach(training_job_name=training_job_name) + +Distributed Training +==================== + +To run your training job with multiple instances in a distributed fashion, set ``train_instance_count`` +to a number larger than 1. We support two different types of distributed training, parameter server and Horovod. +The ``distributions`` parameter is used to configure which distributed training strategy to use. + +Training with parameter servers +------------------------------- + +If you specify parameter_server as the value of the distributions parameter, the container launches a parameter server +thread on each instance in the training cluster, and then executes your training code. You can find more information on +TensorFlow distributed training at `TensorFlow docs `__. +To enable parameter server training: + +.. code:: python + + from sagemaker.tensorflow import TensorFlow + + tf_estimator = TensorFlow(entry_point='tf-train.py', role='SageMakerRole', + train_instance_count=2, train_instance_type='ml.p2.xlarge', + framework_version='1.11', py_version='py3', + distributions={'parameter_server': {'enabled': True}}) + tf_estimator.fit('s3://bucket/path/to/training/data') + +Training with Horovod +--------------------- + +Horovod is a distributed training framework based on MPI. Horovod is only available with TensorFlow version ``1.12`` or newer. +You can find more details at `Horovod README `__. + +The container sets up the MPI environment and executes the ``mpirun`` command enabling you to run any Horovod +training script with Script Mode. + +Training with ``MPI`` is configured by specifying following fields in ``distributions``: + +- ``enabled (bool)``: If set to ``True``, the MPI setup is performed and ``mpirun`` command is executed. +- ``processes_per_host (int)``: Number of processes MPI should launch on each host. Note, this should not be + greater than the available slots on the selected instance type. This flag should be set for the multi-cpu/gpu + training. +- ``custom_mpi_options (str)``: Any `mpirun` flag(s) can be passed in this field that will be added to the `mpirun` + command executed by SageMaker to launch distributed horovod training. + + +In the below example we create an estimator to launch Horovod distributed training with 2 processes on one host: + +.. code:: python + + from sagemaker.tensorflow import TensorFlow + + tf_estimator = TensorFlow(entry_point='tf-train.py', role='SageMakerRole', + train_instance_count=1, train_instance_type='ml.p2.xlarge', + framework_version='1.12', py_version='py3', + distributions={ + 'mpi': { + 'enabled': True, + 'processes_per_host': 2, + 'custom_mpi_options': '--NCCL_DEBUG INFO' + } + }) + tf_estimator.fit('s3://bucket/path/to/training/data') + + +Training with Pipe Mode using PipeModeDataset +============================================= + +Amazon SageMaker allows users to create training jobs using Pipe input mode. +With Pipe input mode, your dataset is streamed directly to your training instances instead of being downloaded first. +This means that your training jobs start sooner, finish quicker, and need less disk space. + +SageMaker TensorFlow provides an implementation of ``tf.data.Dataset`` that makes it easy to take advantage of Pipe +input mode in SageMaker. You can replace your ``tf.data.Dataset`` with a ``sagemaker_tensorflow.PipeModeDataset`` to +read TFRecords as they are streamed to your training instances. + +In your ``entry_point`` script, you can use ``PipeModeDataset`` like a ``Dataset``. In this example, we create a +``PipeModeDataset`` to read TFRecords from the 'training' channel: + + +.. code:: python + + from sagemaker_tensorflow import PipeModeDataset + + features = { + 'data': tf.FixedLenFeature([], tf.string), + 'labels': tf.FixedLenFeature([], tf.int64), + } + + def parse(record): + parsed = tf.parse_single_example(record, features) + return ({ + 'data': tf.decode_raw(parsed['data'], tf.float64) + }, parsed['labels']) + + def train_input_fn(training_dir, hyperparameters): + ds = PipeModeDataset(channel='training', record_format='TFRecord') + ds = ds.repeat(20) + ds = ds.prefetch(10) + ds = ds.map(parse, num_parallel_calls=10) + ds = ds.batch(64) + return ds + + +To run training job with Pipe input mode, pass in ``input_mode='Pipe'`` to your TensorFlow Estimator: + + +.. code:: python + + from sagemaker.tensorflow import TensorFlow + + tf_estimator = TensorFlow(entry_point='tf-train-with-pipemodedataset.py', role='SageMakerRole', + training_steps=10000, evaluation_steps=100, + train_instance_count=1, train_instance_type='ml.p2.xlarge', + framework_version='1.10.0', input_mode='Pipe') + + tf_estimator.fit('s3://bucket/path/to/training/data') + + +If your TFRecords are compressed, you can train on Gzipped TF Records by passing in ``compression='Gzip'`` to the call to +``fit()``, and SageMaker will automatically unzip the records as data is streamed to your training instances: + +.. code:: python + + from sagemaker.session import s3_input + + train_s3_input = s3_input('s3://bucket/path/to/training/data', compression='Gzip') + tf_estimator.fit(train_s3_input) + + +You can learn more about ``PipeModeDataset`` in the sagemaker-tensorflow-extensions repository: https://github.com/aws/sagemaker-tensorflow-extensions + + +Training with MKL-DNN disabled +============================== + +SageMaker TensorFlow CPU images use TensorFlow built with Intel® MKL-DNN optimization. + +In certain cases you might be able to get a better performance by disabling this optimization +(`for example when using small models `_) + +You can disable MKL-DNN optimization for TensorFlow ``1.8.0`` and above by setting two following environment variables: + +.. code:: python + + import os + + os.environ['TF_DISABLE_MKL'] = '1' + os.environ['TF_DISABLE_POOL_ALLOCATOR'] = '1' + +******************************** +Deploy TensorFlow Serving models +******************************** + +After a TensorFlow estimator has been fit, it saves a TensorFlow SavedModel in +the S3 location defined by ``output_path``. You can call ``deploy`` on a TensorFlow +estimator to create a SageMaker Endpoint, or you can call ``transformer`` to create a ``Transformer`` that you can use to run a batch transform job. + +Your model will be deployed to a TensorFlow Serving-based server. The server provides a super-set of the +`TensorFlow Serving REST API `_. + + +Deploy to a SageMaker Endpoint +============================== + +Deploying from an Estimator +--------------------------- + +After a TensorFlow estimator has been fit, it saves a TensorFlow +`SavedModel `_ bundle in +the S3 location defined by ``output_path``. You can call ``deploy`` on a TensorFlow +estimator object to create a SageMaker Endpoint: + +.. code:: python + + from sagemaker.tensorflow import TensorFlow + + estimator = TensorFlow(entry_point='tf-train.py', ..., train_instance_count=1, + train_instance_type='ml.c4.xlarge', framework_version='1.11') + + estimator.fit(inputs) + + predictor = estimator.deploy(initial_instance_count=1, + instance_type='ml.c5.xlarge', + endpoint_type='tensorflow-serving') + + +The code block above deploys a SageMaker Endpoint with one instance of the type 'ml.c5.xlarge'. + +What happens when deploy is called +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Calling ``deploy`` starts the process of creating a SageMaker Endpoint. This process includes the following steps. + +- Starts ``initial_instance_count`` EC2 instances of the type ``instance_type``. +- On each instance, it will do the following steps: + + - start a Docker container optimized for TensorFlow Serving, see `SageMaker TensorFlow Serving containers `_. + - start a `TensorFlow Serving` process configured to run your model. + - start an HTTP server that provides access to TensorFlow Server through the SageMaker InvokeEndpoint API. + + +When the ``deploy`` call finishes, the created SageMaker Endpoint is ready for prediction requests. The +`Making predictions against a SageMaker Endpoint`_ section will explain how to make prediction requests +against the Endpoint. + +Deploying directly from model artifacts +--------------------------------------- + +If you already have existing model artifacts in S3, you can skip training and deploy them directly to an endpoint: + +.. code:: python + + from sagemaker.tensorflow.serving import Model + + model = Model(model_data='s3://mybucket/model.tar.gz', role='MySageMakerRole') + + predictor = model.deploy(initial_instance_count=1, instance_type='ml.c5.xlarge') + +Python-based TensorFlow serving on SageMaker has support for `Elastic Inference `__, which allows for inference acceleration to a hosted endpoint for a fraction of the cost of using a full GPU instance. In order to attach an Elastic Inference accelerator to your endpoint provide the accelerator type to accelerator_type to your deploy call. + +.. code:: python + + from sagemaker.tensorflow.serving import Model + + model = Model(model_data='s3://mybucket/model.tar.gz', role='MySageMakerRole') + + predictor = model.deploy(initial_instance_count=1, instance_type='ml.c5.xlarge', accelerator_type='ml.eia1.medium') + +Making predictions against a SageMaker Endpoint +----------------------------------------------- + +Once you have the ``Predictor`` instance returned by ``model.deploy(...)`` or ``estimator.deploy(...)``, you +can send prediction requests to your Endpoint. + +The following code shows how to make a prediction request: + +.. code:: python + + input = { + 'instances': [1.0, 2.0, 5.0] + } + result = predictor.predict(input) + +The result object will contain a Python dict like this: + +.. code:: python + + { + 'predictions': [3.5, 4.0, 5.5] + } + +The formats of the input and the output data correspond directly to the request and response formats +of the ``Predict`` method in the `TensorFlow Serving REST API `_. + +If your SavedModel includes the right ``signature_def``, you can also make Classify or Regress requests: + +.. code:: python + + # input matches the Classify and Regress API + input = { + 'signature_name': 'tensorflow/serving/regress', + 'examples': [{'x': 1.0}, {'x': 2.0}] + } + + result = predictor.regress(input) # or predictor.classify(...) + + # result contains: + { + 'results': [3.5, 4.0] + } + +You can include multiple ``instances`` in your predict request (or multiple ``examples`` in +classify/regress requests) to get multiple prediction results in one request to your Endpoint: + +.. code:: python + + input = { + 'instances': [ + [1.0, 2.0, 5.0], + [1.0, 2.0, 5.0], + [1.0, 2.0, 5.0] + ] + } + result = predictor.predict(input) + + # result contains: + { + 'predictions': [ + [3.5, 4.0, 5.5], + [3.5, 4.0, 5.5], + [3.5, 4.0, 5.5] + ] + } + +If your application allows request grouping like this, it is **much** more efficient than making separate requests. + +See `Deploying to TensorFlow Serving Endpoints ` to learn how to deploy your model and make inference requests. + +Run a Batch Transform Job +========================= + +Batch transform allows you to get inferences for an entire dataset that is stored in an S3 bucket. + +For general information about using batch transform with the SageMaker Python SDK, see :ref:`overview:SageMaker Batch Transform`. +For information about SageMaker batch transform, see `Get Inferences for an Entire Dataset with Batch Transform ` in the AWS documentation. + +To run a batch transform job, you first create a ``Transformer`` object, and then call that object's ``transform`` method. + +Create a Transformer Object +--------------------------- + +If you used an estimator to train your model, you can call the ``transformer`` method of the estimator to create a ``Transformer`` object. + +For example: + +.. code:: python + + bucket = myBucket # The name of the S3 bucket where the results are stored + prefix = 'batch-results' # The folder in the S3 bucket where the results are stored + + batch_output = 's3://{}/{}/results'.format(bucket, prefix) # The location to store the results + + tf_transformer = tf_estimator.transformer(instance_count=1, instance_type='ml.m4.xlarge, output_path=batch_output) + +To use a model trained outside of SageMaker, you can package the model as a SageMaker model, and call the ``transformer`` method of the SageMaker model. + +For example: + +.. code:: python + + bucket = myBucket # The name of the S3 bucket where the results are stored + prefix = 'batch-results' # The folder in the S3 bucket where the results are stored + + batch_output = 's3://{}/{}/results'.format(bucket, prefix) # The location to store the results + + tf_transformer = tensorflow_serving_model.transformer(instance_count=1, instance_type='ml.m4.xlarge, output_path=batch_output) + +For information about how to package a model as a SageMaker model, see :ref:`overview:BYO Model`. +When you call the ``tranformer`` method, you specify the type and number of instances to use for the batch transform job, and the location where the results are stored in S3. + + + +Call transform +-------------- + +After you create a ``Transformer`` object, you call that object's ``transform`` method to start a batch transform job. +For example: + +.. code:: python + + batch_input = 's3://{}/{}/test/examples'.format(bucket, prefix) # The location of the input dataset + + tf_transformer.transform(data=batch_input, data_type='S3Prefix', content_type='text/csv', split_type='Line') + +In the example, the content type is CSV, and each line in the dataset is treated as a record to get a predition for. + +Batch Transform Supported Data Formats +-------------------------------------- + +When you call the ``tranform`` method to start a batch transform job, +you specify the data format by providing a MIME type as the value for the ``content_type`` parameter. + +The following content formats are supported without custom intput and output handling: + +* CSV - specify ``text/csv`` as the value of the ``content_type`` parameter. +* JSON - specify ``application/json`` as the value of the ``content_type`` parameter. +* JSON lines - specify ``application/jsonlines`` as the value of the ``content_type`` parameter. + +For detailed information about how TensorFlow Serving formats these data types for input and output, see :ref:`using_tf:TensorFlow Serving Input and Output`. + +You can also accept any custom data format by writing input and output functions, and include them in the ``inference.py`` file in your model. +For information, see :ref:`using_tf:Create Python Scripts for Custom Input and Output Formats`. + + +TensorFlow Serving Input and Output +=================================== + +The following sections describe the data formats that TensorFlow Serving endpoints and batch transform jobs accept, +and how to write input and output functions to input and output custom data formats. + +Supported Formats +----------------- + +SageMaker's TensforFlow Serving endpoints can also accept some additional input formats that are not part of the +TensorFlow REST API, including a simplified json format, line-delimited json objects ("jsons" or "jsonlines"), and +CSV data. + +Simplified JSON Input +^^^^^^^^^^^^^^^^^^^^^ + +The Endpoint will accept simplified JSON input that doesn't match the TensorFlow REST API's Predict request format. +When the Endpoint receives data like this, it will attempt to transform it into a valid +Predict request, using a few simple rules: + +- python value, dict, or one-dimensional arrays are treated as the input value in a single 'instance' Predict request. +- multidimensional arrays are treated as a multiple values in a multi-instance Predict request. + +Combined with the client-side ``Predictor`` object's JSON serialization, this allows you to make simple +requests like this: + +.. code:: python + + input = [ + [1.0, 2.0, 5.0], + [1.0, 2.0, 5.0] + ] + result = predictor.predict(input) + + # result contains: + { + 'predictions': [ + [3.5, 4.0, 5.5], + [3.5, 4.0, 5.5] + ] + } + +Or this: + +.. code:: python + + # 'x' must match name of input tensor in your SavedModel graph + # for models with multiple named inputs, just include all the keys in the input dict + input = { + 'x': [1.0, 2.0, 5.0] + } + + # result contains: + { + 'predictions': [ + [3.5, 4.0, 5.5] + ] + } + + +Line-delimited JSON +^^^^^^^^^^^^^^^^^^^ + +The Endpoint will accept line-delimited JSON objects (also known as "jsons" or "jsonlines" data). +The Endpoint treats each line as a separate instance in a multi-instance Predict request. To use +this feature from your python code, you need to create a ``Predictor`` instance that does not +try to serialize your input to JSON: + +.. code:: python + + # create a Predictor without JSON serialization + + predictor = Predictor('endpoint-name', serializer=None, content_type='application/jsonlines') + + input = '''{'x': [1.0, 2.0, 5.0]} + {'x': [1.0, 2.0, 5.0]} + {'x': [1.0, 2.0, 5.0]}''' + + result = predictor.predict(input) + + # result contains: + { + 'predictions': [ + [3.5, 4.0, 5.5], + [3.5, 4.0, 5.5], + [3.5, 4.0, 5.5] + ] + } + +This feature is especially useful if you are reading data from a file containing jsonlines data. + +**CSV (comma-separated values)** + +The Endpoint will accept CSV data. Each line is treated as a separate instance. This is a +compact format for representing multiple instances of 1-d array data. To use this feature +from your python code, you need to create a ``Predictor`` instance that can serialize +your input data to CSV format: + +.. code:: python + + # create a Predictor with JSON serialization + + predictor = Predictor('endpoint-name', serializer=sagemaker.predictor.csv_serializer) + + # CSV-formatted string input + input = '1.0,2.0,5.0\n1.0,2.0,5.0\n1.0,2.0,5.0' + + result = predictor.predict(input) + + # result contains: + { + 'predictions': [ + [3.5, 4.0, 5.5], + [3.5, 4.0, 5.5], + [3.5, 4.0, 5.5] + ] + } + +You can also use python arrays or numpy arrays as input and let the `csv_serializer` object +convert them to CSV, but the client-size CSV conversion is more sophisticated than the +CSV parsing on the Endpoint, so if you encounter conversion problems, try using one of the +JSON options instead. + + +Create Python Scripts for Custom Input and Output Formats +--------------------------------------------------------- + +You can add your customized Python code to process your input and output data: + +.. code:: + + from sagemaker.tensorflow.serving import Model + + model = Model(entry_point='inference.py', + model_data='s3://mybucket/model.tar.gz', + role='MySageMakerRole') + +How to implement the pre- and/or post-processing handler(s) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Your entry point file should implement either a pair of ``input_handler`` + and ``output_handler`` functions or a single ``handler`` function. + Note that if ``handler`` function is implemented, ``input_handler`` + and ``output_handler`` are ignored. + +To implement pre- and/or post-processing handler(s), use the Context +object that the Python service creates. The Context object is a namedtuple with the following attributes: + +- ``model_name (string)``: the name of the model to use for + inference. For example, 'half-plus-three' + +- ``model_version (string)``: version of the model. For example, '5' + +- ``method (string)``: inference method. For example, 'predict', + 'classify' or 'regress', for more information on methods, please see + `Classify and Regress + API `__ + and `Predict + API `__ + +- ``rest_uri (string)``: the TFS REST uri generated by the Python + service. For example, + 'http://localhost:8501/v1/models/half_plus_three:predict' + +- ``grpc_uri (string)``: the GRPC port number generated by the Python + service. For example, '9000' + +- ``custom_attributes (string)``: content of + 'X-Amzn-SageMaker-Custom-Attributes' header from the original + request. For example, + 'tfs-model-name=half*plus*\ three,tfs-method=predict' + +- ``request_content_type (string)``: the original request content type, + defaulted to 'application/json' if not provided + +- ``accept_header (string)``: the original request accept type, + defaulted to 'application/json' if not provided + +- ``content_length (int)``: content length of the original request + +The following code example implements ``input_handler`` and +``output_handler``. By providing these, the Python service posts the +request to the TFS REST URI with the data pre-processed by ``input_handler`` +and passes the response to ``output_handler`` for post-processing. + +.. code:: + + import json + + def input_handler(data, context): + """ Pre-process request input before it is sent to TensorFlow Serving REST API + Args: + data (obj): the request data, in format of dict or string + context (Context): an object containing request and configuration details + Returns: + (dict): a JSON-serializable dict that contains request body and headers + """ + if context.request_content_type == 'application/json': + # pass through json (assumes it's correctly formed) + d = data.read().decode('utf-8') + return d if len(d) else '' + + if context.request_content_type == 'text/csv': + # very simple csv handler + return json.dumps({ + 'instances': [float(x) for x in data.read().decode('utf-8').split(',')] + }) + + raise ValueError('{{"error": "unsupported content type {}"}}'.format( + context.request_content_type or "unknown")) + + + def output_handler(data, context): + """Post-process TensorFlow Serving output before it is returned to the client. + Args: + data (obj): the TensorFlow serving response + context (Context): an object containing request and configuration details + Returns: + (bytes, string): data to return to client, response content type + """ + if data.status_code != 200: + raise ValueError(data.content.decode('utf-8')) + + response_content_type = context.accept_header + prediction = data.content + return prediction, response_content_type + +You might want to have complete control over the request. +For example, you might want to make a TFS request (REST or GRPC) to the first model, +inspect the results, and then make a request to a second model. In this case, implement +the ``handler`` method instead of the ``input_handler`` and ``output_handler`` methods, as demonstrated +in the following code: + +.. code:: + + import json + import requests + + + def handler(data, context): + """Handle request. + Args: + data (obj): the request data + context (Context): an object containing request and configuration details + Returns: + (bytes, string): data to return to client, (optional) response content type + """ + processed_input = _process_input(data, context) + response = requests.post(context.rest_uri, data=processed_input) + return _process_output(response, context) + + + def _process_input(data, context): + if context.request_content_type == 'application/json': + # pass through json (assumes it's correctly formed) + d = data.read().decode('utf-8') + return d if len(d) else '' + + if context.request_content_type == 'text/csv': + # very simple csv handler + return json.dumps({ + 'instances': [float(x) for x in data.read().decode('utf-8').split(',')] + }) + + raise ValueError('{{"error": "unsupported content type {}"}}'.format( + context.request_content_type or "unknown")) + + + def _process_output(data, context): + if data.status_code != 200: + raise ValueError(data.content.decode('utf-8')) + + response_content_type = context.accept_header + prediction = data.content + return prediction, response_content_type + +You can also bring in external dependencies to help with your data +processing. There are 2 ways to do this: + +1. If you included ``requirements.txt`` in your ``source_dir`` or in + your dependencies, the container installs the Python dependencies at runtime using ``pip install -r``: + +.. code:: + + from sagemaker.tensorflow.serving import Model + + model = Model(entry_point='inference.py', + dependencies=['requirements.txt'], + model_data='s3://mybucket/model.tar.gz', + role='MySageMakerRole') + + +2. If you are working in a network-isolation situation or if you don't + want to install dependencies at runtime every time your endpoint starts or a batch + transform job runs, you might want to put + pre-downloaded dependencies under a ``lib`` directory and this + directory as dependency. The container adds the modules to the Python + path. Note that if both ``lib`` and ``requirements.txt`` + are present in the model archive, the ``requirements.txt`` is ignored: + +.. code:: + + from sagemaker.tensorflow.serving import Model + + model = Model(entry_point='inference.py', + dependencies=['/path/to/folder/named/lib'], + model_data='s3://mybucket/model.tar.gz', + role='MySageMakerRole') + + +************************************* +sagemaker.tensorflow.TensorFlow Class +************************************* + +The following are the most commonly used ``TensorFlow`` constructor arguments. + +Required: + +- ``entry_point (str)`` Path (absolute or relative) to the Python file which + should be executed as the entry point to training. +- ``role (str)`` An AWS IAM role (either name or full ARN). The Amazon + SageMaker training jobs and APIs that create Amazon SageMaker + endpoints use this role to access training data and model artifacts. + After the endpoint is created, the inference code might use the IAM + role, if accessing AWS resource. +- ``train_instance_count (int)`` Number of Amazon EC2 instances to use for + training. +- ``train_instance_type (str)`` Type of EC2 instance to use for training, for + example, 'ml.c4.xlarge'. + +Optional: + +- ``source_dir (str)`` Path (absolute or relative) to a directory with any + other training source code dependencies including the entry point + file. Structure within this directory will be preserved when training + on SageMaker. +- ``dependencies (list[str])`` A list of paths to directories (absolute or relative) with + any additional libraries that will be exported to the container (default: ``[]``). + The library folders will be copied to SageMaker in the same folder where the entrypoint is copied. + If the ``source_dir`` points to S3, code will be uploaded and the S3 location will be used + instead. Example: + + The following call + + >>> TensorFlow(entry_point='train.py', dependencies=['my/libs/common', 'virtual-env']) + + results in the following inside the container: + + >>> opt/ml/code + >>> ├── train.py + >>> ├── common + >>> └── virtual-env + +- ``hyperparameters (dict[str, ANY])`` Hyperparameters that will be used for training. + Will be made accessible as command line arguments. +- ``train_volume_size (int)`` Size in GB of the EBS volume to use for storing + input data during training. Must be large enough to the store training + data. +- ``train_max_run (int)`` Timeout in seconds for training, after which Amazon + SageMaker terminates the job regardless of its current status. +- ``output_path (str)`` S3 location where you want the training result (model + artifacts and optional output files) saved. If not specified, results + are stored to a default bucket. If the bucket with the specific name + does not exist, the estimator creates the bucket during the ``fit`` + method execution. +- ``output_kms_key`` Optional KMS key ID to optionally encrypt training + output with. +- ``base_job_name`` Name to assign for the training job that the ``fit`` + method launches. If not specified, the estimator generates a default + job name, based on the training image name and current timestamp. +- ``image_name`` An alternative docker image to use for training and + serving. If specified, the estimator will use this image for training and + hosting, instead of selecting the appropriate SageMaker official image based on + ``framework_version`` and ``py_version``. Refer to: `SageMaker TensorFlow Docker containers `_ for details on what the official images support + and where to find the source code to build your custom image. +- ``script_mode (bool)`` Whether to use Script Mode or not. Script mode is the only available training mode in Python 3, + setting ``py_version`` to ``py3`` automatically sets ``script_mode`` to True. +- ``model_dir (str)`` Location where model data, checkpoint data, and TensorBoard checkpoints should be saved during training. + If not specified a S3 location will be generated under the training job's default bucket. And ``model_dir`` will be + passed in your training script as one of the command line arguments. +- ``distributions (dict)`` Configure your distribution strategy with this argument. + +************************************** +SageMaker TensorFlow Docker containers +************************************** + +For information about SageMaker TensorFlow Docker containers and their dependencies, see `SageMaker TensorFlow Docker containers `_. diff --git a/doc/using_workflow.rst b/doc/using_workflow.rst index 46dbedd612..8ae50e42da 100644 --- a/doc/using_workflow.rst +++ b/doc/using_workflow.rst @@ -1,166 +1,166 @@ -==================================== -SageMaker Workflow in Apache Airflow -==================================== - -Apache Airflow -~~~~~~~~~~~~~~ - -`Apache Airflow `_ -is a platform that enables you to programmatically author, schedule, and monitor workflows. Using Airflow, -you can build a workflow for SageMaker training, hyperparameter tuning, batch transform and endpoint deployment. -You can use any SageMaker deep learning framework or Amazon algorithms to perform above operations in Airflow. - -There are two ways to build a SageMaker workflow. Using Airflow SageMaker operators or using Airflow PythonOperator. - -1. SageMaker Operators: In Airflow 1.10.1, the SageMaker team contributed special operators for SageMaker operations. -Each operator takes a configuration dictionary that defines the corresponding operation. We provide APIs to generate -the configuration dictionary in the SageMaker Python SDK. Currently, the following SageMaker operators are supported: - -* ``SageMakerTrainingOperator`` -* ``SageMakerTuningOperator`` -* ``SageMakerModelOperator`` -* ``SageMakerTransformOperator`` -* ``SageMakerEndpointConfigOperator`` -* ``SageMakerEndpointOperator`` - -2. PythonOperator: Airflow built-in operator that executes Python callables. You can use the PythonOperator to execute -operations in the SageMaker Python SDK to create a SageMaker workflow. - -Using Airflow on AWS -~~~~~~~~~~~~~~~~~~~~ - -Turbine is an open-source AWS CloudFormation template that enables you to create an Airflow resource stack on AWS. -You can get it here: https://github.com/villasv/aws-airflow-stack - -Using Airflow SageMaker Operators -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Starting with Airflow 1.10.1, you can use SageMaker operators in Airflow. All SageMaker operators take a configuration -dictionary that can be generated by the SageMaker Python SDK. For example: - -.. code:: python - - import sagemaker - from sagemaker.tensorflow import TensorFlow - from sagemaker.workflow.airflow import training_config, transform_config_from_estimator - - estimator = TensorFlow(entry_point='tf_train.py', - role='sagemaker-role', - framework_version='1.11.0', - training_steps=1000, - evaluation_steps=100, - train_instance_count=2, - train_instance_type='ml.p2.xlarge') - - # train_config specifies SageMaker training configuration - train_config = training_config(estimator=estimator, - inputs=your_training_data_s3_uri) - - # trans_config specifies SageMaker batch transform configuration - # task_id specifies which operator the training job associatd with; task_type specifies whether the operator is a - # training operator or tuning operator - trans_config = transform_config_from_estimator(estimator=estimator, - task_id='tf_training', - task_type='training', - instance_count=1, - instance_type='ml.m4.xlarge', - data=your_transform_data_s3_uri, - content_type='text/csv') - -Now you can pass these configurations to the corresponding SageMaker operators and create the workflow: - -.. code:: python - - import airflow - from airflow import DAG - from airflow.contrib.operators.sagemaker_training_operator import SageMakerTrainingOperator - from airflow.contrib.operators.sagemaker_transform_operator import SageMakerTransformOperator - - default_args = { - 'owner': 'airflow', - 'start_date': airflow.utils.dates.days_ago(2), - 'provide_context': True - } - - dag = DAG('tensorflow_example', default_args=default_args, - schedule_interval='@once') - - train_op = SageMakerTrainingOperator( - task_id='tf_training', - config=train_config, - wait_for_completion=True, - dag=dag) - - transform_op = SageMakerTransformOperator( - task_id='tf_transform', - config=trans_config, - wait_for_completion=True, - dag=dag) - - transform_op.set_upstream(train_op) - -Using Airflow Python Operator -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -`Airflow PythonOperator `_ -is a built-in operator that can execute any Python callable. If you want to build the SageMaker workflow in a more -flexible way, write your python callables for SageMaker operations by using the SageMaker Python SDK. - -.. code:: python - - from sagemaker.tensorflow import TensorFlow - - # callable for SageMaker training in TensorFlow - def train(data, **context): - estimator = TensorFlow(entry_point='tf_train.py', - role='sagemaker-role', - framework_version='1.11.0', - training_steps=1000, - evaluation_steps=100, - train_instance_count=2, - train_instance_type='ml.p2.xlarge') - estimator.fit(data) - return estimator.latest_training_job.job_name - - # callable for SageMaker batch transform - def transform(data, **context): - training_job = context['ti'].xcom_pull(task_ids='training') - estimator = TensorFlow.attach(training_job) - transformer = estimator.transformer(instance_count=1, instance_type='ml.c4.xlarge') - transformer.transform(data, content_type='text/csv') - -Then build your workflow by using the PythonOperator with the Python callables defined above: - -.. code:: python - - import airflow - from airflow import DAG - from airflow.operators.python_operator import PythonOperator - - default_args = { - 'owner': 'airflow', - 'start_date': airflow.utils.dates.days_ago(2), - 'provide_context': True - } - - dag = DAG('tensorflow_example', default_args=default_args, - schedule_interval='@once') - - train_op = PythonOperator( - task_id='training', - python_callable=train, - op_args=[training_data_s3_uri], - provide_context=True, - dag=dag) - - transform_op = PythonOperator( - task_id='transform', - python_callable=transform, - op_args=[transform_data_s3_uri], - provide_context=True, - dag=dag) - - transform_op.set_upstream(train_op) - -A workflow that runs a SageMaker training job and a batch transform job is finished. You can customize your Python +==================================== +SageMaker Workflow in Apache Airflow +==================================== + +Apache Airflow +~~~~~~~~~~~~~~ + +`Apache Airflow `_ +is a platform that enables you to programmatically author, schedule, and monitor workflows. Using Airflow, +you can build a workflow for SageMaker training, hyperparameter tuning, batch transform and endpoint deployment. +You can use any SageMaker deep learning framework or Amazon algorithms to perform above operations in Airflow. + +There are two ways to build a SageMaker workflow. Using Airflow SageMaker operators or using Airflow PythonOperator. + +1. SageMaker Operators: In Airflow 1.10.1, the SageMaker team contributed special operators for SageMaker operations. +Each operator takes a configuration dictionary that defines the corresponding operation. We provide APIs to generate +the configuration dictionary in the SageMaker Python SDK. Currently, the following SageMaker operators are supported: + +* ``SageMakerTrainingOperator`` +* ``SageMakerTuningOperator`` +* ``SageMakerModelOperator`` +* ``SageMakerTransformOperator`` +* ``SageMakerEndpointConfigOperator`` +* ``SageMakerEndpointOperator`` + +2. PythonOperator: Airflow built-in operator that executes Python callables. You can use the PythonOperator to execute +operations in the SageMaker Python SDK to create a SageMaker workflow. + +Using Airflow on AWS +~~~~~~~~~~~~~~~~~~~~ + +Turbine is an open-source AWS CloudFormation template that enables you to create an Airflow resource stack on AWS. +You can get it here: https://github.com/villasv/aws-airflow-stack + +Using Airflow SageMaker Operators +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Starting with Airflow 1.10.1, you can use SageMaker operators in Airflow. All SageMaker operators take a configuration +dictionary that can be generated by the SageMaker Python SDK. For example: + +.. code:: python + + import sagemaker + from sagemaker.tensorflow import TensorFlow + from sagemaker.workflow.airflow import training_config, transform_config_from_estimator + + estimator = TensorFlow(entry_point='tf_train.py', + role='sagemaker-role', + framework_version='1.11.0', + training_steps=1000, + evaluation_steps=100, + train_instance_count=2, + train_instance_type='ml.p2.xlarge') + + # train_config specifies SageMaker training configuration + train_config = training_config(estimator=estimator, + inputs=your_training_data_s3_uri) + + # trans_config specifies SageMaker batch transform configuration + # task_id specifies which operator the training job associatd with; task_type specifies whether the operator is a + # training operator or tuning operator + trans_config = transform_config_from_estimator(estimator=estimator, + task_id='tf_training', + task_type='training', + instance_count=1, + instance_type='ml.m4.xlarge', + data=your_transform_data_s3_uri, + content_type='text/csv') + +Now you can pass these configurations to the corresponding SageMaker operators and create the workflow: + +.. code:: python + + import airflow + from airflow import DAG + from airflow.contrib.operators.sagemaker_training_operator import SageMakerTrainingOperator + from airflow.contrib.operators.sagemaker_transform_operator import SageMakerTransformOperator + + default_args = { + 'owner': 'airflow', + 'start_date': airflow.utils.dates.days_ago(2), + 'provide_context': True + } + + dag = DAG('tensorflow_example', default_args=default_args, + schedule_interval='@once') + + train_op = SageMakerTrainingOperator( + task_id='tf_training', + config=train_config, + wait_for_completion=True, + dag=dag) + + transform_op = SageMakerTransformOperator( + task_id='tf_transform', + config=trans_config, + wait_for_completion=True, + dag=dag) + + transform_op.set_upstream(train_op) + +Using Airflow Python Operator +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +`Airflow PythonOperator `_ +is a built-in operator that can execute any Python callable. If you want to build the SageMaker workflow in a more +flexible way, write your python callables for SageMaker operations by using the SageMaker Python SDK. + +.. code:: python + + from sagemaker.tensorflow import TensorFlow + + # callable for SageMaker training in TensorFlow + def train(data, **context): + estimator = TensorFlow(entry_point='tf_train.py', + role='sagemaker-role', + framework_version='1.11.0', + training_steps=1000, + evaluation_steps=100, + train_instance_count=2, + train_instance_type='ml.p2.xlarge') + estimator.fit(data) + return estimator.latest_training_job.job_name + + # callable for SageMaker batch transform + def transform(data, **context): + training_job = context['ti'].xcom_pull(task_ids='training') + estimator = TensorFlow.attach(training_job) + transformer = estimator.transformer(instance_count=1, instance_type='ml.c4.xlarge') + transformer.transform(data, content_type='text/csv') + +Then build your workflow by using the PythonOperator with the Python callables defined above: + +.. code:: python + + import airflow + from airflow import DAG + from airflow.operators.python_operator import PythonOperator + + default_args = { + 'owner': 'airflow', + 'start_date': airflow.utils.dates.days_ago(2), + 'provide_context': True + } + + dag = DAG('tensorflow_example', default_args=default_args, + schedule_interval='@once') + + train_op = PythonOperator( + task_id='training', + python_callable=train, + op_args=[training_data_s3_uri], + provide_context=True, + dag=dag) + + transform_op = PythonOperator( + task_id='transform', + python_callable=transform, + op_args=[transform_data_s3_uri], + provide_context=True, + dag=dag) + + transform_op.set_upstream(train_op) + +A workflow that runs a SageMaker training job and a batch transform job is finished. You can customize your Python callables with the SageMaker Python SDK according to your needs, and build more flexible and powerful workflows. \ No newline at end of file From 8df0760ab87fe804c438bc9c53060f1df97c0067 Mon Sep 17 00:00:00 2001 From: eslesar-aws Date: Fri, 19 Jul 2019 15:39:02 -0700 Subject: [PATCH 12/14] doc: update conf.py --- doc/conf.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/conf.py b/doc/conf.py index b08e04006e..c0ff0e0c51 100644 --- a/doc/conf.py +++ b/doc/conf.py @@ -60,7 +60,7 @@ def __getattr__(cls, name): "sphinx.ext.coverage", "sphinx.ext.autosummary", "sphinx.ext.napoleon", - "sphinx.ext.autosectionlabel" + "sphinx.ext.autosectionlabel", ] # Add any paths that contain templates here, relative to this directory. From addced496b808b6868be65da44c91397f236969e Mon Sep 17 00:00:00 2001 From: eslesar-aws Date: Fri, 19 Jul 2019 15:44:28 -0700 Subject: [PATCH 13/14] doc: remove duplicate byom section in overview.rst --- doc/overview.rst | 53 ------------------------------------------------ 1 file changed, 53 deletions(-) diff --git a/doc/overview.rst b/doc/overview.rst index 69992d9f1f..cf525cd4a4 100644 --- a/doc/overview.rst +++ b/doc/overview.rst @@ -826,59 +826,6 @@ A new training job channel, named ``code``, will be added with that S3 URI. Bef Once the training job begins, the training container will look at the offline input ``code`` channel to install dependencies and run the entry script. This isolates the training container, so no inbound or outbound network calls can be made. -********* -BYO Model -********* - -You can also create an endpoint from an existing model rather than training one. -That is, you can bring your own model: - -First, package the files for the trained model into a ``.tar.gz`` file, and upload the archive to S3. - -Next, create a ``Model`` object that corresponds to the framework that you are using: `MXNetModel `__ or `TensorFlowModel `__. - -Example code using ``MXNetModel``: - -.. code:: python - - from sagemaker.mxnet.model import MXNetModel - - sagemaker_model = MXNetModel(model_data='s3://path/to/model.tar.gz', - role='arn:aws:iam::accid:sagemaker-role', - entry_point='entry_point.py') - -After that, invoke the ``deploy()`` method on the ``Model``: - -.. code:: python - - predictor = sagemaker_model.deploy(initial_instance_count=1, - instance_type='ml.m4.xlarge') - -This returns a predictor the same way an ``Estimator`` does when ``deploy()`` is called. You can now get inferences just like with any other model deployed on Amazon SageMaker. - -Git support is also available when you bring your own model, through which you can use inference scripts stored in your -Git repositories. The process is similar to using Git support for training jobs. You can simply provide ``git_config`` -when create the ``Model`` object, and let ``entry_point``, ``source_dir`` and ``dependencies`` (if needed) be relative -paths inside the Git repository: - -.. code:: python - - git_config = {'repo': 'https://github.com/username/repo-with-training-scripts.git', - 'branch': 'branch1', - 'commit': '4893e528afa4a790331e1b5286954f073b0f14a2'} - - sagemaker_model = MXNetModel(model_data='s3://path/to/model.tar.gz', - role='arn:aws:iam::accid:sagemaker-role', - entry_point='inference.py', - source_dir='mxnet', - git_config=git_config) - -A full example is available in the `Amazon SageMaker examples repository `__. - -You can also find this notebook in the **Advanced Functionality** section of the **SageMaker Examples** section in a notebook instance. -For information about using sample notebooks in a SageMaker notebook instance, see `Use Example Notebooks `__ -in the AWS documentation. - ******************* Inference Pipelines ******************* From f595cabd0a64429963bde440c2872d54f5dfdf9d Mon Sep 17 00:00:00 2001 From: eslesar-aws Date: Fri, 19 Jul 2019 15:57:52 -0700 Subject: [PATCH 14/14] doc: remove duplicate headings in several rst files --- doc/using_chainer.rst | 8 ++++---- doc/using_mxnet.rst | 8 ++++---- doc/using_pytorch.rst | 8 ++++---- doc/using_rl.rst | 6 +++--- doc/using_sklearn.rst | 8 ++++---- 5 files changed, 19 insertions(+), 19 deletions(-) diff --git a/doc/using_chainer.rst b/doc/using_chainer.rst index 248071bb67..067e83c1dd 100644 --- a/doc/using_chainer.rst +++ b/doc/using_chainer.rst @@ -239,8 +239,8 @@ Calling fit You start your training script by calling ``fit`` on a ``Chainer`` Estimator. ``fit`` takes both required and optional arguments. -Required arguments -'''''''''''''''''' +fit Required arguments +'''''''''''''''''''''' - ``inputs``: This can take one of the following forms: A string s3 URI, for example ``s3://my-bucket/my-training-data``. In this @@ -258,8 +258,8 @@ For example: .. optional-arguments-1: -Optional arguments -'''''''''''''''''' +fit Optional arguments +'''''''''''''''''''''' - ``wait``: Defaults to True, whether to block and wait for the training script to complete before returning. diff --git a/doc/using_mxnet.rst b/doc/using_mxnet.rst index a951b5ce43..d97640f726 100644 --- a/doc/using_mxnet.rst +++ b/doc/using_mxnet.rst @@ -411,8 +411,8 @@ Calling fit You start your training script by calling ``fit`` on an ``MXNet`` Estimator. ``fit`` takes both required and optional arguments. -Required argument -''''''''''''''''' +fit Required argument +''''''''''''''''''''' - ``inputs``: This can take one of the following forms: A string S3 URI, for example ``s3://my-bucket/my-training-data``. In this @@ -430,8 +430,8 @@ For example: .. optional-arguments-1: -Optional arguments -'''''''''''''''''' +fit Optional arguments +'''''''''''''''''''''' - ``wait``: Defaults to True, whether to block and wait for the training script to complete before returning. diff --git a/doc/using_pytorch.rst b/doc/using_pytorch.rst index 4e89d58347..6c72445571 100644 --- a/doc/using_pytorch.rst +++ b/doc/using_pytorch.rst @@ -232,8 +232,8 @@ Calling fit You start your training script by calling ``fit`` on a ``PyTorch`` Estimator. ``fit`` takes both required and optional arguments. -Required arguments -'''''''''''''''''' +fit Required arguments +'''''''''''''''''''''' - ``inputs``: This can take one of the following forms: A string S3 URI, for example ``s3://my-bucket/my-training-data``. In this @@ -251,8 +251,8 @@ For example: .. optional-arguments-1: -Optional arguments -'''''''''''''''''' +fit Optional arguments +'''''''''''''''''''''' - ``wait``: Defaults to True, whether to block and wait for the training script to complete before returning. diff --git a/doc/using_rl.rst b/doc/using_rl.rst index 959d60b3ae..c99420708f 100644 --- a/doc/using_rl.rst +++ b/doc/using_rl.rst @@ -174,11 +174,11 @@ The following are optional arguments. When you create an ``RlEstimator`` object, Calling fit ~~~~~~~~~~~ -You start your training script by calling ``fit`` on an ``RLEstimator``. ``fit`` takes both a few optional +You start your training script by calling ``fit`` on an ``RLEstimator``. ``fit`` takes a few optional arguments. -Optional arguments -'''''''''''''''''' +fit Optional arguments +'''''''''''''''''''''' - ``inputs``: This can take one of the following forms: A string S3 URI, for example ``s3://my-bucket/my-training-data``. In this diff --git a/doc/using_sklearn.rst b/doc/using_sklearn.rst index e67f5edaff..b72d2b4356 100644 --- a/doc/using_sklearn.rst +++ b/doc/using_sklearn.rst @@ -187,8 +187,8 @@ Calling fit You start your training script by calling ``fit`` on a ``SKLearn`` Estimator. ``fit`` takes both required and optional arguments. -Required arguments -'''''''''''''''''' +fit Required arguments +'''''''''''''''''''''' - ``inputs``: This can take one of the following forms: A string s3 URI, for example ``s3://my-bucket/my-training-data``. In this @@ -206,8 +206,8 @@ For example: .. optional-arguments-1: -Optional arguments -'''''''''''''''''' +fit Optional arguments +'''''''''''''''''''''' - ``wait``: Defaults to True, whether to block and wait for the training script to complete before returning.