Skip to content

doc: add some clarification to Processing docs #1600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 17, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 62 additions & 53 deletions doc/amazon_sagemaker_processing.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
##############################
###########################
Amazon SageMaker Processing
##############################
###########################


Amazon SageMaker Processing allows you to run steps for data pre- or post-processing, feature engineering, data validation, or model evaluation workloads on Amazon SageMaker.
Expand All @@ -24,76 +24,85 @@ The fastest way to run get started with Amazon SageMaker Processing is by runnin
You can run notebooks on Amazon SageMaker that demonstrate end-to-end examples of using processing jobs to perform data pre-processing, feature engineering and model evaluation steps. See `Learn More`_ at the bottom of this page for more in-depth information.


Data Pre-Processing and Model Evaluation with Scikit-Learn
==================================================================
Data Pre-Processing and Model Evaluation with scikit-learn
==========================================================

You can run a Scikit-Learn script to do data processing on SageMaker using the `SKLearnProcessor`_ class.

.. _SKLearnProcessor: https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html#sagemaker.sklearn.processing.SKLearnProcessor
You can run a scikit-learn script to do data processing on SageMaker using the :class:`sagemaker.sklearn.processing.SKLearnProcessor` class.

You first create a ``SKLearnProcessor``

.. code:: python

from sagemaker.sklearn.processing import SKLearnProcessor

sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
role='[Your SageMaker-compatible IAM role]',
instance_type='ml.m5.xlarge',
instance_count=1)
sklearn_processor = SKLearnProcessor(
framework_version="0.20.0",
role="[Your SageMaker-compatible IAM role]",
instance_type="ml.m5.xlarge",
instance_count=1,
)

Then you can run a Scikit-Learn script ``preprocessing.py`` in a processing job. In this example, our script takes one input from S3 and one command-line argument, processes the data, then splits the data into two datasets for output. When the job is finished, we can retrive the output from S3.
Then you can run a scikit-learn script ``preprocessing.py`` in a processing job. In this example, our script takes one input from S3 and one command-line argument, processes the data, then splits the data into two datasets for output. When the job is finished, we can retrive the output from S3.

.. code:: python

from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor.run(code='preprocessing.py',
inputs=[ProcessingInput(
source='s3://your-bucket/path/to/your/data,
destination='/opt/ml/processing/input')],
outputs=[ProcessingOutput(output_name='train_data',
source='/opt/ml/processing/train'),
ProcessingOutput(output_name='test_data',
source='/opt/ml/processing/test')],
arguments=['--train-test-split-ratio', '0.2']
)
sklearn_processor.run(
code="preprocessing.py",
inputs=[
ProcessingInput(source="s3://your-bucket/path/to/your/data", destination="/opt/ml/processing/input"),
],
outputs=[
ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"),
ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test"),
],
arguments=["--train-test-split-ratio", "0.2"],
)

preprocessing_job_description = sklearn_processor.jobs[-1].describe()

For an in-depth look, please see the `Scikit-Learn Data Processing and Model Evaluation`_ example notebook.
For an in-depth look, please see the `Scikit-learn Data Processing and Model Evaluation`_ example notebook.

.. _Scikit-Learn Data Processing and Model Evaluation: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.ipynb
.. _Scikit-learn Data Processing and Model Evaluation: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.ipynb


Data Pre-Processing with Spark
==============================

You can use the `ScriptProcessor`_ class to run a script in a processing container, including your own container.

.. _ScriptProcessor: https://sagemaker.readthedocs.io/en/stable/processing.html#sagemaker.processing.ScriptProcessor
You can use the :class:`sagemaker.processing.ScriptProcessor` class to run a script in a processing container, including your own container.

This example shows how you can run a processing job inside of a container that can run a Spark script called ``preprocess.py`` by invoking a command ``/opt/program/submit`` inside the container.

.. code:: python

from sagemaker.processing import ScriptProcessor, ProcessingInput

spark_processor = ScriptProcessor(base_job_name='spark-preprocessor',
image_uri='<ECR repository URI to your Spark processing image>',
command=['/opt/program/submit'],
role=role,
instance_count=2,
instance_type='ml.r5.xlarge',
max_runtime_in_seconds=1200,
env={'mode': 'python'})

spark_processor.run(code='preprocess.py',
arguments=['s3_input_bucket', bucket,
's3_input_key_prefix', input_prefix,
's3_output_bucket', bucket,
's3_output_key_prefix', input_preprocessed_prefix],
logs=False)
spark_processor = ScriptProcessor(
base_job_name="spark-preprocessor",
image_uri="<ECR repository URI to your Spark processing image>",
command=["/opt/program/submit"],
role=role,
instance_count=2,
instance_type="ml.r5.xlarge",
max_runtime_in_seconds=1200,
env={"mode": "python"},
)

spark_processor.run(
code="preprocess.py",
arguments=[
"s3_input_bucket",
bucket,
"s3_input_key_prefix",
input_prefix,
"s3_output_bucket",
bucket,
"s3_output_key_prefix",
input_preprocessed_prefix,
],
logs=False,
)

For an in-depth look, please see the `Feature Transformation with Spark`_ example notebook.

Expand All @@ -106,19 +115,19 @@ Learn More
Processing class documentation
------------------------------

- ``Processor``: https://sagemaker.readthedocs.io/en/stable/processing.html#sagemaker.processing.Processor
- ``ScriptProcessor``: https://sagemaker.readthedocs.io/en/stable/processing.html#sagemaker.processing.ScriptProcessor
- ``SKLearnProcessor``: https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html#sagemaker.sklearn.processing.SKLearnProcessor
- ``ProcessingInput``: https://sagemaker.readthedocs.io/en/stable/processing.html#sagemaker.processing.ProcessingInput
- ``ProcessingOutput``: https://sagemaker.readthedocs.io/en/stable/processing.html#sagemaker.processing.ProcessingOutput
- ``ProcessingJob``: https://sagemaker.readthedocs.io/en/stable/processing.html#sagemaker.processing.ProcessingJob
- :class:`sagemaker.processing.Processor`
- :class:`sagemaker.processing.ScriptProcessor`
- :class:`sagemaker.sklearn.processing.SKLearnProcessor`
- :class:`sagemaker.processing.ProcessingInput`
- :class:`sagemaker.processing.ProcessingOutput`
- :class:`sagemaker.processing.ProcessingJob`


Further documentation
---------------------

- Processing class documentation: https://sagemaker.readthedocs.io/en/stable/processing.html
- ​​AWS Documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html
- AWS Notebook examples: https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker_processing
- Processing API documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateProcessingJob.html
- Processing container specification: https://docs.aws.amazon.com/sagemaker/latest/dg/build-your-own-processing-container.html
- `Processing class documentation <https://sagemaker.readthedocs.io/en/stable/processing.html>`_
- `AWS Documentation <https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html>`_
- `AWS Notebook examples <https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker_processing>`_
- `Processing API documentation <https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateProcessingJob.html>`_
- `Processing container specification <https://docs.aws.amazon.com/sagemaker/latest/dg/build-your-own-processing-container.html>`_
3 changes: 2 additions & 1 deletion src/sagemaker/processing.py
Original file line number Diff line number Diff line change
Expand Up @@ -289,7 +289,8 @@ def __init__(
network_config=None,
):
"""Initializes a ``ScriptProcessor`` instance. The ``ScriptProcessor``
handles Amazon SageMaker Processing tasks for jobs using a machine learning framework.
handles Amazon SageMaker Processing tasks for jobs using a machine learning framework,
which allows for providing a script to be run as part of the Processing Job.

Args:
role (str): An AWS IAM role name or ARN. Amazon SageMaker Processing
Expand Down