You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/amazon_sagemaker_model_building_pipeline.rst
+126
Original file line number
Diff line number
Diff line change
@@ -954,6 +954,132 @@ When model repacking is needed, :class:`sagemaker.workflow.model_step.ModelStep`
954
954
955
955
:class:`sagemaker.workflow.model_step.ModelStep` uses the provided inputs to automatically detect if a repack is needed. If a repack is needed, :class:`sagemaker.workflow.steps.TrainingStep`is added to the step collection for that repack. Then, either :class:`sagemaker.workflow.steps.CreateModelStep`or :class:`sagemaker.workflow.step_collections.RegisterModelStep` will be chained after it.
956
956
957
+
MonitorBatchTransform Step
958
+
===========================
959
+
960
+
MonitorBatchTransformStep is a new step type that allows customers to use SageMaker Model Monitor with batch transform jobs that are a part of their pipeline. Using this step, customers can set up the following monitors for their batch transform job: data quality, model quality, model bias, and feature attribution.
961
+
962
+
963
+
When configuring this step, customers have the flexibility to run the monitoring job before or after the transform job executes. There is an additional flag called :code:`fail_on_violation` which will fail the step ifset to true and there is a monitoring violation, or will continue to execute the step ifset to false.
964
+
965
+
Here is an example showing you how to configure a :class:`sagemaker.workflow.monitor_batch_transform_step.MonitorBatchTransformStep`with a Data Quality monitor.
966
+
967
+
.. code-block:: python
968
+
969
+
from sagemaker.workflow.pipeline_context import PipelineSession
970
+
971
+
from sagemaker.transformer import Transformer
972
+
from sagemaker.model_monitor import DefaultModelMonitor
973
+
from sagemaker.model_monitor.dataset_format import DatasetFormat
974
+
from sagemaker.workflow.check_job_config import CheckJobConfig
975
+
from sagemaker.workflow.quality_check_step import DataQualityCheckConfig
976
+
977
+
from sagemaker.workflow.parameters import ParameterString
The same example can be extended for model quality, bias, and feature attribute monitoring.
1033
+
1034
+
.. warning::
1035
+
Note that to run on-demand model quality, you will need to have the ground truth data ready. When running the transform job, include the ground truth inside your transform input, and join the transform inference inputand output. Then you can indicate which attribute or column name/index points to the ground truth when run the monitoring job.
SageMaker Training on Trainium instances now supports the ``xla``
302
+
package through ``torchrun``. With this, you do not need to manually pass RANK,
303
+
WORLD_SIZE, MASTER_ADDR, and MASTER_PORT. You can launch the training job using the
304
+
:class:`sagemaker.pytorch.estimator.PyTorch` estimator class
305
+
with the ``torch_distributed`` option as the distribution strategy.
306
+
307
+
.. note::
308
+
309
+
This ``torch_distributed`` support is available
310
+
in the SageMaker Trainium (trn1) PyTorch Deep Learning Containers starting v1.11.0.
311
+
To find a complete list of supported versions of PyTorch Neuron, see `Neuron Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers>`_ in the *AWS Deep Learning Containers GitHub repository*.
312
+
313
+
SageMaker Debugger and Profiler are currently not supported with Trainium instances.
314
+
315
+
Adapt Your Training Script to Initialize with the XLA backend
SageMaker takes care of ``'MASTER_ADDR'`` and ``'MASTER_PORT'`` for you via ``torchrun``
330
+
331
+
For detailed documentation about modifying your training script for Trainium, see `Multi-worker data-parallel MLP training using torchrun <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/mlp.html?highlight=torchrun#multi-worker-data-parallel-mlp-training-using-torchrun>`_ in the *AWS Neuron Documentation*.
332
+
333
+
**Currently Supported backends:**
334
+
335
+
- ``xla`` for Trainium (Trn1) instances
336
+
337
+
For up-to-date information on supported backends for Trainium instances, see `AWS Neuron Documentation <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html>`_.
338
+
339
+
Launching a Distributed Training Job on Trainium
340
+
------------------------------------------------
341
+
342
+
You can run multi-node distributed PyTorch training jobs on Trainium instances using the
0 commit comments