aws
diff --git a/‎doc/amazon_sagemaker_model_building_pipeline.rst
+126 b/‎doc/amazon_sagemaker_model_building_pipeline.rst
+126
diff --git a/‎doc/api/training/smp_versions/latest/smd_model_parallel_pytorch.rst
+11-2 b/‎doc/api/training/smp_versions/latest/smd_model_parallel_pytorch.rst
+11-2
diff --git a/‎doc/frameworks/pytorch/using_pytorch.rst
+101-1 b/‎doc/frameworks/pytorch/using_pytorch.rst
+101-1
diff --git a/‎doc/requirements.txt
+1 b/‎doc/requirements.txt
+1
diff --git a/‎doc/workflows/pipelines/sagemaker.workflow.pipelines.rst
+2 b/‎doc/workflows/pipelines/sagemaker.workflow.pipelines.rst
+2
diff --git a/‎requirements/extras/test_requirements.txt
+1-1 b/‎requirements/extras/test_requirements.txt
+1-1
@@ -954,6 +954,132 @@ When model repacking is needed, :class:`sagemaker.workflow.model_step.ModelStep`
 
 :class:`sagemaker.workflow.model_step.ModelStep` uses the provided inputs to automatically detect if a repack is needed. If a repack is needed, :class:`sagemaker.workflow.steps.TrainingStep` is added to the step collection for that repack. Then, either :class:`sagemaker.workflow.steps.CreateModelStep` or :class:`sagemaker.workflow.step_collections.RegisterModelStep` will be chained after it.
 
+MonitorBatchTransform Step
+===========================
+
+MonitorBatchTransformStep is a new step type that allows customers to use SageMaker Model Monitor with batch transform jobs that are a part of their pipeline. Using this step, customers can set up the following monitors for their batch transform job: data quality, model quality, model bias, and feature attribution.
+
+
+When configuring this step, customers have the flexibility to run the monitoring job before or after the transform job executes. There is an additional flag called :code:`fail_on_violation` which will fail the step if set to true and there is a monitoring violation, or will continue to execute the step if set to false.
+
+Here is an example showing you how to configure a :class:`sagemaker.workflow.monitor_batch_transform_step.MonitorBatchTransformStep` with a Data Quality monitor.
+
+.. code-block:: python
+
+    from sagemaker.workflow.pipeline_context import PipelineSession
+
+    from sagemaker.transformer import Transformer
+    from sagemaker.model_monitor import DefaultModelMonitor
+    from sagemaker.model_monitor.dataset_format import DatasetFormat
+    from sagemaker.workflow.check_job_config import CheckJobConfig
+    from sagemaker.workflow.quality_check_step import DataQualityCheckConfig
+
+    from sagemaker.workflow.parameters import ParameterString
+
+    pipeline_session = PipelineSession()
+
+    transform_input_param = ParameterString(
+        name="transform_input",
+        default_value=f"s3://my-bucket/my-prefix/my-transform-input",
+    )
+
+    # the resource configuration for the monitoring job
+    job_config = CheckJobConfig(
+        role=role,
+        instance_count=1,
+        instance_type="ml.m5.xlarge",
+        ...
+    )
+
+The following code sample demonstrates how to set up an on-demand batch transform *data quality* monitor:
+
+.. code-block:: python
+
+    # configure your transformer
+    transformer = Transformer(..., sagemaker_session=pipeline_session)
+    transform_arg = transformer.transform(
+        transform_input_param,
+        content_type="text/csv",
+        split_type="Line",
+        ...
+    )
+
+    data_quality_config = DataQualityCheckConfig(
+        baseline_dataset=transform_input_param,
+        dataset_format=DatasetFormat.csv(header=False),
+        output_s3_uri="s3://my-report-path",
+    )
+
+    from sagemaker.workflow.monitor_batch_transform_step import MonitorBatchTransformStep
+
+    transform_and_monitor_step = MonitorBatchTransformStep(
+        name="MyMonitorBatchTransformStep",
+        transform_step_args=transform_arg,
+        monitor_configuration=data_quality_config,
+        check_job_configuration=job_config,
+        # since data quality only looks at the inputs,
+        # so there is no need to wait for the transform output.
+        monitor_before_transform=True,
+        # if violation is detected in the monitoring, and you want to skip it
+        # and continue running batch transform, you can set fail_on_violation
+        # to false.
+        fail_on_violation=False,
+        supplied_baseline_statistics="s3://my-baseline-statistics.json",
+        supplied_baseline_constraints="s3://my-baseline-constraints.json",
+    )
+    ...
+
+The same example can be extended for model quality, bias, and feature attribute monitoring.
+
+.. warning::
+    Note that to run on-demand model quality, you will need to have the ground truth data ready. When running the transform job, include the ground truth inside your transform input, and join the transform inference input and output. Then you can indicate which attribute or column name/index points to the ground truth when run the monitoring job.
+
+.. code-block:: python
+
+    transformer = Transformer(..., sagemaker_session=pipeline_session)
+
+    transform_arg = transformer.transform(
+        transform_input_param,
+        content_type="text/csv",
+        split_type="Line",
+        # Note that we need to join both the inference input and output
+        # into transform outputs. The inference input needs to have the ground truth.
+        # details can be found here
+        # https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-data-processing.html
+        join_source="Input",
+        # We need to exclude the ground truth inside the inference input
+        # before passing it to the prediction model.
+        # Assume the first column of our csv file is the ground truth
+        input_filter="$[1:]",
+        ...
+    )
+
+    model_quality_config = ModelQualityCheckConfig(
+        baseline_dataset=transformer.output_path,
+        problem_type="BinaryClassification",
+        dataset_format=DatasetFormat.csv(header=False),
+        output_s3_uri="s3://my-output",
+        # assume the model output is at column idx 10
+        inference_attribute="_c10",
+        # As pointed out previously, the first column is the ground truth.
+        ground_truth_attribute="_c0",
+    )
+    from sagemaker.workflow.monitor_batch_transform_step import MonitorBatchTransformStep
+
+    transform_and_monitor_step = MonitorBatchTransformStep(
+        name="MyMonitorBatchTransformStep",
+        transform_step_args=transform_arg,
+        monitor_configuration=data_quality_config,
+        check_job_configuration=job_config,
+        # model quality job needs the transform outputs, therefore
+        # monitor_before_transform can not be true for model quality
+        monitor_before_transform=False,
+        fail_on_violation=True,
+        supplied_baseline_statistics="s3://my-baseline-statistics.json",
+        supplied_baseline_constraints="s3://my-baseline-constraints.json",
+    )
+    ...
+
 =================
 Example Notebooks
 =================
 
@@ -729,7 +729,7 @@ smdistributed.modelparallel.torch APIs for Saving and Loading
    * ``num_kept_partial_checkpoints`` (int) (default: None): The maximum number
      of partial checkpoints to keep on disk.
 
-.. function:: smdistributed.modelparallel.torch.resume_from_checkpoint(path, tag=None, partial=True, strict=True, load_optimizer_states=True, translate_function=None)
+.. function:: smdistributed.modelparallel.torch.resume_from_checkpoint(path, tag=None, partial=True, strict=True, load_optimizer=True, load_sharded_optimizer_state=True, translate_function=None)
 
    While :class:`smdistributed.modelparallel.torch.load` loads saved
    model and optimizer objects, this function resumes from a saved checkpoint file.
@@ -742,7 +742,16 @@ smdistributed.modelparallel.torch APIs for Saving and Loading
    * ``partial`` (boolean) (default: True): Whether to load the partial checkpoint.
    * ``strict`` (boolean) (default: True): Load with strict load, no extra key or
      missing key is allowed.
-   * ``load_optimizer_states`` (boolean) (default: True): Whether to load ``optimizer_states``.
+   * ``load_optimizer`` (boolean) (default: True): Whether to load ``optimizer``.
+   * ``load_sharded_optimizer_state`` (boolean) (default: True): Whether to load
+     the sharded optimizer state of a model.
+     It can be used only when you activate
+     the `sharded data parallelism
+     <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html>`_
+     feature of the SageMaker model parallel library.
+     When this is ``False``, the library only loads the FP16
+     states, such as FP32 master parameters and the loss scaling factor,
+     not the sharded optimizer states.
    * ``translate_function`` (function) (default: None): function to translate the full
      checkpoint into smdistributed.modelparallel format.
      For supported models, this is not required.
 
@@ -262,7 +262,8 @@ during the PyTorch DDP initialization.
 
 .. note::
 
-  The SageMaker PyTorch estimator doesn’t use ``torchrun`` for distributed training.
+  The SageMaker PyTorch estimator operates ``mpirun`` in the backend.
+  It doesn’t use ``torchrun`` for distributed training.
 
 For more information about setting up PyTorch DDP in your training script,
 see `Getting Started with Distributed Data Parallel
@@ -292,7 +293,106 @@ using two ``ml.p4d.24xlarge`` instances:
 
     pt_estimator.fit("s3://bucket/path/to/training/data")
 
+.. _distributed-pytorch-training-on-trainium:
 
+Distributed PyTorch Training on Trainium
+========================================
+
+SageMaker Training on Trainium instances now supports the ``xla``
+package through ``torchrun``. With this, you do not need to manually pass RANK,
+WORLD_SIZE, MASTER_ADDR, and MASTER_PORT. You can launch the training job using the
+:class:`sagemaker.pytorch.estimator.PyTorch` estimator class
+with the ``torch_distributed`` option as the distribution strategy.
+
+.. note::
+
+  This ``torch_distributed`` support is available
+  in the SageMaker Trainium (trn1) PyTorch Deep Learning Containers starting v1.11.0.
+  To find a complete list of supported versions of PyTorch Neuron, see `Neuron Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers>`_ in the *AWS Deep Learning Containers GitHub repository*.
+
+  SageMaker Debugger and Profiler are currently not supported with Trainium instances.
+
+Adapt Your Training Script to Initialize with the XLA backend
+-------------------------------------------------------------
+
+To initialize distributed training in your script, call
+`torch.distributed.init_process_group
+<https://pytorch.org/docs/master/distributed.html#torch.distributed.init_process_group>`_
+with the ``xla`` backend as shown below.
+
+.. code:: python
+
+    import torch.distributed as dist
+
+    dist.init_process_group('xla')
+
+SageMaker takes care of ``'MASTER_ADDR'`` and ``'MASTER_PORT'`` for you via ``torchrun``
+
+For detailed documentation about modifying your training script for Trainium, see `Multi-worker data-parallel MLP training using torchrun <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/mlp.html?highlight=torchrun#multi-worker-data-parallel-mlp-training-using-torchrun>`_ in the *AWS Neuron Documentation*.
+
+**Currently Supported backends:**
+
+-  ``xla`` for Trainium (Trn1) instances
+
+For up-to-date information on supported backends for Trainium instances, see `AWS Neuron Documentation <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html>`_.
+
+Launching a Distributed Training Job on Trainium
+------------------------------------------------
+
+You can run multi-node distributed PyTorch training jobs on Trainium instances using the
+:class:`sagemaker.pytorch.estimator.PyTorch` estimator class.
+With ``instance_count=1``, the estimator submits a
+single-node training job to SageMaker; with ``instance_count`` greater
+than one, a multi-node training job is launched.
+
+With the ``torch_distributed`` option, the SageMaker PyTorch estimator runs a SageMaker
+training container for PyTorch Neuron, sets up the environment, and launches
+the training job using the ``torchrun`` command on each worker with the given information.
+
+**Examples**
+
+The following examples show how to run a PyTorch training using ``torch_distributed`` in SageMaker
+on one ``ml.trn1.2xlarge`` instance and two ``ml.trn1.32xlarge`` instances:
+
+.. code:: python
+
+    from sagemaker.pytorch import PyTorch
+
+    pt_estimator = PyTorch(
+        entry_point="train_ptddp.py",
+        role="SageMakerRole",
+        framework_version="1.11.0",
+        py_version="py38",
+        instance_count=1,
+        instance_type="ml.trn1.2xlarge",
+        distribution={
+            "torch_distributed": {
+                "enabled": True
+            }
+        }
+    )
+
+    pt_estimator.fit("s3://bucket/path/to/training/data")
+
+.. code:: python
+
+    from sagemaker.pytorch import PyTorch
+
+    pt_estimator = PyTorch(
+        entry_point="train_ptddp.py",
+        role="SageMakerRole",
+        framework_version="1.11.0",
+        py_version="py38",
+        instance_count=2,
+        instance_type="ml.trn1.32xlarge",
+        distribution={
+            "torch_distributed": {
+                "enabled": True
+            }
+        }
+    )
+
+    pt_estimator.fit("s3://bucket/path/to/training/data")
 
 *********************
 Deploy PyTorch Models
 
@@ -3,3 +3,4 @@ sphinx-rtd-theme==0.5.0
 docutils==0.15.2
 packaging==20.9
 jinja2<3.1
+schema==0.7.5
@@ -132,6 +132,8 @@ Step Collections
 
 .. autoclass:: sagemaker.workflow.model_step.ModelStep
 
+.. autoclass:: sagemaker.workflow.monitor_batch_transform_step.MonitorBatchTransformStep
+
 Steps
 -----
 
 
@@ -11,7 +11,7 @@ contextlib2==21.6.0
 awslogs==0.14.0
 black==22.3.0
 stopit==1.1.2
-apache-airflow==2.4.0
+apache-airflow==2.4.1
 apache-airflow-providers-amazon==4.0.0
 attrs==22.1.0
 fabric==2.6.0