aws
diff --git a/‎doc/api/training/smd_model_parallel.rst
Lines changed: 17 additions & 8 deletions b/‎doc/api/training/smd_model_parallel.rst
Lines changed: 17 additions & 8 deletions
diff --git a/‎doc/api/training/smd_model_parallel_general.rst
Lines changed: 300 additions & 298 deletions b/‎doc/api/training/smd_model_parallel_general.rst
Lines changed: 300 additions & 298 deletions
diff --git a/‎doc/api/training/smp_versions/archives.rst
Lines changed: 2 additions & 0 deletions b/‎doc/api/training/smp_versions/archives.rst
Lines changed: 2 additions & 0 deletions
diff --git a/‎doc/api/training/smp_versions/latest.rst
Lines changed: 1 addition & 0 deletions b/‎doc/api/training/smp_versions/latest.rst
Lines changed: 1 addition & 0 deletions
diff --git a/‎doc/api/training/smp_versions/latest/smd_model_parallel_common_api.rst
Lines changed: 66 additions & 17 deletions b/‎doc/api/training/smp_versions/latest/smd_model_parallel_common_api.rst
Lines changed: 66 additions & 17 deletions
diff --git a/‎doc/api/training/smp_versions/latest/smd_model_parallel_pytorch.rst
Lines changed: 112 additions & 13 deletions b/‎doc/api/training/smp_versions/latest/smd_model_parallel_pytorch.rst
Lines changed: 112 additions & 13 deletions
@@ -9,32 +9,41 @@ allowing you to increase prediction accuracy by creating larger models with more
 You can use the library to automatically partition your existing TensorFlow and PyTorch workloads
 across multiple GPUs with minimal code changes. The library's API can be accessed through the Amazon SageMaker SDK.
 
-Use the following sections to learn more about the model parallelism and the library.
+See the following sections to learn more about the SageMaker model parallel library APIs.
 
 Use with the SageMaker Python SDK
 =================================
 
 Use the following page to learn how to configure and enable distributed model parallel
-when you configure an Amazon SageMaker Python SDK `Estimator`.
+when you construct an Amazon SageMaker Python SDK `Estimator`.
 
 .. toctree::
    :maxdepth: 1
 
    smd_model_parallel_general
 
-API Documentation
-=================
+The library's API to Adapt Training Scripts
+===========================================
 
-The library contains a Common API that is shared across frameworks, as well as APIs
-that are specific to supported frameworks, TensorFlow and PyTorch.
+The library contains a Common API that is shared across frameworks,
+as well as framework-specific APIs for TensorFlow and PyTorch.
 
-Select a version to see the API documentation for version. To use the library, reference the
+Select the latest or one of the previous versions of the API documentation
+depending on which version of the library you need to use.
+To use the library, reference the
 **Common API** documentation alongside the framework specific API documentation.
 
 .. toctree::
-   :maxdepth: 1
+   :maxdepth: 2
 
    smp_versions/latest.rst
+
+To find archived API documentation for the previous versions of the library,
+see the following link:
+
+.. toctree::
+   :maxdepth: 1
+
    smp_versions/archives.rst
 
 It is recommended to use this documentation alongside `SageMaker Distributed Model Parallel
 
@@ -1,3 +1,5 @@
+.. _smdmp-pt-version-archive:
+
 Version Archive
 ===============
 
 
@@ -9,4 +9,5 @@ To use the library, reference the Common API documentation alongside the framewo
 
    latest/smd_model_parallel_common_api
    latest/smd_model_parallel_pytorch
+   latest/smd_model_parallel_pytorch_tensor_parallel
    latest/smd_model_parallel_tensorflow
@@ -258,23 +258,72 @@ MPI Basics
 
 The library exposes the following basic MPI primitives to its Python API:
 
--  ``smp.rank()``: The rank of the current process.
--  ``smp.size()``: The total number of processes.
--  ``smp.mp_rank()``: The rank of the process among the processes that
-   hold the current model replica.
--  ``smp.dp_rank()``: The rank of the process among the processes that
-   hold different replicas of the same model partition.
--  ``smp.dp_size()``: The total number of model replicas.
--  ``smp.local_rank()``: The rank among the processes on the current
-   instance.
--  ``smp.local_size()``: The total number of processes on the current
-   instance.
--  ``smp.get_mp_group()``: The list of ranks over which the current
-   model replica is partitioned.
--  ``smp.get_dp_group()``: The list of ranks that hold different
-   replicas of the same model partition.
-
-   .. _communication_api:
+**Global**
+
+-  ``smp.rank()`` : The global rank of the current process.
+-  ``smp.size()`` : The total number of processes.
+-  ``smp.get_world_process_group()`` :
+   ``torch.distributed.ProcessGroup`` that contains all processes.
+-  ``[smp.CommGroup.WORLD](https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest/smd_model_parallel_common_api.html#smp.CommGroup)``
+   : The communication group corresponding to all processes.
+-  ``smp.local_rank()``: The rank among the processes on the current instance.
+-  ``smp.local_size()``: The total number of processes on the current instance.
+-  ``smp.get_mp_group()``: The list of ranks over which the current model replica is partitioned.
+-  ``smp.get_dp_group()``: The list of ranks that hold different replicas of the same model partition.
+
+**Tensor Parallelism**
+
+-  ``smp.tp_rank()`` : The rank of the process within its
+   tensor-parallelism group.
+-  ``smp.tp_size()`` : The size of the tensor-parallelism group.
+-  ``smp.get_tp_process_group()`` : Equivalent to
+   ``torch.distributed.ProcessGroup`` that contains the processes in the
+   current tensor-parallelism group.
+-  ``smp.CommGroup.TP_GROUP`` : The communication group corresponding to
+   the current tensor parallelism group.
+
+**Pipeline Parallelism**
+
+-  ``smp.pp_rank()`` : The rank of the process within its
+   pipeline-parallelism group.
+-  ``smp.pp_size()`` : The size of the pipeline-parallelism group.
+-  ``smp.get_pp_process_group()`` : ``torch.distributed.ProcessGroup``
+   that contains the processes in the current pipeline-parallelism group.
+-  ``smp.CommGroup.PP_GROUP`` : The communication group corresponding to
+   the current pipeline parallelism group.
+
+**Reduced-Data Parallelism**
+
+-  ``smp.rdp_rank()`` : The rank of the process within its
+   reduced-data-parallelism group.
+-  ``smp.rdp_size()`` : The size of the reduced-data-parallelism group.
+-  ``smp.get_rdp_process_group()`` : ``torch.distributed.ProcessGroup``
+   that contains the processes in the current reduced data parallelism
+   group.
+-  ``smp.CommGroup.RDP_GROUP`` : The communication group corresponding
+   to the current reduced data parallelism group.
+
+**Model Parallelism**
+
+-  ``smp.mp_rank()`` : The rank of the process within its model-parallelism
+   group.
+-  ``smp.mp_size()`` : The size of the model-parallelism group.
+-  ``smp.get_mp_process_group()`` : ``torch.distributed.ProcessGroup``
+   that contains the processes in the current model-parallelism group.
+-  ``smp.CommGroup.MP_GROUP`` : The communication group corresponding to
+   the current model parallelism group.
+
+**Data Parallelism**
+
+-  ``smp.dp_rank()`` : The rank of the process within its data-parallelism
+   group.
+-  ``smp.dp_size()`` : The size of the data-parallelism group.
+-  ``smp.get_dp_process_group()`` : ``torch.distributed.ProcessGroup``
+   that contains the processes in the current data-parallelism group.
+-  ``smp.CommGroup.DP_GROUP`` : The communication group corresponding to
+   the current data-parallelism group.
+
+.. _communication_api:
 
 Communication API
 ^^^^^^^^^^^^^^^^^
 
@@ -1,14 +1,8 @@
-.. admonition:: Contents
-
-   - :ref:`pytorch_saving_loading`
-   - :ref:`pytorch_saving_loading_instructions`
-
 PyTorch API
 ===========
 
-**Supported versions: 1.7.1, 1.8.1**
-
-This API document assumes you use the following import statements in your training scripts.
+To use the PyTorch-specific APIs for SageMaker distributed model parallism,
+you need to add the following import statement at the top of your training script.
 
 .. code:: python
 
@@ -19,10 +13,10 @@ This API document assumes you use the following import statements in your traini
 
    Refer to
    `Modify a PyTorch Training Script
-   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-pt>`_
+   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script-pt.html>`_
    to learn how to use the following API in your PyTorch training script.
 
-.. class:: smp.DistributedModel
+.. py:class:: smp.DistributedModel()
 
    A sub-class of ``torch.nn.Module`` which specifies the model to be
    partitioned. Accepts a ``torch.nn.Module`` object ``module`` which is
@@ -42,7 +36,6 @@ This API document assumes you use the following import statements in your traini
    is \ ``model``) can only be made inside a ``smp.step``-decorated
    function.
 
-
    Since ``DistributedModel``  is a ``torch.nn.Module``, a forward pass can
    be performed by calling the \ ``DistributedModel`` object on the input
    tensors.
@@ -56,7 +49,6 @@ This API document assumes you use the following import statements in your traini
    arguments, replacing the PyTorch operations \ ``torch.Tensor.backward``
    or ``torch.autograd.backward``.
 
-
    The API for ``model.backward`` is very similar to
    ``torch.autograd.backward``. For example, the following
    ``backward`` calls:
@@ -90,7 +82,7 @@ This API document assumes you use the following import statements in your traini
 
    **Using DDP**
 
-   If DDP is enabled, do not not place a PyTorch
+   If DDP is enabled with the SageMaker model parallel library, do not not place a PyTorch
    ``DistributedDataParallel`` wrapper around the ``DistributedModel`` because
    the ``DistributedModel`` wrapper will also handle data parallelism.
 
@@ -284,6 +276,113 @@ This API document assumes you use the following import statements in your traini
       `register_comm_hook <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.register_comm_hook>`__
       in the PyTorch documentation.
 
+  **Behavior of** ``smp.DistributedModel`` **with Tensor Parallelism**
+
+  When a model is wrapped by ``smp.DistributedModel``, the library
+  immediately traverses the modules of the model object, and replaces the
+  modules that are supported for tensor parallelism with their distributed
+  counterparts. This replacement happens in place. If there are no other
+  references to the original modules in the script, they are
+  garbage-collected. The module attributes that previously referred to the
+  original submodules now refer to the distributed versions of those
+  submodules.
+
+  **Example:**
+
+  .. code:: python
+
+     # register DistributedSubmodule as the distributed version of Submodule
+     # (note this is a hypothetical example, smp.nn.DistributedSubmodule does not exist)
+     smp.tp_register_with_module(Submodule, smp.nn.DistributedSubmodule)
+
+     class MyModule(nn.Module):
+         def __init__(self):
+             ...
+
+             self.submodule = Submodule()
+         ...
+
+     # enabling tensor parallelism for the entire model
+     with smp.tensor_parallelism():
+         model = MyModule()
+
+     # here model.submodule is still a Submodule object
+     assert isinstance(model.submodule, Submodule)
+
+     model = smp.DistributedModel(model)
+
+     # now model.submodule is replaced with an equivalent instance
+     # of smp.nn.DistributedSubmodule
+     assert isinstance(model.module.submodule, smp.nn.DistributedSubmodule)
+
+  If ``pipeline_parallel_degree`` (equivalently, ``partitions``) is 1, the
+  placement of model partitions into GPUs and the initial broadcast of
+  model parameters and buffers across data-parallel ranks take place
+  immediately. This is because it does not need to wait for the model
+  partition when ``smp.DistributedModel`` wrapper is called. For other
+  cases with ``pipeline_parallel_degree`` greater than 1, the broadcast
+  and device placement will be deferred until the first call of an
+  ``smp.step``-decorated function happens. This is because the first
+  ``smp.step``-decorated function call is when the model partitioning
+  happens if pipeline parallelism is enabled.
+
+  Because of the module replacement during the ``smp.DistributedModel``
+  call, any ``load_state_dict`` calls on the model, as well as any direct
+  access to model parameters, such as during the optimizer creation,
+  should be done **after** the ``smp.DistributedModel`` call.
+
+  Since the broadcast of the model parameters and buffers happens
+  immediately during ``smp.DistributedModel`` call when the degree of
+  pipeline parallelism is 1, using ``@smp.step`` decorators is not
+  required when tensor parallelism is used by itself (without pipeline
+  parallelism).
+
+  For more information about the library's tensor parallelism APIs for PyTorch,
+  see :ref:`smdmp-pytorch-tensor-parallel`.
+
+  **Additional Methods of** ``smp.DistributedModel`` **for Tensor Parallelism**
+
+  The following are the new methods of ``smp.DistributedModel``, in
+  addition to the ones listed in the
+  `documentation <https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.html#smp.DistributedModel>`__.
+
+  .. function:: distributed_modules()
+
+     -  An iterator that runs over the set of distributed
+        (tensor-parallelized) modules in the model
+
+  .. function:: is_distributed_parameter(param)
+
+     -  Returns ``True`` if the given ``nn.Parameter`` is distributed over
+        tensor-parallel ranks.
+
+  .. function::  is_distributed_buffer(buf)
+
+     -  Returns ``True`` if the given buffer is distributed over
+        tensor-parallel ranks.
+
+  .. function::  is_scaled_batch_parameter(param)
+
+     -  Returns ``True`` if the given ``nn.Parameter`` is operates on the
+        scaled batch (batch over the entire ``TP_GROUP``, and not only the
+        local batch).
+
+  .. function::  is_scaled_batch_buffer(buf)
+
+     -  Returns ``True`` if the parameter corresponding to the given
+        buffer operates on the scaled batch (batch over the entire
+        ``TP_GROUP``, and not only the local batch).
+
+  .. function::  default_reducer_named_parameters()
+
+     -  Returns an iterator that runs over ``(name, param)`` tuples, for
+        ``param`` that is allreduced over the ``DP_GROUP``.
+
+  .. function::  scaled_batch_reducer_named_parameters()
+
+     -  Returns an iterator that runs over ``(name, param)`` tuples, for
+        ``param`` that is allreduced over the ``RDP_GROUP``.
+
 
 
 .. class:: smp.DistributedOptimizer
Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,5 @@`
	`1`	`+.. _smdmp-pt-version-archive:`
	`2`	`+`
`1`	`3`	`Version Archive`
`2`	`4`	`===============`
`3`	`5`