vishwakaria · mchoi8739 · Dec 2, 2022 · Dec 2, 2022 · Dec 2, 2022 · Dec 2, 2022
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,72 @@
 # Changelog
 
+## v2.121.2 (2022-12-12)
+
+### Bug Fixes and Other Changes
+
+ * Update for Tensorflow Serving 2.11 inference DLCs
+ * Revert "fix: type hint of PySparkProcessor __init__"
+ * Skip Bad Transform Test
+
+## v2.121.1 (2022-12-09)
+
+### Bug Fixes and Other Changes
+
+ * Pop out ModelPackageName from pipeline definition
+ * Fix failing jumpstart cache unit tests
+
+## v2.121.0 (2022-12-08)
+
+### Features
+
+ * Algorithms Region Expansion OSU/DXB
+
+### Bug Fixes and Other Changes
+
+ * FrameworkProcessor S3 uploads
+ * Add constraints file for apache-airflow
+
+## v2.120.0 (2022-12-07)
+
+### Features
+
+ * Add Neo image uri config for Pytorch 1.12
+ * Adding support for SageMaker Training Compiler in PyTorch estimator starting 1.12
+ * Update registries with new region account number mappings.
+ * Add DXB region to frameworks by DLC
+
+### Bug Fixes and Other Changes
+
+ * support idempotency for framework and spark processors
+
+## v2.119.0 (2022-12-03)
+
+### Features
+
+ * Add Code Owners file
+ * Added transform with monitoring pipeline step in transformer
+ * Update TF 2.9 and TF 2.10 inference DLCs
+ * make estimator accept json file as modelparallel config
+ * SageMaker Training Compiler does not support p4de instances
+ * Add support for SparkML v3.3
+
+### Bug Fixes and Other Changes
+
+ * Fix bug forcing uploaded tar to be named sourcedir
+ * Update local_requirements.txt PyYAML version
+ * refactoring : using with statement
+ * Allow Py 3.7 for MMS Test Docker env
+ * fix PySparkProcessor __init__ params type
+ * type hint of PySparkProcessor __init__
+ * Return ARM XGB/SKLearn tags if `image_scope` is `inference_graviton`
+ * Update scipy to 1.7.3 to support M1 development envs
+ * Fixing type hints for Spark processor that has instance type/count params in reverse order
+ * Add DeepAR ap-northeast-3 repository.
+ * Fix AsyncInferenceConfig documentation typo
+ * fix ml_inf to ml_inf1 in Neo multi-version support
+ * Fix type annotations
+ * add neo mvp region accounts
+
 ## v2.118.0 (2022-12-01)
 
 ### Features

diff --git a/CODEOWNERS b/CODEOWNERS
@@ -0,0 +1 @@
+* @aws/sagemaker-ml-frameworks
diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-2.118.1.dev0
+2.121.3.dev0
diff --git a/doc/frameworks/pytorch/using_pytorch.rst b/doc/frameworks/pytorch/using_pytorch.rst
@@ -140,6 +140,8 @@ The function of installing packages using ``requirements.txt`` is supported for
 
 A ``requirements.txt`` file is a text file that contains a list of items that are installed by using ``pip install``. You can also specify the version of an item to install. For information about the format of a ``requirements.txt`` file, see `Requirements Files <https://pip.pypa.io/en/stable/user_guide/#requirements-files>`__ in the pip documentation.
 
+-----
+
 Create an Estimator
 ===================
 
@@ -161,7 +163,7 @@ directories ('train' and 'test').
                            'test': 's3://my-data-bucket/path/to/my/test/data'})
 
 
-
+-----
 
 Call the fit Method
 ===================
@@ -196,6 +198,7 @@ fit Optional Arguments
 -  ``logs``: Defaults to True, whether to show logs produced by training
    job in the Python session. Only meaningful when wait is True.
 
+-----
 
 Distributed PyTorch Training
 ============================
@@ -212,7 +215,7 @@ with the ``pytorchddp`` option as the distribution strategy.
 .. note::
 
   This PyTorch DDP support is available
-  in the SageMaker PyTorch Deep Learning Containers v1.12 and later.
+  in the SageMaker PyTorch Deep Learning Containers v1.11 and later.
 
 Adapt Your Training Script
 --------------------------
@@ -238,7 +241,6 @@ but you can also overwrite them.
 
 **Supported backends:**
 
--  ``gloo`` and ``tcp`` for CPU instances
 -  ``gloo`` and ``nccl`` for GPU instances
 
 Launching a Distributed Training Job
@@ -268,7 +270,7 @@ during the PyTorch DDP initialization.
 For more information about setting up PyTorch DDP in your training script,
 see `Getting Started with Distributed Data Parallel
 <https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`_ in the
-PyTorch documentation.
+*PyTorch documentation*.
 
 The following example shows how to run a PyTorch DDP training in SageMaker
 using two ``ml.p4d.24xlarge`` instances:
@@ -293,6 +295,64 @@ using two ``ml.p4d.24xlarge`` instances:
 
     pt_estimator.fit("s3://bucket/path/to/training/data")
 
+SMDDP collectives for PyTorch DDP
+---------------------------------
+
+Starting PyTorch 1.12.1, when you run PyTorch DDP training jobs on SageMaker,
+you can benefit from the performance gain by using *SMDDP collectives*
+with *no code changes* in your PyTorch training script.
+*SMDDP collectives* is the
+collective communication primitives offered by the `SageMaker data parallelism library
+<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html>`_.
+The SMDDP collectives provide a GPU-based
+AllReduce operation optimized for the AWS network infrastructure and the Amazon EC2
+instance topology. These collectives are designed to address suboptimal performance issues
+in PyTorch DDP distributed training when using NCCL, the general-purpose collective communications library
+from NVIDIA, on the AWS infrastructure.
+
+The SMDDP collectives are on by default for the ``pytorchddp`` distribution option
+if the job configuration is supported. To achieve the most optimal training with
+faster communication, SageMaker compares the performance of the SMDDP and NCCL
+collectives at runtime and automatically selects the backend that performs better.
+
+This means that by default the backend parameter of the ``communication_options`` for
+``pytorchddp`` distribution is set to ``auto``. The goal of the automated backend
+selection is to abstract away the overhead of deciding the right communication backend
+for your training job, while ensuring that you get the best possible training performance
+at a lower cost.
+
+The following code shows an example of setting up a PyTorch estimator
+to run a PyTorch DDP training job on two ``ml.p4d.24xlarge`` instances
+with the communication backed set to ``auto``.
+
+.. code:: python
+
+    from sagemaker.pytorch import PyTorch
+
+    pt_estimator = PyTorch(
+        entry_point="train_ptddp.py",
+        role="SageMakerRole",
+        framework_version="1.12.1",
+        py_version="py38",
+        instance_count=2,
+        instance_type="ml.p4d.24xlarge",
+        distribution={
+            "pytorchddp": {
+                "enabled": True,
+                "communication_options": {    # Optional. This is set to auto by default.
+                    "backend": "auto"
+                }
+            }
+        }
+    )
+
+    pt_estimator.fit("s3://bucket/path/to/training/data")
+
+If you want to turn this automatic backend selection off and force to use NCCL,
+you can explicitly set ``"backend": "nccl"`` as the communication backend.
+
+-----
+
 .. _distributed-pytorch-training-on-trainium:
 
 Distributed Training with PyTorch Neuron on Trn1 instances
@@ -408,6 +468,8 @@ on one ``ml.trn1.2xlarge`` instance and two ``ml.trn1.32xlarge`` instances:
 
     pt_estimator.fit("s3://bucket/path/to/training/data")
 
+-----
+
 *********************
 Deploy PyTorch Models
 *********************

diff --git a/requirements/extras/test_requirements.txt b/requirements/extras/test_requirements.txt
@@ -11,6 +11,7 @@ contextlib2==21.6.0
 awslogs==0.14.0
 black==22.3.0
 stopit==1.1.2
+# Update tox.ini to have correct version of airflow constraints file
 apache-airflow==2.4.1
 apache-airflow-providers-amazon==4.0.0
 attrs==22.1.0

diff --git a/setup.py b/setup.py
@@ -55,7 +55,7 @@ def read_requirements(filename):
     "protobuf3-to-dict>=0.1.5,<1.0",
     "smdebug_rulesconfig==1.0.1",
     "importlib-metadata>=1.4.0,<5.0",
-    "packaging>=20.0",
+    "packaging==20.9",
     "pandas",
     "pathos",
     "schema",

diff --git a/src/sagemaker/fw_utils.py b/src/sagemaker/fw_utils.py
@@ -129,9 +129,6 @@
 }
 
 PYTORCHDDP_SUPPORTED_FRAMEWORK_VERSIONS = [
-    "1.10",
-    "1.10.0",
-    "1.10.2",
     "1.11",
     "1.11.0",
     "1.12",
@@ -145,6 +142,8 @@
 
 TRAINIUM_SUPPORTED_DISTRIBUTION_STRATEGIES = ["torch_distributed"]
 
+SMDDP_COLLECTIVES_SUPPORTED_FRAMEWORK_VERSIONS = ("1.12.1",)
+SMDDP_COLLECTIVES_SUPPORTED_INSTANCE_TYPES = ("ml.p4d.24xlarge",)
 
 SMDISTRIBUTED_SUPPORTED_STRATEGIES = ["dataparallel", "modelparallel"]
 
@@ -493,7 +492,7 @@ def framework_name_from_image(image_uri):
     # We must support both the legacy and current image name format.
     name_pattern = re.compile(
         r"""^(?:sagemaker(?:-rl)?-)?
-        (tensorflow|mxnet|chainer|pytorch|scikit-learn|xgboost
+        (tensorflow|mxnet|chainer|pytorch|pytorch-trcomp|scikit-learn|xgboost
         |huggingface-tensorflow|huggingface-pytorch
         |huggingface-tensorflow-trcomp|huggingface-pytorch-trcomp)(?:-)?
         (scriptmode|training)?
@@ -1043,7 +1042,6 @@ def validate_torch_distributed_distribution(
     if not torch_distributed_enabled:
         # Distribution strategy other than torch_distributed is selected
         return
-
     err_msg = ""
     if not image_uri:
         # ignore framework_version and py_version if image_uri is set
@@ -1082,6 +1080,62 @@ def validate_torch_distributed_distribution(
         raise ValueError(err_msg)
 
 
+def validate_smddp_collectives_support(
+    framework_version, py_version, image_uri, instance_type, instance_count
+):
+    """Check if SMDDP collective backend is supported for current invocation.
+
+    Args:
+        framework_version (str): A string representing the framework version selected.
+        py_version (str): A string representing the python version selected.
+        image_uri (str): A string representing a Docker image URI.
+        instance_type (str): SageMaker instance type.
+        instance_count (int): Number of training instances to use.
+
+    Returns false if:
+        `instance_type` is not in SMDDP_COLLECTIVES_SUPPORTED_INSTANCE_TYPES or
+        `py_version` is not python3 or
+        `framework_version` is not in SMDDP_COLLECTIVES_SUPPORTED_FRAMEWORK_VERSIONS or
+        `instance_count` is not greater than 1
+    """
+    err_msg = ""
+    if not image_uri:
+        # ignore framework_version and py_version if image_uri is set
+        # in case image_uri is not set, then both are mandatory
+        if framework_version not in SMDDP_COLLECTIVES_SUPPORTED_FRAMEWORK_VERSIONS:
+            err_msg += (
+                f"Provided framework_version {framework_version} is not supported. "
+                "Please specify one of the supported framework versions:"
+                f" {SMDDP_COLLECTIVES_SUPPORTED_FRAMEWORK_VERSIONS}.\n"
+            )
+        if "py3" not in py_version:
+            err_msg += (
+                f"Provided py_version {py_version} is not supported. "
+                "Please specify py_version>=py3.\n"
+            )
+    if instance_type not in SMDDP_COLLECTIVES_SUPPORTED_INSTANCE_TYPES:
+        err_msg += (
+            f"Provided instance_type {instance_type} is not supported. "
+            "Please specify one of the supported instance types:"
+            f"{SMDDP_COLLECTIVES_SUPPORTED_INSTANCE_TYPES}.\n"
+        )
+    if instance_count == 1:
+        # Communication backend auto is not supported for single-node jobs
+        err_msg += (
+            "SMDDP Collective backend is not supported for single-node jobs. "
+            "Please increase instance_count to be greater than 1.\n"
+        )
+    if not err_msg:
+        return True
+    logger.warning(
+        "The system is not compatible or not configured to run SMDDP collectives optimized"
+        " for AWS infrastructure.\n%s",
+        err_msg,
+    )
+    logger.warning("Continuing model training with default NCCL communication backend.\n")
+    return False
+
+
 def python_deprecation_warning(framework, latest_supported_version):
     """Placeholder docstring"""
     return PYTHON_2_DEPRECATION_WARNING.format(

diff --git a/src/sagemaker/git_utils.py b/src/sagemaker/git_utils.py
@@ -279,9 +279,8 @@ def _run_clone_command(repo_url, dest_dir):
         subprocess.check_call(["git", "clone", repo_url, dest_dir], env=my_env)
     elif repo_url.startswith("git@"):
         with tempfile.NamedTemporaryFile() as sshnoprompt:
-            write_pipe = open(sshnoprompt.name, "w")
-            write_pipe.write("ssh -oBatchMode=yes $@")
-            write_pipe.close()
+            with open(sshnoprompt.name, "w") as write_pipe:
+                write_pipe.write("ssh -oBatchMode=yes $@")
             os.chmod(sshnoprompt.name, 0o511)
             my_env["GIT_SSH"] = sshnoprompt.name
             subprocess.check_call(["git", "clone", repo_url, dest_dir], env=my_env)