aws
diff --git a/‎src/sagemaker/modules/testing_notebooks/base_model_trainer.ipynb
+37-25 b/‎src/sagemaker/modules/testing_notebooks/base_model_trainer.ipynb
+37-25
diff --git a/‎src/sagemaker/modules/testing_notebooks/install_docker.sh renamed to ‎src/sagemaker/modules/testing_notebooks/bootstrap.sh
+4-1 b/‎src/sagemaker/modules/testing_notebooks/install_docker.sh renamed to ‎src/sagemaker/modules/testing_notebooks/bootstrap.sh
+4-1
diff --git a/‎src/sagemaker/modules/testing_notebooks/sagemaker-2.232.4.dev0-py3-none-any.whl
5.38 MB b/‎src/sagemaker/modules/testing_notebooks/sagemaker-2.232.4.dev0-py3-none-any.whl
5.38 MB
diff --git a/‎src/sagemaker/modules/testing_notebooks/sagemaker-2.232.4.dev0.tar.gz
5.37 MB b/‎src/sagemaker/modules/testing_notebooks/sagemaker-2.232.4.dev0.tar.gz
5.37 MB
@@ -10,28 +10,24 @@
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "! pip install \"pydantic>=2.0.0\" sagemaker-core"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "! pip install sagemaker-2.232.4.dev0.tar.gz"
+    "# ModelTrainer\n",
+    "The ModelTrainer is a new interface for training designed to tackle many of the challenges that exist in todays Estimator class. Some key features include:\n",
+    "\n",
+    "1. Improved Design Principles making the interface much more user friendly and approachable than previous Estimator.\n",
+    "2. Remove Dependency on Training-Toolkit for Script Mode and BYOC, and bring the training toolkit’s runtime driver code within PySDK for a smoother 1-stop and an extendible distributed runner.\n",
+    "3. Training Recipe support for easier setup of training for LLMs with PySDK"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Script Mode Case - 1"
+    "## ModelTrainer - Script Mode Case - 1\n",
+    "\n",
+    "This case show cases the minimal setup for a ModelTrainer. A user need only to provide a desired training image and the commands they wish to execute in the container using the `SourceCode` class object config.\n"
    ]
   },
   {
@@ -68,7 +64,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Script Mode Case - 2"
+    "## ModelTrainer - Script Mode Case - 2\n",
+    "\n",
+    "This case show cases an abstracted setup for script mode where a user can provide their training image and a `SourceCode` object config with path to their `source_dir`, `enty_script`, and any additional `requirements` to install in the training container."
    ]
   },
   {
@@ -77,8 +75,11 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "from sagemaker.modules.train import ModelTrainer\n",
     "from sagemaker.modules.configs import SourceCode\n",
     "\n",
+    "pytorch_image = \"763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-cpu-py310\"\n",
+    "\n",
     "source_code = SourceCode(\n",
     "    source_dir=\"basic-script-mode\",\n",
     "    requirements=\"requirements.txt\",\n",
@@ -244,7 +245,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Distributed Training - Case 1"
+    "## ModelTrainer - Distributed Training - Case 1\n",
+    "\n",
+    "This use cases shows how a user could perform a more complex setup for DistributedTraining using `torchrun` and setting up commands for the execution directly using the `command` parameter in the `SourceCode` class"
    ]
   },
   {
@@ -262,8 +265,6 @@
     "env[\"NCCL_SOCKET_IFNAME\"] = \"eth0\"\n",
     "env[\"NCCL_IB_DISABLE\"] = \"1\"\n",
     "env[\"NCCL_DEBUG\"] = \"WARN\"\n",
-    "env[\"FI_EFA_USE_DEVICE_RDMA\"] = \"1\"\n",
-    "env[\"RDMAV_FORK_SAFE\"] = \"1\"\n",
     "\n",
     "compute = Compute(\n",
     "    instance_count=1,\n",
@@ -332,7 +333,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Distributed Training - Case 2"
+    "## ModelTrainer - Distributed Training - Case 2\n",
+    "\n",
+    "This distributed training case showcases how a user could perform distributed training using an abstracted approach provided by the PySDK via the `Torchrun` distributed runner object."
    ]
   },
   {
@@ -421,7 +424,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Distributed Training - Case 3"
+    "## ModelTrainer - Distributed Training - Case 3\n",
+    "\n",
+    "In this case, we show how distributed training job can be setup and ran using a more generic `MPI` distributed runner with some additional mpi options set through the `mpi_additional_options` parameter."
    ]
   },
   {
@@ -472,7 +477,6 @@
     "    entry_script=\"run_clm_no_trainer.py\",\n",
     ")\n",
     "\n",
-    "\n",
     "# Run using MPI\n",
     "mpi = MPI(\n",
     "    mpi_additional_options=[\n",
@@ -510,14 +514,18 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# ModelTrainer Recipes"
+    "# ModelTrainer Recipes\n",
+    "\n",
+    "Training recipes is an abstracted way to create a ModelTrainer for training of LLM models via a recipe.yaml with configuration for the Trainer, model weights, etc."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Training Recipes - Case 1"
+    "## ModelTrainer - Training Recipes - Case 1\n",
+    "\n",
+    "This example showcases how a user could leverage SageMaker pre-defined training recipe `training/llama/hf_llama3_8b_seq8192_gpu` for training a llama3 model using synthetic data."
    ]
   },
   {
@@ -580,7 +588,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Training Recipes - Case 2"
+    "## ModelTrainer - Training Recipes - Case 2\n",
+    "\n",
+    "This example showcases how a user can leverage the sagemaker recipe adaptors to train a model with configurations in a custom local `custom-recipe.yaml`."
    ]
   },
   {
@@ -615,7 +625,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Training Recipes - Case 3"
+    "## ModelTrainer - Training Recipes - Case 3\n",
+    "\n",
+    "This usecase shows how a `neuronx-distributed-training` recipe from github url can be training on a trainum instance using custom data in an S3 bucket."
    ]
   },
   {
 
@@ -17,4 +17,7 @@ VERSION_STRING=5:20.10.24~3-0~ubuntu-jammy
 sudo apt-get install docker-ce-cli=$VERSION_STRING docker-compose-plugin -y
 
 # validate the Docker Client is able to access Docker Server at [unix:///docker/proxy.sock]
-docker version
+docker version
+
+pip install "pydantic>=2.0.0"
+pip install sagemaker-2.232.4.dev0.tar.gz