sentiment analysis notebook changes for SDK v2 etc.

rabowskyb · rabowskyb · commit 2c32c51560c7 · 2020-12-11T20:43:13.000-08:00
diff --git a/tf-sentiment-script-mode/sentiment-analysis.ipynb b/tf-sentiment-script-mode/sentiment-analysis.ipynb
@@ -6,15 +6,15 @@
    "source": [
     "#  Sentiment Analysis with TensorFlow 2\n",
     "\n",
-    "Amazon SageMaker provides both built-in algorithms and an easy path to train your own custom models.  Although the built-in algorithms cover many domains (computer vision, natural language processing etc.) and are easy to use (just provide your data), sometimes training a custom model is the preferred approach.  This notebook will focus on training a custom model using TensorFlow 2.  \n",
+    "Amazon SageMaker provides both (1) built-in algorithms and (2) an easy path to train your own custom models.  Although the built-in algorithms cover many domains (computer vision, natural language processing etc.) and are easy to use (just provide your data), sometimes training a custom model is the preferred approach.  This notebook will focus on training a custom model using TensorFlow 2.  \n",
     "\n",
-    "Sentiment analysis is a very common text analytics task that involves determining whether a text sample is positive or negative about its subject.  There are several different algorithms for performing this task, including statistical algorithms and deep learning algorithms.  With respect to deep learning, a 1D Convolutional Neural Net (CNN) is sometimes used for this purpose.  In this notebook we'll use a CNN built with TensorFlow 2 to perform sentiment analysis in Amazon SageMaker on the IMDB dataset, which consists of movie reviews labeled as having positive or negative sentiment. Three aspects of Amazon SageMaker will be demonstrated:\n",
+    "Sentiment analysis is a very common text analytics task that determines whether a text sample is positive or negative about its subject.  There are several different algorithms for performing this task, including older statistical algorithms and newer deep learning algorithms.  With respect to deep learning, a 1D Convolutional Neural Net (CNN) is sometimes used for this purpose.  In this notebook we'll use a CNN built with TensorFlow 2 to perform sentiment analysis in Amazon SageMaker on the IMDb dataset, which consists of movie reviews labeled as having positive or negative sentiment. Several aspects of Amazon SageMaker will be demonstrated:\n",
     "\n",
     "- How to use a SageMaker prebuilt TensorFlow 2 container with a custom model training script similar to one you would use outside SageMaker. This feature is known as Script Mode.  \n",
-    "- Local Mode training:  this allows you to prototype and test your code on a low-cost notebook instance before creating a full scale training job on more powerful and expensive instances.\n",
-    "- Hosted training: for full scale training on a complete dataset.  \n",
-    "- Distributed training:  using a single, multi-GPU instance to speed up training.  \n",
-    "- Batch Transform for offline, asynchronous predictions on large batches of data. \n",
+    "- Hosted training: for full scale training on a complete dataset on a separate, larger and more powerful SageMaker-managed GPU instance.  \n",
+    "- Distributed training:  using multiple GPUs to speed up training.  \n",
+    "- Batch Transform: for offline, asynchronous predictions on large batches of data. \n",
+    "- Instance type choices:  many different kinds of CPU and GPU instances are available in SageMaker, and are applicable to different use cases.\n",
     "\n",
     "#  Prepare the dataset\n",
     "\n",
@@ -54,7 +54,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Training data for both Local Mode training and hosted Training must be saved as files.  Accordingly, we'll save the padded data to files, locally for now, and later to Amazon S3."
+    "Next, we'll save the padded data to files, locally for now, and later to Amazon S3."
    ]
   },
   {
@@ -86,33 +86,13 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Local Mode Training & Git Integration\n",
+    "# SageMaker Training\n",
     "\n",
-    "Amazon SageMaker’s Local Mode training feature is a convenient way to make sure your code is working as expected before moving on to full scale, hosted training. With Local Mode, you can run quick tests with just a sample of training data, and/or a small number of epochs (passes over the full training set), while avoiding the time and expense of attempting full scale hosted training using possibly buggy code.  \n",
+    "With our dataset prepared, we're now ready to set up a SageMaker hosted training job. The core concept of SageMaker hosted training is to use more powerful compute resources separate from the less powerful, lower cost notebook instance that you use for prototyping.  Hosted training spins up one or more instances (i.e. a cluster) for training, and then tears the cluster down when training is complete, with billing per second for cluster up time. In general, hosted training is preferred for doing actual large-scale training on more powerful instances, especially for distributed training on a single large instance with multiple GPUs, or multiple instances each having multiple GPUs. \n",
     "\n",
-    "### Prepare for Local Mode\n",
-    "\n",
-    "To train in Local Mode, it is necessary to have docker-compose or nvidia-docker-compose (for GPU) installed. Running the following script will install docker-compose or nvidia-docker-compose and configure the notebook environment for you."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/local_mode_setup.sh\n",
-    "!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/daemon.json    \n",
-    "!/bin/bash ./local_mode_setup.sh"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
     "### Git Configuration\n",
     "\n",
-    "Now we need a training script that can be used to train the model in Amazon SageMaker.  In this example, we'll use Git integration. That is, you can specify a training script that is stored in a GitHub, AWS CodeCommit or another Git repository as the entry point so that you don't have to download the scripts locally. If you do so, any source directory and dependencies should be in the same repo if they are needed.\n",
+    "To begin, we need a training script that can be used to train the model in Amazon SageMaker.  In this example, we'll use Git integration. That is, you can specify a training script that is stored in a GitHub, AWS CodeCommit or another Git repository as the entry point so that you don't have to download the scripts locally. For this purpose, the source directory and dependencies should be in the same repository.\n",
     "\n",
     "To use Git integration, pass a dict `git_config` as a parameter when you create an Amazon SageMaker Estimator object. In the `git_config` parameter, you specify the fields `repo`, `branch` and `commit` to locate the specific repo you want to use. If you do not specify `commit` in `git_config`, the latest commit of the specified repo and branch will be used by default.  Also, if authentication is required to access the repo, you can specify fields `2FA_enabled`, `username`, `password` and `token` accordingly.\n",
     "\n",
@@ -133,66 +113,6 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Local Mode Estimator setup\n",
-    "\n",
-    "The next step is to set up a TensorFlow Estimator for Local Mode training. The Estimator will be used to configure and start a training job.  A key parameters for the Estimator is the `train_instance_type`, which is the kind of hardware on which training will run. In the case of Local Mode, we simply set this parameter to `local_gpu` to invoke Local Mode training on the GPU, or to `local` if the instance has a CPU. Other parameters of note are the algorithm’s hyperparameters, which are passed in as a dictionary, and a Boolean parameter indicating that we are using Script Mode."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import sagemaker\n",
-    "from sagemaker.tensorflow import TensorFlow\n",
-    "\n",
-    "\n",
-    "model_dir = '/opt/ml/model'\n",
-    "train_instance_type = 'local'\n",
-    "hyperparameters = {'epochs': 1, 'batch_size': 64, 'learning_rate': 0.01}\n",
-    "local_estimator = TensorFlow(\n",
-    "                       git_config=git_config,\n",
-    "                       source_dir='tf-sentiment-script-mode',\n",
-    "                       entry_point='sentiment.py',\n",
-    "                       model_dir=model_dir,\n",
-    "                       train_instance_type=train_instance_type,\n",
-    "                       train_instance_count=1,\n",
-    "                       hyperparameters=hyperparameters,\n",
-    "                       role=sagemaker.get_execution_role(),\n",
-    "                       base_job_name='tf-sentiment',\n",
-    "                       framework_version='2.1',\n",
-    "                       py_version='py3',\n",
-    "                       script_mode=True)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now we'll briefly train the model in Local Mode.  Since this is just to make sure the code is working, we'll train for only one epoch.  On a notebook instance using a general purpose CPU-based instance type such as m5.xlarge, this one epoch will take about 3 or 4 minutes.  As you'll see from the logs below the cell when training is complete, even when trained for only one epoch, the accuracy of the model on training data is around 80%.  "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "inputs = {'train': f'file://{train_dir}',\n",
-    "          'test': f'file://{test_dir}'}\n",
-    "\n",
-    "local_estimator.fit(inputs)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#  Hosted Training\n",
-    "\n",
-    "After we've confirmed our code is working using Local Mode training, we can move on to use SageMaker's hosted training, which typically uses more powerful compute resources separate from your less powerful, lower cost notebook instance.  Hosted training spins up one or more instances (i.e. a cluster) for training, and then tears the cluster down when training is complete. In general, hosted training is preferred for doing actual large-scale training on more powerful instances, especially for distributed training on a single instance with multiple GPUs or multiple instances each having multiple GPUs. \n",
-    "\n",
     "### Upload data to S3\n",
     "\n",
     "Before starting hosted training, the data must be present in storage that can be accessed by SageMaker. The storage options are:  Amazon S3 (object storage service), Amazon EFS (elastic NFS file system service), and Amazon FSx for Lustre (high-performance file system service). For this example, we'll upload the data to S3.  "
@@ -222,9 +142,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Hosted training Estimator setup\n",
+    "### Estimator setup\n",
     "\n",
-    "With the training data now in S3, we're ready to set up an Estimator object for hosted training. It is similar to the Local Mode Estimator, except the `train_instance_type` has been set to a ML instance type instead of a local type for Local Mode. Additionally, we've increased the number of epochs for actual training, as opposed to just testing the code in Local Mode."
+    "With the training data now in S3, we're ready to set up an Estimator object for hosted training. Most of the Estimator parameters are self-explantory; further discussion of the instance type selection is below.  The parameters most likely to change between different training jobs, the algorithm hyperparameters, are passed in as a dictionary."
    ]
   },
   {
@@ -244,8 +164,8 @@
     "                       source_dir='tf-sentiment-script-mode',\n",
     "                       entry_point='sentiment.py',\n",
     "                       model_dir=model_dir,\n",
-    "                       train_instance_type=train_instance_type,\n",
-    "                       train_instance_count=1,\n",
+    "                       instance_type=train_instance_type,\n",
+    "                       instance_count=1,\n",
     "                       hyperparameters=hyperparameters,\n",
     "                       role=sagemaker.get_execution_role(),\n",
     "                       base_job_name='tf-sentiment',\n",
@@ -260,9 +180,9 @@
    "source": [
     "### Distributed training on a single multi-GPU instance\n",
     "\n",
-    "The instance type selected above, p3.8xlarge, contains four GPUs based on NVIDIA's V100 Tensor Core architecture.  This presents an opportunity to do distributed training within a single multi-GPU instance, utilizing all four GPUs to reduce total training time compared to using a single GPU.  Although using multiple instances also is a possibility, using a single multi-GPU instance may be more performant because it avoids extra network traffic necessary to coordinate multiple instances.  For larger datasets and more complex models, using multiple instances may be a necessity, however, that is not the case here.\n",
+    "The SageMaker instance type selected above, p3.8xlarge, contains four GPUs based on NVIDIA's V100 Tensor Core architecture.  This presents an opportunity to do distributed training within a single multi-GPU instance, utilizing all four GPUs to reduce total training time compared to using a single GPU.  Although using multiple instances also is a possibility, using a single multi-GPU instance may be more performant because it avoids extra network traffic necessary to coordinate multiple instances.  For larger datasets and more complex models, using multiple instances may be a necessity, however, that is not the case here.\n",
     "\n",
-    "To utilize all four GPUs on the instance, you don't need to do anything special in Amazon SageMaker:  TensorFlow 2 itself will handle the details under the hood.  TensorFlow 2 includes several native distribution strategies, including MirroredStrategy, which is well-suited for training a model using multiple GPUs on a single instance.  To enable MirroredStrategy, simply add the following lines of code in your training script before you define and compile the model:\n",
+    "To utilize all four GPUs on the instance, you don't need to do anything special in Amazon SageMaker:  TensorFlow 2 itself will handle the details under the hood.  TensorFlow 2 includes several native distribution strategies, including MirroredStrategy, which is well-suited for training a model using multiple GPUs on a single instance.  To enable MirroredStrategy, we simply add the following lines of code in the training script before defining and compiling the model (this has already been done for this example):\n",
     "\n",
     "```python\n",
     "def get_model(learning_rate):\n",
@@ -281,7 +201,7 @@
     "    return model\n",
     "```\n",
     "\n",
-    "Additionally, the batch size was increased in the Estimator hyperparameters to account for the fact that batches are divided among multiple GPUs.  If you are interested in reviewing the rest of the training code, it is at the GitHub repository referenced above in the `git_config` variable.  "
+    "Additionally, the batch size is increased in the Estimator hyperparameters to account for the fact that batches are divided among multiple GPUs.  If you are interested in reviewing the rest of the training code, it is at the GitHub repository referenced above in the `git_config` variable.  "
    ]
   },
   {
@@ -290,7 +210,7 @@
    "source": [
     "### Start the hosted training job\n",
     "\n",
-    "With the change in training instance type and increase in epochs, we simply call `fit` to start the actual hosted training.  The training job should take around 5 minutes, including the time needed to spin up the training instance.  At the end of hosted training, you'll see from the logs below the cell that accuracy on the training set has greatly increased, and accuracy on the validation set is approaching 90%.  "
+    "We simply call `fit` to start the actual hosted training.  The training job should take around 5 minutes, including the time needed to spin up the training instance.  At the end of hosted training, you'll see from the logs below the code cell that validation accuracy is approaching 90%, and the number of billable seconds (which should be in the neighborhood of 180).  "
    ]
   },
   {
@@ -306,7 +226,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Although the training accuracy has improved, the validation accuracy plateaued, so the model might be overfitting now:  it could be less able to generalize to data it has not yet seen.  This is the case even though we are employing dropout as a regularization technique to reduce the possibility of overfitting.  (See the training script at the GitHub repository referenced above.)  For a production model, further experimentation would be necessary.\n",
+    "The validation accuracy appears to have plateaued, so the model might be overfitting:  it might be less able to generalize to data it has not yet seen.  This is the case even though we are employing dropout as a regularization technique to reduce the possibility of overfitting.  (See the training script at the GitHub repository referenced above.)  For a production model, further experimentation would be necessary.\n",
     "\n",
     "TensorFlow 2's tf.keras API provides a convenient way to capture the history of model training.  When the model was saved after training, the history was saved alongside it.  To retrieve the history, we first download the trained model from the S3 bucket where SageMaker stored it.  Models trained by SageMaker are always accessible in this way to be run anywhere.  Next, we can unzip it to gain access to the history data structure, and then simply load the history as JSON:"
    ]
@@ -522,7 +442,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.6"
+   "version": "3.6.10"
   }
  },
  "nbformat": 4,