Merge pull request aws#296 from juliodelgadoaws/master

djarpin · web-flow · commit e37174d280cb · 2018-07-03T23:09:00.000-07:00
Use SageMaker SDK RandomCutForest estimator
diff --git a/introduction_to_amazon_algorithms/random_cut_forest/random_cut_forest.ipynb b/introduction_to_amazon_algorithms/random_cut_forest/random_cut_forest.ipynb
@@ -50,7 +50,7 @@
     "\n",
     "*This notebook was created and tested on an ml.m4.xlarge notebook instance.*\n",
     "\n",
-    "Our first step is to setup our AWS credentials so that AWS SageMaker can store and access data. We also need some data to inspect and to train upon."
+    "Our first step is to setup our AWS credentials so that AWS SageMaker can store and access training data and model artifacts. We also need some data to inspect and to train upon."
    ]
   },
   {
@@ -203,46 +203,6 @@
     "taxi_data[5952:6000]"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Store Data on S3\n",
-    "\n",
-    "The Random Cut Forest Algorithm accepts data in [RecordIO](https://mxnet.apache.org/api/python/io/io.html#module-mxnet.recordio) [Protobuf](https://developers.google.com/protocol-buffers/) format. The SageMaker Python API provides helper functions for easily converting your data into this format. Below we convert the taxicab data and upload it to the `bucket + prefix` Amazon S3 destination specified at the beginning of this notebook in the [Setup AWS Credentials](#Setup-AWS-Credentials) section."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def convert_and_upload_training_data(ndarray, bucket, prefix, filename='data.pbr'):\n",
-    "    import boto3\n",
-    "    import os\n",
-    "    from sagemaker.amazon.common import numpy_to_record_serializer\n",
-    "    \n",
-    "    # convert numpy array to Protobuf RecordIO format\n",
-    "    serializer = numpy_to_record_serializer()\n",
-    "    buffer = serializer(ndarray)\n",
-    "    \n",
-    "    # Upload to S3\n",
-    "    s3_object = os.path.join(prefix, 'train', filename)\n",
-    "    boto3.Session().resource('s3').Bucket(bucket).Object(s3_object).upload_fileobj(buffer)\n",
-    "    \n",
-    "    s3_path = 's3://{}/{}'.format(bucket, s3_object)\n",
-    "    return s3_path\n",
-    "\n",
-    "\n",
-    "# RCV only works on an array of values.\n",
-    "s3_train_data = convert_and_upload_training_data(\n",
-    "    taxi_data.value.as_matrix().reshape(-1,1),\n",
-    "    bucket,\n",
-    "    prefix)\n",
-    "print('Uploaded data to {}'.format(s3_train_data))"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -251,30 +211,7 @@
     "\n",
     "***\n",
     "\n",
-    "We have created a training data set and uploaded it to S3. Next, we configure a SageMaker training job to use the Random Cut Forest (RCF) algorithm on said training data.\n",
-    "\n",
-    "The first step is to specify the location of the Docker image containing the SageMaker Random Cut Forest algorithm. In order to minimize communication latency, we provide containers for each AWS region in which SageMaker is available. The code below automatically chooses an algorithm container based on the current region; that is, the region in which this notebook is run."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import boto3\n",
-    "\n",
-    "# select the algorithm container based on this notebook's current location\n",
-    "containers = {'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/randomcutforest:latest',\n",
-    "              'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/randomcutforest:latest',\n",
-    "              'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/randomcutforest:latest',\n",
-    "              'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/randomcutforest:latest',\n",
-    "              'ap-northeast-1': '351501993468.dkr.ecr.ap-northeast-1.amazonaws.com/randomcutforest:latest',\n",
-    "              'ap-northeast-2': '835164637446.dkr.ecr.ap-northeast-2.amazonaws.com/randomcutforest:latest'}\n",
-    "region_name = boto3.Session().region_name\n",
-    "container = containers[region_name]\n",
-    "\n",
-    "print('Using SageMaker RCF container: {} ({})'.format(container, region_name))"
+    "Next, we configure a SageMaker training job to train the Random Cut Forest (RCF) algorithm on the taxi cab data."
    ]
   },
   {
@@ -302,35 +239,21 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "from sagemaker import RandomCutForest\n",
+    "\n",
     "session = sagemaker.Session()\n",
     "\n",
     "# specify general training job information\n",
-    "rcf = sagemaker.estimator.Estimator(\n",
-    "    container,\n",
-    "    execution_role,\n",
-    "    output_path='s3://{}/{}/output'.format(bucket, prefix),\n",
-    "    train_instance_count=1,\n",
-    "    train_instance_type='ml.m4.xlarge',\n",
-    "    sagemaker_session=session,\n",
-    ")\n",
-    "\n",
-    "# set algorithm-specific hyperparameters\n",
-    "rcf.set_hyperparameters(\n",
-    "    num_samples_per_tree = 200,\n",
-    "    num_trees = 50,\n",
-    "    feature_dim = 1,\n",
-    ")\n",
-    "\n",
-    "# RCF training requires sharded data. See documentation for\n",
-    "# more information.\n",
-    "s3_train_input = sagemaker.session.s3_input(\n",
-    "    s3_train_data,\n",
-    "    distribution='ShardedByS3Key',\n",
-    "    content_type='application/x-recordio-protobuf',\n",
-    ")\n",
+    "rcf = RandomCutForest(role=execution_role,\n",
+    "                      train_instance_count=1,\n",
+    "                      train_instance_type='ml.m4.xlarge',\n",
+    "                      data_location='s3://{}/{}/'.format(bucket, prefix),\n",
+    "                      output_path='s3://{}/{}/output'.format(bucket, prefix),\n",
+    "                      num_samples_per_tree=512,\n",
+    "                      num_trees=50)\n",
     "\n",
-    "# run the training job on input data stored in S3\n",
-    "rcf.fit({'train': s3_train_input})"
+    "# automatically upload the training data to S3 and run the training job\n",
+    "rcf.fit(rcf.record_set(taxi_data.value.as_matrix().reshape(-1,1)))"
    ]
   },
   {
@@ -408,15 +331,14 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
+   "metadata": {},
    "outputs": [],
    "source": [
     "from sagemaker.predictor import csv_serializer, json_deserializer\n",
     "\n",
     "rcf_inference.content_type = 'text/csv'\n",
     "rcf_inference.serializer = csv_serializer\n",
+    "rcf_inference.accept = 'application/json'\n",
     "rcf_inference.deserializer = json_deserializer"
    ]
   },
@@ -436,9 +358,8 @@
    "outputs": [],
    "source": [
     "taxi_data_numpy = taxi_data.value.as_matrix().reshape(-1,1)\n",
-    "results = rcf_inference.predict(taxi_data_numpy[:6])\n",
-    "\n",
-    "print(results)"
+    "print(taxi_data_numpy[:6])\n",
+    "results = rcf_inference.predict(taxi_data_numpy[:6])"
    ]
   },
   {
@@ -609,17 +530,11 @@
     "        shingled_data[n] = data[n:(n+shingle_size)]\n",
     "    return shingled_data\n",
     "\n",
-    "# single data with shingle size=48 (one day) and upload to S3\n",
+    "# single data with shingle size=48 (one day)\n",
     "shingle_size = 48\n",
     "prefix_shingled = 'sagemaker/randomcutforest_shingled'\n",
     "taxi_data_shingled = shingle(taxi_data.values[:,1], shingle_size)\n",
-    "s3_train_data_shingled = convert_and_upload_training_data(\n",
-    "    taxi_data_shingled,\n",
-    "    bucket,\n",
-    "    prefix_shingled)\n",
-    "\n",
-    "print(taxi_data_shingled)\n",
-    "print(s3_train_data_shingled)"
+    "print(taxi_data_shingled)"
    ]
   },
   {
@@ -637,28 +552,17 @@
    "source": [
     "session = sagemaker.Session()\n",
     "\n",
-    "rcf = sagemaker.estimator.Estimator(\n",
-    "    container,\n",
-    "    execution_role,\n",
-    "    output_path='s3://{}/{}/output'.format(bucket, prefix_shingled),\n",
-    "    train_instance_count=1,\n",
-    "    train_instance_type='ml.m4.xlarge',\n",
-    "    sagemaker_session=session,\n",
-    ")\n",
-    "\n",
-    "rcf.set_hyperparameters(\n",
-    "    num_samples_per_tree = 200,\n",
-    "    num_trees = 50,\n",
-    "    feature_dim = shingle_size,\n",
-    ")\n",
-    "\n",
-    "s3_train_input = sagemaker.session.s3_input(\n",
-    "    s3_train_data_shingled,\n",
-    "    distribution='ShardedByS3Key',\n",
-    "    content_type='application/x-recordio-protobuf',\n",
-    ")\n",
+    "# specify general training job information\n",
+    "rcf = RandomCutForest(role=execution_role,\n",
+    "                      train_instance_count=1,\n",
+    "                      train_instance_type='ml.m4.xlarge',\n",
+    "                      data_location='s3://{}/{}/'.format(bucket, prefix_shingled),\n",
+    "                      output_path='s3://{}/{}/output'.format(bucket, prefix_shingled),\n",
+    "                      num_samples_per_tree=512,\n",
+    "                      num_trees=50)\n",
     "\n",
-    "rcf.fit({'train': s3_train_input})"
+    "# automatically upload the training data to S3 and run the training job\n",
+    "rcf.fit(rcf.record_set(taxi_data_shingled))"
    ]
   },
   {
@@ -676,6 +580,7 @@
     "\n",
     "rcf_inference.content_type = 'text/csv'\n",
     "rcf_inference.serializer = csv_serializer\n",
+    "rcf_inference.accept = 'appliation/json'\n",
     "rcf_inference.deserializer = json_deserializer"
    ]
   },
@@ -692,12 +597,13 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Score the shingled datapoints\n",
     "results = rcf_inference.predict(taxi_data_shingled)\n",
     "scores = np.array([datum['score'] for datum in results['scores']])\n",
     "\n",
     "# compute the shingled score distribution and cutoff and determine anomalous scores\n",
-    "score_mean = taxi_data.score.mean()\n",
-    "score_std = taxi_data.score.std()\n",
+    "score_mean = scores.mean()\n",
+    "score_std = scores.std()\n",
     "score_cutoff = score_mean + 3*score_std\n",
     "\n",
     "anomalies = scores[scores > score_cutoff]\n",
@@ -747,7 +653,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We see that with this particular shingle size, hyperparameter selection, and anomaly cutoff threshold that the shingled approach more clearly captures the major anomalous events: the spike at around t=6000 and the dip at around t=10000. In general, the number of trees, sample size, and anomaly score cutoff are all parameters that a data scientist may need experiment with in order to achieve desired results. The use of a labeled test dataset allows the used to obtain common accuracy metrics for anomaly detection algorithms. For more information about Amazon SageMaker Random Cut Forest see the [AWS Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html)."
+    "We see that with this particular shingle size, hyperparameter selection, and anomaly cutoff threshold that the shingled approach more clearly captures the major anomalous events: the spike at around t=6000 and the dips at around t=9000 and t=10000. In general, the number of trees, sample size, and anomaly score cutoff are all parameters that a data scientist may need experiment with in order to achieve desired results. The use of a labeled test dataset allows the used to obtain common accuracy metrics for anomaly detection algorithms. For more information about Amazon SageMaker Random Cut Forest see the [AWS Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html)."
    ]
   },
   {