Merge pull request aws#47 from awslabs/lda_topic_modeling

cswiercz · web-flow · commit e06c5146f20b · 2017-11-25T15:56:25.000-08:00
Additional inference discussion and analysis in LDA-Science
diff --git a/scientific_details_of_algorithms/lda_topic_modeling/LDA-Science.ipynb b/scientific_details_of_algorithms/lda_topic_modeling/LDA-Science.ipynb
@@ -563,7 +563,7 @@
     "        'num_topics': str(num_topics),\n",
     "        'feature_dim': str(vocabulary_size),\n",
     "        'mini_batch_size': str(num_documents_training),\n",
-    "        'alpha0': str(0.1),\n",
+    "        'alpha0': str(1.0),\n",
     "    },\n",
     "    'InputDataConfig': [\n",
     "        {\n",
@@ -1048,7 +1048,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's also compute and plot the distribution of L1-errors from **all** of the test documents"
+    "In the eyeball-norm these look quite comparable.\n",
+    "\n",
+    "Let's be more scientific about this. Below we compute and plot the distribution of L1-errors from **all** of the test documents. Note that we send a new payload of test documents to the inference endpoint and apply the appropriate permutation to the output."
    ]
   },
   {
@@ -1061,10 +1063,15 @@
     "\n",
     "# create a payload containing all of the test documents and run inference again\n",
     "#\n",
-    "# try switching between the test data set and a subset of the training data set\n",
+    "# TRY THIS:\n",
+    "#   try switching between the test data set and a subset of the training\n",
+    "#   data set. It is likely that LDA inference will perform better against\n",
+    "#   the training set than the holdout test set.\n",
     "#\n",
-    "payload_documents = documents_test            # Example 1\n",
-    "#payload_documents = documents_training[:600]  # Example 2\n",
+    "payload_documents = documents_test                    # Example 1\n",
+    "known_topic_mixtures = topic_mixtures_test            # Example 1\n",
+    "#payload_documents = documents_training[:600];         # Example 2\n",
+    "#known_topic_mixtures = topic_mixtures_training[:600]  # Example 2\n",
     "\n",
     "print('Invoking endpoint...\\n')\n",
     "payload = np2csv(documents_test)\n",
@@ -1078,7 +1085,7 @@
     "inferred_topic_mixtures_permuted = np.array([prediction['topic_mixture'] for prediction in results['predictions']])\n",
     "inferred_topic_mixtures = inferred_topic_mixtures_permuted[:,permutation]\n",
     "\n",
-    "print('topics_mixtures_test.shape = {}'.format(topic_mixtures_test.shape))\n",
+    "print('known_topics_mixtures.shape = {}'.format(known_topic_mixtures.shape))\n",
     "print('inferred_topics_mixtures_test.shape = {}\\n'.format(inferred_topic_mixtures.shape))"
    ]
   },
@@ -1088,7 +1095,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "l1_errors = np.linalg.norm((inferred_topic_mixtures - topic_mixtures_test), 1, axis=1)\n",
+    "l1_errors = np.linalg.norm((inferred_topic_mixtures - known_topic_mixtures), 1, axis=1)\n",
     "\n",
     "# plot the error freqency\n",
     "fig, ax_frequency = plt.subplots()\n",
@@ -1122,6 +1129,65 @@
     "fig.set_dpi(110)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Machine learning algorithms are not perfect and the data above suggests this is true of SageMaker LDA. With more documents and some hyperparameter tuning we can obtain more accurate results against the known topic-mixtures.\n",
+    "\n",
+    "For now, let's just investigate the documents-topic mixture pairs that seem to do well as well as those that do not. Below we retreive a document and topic mixture corresponding to a small L1-error as well as one with a large L1-error."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "N = 6\n",
+    "\n",
+    "good_idx = (l1_errors < 0.1)\n",
+    "good_documents = payload_documents[good_idx][:N]\n",
+    "good_topic_mixtures = inferred_topic_mixtures[good_idx][:N]\n",
+    "\n",
+    "poor_idx = (l1_errors > 0.4)\n",
+    "poor_documents = payload_documents[poor_idx][:N]\n",
+    "poor_topic_mixtures = inferred_topic_mixtures[poor_idx][:N]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%matplotlib inline\n",
+    "\n",
+    "fig = plot_lda_topics(good_documents, 2, 3, topic_mixtures=good_topic_mixtures)\n",
+    "fig.suptitle('Documents With Accurate Inferred Topic-Mixtures')\n",
+    "fig.set_dpi(120)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%matplotlib inline\n",
+    "\n",
+    "fig = plot_lda_topics(poor_documents, 2, 3, topic_mixtures=poor_topic_mixtures)\n",
+    "fig.suptitle('Documents With Inaccurate Inferred Topic-Mixtures')\n",
+    "fig.set_dpi(120)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this example set the documents on which inference was not as accurate tend to have a denser topic-mixture. This makes sense when extrapolated to real-world datasets: it can be difficult to nail down which topics are represented in a document when the document uses words from a large subset of the vocabulary."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},