Merge branch 'master' of git://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

zaxtax · zaxtax · commit 0dcbc0191670 · 2014-01-24T15:18:42.000-05:00
diff --git a/Chapter4_TheGreatestTheoremNeverTold/LawOfLargeNumbers.ipynb b/Chapter4_TheGreatestTheoremNeverTold/LawOfLargeNumbers.ipynb
@@ -489,7 +489,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "One way to determine a prior on the upvote ratio is that look at the historical distribution of upvote ratios. This can be accomplished by scraping Reddit's comments and determining a distribution. There are a few problems with this technique though:\n",
+      "One way to determine a prior on the upvote ratio is to look at the historical distribution of upvote ratios. This can be accomplished by scraping Reddit's comments and determining a distribution. There are a few problems with this technique though:\n",
       "\n",
       "1. Skewed data:  The vast majority of comments have very few votes, hence there will be many comments with ratios near the extremes (see the \"triangular plot\" in the above Kaggle dataset), effectively skewing our distribution to the extremes. One could try to only use comments with votes greater than some threshold. Again, problems are encountered. There is a tradeoff between number of comments available to use and a higher threshold with associated ratio precision. \n",
       "2. Biased data: Reddit is composed of different subpages, called subreddits. Two examples are *r/aww*, which posts pics of cute animals, and *r/politics*. It is very likely that the user behaviour towards comments of these two subreddits are very different: visitors are likely friend and affectionate in the former, and would therefore upvote comments more, compared to the latter, where comments are likely to be controversial and disagreed upon. Therefore not all comments are the same. \n",
@@ -674,7 +674,7 @@
       "\n",
       "### Sorting!\n",
       "\n",
-      "We have been ignoring the goal of this exercise: how do we sort the comments from *best to worst*? Of course, we cannot sort distributions, we must sort scalar numbers. There are many ways to distill a distribution down to a scalar: expressing the distribution through its expected value, or mean, is one way. Choosing the mean bad choice though. This is because the mean does not take into account the uncertainty of distributions.\n",
+      "We have been ignoring the goal of this exercise: how do we sort the comments from *best to worst*? Of course, we cannot sort distributions, we must sort scalar numbers. There are many ways to distill a distribution down to a scalar: expressing the distribution through its expected value, or mean, is one way. Choosing the mean is a bad choice though. This is because the mean does not take into account the uncertainty of distributions.\n",
       "\n",
       "I  suggest using the *95% least plausible value*, defined as the value such that there is only a 5% chance the true parameter is lower (think of the lower bound on the 95% credible region). Below are the posterior distributions with the 95% least-plausible value plotted:"
      ]
diff --git a/Chapter5_LossFunctions/LossFunctions.ipynb b/Chapter5_LossFunctions/LossFunctions.ipynb
@@ -119,7 +119,7 @@
       "\n",
       "Notice that measuring your loss via an *expected value* uses more information from the distribution than the MAP estimate which, if you recall, will only find the maximum value of the distribution and ignore the shape of the distribution. Ignoring information can over-expose yourself to tail risks, like the unlikely hurricane, and leaves your estimate ignorant of how ignorant you really are about the parameter.\n",
       "\n",
-      "Similarly, compare this with frequentist methods, that traditionally only aim to minimize the error, and not considering the *loss associated with the result of that error*. Compound this with the fact that frequentist methods are almost guaranteed to never be absolutely accurate. Bayesian point estimates fix this by planning ahead: your estimate is going to be wrong, you might as well err on the right side of wrong."
+      "Similarly, compare this with frequentist methods, that traditionally only aim to minimize the error, and do not consider the *loss associated with the result of that error*. Compound this with the fact that frequentist methods are almost guaranteed to never be absolutely accurate. Bayesian point estimates fix this by planning ahead: your estimate is going to be wrong, you might as well err on the right side of wrong."
      ]
     },
     {
@@ -583,7 +583,7 @@
       "def stock_loss( true_return, yhat, alpha = 100. ):\n",
       "    if true_return*yhat < 0:\n",
       "        #opposite signs, not good\n",
-      "        return alpha*yhat**2 - sign( true_return )*yhat \\\n",
+      "        return alpha*yhat**2 - np.sign( true_return )*yhat \\\n",
       "                        + abs( true_return ) \n",
       "    else:\n",
       "        return abs( true_return - yhat )\n",
diff --git a/Chapter6_Priorities/Priors.ipynb b/Chapter6_Priorities/Priors.ipynb
@@ -112,7 +112,7 @@
      "source": [
       "### Decision, decisions...\n",
       "\n",
-      "The choice, either *objective* or *subjective* mostly depend on the problem being solved, but there are a few cases where one is preferred over the other. In instances of scientific research, the choice of an objective prior is obvious. This eliminates any biases in the results, and two researchers who might have differing prior opinions would feel an objective prior is fair. Consider a more extreme situation:\n",
+      "The choice, either *objective* or *subjective* mostly depends on the problem being solved, but there are a few cases where one is preferred over the other. In instances of scientific research, the choice of an objective prior is obvious. This eliminates any biases in the results, and two researchers who might have differing prior opinions would feel an objective prior is fair. Consider a more extreme situation:\n",
       "\n",
       "> A tobacco company publishes a report with a Bayesian methodology that retreated 60 years of medical research on tobacco use. Would you believe the results? Unlikely. The researchers probably chose a subjective prior that too strongly biased results in their favor.\n",
       "\n",
@@ -145,7 +145,7 @@
      "source": [
       "### Empirical Bayes\n",
       "\n",
-      "While not a true Bayesian method, *empirical Bayes* is a trick that combines frequentist and Bayesian inference. As mentioned previously, for (almost) every inference problem there is a Bayesian method and a frequentist method. The significant difference between the two is that Bayesian methods have a prior distribution, with hyperparameters $\\alpha$, while empirical methods do not have any notion of a prior. Empirical Bayes combines the two methods by using frequentist methods to select $\\alpha$, and then proceeding with Bayesian methods on the original problem. \n",
+      "While not a true Bayesian method, *empirical Bayes* is a trick that combines frequentist and Bayesian inference. As mentioned previously, for (almost) every inference problem there is a Bayesian method and a frequentist method. The significant difference between the two is that Bayesian methods have a prior distribution, with hyperparameters $\\alpha$, while empirical methods do not have any notion of a prior. Empirical Bayes combines the two methods by using frequentist methods to select $\\alpha$, and then proceeds with Bayesian methods on the original problem. \n",
       "\n",
       "A very simple example follows: suppose we wish to estimate the parameter $\\mu$ of a Normal distribution, with $\\sigma = 5$. Since $\\mu$ could range over the whole real line, we can use a Normal distribution as a prior for $\\mu$. How to select the prior's hyperparameters, denoted ($\\mu_p, \\sigma_p^2$)? The $\\sigma_p^2$ parameter can be chosen to reflect the uncertainty we have. For $\\mu_p$, we have two options:\n",
       "\n",
@@ -265,7 +265,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "One thing to notice is that the symmetry of these matrices. The Wishart distribution can be a little troubling to deal with, but we will use it in an example later."
+      "One thing to notice is the symmetry of these matrices. The Wishart distribution can be a little troubling to deal with, but we will use it in an example later."
      ]
     },
     {
@@ -536,7 +536,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "Note that we don't real care how accurate we become about inference of the hidden probabilities &mdash; for this problem we are more interested in choosing the best bandit (or more accurately, becoming *more confident* in choosing the best bandit). For this reason, the distribution of the red bandit is very wide (representing ignorance about what that hidden probability might be) but we are reasonably confident that it is not the best, so the algorithm chooses to ignore it.\n",
+      "Note that we don't really care how accurate we become about the inference of the hidden probabilities &mdash; for this problem we are more interested in choosing the best bandit (or more accurately, becoming *more confident* in choosing the best bandit). For this reason, the distribution of the red bandit is very wide (representing ignorance about what that hidden probability might be) but we are reasonably confident that it is not the best, so the algorithm chooses to ignore it.\n",
       "\n",
       "From the above, we can see that after 1000 pulls, the majority of the \"blue\" function leads the pack, hence we will almost always choose this arm. This is good, as this arm is indeed the best.\n",
       "\n",
@@ -865,7 +865,7 @@
       "\n",
       "- If interested in the *minimum* probability (eg: where prizes are a bad thing), simply choose $B = \\text{argmin} \\; X_b$ and proceed.\n",
       "\n",
-      "- Adding learning rates: Suppose the underlying environment may change over time. Technically the standard Bayesian Bandit algorithm would self-update itself (awesome) by noting that what it thought was the best is starting to fail more often, we can motivate the algorithm to learn changing environments quicker. We simply need to add a *rate* term upon updating:\n",
+      "- Adding learning rates: Suppose the underlying environment may change over time. Technically the standard Bayesian Bandit algorithm would self-update itself (awesome) by noting that what it thought was the best is starting to fail more often. We can motivate the algorithm to learn changing environments quicker by simply adding a *rate* term upon updating:\n",
       "\n",
       "        self.wins[ choice ] = rate*self.wins[ choice ] + result\n",
       "        self.trials[ choice ] = rate*self.trials[ choice ] + 1\n",
@@ -874,7 +874,7 @@
       "\n",
       "- Hierarchical algorithms: We can setup a Bayesian Bandit algorithm on top of smaller bandit algorithms. Suppose we have $N$ Bayesian Bandit models, each varying in some behavior (for example  different `rate` parameters, representing varying sensitivity to changing environments). On top of these $N$ models is another Bayesian Bandit learner that will select a sub-Bayesian Bandit. This chosen Bayesian Bandit will then make an internal choice as to which machine to pull. The super-Bayesian Bandit updates itself depending on whether the sub-Bayesian Bandit was correct or not. \n",
       "\n",
-      "- Extending the rewards, denoted $y_a$ for bandit $a$, to random variables from a distribution $f_{y_a}(y)$ is straightforward. More generally, this problem can be rephrased as \"Find the bandit with the largest expected value\", as playing the bandit with the largest expected value is optimal. In the case above, $f_{y_a}$ was Bernoulli with probability $p_a$, hence the expected value for an bandit is equal to $p_a$, which is why it looks like we are aiming to maximize the probability of winning. If $f$ is not Bernoulli, and it is non-negative, which can be accomplished apriori by shifting the distribution (we assume we know $f$), then the algorithm behaves as before:\n",
+      "- Extending the rewards, denoted $y_a$ for bandit $a$, to random variables from a distribution $f_{y_a}(y)$ is straightforward. More generally, this problem can be rephrased as \"Find the bandit with the largest expected value\", as playing the bandit with the largest expected value is optimal. In the case above, $f_{y_a}$ was Bernoulli with probability $p_a$, hence the expected value for a bandit is equal to $p_a$, which is why it looks like we are aiming to maximize the probability of winning. If $f$ is not Bernoulli, and it is non-negative, which can be accomplished apriori by shifting the distribution (we assume we know $f$), then the algorithm behaves as before:\n",
       "\n",
       "   For each round, \n",
       "    \n",
@@ -887,7 +887,7 @@
       "\n",
       "- There has been some interest in extending the Bayesian Bandit algorithm to commenting systems. Recall in Chapter 4, we developed a ranking algorithm based on the Bayesian lower-bound of the proportion of upvotes to total votes. One problem with this approach is that it will bias the top rankings towards older comments, since older comments naturally have more votes (and hence the lower-bound is tighter to the true proportion). This creates a positive feedback cycle where older comments gain more votes, hence are displayed more often, hence gain more votes, etc. This pushes any new, potentially better comments, towards the bottom. J. Neufeld proposes a system to remedy this that uses a Bayesian Bandit solution.\n",
       "\n",
-      "His proposal is to consider each comment as a Bandit, with a the number of pulls equal to the number of votes cast, and number of rewards as the number of upvotes, hence creating a $\\text{Beta}(1+U,1+D)$ posterior. As visitors visit the page, samples are drawn from each bandit/comment, but instead of displaying the comment with the $\\max$ sample, the comments are ranked according the the ranking of their respective samples. From J. Neufeld's blog [7]:\n",
+      "His proposal is to consider each comment as a Bandit, with the number of pulls equal to the number of votes cast, and number of rewards as the number of upvotes, hence creating a $\\text{Beta}(1+U,1+D)$ posterior. As visitors visit the page, samples are drawn from each bandit/comment, but instead of displaying the comment with the $\\max$ sample, the comments are ranked according to the ranking of their respective samples. From J. Neufeld's blog [7]:\n",
       "\n",
       "   > [The] resulting ranking algorithm is quite straightforward, each new time the comments page is loaded, the score for each comment is sampled from a $\\text{Beta}(1+U,1+D)$, comments are then ranked by this score in descending order... This randomization has a unique benefit in that even untouched comments $(U=1,D=0)$ have some chance of being seen even in threads with 5000+ comments (something that is not happening now), but, at the same time, the user is not likely to be inundated with rating these new comments. "
      ]
@@ -998,7 +998,7 @@
       "\n",
       "The *expected daily return* of a stock is denoted $\\mu = E[ r_t ] $. Obviously, stocks with high expected returns are desirable. Unfortunately, stock returns are so filled with noise that it is very hard to estimate this parameter. Furthermore, the parameter might change over time (consider the rises and falls of AAPL stock), hence it is unwise to use a large historical dataset. \n",
       "\n",
-      "Historically, the expected return has been estimated by using the sample mean. This is a bad idea. As mentioned, the sample mean of a small dataset size has enormous potential to be very wrong (again, see Chapter 4 for full details). Thus Bayesian inference is the correct procedure here, since we are able to see our uncertainty along with probable values.\n",
+      "Historically, the expected return has been estimated by using the sample mean. This is a bad idea. As mentioned, the sample mean of a small sized dataset has enormous potential to be very wrong (again, see Chapter 4 for full details). Thus Bayesian inference is the correct procedure here, since we are able to see our uncertainty along with probable values.\n",
       "\n",
       "For this exercise, we will be examining the daily returns of the AAPL, GOOG, MSFT and AMZN. Before we pull in the data, suppose we ask our a stock fund manager (an expert in finance, but see [10] ), \n",
       "\n",