pymc-devs
diff --git a/‎examples/case_studies/BART_introduction.ipynb
+1-1 b/‎examples/case_studies/BART_introduction.ipynb
+1-1
diff --git a/‎examples/case_studies/BART_introduction.myst.md
+1-1 b/‎examples/case_studies/BART_introduction.myst.md
+1-1
diff --git a/‎examples/case_studies/binning.ipynb
+1-1 b/‎examples/case_studies/binning.ipynb
+1-1
diff --git a/‎examples/case_studies/binning.myst.md
+1-1 b/‎examples/case_studies/binning.myst.md
+1-1
diff --git a/‎examples/case_studies/conditional-autoregressive-model.ipynb
+2-2 b/‎examples/case_studies/conditional-autoregressive-model.ipynb
+2-2
diff --git a/‎examples/case_studies/conditional-autoregressive-model.myst.md
+2-2 b/‎examples/case_studies/conditional-autoregressive-model.myst.md
+2-2
diff --git a/‎examples/case_studies/hierarchical_partial_pooling.ipynb
+1-1 b/‎examples/case_studies/hierarchical_partial_pooling.ipynb
+1-1
diff --git a/‎examples/case_studies/hierarchical_partial_pooling.myst.md
+1-1 b/‎examples/case_studies/hierarchical_partial_pooling.myst.md
+1-1
diff --git a/‎examples/case_studies/item_response_nba.ipynb
+2-2 b/‎examples/case_studies/item_response_nba.ipynb
+2-2
diff --git a/‎examples/case_studies/item_response_nba.myst.md
+1-1 b/‎examples/case_studies/item_response_nba.myst.md
+1-1
diff --git a/‎examples/case_studies/multilevel_modeling.ipynb
+1-1 b/‎examples/case_studies/multilevel_modeling.ipynb
+1-1
diff --git a/‎examples/case_studies/multilevel_modeling.myst.md
+1-1 b/‎examples/case_studies/multilevel_modeling.myst.md
+1-1
diff --git a/‎examples/case_studies/probabilistic_matrix_factorization.ipynb
+2-2 b/‎examples/case_studies/probabilistic_matrix_factorization.ipynb
+2-2
diff --git a/‎examples/case_studies/probabilistic_matrix_factorization.myst.md
+2-2 b/‎examples/case_studies/probabilistic_matrix_factorization.myst.md
+2-2
diff --git a/‎examples/gaussian_processes/MOGP-Coregion-Hadamard.ipynb
+3-3 b/‎examples/gaussian_processes/MOGP-Coregion-Hadamard.ipynb
+3-3
@@ -279,7 +279,7 @@
    "id": "8633b7b4",
    "metadata": {},
    "source": [
-    "The next figure shows 3 trees. As we can see these are very simple function and definitely not very good approximators by themselves. Inspecting individuals trees is generally not necessary when working with BART, we are showing them just so we can gain further intuition on the inner workins of BART."
+    "The next figure shows 3 trees. As we can see these are very simple function and definitely not very good approximators by themselves. Inspecting individuals trees is generally not necessary when working with BART, we are showing them just so we can gain further intuition on the inner workings of BART."
    ]
   },
   {
 
@@ -117,7 +117,7 @@ The following figure shows two samples of $\mu$ from the posterior.
 plt.step(x_data, idata_coal.posterior["μ"].sel(chain=0, draw=[3, 10]).T);
 ```
 
-The next figure shows 3 trees. As we can see these are very simple function and definitely not very good approximators by themselves. Inspecting individuals trees is generally not necessary when working with BART, we are showing them just so we can gain further intuition on the inner workins of BART.
+The next figure shows 3 trees. As we can see these are very simple function and definitely not very good approximators by themselves. Inspecting individuals trees is generally not necessary when working with BART, we are showing them just so we can gain further intuition on the inner workings of BART.
 
 ```{code-cell} ipython3
 bart_trees = μ_.owner.op.all_trees
 
@@ -758,7 +758,7 @@
    "id": "71c3cf64",
    "metadata": {},
    "source": [
-    "Pretty good! And we can access the posterior mean estimates (stored as [xarray](http://xarray.pydata.org/en/stable/index.html) types) as below. The MCMC samples arrive back in a 2D matrix with one dimension for the MCMC chain (`chain`), and one for the sample number (`draw`). We can calculate the overal posterior average with `.mean(dim=[\"draw\", \"chain\"])`."
+    "Pretty good! And we can access the posterior mean estimates (stored as [xarray](http://xarray.pydata.org/en/stable/index.html) types) as below. The MCMC samples arrive back in a 2D matrix with one dimension for the MCMC chain (`chain`), and one for the sample number (`draw`). We can calculate the overall posterior average with `.mean(dim=[\"draw\", \"chain\"])`."
    ]
   },
   {
 
@@ -299,7 +299,7 @@ Recall that we used `mu = -2` and `sigma = 2` to generate the data.
 az.plot_posterior(trace1, var_names=["mu", "sigma"], ref_val=[true_mu, true_sigma]);
 ```
 
-Pretty good! And we can access the posterior mean estimates (stored as [xarray](http://xarray.pydata.org/en/stable/index.html) types) as below. The MCMC samples arrive back in a 2D matrix with one dimension for the MCMC chain (`chain`), and one for the sample number (`draw`). We can calculate the overal posterior average with `.mean(dim=["draw", "chain"])`.
+Pretty good! And we can access the posterior mean estimates (stored as [xarray](http://xarray.pydata.org/en/stable/index.html) types) as below. The MCMC samples arrive back in a 2D matrix with one dimension for the MCMC chain (`chain`), and one for the sample number (`draw`). We can calculate the overall posterior average with `.mean(dim=["draw", "chain"])`.
 
 ```{code-cell} ipython3
 :tags: []
 
@@ -286,11 +286,11 @@
     "        self.mode = 0.0\n",
     "\n",
     "    def get_mu(self, x):\n",
-    "        def weigth_mu(w, a):\n",
+    "        def weight_mu(w, a):\n",
     "            a1 = tt.cast(a, \"int32\")\n",
     "            return tt.sum(w * x[a1]) / tt.sum(w)\n",
     "\n",
-    "        mu_w, _ = scan(fn=weigth_mu, sequences=[self.w, self.a])\n",
+    "        mu_w, _ = scan(fn=weight_mu, sequences=[self.w, self.a])\n",
     "\n",
     "        return mu_w\n",
     "\n",
 
@@ -220,11 +220,11 @@ class CAR(distribution.Continuous):
         self.mode = 0.0
 
     def get_mu(self, x):
-        def weigth_mu(w, a):
+        def weight_mu(w, a):
             a1 = tt.cast(a, "int32")
             return tt.sum(w * x[a1]) / tt.sum(w)
 
-        mu_w, _ = scan(fn=weigth_mu, sequences=[self.w, self.a])
+        mu_w, _ = scan(fn=weight_mu, sequences=[self.w, self.a])
 
         return mu_w
 
 
@@ -42,7 +42,7 @@
     "\n",
     "It may be possible to cluster groups of \"similar\" players, and estimate group averages, but using a hierarchical modeling approach is a natural way of sharing information that does not involve identifying *ad hoc* clusters.\n",
     "\n",
-    "The idea of hierarchical partial pooling is to model the global performance, and use that estimate to parameterize a population of players that accounts for differences among the players' performances. This tradeoff between global and individual performance will be automatically tuned by the model. Also, uncertainty due to different number of at bats for each player (*i.e.* informatino) will be automatically accounted for, by shrinking those estimates closer to the global mean.\n",
+    "The idea of hierarchical partial pooling is to model the global performance, and use that estimate to parameterize a population of players that accounts for differences among the players' performances. This tradeoff between global and individual performance will be automatically tuned by the model. Also, uncertainty due to different number of at bats for each player (*i.e.* information) will be automatically accounted for, by shrinking those estimates closer to the global mean.\n",
     "\n",
     "For far more in-depth discussion please refer to Stan [tutorial](http://mc-stan.org/documentation/case-studies/pool-binary-trials.html) {cite:p}`carpenter2016hierarchical` on the subject. The model and parameter values were taken from that example."
    ]
 
@@ -45,7 +45,7 @@ Of course, neither approach is realistic. Clearly, all players aren't equally sk
 
 It may be possible to cluster groups of "similar" players, and estimate group averages, but using a hierarchical modeling approach is a natural way of sharing information that does not involve identifying *ad hoc* clusters.
 
-The idea of hierarchical partial pooling is to model the global performance, and use that estimate to parameterize a population of players that accounts for differences among the players' performances. This tradeoff between global and individual performance will be automatically tuned by the model. Also, uncertainty due to different number of at bats for each player (*i.e.* informatino) will be automatically accounted for, by shrinking those estimates closer to the global mean.
+The idea of hierarchical partial pooling is to model the global performance, and use that estimate to parameterize a population of players that accounts for differences among the players' performances. This tradeoff between global and individual performance will be automatically tuned by the model. Also, uncertainty due to different number of at bats for each player (*i.e.* information) will be automatically accounted for, by shrinking those estimates closer to the global mean.
 
 For far more in-depth discussion please refer to Stan [tutorial](http://mc-stan.org/documentation/case-studies/pool-binary-trials.html) {cite:p}`carpenter2016hierarchical` on the subject. The model and parameter values were taken from that example.
 
 
@@ -248,7 +248,7 @@
      "output_type": "stream",
      "text": [
       "Number of observed plays: 46861\n",
-      "Number of disadvanteged players: 770\n",
+      "Number of disadvantaged players: 770\n",
       "Number of committing players: 789\n",
       "Global probability of a foul being called: 23.3%\n",
       "\n",
@@ -378,7 +378,7 @@
     "\n",
     "# Display of main dataframe with some statistics\n",
     "print(f\"Number of observed plays: {len(df)}\")\n",
-    "print(f\"Number of disadvanteged players: {len(disadvantaged)}\")\n",
+    "print(f\"Number of disadvantaged players: {len(disadvantaged)}\")\n",
     "print(f\"Number of committing players: {len(committing)}\")\n",
     "print(f\"Global probability of a foul being called: \" f\"{100*round(df.foul_called.mean(),3)}%\\n\\n\")\n",
     "df.head()"
 
@@ -140,7 +140,7 @@ df.index.name = "play_id"
 
 # Display of main dataframe with some statistics
 print(f"Number of observed plays: {len(df)}")
-print(f"Number of disadvanteged players: {len(disadvantaged)}")
+print(f"Number of disadvantaged players: {len(disadvantaged)}")
 print(f"Number of committing players: {len(committing)}")
 print(f"Global probability of a foul being called: " f"{100*round(df.foul_called.mean(),3)}%\n\n")
 df.head()
 
@@ -1026,7 +1026,7 @@
     "Neither of these models are satisfactory:\n",
     "\n",
     "* If we are trying to identify high-radon counties, pooling is useless -- because, by definition, the pooled model estimates radon at the state-level. In other words, pooling leads to maximal *underfitting*: the variation across counties is not taken into account and only the overall population is estimated.\n",
-    "* We do not trust extreme unpooled estimates produced by models using few observations. This leads to maximal *overfitting*: only the within-county variations are taken into account and the overall population (i.e the state-level, which tells us about similarites across counties) is not estimated. \n",
+    "* We do not trust extreme unpooled estimates produced by models using few observations. This leads to maximal *overfitting*: only the within-county variations are taken into account and the overall population (i.e the state-level, which tells us about similarities across counties) is not estimated. \n",
     "\n",
     "This issue is acute for small sample sizes, as seen above: in counties where we have few floor measurements, if radon levels are higher for those data points than for basement ones (Aitkin, Koochiching, Ramsey), the model will estimate that radon levels are higher in floors than basements for these counties. But we shouldn't trust this conclusion, because both scientific knowledge and the situation in other counties tell us that it is usually the reverse (basement radon > floor radon). So unless we have a lot of observations telling us otherwise for a given county, we should be skeptical and shrink our county-estimates to the state-estimates -- in other words, we should balance between cluster-level and population-level information, and the amount of shrinkage will depend on how extreme and how numerous the data in each cluster are. \n",
     "\n",
 
@@ -341,7 +341,7 @@ for i, c in enumerate(sample_counties):
 Neither of these models are satisfactory:
 
 * If we are trying to identify high-radon counties, pooling is useless -- because, by definition, the pooled model estimates radon at the state-level. In other words, pooling leads to maximal *underfitting*: the variation across counties is not taken into account and only the overall population is estimated.
-* We do not trust extreme unpooled estimates produced by models using few observations. This leads to maximal *overfitting*: only the within-county variations are taken into account and the overall population (i.e the state-level, which tells us about similarites across counties) is not estimated. 
+* We do not trust extreme unpooled estimates produced by models using few observations. This leads to maximal *overfitting*: only the within-county variations are taken into account and the overall population (i.e the state-level, which tells us about similarities across counties) is not estimated. 
 
 This issue is acute for small sample sizes, as seen above: in counties where we have few floor measurements, if radon levels are higher for those data points than for basement ones (Aitkin, Koochiching, Ramsey), the model will estimate that radon levels are higher in floors than basements for these counties. But we shouldn't trust this conclusion, because both scientific knowledge and the situation in other counties tell us that it is usually the reverse (basement radon > floor radon). So unless we have a lot of observations telling us otherwise for a given county, we should be skeptical and shrink our county-estimates to the state-estimates -- in other words, we should balance between cluster-level and population-level information, and the amount of shrinkage will depend on how extreme and how numerous the data in each cluster are. 
 
 
@@ -1058,7 +1058,7 @@
    "source": [
     "### Training Data vs. Test Data\n",
     "\n",
-    "The next thing we need to do is split our data into a training set and a test set. Matrix factorization techniques use [transductive learning](http://en.wikipedia.org/wiki/Transduction_%28machine_learning%29) rather than inductive learning. So we produce a test set by taking a random sample of the cells in the full $N \\times M$ data matrix. The values selected as test samples are replaced with `nan` values in a copy of the original data matrix to produce the training set. Since we'll be producing random splits, let's also write out the train/test sets generated. This will allow us to replicate our results. We'd like to be able to idenfity which split is which, so we'll take a hash of the indices selected for testing and use that to save the data."
+    "The next thing we need to do is split our data into a training set and a test set. Matrix factorization techniques use [transductive learning](http://en.wikipedia.org/wiki/Transduction_%28machine_learning%29) rather than inductive learning. So we produce a test set by taking a random sample of the cells in the full $N \\times M$ data matrix. The values selected as test samples are replaced with `nan` values in a copy of the original data matrix to produce the training set. Since we'll be producing random splits, let's also write out the train/test sets generated. This will allow us to replicate our results. We'd like to be able to identify which split is which, so we'll take a hash of the indices selected for testing and use that to save the data."
    ]
   },
   {
@@ -1512,7 +1512,7 @@
    "source": [
     "It appears we get convergence of $U$ and $V$ after about the default tuning. When testing for convergence, we also want to see convergence of the particular statistics we are looking for, since different characteristics of the posterior may converge at different rates. Let's also do a traceplot of the RSME. We'll compute RMSE for both the train and the test set, even though the convergence is indicated by RMSE on the training set alone. In addition, let's compute a running RMSE on the train/test sets to see how aggregate performance improves or decreases as we continue to sample.\n",
     "\n",
-    "Notice here that we are sampling from 1 chain only, which makes the convergence statisitcs like $\\hat{R}$ impossible (we can still compute the split-rhat but the purpose is different). The reason of not sampling multiple chain is that PMF might not have unique solution. Thus without constraints, the solutions are at best symmetrical, at worse identical under any rotation, in any case subject to label switching. In fact if we sample from multiple chains we will see large $\\hat{R}$ indicating the sampler is exploring different solutions in different part of parameter space."
+    "Notice here that we are sampling from 1 chain only, which makes the convergence statistics like $\\hat{R}$ impossible (we can still compute the split-rhat but the purpose is different). The reason of not sampling multiple chain is that PMF might not have unique solution. Thus without constraints, the solutions are at best symmetrical, at worse identical under any rotation, in any case subject to label switching. In fact if we sample from multiple chains we will see large $\\hat{R}$ indicating the sampler is exploring different solutions in different part of parameter space."
    ]
   },
   {
 
@@ -486,7 +486,7 @@ def rmse(test_data, predicted):
 
 ### Training Data vs. Test Data
 
-The next thing we need to do is split our data into a training set and a test set. Matrix factorization techniques use [transductive learning](http://en.wikipedia.org/wiki/Transduction_%28machine_learning%29) rather than inductive learning. So we produce a test set by taking a random sample of the cells in the full $N \times M$ data matrix. The values selected as test samples are replaced with `nan` values in a copy of the original data matrix to produce the training set. Since we'll be producing random splits, let's also write out the train/test sets generated. This will allow us to replicate our results. We'd like to be able to idenfity which split is which, so we'll take a hash of the indices selected for testing and use that to save the data.
+The next thing we need to do is split our data into a training set and a test set. Matrix factorization techniques use [transductive learning](http://en.wikipedia.org/wiki/Transduction_%28machine_learning%29) rather than inductive learning. So we produce a test set by taking a random sample of the cells in the full $N \times M$ data matrix. The values selected as test samples are replaced with `nan` values in a copy of the original data matrix to produce the training set. Since we'll be producing random splits, let's also write out the train/test sets generated. This will allow us to replicate our results. We'd like to be able to identify which split is which, so we'll take a hash of the indices selected for testing and use that to save the data.
 
 ```{code-cell} ipython3
 # Define a function for splitting train/test data.
@@ -668,7 +668,7 @@ pmf.traceplot()
 
 It appears we get convergence of $U$ and $V$ after about the default tuning. When testing for convergence, we also want to see convergence of the particular statistics we are looking for, since different characteristics of the posterior may converge at different rates. Let's also do a traceplot of the RSME. We'll compute RMSE for both the train and the test set, even though the convergence is indicated by RMSE on the training set alone. In addition, let's compute a running RMSE on the train/test sets to see how aggregate performance improves or decreases as we continue to sample.
 
-Notice here that we are sampling from 1 chain only, which makes the convergence statisitcs like $\hat{R}$ impossible (we can still compute the split-rhat but the purpose is different). The reason of not sampling multiple chain is that PMF might not have unique solution. Thus without constraints, the solutions are at best symmetrical, at worse identical under any rotation, in any case subject to label switching. In fact if we sample from multiple chains we will see large $\hat{R}$ indicating the sampler is exploring different solutions in different part of parameter space.
+Notice here that we are sampling from 1 chain only, which makes the convergence statistics like $\hat{R}$ impossible (we can still compute the split-rhat but the purpose is different). The reason of not sampling multiple chain is that PMF might not have unique solution. Thus without constraints, the solutions are at best symmetrical, at worse identical under any rotation, in any case subject to label switching. In fact if we sample from multiple chains we will see large $\hat{R}$ indicating the sampler is exploring different solutions in different part of parameter space.
 
 ```{code-cell} ipython3
 def _running_rmse(pmf_model, test_data, train_data, plot=True):
 
@@ -178,13 +178,13 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "There are 262 pichers, in 182 game dates\n"
+      "There are 262 pitchers, in 182 game dates\n"
      ]
     }
    ],
    "source": [
     "print(\n",
-    "    f\"There are {df['pitcher_name'].nunique()} pichers, in {df['game_date'].nunique()} game dates\"\n",
+    "    f\"There are {df['pitcher_name'].nunique()} pitchers, in {df['game_date'].nunique()} game dates\"\n",
     ")"
    ]
   },
@@ -699,7 +699,7 @@
     "adf = adf.sort_values([\"output_idx\", \"x\"])\n",
     "X = adf[\n",
     "    [\"x\", \"output_idx\"]\n",
-    "].values  # Input data includes the index of game dates, and the index of picthers\n",
+    "].values  # Input data includes the index of game dates, and the index of pitchers\n",
     "Y = adf[\"avg_spin_rate\"].values  # Output data includes the average spin rate of pitchers\n",
     "X.shape, Y.shape"
    ]
Original file line number	Diff line number	Diff line change
`@@ -279,7 +279,7 @@`
`279`	`279`	`"id": "8633b7b4",`
`280`	`280`	`"metadata": {},`
`281`	`281`	`"source": [`
`282`		`- "The next figure shows 3 trees. As we can see these are very simple function and definitely not very good approximators by themselves. Inspecting individuals trees is generally not necessary when working with BART, we are showing them just so we can gain further intuition on the inner workins of BART."`
	`282`	`+ "The next figure shows 3 trees. As we can see these are very simple function and definitely not very good approximators by themselves. Inspecting individuals trees is generally not necessary when working with BART, we are showing them just so we can gain further intuition on the inner workings of BART."`
`283`	`283`	`]`
`284`	`284`	`},`
`285`	`285`	`{`
Original file line number	Diff line number	Diff line change
`@@ -758,7 +758,7 @@`
`758`	`758`	`"id": "71c3cf64",`
`759`	`759`	`"metadata": {},`
`760`	`760`	`"source": [`
`761`		- "Pretty good! And we can access the posterior mean estimates (stored as [xarray](http://xarray.pydata.org/en/stable/index.html) types) as below. The MCMC samples arrive back in a 2D matrix with one dimension for the MCMC chain (`chain`), and one for the sample number (`draw`). We can calculate the overal posterior average with `.mean(dim=[\"draw\", \"chain\"])`."
	`761`	+ "Pretty good! And we can access the posterior mean estimates (stored as [xarray](http://xarray.pydata.org/en/stable/index.html) types) as below. The MCMC samples arrive back in a 2D matrix with one dimension for the MCMC chain (`chain`), and one for the sample number (`draw`). We can calculate the overall posterior average with `.mean(dim=[\"draw\", \"chain\"])`."
`762`	`762`	`]`
`763`	`763`	`},`
`764`	`764`	`{`
Original file line number	Diff line number	Diff line change
`@@ -42,7 +42,7 @@`
`42`	`42`	`"\n",`
`43`	`43`	`"It may be possible to cluster groups of \"similar\" players, and estimate group averages, but using a hierarchical modeling approach is a natural way of sharing information that does not involve identifying ad hoc clusters.\n",`
`44`	`44`	`"\n",`
`45`		`- "The idea of hierarchical partial pooling is to model the global performance, and use that estimate to parameterize a population of players that accounts for differences among the players' performances. This tradeoff between global and individual performance will be automatically tuned by the model. Also, uncertainty due to different number of at bats for each player (i.e. informatino) will be automatically accounted for, by shrinking those estimates closer to the global mean.\n",`
	`45`	`+ "The idea of hierarchical partial pooling is to model the global performance, and use that estimate to parameterize a population of players that accounts for differences among the players' performances. This tradeoff between global and individual performance will be automatically tuned by the model. Also, uncertainty due to different number of at bats for each player (i.e. information) will be automatically accounted for, by shrinking those estimates closer to the global mean.\n",`
`46`	`46`	`"\n",`
`47`	`47`	"For far more in-depth discussion please refer to Stan [tutorial](http://mc-stan.org/documentation/case-studies/pool-binary-trials.html) {cite:p}`carpenter2016hierarchical` on the subject. The model and parameter values were taken from that example."
`48`	`48`	`]`
Original file line number	Diff line number	Diff line change
`@@ -1058,7 +1058,7 @@`
`1058`	`1058`	`"source": [`
`1059`	`1059`	`"### Training Data vs. Test Data\n",`
`1060`	`1060`	`"\n",`
`1061`		- "The next thing we need to do is split our data into a training set and a test set. Matrix factorization techniques use [transductive learning](http://en.wikipedia.org/wiki/Transduction_%28machine_learning%29) rather than inductive learning. So we produce a test set by taking a random sample of the cells in the full $N \\times M$ data matrix. The values selected as test samples are replaced with `nan` values in a copy of the original data matrix to produce the training set. Since we'll be producing random splits, let's also write out the train/test sets generated. This will allow us to replicate our results. We'd like to be able to idenfity which split is which, so we'll take a hash of the indices selected for testing and use that to save the data."
	`1061`	+ "The next thing we need to do is split our data into a training set and a test set. Matrix factorization techniques use [transductive learning](http://en.wikipedia.org/wiki/Transduction_%28machine_learning%29) rather than inductive learning. So we produce a test set by taking a random sample of the cells in the full $N \\times M$ data matrix. The values selected as test samples are replaced with `nan` values in a copy of the original data matrix to produce the training set. Since we'll be producing random splits, let's also write out the train/test sets generated. This will allow us to replicate our results. We'd like to be able to identify which split is which, so we'll take a hash of the indices selected for testing and use that to save the data."
`1062`	`1062`	`]`
`1063`	`1063`	`},`
`1064`	`1064`	`{`
`@@ -1512,7 +1512,7 @@`
`1512`	`1512`	`"source": [`
`1513`	`1513`	"It appears we get convergence of $U$ and $V$ after about the default tuning. When testing for convergence, we also want to see convergence of the particular statistics we are looking for, since different characteristics of the posterior may converge at different rates. Let's also do a traceplot of the RSME. We'll compute RMSE for both the train and the test set, even though the convergence is indicated by RMSE on the training set alone. In addition, let's compute a running RMSE on the train/test sets to see how aggregate performance improves or decreases as we continue to sample.\n",
`1514`	`1514`	`"\n",`
`1515`		- "Notice here that we are sampling from 1 chain only, which makes the convergence statisitcs like $\\hat{R}$ impossible (we can still compute the split-rhat but the purpose is different). The reason of not sampling multiple chain is that PMF might not have unique solution. Thus without constraints, the solutions are at best symmetrical, at worse identical under any rotation, in any case subject to label switching. In fact if we sample from multiple chains we will see large $\\hat{R}$ indicating the sampler is exploring different solutions in different part of parameter space."
	`1515`	+ "Notice here that we are sampling from 1 chain only, which makes the convergence statistics like $\\hat{R}$ impossible (we can still compute the split-rhat but the purpose is different). The reason of not sampling multiple chain is that PMF might not have unique solution. Thus without constraints, the solutions are at best symmetrical, at worse identical under any rotation, in any case subject to label switching. In fact if we sample from multiple chains we will see large $\\hat{R}$ indicating the sampler is exploring different solutions in different part of parameter space."
`1516`	`1516`	`]`
`1517`	`1517`	`},`
`1518`	`1518`	`{`
Original file line number	Diff line number	Diff line change
`@@ -178,13 +178,13 @@`
`178`	`178`	`"name": "stdout",`
`179`	`179`	`"output_type": "stream",`
`180`	`180`	`"text": [`
`181`		`- "There are 262 pichers, in 182 game dates\n"`
	`181`	`+ "There are 262 pitchers, in 182 game dates\n"`
`182`	`182`	`]`
`183`	`183`	`}`
`184`	`184`	`],`
`185`	`185`	`"source": [`
`186`	`186`	`"print(\n",`
`187`		`- " f\"There are {df['pitcher_name'].nunique()} pichers, in {df['game_date'].nunique()} game dates\"\n",`
	`187`	`+ " f\"There are {df['pitcher_name'].nunique()} pitchers, in {df['game_date'].nunique()} game dates\"\n",`
`188`	`188`	`")"`
`189`	`189`	`]`
`190`	`190`	`},`
`@@ -699,7 +699,7 @@`
`699`	`699`	`"adf = adf.sort_values([\"output_idx\", \"x\"])\n",`
`700`	`700`	`"X = adf[\n",`
`701`	`701`	`" [\"x\", \"output_idx\"]\n",`
`702`		`- "].values # Input data includes the index of game dates, and the index of picthers\n",`
	`702`	`+ "].values # Input data includes the index of game dates, and the index of pitchers\n",`
`703`	`703`	`"Y = adf[\"avg_spin_rate\"].values # Output data includes the average spin rate of pitchers\n",`
`704`	`704`	`"X.shape, Y.shape"`
`705`	`705`	`]`