diff --git a/examples/generalized_linear_models/GLM-hierarchical.ipynb b/examples/generalized_linear_models/GLM-hierarchical.ipynb index 8e658b243..960eb32d1 100644 --- a/examples/generalized_linear_models/GLM-hierarchical.ipynb +++ b/examples/generalized_linear_models/GLM-hierarchical.ipynb @@ -4,7 +4,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# GLM: Hierarchical Linear Regression" + "(notebook_name)=\n", + "# GLM: Hierarchical Linear Regression\n", + "\n", + ":::{post} May 20, 2022\n", + ":tags: generalized linear model, hierarchical model\n", + ":category: intermediate\n", + ":author: Danne Elbers, Thomas Wiecki, Chris Fonnesbeck\n", + ":::" ] }, { @@ -13,16 +20,14 @@ "source": [ "(c) 2016 by Danne Elbers, Thomas Wiecki\n", "\n", - "> This tutorial is adapted from a [blog post by Danne Elbers and Thomas Wiecki called \"The Best Of Both Worlds: Hierarchical Linear Regression in PyMC\"](http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/).\n", - "\n", - "Today's blog post is co-written by [Danne Elbers](http://www.linkedin.com/pub/danne-elbers/69/3a2/7ba) who is doing her masters thesis with me on computational psychiatry using Bayesian modeling. This post also borrows heavily from a [Notebook](http://nbviewer.ipython.org/github/fonnesbeck/multilevel_modeling/blob/master/multilevel_modeling.ipynb?create=1) by [Chris Fonnesbeck](http://biostat.mc.vanderbilt.edu/wiki/Main/ChrisFonnesbeck).\n", + "This tutorial is adapted from a blog post by [Danne Elbers](http://www.linkedin.com/pub/danne-elbers/69/3a2/7ba) and Thomas Wiecki called [\"The Best Of Both Worlds: Hierarchical Linear Regression in PyMC\"](http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/). It also borrows heavily from a notebook {ref}`multilevel_modeling` by [Chris Fonnesbeck](http://biostat.mc.vanderbilt.edu/wiki/Main/ChrisFonnesbeck).\n", "\n", - "The power of Bayesian modelling really clicked for me when I was first introduced to hierarchical modelling. In this blog post we will:\n", + "In this tutorial we:\n", " \n", - " * provide and intuitive explanation of hierarchical/multi-level Bayesian modeling;\n", + " * provide an intuitive explanation of hierarchical/multi-level Bayesian modeling;\n", " * show how this type of model can easily be built and estimated in [PyMC](https://github.com/pymc-devs/pymc);\n", - " * demonstrate the advantage of using hierarchical Bayesian modelling, as opposed to non-hierarchical Bayesian modelling by comparing the two\n", - " * visualize the \"shrinkage effect\" (explained below)\n", + " * demonstrate the advantage of using hierarchical Bayesian modelling, as opposed to non-hierarchical Bayesian; modelling by comparing the two\n", + " * visualize the \"shrinkage effect\" (explained below);\n", " * highlight connections to the frequentist version of this model.\n", "\n", "Having multiple sets of related measurements comes up all the time. In mathematical psychology, for example, you test multiple subjects on the same task. We then want to estimate a computational/mathematical model that describes the behavior on the task by a set of parameters. We could thus fit a model to each subject individually, assuming they share no similarities; or, pool all the data and estimate one model assuming all subjects are identical. Hierarchical modeling allows the best of both worlds by modeling subjects' similarities but also allowing estimation of individual parameters. As an aside, software from our lab, [HDDM](http://ski.cog.brown.edu/hddm_docs/), allows hierarchical Bayesian estimation of a widely used decision making model in psychology. In this blog post, however, we will use a more classical example of [hierarchical linear regression](http://en.wikipedia.org/wiki/Hierarchical_linear_modeling) to predict radon levels in houses.\n", @@ -34,10 +39,11 @@ "\n", "## The Dataset\n", "\n", - "Gelman et al.'s (2007) radon dataset is a classic for hierarchical modeling. In this dataset the amount of the radioactive gas radon has been measured among different households in all counties of several states. Radon gas is known to be the highest cause of lung cancer in non-smokers. It is believed to be more strongly present in households containing a basement and to differ in amount present among types of soil.\n", - "Here we'll investigate this differences and try to make predictions of radonlevels in different counties based on the county itself and the presence of a basement. In this example we'll look at Minnesota, a state that contains 85 counties in which different measurements are taken, ranging from 2 to 116 measurements per county. \n", + "Gelman et al.'s (2007) radon dataset is a classic for hierarchical modeling. In this dataset the amount of the radioactive gas radon has been measured among different households in all counties of several states. Radon gas is known to be the highest cause of lung cancer in non-smokers. It is believed to be more strongly present in households containing a basement and to differ in amount present among types of soil. Here we'll investigate this differences and try to make predictions of radon levels in different counties based on the county itself and the presence of a basement. In this example we'll look at Minnesota, a state that contains 85 counties in which different measurements are taken, ranging from 2 to 116 measurements per county. \n", "\n", - "" + "
\n", + "
<xarray.Dataset>\n", - "Dimensions: (chain_draw: 8000, county: 85)\n", + "Dimensions: (chain_draw: 4000, county: 85)\n", "Coordinates:\n", " * county (county) <U17 'AITKIN' 'ANOKA' ... 'WRIGHT' 'YELLOW MEDICINE'\n", " * chain_draw (chain_draw) MultiIndex\n", - " - chain (chain_draw) int64 0 0 0 0 0 0 0 0 0 0 0 ... 3 3 3 3 3 3 3 3 3 3\n", + " - chain (chain_draw) int64 0 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1\n", " - draw (chain_draw) int64 0 1 2 3 4 5 ... 1994 1995 1996 1997 1998 1999\n", "Data variables:\n", - " mu_a (chain_draw) float64 1.403 1.438 1.447 ... 1.531 1.556 1.548\n", - " sigma_a (chain_draw) float64 0.3293 0.2812 0.2438 ... 0.3564 0.3729\n", - " mu_b (chain_draw) float64 -0.74 -0.6791 -0.5542 ... -0.7281 -0.7413\n", - " sigma_b (chain_draw) float64 0.2061 0.2563 0.3042 ... 0.1551 0.1586\n", - " a (county, chain_draw) float64 1.406 1.61 1.202 ... 1.33 1.421\n", - " b (county, chain_draw) float64 -0.6981 -1.15 ... -0.9213 -0.8833\n", - " eps (chain_draw) float64 0.7185 0.739 0.7169 ... 0.744 0.696 0.6974\n", + " mu_a (chain_draw) float64 1.547 1.453 1.466 ... 1.508 1.492 1.504\n", + " mu_b (chain_draw) float64 -0.6138 -0.5662 -0.6135 ... -0.6099 -0.5082\n", + " a (county, chain_draw) float64 0.7825 1.473 1.276 ... 1.065 1.391\n", + " b (county, chain_draw) float64 -0.3848 -0.2479 ... 0.04303 -1.376\n", + " sigma_a (chain_draw) float64 0.3429 0.2987 0.3025 ... 0.3968 0.4025\n", + " sigma_b (chain_draw) float64 0.4143 0.3577 0.4727 ... 0.2915 0.3387\n", + " eps (chain_draw) float64 0.7549 0.7269 0.7278 ... 0.6997 0.7104\n", "Attributes:\n", - " created_at: 2022-01-09T13:50:31.244473\n", - " arviz_version: 0.11.4\n", + " created_at: 2022-09-30T17:12:00.070351\n", + " arviz_version: 0.12.1\n", " inference_library: pymc\n", - " inference_library_version: 4.0.0b1\n", - " sampling_time: 62.45755052566528\n", - " tuning_steps: 2000
array(['AITKIN', 'ANOKA', 'BECKER', 'BELTRAMI', 'BENTON', 'BIG STONE',\n", + " inference_library_version: 4.1.2\n", + " sampling_time: 94.50930380821228\n", + " tuning_steps: 2000
array([0.34293626, 0.29871093, 0.30253284, ..., 0.43775103, 0.39684769,\n", + " 0.4025356 ])
array([0.41430107, 0.35768722, 0.4726837 , ..., 0.27566421, 0.29147091,\n", + " 0.33870798])
array([0.75492761, 0.72691999, 0.72783917, ..., 0.72412527, 0.69967499,\n", + " 0.71044581])
<xarray.Dataset>\n", "Dimensions: (obs_id: 919)\n", "Coordinates:\n", - " * obs_id (obs_id) int32 0 1 2 3 4 5 6 7 ... 911 912 913 914 915 916 917 918\n", + " * obs_id (obs_id) int64 0 1 2 3 4 5 6 7 ... 911 912 913 914 915 916 917 918\n", " floor (obs_id) float64 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0\n", "Data variables:\n", " y (obs_id) float64 0.8329 0.8329 1.099 0.09531 ... 1.629 1.335 1.099\n", "Attributes:\n", - " created_at: 2022-01-09T13:49:18.417187\n", - " arviz_version: 0.11.4\n", + " created_at: 2022-09-30T17:10:20.627345\n", + " arviz_version: 0.12.1\n", " inference_library: pymc\n", - " inference_library_version: 4.0.0b1
array([ 0, 1, 2, ..., 916, 917, 918])
array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,\n", + " inference_library_version: 4.1.2
+