Skip to content

Commit c7a349e

Browse files
OriolAbrilconorhassankuvychkoreshamas
authored
Update GLM-robust notebook (#472)
* Update GLM-robust.ipynb * Update myst_nb and suggest changes to nb * Update GLM-robust.ipynb * updated pair * Update notebook references Co-authored-by: Oriol Abril-Pla <[email protected]> * updated GLM-robust.ipynb updated GLM-robust.ipynb * Combine all PRs into one Co-authored-by: Reshama Shaikh <[email protected]> * remove from pre-commit exceptions * fix pre-commit Co-authored-by: conorhassan <[email protected]> Co-authored-by: conorhassan <[email protected]> Co-authored-by: kuvychko <[email protected]> Co-authored-by: Reshama Shaikh <[email protected]>
1 parent 6623c92 commit c7a349e

File tree

3 files changed

+357
-192
lines changed

3 files changed

+357
-192
lines changed

.pre-commit-config.yaml

-1
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,6 @@ repos:
7777
examples/generalized_linear_models/GLM-logistic.ipynb|
7878
examples/generalized_linear_models/GLM-out-of-sample-predictions.ipynb|
7979
examples/generalized_linear_models/GLM-poisson-regression.ipynb|
80-
examples/generalized_linear_models/GLM-robust.ipynb|
8180
examples/generalized_linear_models/GLM-rolling-regression.ipynb|
8281
examples/generalized_linear_models/GLM-simpsons-paradox.ipynb|
8382
examples/howto/api_quickstart.ipynb|

examples/generalized_linear_models/GLM-robust.ipynb

+249-148
Large diffs are not rendered by default.

myst_nbs/generalized_linear_models/GLM-robust.myst.md

+108-43
Original file line numberDiff line numberDiff line change
@@ -6,47 +6,56 @@ jupytext:
66
format_version: 0.13
77
jupytext_version: 1.13.7
88
kernelspec:
9-
display_name: Python 3
9+
display_name: Python 3 (ipykernel)
1010
language: python
1111
name: python3
1212
---
1313

14+
(GLM-robust)=
1415
# GLM: Robust Linear Regression
1516

16-
This tutorial first appeard as a post in small series on Bayesian GLMs on:
17+
:::{post} August, 2013
18+
:tags: regression, linear model, robust
19+
:category: beginner
20+
:author: Thomas Wiecki, Chris Fonnesbeck, Abhipsha Das, Conor Hassan, Igor Kuvychko, Reshama Shaikh, Oriol Abril Pla
21+
:::
1722

18-
1. [The Inference Button: Bayesian GLMs made easy with PyMC3](http://twiecki.github.com/blog/2013/08/12/bayesian-glms-1/)
19-
2. [This world is far from Normal(ly distributed): Robust Regression in PyMC3](http://twiecki.github.io/blog/2013/08/27/bayesian-glms-2/)
20-
3. [The Best Of Both Worlds: Hierarchical Linear Regression in PyMC3](http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/)
23+
+++
24+
25+
# GLM: Robust Linear Regression
26+
27+
The tutorial is the second of a three-part series on Bayesian *generalized linear models (GLMs)*, that first appeared on [Thomas Wiecki's blog](https://twiecki.io/):
28+
29+
1. {ref}`Linear Regression <pymc:GLM_linear>`
30+
2. {ref}`Robust Linear Regression <GLM-robust>`
31+
3. {ref}`Hierarchical Linear Regression <GLM-hierarchical>`
2132

2233
In this blog post I will write about:
2334

2435
- How a few outliers can largely affect the fit of linear regression models.
2536
- How replacing the normal likelihood with Student T distribution produces robust regression.
26-
- How this can easily be done with the `Bambi` library by passing a `family` object or passing the family name as an argument.
27-
28-
This is the second part of a series on Bayesian GLMs (click [here for part I about linear regression](http://twiecki.github.io/blog/2013/08/12/bayesian-glms-1/)). In this prior post I described how minimizing the squared distance of the regression line is the same as maximizing the likelihood of a Normal distribution with the mean coming from the regression line. This latter probabilistic expression allows us to easily formulate a Bayesian linear regression model.
37+
38+
In the {ref}`linear regression tutorial <pymc:GLM_linear>` I described how minimizing the squared distance of the regression line is the same as maximizing the likelihood of a Normal distribution with the mean coming from the regression line. This latter probabilistic expression allows us to easily formulate a Bayesian linear regression model.
2939

3040
This worked splendidly on simulated data. The problem with simulated data though is that it's, well, simulated. In the real world things tend to get more messy and assumptions like normality are easily violated by a few outliers.
3141

3242
Lets see what happens if we add some outliers to our simulated data from the last post.
3343

3444
+++
3545

36-
Again, import our modules.
46+
First, let's import our modules.
3747

3848
```{code-cell} ipython3
3949
%matplotlib inline
4050
51+
import aesara
52+
import aesara.tensor as at
4153
import arviz as az
42-
import bambi as bmb
4354
import matplotlib.pyplot as plt
4455
import numpy as np
4556
import pandas as pd
46-
import pymc3 as pm
47-
import theano
48-
49-
print(f"Running on pymc3 v{pm.__version__}")
57+
import pymc as pm
58+
import xarray as xr
5059
```
5160

5261
```{code-cell} ipython3
@@ -79,7 +88,7 @@ data = pd.DataFrame(dict(x=x_out, y=y_out))
7988
Plot the data together with the true regression line (the three points in the upper left corner are the outliers we added).
8089

8190
```{code-cell} ipython3
82-
fig = plt.figure(figsize=(7, 7))
91+
fig = plt.figure(figsize=(7, 5))
8392
ax = fig.add_subplot(111, xlabel="x", ylabel="y", title="Generated data and underlying model")
8493
ax.plot(x_out, y_out, "x", label="sampled data")
8594
ax.plot(x, true_regression_line, label="true regression line", lw=2.0)
@@ -88,60 +97,95 @@ plt.legend(loc=0);
8897

8998
## Robust Regression
9099

100+
### Normal Likelihood
91101

92-
Lets see what happens if we estimate our Bayesian linear regression model using the `bambi`. This function takes a [`formulae`](https://bambinos.github.io/formulae/api_reference.html) string to describe the linear model and adds a Normal likelihood by default.
102+
Lets see what happens if we estimate our Bayesian linear regression model by fitting a regression model with a normal likelihood.
103+
Note that the bambi library provides an easy to use such that an equivalent model can be built using one line of code.
104+
A version of this same notebook using Bambi is available at {doc}`bambi's docs <bambi:notebooks/t_regression>`
93105

94106
```{code-cell} ipython3
95-
model = bmb.Model("y ~ x", data)
96-
trace = model.fit(draws=2000, cores=2)
107+
with pm.Model() as model:
108+
xdata = pm.ConstantData("x", x_out, dims="obs_id")
109+
110+
# define priors
111+
intercept = pm.Normal("intercept", mu=0, sigma=1)
112+
slope = pm.Normal("slope", mu=0, sigma=1)
113+
sigma = pm.HalfCauchy("sigma", beta=10)
114+
115+
mu = pm.Deterministic("mu", intercept + slope * xdata, dims="obs_id")
116+
117+
# define likelihood
118+
likelihood = pm.Normal("y", mu=mu, sigma=sigma, observed=y_out, dims="obs_id")
119+
120+
# inference
121+
trace = pm.sample(tune=2000)
97122
```
98123

99-
To evaluate the fit, I am plotting the posterior predictive regression lines by taking regression parameters from the posterior distribution and plotting a regression line for each (this is all done inside of `plot_posterior_predictive()`).
124+
To evaluate the fit, the code below calculates the posterior predictive regression lines by taking regression parameters from the posterior distribution and plots a regression line for 20 of them.
100125

101126
```{code-cell} ipython3
102-
plt.figure(figsize=(7, 5))
103-
plt.plot(x_out, y_out, "x", label="data")
104-
pm.plot_posterior_predictive_glm(trace, samples=100, label="posterior predictive regression lines")
105-
plt.plot(x, true_regression_line, label="true regression line", lw=3.0, c="y")
106-
107-
plt.legend(loc=0);
127+
post = az.extract(trace, num_samples=20)
128+
x_plot = xr.DataArray(np.linspace(x_out.min(), x_out.max(), 100), dims="plot_id")
129+
lines = post["intercept"] + post["slope"] * x_plot
130+
131+
plt.scatter(x_out, y_out, label="data")
132+
plt.plot(x_plot, lines.transpose(), alpha=0.4, color="C1")
133+
plt.plot(x, true_regression_line, label="True regression line", lw=3.0, c="C2")
134+
plt.legend(loc=0)
135+
plt.title("Posterior predictive for normal likelihood");
108136
```
109137

110138
As you can see, the fit is quite skewed and we have a fair amount of uncertainty in our estimate as indicated by the wide range of different posterior predictive regression lines. Why is this? The reason is that the normal distribution does not have a lot of mass in the tails and consequently, an outlier will affect the fit strongly.
111139

112140
A Frequentist would estimate a [Robust Regression](http://en.wikipedia.org/wiki/Robust_regression) and use a non-quadratic distance measure to evaluate the fit.
113141

114-
But what's a Bayesian to do? Since the problem is the light tails of the Normal distribution we can instead assume that our data is not normally distributed but instead distributed according to the [Student T distribution](http://en.wikipedia.org/wiki/Student%27s_t-distribution) which has heavier tails as shown next (I read about this trick in ["The Kruschke"](http://www.indiana.edu/~kruschke/DoingBayesianDataAnalysis/), aka the puppy-book; but I think [Gelman](http://www.stat.columbia.edu/~gelman/book/) was the first to formulate this).
142+
But what's a Bayesian to do? Since the problem is the light tails of the Normal distribution we can instead assume that our data is not normally distributed but instead distributed according to the [Student T distribution](http://en.wikipedia.org/wiki/Student%27s_t-distribution) which has heavier tails as shown next {cite:p}`gelman2013bayesian,kruschke2014doing`.
115143

116144
Lets look at those two distributions to get a feel for them.
117145

118146
```{code-cell} ipython3
119147
normal_dist = pm.Normal.dist(mu=0, sigma=1)
120148
t_dist = pm.StudentT.dist(mu=0, lam=1, nu=1)
121149
x_eval = np.linspace(-8, 8, 300)
122-
plt.plot(x_eval, theano.tensor.exp(normal_dist.logp(x_eval)).eval(), label="Normal", lw=2.0)
123-
plt.plot(x_eval, theano.tensor.exp(t_dist.logp(x_eval)).eval(), label="Student T", lw=2.0)
150+
plt.plot(x_eval, pm.math.exp(pm.logp(normal_dist, x_eval)).eval(), label="Normal", lw=2.0)
151+
plt.plot(x_eval, pm.math.exp(pm.logp(t_dist, x_eval)).eval(), label="Student T", lw=2.0)
124152
plt.xlabel("x")
125153
plt.ylabel("Probability density")
126154
plt.legend();
127155
```
128156

129157
As you can see, the probability of values far away from the mean (0 in this case) are much more likely under the `T` distribution than under the Normal distribution.
130158

131-
To define the usage of a T distribution in `Bambi` we can pass the distribution name to the `family` argument -- `t` -- that specifies that our data is Student T-distributed. Note that this is the same syntax as `R` and `statsmodels` use.
159+
Below is a PyMC model, with the `likelihood` term following a `StudentT` distribution with $\nu=3$ degrees of freedom, opposed to the `Normal` distribution.
132160

133161
```{code-cell} ipython3
134-
model_robust = bmb.Model("y ~ x", data, family="t")
135-
model_robust.set_priors({"nu": bmb.Prior("Gamma", alpha=3, beta=1)})
136-
trace_robust = model_robust.fit(draws=2000, cores=2)
162+
with pm.Model() as robust_model:
163+
xdata = pm.ConstantData("x", x_out, dims="obs_id")
164+
165+
# define priors
166+
intercept = pm.Normal("intercept", mu=0, sigma=1)
167+
slope = pm.Normal("slope", mu=0, sigma=1)
168+
sigma = pm.HalfCauchy("sigma", beta=10)
169+
170+
mu = pm.Deterministic("mu", intercept + slope * xdata, dims="obs_id")
171+
172+
# define likelihood
173+
likelihood = pm.StudentT("y", mu=mu, sigma=sigma, nu=3, observed=y_out, dims="obs_id")
174+
175+
# inference
176+
robust_trace = pm.sample(tune=4000)
137177
```
138178

139179
```{code-cell} ipython3
140-
plt.figure(figsize=(7, 5))
141-
plt.plot(x_out, y_out, "x")
142-
pm.plot_posterior_predictive_glm(trace_robust, label="posterior predictive regression lines")
143-
plt.plot(x, true_regression_line, label="true regression line", lw=3.0, c="y")
144-
plt.legend();
180+
robust_post = az.extract(robust_trace, num_samples=20)
181+
x_plot = xr.DataArray(np.linspace(x_out.min(), x_out.max(), 100), dims="plot_id")
182+
robust_lines = robust_post["intercept"] + robust_post["slope"] * x_plot
183+
184+
plt.scatter(x_out, y_out, label="data")
185+
plt.plot(x_plot, robust_lines.transpose(), alpha=0.4, color="C1")
186+
plt.plot(x, true_regression_line, label="True regression line", lw=3.0, c="C2")
187+
plt.legend(loc=0)
188+
plt.title("Posterior predictive for Student-T likelihood")
145189
```
146190

147191
There, much better! The outliers are barely influencing our estimation at all because our likelihood function assumes that outliers are much more probable than under the Normal distribution.
@@ -150,20 +194,41 @@ There, much better! The outliers are barely influencing our estimation at all be
150194

151195
## Summary
152196

153-
- `Bambi` allows you to pass in a `family` argument that contains information about the likelihood.
154197
- By changing the likelihood from a Normal distribution to a Student T distribution -- which has more mass in the tails -- we can perform *Robust Regression*.
155198

156-
The next post will be about logistic regression in PyMC3 and what the posterior and oatmeal have in common.
157-
158199
*Extensions*:
159200

160201
- The Student-T distribution has, besides the mean and variance, a third parameter called *degrees of freedom* that describes how much mass should be put into the tails. Here it is set to 1 which gives maximum mass to the tails (setting this to infinity results in a Normal distribution!). One could easily place a prior on this rather than fixing it which I leave as an exercise for the reader ;).
161-
- T distributions can be used as priors as well. I will show this in a future post on hierarchical GLMs.
162-
- How do we test if our data is normal or violates that assumption in an important way? Check out this [great blog post](http://allendowney.blogspot.com/2013/08/are-my-data-normal.html) by Allen Downey.
202+
- T distributions can be used as priors as well. See {ref}`GLM-hierarchical`
203+
- How do we test if our data is normal or violates that assumption in an important way? Check out this [great blog post](http://allendowney.blogspot.com/2013/08/are-my-data-normal.html) by Allen Downey.
204+
205+
+++
206+
207+
## Authors
163208

164-
Author: [Thomas Wiecki](https://twitter.com/twiecki)
209+
* Adapted from [Thomas Wiecki's](https://twitter.com/twiecki) blog
210+
* Updated by @fonnesbeck in September 2016 (pymc#1378)
211+
* Updated by @chiral-carbon in August 2021 (pymc-examples#205)
212+
* Updated by Conor Hassan, Igor Kuvychko, Reshama Shaikh and [Oriol Abril Pla](https://oriolabrilpla.cat/en/) in 2022
213+
214+
+++
215+
216+
## References
217+
218+
:::{bibliography}
219+
:filter: docname in docnames
220+
:::
221+
222+
## Watermark
165223

166224
```{code-cell} ipython3
167225
%load_ext watermark
168226
%watermark -n -u -v -iv -w -p xarray
169227
```
228+
229+
:::{include} ../page_footer.md
230+
:::
231+
232+
```{code-cell} ipython3
233+
234+
```

0 commit comments

Comments
 (0)