Skip to content

Commit 75e0b9a

Browse files
Benjamin T. Vincentdrbenvincent
Benjamin T. Vincent
andauthored
update GLM out of sample prediction notebook to v4 (#370)
* create truncated regression example * delete truncated regression example from main branch * create truncated regression example * delete truncated regression example from main branch * create truncated regression example * delete truncated regression example from main branch * fix incorrect statement about pm.NormalMixture * update to v4 * remove a commented out line of code + add myst file * spelling * revisions based on feedback * fix: outputs disappeared for some reason * updates based on feedback * revert to using labels for coords Co-authored-by: Benjamin T. Vincent <[email protected]>
1 parent 23393e6 commit 75e0b9a

File tree

2 files changed

+613
-115686
lines changed

2 files changed

+613
-115686
lines changed

examples/generalized_linear_models/GLM-out-of-sample-predictions.ipynb

+480-115,558
Large diffs are not rendered by default.

myst_nbs/generalized_linear_models/GLM-out-of-sample-predictions.myst.md

+133-128
Original file line numberDiff line numberDiff line change
@@ -6,87 +6,70 @@ jupytext:
66
format_version: 0.13
77
jupytext_version: 1.13.7
88
kernelspec:
9-
display_name: 'Python 3.7.6 64-bit (''website_projects'': conda)'
10-
metadata:
11-
interpreter:
12-
hash: fbddea5140024843998ae64bf59a7579a9660d103062604797ea5984366c686c
9+
display_name: Python 3.9.12 ('pymc-dev-py39')
10+
language: python
1311
name: python3
1412
---
1513

16-
# GLM in PyMC3: Out-Of-Sample Predictions
14+
(GLM-out-of-sample-predictions)=
15+
# Out-Of-Sample Predictions
1716

18-
In this notebook I explore the [glm](https://docs.pymc.io/api/glm.html) module of [PyMC3](https://docs.pymc.io/). I am particularly interested in the model definition using [patsy](https://patsy.readthedocs.io/en/latest/) formulas, as it makes the model evaluation loop faster (easier to include features and/or interactions). There are many good resources on this subject, but most of them evaluate the model in-sample. For many applications we require doing predictions on out-of-sample data. This experiment was motivated by the discussion of the thread ["Out of sample" predictions with the GLM sub-module](https://discourse.pymc.io/t/out-of-sample-predictions-with-the-glm-sub-module/773) on the (great!) forum [discourse.pymc.io/](https://discourse.pymc.io/), thank you all for your input!
19-
20-
**Resources**
21-
22-
23-
- [PyMC3 Docs: Example Notebooks](https://docs.pymc.io/nb_examples/index.html)
24-
25-
- In particular check [GLM: Logistic Regression](https://docs.pymc.io/notebooks/GLM-logistic.html)
26-
27-
- [Bambi](https://bambinos.github.io/bambi/), a more complete implementation of the GLM submodule which also allows for mixed-effects models.
28-
29-
- [Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb)
30-
- [Statistical Rethinking](https://xcelab.net/rm/statistical-rethinking/)
17+
:::{post} June, 2022
18+
:tags: generalized linear model, logistic regression, out of sample predictions, patsy
19+
:category: beginner
20+
:::
3121

3222
+++
3323

34-
## Prepare Notebook
24+
For many applications we require doing predictions on out-of-sample data. This experiment was motivated by the discussion of the thread ["Out of sample" predictions with the GLM sub-module](https://discourse.pymc.io/t/out-of-sample-predictions-with-the-glm-sub-module/773) on the (great!) forum [discourse.pymc.io/](https://discourse.pymc.io/), thank you all for your input! But note that this GLM sub-module was deprecated in favour of [`bambi`](https://github.com/bambinos/bambi). But this notebook implements a 'raw' PyMC model.
3525

3626
```{code-cell} ipython3
27+
import arviz as az
3728
import matplotlib.pyplot as plt
3829
import numpy as np
3930
import pandas as pd
40-
import seaborn as sns
41-
42-
sns.set_style(style="darkgrid", rc={"axes.facecolor": ".9", "grid.color": ".8"})
43-
sns.set_palette(palette="deep")
44-
sns_c = sns.color_palette(palette="deep")
45-
46-
import arviz as az
4731
import patsy
48-
import pymc3 as pm
32+
import pymc as pm
33+
import seaborn as sns
4934
50-
from pymc3 import glm
35+
from scipy.special import expit as inverse_logit
36+
from sklearn.metrics import RocCurveDisplay, accuracy_score, auc, roc_curve
37+
from sklearn.model_selection import train_test_split
38+
```
5139

52-
plt.rcParams["figure.figsize"] = [7, 6]
53-
plt.rcParams["figure.dpi"] = 100
40+
```{code-cell} ipython3
41+
RANDOM_SEED = 8927
42+
rng = np.random.default_rng(RANDOM_SEED)
43+
az.style.use("arviz-darkgrid")
5444
```
5545

5646
## Generate Sample Data
5747

5848
We want to fit a logistic regression model where there is a multiplicative interaction between two numerical features.
5949

6050
```{code-cell} ipython3
61-
SEED = 42
62-
np.random.seed(SEED)
63-
64-
# Number of data points.
51+
# Number of data points
6552
n = 250
66-
# Create features.
67-
x1 = np.random.normal(loc=0.0, scale=2.0, size=n)
68-
x2 = np.random.normal(loc=0.0, scale=2.0, size=n)
69-
epsilon = np.random.normal(loc=0.0, scale=0.5, size=n)
70-
# Define target variable.
53+
# Create features
54+
x1 = rng.normal(loc=0.0, scale=2.0, size=n)
55+
x2 = rng.normal(loc=0.0, scale=2.0, size=n)
56+
# Define target variable
7157
intercept = -0.5
7258
beta_x1 = 1
7359
beta_x2 = -1
7460
beta_interaction = 2
7561
z = intercept + beta_x1 * x1 + beta_x2 * x2 + beta_interaction * x1 * x2
76-
p = 1 / (1 + np.exp(-z))
77-
y = np.random.binomial(n=1, p=p, size=n)
78-
62+
p = inverse_logit(z)
63+
# note binimial with n=1 is equal to a bernoulli
64+
y = rng.binomial(n=1, p=p, size=n)
7965
df = pd.DataFrame(dict(x1=x1, x2=x2, y=y))
80-
8166
df.head()
8267
```
8368

8469
Let us do some exploration of the data:
8570

8671
```{code-cell} ipython3
87-
sns.pairplot(
88-
data=df, kind="scatter", height=2, plot_kws={"color": sns_c[1]}, diag_kws={"color": sns_c[2]}
89-
);
72+
sns.pairplot(data=df, kind="scatter");
9073
```
9174

9275
- $x_1$ and $x_2$ are not correlated.
@@ -95,91 +78,85 @@ sns.pairplot(
9578

9679
```{code-cell} ipython3
9780
fig, ax = plt.subplots()
98-
sns_c_div = sns.diverging_palette(240, 10, n=2)
99-
sns.scatterplot(x="x1", y="x2", data=df, hue="y", palette=[sns_c_div[0], sns_c_div[-1]])
100-
ax.legend(title="y", loc="center left", bbox_to_anchor=(1, 0.5))
81+
sns.scatterplot(x="x1", y="x2", data=df, hue="y")
82+
ax.legend(title="y")
10183
ax.set(title="Sample Data", xlim=(-9, 9), ylim=(-9, 9));
10284
```
10385

10486
## Prepare Data for Modeling
10587

106-
I wanted to use the *`classmethod`* `from_formula` (see [documentation](https://docs.pymc.io/api/glm.html)), but I was not able to generate out-of-sample predictions with this approach (if you find a way please let me know!). As a workaround, I created the features from a formula using [patsy](https://patsy.readthedocs.io/en/latest/) directly and then use *`class`* `pymc3.glm.linear.GLM` (this was motivated by going into the [source code](https://github.com/pymc-devs/pymc3/blob/master/pymc3/glm/linear.py)).
107-
10888
```{code-cell} ipython3
109-
# Define model formula.
110-
formula = "y ~ x1 * x2"
111-
# Create features.
112-
y, x = patsy.dmatrices(formula_like=formula, data=df)
89+
y, x = patsy.dmatrices("y ~ x1 * x2", data=df)
11390
y = np.asarray(y).flatten()
11491
labels = x.design_info.column_names
11592
x = np.asarray(x)
11693
```
11794

118-
As pointed out on the [thread](https://discourse.pymc.io/t/out-of-sample-predictions-with-the-glm-sub-module/773) (thank you @Nicky!), we need to keep the labels of the features in the design matrix.
119-
120-
```{code-cell} ipython3
121-
print(f"labels = {labels}")
122-
```
123-
12495
Now we do a train-test split.
12596

12697
```{code-cell} ipython3
127-
from sklearn.model_selection import train_test_split
128-
129-
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=SEED)
98+
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7)
13099
```
131100

132101
## Define and Fit the Model
133102

134-
We now specify the model in PyMC3.
103+
We now specify the model in PyMC.
135104

136105
```{code-cell} ipython3
137-
with pm.Model() as model:
138-
# Set data container.
139-
data = pm.Data("data", x_train)
140-
# Define GLM family.
141-
family = pm.glm.families.Binomial()
142-
# Set priors.
143-
priors = {
144-
"Intercept": pm.Normal.dist(mu=0, sd=10),
145-
"x1": pm.Normal.dist(mu=0, sd=10),
146-
"x2": pm.Normal.dist(mu=0, sd=10),
147-
"x1:x2": pm.Normal.dist(mu=0, sd=10),
148-
}
149-
# Specify model.
150-
glm.GLM(y=y_train, x=data, family=family, intercept=False, labels=labels, priors=priors)
151-
# Configure sampler.
152-
trace = pm.sample(5000, chains=5, tune=1000, target_accept=0.87, random_seed=SEED)
106+
COORDS = {"coeffs": labels}
107+
108+
with pm.Model(coords=COORDS) as model:
109+
# data containers
110+
X = pm.MutableData("X", x_train)
111+
y = pm.MutableData("y", y_train)
112+
# priors
113+
b = pm.Normal("b", mu=0, sigma=1, dims="coeffs")
114+
# linear model
115+
mu = pm.math.dot(X, b)
116+
# link function
117+
p = pm.Deterministic("p", pm.math.invlogit(mu))
118+
# likelihood
119+
pm.Bernoulli("obs", p=p, observed=y)
120+
121+
pm.model_to_graphviz(model)
153122
```
154123

155124
```{code-cell} ipython3
156-
# Plot chains.
157-
az.plot_trace(data=trace);
125+
with model:
126+
idata = pm.sample()
158127
```
159128

160129
```{code-cell} ipython3
161-
az.summary(trace)
130+
az.plot_trace(idata, var_names="b", compact=False);
162131
```
163132

164133
The chains look good.
165134

166-
+++
135+
```{code-cell} ipython3
136+
az.summary(idata, var_names="b")
137+
```
138+
139+
And we do a good job of recovering the true parameters for this simulated dataset.
140+
141+
```{code-cell} ipython3
142+
az.plot_posterior(
143+
idata, var_names=["b"], ref_val=[intercept, beta_x1, beta_x2, beta_interaction], figsize=(15, 4)
144+
);
145+
```
167146

168147
## Generate Out-Of-Sample Predictions
169148

170149
Now we generate predictions on the test set.
171150

172151
```{code-cell} ipython3
173-
# Update data reference.
174-
pm.set_data({"data": x_test}, model=model)
175-
# Generate posterior samples.
176-
ppc_test = pm.sample_posterior_predictive(trace, model=model, samples=1000)
152+
with model:
153+
pm.set_data({"X": x_test, "y": y_test})
154+
idata.extend(pm.sample_posterior_predictive(idata))
177155
```
178156

179157
```{code-cell} ipython3
180-
# Compute the point prediction by taking the mean
181-
# and defining the category via a threshold.
182-
p_test_pred = ppc_test["y"].mean(axis=0)
158+
# Compute the point prediction by taking the mean and defining the category via a threshold.
159+
p_test_pred = idata.posterior_predictive["obs"].mean(dim=["chain", "draw"])
183160
y_test_pred = (p_test_pred >= 0.5).astype("int")
184161
```
185162

@@ -188,24 +165,20 @@ y_test_pred = (p_test_pred >= 0.5).astype("int")
188165
First let us compute the accuracy on the test set.
189166

190167
```{code-cell} ipython3
191-
from sklearn.metrics import accuracy_score
192-
193168
print(f"accuracy = {accuracy_score(y_true=y_test, y_pred=y_test_pred): 0.3f}")
194169
```
195170

196171
Next, we plot the [roc curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) and compute the [auc](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve).
197172

198173
```{code-cell} ipython3
199-
from sklearn.metrics import RocCurveDisplay, auc, roc_curve
200-
201174
fpr, tpr, thresholds = roc_curve(
202175
y_true=y_test, y_score=p_test_pred, pos_label=1, drop_intermediate=False
203176
)
204177
roc_auc = auc(fpr, tpr)
205178
206179
fig, ax = plt.subplots()
207180
roc_display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc)
208-
roc_display = roc_display.plot(ax=ax, marker="o", color=sns_c[4], markersize=4)
181+
roc_display = roc_display.plot(ax=ax, marker="o", markersize=4)
209182
ax.set(title="ROC");
210183
```
211184

@@ -238,60 +211,92 @@ Observe that this curve is a hyperbola centered at the singularity point $x_1 =
238211
Let us now plot the model decision boundary using a grid:
239212

240213
```{code-cell} ipython3
241-
# Construct grid.
242-
x1_grid = np.linspace(start=-9, stop=9, num=300)
243-
x2_grid = x1_grid
244-
245-
x1_mesh, x2_mesh = np.meshgrid(x1_grid, x2_grid)
246-
247-
x_grid = np.stack(arrays=[x1_mesh.flatten(), x2_mesh.flatten()], axis=1)
248-
249-
# Create features on the grid.
250-
x_grid_ext = patsy.dmatrix(formula_like="x1 * x2", data=dict(x1=x_grid[:, 0], x2=x_grid[:, 1]))
251-
252-
x_grid_ext = np.asarray(x_grid_ext)
214+
def make_grid():
215+
x1_grid = np.linspace(start=-9, stop=9, num=300)
216+
x2_grid = x1_grid
217+
x1_mesh, x2_mesh = np.meshgrid(x1_grid, x2_grid)
218+
x_grid = np.stack(arrays=[x1_mesh.flatten(), x2_mesh.flatten()], axis=1)
219+
return x1_grid, x2_grid, x_grid
220+
221+
222+
x1_grid, x2_grid, x_grid = make_grid()
223+
224+
with model:
225+
# Create features on the grid.
226+
x_grid_ext = patsy.dmatrix(formula_like="x1 * x2", data=dict(x1=x_grid[:, 0], x2=x_grid[:, 1]))
227+
x_grid_ext = np.asarray(x_grid_ext)
228+
# set the observed variables
229+
pm.set_data({"X": x_grid_ext})
230+
# calculate pushforward values of `p`
231+
ppc_grid = pm.sample_posterior_predictive(idata, var_names=["p"])
232+
```
253233

254-
# Generate model predictions on the grid.
255-
pm.set_data({"data": x_grid_ext}, model=model)
256-
ppc_grid = pm.sample_posterior_predictive(trace, model=model, samples=1000)
234+
```{code-cell} ipython3
235+
# grid of predictions
236+
grid_df = pd.DataFrame(x_grid, columns=["x1", "x2"])
237+
grid_df["p"] = ppc_grid.posterior_predictive.p.mean(dim=["chain", "draw"])
238+
p_grid = grid_df.pivot(index="x2", columns="x1", values="p").to_numpy()
257239
```
258240

259241
Now we compute the model decision boundary on the grid for visualization purposes.
260242

261243
```{code-cell} ipython3
262-
numerator = -(trace["Intercept"].mean(axis=0) + trace["x1"].mean(axis=0) * x1_grid)
263-
denominator = trace["x2"].mean(axis=0) + trace["x1:x2"].mean(axis=0) * x1_grid
264-
bd_grid = numerator / denominator
265-
266-
grid_df = pd.DataFrame(x_grid, columns=["x1", "x2"])
267-
grid_df["p"] = ppc_grid["y"].mean(axis=0)
268-
grid_df.sort_values("p", inplace=True)
269-
270-
p_grid = grid_df.pivot(index="x2", columns="x1", values="p").to_numpy()
244+
def calc_decision_boundary(idata, x1_grid):
245+
# posterior mean of coefficients
246+
intercept = idata.posterior["b"].sel(coeffs="Intercept").mean().data
247+
b1 = idata.posterior["b"].sel(coeffs="x1").mean().data
248+
b2 = idata.posterior["b"].sel(coeffs="x2").mean().data
249+
b1b2 = idata.posterior["b"].sel(coeffs="x1:x2").mean().data
250+
# decision boundary equation
251+
return -(intercept + b1 * x1_grid) / (b2 + b1b2 * x1_grid)
271252
```
272253

273254
We finally get the plot and the predictions on the test set:
274255

275256
```{code-cell} ipython3
276257
fig, ax = plt.subplots()
277-
cmap = sns.diverging_palette(240, 10, n=50, as_cmap=True)
258+
259+
# data
278260
sns.scatterplot(
279261
x=x_test[:, 1].flatten(),
280262
y=x_test[:, 2].flatten(),
281263
hue=y_test,
282-
palette=[sns_c_div[0], sns_c_div[-1]],
283264
ax=ax,
284265
)
285-
sns.lineplot(x=x1_grid, y=bd_grid, color="black", ax=ax)
286-
ax.contourf(x1_grid, x2_grid, p_grid, cmap=cmap, alpha=0.3)
266+
267+
# decision boundary
268+
ax.plot(x1_grid, calc_decision_boundary(idata, x1_grid), color="black", linestyle=":")
269+
270+
# grid of predictions
271+
ax.contourf(x1_grid, x2_grid, p_grid, alpha=0.3)
272+
287273
ax.legend(title="y", loc="center left", bbox_to_anchor=(1, 0.5))
288-
ax.lines[0].set_linestyle("dotted")
289274
ax.set(title="Model Decision Boundary", xlim=(-9, 9), ylim=(-9, 9), xlabel="x1", ylabel="x2");
290275
```
291276

292-
**Remark:** Note that we have computed the model decision boundary by using the mean of the posterior samples. However, we can generate a better (and more informative!) plot if we use the complete distribution (similarly for other metrics like accuracy and auc). One way of doing this is by storing and computing it inside the model definition as a `Deterministic` variable as in [Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb).
277+
Note that we have computed the model decision boundary by using the mean of the posterior samples. However, we can generate a better (and more informative!) plot if we use the complete distribution (similarly for other metrics like accuracy and AUC).
278+
279+
+++
280+
281+
## References
282+
283+
- [Bambi](https://bambinos.github.io/bambi/), a more complete implementation of the GLM submodule which also allows for mixed-effects models.
284+
- [Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb)
285+
- [Statistical Rethinking](https://xcelab.net/rm/statistical-rethinking/)
286+
287+
+++
288+
289+
## Authors
290+
- Created by [Juan Orduz](https://github.com/juanitorduz).
291+
- Updated by [Benjamin T. Vincent](https://github.com/drbenvincent) to PyMC v4 in June 2022
292+
293+
+++
294+
295+
## Watermark
293296

294297
```{code-cell} ipython3
295298
%load_ext watermark
296-
%watermark -n -u -v -iv -w
299+
%watermark -n -u -v -iv -w -p aesara,aeppl
297300
```
301+
302+
:::{include} ../page_footer.md :::

0 commit comments

Comments
 (0)