Skip to content

Commit 5238dc3

Browse files
authored
Bring back updated GLM out of sample prediction notebook (#486)
* bring back the updated notebook * proper formatting of page footer * update date + add re-executed line in end author block * remove mention of GLM submodule. Also mention of Bambi is now irrelevant.
1 parent c98bd70 commit 5238dc3

File tree

2 files changed

+547
-115696
lines changed

2 files changed

+547
-115696
lines changed

examples/generalized_linear_models/GLM-out-of-sample-predictions.ipynb

+414-115,565
Large diffs are not rendered by default.

examples/generalized_linear_models/GLM-out-of-sample-predictions.myst.md

+133-131
Original file line numberDiff line numberDiff line change
@@ -5,87 +5,66 @@ jupytext:
55
format_name: myst
66
format_version: 0.13
77
kernelspec:
8-
display_name: 'Python 3.7.6 64-bit (''website_projects'': conda)'
9-
metadata:
10-
interpreter:
11-
hash: fbddea5140024843998ae64bf59a7579a9660d103062604797ea5984366c686c
8+
display_name: pymc_env
9+
language: python
1210
name: python3
1311
---
1412

15-
# GLM in PyMC3: Out-Of-Sample Predictions
13+
(GLM-out-of-sample-predictions)=
14+
# Out-Of-Sample Predictions
1615

17-
In this notebook I explore the [glm](https://docs.pymc.io/api/glm.html) module of [PyMC3](https://docs.pymc.io/). I am particularly interested in the model definition using [patsy](https://patsy.readthedocs.io/en/latest/) formulas, as it makes the model evaluation loop faster (easier to include features and/or interactions). There are many good resources on this subject, but most of them evaluate the model in-sample. For many applications we require doing predictions on out-of-sample data. This experiment was motivated by the discussion of the thread ["Out of sample" predictions with the GLM sub-module](https://discourse.pymc.io/t/out-of-sample-predictions-with-the-glm-sub-module/773) on the (great!) forum [discourse.pymc.io/](https://discourse.pymc.io/), thank you all for your input!
18-
19-
**Resources**
20-
21-
22-
- [PyMC3 Docs: Example Notebooks](https://docs.pymc.io/nb_examples/index.html)
23-
24-
- In particular check [GLM: Logistic Regression](https://docs.pymc.io/notebooks/GLM-logistic.html)
25-
26-
- [Bambi](https://bambinos.github.io/bambi/), a more complete implementation of the GLM submodule which also allows for mixed-effects models.
27-
28-
- [Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb)
29-
- [Statistical Rethinking](https://xcelab.net/rm/statistical-rethinking/)
30-
31-
+++
32-
33-
## Prepare Notebook
16+
:::{post} December, 2022
17+
:tags: generalized linear model, logistic regression, out of sample predictions, patsy
18+
:category: beginner
19+
:::
3420

3521
```{code-cell} ipython3
22+
import arviz as az
3623
import matplotlib.pyplot as plt
3724
import numpy as np
3825
import pandas as pd
39-
import seaborn as sns
40-
41-
sns.set_style(style="darkgrid", rc={"axes.facecolor": ".9", "grid.color": ".8"})
42-
sns.set_palette(palette="deep")
43-
sns_c = sns.color_palette(palette="deep")
44-
45-
import arviz as az
4626
import patsy
47-
import pymc3 as pm
27+
import pymc as pm
28+
import seaborn as sns
4829
49-
from pymc3 import glm
30+
from scipy.special import expit as inverse_logit
31+
from sklearn.metrics import RocCurveDisplay, accuracy_score, auc, roc_curve
32+
from sklearn.model_selection import train_test_split
33+
```
5034

51-
plt.rcParams["figure.figsize"] = [7, 6]
52-
plt.rcParams["figure.dpi"] = 100
35+
```{code-cell} ipython3
36+
RANDOM_SEED = 8927
37+
rng = np.random.default_rng(RANDOM_SEED)
38+
az.style.use("arviz-darkgrid")
5339
```
5440

5541
## Generate Sample Data
5642

5743
We want to fit a logistic regression model where there is a multiplicative interaction between two numerical features.
5844

5945
```{code-cell} ipython3
60-
SEED = 42
61-
np.random.seed(SEED)
62-
63-
# Number of data points.
46+
# Number of data points
6447
n = 250
65-
# Create features.
66-
x1 = np.random.normal(loc=0.0, scale=2.0, size=n)
67-
x2 = np.random.normal(loc=0.0, scale=2.0, size=n)
68-
epsilon = np.random.normal(loc=0.0, scale=0.5, size=n)
69-
# Define target variable.
48+
# Create features
49+
x1 = rng.normal(loc=0.0, scale=2.0, size=n)
50+
x2 = rng.normal(loc=0.0, scale=2.0, size=n)
51+
# Define target variable
7052
intercept = -0.5
7153
beta_x1 = 1
7254
beta_x2 = -1
7355
beta_interaction = 2
7456
z = intercept + beta_x1 * x1 + beta_x2 * x2 + beta_interaction * x1 * x2
75-
p = 1 / (1 + np.exp(-z))
76-
y = np.random.binomial(n=1, p=p, size=n)
77-
57+
p = inverse_logit(z)
58+
# note binimial with n=1 is equal to a bernoulli
59+
y = rng.binomial(n=1, p=p, size=n)
7860
df = pd.DataFrame(dict(x1=x1, x2=x2, y=y))
79-
8061
df.head()
8162
```
8263

8364
Let us do some exploration of the data:
8465

8566
```{code-cell} ipython3
86-
sns.pairplot(
87-
data=df, kind="scatter", height=2, plot_kws={"color": sns_c[1]}, diag_kws={"color": sns_c[2]}
88-
);
67+
sns.pairplot(data=df, kind="scatter");
8968
```
9069

9170
- $x_1$ and $x_2$ are not correlated.
@@ -94,91 +73,85 @@ sns.pairplot(
9473

9574
```{code-cell} ipython3
9675
fig, ax = plt.subplots()
97-
sns_c_div = sns.diverging_palette(240, 10, n=2)
98-
sns.scatterplot(x="x1", y="x2", data=df, hue="y", palette=[sns_c_div[0], sns_c_div[-1]])
99-
ax.legend(title="y", loc="center left", bbox_to_anchor=(1, 0.5))
76+
sns.scatterplot(x="x1", y="x2", data=df, hue="y")
77+
ax.legend(title="y")
10078
ax.set(title="Sample Data", xlim=(-9, 9), ylim=(-9, 9));
10179
```
10280

10381
## Prepare Data for Modeling
10482

105-
I wanted to use the *`classmethod`* `from_formula` (see [documentation](https://docs.pymc.io/api/glm.html)), but I was not able to generate out-of-sample predictions with this approach (if you find a way please let me know!). As a workaround, I created the features from a formula using [patsy](https://patsy.readthedocs.io/en/latest/) directly and then use *`class`* `pymc3.glm.linear.GLM` (this was motivated by going into the [source code](https://github.com/pymc-devs/pymc3/blob/master/pymc3/glm/linear.py)).
106-
10783
```{code-cell} ipython3
108-
# Define model formula.
109-
formula = "y ~ x1 * x2"
110-
# Create features.
111-
y, x = patsy.dmatrices(formula_like=formula, data=df)
84+
y, x = patsy.dmatrices("y ~ x1 * x2", data=df)
11285
y = np.asarray(y).flatten()
11386
labels = x.design_info.column_names
11487
x = np.asarray(x)
11588
```
11689

117-
As pointed out on the [thread](https://discourse.pymc.io/t/out-of-sample-predictions-with-the-glm-sub-module/773) (thank you @Nicky!), we need to keep the labels of the features in the design matrix.
118-
119-
```{code-cell} ipython3
120-
print(f"labels = {labels}")
121-
```
122-
12390
Now we do a train-test split.
12491

12592
```{code-cell} ipython3
126-
from sklearn.model_selection import train_test_split
127-
128-
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=SEED)
93+
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7)
12994
```
13095

13196
## Define and Fit the Model
13297

133-
We now specify the model in PyMC3.
98+
We now specify the model in PyMC.
13499

135100
```{code-cell} ipython3
136-
with pm.Model() as model:
137-
# Set data container.
138-
data = pm.Data("data", x_train)
139-
# Define GLM family.
140-
family = pm.glm.families.Binomial()
141-
# Set priors.
142-
priors = {
143-
"Intercept": pm.Normal.dist(mu=0, sd=10),
144-
"x1": pm.Normal.dist(mu=0, sd=10),
145-
"x2": pm.Normal.dist(mu=0, sd=10),
146-
"x1:x2": pm.Normal.dist(mu=0, sd=10),
147-
}
148-
# Specify model.
149-
glm.GLM(y=y_train, x=data, family=family, intercept=False, labels=labels, priors=priors)
150-
# Configure sampler.
151-
trace = pm.sample(5000, chains=5, tune=1000, target_accept=0.87, random_seed=SEED)
101+
COORDS = {"coeffs": labels}
102+
103+
with pm.Model(coords=COORDS) as model:
104+
# data containers
105+
X = pm.MutableData("X", x_train)
106+
y = pm.MutableData("y", y_train)
107+
# priors
108+
b = pm.Normal("b", mu=0, sigma=1, dims="coeffs")
109+
# linear model
110+
mu = pm.math.dot(X, b)
111+
# link function
112+
p = pm.Deterministic("p", pm.math.invlogit(mu))
113+
# likelihood
114+
pm.Bernoulli("obs", p=p, observed=y)
115+
116+
pm.model_to_graphviz(model)
152117
```
153118

154119
```{code-cell} ipython3
155-
# Plot chains.
156-
az.plot_trace(data=trace);
120+
with model:
121+
idata = pm.sample()
157122
```
158123

159124
```{code-cell} ipython3
160-
az.summary(trace)
125+
az.plot_trace(idata, var_names="b", compact=False);
161126
```
162127

163128
The chains look good.
164129

165-
+++
130+
```{code-cell} ipython3
131+
az.summary(idata, var_names="b")
132+
```
133+
134+
And we do a good job of recovering the true parameters for this simulated dataset.
135+
136+
```{code-cell} ipython3
137+
az.plot_posterior(
138+
idata, var_names=["b"], ref_val=[intercept, beta_x1, beta_x2, beta_interaction], figsize=(15, 4)
139+
);
140+
```
166141

167142
## Generate Out-Of-Sample Predictions
168143

169144
Now we generate predictions on the test set.
170145

171146
```{code-cell} ipython3
172-
# Update data reference.
173-
pm.set_data({"data": x_test}, model=model)
174-
# Generate posterior samples.
175-
ppc_test = pm.sample_posterior_predictive(trace, model=model, samples=1000)
147+
with model:
148+
pm.set_data({"X": x_test, "y": y_test})
149+
idata.extend(pm.sample_posterior_predictive(idata))
176150
```
177151

178152
```{code-cell} ipython3
179-
# Compute the point prediction by taking the mean
180-
# and defining the category via a threshold.
181-
p_test_pred = ppc_test["y"].mean(axis=0)
153+
# Compute the point prediction by taking the mean and defining the category via a threshold.
154+
p_test_pred = idata.posterior_predictive["obs"].mean(dim=["chain", "draw"])
182155
y_test_pred = (p_test_pred >= 0.5).astype("int")
183156
```
184157

@@ -187,24 +160,20 @@ y_test_pred = (p_test_pred >= 0.5).astype("int")
187160
First let us compute the accuracy on the test set.
188161

189162
```{code-cell} ipython3
190-
from sklearn.metrics import accuracy_score
191-
192163
print(f"accuracy = {accuracy_score(y_true=y_test, y_pred=y_test_pred): 0.3f}")
193164
```
194165

195166
Next, we plot the [roc curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) and compute the [auc](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve).
196167

197168
```{code-cell} ipython3
198-
from sklearn.metrics import RocCurveDisplay, auc, roc_curve
199-
200169
fpr, tpr, thresholds = roc_curve(
201170
y_true=y_test, y_score=p_test_pred, pos_label=1, drop_intermediate=False
202171
)
203172
roc_auc = auc(fpr, tpr)
204173
205174
fig, ax = plt.subplots()
206175
roc_display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc)
207-
roc_display = roc_display.plot(ax=ax, marker="o", color=sns_c[4], markersize=4)
176+
roc_display = roc_display.plot(ax=ax, marker="o", markersize=4)
208177
ax.set(title="ROC");
209178
```
210179

@@ -237,60 +206,93 @@ Observe that this curve is a hyperbola centered at the singularity point $x_1 =
237206
Let us now plot the model decision boundary using a grid:
238207

239208
```{code-cell} ipython3
240-
# Construct grid.
241-
x1_grid = np.linspace(start=-9, stop=9, num=300)
242-
x2_grid = x1_grid
243-
244-
x1_mesh, x2_mesh = np.meshgrid(x1_grid, x2_grid)
245-
246-
x_grid = np.stack(arrays=[x1_mesh.flatten(), x2_mesh.flatten()], axis=1)
247-
248-
# Create features on the grid.
249-
x_grid_ext = patsy.dmatrix(formula_like="x1 * x2", data=dict(x1=x_grid[:, 0], x2=x_grid[:, 1]))
250-
251-
x_grid_ext = np.asarray(x_grid_ext)
209+
def make_grid():
210+
x1_grid = np.linspace(start=-9, stop=9, num=300)
211+
x2_grid = x1_grid
212+
x1_mesh, x2_mesh = np.meshgrid(x1_grid, x2_grid)
213+
x_grid = np.stack(arrays=[x1_mesh.flatten(), x2_mesh.flatten()], axis=1)
214+
return x1_grid, x2_grid, x_grid
215+
216+
217+
x1_grid, x2_grid, x_grid = make_grid()
218+
219+
with model:
220+
# Create features on the grid.
221+
x_grid_ext = patsy.dmatrix(formula_like="x1 * x2", data=dict(x1=x_grid[:, 0], x2=x_grid[:, 1]))
222+
x_grid_ext = np.asarray(x_grid_ext)
223+
# set the observed variables
224+
pm.set_data({"X": x_grid_ext})
225+
# calculate pushforward values of `p`
226+
ppc_grid = pm.sample_posterior_predictive(idata, var_names=["p"])
227+
```
252228

253-
# Generate model predictions on the grid.
254-
pm.set_data({"data": x_grid_ext}, model=model)
255-
ppc_grid = pm.sample_posterior_predictive(trace, model=model, samples=1000)
229+
```{code-cell} ipython3
230+
# grid of predictions
231+
grid_df = pd.DataFrame(x_grid, columns=["x1", "x2"])
232+
grid_df["p"] = ppc_grid.posterior_predictive.p.mean(dim=["chain", "draw"])
233+
p_grid = grid_df.pivot(index="x2", columns="x1", values="p").to_numpy()
256234
```
257235

258236
Now we compute the model decision boundary on the grid for visualization purposes.
259237

260238
```{code-cell} ipython3
261-
numerator = -(trace["Intercept"].mean(axis=0) + trace["x1"].mean(axis=0) * x1_grid)
262-
denominator = trace["x2"].mean(axis=0) + trace["x1:x2"].mean(axis=0) * x1_grid
263-
bd_grid = numerator / denominator
264-
265-
grid_df = pd.DataFrame(x_grid, columns=["x1", "x2"])
266-
grid_df["p"] = ppc_grid["y"].mean(axis=0)
267-
grid_df.sort_values("p", inplace=True)
268-
269-
p_grid = grid_df.pivot(index="x2", columns="x1", values="p").to_numpy()
239+
def calc_decision_boundary(idata, x1_grid):
240+
# posterior mean of coefficients
241+
intercept = idata.posterior["b"].sel(coeffs="Intercept").mean().data
242+
b1 = idata.posterior["b"].sel(coeffs="x1").mean().data
243+
b2 = idata.posterior["b"].sel(coeffs="x2").mean().data
244+
b1b2 = idata.posterior["b"].sel(coeffs="x1:x2").mean().data
245+
# decision boundary equation
246+
return -(intercept + b1 * x1_grid) / (b2 + b1b2 * x1_grid)
270247
```
271248

272249
We finally get the plot and the predictions on the test set:
273250

274251
```{code-cell} ipython3
275252
fig, ax = plt.subplots()
276-
cmap = sns.diverging_palette(240, 10, n=50, as_cmap=True)
253+
254+
# data
277255
sns.scatterplot(
278256
x=x_test[:, 1].flatten(),
279257
y=x_test[:, 2].flatten(),
280258
hue=y_test,
281-
palette=[sns_c_div[0], sns_c_div[-1]],
282259
ax=ax,
283260
)
284-
sns.lineplot(x=x1_grid, y=bd_grid, color="black", ax=ax)
285-
ax.contourf(x1_grid, x2_grid, p_grid, cmap=cmap, alpha=0.3)
261+
262+
# decision boundary
263+
ax.plot(x1_grid, calc_decision_boundary(idata, x1_grid), color="black", linestyle=":")
264+
265+
# grid of predictions
266+
ax.contourf(x1_grid, x2_grid, p_grid, alpha=0.3)
267+
286268
ax.legend(title="y", loc="center left", bbox_to_anchor=(1, 0.5))
287-
ax.lines[0].set_linestyle("dotted")
288269
ax.set(title="Model Decision Boundary", xlim=(-9, 9), ylim=(-9, 9), xlabel="x1", ylabel="x2");
289270
```
290271

291-
**Remark:** Note that we have computed the model decision boundary by using the mean of the posterior samples. However, we can generate a better (and more informative!) plot if we use the complete distribution (similarly for other metrics like accuracy and auc). One way of doing this is by storing and computing it inside the model definition as a `Deterministic` variable as in [Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb).
272+
Note that we have computed the model decision boundary by using the mean of the posterior samples. However, we can generate a better (and more informative!) plot if we use the complete distribution (similarly for other metrics like accuracy and AUC).
273+
274+
+++
275+
276+
## References
277+
278+
- [Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb)
279+
- [Statistical Rethinking](https://xcelab.net/rm/statistical-rethinking/)
280+
281+
+++
282+
283+
## Authors
284+
- Created by [Juan Orduz](https://github.com/juanitorduz).
285+
- Updated by [Benjamin T. Vincent](https://github.com/drbenvincent) to PyMC v4 in June 2022
286+
- Re-executed by [Benjamin T. Vincent](https://github.com/drbenvincent) with PyMC v5 in December 2022
287+
288+
+++
289+
290+
## Watermark
292291

293292
```{code-cell} ipython3
294293
%load_ext watermark
295-
%watermark -n -u -v -iv -w
294+
%watermark -n -u -v -iv -w -p pytensor
296295
```
296+
297+
:::{include} ../page_footer.md
298+
:::

0 commit comments

Comments
 (0)