You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
update GLM out of sample prediction notebook to v4 (#370)
* create truncated regression example
* delete truncated regression example from main branch
* create truncated regression example
* delete truncated regression example from main branch
* create truncated regression example
* delete truncated regression example from main branch
* fix incorrect statement about pm.NormalMixture
* update to v4
* remove a commented out line of code + add myst file
* spelling
* revisions based on feedback
* fix: outputs disappeared for some reason
* updates based on feedback
* revert to using labels for coords
Co-authored-by: Benjamin T. Vincent <[email protected]>
In this notebook I explore the [glm](https://docs.pymc.io/api/glm.html) module of [PyMC3](https://docs.pymc.io/). I am particularly interested in the model definition using [patsy](https://patsy.readthedocs.io/en/latest/) formulas, as it makes the model evaluation loop faster (easier to include features and/or interactions). There are many good resources on this subject, but most of them evaluate the model in-sample. For many applications we require doing predictions on out-of-sample data. This experiment was motivated by the discussion of the thread ["Out of sample" predictions with the GLM sub-module](https://discourse.pymc.io/t/out-of-sample-predictions-with-the-glm-sub-module/773) on the (great!) forum [discourse.pymc.io/](https://discourse.pymc.io/), thank you all for your input!
19
-
20
-
**Resources**
21
-
22
-
23
-
-[PyMC3 Docs: Example Notebooks](https://docs.pymc.io/nb_examples/index.html)
24
-
25
-
- In particular check [GLM: Logistic Regression](https://docs.pymc.io/notebooks/GLM-logistic.html)
26
-
27
-
-[Bambi](https://bambinos.github.io/bambi/), a more complete implementation of the GLM submodule which also allows for mixed-effects models.
28
-
29
-
-[Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb)
:tags: generalized linear model, logistic regression, out of sample predictions, patsy
19
+
:category: beginner
20
+
:::
31
21
32
22
+++
33
23
34
-
## Prepare Notebook
24
+
For many applications we require doing predictions on out-of-sample data. This experiment was motivated by the discussion of the thread ["Out of sample" predictions with the GLM sub-module](https://discourse.pymc.io/t/out-of-sample-predictions-with-the-glm-sub-module/773) on the (great!) forum [discourse.pymc.io/](https://discourse.pymc.io/), thank you all for your input! But note that this GLM sub-module was deprecated in favour of [`bambi`](https://github.com/bambinos/bambi). But this notebook implements a 'raw' PyMC model.
I wanted to use the *`classmethod`*`from_formula` (see [documentation](https://docs.pymc.io/api/glm.html)), but I was not able to generate out-of-sample predictions with this approach (if you find a way please let me know!). As a workaround, I created the features from a formula using [patsy](https://patsy.readthedocs.io/en/latest/) directly and then use *`class`*`pymc3.glm.linear.GLM` (this was motivated by going into the [source code](https://github.com/pymc-devs/pymc3/blob/master/pymc3/glm/linear.py)).
107
-
108
88
```{code-cell} ipython3
109
-
# Define model formula.
110
-
formula = "y ~ x1 * x2"
111
-
# Create features.
112
-
y, x = patsy.dmatrices(formula_like=formula, data=df)
89
+
y, x = patsy.dmatrices("y ~ x1 * x2", data=df)
113
90
y = np.asarray(y).flatten()
114
91
labels = x.design_info.column_names
115
92
x = np.asarray(x)
116
93
```
117
94
118
-
As pointed out on the [thread](https://discourse.pymc.io/t/out-of-sample-predictions-with-the-glm-sub-module/773) (thank you @Nicky!), we need to keep the labels of the features in the design matrix.
119
-
120
-
```{code-cell} ipython3
121
-
print(f"labels = {labels}")
122
-
```
123
-
124
95
Now we do a train-test split.
125
96
126
97
```{code-cell} ipython3
127
-
from sklearn.model_selection import train_test_split
128
-
129
-
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=SEED)
98
+
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7)
Next, we plot the [roc curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) and compute the [auc](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve).
197
172
198
173
```{code-cell} ipython3
199
-
from sklearn.metrics import RocCurveDisplay, auc, roc_curve
**Remark:** Note that we have computed the model decision boundary by using the mean of the posterior samples. However, we can generate a better (and more informative!) plot if we use the complete distribution (similarly for other metrics like accuracy and auc). One way of doing this is by storing and computing it inside the model definition as a `Deterministic` variable as in [Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb).
277
+
Note that we have computed the model decision boundary by using the mean of the posterior samples. However, we can generate a better (and more informative!) plot if we use the complete distribution (similarly for other metrics like accuracy and AUC).
278
+
279
+
+++
280
+
281
+
## References
282
+
283
+
-[Bambi](https://bambinos.github.io/bambi/), a more complete implementation of the GLM submodule which also allows for mixed-effects models.
284
+
-[Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb)
0 commit comments