You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bring back updated GLM out of sample prediction notebook (#486)
* bring back the updated notebook
* proper formatting of page footer
* update date + add re-executed line in end author block
* remove mention of GLM submodule. Also mention of Bambi is now irrelevant.
In this notebook I explore the [glm](https://docs.pymc.io/api/glm.html) module of [PyMC3](https://docs.pymc.io/). I am particularly interested in the model definition using [patsy](https://patsy.readthedocs.io/en/latest/) formulas, as it makes the model evaluation loop faster (easier to include features and/or interactions). There are many good resources on this subject, but most of them evaluate the model in-sample. For many applications we require doing predictions on out-of-sample data. This experiment was motivated by the discussion of the thread ["Out of sample" predictions with the GLM sub-module](https://discourse.pymc.io/t/out-of-sample-predictions-with-the-glm-sub-module/773) on the (great!) forum [discourse.pymc.io/](https://discourse.pymc.io/), thank you all for your input!
18
-
19
-
**Resources**
20
-
21
-
22
-
-[PyMC3 Docs: Example Notebooks](https://docs.pymc.io/nb_examples/index.html)
23
-
24
-
- In particular check [GLM: Logistic Regression](https://docs.pymc.io/notebooks/GLM-logistic.html)
25
-
26
-
-[Bambi](https://bambinos.github.io/bambi/), a more complete implementation of the GLM submodule which also allows for mixed-effects models.
27
-
28
-
-[Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb)
I wanted to use the *`classmethod`*`from_formula` (see [documentation](https://docs.pymc.io/api/glm.html)), but I was not able to generate out-of-sample predictions with this approach (if you find a way please let me know!). As a workaround, I created the features from a formula using [patsy](https://patsy.readthedocs.io/en/latest/) directly and then use *`class`*`pymc3.glm.linear.GLM` (this was motivated by going into the [source code](https://github.com/pymc-devs/pymc3/blob/master/pymc3/glm/linear.py)).
106
-
107
83
```{code-cell} ipython3
108
-
# Define model formula.
109
-
formula = "y ~ x1 * x2"
110
-
# Create features.
111
-
y, x = patsy.dmatrices(formula_like=formula, data=df)
84
+
y, x = patsy.dmatrices("y ~ x1 * x2", data=df)
112
85
y = np.asarray(y).flatten()
113
86
labels = x.design_info.column_names
114
87
x = np.asarray(x)
115
88
```
116
89
117
-
As pointed out on the [thread](https://discourse.pymc.io/t/out-of-sample-predictions-with-the-glm-sub-module/773) (thank you @Nicky!), we need to keep the labels of the features in the design matrix.
118
-
119
-
```{code-cell} ipython3
120
-
print(f"labels = {labels}")
121
-
```
122
-
123
90
Now we do a train-test split.
124
91
125
92
```{code-cell} ipython3
126
-
from sklearn.model_selection import train_test_split
127
-
128
-
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=SEED)
93
+
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7)
Next, we plot the [roc curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) and compute the [auc](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve).
196
167
197
168
```{code-cell} ipython3
198
-
from sklearn.metrics import RocCurveDisplay, auc, roc_curve
**Remark:** Note that we have computed the model decision boundary by using the mean of the posterior samples. However, we can generate a better (and more informative!) plot if we use the complete distribution (similarly for other metrics like accuracy and auc). One way of doing this is by storing and computing it inside the model definition as a `Deterministic` variable as in [Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb).
272
+
Note that we have computed the model decision boundary by using the mean of the posterior samples. However, we can generate a better (and more informative!) plot if we use the complete distribution (similarly for other metrics like accuracy and AUC).
273
+
274
+
+++
275
+
276
+
## References
277
+
278
+
-[Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb)
0 commit comments