Skip to content

Commit 2aaddde

Browse files
ferrineOriolAbril
authored andcommitted
Rerun vi notebooks and remove outdated ones
rerunning bayesian_neural_network_advi rerunning bayesian_neural_network_advi rerunning bayesian_neural_network_advi rerunning bayesian_neural_network_advi rerunning bayesian_neural_network_advi rerunning bayesian_neural_network_advi rerunning empirical approx reorder imports update the notebook update the notebook update minibatch example rerun empirical approx run pre commit GLM rerun update glm remove references to pymc3 remove tags
1 parent 5c21572 commit 2aaddde

11 files changed

+897
-3964
lines changed

examples/variational_inference/GLM-hierarchical-advi-minibatch.ipynb

+96-86
Large diffs are not rendered by default.

examples/variational_inference/bayesian_neural_network_advi.ipynb

+223-92
Large diffs are not rendered by default.

examples/variational_inference/convolutional_vae_keras_advi.ipynb

-761
This file was deleted.

examples/variational_inference/empirical-approx-overview.ipynb

+137-149
Large diffs are not rendered by default.

examples/variational_inference/lda-advi-aevb.ipynb

-849
This file was deleted.

examples/variational_inference/normalizing_flows_overview.ipynb

-1,288
This file was deleted.

examples/variational_inference/variational_api_quickstart.ipynb

+377-685
Large diffs are not rendered by default.

myst_nbs/variational_inference/GLM-hierarchical-advi-minibatch.myst.md

+15-9
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@ jupytext:
66
format_version: 0.13
77
jupytext_version: 1.13.7
88
kernelspec:
9-
display_name: Python 3
9+
display_name: pymc
1010
language: python
11-
name: python3
11+
name: pymc
1212
---
1313

1414
# GLM: Mini-batch ADVI on hierarchical regression model
@@ -23,28 +23,29 @@ kernelspec:
2323
Unlike Gaussian mixture models, (hierarchical) regression models have independent variables. These variables affect the likelihood function, but are not random variables. When using mini-batch, we should take care of that.
2424

2525
```{code-cell} ipython3
26-
%env THEANO_FLAGS=device=cpu, floatX=float32, warn_float64=ignore
26+
%env AESARA_FLAGS=device=cpu, floatX=float32, warn_float64=ignore
2727
2828
import os
2929
30+
import aesara
31+
import aesara.tensor as at
3032
import arviz as az
3133
import matplotlib.pyplot as plt
3234
import numpy as np
3335
import pandas as pd
34-
import pymc3 as pm
36+
import pymc as pm
3537
import seaborn as sns
36-
import theano
37-
import theano.tensor as tt
3838
3939
from scipy import stats
4040
41-
print(f"Running on PyMC3 v{pm.__version__}")
41+
print(f"Running on PyMC v{pm.__version__}")
4242
```
4343

4444
```{code-cell} ipython3
4545
%config InlineBackend.figure_format = 'retina'
4646
RANDOM_SEED = 8927
4747
rng = np.random.default_rng(RANDOM_SEED)
48+
pm.set_at_rng(RANDOM_SEED)
4849
az.style.use("arviz-darkgrid")
4950
```
5051

@@ -127,7 +128,7 @@ Then, run ADVI with mini-batch.
127128
```{code-cell} ipython3
128129
with hierarchical_model:
129130
approx = pm.fit(100000, callbacks=[pm.callbacks.CheckParametersConvergence(tolerance=1e-4)])
130-
idata_advi = az.from_pymc3(approx.sample(500))
131+
idata_advi = approx.sample(500)
131132
```
132133

133134
Check the trace of ELBO and compare the result with MCMC.
@@ -158,7 +159,12 @@ with pm.Model(coords=coords):
158159
# essentially, this is what init='advi' does
159160
step = pm.NUTS(scaling=approx.cov.eval(), is_cov=True)
160161
hierarchical_trace = pm.sample(
161-
2000, step, start=approx.sample()[0], progressbar=True, return_inferencedata=True
162+
2000,
163+
step,
164+
# sampling different initial values from the trace
165+
initvals=list(approx.sample(return_inferencedata=False, size=4)[i] for i in range(4)),
166+
progressbar=True,
167+
return_inferencedata=True,
162168
)
163169
```
164170

myst_nbs/variational_inference/bayesian_neural_network_advi.myst.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -35,13 +35,13 @@ Unfortunately, when it comes to traditional ML problems like classification or (
3535

3636
### Deep Learning
3737

38-
Now in its third renaissance, neural networks have been making headlines repeatadly by dominating almost any object recognition benchmark, kicking ass at Atari games {cite:p}`mnih2013playing`, and beating the world-champion Lee Sedol at Go {cite:p}`silver2016masteringgo`. From a statistical point, Neural Networks are extremely good non-linear function approximators and representation learners. While mostly known for classification, they have been extended to unsupervised learning with AutoEncoders {cite:p}`kingma2014autoencoding` and in all sorts of other interesting ways (e.g. [Recurrent Networks](https://en.wikipedia.org/wiki/Recurrent_neural_network), or [MDNs](http://cbonnett.github.io/MDN_EDWARD_KERAS_TF.html) to estimate multimodal distributions). Why do they work so well? No one really knows as the statistical properties are still not fully understood.
38+
Now in its third renaissance, neural networks have been making headlines repeatadly by dominating almost any object recognition benchmark, kicking ass at Atari games {cite:p}`mnih2013playing`, and beating the world-champion Lee Sedol at Go {cite:p}`silver2016masteringgo`. From a statistical point, Neural Networks are extremely good non-linear function approximators and representation learners. While mostly known for classification, they have been extended to unsupervised learning with AutoEncoders {cite:p}`kingma2014autoencoding` and in all sorts of other interesting ways (e.g. [Recurrent Networks](https://en.wikipedia.org/wiki/Recurrent_neural_network), or [MDNs](http://cbonneat.github.io/MDN_EDWARD_KERAS_TF.html) to estimate multimodal distributions). Why do they work so well? No one really knows as the statistical properties are still not fully understood.
3939

4040
A large part of the innoviation in deep learning is the ability to train these extremely complex models. This rests on several pillars:
4141
* Speed: facilitating the GPU allowed for much faster processing.
4242
* Software: frameworks like [PyTorch](https://pytorch.org/) and [TensorFlow](https://www.tensorflow.org/) allow flexible creation of abstract models that can then be optimized and compiled to CPU or GPU.
4343
* Learning algorithms: training on sub-sets of the data -- stochastic gradient descent -- allows us to train these models on massive amounts of data. Techniques like drop-out avoid overfitting.
44-
* Architectural: A lot of innovation comes from changing the input layers, like for convolutional neural nets, or the output layers, like for [MDNs](http://cbonnett.github.io/MDN_EDWARD_KERAS_TF.html).
44+
* Architectural: A lot of innovation comes from changing the input layers, like for convolutional neural nets, or the output layers, like for [MDNs](http://cbonneat.github.io/MDN_EDWARD_KERAS_TF.html).
4545

4646
### Bridging Deep Learning and Probabilistic Programming
4747
On one hand we have Probabilistic Programming which allows us to build rather small and focused models in a very principled and well-understood way to gain insight into our data; on the other hand we have deep learning which uses many heuristics to train huge and highly complex models that are amazing at prediction. Recent innovations in variational inference allow probabilistic programming to scale model complexity as well as data size. We are thus at the cusp of being able to combine these two approaches to hopefully unlock new innovations in Machine Learning. For more motivation, see also [Dustin Tran's](https://twitter.com/dustinvtran) [blog post](http://dustintran.com/blog/a-quick-update-edward-and-some-motivations/).
@@ -51,7 +51,7 @@ While this would allow Probabilistic Programming to be applied to a much wider s
5151
* **Uncertainty in representations**: We also get uncertainty estimates of our weights which could inform us about the stability of the learned representations of the network.
5252
* **Regularization with priors**: Weights are often L2-regularized to avoid overfitting, this very naturally becomes a Gaussian prior for the weight coefficients. We could, however, imagine all kinds of other priors, like spike-and-slab to enforce sparsity (this would be more like using the L1-norm).
5353
* **Transfer learning with informed priors**: If we wanted to train a network on a new object recognition data set, we could bootstrap the learning by placing informed priors centered around weights retrieved from other pre-trained networks, like GoogLeNet {cite:p}`szegedy2014going`.
54-
* **Hierarchical Neural Networks**: A very powerful approach in Probabilistic Programming is hierarchical modeling that allows pooling of things that were learned on sub-groups to the overall population (see [Hierarchical Linear Regression in PyMC3](https://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/)). Applied to Neural Networks, in hierarchical data sets, we could train individual neural nets to specialize on sub-groups while still being informed about representations of the overall population. For example, imagine a network trained to classify car models from pictures of cars. We could train a hierarchical neural network where a sub-neural network is trained to tell apart models from only a single manufacturer. The intuition being that all cars from a certain manufactures share certain similarities so it would make sense to train individual networks that specialize on brands. However, due to the individual networks being connected at a higher layer, they would still share information with the other specialized sub-networks about features that are useful to all brands. Interestingly, different layers of the network could be informed by various levels of the hierarchy -- *e.g.* early layers that extract visual lines could be identical in all sub-networks while the higher-order representations would be different. The hierarchical model would learn all that from the data.
54+
* **Hierarchical Neural Networks**: A very powerful approach in Probabilistic Programming is hierarchical modeling that allows pooling of things that were learned on sub-groups to the overall population (see [Hierarchical Linear Regression in PyMC](https://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/)). Applied to Neural Networks, in hierarchical data sets, we could train individual neural nets to specialize on sub-groups while still being informed about representations of the overall population. For example, imagine a network trained to classify car models from pictures of cars. We could train a hierarchical neural network where a sub-neural network is trained to tell apart models from only a single manufacturer. The intuition being that all cars from a certain manufactures share certain similarities so it would make sense to train individual networks that specialize on brands. However, due to the individual networks being connected at a higher layer, they would still share information with the other specialized sub-networks about features that are useful to all brands. Interestingly, different layers of the network could be informed by various levels of the hierarchy -- *e.g.* early layers that extract visual lines could be identical in all sub-networks while the higher-order representations would be different. The hierarchical model would learn all that from the data.
5555
* **Other hybrid architectures**: We can more freely build all kinds of neural networks. For example, Bayesian non-parametrics could be used to flexibly adjust the size and shape of the hidden layers to optimally scale the network architecture to the problem at hand during training. Currently, this requires costly hyper-parameter optimization and a lot of tribal knowledge.
5656

5757
+++
@@ -109,7 +109,7 @@ ax.set(xlabel="X", ylabel="Y", title="Toy binary classification data set");
109109

110110
### Model specification
111111

112-
A neural network is quite simple. The basic unit is a [perceptron](https://en.wikipedia.org/wiki/Perceptron) which is nothing more than [logistic regression](http://pymc-devs.github.io/pymc3/notebooks/posterior_predictive.html#Prediction). We use many of these in parallel and then stack them up to get hidden layers. Here we will use 2 hidden layers with 5 neurons each which is sufficient for such a simple problem.
112+
A neural network is quite simple. The basic unit is a [perceptron](https://en.wikipedia.org/wiki/Perceptron) which is nothing more than [logistic regression](http://pymc-devs.github.io/pymc/notebooks/posterior_predictive.html#Prediction). We use many of these in parallel and then stack them up to get hidden layers. Here we will use 2 hidden layers with 5 neurons each which is sufficient for such a simple problem.
113113

114114
```{code-cell} ipython3
115115
---
@@ -325,7 +325,7 @@ You might argue that the above network isn't really deep, but note that we could
325325

326326
## Acknowledgements
327327

328-
[Taku Yoshioka](https://github.com/taku-y) did a lot of work on ADVI in PyMC3, including the mini-batch implementation as well as the sampling from the variational posterior. I'd also like to the thank the Stan guys (specifically Alp Kucukelbir and Daniel Lee) for deriving ADVI and teaching us about it. Thanks also to Chris Fonnesbeck, Andrew Campbell, Taku Yoshioka, and Peadar Coyle for useful comments on an earlier draft.
328+
[Taku Yoshioka](https://github.com/taku-y) did a lot of work on ADVI in PyMC, including the mini-batch implementation as well as the sampling from the variational posterior. I'd also like to the thank the Stan guys (specifically Alp Kucukelbir and Daniel Lee) for deriving ADVI and teaching us about it. Thanks also to Chris Fonnesbeck, Andrew Campbell, Taku Yoshioka, and Peadar Coyle for useful comments on an earlier draft.
329329

330330
+++
331331

myst_nbs/variational_inference/empirical-approx-overview.myst.md

+20-22
Original file line numberDiff line numberDiff line change
@@ -6,32 +6,37 @@ jupytext:
66
format_version: 0.13
77
jupytext_version: 1.13.7
88
kernelspec:
9-
display_name: Python PyMC3 (Dev)
9+
display_name: pymc
1010
language: python
11-
name: pymc3-dev-py38
11+
name: pymc
1212
---
1313

1414
# Empirical Approximation overview
1515

16+
:::
17+
:tags: variational inference
18+
:category: intermediate
19+
:::
20+
1621
For most models we use sampling MCMC algorithms like Metropolis or NUTS. In PyMC3 we got used to store traces of MCMC samples and then do analysis using them. There is a similar concept for the variational inference submodule in PyMC3: *Empirical*. This type of approximation stores particles for the SVGD sampler. There is no difference between independent SVGD particles and MCMC samples. *Empirical* acts as a bridge between MCMC sampling output and full-fledged VI utils like `apply_replacements` or `sample_node`. For the interface description, see [variational_api_quickstart](variational_api_quickstart.ipynb). Here we will just focus on `Emprical` and give an overview of specific things for the *Empirical* approximation
1722

1823
```{code-cell} ipython3
24+
import aesara
1925
import arviz as az
2026
import matplotlib.pyplot as plt
2127
import numpy as np
22-
import pymc3 as pm
23-
import theano
28+
import pymc as pm
2429
2530
from pandas import DataFrame
2631
27-
print(f"Running on PyMC3 v{pm.__version__}")
32+
print(f"Running on PyMC v{pm.__version__}")
2833
```
2934

3035
```{code-cell} ipython3
3136
%config InlineBackend.figure_format = 'retina'
3237
az.style.use("arviz-darkgrid")
3338
np.random.seed(42)
34-
pm.set_tt_rng(42)
39+
pm.set_at_rng(42)
3540
```
3641

3742
## Multimodal density
@@ -43,12 +48,14 @@ mu = pm.floatX([-0.3, 0.5])
4348
sd = pm.floatX([0.1, 0.1])
4449
4550
with pm.Model() as model:
46-
x = pm.NormalMixture("x", w=w, mu=mu, sigma=sd, dtype=theano.config.floatX)
47-
trace = pm.sample(50000)
51+
x = pm.NormalMixture("x", w=w, mu=mu, sigma=sd)
52+
# Empirical approx does not support inference data
53+
trace = pm.sample(50000, return_inferencedata=False)
54+
idata = pm.to_inference_data(trace)
4855
```
4956

5057
```{code-cell} ipython3
51-
az.plot_trace(trace);
58+
az.plot_trace(idata);
5259
```
5360

5461
Great. First having a trace we can create `Empirical` approx
@@ -66,7 +73,7 @@ with model:
6673
approx
6774
```
6875

69-
This type of approximation has it's own underlying storage for samples that is `theano.shared` itself
76+
This type of approximation has it's own underlying storage for samples that is `aesara.shared` itself
7077

7178
```{code-cell} ipython3
7279
approx.histogram
@@ -113,7 +120,8 @@ mu = pm.floatX([0.0, 0.0])
113120
cov = pm.floatX([[1, 0.5], [0.5, 1.0]])
114121
with pm.Model() as model:
115122
pm.MvNormal("x", mu=mu, cov=cov, shape=2)
116-
trace = pm.sample(1000)
123+
trace = pm.sample(1000, return_inferencedata=False)
124+
idata = pm.to_inference_data(trace)
117125
```
118126

119127
```{code-cell} ipython3
@@ -126,17 +134,7 @@ az.plot_trace(approx.sample(10000));
126134
```
127135

128136
```{code-cell} ipython3
129-
import seaborn as sns
130-
```
131-
132-
```{code-cell} ipython3
133-
kdeViz_df = DataFrame(
134-
data=approx.sample(1000)["x"], columns=["First Dimension", "Second Dimension"]
135-
)
136-
```
137-
138-
```{code-cell} ipython3
139-
sns.kdeplot(data=kdeViz_df, x="First Dimension", y="Second Dimension")
137+
az.plot_pair(data=approx.sample(10000))
140138
plt.show()
141139
```
142140

0 commit comments

Comments
 (0)