fix links

chiral-carbon · chiral-carbon · commit d4afe490d0f4 · 2021-10-21T15:12:21.000+05:30
diff --git a/examples/references.bib b/examples/references.bib
@@ -73,11 +73,13 @@ @book{mcelreath2018statistical
   publisher={Chapman and Hall/CRC}
 }
 
-@article{mnih2013atari,
-  title={Playing Atari with Deep Reinforcement Learning},
-  author={Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Graves, Alex and Antonoglou, Ioannis and Wierstra, Daan and Riedmiller, Martin},
-  url={http://arxiv.org/abs/1312.5602},
-  year={2013}
+@misc{mnih2013playing,
+  title={Playing Atari with Deep Reinforcement Learning}, 
+  author={Volodymyr Mnih and Koray Kavukcuoglu and David Silver and Alex Graves and Ioannis Antonoglou and Daan Wierstra and Martin Riedmiller},
+  year={2013},
+  eprint={1312.5602},
+  archivePrefix={arXiv},
+  primaryClass={cs.LG}
 }
 
 @article{silver2016masteringgo,
diff --git a/examples/variational_inference/bayesian_neural_network_advi.ipynb b/examples/variational_inference/bayesian_neural_network_advi.ipynb
@@ -32,7 +32,7 @@
     "\n",
     "### Deep Learning\n",
     "\n",
-    "Now in its third renaissance, deep learning has been making headlines repeatadly by dominating almost any object recognition benchmark, kicking ass at Atari games {cite:p}`mnih2013atari`, and beating the world-champion Lee Sedol at Go {cite:p}`silver2016masteringgo`. From a statistical point, Neural Networks are extremely good non-linear function approximators and representation learners. While mostly known for classification, they have been extended to unsupervised learning with {cite:t}`kingma2014autoencoding` and in all sorts of other interesting ways (e.g. [Recurrent Networks](https://en.wikipedia.org/wiki/Recurrent_neural_network), or [MDNs](http://cbonnett.github.io/MDN_EDWARD_KERAS_TF.html) to estimate multimodal distributions). Why do they work so well? No one really knows as the statistical properties are still not fully understood.\n",
+    "Now in its third renaissance, deep learning has been making headlines repeatadly by dominating almost any object recognition benchmark, kicking ass at Atari games {cite:p}`mnih2013playing`, and beating the world-champion Lee Sedol at Go {cite:p}`silver2016masteringgo`. From a statistical point, Neural Networks are extremely good non-linear function approximators and representation learners. While mostly known for classification, they have been extended to unsupervised learning with AutoEncoders {cite:p}`kingma2014autoencoding` and in all sorts of other interesting ways (e.g. [Recurrent Networks](https://en.wikipedia.org/wiki/Recurrent_neural_network), or [MDNs](http://cbonnett.github.io/MDN_EDWARD_KERAS_TF.html) to estimate multimodal distributions). Why do they work so well? No one really knows as the statistical properties are still not fully understood.\n",
     "\n",
     "A large part of the innoviation in deep learning is the ability to train these extremely complex models. This rests on several pillars:\n",
     "* Speed: facilitating the GPU allowed for much faster processing.\n",
@@ -47,8 +47,8 @@
     "* **Uncertainty in predictions**: As we will see below, the Bayesian Neural Network informs us about the uncertainty in its predictions. I think uncertainty is an underappreciated concept in Machine Learning as it's clearly important for real-world applications. But it could also be useful in training. For example, we could train the model specifically on samples it is most uncertain about.\n",
     "* **Uncertainty in representations**: We also get uncertainty estimates of our weights which could inform us about the stability of the learned representations of the network.\n",
     "* **Regularization with priors**: Weights are often L2-regularized to avoid overfitting, this very naturally becomes a Gaussian prior for the weight coefficients. We could, however, imagine all kinds of other priors, like spike-and-slab to enforce sparsity (this would be more like using the L1-norm).\n",
-    "* **Transfer learning with informed priors**: If we wanted to train a network on a new object recognition data set, we could bootstrap the learning by placing informed priors centered around weights retrieved from other pre-trained networks, like {cite:t}`szegedy2014going`. \n",
-    "* **Hierarchical Neural Networks**: A very powerful approach in Probabilistic Programming is hierarchical modeling that allows pooling of things that were learned on sub-groups to the overall population (see my tutorial on [Hierarchical Linear Regression in PyMC3](http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/)). Applied to Neural Networks, in hierarchical data sets, we could train individual neural nets to specialize on sub-groups while still being informed about representations of the overall population. For example, imagine a network trained to classify car models from pictures of cars. We could train a hierarchical neural network where a sub-neural network is trained to tell apart models from only a single manufacturer. The intuition being that all cars from a certain manufactures share certain similarities so it would make sense to train individual networks that specialize on brands. However, due to the individual networks being connected at a higher layer, they would still share information with the other specialized sub-networks about features that are useful to all brands. Interestingly, different layers of the network could be informed by various levels of the hierarchy -- e.g. early layers that extract visual lines could be identical in all sub-networks while the higher-order representations would be different. The hierarchical model would learn all that from the data.\n",
+    "* **Transfer learning with informed priors**: If we wanted to train a network on a new object recognition data set, we could bootstrap the learning by placing informed priors centered around weights retrieved from other pre-trained networks, like GoogLeNet {cite:p}`szegedy2014going`. \n",
+    "* **Hierarchical Neural Networks**: A very powerful approach in Probabilistic Programming is hierarchical modeling that allows pooling of things that were learned on sub-groups to the overall population (see my tutorial on [Hierarchical Linear Regression in PyMC3](https://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/)). Applied to Neural Networks, in hierarchical data sets, we could train individual neural nets to specialize on sub-groups while still being informed about representations of the overall population. For example, imagine a network trained to classify car models from pictures of cars. We could train a hierarchical neural network where a sub-neural network is trained to tell apart models from only a single manufacturer. The intuition being that all cars from a certain manufactures share certain similarities so it would make sense to train individual networks that specialize on brands. However, due to the individual networks being connected at a higher layer, they would still share information with the other specialized sub-networks about features that are useful to all brands. Interestingly, different layers of the network could be informed by various levels of the hierarchy -- e.g. early layers that extract visual lines could be identical in all sub-networks while the higher-order representations would be different. The hierarchical model would learn all that from the data.\n",
     "* **Other hybrid architectures**: We can more freely build all kinds of neural networks. For example, Bayesian non-parametrics could be used to flexibly adjust the size and shape of the hidden layers to optimally scale the network architecture to the problem at hand during training. Currently, this requires costly hyper-parameter optimization and a lot of tribal knowledge."
    ]
   },
@@ -246,9 +246,9 @@
    "source": [
     "### Variational Inference: Scaling model complexity\n",
     "\n",
-    "We could now just run a MCMC sampler like {class}`pymc.step_methods.hmc.nuts` which works pretty well in this case, but as I already mentioned, this will become very slow as we scale our model up to deeper architectures with more layers.\n",
+    "We could now just run a MCMC sampler like {class}`~pymc3.step_methods.hmc.nuts.NUTS` which works pretty well in this case, but as I already mentioned, this will become very slow as we scale our model up to deeper architectures with more layers.\n",
     "\n",
-    "Instead, we will use the brand-new {class}`pymc.variational.inference` variational inference algorithm which was recently added to `PyMC3`, and updated to use the operator variational inference (OPVI) framework. This is much faster and will scale better. Note, that this is a mean-field approximation so we ignore correlations in the posterior."
+    "Instead, we will use the {class}`~pymc3.variational.inference.ADVI` variational inference algorithm which was added to `PyMC3`, and updated to use the operator variational inference (OPVI) framework. This is much faster and will scale better. Note, that this is a mean-field approximation so we ignore correlations in the posterior."
    ]
   },
   {
@@ -382,7 +382,7 @@
    "source": [
     "Now that we trained our model, lets predict on the hold-out set using a posterior predictive check (PPC). \n",
     "\n",
-    "1. We can use {class}`pymc.sample_posterior_predictive` to generate new data (in this case class predictions) from the posterior (sampled from the variational estimation).\n",
+    "1. We can use {func}`~pymc3.sampling.sample_posterior_predictive` to generate new data (in this case class predictions) from the posterior (sampled from the variational estimation).\n",
     "2. It is better to get the node directly and build theano graph using our approximation (`approx.sample_node`) , we get a lot of speed up"
    ]
   },
@@ -859,7 +859,7 @@
    "source": [
     "(c) 2017 by Thomas Wiecki, updated by Maxim Kochurov\n",
     "\n",
-    "Original blog post: http://twiecki.github.io/blog/2016/06/01/bayesian-deep-learning/"
+    "Original [blog post](https://twiecki.github.io/blog/2016/06/01/bayesian-deep-learning/)."
    ]
   },
   {