|
| 1 | +.. _gp_guide: |
| 2 | + |
| 3 | +****************** |
| 4 | +Gaussian Processes |
| 5 | +****************** |
| 6 | + |
| 7 | +GP Basics |
| 8 | +========= |
| 9 | + |
| 10 | +Sometimes an unknown parameter or variable in a model is not a scalar value or |
| 11 | +a fixed-length vector, but a *function*. A Gaussian process (GP) can be used |
| 12 | +as a prior probability distribution whose support is over the space of |
| 13 | +continuous functions. A GP prior on the function :math:`f(x)` is usually written, |
| 14 | + |
| 15 | +.. math:: |
| 16 | +
|
| 17 | + f(x) \sim \mathcal{GP}(m(x), \, k(x, x')) \,. |
| 18 | +
|
| 19 | +The function values are modeled as a draw from a multivariate normal |
| 20 | +distribution that is parameterized by the mean function, :math:`m(x)`, and the |
| 21 | +covariance function, :math:`k(x, x')`. Gaussian processes are a convenient |
| 22 | +choice as priors over functions due to the marginalization and conditioning |
| 23 | +properties of the multivariate normal distribution. Usually, the marginal |
| 24 | +distribution over :math:`f(x)` is evaluated during the inference step. The |
| 25 | +conditional distribution is then used for predicting the function values |
| 26 | +:math:`f(x_*)` at new points, :math:`x_*`. |
| 27 | + |
| 28 | +The joint distribution of :math:`f(x)` and :math:`f(x_*)` is multivariate |
| 29 | +normal, |
| 30 | + |
| 31 | +.. math:: |
| 32 | +
|
| 33 | + \begin{bmatrix} f(x) \\ f(x_*) \\ \end{bmatrix} \sim |
| 34 | + \text{N}\left( |
| 35 | + \begin{bmatrix} m(x) \\ m(x_*) \\ \end{bmatrix} \,, |
| 36 | + \begin{bmatrix} k(x,x') & k(x_*, x) \\ |
| 37 | + k(x_*, x) & k(x_*, x_*') \\ \end{bmatrix} |
| 38 | + \right) \,. |
| 39 | +
|
| 40 | +Starting from the joint distribution, one obtains the marginal distribution |
| 41 | +of :math:`f(x)`, as :math:`\text{N}(m(x),\, k(x, x'))`. The conditional |
| 42 | +distribution is |
| 43 | + |
| 44 | +.. math:: |
| 45 | +
|
| 46 | + f(x_*) \mid f(x) \sim \text{N}\left( k(x_*, x) k(x, x)^{-1} [f(x) - m(x)] + m(x_*) ,\, |
| 47 | + k(x_*, x_*) - k(x, x_*) k(x, x)^{-1} k(x, x_*) \right) \,. |
| 48 | +
|
| 49 | +.. note:: |
| 50 | + |
| 51 | + For more information on GPs, check out the book `Gaussian Processes for |
| 52 | + Machine Learning <http://www.gaussianprocess.org/gpml/>`_ by Rasmussen & |
| 53 | + Williams, or `this introduction <http://www.inference.org.uk/mackay/gpB.pdf>`_ |
| 54 | + by D. Mackay. |
| 55 | + |
| 56 | +PyMC is a great environment for working with fully Bayesian Gaussian Process |
| 57 | +models. GPs in PyMC have a clear syntax and are highly composable, and many |
| 58 | +predefined covariance functions (or kernels), mean functions, and several GP |
| 59 | +implementations are included. GPs are treated as distributions that can be |
| 60 | +used within larger or hierarchical models, not just as standalone regression |
| 61 | +models. |
| 62 | + |
| 63 | +Mean and covariance functions |
| 64 | +============================= |
| 65 | + |
| 66 | +Those who have used the GPy or GPflow Python packages will find the syntax for |
| 67 | +construction mean and covariance functions somewhat familiar. When first |
| 68 | +instantiated, the mean and covariance functions are parameterized, but not |
| 69 | +given their inputs yet. The covariance functions must additionally be provided |
| 70 | +with the number of dimensions of the input matrix, and a list that indexes |
| 71 | +which of those dimensions they are to operate on. The reason for this design |
| 72 | +is so that covariance functions can be constructed that are combinations of |
| 73 | +other covariance functions. |
| 74 | + |
| 75 | +For example, to construct an exponentiated quadratic covariance function that |
| 76 | +operates on the second and third column of a three column matrix representing |
| 77 | +three predictor variables:: |
| 78 | + |
| 79 | + ls = [2, 5] # the lengthscales |
| 80 | + cov_func = pm.gp.cov.ExpQuad(input_dim=3, ls=ls, active_dims=[1, 2]) |
| 81 | + |
| 82 | +Here the :code:`ls`, or lengthscale, parameter is two dimensional, allowing the second |
| 83 | +and third dimension to have a different lengthscale. The reason we have to |
| 84 | +specify :code:`input_dim`, the total number of columns of :code:`X`, and |
| 85 | +:code:`active_dims`, which of those columns or dimensions the covariance |
| 86 | +function will act on, is because :code:`cov_func` hasn't actually seen the |
| 87 | +input data yet. The :code:`active_dims` argument is optional, and defaults to |
| 88 | +all columns of the matrix of inputs. |
| 89 | + |
| 90 | +Covariance functions in PyMC closely follow the algebraic rules for kernels, |
| 91 | +which allows users to combine covariance functions into new ones, for example: |
| 92 | + |
| 93 | +- The sum of two covariance functions is also a covariance function:: |
| 94 | + |
| 95 | + |
| 96 | + cov_func = pm.gp.cov.ExpQuad(...) + pm.gp.cov.ExpQuad(...) |
| 97 | + |
| 98 | +- The product of two covariance functions is also a covariance function:: |
| 99 | + |
| 100 | + |
| 101 | + cov_func = pm.gp.cov.ExpQuad(...) * pm.gp.cov.Periodic(...) |
| 102 | + |
| 103 | +- The product (or sum) of a covariance function with a scalar is a |
| 104 | + covariance function:: |
| 105 | + |
| 106 | + |
| 107 | + cov_func = eta**2 * pm.gp.cov.Matern32(...) |
| 108 | + |
| 109 | + |
| 110 | + |
| 111 | +After the covariance function is defined, it is now a function that is |
| 112 | +evaluated by calling :code:`cov_func(x, x)` (or :code:`mean_func(x)`). Since |
| 113 | +PyMC is built on top of PyTensor, it is relatively easy to define and experiment |
| 114 | +with non-standard covariance and mean functions. For more information check out |
| 115 | +the :ref:`tutorial <GP-MeansAndCovs>` on covariance functions. |
| 116 | + |
| 117 | + |
| 118 | +GP Implementations |
| 119 | +================== |
| 120 | + |
| 121 | +PyMC includes several GP implementations, including marginal and latent |
| 122 | +variable models and also some fast approximations. Their usage all follows a |
| 123 | +similar pattern: First, a GP is instantiated with a mean function and a |
| 124 | +covariance function. Then, GP objects can be added together, allowing for |
| 125 | +function characteristics to be carefully modeled and separated. Finally, one |
| 126 | +of `prior`, `marginal_likelihood` or `conditional` methods is called on the GP |
| 127 | +object to actually construct the PyMC random variable that represents the |
| 128 | +function prior. |
| 129 | + |
| 130 | +Using :code:`gp.Latent` for the example, the syntax to first specify the GP |
| 131 | +is:: |
| 132 | + |
| 133 | + gp = pm.gp.Latent(mean_func, cov_func) |
| 134 | + |
| 135 | +The first argument is the mean function and the second is the covariance |
| 136 | +function. We've made the GP object, but we haven't made clear which function |
| 137 | +it is to be a prior for, what the inputs are, or what parameters it will be |
| 138 | +conditioned on. |
| 139 | + |
| 140 | +.. note:: |
| 141 | + |
| 142 | + The :code:`gp.Marginal` class and similar don't have a :code:`prior` method. |
| 143 | + Instead they have a :code:`marginal_likelihood` method that is used similarly, |
| 144 | + but has additional required arguments, such as the observed data, noise, |
| 145 | + or other, depending on the implementation. See the notebooks for examples. |
| 146 | + The :code:`conditional` method works similarly. |
| 147 | + |
| 148 | +Calling the `prior` method will create a PyMC random variable that represents |
| 149 | +the latent function :math:`f(x) = \mathbf{f}`:: |
| 150 | + |
| 151 | + f = gp.prior("f", X) |
| 152 | + |
| 153 | +:code:`f` is a random variable that can be used within a PyMC model like any |
| 154 | +other type of random variable. The first argument is the name of the random |
| 155 | +variable representing the function we are placing the prior over. |
| 156 | +The second argument is the inputs to the function that the prior is over, |
| 157 | +:code:`X`. The inputs are usually known and present in the data, but they can |
| 158 | +also be PyMC random variables. If the inputs are an PyTensor tensor or a |
| 159 | +PyMC random variable, the :code:`shape` needs to be given. |
| 160 | + |
| 161 | +Usually at this point, inference is performed on the model. The |
| 162 | +:code:`conditional` method creates the conditional, or predictive, |
| 163 | +distribution over the latent function at arbitrary :math:`x_*` input points, |
| 164 | +:math:`f(x_*)`. To construct the conditional distribution we write:: |
| 165 | + |
| 166 | + f_star = gp.conditional("f_star", X_star) |
| 167 | + |
| 168 | +.. _additive_gp: |
| 169 | + |
| 170 | +Additive GPs |
| 171 | +============ |
| 172 | + |
| 173 | +The GP implementation in PyMC is constructed so that it is easy to define |
| 174 | +additive GPs and sample from individual GP components. We can write:: |
| 175 | + |
| 176 | + gp1 = pm.gp.Marginal(mean_func1, cov_func1) |
| 177 | + gp2 = pm.gp.Marginal(mean_func2, cov_func2) |
| 178 | + gp3 = gp1 + gp2 |
| 179 | + |
| 180 | +The GP objects have to have the same type, :code:`gp.Marginal` cannot |
| 181 | +be added to :code:`gp.Latent`. |
| 182 | + |
| 183 | +Consider two independent GP distributed functions, :math:`f_1(x) \sim |
| 184 | +\mathcal{GP}\left(m_1(x),\, k_1(x, x')\right)` and :math:`f_2(x) \sim |
| 185 | +\mathcal{GP}\left( m_2(x),\, k_2(x, x')\right)`. The joint distribution of |
| 186 | +:math:`f_1,\, f_1^*,\, f_2,\, f_2^*,\, f_1 + f_2` and :math:`f_1^* + f_2^*` is |
| 187 | + |
| 188 | +.. math:: |
| 189 | +
|
| 190 | + \begin{bmatrix} f_1 \\ f_1^* \\ f_2 \\ f_2^* |
| 191 | + \\ f_1 + f_2 \\ f_1^* + f_2^* \end{bmatrix} \sim |
| 192 | + \text{N}\left( |
| 193 | + \begin{bmatrix} m_1 \\ m_1^* \\ m_2 \\ m_2^* \\ |
| 194 | + m_1 + m_2 \\ m_1^* + m_2^* \\ \end{bmatrix} \,,\, |
| 195 | + \begin{bmatrix} |
| 196 | + K_1 & K_1^* & 0 & 0 & K_1 & K_1^* \\ |
| 197 | + K_1^{*^T} & K_1^{**} & 0 & 0 & K_1^* & K_1^{**} \\ |
| 198 | + 0 & 0 & K_2 & K_2^* & K_2 & K_2^{*} \\ |
| 199 | + 0 & 0 & K_2^{*^T} & K_2^{**} & K_2^{*} & K_2^{**} \\ |
| 200 | + K_1 & K_1^{*} & K_2 & K_2^{*} & K_1 + K_2 & K_1^{*} + K_2^{*} \\ |
| 201 | + K_1^{*^T} & K_1^{**} & K_2^{*^T} & K_2^{**} & K_1^{*^T}+K_2^{*^T} & K_1^{**}+K_2^{**} |
| 202 | + \end{bmatrix} |
| 203 | + \right) \,. |
| 204 | +
|
| 205 | +Using the joint distribution to obtain the conditional distribution of :math:`f_1^*` |
| 206 | +with the contribution due to :math:`f_1 + f_2` factored out, we get |
| 207 | + |
| 208 | +.. math:: |
| 209 | + f_1^* \mid f_1 + f_2 \sim \text{N}\left( |
| 210 | + m_1^* + K_1^{*^T}(K_1 + K_2)^{-1}\left[f_1 + f_2 - m_1 - m_2\right] \,,\, |
| 211 | + K_1^{**} - K_1^{*^T}(K_1 + K_2)^{-1}K_1^* \right) \,. |
| 212 | +
|
| 213 | +
|
| 214 | +These equations show how to break down GP models into individual components to see how each |
| 215 | +contributes to the data. For more information, check out `David Duvenaud's PhD |
| 216 | +thesis <https://www.cs.toronto.edu/~duvenaud/thesis.pdf>`_. |
| 217 | + |
| 218 | +The GP objects in PyMC keeps track of these marginals automatically. The |
| 219 | +following code sketch shows how to define the conditional distribution of |
| 220 | +:math:`f_2^*`. We use `gp.Marginal` in the example, but the same works for |
| 221 | +other implementations. The first block fits the GP prior. We denote |
| 222 | +:math:`f_1 + f_2` as just :math:`f` for brevity:: |
| 223 | + |
| 224 | + with pm.Model() as model: |
| 225 | + gp1 = pm.gp.Marginal(mean_func1, cov_func1) |
| 226 | + gp2 = pm.gp.Marginal(mean_func2, cov_func2) |
| 227 | + |
| 228 | + # gp represents f1 + f2. |
| 229 | + gp = gp1 + gp2 |
| 230 | + |
| 231 | + f = gp.marginal_likelihood("f", X, y, sigma) |
| 232 | + |
| 233 | + idata = pm.sample(1000) |
| 234 | + |
| 235 | + |
| 236 | +To construct the conditional distribution of :code:`gp1` or :code:`gp2`, we |
| 237 | +also need to include the additional arguments, :code:`X`, :code:`y`, and |
| 238 | +:code:`sigma`:: |
| 239 | + |
| 240 | + with model: |
| 241 | + # conditional distributions of f1 and f2 |
| 242 | + f1_star = gp1.conditional("f1_star", X_star, |
| 243 | + given={"X": X, "y": y, "sigma": sigma, "gp": gp}) |
| 244 | + f2_star = gp2.conditional("f2_star", X_star, |
| 245 | + given={"X": X, "y": y, "sigma": sigma, "gp": gp}) |
| 246 | + |
| 247 | + # conditional of f1 + f2, `given` not required |
| 248 | + f_star = gp.conditional("f_star", X_star) |
| 249 | + |
| 250 | +This second block produces the conditional distributions. Notice that extra |
| 251 | +arguments are required for conditionals of :math:`f1` and :math:`f2`, but not |
| 252 | +:math:`f`. This is because those arguments are cached when |
| 253 | +:code:`.marginal_likelihood` is called on :code:`gp`. |
| 254 | + |
| 255 | +.. note:: |
| 256 | + When constructing conditionals, the additional arguments :code:`X`, :code:`y`, |
| 257 | + :code:`sigma` and :code:`gp` must be provided as a dict called `given`! |
| 258 | + |
| 259 | +Since the marginal likelihoood method of :code:`gp1` or :code:`gp2` weren't called, |
| 260 | +their conditionals need to be provided with the required inputs. In the same |
| 261 | +fashion as the prior, :code:`f_star`, :code:`f1_star` and :code:`f2_star` are random |
| 262 | +variables that can now be used like any other random variable in PyMC. |
| 263 | + |
| 264 | +Check the :ref:`notebooks <gaussian_processes>` |
| 265 | +for detailed demonstrations of the usage of GP functionality in PyMC. |
0 commit comments