Skip to content

SparseLatent GP #2951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from
Closed

Conversation

smartass101
Copy link
Contributor

This is an attempt based on this notebook by @bwengals to implement the DTC and FITC sparse approximations.

I tried to read through Quinonero-Candela+Rasmussen, 2006 for the covariance equations. Essentially I substituted sigma=0 in them and reduced them to simpler and yet computationally feasible forms. But it's quite likely my code is un-optimal and wrong.

the mean_func and mean_total usage might be wrong, need to check literature
could probably use some matrix computation optimization
@junpenglao junpenglao requested a review from bwengals April 25, 2018 16:01
@junpenglao junpenglao added the WIP label Apr 25, 2018
@junpenglao junpenglao changed the title [WIP] SparseLatent GP SparseLatent GP Apr 25, 2018
getting an -inf logp, the models are wrong for some reason
also use more descriptive names for vars
@smartass101
Copy link
Contributor Author

The DTC approximation passes the test suite as of 768fa70. FITC gives me a very different logp for the prior and -inf with the conditional, needs some more work.

@smartass101
Copy link
Contributor Author

Splitting of the test suite in ab6e01d shows that the DTC prior approximation is good to 1% relative error, but DTC prior+conditional only to 5%. Not sure if this suggests a problem or is realistic.

@ColCarroll
Copy link
Member

This is really cool! Could you put a link to the paper in the docstring for the implementation, as a reference?

pymc3/gp/gp.py Outdated
----------
cov_func : None, 2D array, or instance of Covariance
The covariance function. Defaults to zero.
Must implement the `diag` method for the FITC approximation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of them must implement the diag method. I think there's no need to specify this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I should reformulate this that the cov_func has to implement a computationally efficient diag method (i.e. not use tt.diagonal(full(X, X)) for the approximation to be sparse?

pymc3/gp/gp.py Outdated
Kff_diag = self.cov_func.diag(X)
# MvNormal with diagonal cov is Normal with sd=cov**0.5
var = tt.clip(Kff_diag - Qff_diag, 0, np.inf)
f = pm.Normal(name, mu=f_, tau=tt.inv(var), shape=shape)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use tau and not sd? If there is a reason this is preferred, would you mind changing this in MarginalSparse while you're at it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, shape should be shape of f here. not shape of u. Sorry if you know and you're not quite done with this part yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I was thinking that sd=tt.sqrt(var) might be computationally more expensive than tau=tt.inv(var). But I could be wrong (especially in terms of numerical stability), please correct me on this.
  • good catch with the shape parameters, maybe that's why FITC isn't working. I suggest to slightly change the implementation of infer_shape to
def infer_shape(X, kwargs=None, shape_kw='shape'):
    if kwargs is None:
        try:
            shape = np.int(X.shape[0])
        except TypeError:
            raise TypeError("Cannot infer '%s', provide as a keyword argument" % shape_kw)
    else:
        shape = kwargs.pop(shape_kw)
    return shape

and in the docstring mention that if X and/or Xu are tensors, the keyword arguments shape and/or shape_u must be provided, respectively. Also, should I provide **kwargs to the second Normal for FITC as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's more expensive computationally. Should profile it.

I suggest to slightly change the implementation

Good call

Also, should I provide **kwargs to the second Normal for FITC as well?

I'm thinking no... any reason you can think of at all for someone to pass kwargs there?

pymc3/gp/gp.py Outdated
f = pm.Deterministic(name, f_)
elif self.approx == 'FITC':
Qff_diag = project_inverse(Kfu, Luu, diag=True)
Kff_diag = self.cov_func.diag(X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.cov_func(X, diag=True)

pymc3/gp/gp.py Outdated
shape = infer_shape(Xu, kwargs.pop("shape", None))
v = pm.Normal(name + "_u_rotated_", mu=0.0, sd=1.0, shape=shape, **kwargs)
u_ = self.mean_func(Xu) + tt.dot(Luu, v) # mean + chol method of MvGaussian
u = pm.Deterministic(name+'_u', u_) # (m,) prior at inducing points
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a style thing, name + '_u'

pymc3/gp/gp.py Outdated
shape = infer_shape(Xu, kwargs.pop("shape", None))
v = pm.Normal(name + "_u_rotated_", mu=0.0, sd=1.0, shape=shape, **kwargs)
u_ = self.mean_func(Xu) + tt.dot(Luu, v) # mean + chol method of MvGaussian
u = pm.Deterministic(name+'_u', u_) # (m,) prior at inducing points
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small thing, would prefer:

u = pm.Deterministic(name+'_u', self.mean_func(Xu) + tt.dot(Luu, v))

I think this is common when writing pymc3 programs, e.g y_ = Normal('y', ..., observed=y) so as to not overwrite the data y with a symbolic variable. I don't know if its good to have this naming convention in source code.

@@ -9,6 +9,24 @@
solve_upper = tt.slinalg.Solve(A_structure='upper_triangular')
solve = tt.slinalg.Solve(A_structure='general')

def invert_dot(L, X):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are great! For consistency though, it would be good to use these functions throughout the gp source code (I know, a pain) or just write the theano ops as before.

@bwengals
Copy link
Contributor

@ColCarroll The paper strictly is only about the case where the likelihood function is also Gaussian. The main thing I can find on FITC for non-gaussian likelihoods is this paper, but the inference method was not MCMC.

@smartass101 This is great work, thank you for submitting! We should take a look at FITC with MCMC on some simulated data to get a feel for how it does before including it in the code base. DTC is simple to implement and seems to work alright, based off this gist. Some variational sparse approaches (see GPflow) also occupy this niche, and don't exhibit some of the pathologies of DTC or FITC.

@smartass101
Copy link
Contributor Author

Regarding the paper(s), I must confess that I'm no statistician, just a physicist trying to get reasonable errorbars for large datasets (hence the need for sparsity) of noisy, heteroscedastic measurements (plasma turbulence). I wish to use the LatentSparse approximation for the noise parameters (var and cov).
So my understanding of the paper(s) is rather shallow. I was ovewhelmed by the number of papers on the sparse GP topic. After skimming through a few of them my intuition tells me that any more advanced implementation (especially variational) will be easier to code as logp like with MarginalSparse. I was also intrigued by this recent paper on a new unifying sparse GP framework which goes even further to power expectation propagation.

The variatonal approchaes in GPflow you mention @bwengals don't seem to be easily usable as latent variables in a larger model which is what I need.

@philastrophist
Copy link
Contributor

I also have a use-case for LatentSparse (specifically .prior): incorporation of a time-varying parameter into an otherwise static model. I wish to use all 13,000 data points, which I cannot do without a sparse implementation.

Just letting you know that other use-cases exist and that this is a really worthwhile endeavour!

@ColCarroll
Copy link
Member

Thank you both for helping me out with references (or why there isn't one!)

I have been reading about GPs and sparse latent gps the past two weeks, and am finally getting to a discussion of using MCMC vs an approximation to fit them. This looks like a super interesting PR!

@smartass101
Copy link
Contributor Author

After some digging trough the math I came to the conlusion that for X==Xu the FITC approximation degenerates to DTC, because Lambda = diag(Kff) - diag(Qff) goes to 0. However, due to rounding errors and cholesky stabilization it is not completely 0 but some small residual value, so there is some offset in the FITC prior logp. The conditional logp (which is at least finite now) is even worse, because there are divisions by Lambda. Therefore, it might be better to use a different approach for the FITC test. Any ideas?

I also had a look at VFE by Titsias, but I don't think that it's applicable, because the regularizer term trace(Kff) / sigma^2 goes to 0 for sigma=0.

Thank you for the support and interest @philastrophist and @ColCarroll.

@smartass101
Copy link
Contributor Author

Bad news, the DTC conditional was wrong and after fixing fails as well. Previously I was wrongly clipping the whole cov matrix, not just the diagonal variance. So DTC seems to suffer from the same problem, Kss - Qss degenerates to some ugly unsolvable K. In the case of FITC it is even worse.
We need a different testing strategy. I vote for using Xu in between X. The logp isn't that different from Latent in my tests.

@bwengals
Copy link
Contributor

bwengals commented May 3, 2018

Right, Kss - Qss should be zero for Kuu = Kff, so you're right, will have to use different Xu for the test.

So I got your FITC implementation to work and tried it on a toy problem. I found that sampling is very slow, much slower than DTC. One of the difficulties of sampling from GPs is that there are strong and complex correlations between f and the covariance function hyperparameters. This makes sense, since doing something like changing a lengthscale parameter will have a big impact across all of f. With the DTC approximation, the only random variable added to the graph is u, (f is deterministic) while with FITC, we have both u and f = pm.Normal(name, mu=f_, tau=tt.inv(var), shape=shape), which compounds the problem.

From what I understand, the variational approximations depend on the likelihood, meaning you can't really do something like what @smartass101 wants to do by modeling the variance with a GP. I'm leaning toward including just DTC in this PR. What do you think? I don't understand EP very well, so can't say much about the paper you linked.

@ColCarroll From what I gather the main reason the variational approaches are generally superior to FITC or DTC and such is that the inducing point locations are themselves variational parameters, so there's no danger of overfitting when optimizing their locations, which happens for things like FITC and DTC. Also, working with a lower bound of the posterior is a bit better theoretically than just an approximation, since it's hard to know how good the approximation is.

@smartass101
Copy link
Contributor Author

smartass101 commented May 4, 2018

Thank you for the insightful comments @bwengals. So it seems that advanced approaches such as VFE or EP are derived for a specific (usually iid Gaussian) likelihood. My aim is to have a rather general spase latent GP implementation which can be used as some unknown function to be inferred with rather arbitrary likelihoods.

will have to use different Xu for the test.

Yes, just not sure what to compare/test then.
Is there some other testing approach than comparing the logp? I couldn't find anything obvious during a quick look at the test suite.

So I got your FITC implementation to work

You mean it didn't work and you had to fix something, or just the sampling was very slow and worked only for a toy problem?

One of the difficulties of sampling from GPs is that there are strong and complex correlations

I thought the reparametrization u = m + Lv was supposed to help with that as mentioned in the Latent.prior docs. Do you have any literature on why the reparametrization helps? Maybe we could gain some insight from there.

while with FITC, we have both u and f = Normal(name, mu=f_, tau=tt.inv(var), shape=shape)

For a stationary (and many non-stationary cases as well) kernel the diagonal var is effectively independent (or should be according to my understanding) of the lengthscale. So the correlations between f and lengthscales should still stem mainly from the deterministic f_ mean as is the case with DTC. In fact, I would expect the extra stochasticity about the mean due to var to reduce correlations between f and lengthscales.

I'm leaning toward including just DTC in this PR. What do you think?

I'd put some more effort into checking whether the FITC implementation is at least correct and then include it as well, but make the default approx='DTC' and warn in the docs that FITC can be much slower.

@smartass101
Copy link
Contributor Author

smartass101 commented May 4, 2018

A possible reparametrization could be to use

v2 = pm.Normal(tau=tt.inv(var),shape=shape)
return pm.Deterministic('f', f_ + v2)

I don't see it as very different from the current approach, but the logp added to the model by pm.Normal would be different, i.e. would not contain (value - mu)**2 which could perhaps be a source of further correlations.

Perhaps reparametrizing the .conditional to u = m + L v could also help? Or is there some guidance when to use MvNormal and when the rotated reparametrization?

mu = self.mean_func(X) # (n,)
Luu = cholesky(stabilize(self.cov_func(Xu))) # (m, m) \sqrt{K_u}
shape_u = infer_shape(Xu, kwargs.pop("shape_u", None))
shape = infer_shape(X, kwargs.pop("shape", None))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about shape_f?

@bwengals
Copy link
Contributor

bwengals commented May 7, 2018

Hows fitc vs dtc compare on a more realistic problem? Things look great so far. Will do a more thorough review soon.

@smartass101
Copy link
Contributor Author

Tried regressions with 2 LatentSparse priors (one for the mean, the other for variance) on 22673 points with 20 inducing points. The DTC approximation took ~ 50 min for 1000 samples (average ~ 3s/it) and the results look rather nice, but there is a lot of correlation. Will try with FITC now and will report back. So far FITC looks slower, will see by how much.

@smartass101
Copy link
Contributor Author

FITC failed with

ValueError: Mass matrix contains zeros on the diagonal. Some derivatives might always be zero.

after sampling at 11.87s/it.

But I think this is more a problem of my model (mispecified covariance), it may not reflect the general performance of FITC.

Perhaps a more relevant benchmark would be to use one of the Latent example notebooks and scale it up to some larger number of points. What do you think @bwengals ?

@bwengals
Copy link
Contributor

Wow thats pretty fast. How much RAM do you have btw? Yeah I think that's a good idea. These results are on your problem, not a toy problem?

@smartass101
Copy link
Contributor Author

It's just my notebook with 16 GB RAM and i7-5500U, nothing super fancy, but decent for experiments. Yes, this is on my (yet still simplified) problem.
I tried to re-implement your original DTC notebook @bwengals, but I get very strange ppc samples. See here. And sometimes that FITC ValueError as well. I also enhanced the plot_gp_dist func to make labeling simpler.

@@ -71,7 +89,7 @@ def setter(self, val):
return gp_wrapper


def plot_gp_dist(ax, samples, x, plot_samples=True, palette="Reds"):
def plot_gp_dist(ax, samples, x, plot_samples=True, palette="Reds", label=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@bwengals
Copy link
Contributor

I have 16gb too and something that big definitely doesn't fit for me... Been trying to figure out what's up with that since I don't have that problem with GPflow or GPy.

Oh yeah, those don't look right, like some numerical problem is happening. The overall shape looks ok but the individual draws are messed up. I'll think about what that could be and get back soon.

@smartass101
Copy link
Contributor Author

I have 16gb too and something that big definitely doesn't fit for me... Been trying to figure out what's up with that

pymc 3.4 was supposed to fix some memory inefficiencies, not sure which version you have. To be clear, a full GP doesn't fit into my RAM, not even into 150 GB (tried on our cluster).

but the individual draws are messed up.

Yes. For some reason the distribution of points and its median isn't completely wrong. Maybe some random sub-selections of data? The posterior does not have these issues though.

@bwengals
Copy link
Contributor

I found the issue, at least with the DTC version. I still need to think carefully about translating FITC to the latent implementation. Mainly, I implemented in terms of u, instead of f. So I changed the _get_given_vals function and _build_conditional,

def _get_given_vals(self, given):
    if given is None:
        given = {}
    if 'gp' in given:
        cov_total = given['gp'].cov_func
        mean_total = given['gp'].mean_func
    else:
        cov_total = self.cov_func
        mean_total = self.mean_func
    if all(val in given for val in ['X', 'u', 'Xu']):
        X, Xu, u = given['X'], given['Xu'], given['u']
    else:
        X, Xu, u = self.X, self.Xu, self.u
    return X, Xu, u, cov_total, mean_total


def _build_conditional(self, Xnew, X, Xu, u, cov_total, mean_total):
    Kuu = cov_total(Xu)
    Luu = cholesky(stabilize(Kuu))
    r = u - mean_total(Xu)
    Kuuiu = invert_dot(Luu, r)
    Ksu = self.cov_func(Xnew, Xu)
    mus = self.mean_func(Xnew) + tt.dot(Ksu, Kuuiu)
    tmp = tt.slinalg.solve_lower_triangular(Luu, Ksu.T)
    Qss = tt.dot(tmp.T, tmp)  
    Kss = self.cov_func(Xnew)
    return mus, pm.gp.util.stabilize(Kss - Qss)

The key spot being

r = u - mean_total(Xu)
Kuuiu = invert_dot(Luu, r)

before things were in terms of f instead of u. It could be that the verison in terms of f is correct, but has numerical issues. I haven't checked that.

@smartass101
Copy link
Contributor Author

Thank you for the feedback @bwengals.
Your implementation is conceptually quite different, because it conditions on u instead of f. That means that any additive LatentSparse models or .conditional(given=...) calls would have to have u provided. This should be also reflected in the decorator line at the top of the class, in your case you should also modify

@conditioned_vars(["X", "Xu", "u"])
class LatentSparse(Latent):
...

Of course, conditioning only on u makes everything much simpler and reflects the actual conditional structure of this model much closer. However, it deviates from the MarginalSparse model which conditions on y, i.e. f with additional noise.

The whole question then is what is the desired use case and how convenient it will be to work with the class. If we decide to condition on u, we should probably also make it easily accessible, perhaps even return it.

Perhaps we could also ditch the whole LatentSparse class altogether and just provide one function which constructs f from u according to the requested approximation. Perhaps that could be a Latent.conditional_approx method.

@bwengals
Copy link
Contributor

bwengals commented May 22, 2018

You're right. So I didn't paste it, but I did also modify .prior to return both f and u. A bit of a hack that I think would work well is attaching u to f like, f._u = u before .prior returns it.

Perhaps we could also ditch the whole LatentSparse class altogether and just provide one function which constructs f from u according to the requested approximation. Perhaps that could be a Latent.conditional_approx method.

But you still get a big memory / speedup with the .prior method! I'm still totally on board with the LatentSparse class. This PR has is challenging, so thanks for what you've done! I see the fastest route to completion being just doing DTC based of u. u could be attached as a hidden attribute to f to keep the interface clean. FITC could be done as an additional separate PR.

I also looked around, and DTC is also known as the Projected Process approximation. It's talked about in Ch 8 in the Rasmussen & Williams book.

@smartass101
Copy link
Contributor Author

A bit of a hack that I think would work well is attaching u to f like, f._u = u before .prior returns it.

I suppose that would work (it would also require a modificaiton of _get_givens), but I'm not in favor of non-transparent hacks. Perhaps @junpenglao, @ColCarroll and/or someone else form the core dev team should comment on this dilema.

But you still get a big memory / speedup with the .prior method!

Indeed, but that would still be possible with a Latent.conditional_approx method. Such a method would simply return the DTC of FITC approximation returned by LatentSparse.prior. Effectively it would serve as an "interpolant" for a large number of values over the real Latent GP which is probably the most common use-case anyways.

The Latent.conditional method would still be usable for predictive samples without the apparent numerical issues and would transparently represent the actual conditional structure of the model, i.e. the DTC and FITC use an approximate training conditional, but a full predictive conditional is used. Both are conditioned on the same full GP.

I'm still totally on board with the LatentSparse class. This PR has is challenging, so thanks for what you've done!

I'm slowly getting off-board. I don't feel the convenience of a class is worth the hacks and non-transparency.
Personally, I don't mind that I spent time writing up the LatentSparse class even if it is scrubbed, it was a worthwhile learning experience.

I also looked around, and DTC is also known as the Projected Process approximation. It's talked about in Ch 8 in the Rasmussen & Williams book.

Indeed. My feeling is that the DTC and FITC nomenclature is now the standard in newer articles. But we could add a.k.a. Projected Process in the docstring. There is also the argument of being consistent with the MarginalSparse class.

@junpenglao
Copy link
Member

Thank you both for the detail write up. While I like the separation of latent and marginal class, @smartass101 is right that things should be more transparent. For example, I would support these two points:

If we decide to condition on u, we should probably also make it easily accessible, perhaps even return it.

Perhaps that could be a Latent.conditional_approx method.

@bwengals
Copy link
Contributor

I guess we're not on the same wavelength here. Just gonna go carefully through our recent convo, and err on the side of being overly verbose:

Your implementation is conceptually quite different, because it conditions on u instead of f. That means that any additive LatentSparse models or .conditional(given=...) calls would have to have u provided. This should be also reflected in the decorator line at the top of the class, in your case you should also modify

Yes, that's what I did. I checked out your PR and modified it this way. I saw correct results.

Of course, conditioning only on u makes everything much simpler and reflects the actual conditional structure of this model much closer.

Strongly agree.

However, it deviates from the MarginalSparse model which conditions on y, i.e. f with additional noise.

Yes, because there is no y in this model, so it has to be different. f is a latent, unobserved, random variable.

The whole question then is what is the desired use case and how convenient it will be to work with the class.

When f is a latent unobserved random variable, especially when there is no non-gaussian likelihood that it's being plugged into directly. When there is a y with non-gaussian noise driven by a GP, the method in this paper is an improvement on both DTC and FITC as implemented in LatentSparse (though LatentSparse can also be used in this case). But LatentSparse is uniquely useful if, for instance, you wanted to model a heteroscedastic process, and use a GP to directly model input dependent noise. You could also use LatentSparse to model things like the lengthscale in a Gibbs covariance function, like is done here. The variational version in the first paper linked cannot do this, because it is based on minimizing the KL divergence between the approx GP and the posterior, which necessarily includes some likelihood function.

If we decide to condition on u, we should probably also make it easily accessible, perhaps even return it.

Yes, it would have to be since it needs to get into .conditional somehow. You could either return both f and u from .prior, or hide u by attaching it to f. Either seem like fine options to me. In the additive case, passing just u, or f with u attached, to conditional is exactly as convenient as it is for MarginalSparse, its just u (or f) instead of y, and there's no sigma.

But say you only give f to conditional, and you throw u away. Just looking at DTC, the conditional mean is

mu = Qsf (Qff)^{-1} f

and the conditional covariance is

Sigma = Kss - (Qsf)(Qff)^{-1}(Qfs)

You can get this result starting from Eq. 20 in the Quninonero paper (u has been marginalized out), then using the conditioning formula for a multivariate normal. I haven't tried implementing this, so I don't know if its numerically any better or worse than the u version. Checking your conditional implementation, I find

mu = (Ksu (Kuf Kfu)^{-1} Kuf)  f
Sigma = Kss - Ksu (Kuu)^{-1} Kus

which is definitely different. Did I transcribe that right? How'd you derive this?

Perhaps we could also ditch the whole LatentSparse class altogether and just provide one function which constructs f from u according to the requested approximation.

I'm not following this, why ditch the LatentSparse class?

Indeed, but that would still be possible with a Latent.conditional_approx method. Such a method would simply return the DTC of FITC approximation returned by LatentSparse.prior. Effectively it would serve as an "interpolant" for a large number of values over the real Latent GP which is probably the most common use-case anyways.

The Latent.conditional method would still be usable for predictive samples without the apparent numerical issues and would transparently represent the actual conditional structure of the model, i.e. the DTC and FITC use an approximate training conditional, but a full predictive conditional is used. Both are conditioned on the same full GP.

I don't think I'm following you guys' reasoning here regarding conditional_approx. LatentSparse.conditional has the same memory / computation requirements as Latent.conditional, thanks to k(x*, x*) appearing in both. First, there's no reason to take f fitted with Latent, and then use an approx_conditional. Why fit the full model, then make suboptimal predictions with no computational gain? Second, taking f fitted with LatentSparse, then applying that f to Latent.conditional also doesn't make sense. If you do this, you aren't including the known error that is due to the use of u as an inducing variable, and the memory / computation requirements are the same.

I don't mind that I spent time writing up the LatentSparse class even if it is scrubbed, it was a worthwhile learning experience.

For me also, this really got into the theory weeds, which I hadn't done before for DTC/FITC. I don't think it should be scrubbed though!

Indeed. My feeling is that the DTC and FITC nomenclature is now the standard in newer articles. But we could add a.k.a. Projected Process in the docstring. There is also the argument of being consistent with the MarginalSparse class.

I'm OK with the DTC/FITC name, was mentioning it as an option just in case you were unaware. Wouldn't hurt to be in the docstring.

@smartass101
Copy link
Contributor Author

I guess we're not on the same wavelength here.

I'll try to tune to yours and improve the transmission efficiency and hopefully reduce interference 😉 Just curious, is that a common saying in your field (I'd guess something to with spectroscopy)? In my first language (Czech) we say "on the same wave", but we don't specify how our frequency and/or wavelength spectra differ 😄

... f is a latent, unobserved, random variable.
When f is a latent unobserved random variable

This is not completely true for the DTC approximation and only partially true for the FITC approximation. u is the actual latent unobserved random variable, f is just a (partially-)deterministic transformation of u towards a larger number of points. I think this point is what could be causing the misunderstanding. So the question then is whether we want to

  • either hide the fact that f is not an actual latent unobserved random variable within the LatentSparse class which will internally handle the true unobserved random variable u
  • or explicitly an transparently let the user construct u as a full Latent GP over Xu and then provide the Latent.conditional_approx(X) method which would return what LatentSparse.prior(X, Xu) does now

The second option reflects the true structure of the probabilistic model (the approximations are essentially conditionals on the inducing prior u), yet still provides the same numerical gains if the user then takes the f = Latent(...).conditional_approx(X) as the " approximate prior" for f. I will illustrate this below.

Why fit the full model, then make suboptimal predictions with no computational gain? Second, taking f fitted with LatentSparse, then applying that f to Latent.conditional also doesn't make sense. If you do this, you aren't including the known error that is due to the use of u as an inducing variable, and the memory / computation requirements are the same.

I'll try to illustrate how I thought Latent.conditional_approx would be used, which will hopefuly clear this misunderstanding. The user explicitly defines the inducing u over Xu and then creates the "approximate prior" f to be used further on.

with pm.Model() as model:
    gp = pm.gp.Latent(cov_func=cov)  # some cov function, irrelevant here
    u = gp.prior(Xu)   # the inducing prior, a true unobserved latent random variable
    f = gp.conditional_approx(X, approx='DTC')  # the approximate prior for f
    obs = pm.Normal('obs', sd=tt.exp(f), observed=data)  # large heteroscedastic dataset

After inference on this model (which gains from the numerical savings in f) we obtain a posterior for u (and possibly f_noise in case of FITC)

Now we can create the predictive conditional distribution which transparently conditions on u (and perhaps on f_noise), but this time using the full conditional

with model:
    fnew = gp.conditional(Xnew)
    trace_pp = pm.sample_ppc(vars=[fnew])

The LatentSparse class would hide the line u = gp.prior(Xu) and obscure the true probabilistic structure of the model.

which is definitely different. Did I transcribe that right? How'd you derive this?

In the Quninonero paper in in eqs. (21) for DTC and (24) for FITC there is always equation (a) as a mathematically well interpretable form and equation (b) gives a numerically efficient form. I derived the equations used in my code as the limit of (b) for sigma going towards 0. But it's possible I made a mistake. The logp test suite suggests it's not (completely) wrong.
The main issue with equations (a) is that they involve Q_ff^-1 which would involve O(n^3) operations since such a matrix is (n x n).

@bwengals
Copy link
Contributor

bwengals commented May 24, 2018

I dunno, I think 'not on your wavelength' is just a saying in english generally, not just in science. That's funny its 'wave' there, it makes me think of surfing haha. I looked it up, and whatever I read said the phrase started because back in the day you had to adjust the knob on your radio just right to get a signal.

This is not completely true for the DTC approximation and only partially true for the FITC approximation. u is the actual latent unobserved random variable, f is just a (partially-)deterministic transformation of u towards a larger number of points.

I'd say they're both latent (or hidden) because neither are observable, like in your example pm.Normal('obs', sd=tt.exp(f), observed=data), there's normally no way to directly observe f, or the log variance. you observe y, via data. Or in a mixture model, the cluster assignment probability is latent in the same way. Maybe it's just semantics though and it doesn't really matter.

Thanks for the illustration! Yes that helps me a ton, I totally see what you're saying now with approx_conditional. I think it's a really interesting idea, and it makes a lot of sense. It's a nice level of abstraction. I think it's worth thinking through the implications though, since it is a bit different than how the other GP classes are implemented. I'd want to spend a bit of time thinking about it, and maybe write it and play with it a bit before coming to a conclusion.

So a sketch of version A would be

with model:
    ...
    gp = pm.gp.LatentSparse(cov_func=cov, approx='DTC')
    f, u = gp.prior("f", X, Xu)
    # or f = gp.prior("f", X, Xu), with u attached
    ...
with model:
    fnew = gp.conditional(Xnew)

And version B,

with model:
    ...
    gp = pm.gp.Latent(cov_func=cov)
    u = gp.prior(Xu)
    f = gp.conditional_approx(X, approx='DTC')
    ...
with model:
    fnew = gp.conditional(Xnew)

I'm intending on implementing the variational approx with MCMC paper I linked soon BTW. It needs the likelihood passed in (which is a bit of a bummer) so I've been thinking about that too. Its interface will have to be different also.

I derived the equations used in my code as the limit of (b) for sigma going towards 0.

I not sure this works... if sigma^2 goes to zero, then the precision 1 / sigma^2 goes to infinity. How they got from Eqs. a) to b) was with the Woodbury matrix inversion identity. You can't apply this identity to

mu = Qsf (Qff)^{-1} f
Sigma = Kss - (Qsf)(Qff)^{-1}(Qfs)

because you'd need (Qff + A)^{-1}, and there's no matrix A (which is the sigma^2 I). (too bad latex in github comments doesn't work). Yes that is why they go from a) to b) though, to avoid inverting the n x n matrix.

So yeah, the conditional_approx idea is really interesting. Would the .conditional that gets sampled from with sample_ppc be aware of the approximation? I think it may need to be.

@smartass101
Copy link
Contributor Author

you had to adjust the knob on your radio just right to get a signal.

But that's usually the frequency, that's why wavelength surprised me. But perhpas earlier radios used wavelength (short, long, etc.).

That's funny its 'wave' there, it makes me think of surfing haha.

The concept of "surfing on waves" is actually quite important in my field, see Landau damping

I'd say they're both latent (or hidden) because neither are observable

Latent yes, but not random. The DTC approximation is completely deterministically conditioned on a truly random u. I think this difference has profound implications for the probabilistic model as discussed below.

Would the .conditional that gets sampled from with sample_ppc be aware of the approximation? I think it may need to be.

My understanding of the DTC and FITC constructs is that any conditional predictive samples f_* are conditioned only on u and know nothing of f, i.e. are conditionally independent. See Figure 3 in the Quinero paper and the related discussion.

Therefore, for the DTC approximation (which is only a deterministic transformation of u) the posterior of u is completely sufficient. In case of FITC the f_FITC_noise posterior needs to be specified as well, but sample_ppc should take care of that, assuming teh rpovided trace contains that posterior, which would likely be the case in common usage.

It needs the likelihood passed in (which is a bit of a bummer) so I've been thinking about that too.

In theory you could (I can imagine it at least for simple models) extract a general likelihood by following the Theano tree from the latent variable towards observed RVs.

I not sure this works... if sigma^2 goes to zero, then the precision 1 / sigma^2 goes to infinity.

My reasoning that in the expression with sigma^-2 (... \sigma^-2 + ...)^-1 only those terms where the sigma^-2 is also in the inverse will remain because the diverging precision will cancel out.. But I admit it's not very rigorous.

@bwengals
Copy link
Contributor

Hi @smartass101, what's the status of this PR for you?

@smartass101
Copy link
Contributor Author

Well, I was mostly waiting for you @bwengals to reach a conclusion since you wrote this above.

I'd want to spend a bit of time thinking about it, and maybe write it and play with it a bit before coming to a conclusion.

Personally, I'm still for implementing .conditional_approx.

@bwengals
Copy link
Contributor

After thinking about it, I think it would be best to simplify things in the interest of getting this done. How about doing only DTC for now, and saving FITC for a separate PR. Perhaps right after this one is done. I'm not sold on the need for FITC here. I think it's worth it to study the performance and quirks of FITC outside the context of additive Gaussian noise first, especially since Latent FITC hasn't really appeared anywhere in the literature that I know of. My main concern is

var = tt.clip(Kff_diag - Qff_diag, 0.0, np.inf)
v2 = pm.Normal(name + '_FITC_noise_', tau=tt.inv(var), shape=shape)

where if the approximation is good, then the variance goes to zero, which will challenge the sampler. Difficulty sampling when the model is good would be pretty difficult to debug (remove inducing points?).

An issue with conditional_approx is that conditional needs to know whether DTC or FITC were used. Will conditional_approx change the state of the gp when it's called to record the approximation used? Also, conditional_approx will need to be called in each model, so it's sortof a large API change, considering all other GP implementations follow the same API. I think this question could be returned to if additive GPs were considered.

Then, I think it would be OK to not support additive modeling (not just additive kernels) for LatentSparse (at least in the current PR). The Kronecker implementations already don't. The hypothetical pool of users who require additive modeling with GPs shrinks each step from Marginal, MarginalSparse, to Latent, then to LatentSparse. If there is a demand, then I think it can be addressed in another PR. Not requiring support for additive GPs simplifies the demands on the API. u is known to the gp instance (can be used by conditional) and appears in the trace (can be examined by MCMC diagnostics), and the user calls f = gp.prior("f", X, Xu).

So in summary, DTC without additive support will make for a much simpler PR, which I think we'd be able to get in here pretty fast. The rest of the stuff you've been looking into could potentially make for some good improvements later.

@fonnesbeck
Copy link
Member

I agree with @bwengals in that we should be conservative with respect to adding methods that have not been validated in the literature, and certainly not drastically changing the API to accommodate them. Can we go with a simpler PR right now, and leave the FITC method until we know more about its performance characteristics?

@twiecki
Copy link
Member

twiecki commented Dec 22, 2018

What's the status of this? Can it be closed?

@smartass101
Copy link
Contributor Author

I'd like to contribute it, but there hasn't been consensus on whether it should be a separate class or a method.

@twiecki
Copy link
Member

twiecki commented Dec 22, 2018

@smartass101 Is that not covered in @bwengals response above?

@twiecki
Copy link
Member

twiecki commented Jun 14, 2019

Closing due to inactivity.

@twiecki twiecki closed this Jun 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants