-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
SparseLatent GP #2951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SparseLatent GP #2951
Conversation
the mean_func and mean_total usage might be wrong, need to check literature
could probably use some matrix computation optimization
getting an -inf logp, the models are wrong for some reason
also use more descriptive names for vars
cbcc15a
to
768fa70
Compare
The DTC approximation passes the test suite as of 768fa70. FITC gives me a very different logp for the prior and -inf with the conditional, needs some more work. |
…is faster at all?
… to use relative 5% error for DTC conditional
Splitting of the test suite in ab6e01d shows that the DTC prior approximation is good to 1% relative error, but DTC prior+conditional only to 5%. Not sure if this suggests a problem or is realistic. |
This is really cool! Could you put a link to the paper in the docstring for the implementation, as a reference? |
pymc3/gp/gp.py
Outdated
---------- | ||
cov_func : None, 2D array, or instance of Covariance | ||
The covariance function. Defaults to zero. | ||
Must implement the `diag` method for the FITC approximation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of them must implement the diag
method. I think there's no need to specify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I should reformulate this that the cov_func
has to implement a computationally efficient diag
method (i.e. not use tt.diagonal(full(X, X))
for the approximation to be sparse?
pymc3/gp/gp.py
Outdated
Kff_diag = self.cov_func.diag(X) | ||
# MvNormal with diagonal cov is Normal with sd=cov**0.5 | ||
var = tt.clip(Kff_diag - Qff_diag, 0, np.inf) | ||
f = pm.Normal(name, mu=f_, tau=tt.inv(var), shape=shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use tau
and not sd
? If there is a reason this is preferred, would you mind changing this in MarginalSparse while you're at it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, shape
should be shape of f
here. not shape of u
. Sorry if you know and you're not quite done with this part yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I was thinking that
sd=tt.sqrt(var)
might be computationally more expensive thantau=tt.inv(var)
. But I could be wrong (especially in terms of numerical stability), please correct me on this. - good catch with the
shape
parameters, maybe that's why FITC isn't working. I suggest to slightly change the implementation ofinfer_shape
to
def infer_shape(X, kwargs=None, shape_kw='shape'):
if kwargs is None:
try:
shape = np.int(X.shape[0])
except TypeError:
raise TypeError("Cannot infer '%s', provide as a keyword argument" % shape_kw)
else:
shape = kwargs.pop(shape_kw)
return shape
and in the docstring mention that if X
and/or Xu
are tensors, the keyword arguments shape
and/or shape_u
must be provided, respectively. Also, should I provide **kwargs
to the second Normal
for FITC as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's more expensive computationally. Should profile it.
I suggest to slightly change the implementation
Good call
Also, should I provide **kwargs to the second Normal for FITC as well?
I'm thinking no... any reason you can think of at all for someone to pass kwargs there?
pymc3/gp/gp.py
Outdated
f = pm.Deterministic(name, f_) | ||
elif self.approx == 'FITC': | ||
Qff_diag = project_inverse(Kfu, Luu, diag=True) | ||
Kff_diag = self.cov_func.diag(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.cov_func(X, diag=True)
pymc3/gp/gp.py
Outdated
shape = infer_shape(Xu, kwargs.pop("shape", None)) | ||
v = pm.Normal(name + "_u_rotated_", mu=0.0, sd=1.0, shape=shape, **kwargs) | ||
u_ = self.mean_func(Xu) + tt.dot(Luu, v) # mean + chol method of MvGaussian | ||
u = pm.Deterministic(name+'_u', u_) # (m,) prior at inducing points |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a style thing, name + '_u'
pymc3/gp/gp.py
Outdated
shape = infer_shape(Xu, kwargs.pop("shape", None)) | ||
v = pm.Normal(name + "_u_rotated_", mu=0.0, sd=1.0, shape=shape, **kwargs) | ||
u_ = self.mean_func(Xu) + tt.dot(Luu, v) # mean + chol method of MvGaussian | ||
u = pm.Deterministic(name+'_u', u_) # (m,) prior at inducing points |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small thing, would prefer:
u = pm.Deterministic(name+'_u', self.mean_func(Xu) + tt.dot(Luu, v))
I think this is common when writing pymc3 programs, e.g y_ = Normal('y', ..., observed=y)
so as to not overwrite the data y
with a symbolic variable. I don't know if its good to have this naming convention in source code.
@@ -9,6 +9,24 @@ | |||
solve_upper = tt.slinalg.Solve(A_structure='upper_triangular') | |||
solve = tt.slinalg.Solve(A_structure='general') | |||
|
|||
def invert_dot(L, X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are great! For consistency though, it would be good to use these functions throughout the gp source code (I know, a pain) or just write the theano ops as before.
@ColCarroll The paper strictly is only about the case where the likelihood function is also Gaussian. The main thing I can find on FITC for non-gaussian likelihoods is this paper, but the inference method was not MCMC. @smartass101 This is great work, thank you for submitting! We should take a look at FITC with MCMC on some simulated data to get a feel for how it does before including it in the code base. DTC is simple to implement and seems to work alright, based off this gist. Some variational sparse approaches (see GPflow) also occupy this niche, and don't exhibit some of the pathologies of DTC or FITC. |
Regarding the paper(s), I must confess that I'm no statistician, just a physicist trying to get reasonable errorbars for large datasets (hence the need for sparsity) of noisy, heteroscedastic measurements (plasma turbulence). I wish to use the The variatonal approchaes in GPflow you mention @bwengals don't seem to be easily usable as latent variables in a larger model which is what I need. |
I also have a use-case for Just letting you know that other use-cases exist and that this is a really worthwhile endeavour! |
Thank you both for helping me out with references (or why there isn't one!) I have been reading about GPs and sparse latent gps the past two weeks, and am finally getting to a discussion of using MCMC vs an approximation to fit them. This looks like a super interesting PR! |
After some digging trough the math I came to the conlusion that for I also had a look at VFE by Titsias, but I don't think that it's applicable, because the regularizer term Thank you for the support and interest @philastrophist and @ColCarroll. |
Bad news, the DTC conditional was wrong and after fixing fails as well. Previously I was wrongly clipping the whole |
Right, So I got your FITC implementation to work and tried it on a toy problem. I found that sampling is very slow, much slower than DTC. One of the difficulties of sampling from GPs is that there are strong and complex correlations between From what I understand, the variational approximations depend on the likelihood, meaning you can't really do something like what @smartass101 wants to do by modeling the variance with a GP. I'm leaning toward including just DTC in this PR. What do you think? I don't understand EP very well, so can't say much about the paper you linked. @ColCarroll From what I gather the main reason the variational approaches are generally superior to FITC or DTC and such is that the inducing point locations are themselves variational parameters, so there's no danger of overfitting when optimizing their locations, which happens for things like FITC and DTC. Also, working with a lower bound of the posterior is a bit better theoretically than just an approximation, since it's hard to know how good the approximation is. |
Thank you for the insightful comments @bwengals. So it seems that advanced approaches such as VFE or EP are derived for a specific (usually iid Gaussian) likelihood. My aim is to have a rather general spase latent GP implementation which can be used as some unknown function to be inferred with rather arbitrary likelihoods.
Yes, just not sure what to compare/test then.
You mean it didn't work and you had to fix something, or just the sampling was very slow and worked only for a toy problem?
I thought the reparametrization
For a stationary (and many non-stationary cases as well) kernel the diagonal
I'd put some more effort into checking whether the FITC implementation is at least correct and then include it as well, but make the default |
A possible reparametrization could be to use
I don't see it as very different from the current approach, but the logp added to the model by Perhaps reparametrizing the |
mu = self.mean_func(X) # (n,) | ||
Luu = cholesky(stabilize(self.cov_func(Xu))) # (m, m) \sqrt{K_u} | ||
shape_u = infer_shape(Xu, kwargs.pop("shape_u", None)) | ||
shape = infer_shape(X, kwargs.pop("shape", None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about shape_f
?
Hows fitc vs dtc compare on a more realistic problem? Things look great so far. Will do a more thorough review soon. |
Tried regressions with 2 |
FITC failed with
after sampling at 11.87s/it. But I think this is more a problem of my model (mispecified covariance), it may not reflect the general performance of FITC. Perhaps a more relevant benchmark would be to use one of the |
Wow thats pretty fast. How much RAM do you have btw? Yeah I think that's a good idea. These results are on your problem, not a toy problem? |
It's just my notebook with 16 GB RAM and i7-5500U, nothing super fancy, but decent for experiments. Yes, this is on my (yet still simplified) problem. |
@@ -71,7 +89,7 @@ def setter(self, val): | |||
return gp_wrapper | |||
|
|||
|
|||
def plot_gp_dist(ax, samples, x, plot_samples=True, palette="Reds"): | |||
def plot_gp_dist(ax, samples, x, plot_samples=True, palette="Reds", label=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
I have 16gb too and something that big definitely doesn't fit for me... Been trying to figure out what's up with that since I don't have that problem with GPflow or GPy. Oh yeah, those don't look right, like some numerical problem is happening. The overall shape looks ok but the individual draws are messed up. I'll think about what that could be and get back soon. |
pymc 3.4 was supposed to fix some memory inefficiencies, not sure which version you have. To be clear, a full GP doesn't fit into my RAM, not even into 150 GB (tried on our cluster).
Yes. For some reason the distribution of points and its median isn't completely wrong. Maybe some random sub-selections of data? The posterior does not have these issues though. |
I found the issue, at least with the DTC version. I still need to think carefully about translating FITC to the latent implementation. Mainly, I implemented in terms of
The key spot being
before things were in terms of |
Thank you for the feedback @bwengals. @conditioned_vars(["X", "Xu", "u"])
class LatentSparse(Latent):
... Of course, conditioning only on The whole question then is what is the desired use case and how convenient it will be to work with the class. If we decide to condition on Perhaps we could also ditch the whole |
You're right. So I didn't paste it, but I did also modify
But you still get a big memory / speedup with the I also looked around, and DTC is also known as the Projected Process approximation. It's talked about in Ch 8 in the Rasmussen & Williams book. |
I suppose that would work (it would also require a modificaiton of
Indeed, but that would still be possible with a The
I'm slowly getting off-board. I don't feel the convenience of a class is worth the hacks and non-transparency.
Indeed. My feeling is that the DTC and FITC nomenclature is now the standard in newer articles. But we could add |
Thank you both for the detail write up. While I like the separation of latent and marginal class, @smartass101 is right that things should be more transparent. For example, I would support these two points:
|
I guess we're not on the same wavelength here. Just gonna go carefully through our recent convo, and err on the side of being overly verbose:
Yes, that's what I did. I checked out your PR and modified it this way. I saw correct results.
Strongly agree.
Yes, because there is no y in this model, so it has to be different. f is a latent, unobserved, random variable.
When
Yes, it would have to be since it needs to get into But say you only give f to conditional, and you throw u away. Just looking at DTC, the conditional mean is
and the conditional covariance is
You can get this result starting from Eq. 20 in the Quninonero paper (u has been marginalized out), then using the conditioning formula for a multivariate normal. I haven't tried implementing this, so I don't know if its numerically any better or worse than the
which is definitely different. Did I transcribe that right? How'd you derive this?
I'm not following this, why ditch the LatentSparse class?
I don't think I'm following you guys' reasoning here regarding
For me also, this really got into the theory weeds, which I hadn't done before for DTC/FITC. I don't think it should be scrubbed though!
I'm OK with the DTC/FITC name, was mentioning it as an option just in case you were unaware. Wouldn't hurt to be in the docstring. |
I'll try to tune to yours and improve the transmission efficiency and hopefully reduce interference 😉 Just curious, is that a common saying in your field (I'd guess something to with spectroscopy)? In my first language (Czech) we say "on the same wave", but we don't specify how our frequency and/or wavelength spectra differ 😄
This is not completely true for the DTC approximation and only partially true for the FITC approximation.
The second option reflects the true structure of the probabilistic model (the approximations are essentially conditionals on the inducing prior
I'll try to illustrate how I thought with pm.Model() as model:
gp = pm.gp.Latent(cov_func=cov) # some cov function, irrelevant here
u = gp.prior(Xu) # the inducing prior, a true unobserved latent random variable
f = gp.conditional_approx(X, approx='DTC') # the approximate prior for f
obs = pm.Normal('obs', sd=tt.exp(f), observed=data) # large heteroscedastic dataset After inference on this model (which gains from the numerical savings in f) we obtain a posterior for Now we can create the predictive conditional distribution which transparently conditions on with model:
fnew = gp.conditional(Xnew)
trace_pp = pm.sample_ppc(vars=[fnew]) The
In the Quninonero paper in in eqs. (21) for DTC and (24) for FITC there is always equation (a) as a mathematically well interpretable form and equation (b) gives a numerically efficient form. I derived the equations used in my code as the limit of (b) for |
I dunno, I think 'not on your wavelength' is just a saying in english generally, not just in science. That's funny its 'wave' there, it makes me think of surfing haha. I looked it up, and whatever I read said the phrase started because back in the day you had to adjust the knob on your radio just right to get a signal.
I'd say they're both latent (or hidden) because neither are observable, like in your example Thanks for the illustration! Yes that helps me a ton, I totally see what you're saying now with So a sketch of version A would be
And version B,
I'm intending on implementing the variational approx with MCMC paper I linked soon BTW. It needs the likelihood passed in (which is a bit of a bummer) so I've been thinking about that too. Its interface will have to be different also.
I not sure this works... if sigma^2 goes to zero, then the precision 1 / sigma^2 goes to infinity. How they got from Eqs. a) to b) was with the Woodbury matrix inversion identity. You can't apply this identity to
because you'd need (Qff + A)^{-1}, and there's no matrix A (which is the So yeah, the |
But that's usually the frequency, that's why wavelength surprised me. But perhpas earlier radios used wavelength (short, long, etc.).
The concept of "surfing on waves" is actually quite important in my field, see Landau damping
Latent yes, but not random. The DTC approximation is completely deterministically conditioned on a truly random
My understanding of the DTC and FITC constructs is that any conditional predictive samples Therefore, for the DTC approximation (which is only a deterministic transformation of
In theory you could (I can imagine it at least for simple models) extract a general likelihood by following the Theano tree from the latent variable towards observed RVs.
My reasoning that in the expression with |
Hi @smartass101, what's the status of this PR for you? |
Well, I was mostly waiting for you @bwengals to reach a conclusion since you wrote this above.
Personally, I'm still for implementing |
After thinking about it, I think it would be best to simplify things in the interest of getting this done. How about doing only DTC for now, and saving FITC for a separate PR. Perhaps right after this one is done. I'm not sold on the need for FITC here. I think it's worth it to study the performance and quirks of FITC outside the context of additive Gaussian noise first, especially since Latent FITC hasn't really appeared anywhere in the literature that I know of. My main concern is
where if the approximation is good, then the variance goes to zero, which will challenge the sampler. Difficulty sampling when the model is good would be pretty difficult to debug (remove inducing points?). An issue with Then, I think it would be OK to not support additive modeling (not just additive kernels) for LatentSparse (at least in the current PR). The Kronecker implementations already don't. The hypothetical pool of users who require additive modeling with GPs shrinks each step from Marginal, MarginalSparse, to Latent, then to LatentSparse. If there is a demand, then I think it can be addressed in another PR. Not requiring support for additive GPs simplifies the demands on the API. So in summary, DTC without additive support will make for a much simpler PR, which I think we'd be able to get in here pretty fast. The rest of the stuff you've been looking into could potentially make for some good improvements later. |
I agree with @bwengals in that we should be conservative with respect to adding methods that have not been validated in the literature, and certainly not drastically changing the API to accommodate them. Can we go with a simpler PR right now, and leave the FITC method until we know more about its performance characteristics? |
What's the status of this? Can it be closed? |
I'd like to contribute it, but there hasn't been consensus on whether it should be a separate class or a method. |
@smartass101 Is that not covered in @bwengals response above? |
Closing due to inactivity. |
This is an attempt based on this notebook by @bwengals to implement the DTC and FITC sparse approximations.
I tried to read through Quinonero-Candela+Rasmussen, 2006 for the covariance equations. Essentially I substituted
sigma=0
in them and reduced them to simpler and yet computationally feasible forms. But it's quite likely my code is un-optimal and wrong.