Skip to content

Reintroduce Bernoulli logitp parametrization #4620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ricardoV94
Copy link
Member

@ricardoV94 ricardoV94 commented Apr 7, 2021

This is a proposal of how we might reintroduce the logit_p parametrization to the Bernoulli distribution.

I parametrized everything (sampling, logp, logcdf) in terms of logit_p as this should be numerically more stable. This required changing the BernoulliOp from Aesara. I didn't run any benchmarks to see if this would incur a significant loss in speed relative to the p parametrization.

Also, we get a divide by zero RuntimeWarning if creating a Bernoulli with p=1.

I am happy to listen if you think there might be a better approach.


Depending on what your PR does, here are a few things you might want to address in the description:

@@ -332,6 +339,19 @@ def logcdf(self, value):
)


class BernoulliLogitRV(BernoulliRV):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it inherit from BernoulliRV or RandomVariable directly?

Is the name and _print_name change just confusing and not necessary?

def test_bernoulli(self):
pymc3_random_discrete(
pm.Bernoulli, {"p": Unit}, ref_rand=lambda size, p=None: st.bernoulli.rvs(p, size=size)
pm.Bernoulli, {"p": Unit}, ref_rand=lambda size, p: st.bernoulli.rvs(p, size=size)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be replaced following #4608

value <= 1,
p >= 0,
p <= 1,
~at.isnan(logit_p),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid p gets converted to nan so this is how we can identify it in bound

Copy link
Contributor

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we don't want to create new versions of the basic RandomVariables unless we absolutely must; otherwise, putting the parameterization inside the RandomVariable Op removes the possibility of optimization and/or stabilization, and necessarily incurs the cost of transforming—even when it's unnecessary.

The real question is: what's the advantage of doing it this way, instead of applying the parameterization as an input to the existing RandomVariable?

@ricardoV94
Copy link
Member Author

The issue is that if we use p instead of logitp as the input to the logp and logcdf we necessarily loose numerical precision. This was not really the case for the other alternative parametrizations that we already refactored but is relevant in here.

@ricardoV94
Copy link
Member Author

ricardoV94 commented Apr 8, 2021

Alternatively if we could rely on Aesara always simplifying the graph logit(invlogit(x)) = x we could do p = invlogit(logit_p) during initialization and logit_p=logit(p) inside the logp and logcdf calls for the same end result without having to alter the Bernoulli op.

I have to check if aesara can actually do this one. If it can / could, would this be a viable strategy?

Taking one step back, is there an alternative where we could simply provide different parametrizations to the random op and the logp/ logcdf methods directly? Feels inefficient to have to do so often x -> f(x) -> f^-1(f(x)) because the random op is (logically) parametrized differently than the logp method.

@ricardoV94
Copy link
Member Author

I wrote a small gist that enables rewrites of logit(invlogit(x)) = x and invlogit(logit(x)) = x.

Whether or not we make use of this for this PR, it could be a nice feature going into the future. Users would benefit from increased precision in logistic regression models without having to be aware of the alternative logit_p parametrization (assuming we decided to stick with Bernoulli logp/logcdf methods written in logit scale).

We probably could also get rid of the "hackish" invlogit function in pymc3.math and revert to using the Aesara sigmoid:
https://github.com/pymc-devs/pymc3/blob/2dee9dafb3541043ae23b66fe9262e70fbd2769a/pymc3/math.py#L212-L214

@brandonwillard
Copy link
Contributor

I wrote a small gist that enables rewrites of logit(invlogit(x)) = x and invlogit(logit(x)) = x.

Sounds good; create a PR in Aesara for that.

We probably could also get rid of the "hackish" invlogit function in pymc3.math and revert to using the Aesara sigmoid

Yeah, we should make this change, too.

@ricardoV94
Copy link
Member Author

ricardoV94 commented Apr 12, 2021

@brandonwillard I will check the sigmoid in more detail and open a PR on the Aesara side.

Besides that, is there no way we could specify a mapping of which parameters should go into a RandomOp, vs the logp or logcdf when creating a PyMC distribution?

It feels rather restrictive to force everything into the RandomOp parametrization, when different methods have more natural / stable parametrizations.

Class Bernoulli(Discrete):
  def __init__(self, p=None, logitp=None, *args, **kwargs):
     p, logitp = self.get_p_logitp(p=p, logitp=logitp)
     param_map = {
       "random": (p,),
       "logp": (logitp,),
       "logcdf": (logitp,),
     }
     super().__init__(param_map=param_map, **kwargs) 

Which by default would obviously do nothing and use the same arguments for all 3 methods.

@brandonwillard
Copy link
Contributor

It feels rather restrictive to force everything into the RandomOp parametrization, when different methods have more natural / stable parametrizations.

If you're talking about developer "restrictions", then it's considerably more restrictive to have to deal with duplicate classes that vary only in how they parameterize something. The downstream complexity introduced by such an outgrown is considerably larger than you realize.

If you're talking about the parameterizations used during sampling, we can always consider changing the internal parameterizations used in the Aesara NumPy RandomVariables when there's a demonstrable improvement to be had. Even so, we definitely can't remove the NumPy parameterizations from the interface.

Aside from those, I'm not sure what you mean, because any transforms one may want to apply in a log-likelihood or CDF graph can be applied quite easily.

The change you proposed in this PR doesn't seem to line up with your example, though, because the code you posted implies that the sampling (i.e. random) should be performed in the existing parameterization (i.e. p) and not the logit-based parameterization change in this PR. Again, it's very straightforward to make an arbitrary parameterization change within a logp implementation, and there is absolutely no reason to change the parameterization used by the underlying RandomVariable—unless it provides a substantial improvement to random sampling, of course.

@ricardoV94
Copy link
Member Author

ricardoV94 commented Apr 12, 2021

If you're talking about developer "restrictions", then it's considerably more restrictive to have to deal with duplicate classes that vary only in how they parameterize something. The downstream complexity introduced by such an outgrown is considerably larger than you realize.

Definitely, I wouldn't want to do that. To be clear I don't like the current PR proposal.

If you're talking about the parameterizations used during sampling, we can always consider changing the internal parameterizations used in the Aesara NumPy RandomVariables when there's a demonstrable improvement to be had. Even so, we definitely can't remove the NumPy parameterizations from the interface.

Again, that sounds suboptimal. Ideally we could keep the RandomVariable parametrization aligned to Numpy's / Scipy's, while allowing for different parametrizations for the logp / logcdf... but only transforming between the different parametrizations where needed. If a user defines p, we would use that directly for the RandomOp, and convert to logitp for the logp/ logcdf, but if the user defines a logitp we would do the reverse: use it directly on logp/logcdf and only convert to p for the RandomOp.

Aside from those, I'm not sure what you mean, because any transforms one may want to apply in a log-likelihood or CDF graph can be applied quite easily.

The change you proposed in this PR doesn't seem to line up with your example, though, because the code you posted implies that the sampling (i.e. random) should be performed in the existing parameterization (i.e. p) and not the logit-based parameterization change in this PR. Again, it's very straightforward to make an arbitrary parameterization change within a logp implementation, and there is absolutely no reason to change the parameterization used by the underlying RandomVariable—unless it provides a substantial improvement to random sampling, of course.

The problem is one of numerical stability / efficiency. If the user computes logitp upstream of the bernoulli as is usual in a logistic regression, we are forcing it to go through logit -> invlogit -> logit to arrive at the logp part, which feels wasteful both in terms of computations and numerical precision. It's nice that we can add automatic rewrites on Aesara to optimize away this kind of inverse transformations, but ideally we would not have to do this for every alternative parametrization where there might be a loss of precision in f^-1(f(x)).

To make it clear, if we converted to p during initialization and then back to logit_p inside the logp this is what would happen now:

import aesara
import aesara.tensor as at

invlogit = at.nnet.sigmoid
logit = lambda p: at.log(p / (1-p)

f = aesara.function([x], logit(invlogit(x)))
print(f([-750, -50, -5, 5, 50, 750]))

# [-inf -50.  -5.   5.  inf  inf]

@brandonwillard
Copy link
Contributor

brandonwillard commented Apr 15, 2021

It's nice that we can add automatic rewrites on Aesara to optimize away this kind of inverse transformations...

That's one of the important reasons for having the kind of rewrites that we do; it focuses such concerns into a place and time that's more universally manageable. It reduces code complexity by removing the need for a complicated patchwork of numerical concerns implemented in random places throughout the codebase.

Your approach mixes the design of our distribution/random variable interfaces with numerical concerns that arise in different parts of the codebase (e.g. in only the log-likelihood calculations), and this is the exact kind of coupling and complexity I was talking about before.

The rewrites have very little overhead, especially since they do nothing other than check a few conditions/types and create new Apply nodes, both of which are dirt cheap.

Also, by not having those rewrites in place, we end up suffering the same numerical instabilities anyway (just in other places), so why would we not implement those and use them to simplify our classes at the same time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants