Skip to content

Updates to Simpon's Paradox notebook #697

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Sep 14, 2024
Merged

Conversation

drbenvincent
Copy link
Contributor

@drbenvincent drbenvincent commented Aug 31, 2024

This PR updates the Simpon's Paradox notebook in a number of ways:

  • MutableData -> Data
  • Updated watermark based on the current style guide
  • Added mathematical descriptions of the models
  • Improved the hierarchical model. Switched from non-centred to regular parameterisation which gives a simpler graphviz which is more closely aligned to the model maths
  • Increased the causal language and have included causal DAGs
  • Added rng to sampling functions for increased reproducibility
  • Significantly simplified potting code
  • Added Wilkinson notation (and explanation) for each model. Hopefully this is correct, but I am not 100% sure.
  • Adds a missing notebook reference to the LKJ notebook because I reference to it from this one.
  • Added some more data visualisation
  • All models now have a fully pooled observation noise parameter for simplicity.

Note: This PR is not related to an issue, I just decided to make these updates in the absence of an issue.


📚 Documentation preview 📚: https://pymc-examples--697.org.readthedocs.build/en/697/

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

review-notebook-app bot commented Sep 1, 2024

View / edit / reply to this conversation on ReviewNB

juanitorduz commented on 2024-09-01T11:57:25Z
----------------------------------------------------------------

add "directed acycylic graph (DSG)" for people who do not know what a DAG is.


drbenvincent commented on 2024-09-02T14:14:16Z
----------------------------------------------------------------

Good idea. Done in an upcoming commit.

tomicapretto commented on 2024-09-09T18:51:37Z
----------------------------------------------------------------

We then show 2 "wayes" to resolve... typo?

drbenvincent commented on 2024-09-13T09:03:22Z
----------------------------------------------------------------

Well spotted. Fixed

Copy link

review-notebook-app bot commented Sep 1, 2024

View / edit / reply to this conversation on ReviewNB

juanitorduz commented on 2024-09-01T11:57:26Z
----------------------------------------------------------------

Line #1.    with pm.Model() as model1:

where are the coords?


drbenvincent commented on 2024-09-02T14:20:15Z
----------------------------------------------------------------

So you can actually provide dims in the model without specifying that they are in the coords. For example in Model 2 I provide a coords dict but only provide the group labels. Yet I still have a dimension in the model called obs_id. Because this is not defined, it's easier when it comes to out of sample prediction when shapes change.

I'm happy to explicitly provide all coords in the kwarg, but that will probably require updates in terms of new coords in the out of sample prediction. That might actually be better practice, but let me know if you'd prefer that before I put in the work :)

Copy link

review-notebook-app bot commented Sep 1, 2024

View / edit / reply to this conversation on ReviewNB

juanitorduz commented on 2024-09-01T11:57:27Z
----------------------------------------------------------------

Line #7.        sigma = pm.Gamma("sigma", 2, 2)

in the previous versions there us one sigma per group, why did you change it?


drbenvincent commented on 2024-09-02T14:23:39Z
----------------------------------------------------------------

In model 3 I was getting divergences and I thought it might help things if I went from unpooled sigmas to pooled. Thought that would actually be ok because the data generating process has identical sigmas for each group.

I just thought I'd keep a single fully pooled parameter in all models to demonstrate you can mix and match.

But could potentially be simpler if Model 2 has unpooled sigmas and Model 3 has hierarchical. Do you think it would be a big improvement to do that?

Copy link

review-notebook-app bot commented Sep 1, 2024

View / edit / reply to this conversation on ReviewNB

juanitorduz commented on 2024-09-01T11:57:27Z
----------------------------------------------------------------

  • we have some divergences
  • do you still want one global sigma? one per group or even build a hierarchy on sigma?

drbenvincent commented on 2024-09-02T14:31:07Z
----------------------------------------------------------------

In terms of divergences: Original notebook used the non-centred parameterisation. The problem that arose was that that makes the DAG and model maths deviate from the regular parameterisation and made it more difficult to explain the connections between the Wilkinson notation, model maths, and pymc model code.

In terms of sigma: See my reply another of your comments above. Do you think a hierarchy on sigma would reduce divergences?

Copy link

review-notebook-app bot commented Sep 1, 2024

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2024-09-01T14:17:08Z
----------------------------------------------------------------

I think it would be more consistent to always use \sim when we want to say that something follows a given distribution. I'm not sure why you're using = in the first two rows.


drbenvincent commented on 2024-09-02T14:28:05Z
----------------------------------------------------------------

Thanks, this was an error that I didn't catch. Fixed in an upcoming commit

Copy link

review-notebook-app bot commented Sep 1, 2024

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2024-09-01T14:17:09Z
----------------------------------------------------------------

The formula is correct and the model it builds is equivalent to the one you wrote with distributions above. However, there's not a one-to-one mapping between the distributions above and the formula. The models are equivalent but not exactly the same. With the formula, we're writing the following model

$$

\begin{aligned}

p_{0\sigma}, p_{1\sigma} &\sim \text{Gamma}(2, 2) \\ 

\vec{u_0} &\sim \text{Normal}(0, p_{0\sigma}^2) \\

\vec{u_1} &\sim \text{Normal}(0, p_{1\sigma}^2) \\

\beta_0 &\sim \text{Normal} \\ 

\beta_1 &\sim \text{Normal} \\ 

\sigma &\sim \text{Gamma}(2, 2) \\ 

\mu_i &= \beta_0 + \vec{u_0}[g_i] + \beta_1 \cdot x_i + 1\vec{u_1}[g_i] \cdot x_i \\ 

y_i &\sim \text{Normal}(\mu_i, \sigma) 

\end{aligned}

$$

and then 1 maps $\beta_0$, x maps $\beta_1$, (1 | g) maps $u_0$ and (x | g) maps `u_1$

This comment in a Bambi discussion can also help

bambinos/bambi#808 (comment)


drbenvincent commented on 2024-09-02T14:52:36Z
----------------------------------------------------------------

Ah, so I think I understand 100%, let me check. The formula block I wrote is equivalent to your formula block. Your's encodes a population mean and group deflections but mine just encodes the absolute group values (not as deflections from population mean).

However, when it comes to comparing to the Wilkinson notation, the population mean + deflections is a truer interpretation of what that model notation really means? I buy that.

When I was writing this update I was actually thinking about adding an admonition block saying that you can get an equivalent model with the population + deflection approach. But now I am swayed to go with your population + deflection approach AND update the pymc model accordingly. But I might still add in an admonition block describing this initial model, and to link to the Bambi discussion.

tomicapretto commented on 2024-09-03T09:39:02Z
----------------------------------------------------------------

Yes, that's correct! The Wilkinson's model formula was adapted in a non-bayesian setting where you have fixed and random parameters. The population mean is the fixed part, the group deflections are the random part. In a bayesian context they're not different (both are random, and that's fine).

Copy link
Contributor Author

In model 3 I was getting divergences and I thought it might help things if I went from unpooled sigmas to pooled. Thought that would actually be ok because the data generating process has identical sigmas for each group.

I just thought I'd keep a single fully pooled parameter in all models to demonstrate you can mix and match.

But could potentially be simpler if Model 2 has unpooled sigmas and Model 3 has hierarchical. Do you think it would be a big improvement to do that?


View entire conversation on ReviewNB

Copy link
Contributor Author

Thanks, this was an error that I didn't catch. Fixed in an upcoming commit


View entire conversation on ReviewNB

Copy link
Contributor Author

In terms of divergences: Original notebook used the non-centred parameterisation. The problem that arose was that that makes the DAG and model maths deviate from the regular parameterisation and made it more difficult to explain the connections between the Wilkinson notation, model maths, and pymc model code.

In terms of sigma: See my reply another of your comments above. Do you think a hierarchy on sigma would reduce divergences?


View entire conversation on ReviewNB

Copy link
Contributor Author

Ah, so I think I understand 100%, let me check. The formula block I wrote is equivalent to your formula block. Your's encodes a population mean and group deflections but mine just encodes the absolute group values (not as deflections from population mean).

However, when it comes to comparing to the Wilkinson notation, the population mean + deflections is a truer interpretation of what that model notation really means? I buy that.

When I was writing this update I was actually thinking about adding an admonition block saying that you can get an equivalent model with the population + deflection approach. But now I am swayed to go with your population + deflection approach AND update the pymc model accordingly. But I might still add in an admonition block describing this initial model, and to link to the Bambi discussion.


View entire conversation on ReviewNB

Copy link

Yes, that's correct! The Wilkinson's model formula was adapted in a non-bayesian setting where you have fixed and random parameters. The population mean is the fixed part, the group deflections are the random part. In a bayesian context they're not different (both are random, and that's fine).


View entire conversation on ReviewNB

@drbenvincent
Copy link
Contributor Author

drbenvincent commented Sep 4, 2024

@tomicapretto So this is all very interesting and I feel like I'm learning a lot.

I've written the Model3 1 + x + (1 + x | g) as:

Screenshot 2024-09-04 at 10 14 03

Question 1: If you want to describe the slope of an as yet unobserved group, you would take samples from $\text{Normal}(\beta_1, p_{1\sigma})$? And you'd do that on a sample by sample basis, i.e. that the $\beta_1$ and $p_{1\sigma}$ represent samples drawn from the posterior. It feels a but clumsy, so I just wanted to check my thinking with you.

Question 2: What would determine if you should specify the distributions of $u$ as $\text{ZeroSumNormal}$, rather than $\text{Normal}$?

@tomicapretto
Copy link

@drbenvincent

Question 1: If you want to describe the slope of an as yet unobserved group, you would take samples from Normal(β1, p_1)? And you'd do that on a sample by sample basis, i.e. that the β1 and p_1 represent samples drawn from the posterior. It feels a but clumsy, so I just wanted to check my thinking with you.

It's actually a bit more complicated than that. I think I should write a comprehensive guide for this in the future (which would help me to understand it better too). I always point people to these two links, which I find useful.

https://www.andrewheiss.com/blog/2021/11/10/ame-bayes-re-guide/
https://x.com/tjmahr/status/1458443667748818955

Question 2: What would determine if you should specify the distributions of u as ZeroSumNormal, rather than Normal?

I'm not sure if you're asking what to do to predict for a new group with ZSN or when one should use ZSN instead of a simple Normal. As usual, it depends.

Hopefully the simpler exampe below will help.
Imagine you have a continuous outcome in the reals and A groups.
We use a normal observational model like the following

$$ \begin{aligned} Y_i &\sim \text{Normal}(\mu_i, \sigma^2) \\ \mu_i &= \alpha + \beta_{j[i]} \\ \alpha &\sim \text{Normal} \\ \beta_j &\sim \text{Normal} \quad \forall j = 1, \dots, A \\ \sigma &\sim \text{Something} \end{aligned} $$

This is a model where we attempt to have a global mean $\alpha$ and a group specific parameter $\beta_j$.
The posterior means of $\alpha$ and $\beta_j$ are non-identifiable as they're linearly dependent.

There are several things one can do to resolve that issue:

  1. Constrain $\alpha$ to zero, which one think of as removing $\alpha$ from the model. In this case, the $\beta_j$ parameters represent the group means.
  2. Constrain one $\beta_j$ to zero, which is like removing one $\beta_j$. $\alpha$ becomes the mean of the group for which we removed/constrained the $\beta$ parameter. All the other $\beta_j$ are deflections between the reference group mean and the $j$-th group mean.
  3. Constraint the sum of $\beta_j$ params to zero. $\alpha$ represents the global mean and each $\beta_j$ deflections around that mean.

With any of the choices above we get equivalent models, but using different parametrizations. All of them are "no pooling" models.

Another way to "resolve" that issue is to use a hierarchical approach. One could write the model this way

$$ \begin{aligned} Y_i &\sim \text{Normal}(\mu_i, \sigma^2) \\ \mu_i &= \alpha + \beta_{j[i]} \\ \alpha &\sim \text{Normal} \\ \beta_j &\sim \text{Normal}(0, s^2) \quad \forall j = 1, \dots, A \\ s &\sim \text{Something} \\ \sigma &\sim \text{Something} \end{aligned} $$

or this other way

$$ \begin{aligned} Y_i &\sim \text{Normal}(\mu_i, \sigma^2) \\ \mu_i &= \beta_{j[i]} \\ \beta_j &\sim \text{Normal}(m, s^2) \quad \forall j = 1, \dots, A \\ m &\sim \text{Normal} \\ s &\sim \text{Something} \\ \sigma &\sim \text{Something} \end{aligned} $$

where $m$ is the $\alpha$ from above.

With this hierarchical approach we are not removing linear dependencies like above. What happens here is a bit more sophisticated. The $\beta_j$ parameters are no longer independent because they're assumed to be draws of a normal population with a common mean ($m$) and standard deviation ($s$). But there's no "hard" constraint like when we set some parameter, or their sum, to zero. Think of it as a "soft" constrained.

However, as the sample size for each group $j$ increases, the posterior will converge to the posterior one would get in the first model we got above, without restrictions, which has non-identifiable means.

What do I want to say with all of this? That, theoretically, the hierarchy will make your parameters identifiable because they're no longer independent and thus Normal should work. However, as the sample sizes for the groups grow, it will be harder and harder to get draws from the posterior because of the increasing correlation.
If you constrain the partially-pooled $\beta_j$ parameters with a zero-sum constraint, that problem should disappear. However, that can create other issues (for example, how to get predictions for new groups). Adrian and Luciano solved that for some cases, but I'm not sure if they solved it for all cases.

See, for example, the debate here about zero-sum constraints https://discourse.mc-stan.org/t/new-stan-data-type-zero-sum-vector/26215/3.

It's a very interesting and challenging topic.

Finally, there's also this book https://www.amazon.com/Richly-Parameterized-Chapman-Statistical-Science/dp/0367533731 which covers some of this topics (identifiability) using the "degrees of freedom" concept. Long story short, when you have partially pooled parameters the degrees of freedom are a non integer value. For the model above, it should be between those of the identifiable no-pooling model and the non-identifiable partial-pooling model.

Btw, I'm fairly familiar with this but I'm certainly not a truly expert, so I recommend reading the book.

@drbenvincent
Copy link
Contributor Author

Superb @tomicapretto !

Because of time constraints and considering the focus of the notebook, I think I will not tackle the population mean (or similar) issue in this notebook. I will take it as a win that this notebook started with the pymc model code, but now makes a reasonable attempt to discuss the mapping onto mathematical and Wilkinson descriptions.

Nevertheless I'll continue to learn about this in more and either return to give more depth to this notebook or come up with an alternative. But in the mean time it sounds like you may well be coming up with your own Capretto definitive resource blog post or docs :)

@tomicapretto
Copy link

Superb @tomicapretto !

Because of time constraints and considering the focus of the notebook, I think I will not tackle the population mean (or similar) issue in this notebook. I will take it as a win that this notebook started with the pymc model code, but now makes a reasonable attempt to discuss the mapping onto mathematical and Wilkinson descriptions.

Nevertheless I'll continue to learn about this in more and either return to give more depth to this notebook or come up with an alternative. But in the mean time it sounds like you may well be coming up with your own Capretto definitive resource blog post or docs :)

Haha thanks :D

Copy link

We then show 2 "wayes" to resolve... typo?


View entire conversation on ReviewNB

Copy link

review-notebook-app bot commented Sep 9, 2024

View / edit / reply to this conversation on ReviewNB

tomicapretto commented on 2024-09-09T19:05:47Z
----------------------------------------------------------------

For extra clarity, in the second paragraph, you could also say that allows you to see the _conditional_ posterior predictive distribution


drbenvincent commented on 2024-09-13T09:03:05Z
----------------------------------------------------------------

Done :)

Copy link
Contributor Author

Done :)


View entire conversation on ReviewNB

Copy link
Contributor Author

Well spotted. Fixed


View entire conversation on ReviewNB

@drbenvincent
Copy link
Contributor Author

Thanks @tomicapretto. I've made those changes.

@OriolAbril OriolAbril merged commit c6e35b9 into pymc-devs:main Sep 14, 2024
2 checks passed
@drbenvincent drbenvincent deleted the simpsons branch September 14, 2024 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants