Skip to content

Moderation analysis example #142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jun 6, 2021

Conversation

drbenvincent
Copy link
Contributor

Hi all
Here's a pull request adding an example notebook covering moderation analysis. I wasn't quite sure on the policy of where to store an embedded image, so I've just included one in the directory that the notebook is in. Just let me know if you need any changes made.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Apr 6, 2021

View / edit / reply to this conversation on ReviewNB

ricardoV94 commented on 2021-04-06T12:34:05Z
----------------------------------------------------------------

Perhaps add a note at the end summarizing how mediation differs from moderation, instead of mentioning it here without further context?


drbenvincent commented on 2021-04-06T18:27:14Z
----------------------------------------------------------------

I've added some more info about mediation analysis now

@review-notebook-app
Copy link

review-notebook-app bot commented Apr 6, 2021

View / edit / reply to this conversation on ReviewNB

ricardoV94 commented on 2021-04-06T12:34:06Z
----------------------------------------------------------------

Is this a real dataset? If so maybe just mention in 1 or 2 lines its source. If it's fake it's useful to mention that as well


drbenvincent commented on 2021-04-06T18:27:49Z
----------------------------------------------------------------

I've clarified the source of the data and that it is unclear if it was from a real research study or if it was simulated.

@review-notebook-app
Copy link

review-notebook-app bot commented Apr 6, 2021

View / edit / reply to this conversation on ReviewNB

ricardoV94 commented on 2021-04-06T12:34:06Z
----------------------------------------------------------------

If you happen to know of any, it could be useful to point to examples of more advanced moderation analysis in a bayesian setting.


drbenvincent commented on 2021-04-06T18:28:13Z
----------------------------------------------------------------

Done. Added a bit more to the discussion part, as well as added a Further Reading section

@ricardoV94
Copy link
Member

This is really neat! The visualizations with the moderator variable really speak for themselves.

I left some minor comments if you want to address them :)

@drbenvincent
Copy link
Contributor Author

Thanks @ricardoV94, I should be able to deal with these points soon.

@fonnesbeck
Copy link
Member

This is great, thank you. The data import should probably use pm.get_data, rather than a relative path string.

@review-notebook-app
Copy link

review-notebook-app bot commented Apr 6, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-04-06T17:18:45Z
----------------------------------------------------------------

This should be splitted in two, with random seed and style in the 2nd cell. It also should not use numpy global random seed but use the new [Generator](https://numpy.org/doc/stable/reference/random/generator.html#numpy.random.Generator) instead. We want to follow the recommendations on [numpy docs](https://numpy.org/doc/stable/reference/random/index.html#quick-start) about random number generation.


drbenvincent commented on 2021-04-06T18:50:45Z
----------------------------------------------------------------

If I'm not actually generating random numbers in the notebook, and just want to use a seed to get consistent inference results, then I guess I can remove this and just pass in random_seed=42 as a kwarg to pm.sample()?

(Thanks for all the comments - I'll work through them)

OriolAbril commented on 2021-04-06T21:57:31Z
----------------------------------------------------------------

Yes, passing the kwarg to sample is the ideal practice for that.

drbenvincent commented on 2021-04-07T15:44:43Z
----------------------------------------------------------------

Done

@review-notebook-app
Copy link

review-notebook-app bot commented Apr 6, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-04-06T17:18:46Z
----------------------------------------------------------------

As Chris was saying, data loading should use a try except. First try to load the local file if present (which will be if user cloned the pymc-examples repo), otherwise download it from github (pm.get_data does that). There is one example of that in https://nbviewer.jupyter.org/github/pymc-devs/pymc-examples/blob/main/examples/case_studies/rugby_analytics.ipynb


drbenvincent commented on 2021-04-07T15:45:20Z
----------------------------------------------------------------

Done. I've followed the example, so believe I'm doing this correctly. Let me know if not.

OriolAbril commented on 2021-04-07T17:35:29Z
----------------------------------------------------------------

Looks good, thanks!

@review-notebook-app
Copy link

review-notebook-app bot commented Apr 6, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-04-06T17:18:46Z
----------------------------------------------------------------

As a general comment, I personally prefer having this code together with the output instead of writing functions that are only called once. I don't know how other people feel about that. I probably have a higher tendency to read (and like) plotting code than the average and therefore like having both code and output as close as possible and not having to worry about local and global variables.

Regarding the model itself, I would use pm.Data for x and m in order to have them as variables in the resulting inferencedata and converted to xarray automatically. It will make plotting and postprocessing much easier. Option 1:

with pm.Model(coords={"obs_id": np.arange(len(x))}) as model:
        x_ = pm.Data("x", x, dims="obs_id")
        m_ = pm.Data("m", m, dims="obs_id")
        # priors
        β0 = pm.Normal("β0", mu=0, sd=10)
        β1 = pm.Normal("β1", mu=0, sd=10)
        β2 = pm.Normal("β2", mu=0, sd=10)
        β3 = pm.Normal("β3", mu=0, sd=10)
        σ = pm.HalfCauchy("σ", 1)
        # likelihood
        y = pm.Normal("y", mu=β0 + (β1 * x_) + (β2 * x_ * m_) + (β3 * m_), sd=σ, observed=y, dims="obs_id")

option 2: leave the model as is and modify:

result = pm.sample(1000, tune=2000, target_accept=0.9, return_inferencedata=True, idata_kwargs={"dims": {"x": ["obs_id"], "m": ["obs_id"], "y": ["obs_id"]}})

drbenvincent commented on 2021-04-07T14:48:40Z
----------------------------------------------------------------

Not having it in a function results in y getting over-written as a pymc3.model.ObservedRV variable which causes a problem when you try to use y in posterior prediction plots later on. So you have to name it y_ and it starts to get unclear. Personally I prefer keeping all the PyMC3 variables out of the global scope - but I'll follow this if it's the house style.

drbenvincent commented on 2021-04-07T15:46:31Z
----------------------------------------------------------------

Partially completed. Working through the coords={"obs_id": np.arange(len(x))} part to make sure I understand it properly.

OriolAbril commented on 2021-04-07T17:48:52Z
----------------------------------------------------------------

Not having it in a function results in y getting over-written as a pymc3.model.ObservedRV variable which causes a problem when you try to use y in posterior prediction plots later on. So you have to name it y_ and it starts to get unclear

Makes sense, the model creation can be kept inside a function, and move only the call to sample outside. We call that approach the model_factory one, in that the function takes data as input and returns the model. Something like:

def model_factory(x, m, y):
    # model
    ...

model = model_factory(x, m, y)
with model:
result = pm.sample()

This has some interesting advantages when it comes to posterior predictive sampling, generating predictions or doing cross-validation, and should keep the variables out of the global scope. As a general comment, for the observed variable, if you are not planning on using it anywhere (same for deterministics that are not used afterwards and only used to track a variable) you can use _ = pm.RandomVar(...) to avoid having a variable storing it.

Regarding coordinates, these are used at two specific moments only. When building the model and using dims instead of shape, pymc3 will look into the coords dict and replace the dims by the corresponding shape, so for that we don't really care about the coordinate values, only about it's length. When converting to InferenceData, both dims and coords are passed to xarray so they persist and are stored into the resulting InferenceData object. You can then use xarray to take care of broadcasting and alignment of all consequent computation, similar to pandas, but in n-d, not only for tabular data.

drbenvincent commented on 2021-04-13T15:22:18Z
----------------------------------------------------------------

I've opted for the factory pattern, which I quite like.

So far I've held off adding the obs_id stuff until comment down below about xarray is decided on. Without doing that, it seems like an addition that might confuse a beginner perhaps.

drbenvincent commented on 2021-05-02T12:34:11Z
----------------------------------------------------------------

I've now implemented option 2

@review-notebook-app
Copy link

review-notebook-app bot commented Apr 6, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-04-06T17:18:47Z
----------------------------------------------------------------

I think that

param_names = ["β1", "β2", "β3"]
az.plot_posterior(result, var_names=param_names, figsize=(14, 4));

will give the same result. if it doesn't, this will

param_names = ["β1", "β2", "β3"]
fig, ax = plt.subplots(1, len(param_names), figsize=(14, 4))
az.plot_posterior(result, var_names=param_names, ax=ax);

drbenvincent commented on 2021-05-05T14:53:53Z
----------------------------------------------------------------

this was done a while back

@review-notebook-app
Copy link

review-notebook-app bot commented Apr 6, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-04-06T17:18:47Z
----------------------------------------------------------------

I think this function should either be removed completely or reduced to subset the data to use only some random draws.


drbenvincent commented on 2021-05-05T14:53:38Z
----------------------------------------------------------------

This was removed in the last commit

@review-notebook-app
Copy link

review-notebook-app bot commented Apr 6, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-04-06T17:18:48Z
----------------------------------------------------------------

Provided the model is modified as indicate above:

post = result.posterior
xi = xr.DataArray(np.linspace(np.min(x), np.max(x), 20), dims=["x_plot"])
m_levels = result.constant_data["m"].quantile(quantile_list).rename({"quantile": "m_level"})

_y = post.β0 + post.β1 * xi + post.β2 * xi * m_levels + post.β3 * m
region = _y.quanitile([.025, .5, .95], dim=("chain", "draw"))
for q in quantile_list:
region_q = region.sel(m_level=q)
ax.fill_between(
xi,
region_q.sel(quantile=.025) # or directly region.sel(m_level=q, quantile=.025),
...

xarray will take care of all the broadcasting and alignment of arrays and we can forget about np.newaxis and the like. A similar thing is probably possible below, I'll try to get there with more specific guidance whenever I have more time. Also, I have not tried the code above, there may be some minor changes needed but I'd say the probability it runs as is is high.

side note, even if it's in a function defined inside a notebook that is only called once I don't think we should use mutable objects like lists as default arguments.



drbenvincent commented on 2021-04-13T15:19:19Z
----------------------------------------------------------------

So I had a go with this. I don't think it's as simple as it might seem.

After updating the model in the way suggested above, the broadcasting in _y = post.β0 + post.β1 * xi + post.β2 * xi * m_levels + post.β3 * m

doesn't work the way you want it to work. The parameters result.posterior.β0 etc are shape (n_chains, s_samples) and they need to be squeezed. So the apparently unsightly code in extract_samples might not be escapable. Then again, I am new to xarray, so maybe there's some nice way to do it. But I don't think extract_samples is that horrific to be honest. Let me know what you think.

Currently I've left things as they are, but I can persist with this approach if you think it's a deal breaker.

OriolAbril commented on 2021-04-13T17:57:23Z
----------------------------------------------------------------

If the chain and draw dimensions need to be flattened you can use post = result.posterior.stack(sample=("chain", "draw")) which you can then subsample with random indexes with post.isel(sample=rng....), independently of: the number of variables in the post tdataset, the dimensions and order of the dimensions in each variable in post or in whatever dataarray/dataset generated in postprocessing. Note that you can get the length of the dimension also by name withlen(post.sample)

I think that precisely for begginers, converting extract_samples to two lines of code that explicitly use the dimension names and don't need np.newaxis nor : will be extremely beneficial. I love np.newaxis now and can go to great lengths with advanced indexing, einsums and whatever is needed to avoid loops, but I don't think this is begginer friendly at all, it is only friendly to people comfortable with numpy/scipy/pandas but don't know pymc3. I even have difficulties understanding my own code years or months later in some of these loop avoiding cases.

Copy link
Contributor Author

I've added some more info about mediation analysis now


View entire conversation on ReviewNB

Copy link
Contributor Author

I've clarified the source of the data and that it is unclear if it was from a real research study or if it was simulated.


View entire conversation on ReviewNB

Copy link
Contributor Author

Done. Added a bit more to the discussion part, as well as added a Further Reading section


View entire conversation on ReviewNB

Copy link
Contributor Author

If I'm not actually generating random numbers in the notebook, and just want to use a seed to get consistent inference results, then I guess I can remove this and just pass in random_seed=42 as a kwarg to pm.sample()?

(Thanks for all the comments - I'll work through them)


View entire conversation on ReviewNB

Copy link
Member

Yes, passing the kwarg to sample is the ideal practice for that.


View entire conversation on ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Apr 7, 2021

View / edit / reply to this conversation on ReviewNB

ricardoV94 commented on 2021-04-07T14:13:17Z
----------------------------------------------------------------

In the first paragraph, shouldn't it be "(the moderator)" and "(the moderating variable)"?


drbenvincent commented on 2021-04-07T15:44:27Z
----------------------------------------------------------------

Whoops! Fixed.

@review-notebook-app
Copy link

review-notebook-app bot commented Apr 7, 2021

View / edit / reply to this conversation on ReviewNB

ricardoV94 commented on 2021-04-07T14:13:18Z
----------------------------------------------------------------

Perhaps you can refer here to the floodlight graph by name since it appears in the discussion.

drbenvincent commented on 2021-04-07T15:48:25Z
----------------------------------------------------------------

Turns out there are internal inconsistencies in McClelland et al (2016). Looks like it is actually a spotlight graph. But have implemented the suggestion. Although this refers to the plot below, not the main x, y plot, so the edits are below this point.

Copy link
Contributor Author

Not having it in a function results in y getting over-written as a pymc3.model.ObservedRV variable which causes a problem when you try to use y in posterior prediction plots later on. So you have to name it y_ and it starts to get unclear. Personally I prefer keeping all the PyMC3 variables out of the global scope - but I'll follow this if it's the house style.


View entire conversation on ReviewNB

McClelland is inconsistent, leading to confusion. It's a spotlight graph, not a floodlight graph.
Copy link
Contributor Author

Whoops! Fixed.


View entire conversation on ReviewNB

Copy link
Contributor Author

Done


View entire conversation on ReviewNB

Copy link
Contributor Author

Done. I've followed the example, so believe I'm doing this correctly. Let me know if not.


View entire conversation on ReviewNB

Copy link
Contributor Author

Partially completed. Working through the coords={"obs_id": np.arange(len(x))} part to make sure I understand it properly.


View entire conversation on ReviewNB

Copy link
Contributor Author

Turns out there are internal inconsistencies in McClelland et al (2016). Looks like it is actually a spotlight graph. But have implemented the suggestion. Although this refers to the plot below, not the main x, y plot, so the edits are below this point.


View entire conversation on ReviewNB

Copy link
Member

Looks good, thanks!


View entire conversation on ReviewNB

Copy link
Member

Not having it in a function results in y getting over-written as a pymc3.model.ObservedRV variable which causes a problem when you try to use y in posterior prediction plots later on. So you have to name it y_ and it starts to get unclear

Makes sense, the model creation can be kept inside a function, and move only the call to sample outside. We call that approach the model_factory one, in that the function takes data as input and returns the model. Something like:

def model_factory(x, m, y):
    # model
    ...

model = model_factory(x, m, y)
with model:
result = pm.sample()

This has some interesting advantages when it comes to posterior predictive sampling, generating predictions or doing cross-validation, and should keep the variables out of the global scope. As a general comment, for the observed variable, if you are not planning on using it anywhere (same for deterministics that are not used afterwards and only used to track a variable) you can use _ = pm.RandomVar(...) to avoid having a variable storing it.

Regarding coordinates, these are used at two specific moments only. When building the model and using dims instead of shape, pymc3 will look into the coords dict and replace the dims by the corresponding shape, so for that we don't really care about the coordinate values, only about it's length. When converting to InferenceData, both dims and coords are passed to xarray so they persist and are stored into the resulting InferenceData object. You can then use xarray to take care of broadcasting and alignment of all consequent computation, similar to pandas, but in n-d, not only for tabular data.


View entire conversation on ReviewNB

Copy link
Contributor Author

So I had a go with this. I don't think it's as simple as it might seem.

After updating the model in the way suggested above, the broadcasting in _y = post.β0 + post.β1 * xi + post.β2 * xi * m_levels + post.β3 * m

doesn't work the way you want it to work. The parameters result.posterior.β0 etc are shape (n_chains, s_samples) and they need to be squeezed. So the apparently unsightly code in extract_samples might not be escapable. Then again, I am new to xarray, so maybe there's some nice way to do it. But I don't think extract_samples is that horrific to be honest. Let me know what you think.

Currently I've left things as they are, but I can persist with this approach if you think it's a deal breaker.


View entire conversation on ReviewNB

Copy link
Contributor Author

I've opted for the factory pattern, which I quite like.

So far I've held off adding the obs_id stuff until comment down below about xarray is decided on. Without doing that, it seems like an addition that might confuse a beginner perhaps.


View entire conversation on ReviewNB

Copy link
Member

If the chain and draw dimensions need to be flattened you can use post = result.posterior.stack(sample=("chain", "draw")) which you can then subsample with random indexes with post.isel(sample=rng....), independently of: the number of variables in the post tdataset, the dimensions and order of the dimensions in each variable in post or in whatever dataarray/dataset generated in postprocessing. Note that you can get the length of the dimension also by name withlen(post.sample)

I think that precisely for begginers, converting extract_samples to two lines of code that explicitly use the dimension names and don't need np.newaxis nor : will be extremely beneficial. I love np.newaxis now and can go to great lengths with advanced indexing, einsums and whatever is needed to avoid loops, but I don't think this is begginer friendly at all, it is only friendly to people comfortable with numpy/scipy/pandas but don't know pymc3. I even have difficulties understanding my own code years or months later in some of these loop avoiding cases.


View entire conversation on ReviewNB

Copy link
Contributor Author

I've now implemented option 2


View entire conversation on ReviewNB

@drbenvincent
Copy link
Contributor Author

Excuse the slight delay... I had to get a couple of papers submitted. I believe I've not implemented all the changes asked for. The last changes were to provide idata_kwargs when sampling, which enabled use of xarray, which has meant I could kill a utility function to extract samples which has lead to smoother plotting code. Let me know if there's anything else :)

@review-notebook-app
Copy link

review-notebook-app bot commented May 4, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-05-04T11:37:50Z
----------------------------------------------------------------

The name (as string passed to the distribution) of y_ is "y_". I believe it should be "y" from the coords.


drbenvincent commented on 2021-05-05T15:23:34Z
----------------------------------------------------------------

Done :)

@review-notebook-app
Copy link

review-notebook-app bot commented May 4, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-05-04T11:37:51Z
----------------------------------------------------------------

Can you also add theano to the watermark? We recently added it to https://github.com/pymc-devs/pymc3/wiki/PyMC3-Jupyter-Notebook-Style-Guide as it is troublesome to rerun the notebook without knowing it in some cases.


drbenvincent commented on 2021-05-05T15:23:41Z
----------------------------------------------------------------

Done :)

Copy link
Contributor Author

This was removed in the last commit


View entire conversation on ReviewNB

Copy link
Contributor Author

this was done a while back


View entire conversation on ReviewNB

Copy link
Contributor Author

Done :)


View entire conversation on ReviewNB

1 similar comment
Copy link
Contributor Author

Done :)


View entire conversation on ReviewNB

@drbenvincent
Copy link
Contributor Author

Thanks for those points @OriolAbril. Unless anyone has any further things, I think that might be it. 🤞🏼

Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I'll happily add this notebook directly to the "Best Practices" column

@OriolAbril OriolAbril merged commit 39a8954 into pymc-devs:main Jun 6, 2021
@drbenvincent drbenvincent deleted the moderation_analysis branch June 7, 2021 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants