From fce3bd84d294e3d3b9ec9d1f28b0ce7db65bc71b Mon Sep 17 00:00:00 2001 From: Thomas Wiecki Date: Sat, 24 Dec 2022 10:19:20 +0100 Subject: [PATCH] Remove jupytext and myst_nbs dir. --- .pre-commit-config.yaml | 12 - .../case_studies/BART_introduction.myst.md | 210 --- myst_nbs/case_studies/BEST.myst.md | 223 --- myst_nbs/case_studies/GEV.myst.md | 241 ---- myst_nbs/case_studies/LKJ.myst.md | 331 ----- .../bayesian_ab_testing_introduction.myst.md | 695 --------- myst_nbs/case_studies/binning.myst.md | 952 ------------- .../blackbox_external_likelihood.myst.md | 667 --------- ...blackbox_external_likelihood_numpy.myst.md | 473 ------ .../conditional-autoregressive-model.myst.md | 772 ---------- myst_nbs/case_studies/factor_analysis.myst.md | 348 ----- .../hierarchical_partial_pooling.myst.md | 182 --- .../case_studies/item_response_nba.myst.md | 529 ------- .../case_studies/mediation_analysis.myst.md | 241 ---- .../case_studies/moderation_analysis.myst.md | 383 ----- .../case_studies/multilevel_modeling.myst.md | 1199 ---------------- ...probabilistic_matrix_factorization.myst.md | 783 ---------- .../case_studies/putting_workflow.myst.md | 871 ----------- .../reinforcement_learning.myst.md | 548 ------- myst_nbs/case_studies/rugby_analytics.myst.md | 523 ------- myst_nbs/case_studies/spline.myst.md | 297 ---- .../stochastic_volatility.myst.md | 220 --- .../wrapping_jax_function.myst.md | 767 ---------- .../difference_in_differences.myst.md | 452 ------ .../causal_inference/excess_deaths.myst.md | 503 ------- .../interrupted_time_series.myst.md | 363 ----- .../regression_discontinuity.myst.md | 244 ---- .../Bayes_factor.myst.md | 359 ----- ..._biased_Inference_with_Divergences.myst.md | 579 -------- .../model_averaging.myst.md | 393 ----- .../sampler-stats.myst.md | 203 --- .../gaussian_processes/GP-Circular.myst.md | 212 --- .../GP-Heteroskedastic.myst.md | 538 ------- myst_nbs/gaussian_processes/GP-Kron.myst.md | 306 ---- myst_nbs/gaussian_processes/GP-Latent.myst.md | 444 ------ .../gaussian_processes/GP-Marginal.myst.md | 302 ---- .../gaussian_processes/GP-MaunaLoa.myst.md | 604 -------- .../gaussian_processes/GP-MaunaLoa2.myst.md | 947 ------------ .../GP-MeansAndCovs.myst.md | 1268 ----------------- .../GP-SparseApprox.myst.md | 185 --- .../gaussian_processes/GP-TProcess.myst.md | 156 -- .../gaussian_processes/GP-smoothing.myst.md | 177 --- .../MOGP-Coregion-Hadamard.myst.md | 359 ----- .../gaussian_process.myst.md | 190 --- .../log-gaussian-cox-process.myst.md | 295 ---- .../GLM-binomial-regression.myst.md | 260 ---- .../GLM-hierarchical-binomial-model.myst.md | 283 ---- .../GLM-model-selection.myst.md | 621 -------- .../GLM-negative-binomial-regression.myst.md | 240 ---- .../GLM-out-of-sample-predictions.myst.md | 297 ---- .../GLM-poisson-regression.myst.md | 535 ------- .../GLM-robust-with-outlier-detection.myst.md | 943 ------------ .../GLM-robust.myst.md | 234 --- .../GLM-rolling-regression.myst.md | 265 ---- .../GLM-simpsons-paradox.myst.md | 559 -------- .../GLM-truncated-censored-regression.myst.md | 384 ----- myst_nbs/howto/api_quickstart.myst.md | 457 ------ myst_nbs/howto/custom_distribution.myst.md | 335 ----- myst_nbs/howto/data_container.myst.md | 329 ----- myst_nbs/howto/howto_debugging.myst.md | 198 --- myst_nbs/howto/lasso_block_update.myst.md | 124 -- myst_nbs/howto/profiling.myst.md | 62 - myst_nbs/howto/sampling_callback.myst.md | 100 -- myst_nbs/howto/sampling_compound_step.myst.md | 197 --- .../howto/sampling_conjugate_step.myst.md | 251 ---- myst_nbs/howto/updating_priors.myst.md | 171 --- .../dependent_density_regression.myst.md | 291 ---- .../dirichlet_mixture_of_multinomials.myst.md | 568 -------- myst_nbs/mixture_models/dp_mix.myst.md | 618 -------- .../gaussian_mixture_model.myst.md | 124 -- ...arginalized_gaussian_mixture_model.myst.md | 239 ---- .../ode_models/ODE_API_introduction.myst.md | 295 ---- .../ODE_API_shapes_and_benchmarking.myst.md | 198 --- .../ODE_with_manual_gradients.myst.md | 634 --------- ...DEMetropolisZ_EfficiencyComparison.myst.md | 250 ---- .../DEMetropolisZ_tune_drop_fraction.myst.md | 235 --- .../samplers/GLM-hierarchical-jax.myst.md | 130 -- .../samplers/MLDA_gravity_surveying.myst.md | 863 ----------- myst_nbs/samplers/MLDA_introduction.myst.md | 70 - .../MLDA_simple_linear_regression.myst.md | 211 --- ...riance_reduction_linear_regression.myst.md | 443 ------ .../SMC-ABC_Lotka-Volterra_example.myst.md | 226 --- myst_nbs/samplers/SMC2_gaussians.myst.md | 213 --- .../bayes_param_survival_pymc3.myst.md | 420 ------ .../survival_analysis/censored_data.myst.md | 244 ---- .../survival_analysis.myst.md | 512 ------- .../survival_analysis/weibull_aft.myst.md | 182 --- myst_nbs/time_series/AR.myst.md | 187 --- ...ers-Prophet_with_Bayesian_workflow.myst.md | 376 ----- .../Euler-Maruyama_and_SDEs.myst.md | 363 ----- ...casting_with_structural_timeseries.myst.md | 763 ---------- .../MvGaussianRandomWalk_demo.myst.md | 241 ---- .../time_series/bayesian_var_model.myst.md | 769 ---------- .../GLM-hierarchical-advi-minibatch.myst.md | 176 --- .../bayesian_neural_network_advi.myst.md | 353 ----- .../convolutional_vae_keras_advi.myst.md | 409 ------ .../empirical-approx-overview.myst.md | 171 --- .../gaussian-mixture-model-advi.myst.md | 312 ---- .../lda-advi-aevb.myst.md | 423 ------ .../normalizing_flows_overview.myst.md | 468 ------ .../variational_inference/pathfinder.myst.md | 97 -- .../variational_api_quickstart.myst.md | 539 ------- 102 files changed, 40004 deletions(-) delete mode 100644 myst_nbs/case_studies/BART_introduction.myst.md delete mode 100644 myst_nbs/case_studies/BEST.myst.md delete mode 100644 myst_nbs/case_studies/GEV.myst.md delete mode 100644 myst_nbs/case_studies/LKJ.myst.md delete mode 100644 myst_nbs/case_studies/bayesian_ab_testing_introduction.myst.md delete mode 100644 myst_nbs/case_studies/binning.myst.md delete mode 100644 myst_nbs/case_studies/blackbox_external_likelihood.myst.md delete mode 100644 myst_nbs/case_studies/blackbox_external_likelihood_numpy.myst.md delete mode 100644 myst_nbs/case_studies/conditional-autoregressive-model.myst.md delete mode 100644 myst_nbs/case_studies/factor_analysis.myst.md delete mode 100644 myst_nbs/case_studies/hierarchical_partial_pooling.myst.md delete mode 100644 myst_nbs/case_studies/item_response_nba.myst.md delete mode 100644 myst_nbs/case_studies/mediation_analysis.myst.md delete mode 100644 myst_nbs/case_studies/moderation_analysis.myst.md delete mode 100644 myst_nbs/case_studies/multilevel_modeling.myst.md delete mode 100644 myst_nbs/case_studies/probabilistic_matrix_factorization.myst.md delete mode 100644 myst_nbs/case_studies/putting_workflow.myst.md delete mode 100644 myst_nbs/case_studies/reinforcement_learning.myst.md delete mode 100644 myst_nbs/case_studies/rugby_analytics.myst.md delete mode 100644 myst_nbs/case_studies/spline.myst.md delete mode 100644 myst_nbs/case_studies/stochastic_volatility.myst.md delete mode 100644 myst_nbs/case_studies/wrapping_jax_function.myst.md delete mode 100644 myst_nbs/causal_inference/difference_in_differences.myst.md delete mode 100644 myst_nbs/causal_inference/excess_deaths.myst.md delete mode 100644 myst_nbs/causal_inference/interrupted_time_series.myst.md delete mode 100644 myst_nbs/causal_inference/regression_discontinuity.myst.md delete mode 100644 myst_nbs/diagnostics_and_criticism/Bayes_factor.myst.md delete mode 100644 myst_nbs/diagnostics_and_criticism/Diagnosing_biased_Inference_with_Divergences.myst.md delete mode 100644 myst_nbs/diagnostics_and_criticism/model_averaging.myst.md delete mode 100644 myst_nbs/diagnostics_and_criticism/sampler-stats.myst.md delete mode 100644 myst_nbs/gaussian_processes/GP-Circular.myst.md delete mode 100644 myst_nbs/gaussian_processes/GP-Heteroskedastic.myst.md delete mode 100644 myst_nbs/gaussian_processes/GP-Kron.myst.md delete mode 100644 myst_nbs/gaussian_processes/GP-Latent.myst.md delete mode 100644 myst_nbs/gaussian_processes/GP-Marginal.myst.md delete mode 100644 myst_nbs/gaussian_processes/GP-MaunaLoa.myst.md delete mode 100644 myst_nbs/gaussian_processes/GP-MaunaLoa2.myst.md delete mode 100644 myst_nbs/gaussian_processes/GP-MeansAndCovs.myst.md delete mode 100644 myst_nbs/gaussian_processes/GP-SparseApprox.myst.md delete mode 100644 myst_nbs/gaussian_processes/GP-TProcess.myst.md delete mode 100644 myst_nbs/gaussian_processes/GP-smoothing.myst.md delete mode 100644 myst_nbs/gaussian_processes/MOGP-Coregion-Hadamard.myst.md delete mode 100644 myst_nbs/gaussian_processes/gaussian_process.myst.md delete mode 100644 myst_nbs/gaussian_processes/log-gaussian-cox-process.myst.md delete mode 100644 myst_nbs/generalized_linear_models/GLM-binomial-regression.myst.md delete mode 100644 myst_nbs/generalized_linear_models/GLM-hierarchical-binomial-model.myst.md delete mode 100644 myst_nbs/generalized_linear_models/GLM-model-selection.myst.md delete mode 100644 myst_nbs/generalized_linear_models/GLM-negative-binomial-regression.myst.md delete mode 100644 myst_nbs/generalized_linear_models/GLM-out-of-sample-predictions.myst.md delete mode 100644 myst_nbs/generalized_linear_models/GLM-poisson-regression.myst.md delete mode 100644 myst_nbs/generalized_linear_models/GLM-robust-with-outlier-detection.myst.md delete mode 100644 myst_nbs/generalized_linear_models/GLM-robust.myst.md delete mode 100644 myst_nbs/generalized_linear_models/GLM-rolling-regression.myst.md delete mode 100644 myst_nbs/generalized_linear_models/GLM-simpsons-paradox.myst.md delete mode 100644 myst_nbs/generalized_linear_models/GLM-truncated-censored-regression.myst.md delete mode 100644 myst_nbs/howto/api_quickstart.myst.md delete mode 100644 myst_nbs/howto/custom_distribution.myst.md delete mode 100644 myst_nbs/howto/data_container.myst.md delete mode 100644 myst_nbs/howto/howto_debugging.myst.md delete mode 100644 myst_nbs/howto/lasso_block_update.myst.md delete mode 100644 myst_nbs/howto/profiling.myst.md delete mode 100644 myst_nbs/howto/sampling_callback.myst.md delete mode 100644 myst_nbs/howto/sampling_compound_step.myst.md delete mode 100644 myst_nbs/howto/sampling_conjugate_step.myst.md delete mode 100644 myst_nbs/howto/updating_priors.myst.md delete mode 100644 myst_nbs/mixture_models/dependent_density_regression.myst.md delete mode 100644 myst_nbs/mixture_models/dirichlet_mixture_of_multinomials.myst.md delete mode 100644 myst_nbs/mixture_models/dp_mix.myst.md delete mode 100644 myst_nbs/mixture_models/gaussian_mixture_model.myst.md delete mode 100644 myst_nbs/mixture_models/marginalized_gaussian_mixture_model.myst.md delete mode 100644 myst_nbs/ode_models/ODE_API_introduction.myst.md delete mode 100644 myst_nbs/ode_models/ODE_API_shapes_and_benchmarking.myst.md delete mode 100644 myst_nbs/ode_models/ODE_with_manual_gradients.myst.md delete mode 100644 myst_nbs/samplers/DEMetropolisZ_EfficiencyComparison.myst.md delete mode 100644 myst_nbs/samplers/DEMetropolisZ_tune_drop_fraction.myst.md delete mode 100644 myst_nbs/samplers/GLM-hierarchical-jax.myst.md delete mode 100644 myst_nbs/samplers/MLDA_gravity_surveying.myst.md delete mode 100644 myst_nbs/samplers/MLDA_introduction.myst.md delete mode 100644 myst_nbs/samplers/MLDA_simple_linear_regression.myst.md delete mode 100644 myst_nbs/samplers/MLDA_variance_reduction_linear_regression.myst.md delete mode 100644 myst_nbs/samplers/SMC-ABC_Lotka-Volterra_example.myst.md delete mode 100644 myst_nbs/samplers/SMC2_gaussians.myst.md delete mode 100644 myst_nbs/survival_analysis/bayes_param_survival_pymc3.myst.md delete mode 100644 myst_nbs/survival_analysis/censored_data.myst.md delete mode 100644 myst_nbs/survival_analysis/survival_analysis.myst.md delete mode 100644 myst_nbs/survival_analysis/weibull_aft.myst.md delete mode 100644 myst_nbs/time_series/AR.myst.md delete mode 100644 myst_nbs/time_series/Air_passengers-Prophet_with_Bayesian_workflow.myst.md delete mode 100644 myst_nbs/time_series/Euler-Maruyama_and_SDEs.myst.md delete mode 100644 myst_nbs/time_series/Forecasting_with_structural_timeseries.myst.md delete mode 100644 myst_nbs/time_series/MvGaussianRandomWalk_demo.myst.md delete mode 100644 myst_nbs/time_series/bayesian_var_model.myst.md delete mode 100644 myst_nbs/variational_inference/GLM-hierarchical-advi-minibatch.myst.md delete mode 100644 myst_nbs/variational_inference/bayesian_neural_network_advi.myst.md delete mode 100644 myst_nbs/variational_inference/convolutional_vae_keras_advi.myst.md delete mode 100644 myst_nbs/variational_inference/empirical-approx-overview.myst.md delete mode 100644 myst_nbs/variational_inference/gaussian-mixture-model-advi.myst.md delete mode 100644 myst_nbs/variational_inference/lda-advi-aevb.myst.md delete mode 100644 myst_nbs/variational_inference/normalizing_flows_overview.myst.md delete mode 100644 myst_nbs/variational_inference/pathfinder.myst.md delete mode 100644 myst_nbs/variational_inference/variational_api_quickstart.myst.md diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 38e80e682..82ab70610 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -1,10 +1,4 @@ repos: -- repo: https://github.com/mwouts/jupytext - rev: v1.13.7 - hooks: - - id: jupytext - files: ^examples/.+\.ipynb$ - args: ["--sync"] - repo: https://github.com/psf/black rev: 22.3.0 hooks: @@ -103,12 +97,6 @@ repos: docs.scipy.org/doc) language: pygrep types_or: [markdown, rst, jupyter] -- repo: https://github.com/mwouts/jupytext - rev: v1.13.7 - hooks: - - id: jupytext - files: ^examples/.+\.ipynb$ - args: ["--sync"] - repo: https://github.com/codespell-project/codespell rev: v2.1.0 hooks: diff --git a/myst_nbs/case_studies/BART_introduction.myst.md b/myst_nbs/case_studies/BART_introduction.myst.md deleted file mode 100644 index 8ac2d8eb5..000000000 --- a/myst_nbs/case_studies/BART_introduction.myst.md +++ /dev/null @@ -1,210 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(BART_introduction)= -# Bayesian Additive Regression Trees: Introduction -:::{post} Dec 21, 2021 -:tags: BART, non-parametric, regression -:category: intermediate, explanation -:author: Osvaldo Martin -::: - -```{code-cell} ipython3 -from pathlib import Path - -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm -import pymc_bart as pmb - -print(f"Running on PyMC v{pm.__version__}") -``` - -```{code-cell} ipython3 -RANDOM_SEED = 5781 -np.random.seed(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -## BART overview - -+++ - -Bayesian additive regression trees (BART) is a non-parametric regression approach. If we have some covariates $X$ and we want to use them to model $Y$, a BART model (omitting the priors) can be represented as: - -$$Y = f(X) + \epsilon$$ - -where we use a sum of $m$ [regression trees](https://en.wikipedia.org/wiki/Decision_tree_learning) to model $f$, and $\epsilon$ is some noise. In the most typical examples $\epsilon$ is normally distributed, $\mathcal{N}(0, \sigma)$. So we can also write: - -$$Y \sim \mathcal{N}(\mu=BART(X), \sigma)$$ - -In principle nothing restricts us to use a sum of trees to model other relationship. For example we may have: - -$$Y \sim \text{Poisson}(\mu=BART(X))$$ - -One of the reason BART is Bayesian is the use of priors over the regression trees. The priors are defined in such a way that they favor shallow trees with leaf values close to zero. A key idea is that a single BART-tree is not very good at fitting the data but when we sum many of these trees we get a good and flexible approximation. - -+++ - -## Coal mining with BART - -To better understand BART in practice we are going to use the oldie but goldie coal mining disaster dataset. One of the classic examples in PyMC. Instead of thinking this problem as a switch-point model with two Poisson distribution, as in the original PyMC example. We are going to think this problem as a non-parametric regression with a Poisson response (this is usually discussed in terms of [Poisson processes](https://en.wikipedia.org/wiki/Poisson_point_process) or [Cox processes](https://en.wikipedia.org/wiki/Cox_process), but we are OK without going into those technicalities). For a similar example but with Gaussian processes see [1](https://github.com/aloctavodia/BAP/blob/master/code/Chp7/07_Gaussian%20process.ipynb) or [2](https://research.cs.aalto.fi/pml/software/gpstuff/demo_lgcp.shtml). Because our data is just a single column with dates, we need to do some pre-processing. We are going to discretize the data, just as if we were building a histogram. We are going to use the centers of the bins as the variable $X$ and the counts per bin as the variable $Y$ - -```{code-cell} ipython3 -try: - coal = np.loadtxt(Path("..", "data", "coal.csv")) -except FileNotFoundError: - coal = np.loadtxt(pm.get_data("coal.csv")) -``` - -```{code-cell} ipython3 -# discretize data -years = int(coal.max() - coal.min()) -bins = years // 4 -hist, x_edges = np.histogram(coal, bins=bins) -# compute the location of the centers of the discretized data -x_centers = x_edges[:-1] + (x_edges[1] - x_edges[0]) / 2 -# xdata needs to be 2D for BART -x_data = x_centers[:, None] -# express data as the rate number of disaster per year -y_data = hist / 4 -``` - -In PyMC a BART variable can be defined very similar to other random variables. One important difference is that we have to pass ours Xs and Ys to the BART variable. Here we are also making explicit that we are going to use a sum over 20 trees (`m=20`). Low number of trees like 20 could be good enough for simple models like this and could also work very good as a quick approximation for more complex models in particular during the iterative or explorative phase of modeling. In those cases once we have more certainty about the model we really like we can improve the approximation by increasing `m`, in the literature is common to find reports of good results with numbers like 50, 100 or 200. - -```{code-cell} ipython3 -with pm.Model() as model_coal: - μ_ = pmb.BART("μ_", X=x_data, Y=y_data, m=20) - μ = pm.Deterministic("μ", np.abs(μ_)) - y_pred = pm.Poisson("y_pred", mu=μ, observed=y_data) - idata_coal = pm.sample(random_seed=RANDOM_SEED) -``` - -The white line in the following plot shows the median rate of accidents. The darker orange band represent the HDI 50% and the lighter one the 94%. We can see a rapid decrease of coal accidents between 1880 and 1900. Feel free to compare these results with those in the original {ref}`pymc:pymc_overview` example. - -```{code-cell} ipython3 -_, ax = plt.subplots(figsize=(10, 6)) - -rates = idata_coal.posterior["μ"] -rate_mean = idata_coal.posterior["μ"].mean(dim=["draw", "chain"]) -ax.plot(x_centers, rate_mean, "w", lw=3) -az.plot_hdi(x_centers, rates, smooth=False) -az.plot_hdi(x_centers, rates, hdi_prob=0.5, smooth=False, plot_kwargs={"alpha": 0}) -ax.plot(coal, np.zeros_like(coal) - 0.5, "k|") -ax.set_xlabel("years") -ax.set_ylabel("rate"); -``` - -In the previous plot the white line is the median over 4000 posterior draws, and each one of those posterior draws is a sum over `m=20` trees. - - -The following figure shows two samples from the posterior of $\mu$. We can see that these functions are not smooth. This is fine and is a direct consequence of using regression trees. Trees can be seen as a way to represent stepwise functions, and a sum of stepwise functions is just another stepwise function. Thus, when using BART we just need to know that we are assuming that a stepwise function is a good enough approximation for our problem. In practice this is often the case because we sum over many trees, usually values like 50, 100 or 200. Additionally, we often average over the posterior distribution. All this makes the "steps smoother", even when we never really have an smooth function as for example with Gaussian processes (splines). A nice theoretical result, tells us that in the limit of $m \to \infty$ the BART prior converges to a [nowheredifferentiable](https://en.wikipedia.org/wiki/Weierstrass_function) Gaussian process. - -The following figure shows two samples of $\mu$ from the posterior. - -```{code-cell} ipython3 -plt.step(x_data, idata_coal.posterior["μ"].sel(chain=0, draw=[3, 10]).T); -``` - -The next figure shows 3 trees. As we can see these are very simple function and definitely not very good approximators by themselves. Inspecting individuals trees is generally not necessary when working with BART, we are showing them just so we can gain further intuition on the inner workins of BART. - -```{code-cell} ipython3 -bart_trees = μ_.owner.op.all_trees -for i in [0, 1, 2]: - plt.step(x_data[:, 0], [bart_trees[0][i].predict(x) for x in x_data]) -``` - -## Biking with BART - -+++ - -To explore other features offered by BART in PyMC. We are now going to move on to a different example. In this example we have data about the number of bikes rental in a city, and we have chosen four covariates; the hour of the day, the temperature, the humidity and whether is a workingday or a weekend. This dataset is a subset of the [bike_sharing_dataset](http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset). - -```{code-cell} ipython3 -try: - bikes = pd.read_csv(Path("..", "data", "bikes.csv")) -except FileNotFoundError: - bikes = pd.read_csv(pm.get_data("bikes.csv")) - -X = bikes[["hour", "temperature", "humidity", "workingday"]] -Y = bikes["count"] -``` - -```{code-cell} ipython3 -with pm.Model() as model_bikes: - α = pm.Exponential("α", 1 / 10) - μ = pmb.BART("μ", X, Y) - y = pm.NegativeBinomial("y", mu=np.abs(μ), alpha=α, observed=Y) - idata_bikes = pm.sample(random_seed=RANDOM_SEED) -``` - -### Partial dependence plots - -+++ - -To help us interpret the results of our model we are going to use partial dependence plot. This is a type of plot that shows the marginal effect that one covariate has on the predicted variable. That is, what is the effect that a covariate $X_i$ has of $Y$ while we average over all the other covariates ($X_j, \forall j \not = i$). This type of plot are not exclusive of BART. But they are often used in the BART literature. PyMC-BART provides an utility function to make this plot from the inference data. - -```{code-cell} ipython3 -pmb.plot_dependence(μ, X=X, Y=Y, grid=(2, 2), var_discrete=[3]); -``` - -From this plot we can see the main effect of each covariate on the predicted value. This is very useful we can recover complex relationship beyond monotonic increasing or decreasing effects. For example for the `hour` covariate we can see two peaks around 8 and and 17 hs and a minimum at midnight. - -When interpreting partial dependence plots we should be careful about the assumptions in this plot. First we are assuming variables are independent. For example when computing the effect of `hour` we have to marginalize the effect of `temperature` and this means that to compute the partial dependence value at `hour=0` we are including all observed values of temperature, and this may include temperatures that are actually not observed at midnight, given that lower temperatures are more likely than higher ones. We are seeing only averages, so if for a covariate half the values are positively associated with predicted variable and the other half negatively associated. The partial dependence plot will be flat as their contributions will cancel each other out. This is a problem that can be solved by using individual conditional expectation plots `pmb.plot_dependence(..., kind="ice")`. Notice that all this assumptions are assumptions of the partial dependence plot, not of our model! In fact BART can easily accommodate interaction of variables Although the prior in BART regularizes high order interactions). For more on interpreting Machine Learning model you could check the "Interpretable Machine Learning" book {cite:p}`molnar2019`. - -Finally like with other regression methods we should be careful that the effects we are seeing on individual variables are conditional on the inclusion of the other variables. So for example, while `humidity` seems to be mostly flat, meaning that this covariate has an small effect of the number of used bikes. This could be the case because `humidity` and `temperature` are correlated to some extend and once we include `temperature` in our model `humidity` does not provide too much extra information. Try for example fitting the model again but this time with `humidity` as the single covariate and then fitting the model again with `hour` as a single covariate. You should see that the result for this single-variate models will very similar to the previous figure for the `hour` covariate, but less similar for the `humidity` covariate. - -+++ - -### Variable importance - -As we saw in the previous section a partial dependence plot can visualize give us an idea of how much each covariable contributes to the predicted outcome. But BART itself leads to a simple heuristic to estimate variable importance. That is simple count how many times a variable is included in all the regression trees. The intuition is that if a variable is important they it should appears more often in the fitted trees that less important variables. While this heuristic seems to provide reasonable results in practice, there is not too much theory justifying this procedure, at least not yet. - -The following plot shows the relative importance in a scale from 0 to 1 (less to more importance) and the sum of the individual importance is 1. See that, at least in this case, the relative importance qualitative agrees with the partial dependence plot. - -Additionally, PyMC-BART provides a novel method to assess the variable importance. You can see an example in the bottom panel. On the x-axis we have the number of covariables and on the y-axis the square of the Pearson correlation coefficient between the predictions made for the full-model (all variables included) and the restricted-models, those with only a subset of the variables. The components are included following the relative variable importance order, as show in the top panel. Thus, in this example 1 component means `hour`, two components means `hour` and `temperature`, 3 components `hour`, `temperature`and `humidity`. Finally, four components means `hour`, `temperature`, `humidity`, `workingday`, i.e., the full model. Hence, from the next figure we can see that even a model with a single component, `hour`, is very close to the full model. Even more, the model with two components `hour`, and `temperature` is on average indistinguishable from the full model. The error bars represent the 94 \% HDI from the posterior predictive distribution. It is important to notice that to compute these correlations we do not resample the models, instead the predictions of the restricted-models are approximated by *prunning* variables from the full-model. - -```{code-cell} ipython3 -pmb.plot_variable_importance(idata_bikes, μ, X, samples=100); -``` - -## Authors -* Authored by Osvaldo Martin in Dec, 2021 ([pymc-examples#259](https://github.com/pymc-devs/pymc-examples/pull/259)) -* Updated by Osvaldo Martin in May, 2022 ([pymc-examples#323](https://github.com/pymc-devs/pymc-examples/pull/323)) -* Updated by Osvaldo Martin in Sep, 2022 -* Updated by Osvaldo Martin in Nov, 2022 - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames - -martin2021bayesian -quiroga2022bart -::: - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/case_studies/BEST.myst.md b/myst_nbs/case_studies/BEST.myst.md deleted file mode 100644 index a2876a5c1..000000000 --- a/myst_nbs/case_studies/BEST.myst.md +++ /dev/null @@ -1,223 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: bayes_toolbox - language: python - name: bayes_toolbox ---- - -(BEST)= -# Bayesian Estimation Supersedes the T-Test - -:::{post} Jan 07, 2022 -:tags: hypothesis testing, model comparison, -:category: beginner -:author: Andrew Straw, Thomas Wiecki, Chris Fonnesbeck, Andrés suárez -::: - -```{code-cell} ipython3 -import arviz as az -import numpy as np -import pandas as pd -import pymc as pm -import seaborn as sns - -print(f"Running on PyMC v{pm.__version__}") -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -rng = np.random.default_rng(seed=42) -``` - -### The Problem - -This model replicates the example used in **Bayesian estimation supersedes the t-test** {cite:p}`kruschke2013`. - -Several statistical inference procedures involve the comparison of two groups. We may be interested in whether one group is larger than another, or simply different from the other. We require a statistical model for this because true differences are usually accompanied by measurement or stochastic noise that prevent us from drawing conclusions simply from differences calculated from the observed data. - -The *de facto* standard for statistically comparing two (or more) samples is to use a statistical test. This involves expressing a null hypothesis, which typically claims that there is no difference between the groups, and using a chosen test statistic to determine whether the distribution of the observed data is plausible under the hypothesis. This rejection occurs when the calculated test statistic is higher than some pre-specified threshold value. - -Unfortunately, it is not easy to conduct hypothesis tests correctly, and their results are very easy to misinterpret. Setting up a statistical test involves several subjective choices (*e.g.* statistical test to use, null hypothesis to test, significance level) by the user that are rarely justified based on the problem or decision at hand, but rather, are usually based on traditional choices that are entirely arbitrary {cite:p}`johnson1999`. The evidence that it provides to the user is indirect, incomplete, and typically overstates the evidence against the null hypothesis {cite:p}`goodman1999`. - -A more informative and effective approach for comparing groups is one based on **estimation** rather than **testing**, and is driven by Bayesian probability rather than frequentist. That is, rather than testing whether two groups are different, we instead pursue an estimate of how different they are, which is fundamentally more informative. Moreover, we include an estimate of uncertainty associated with that difference which includes uncertainty due to our lack of knowledge of the model parameters (epistemic uncertainty) and uncertainty due to the inherent stochasticity of the system (aleatory uncertainty). - -+++ - -## Example: Drug trial evaluation - -To illustrate how this Bayesian estimation approach works in practice, we will use a fictitious example from {cite:t}`kruschke2013` concerning the evaluation of a clinical trial for drug evaluation. The trial aims to evaluate the efficacy of a "smart drug" that is supposed to increase intelligence by comparing IQ scores of individuals in a treatment arm (those receiving the drug) to those in a control arm (those receiving a placebo). There are 47 individuals and 42 individuals in the treatment and control arms, respectively. - -```{code-cell} ipython3 -# fmt: off -iq_drug = np.array([ - 101, 100, 102, 104, 102, 97, 105, 105, 98, 101, 100, 123, 105, 103, - 100, 95, 102, 106, 109, 102, 82, 102, 100, 102, 102, 101, 102, 102, - 103, 103, 97, 97, 103, 101, 97, 104, 96, 103, 124, 101, 101, 100, - 101, 101, 104, 100, 101 -]) - -iq_placebo = np.array([ - 99, 101, 100, 101, 102, 100, 97, 101, 104, 101, 102, 102, 100, 105, - 88, 101, 100, 104, 100, 100, 100, 101, 102, 103, 97, 101, 101, 100, - 101, 99, 101, 100, 100, 101, 100, 99, 101, 100, 102, 99, 100, 99 -]) -# fmt: on - -df1 = pd.DataFrame({"iq": iq_drug, "group": "drug"}) -df2 = pd.DataFrame({"iq": iq_placebo, "group": "placebo"}) -indv = pd.concat([df1, df2]).reset_index() - -sns.histplot(data=indv, x="iq", hue="group"); -``` - -The first step in a Bayesian approach to inference is to specify the full probability model that corresponds to the problem. For this example, Kruschke chooses a Student-t distribution to describe the distributions of the scores in each group. This choice adds robustness to the analysis, as a T distribution is less sensitive to outlier observations, relative to a normal distribution. The three-parameter Student-t distribution allows for the specification of a mean $\mu$, a precision (inverse-variance) $\lambda$ and a degrees-of-freedom parameter $\nu$: - -$$f(x|\mu,\lambda,\nu) = \frac{\Gamma(\frac{\nu + 1}{2})}{\Gamma(\frac{\nu}{2})} \left(\frac{\lambda}{\pi\nu}\right)^{\frac{1}{2}} \left[1+\frac{\lambda(x-\mu)^2}{\nu}\right]^{-\frac{\nu+1}{2}}$$ - -The degrees-of-freedom parameter essentially specifies the "normality" of the data, since larger values of $\nu$ make the distribution converge to a normal distribution, while small values (close to zero) result in heavier tails. Thus, the likelihood functions of our model are specified as follows: - -$$y^{(treat)}_i \sim T(\nu, \mu_1, \sigma_1)$$ - -$$y^{(placebo)}_i \sim T(\nu, \mu_2, \sigma_2)$$ - -As a simplifying assumption, we will assume that the degree of normality $\nu$ is the same for both groups. We will, of course, have separate parameters for the means $\mu_k, k=1,2$ and standard deviations $\sigma_k$. Since the means are real-valued, we will apply normal priors on them, and arbitrarily set the hyperparameters to the pooled empirical mean of the data and twice the pooled empirical standard deviation, which applies very diffuse information to these quantities (and importantly, does not favor one or the other *a priori*). - -$$\mu_k \sim N(\bar{x}, 2s)$$ - -```{code-cell} ipython3 -mu_m = indv.iq.mean() -mu_s = indv.iq.std() * 2 - -with pm.Model() as model: - group1_mean = pm.Normal("group1_mean", mu=mu_m, sigma=mu_s) - group2_mean = pm.Normal("group2_mean", mu=mu_m, sigma=mu_s) -``` - -The group standard deviations will be given a uniform prior over a plausible range of values for the variability of the outcome variable, IQ. - -In Kruschke's original model, he uses a very wide uniform prior for the group standard deviations, from the pooled empirical standard deviation divided by 1000 to the pooled standard deviation multiplied by 1000. This is a poor choice of prior, because very basic prior knowledge about measures of human coginition dictate that the variation cannot ever be as high as this upper bound. IQ is a standardized measure, and hence this constrains how variable a given population's IQ values can be. When you place such a wide uniform prior on these values, you are essentially giving a lot of prior weight on inadmissable values. In this example, there is little practical difference, but in general it is best to apply as much prior information that you have available to the parameterization of prior distributions. - -We will instead set the group standard deviations to have a $\text{Uniform}(1,10)$ prior: - -```{code-cell} ipython3 -sigma_low = 10**-1 -sigma_high = 10 - -with model: - group1_std = pm.Uniform("group1_std", lower=sigma_low, upper=sigma_high) - group2_std = pm.Uniform("group2_std", lower=sigma_low, upper=sigma_high) -``` - -We follow Kruschke by making the prior for $\nu$ exponentially distributed with a mean of 30; this allocates high prior probability over the regions of the parameter that describe the range from normal to heavy-tailed data under the Student-T distribution. - -```{code-cell} ipython3 -with model: - nu_minus_one = pm.Exponential("nu_minus_one", 1 / 29.0) - nu = pm.Deterministic("nu", nu_minus_one + 1) - nu_log10 = pm.Deterministic("nu_log10", np.log10(nu)) - -az.plot_kde(rng.exponential(scale=29, size=10000) + 1, fill_kwargs={"alpha": 0.5}); -``` - -Since PyMC parametrizes the Student-T in terms of precision, rather than standard deviation, we must transform the standard deviations before specifying our likelihoods. - -```{code-cell} ipython3 -with model: - lambda_1 = group1_std**-2 - lambda_2 = group2_std**-2 - group1 = pm.StudentT("drug", nu=nu, mu=group1_mean, lam=lambda_1, observed=iq_drug) - group2 = pm.StudentT("placebo", nu=nu, mu=group2_mean, lam=lambda_2, observed=iq_placebo) -``` - -Having fully specified our probabilistic model, we can turn our attention to calculating the comparisons of interest in order to evaluate the effect of the drug. To this end, we can specify deterministic nodes in our model for the difference between the group means and the difference between the group standard deviations. Wrapping them in named `Deterministic` objects signals to PyMC that we wish to record the sampled values as part of the output. As a joint measure of the groups, we will also estimate the "effect size", which is the difference in means scaled by the pooled estimates of standard deviation. This quantity can be harder to interpret, since it is no longer in the same units as our data, but the quantity is a function of all four estimated parameters. - -```{code-cell} ipython3 -with model: - diff_of_means = pm.Deterministic("difference of means", group1_mean - group2_mean) - diff_of_stds = pm.Deterministic("difference of stds", group1_std - group2_std) - effect_size = pm.Deterministic( - "effect size", diff_of_means / np.sqrt((group1_std**2 + group2_std**2) / 2) - ) -``` - -Now, we can fit the model and evaluate its output. - -```{code-cell} ipython3 -with model: - idata = pm.sample() -``` - -We can plot the stochastic parameters of the model. Arviz's `plot_posterior` function replicates the informative histograms portrayed in {cite:p}`kruschke2013`. These summarize the posterior distributions of the parameters, and present a 95% credible interval and the posterior mean. The plots below are constructed with the final 1000 samples from each of the 2 chains, pooled together. - -```{code-cell} ipython3 -az.plot_posterior( - idata, - var_names=["group1_mean", "group2_mean", "group1_std", "group2_std", "nu", "nu_log10"], - color="#87ceeb", -); -``` - -Looking at the group differences below, we can conclude that there are meaningful differences between the two groups for all three measures. For these comparisons, it is useful to use zero as a reference value (`ref_val`); providing this reference value yields cumulative probabilities for the posterior distribution on either side of the value. Thus, for the difference of means, at least 97% of the posterior probability are greater than zero, which suggests the group means are credibly different. The effect size and differences in standard deviation are similarly positive. - -These estimates suggest that the "smart drug" increased both the expected scores, but also the variability in scores across the sample. So, this does not rule out the possibility that some recipients may be adversely affected by the drug at the same time others benefit. - -```{code-cell} ipython3 -az.plot_posterior( - idata, - var_names=["difference of means", "difference of stds", "effect size"], - ref_val=0, - color="#87ceeb", -); -``` - -When `plot_forest` is called on a trace with more than one chain, it also plots the potential scale reduction parameter, which is used to reveal evidence for lack of convergence; values near one, as we have here, suggest that the model has converged. - -```{code-cell} ipython3 -az.plot_forest(idata, var_names=["group1_mean", "group2_mean"]); -``` - -```{code-cell} ipython3 -az.plot_forest(idata, var_names=["group1_std", "group2_std", "nu"]); -``` - -```{code-cell} ipython3 -az.summary(idata, var_names=["difference of means", "difference of stds", "effect size"]) -``` - -## Autorship - -+++ - -* Authored by Andrew Straw in Dec, 2012 ([best](https://github.com/strawlab/best)) -* Ported to PyMC3 by Thomas Wiecki in 2015 -* Updated by Chris Fonnesbeck in Dec, 2020 -* Ported to PyMC4 by Andrés Suárez in Ene, 2022 ([pymc-examples#52](https://github.com/pymc-devs/pymc-examples/issues/52)) - -+++ - -## References - -+++ - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/case_studies/GEV.myst.md b/myst_nbs/case_studies/GEV.myst.md deleted file mode 100644 index 3cbef9887..000000000 --- a/myst_nbs/case_studies/GEV.myst.md +++ /dev/null @@ -1,241 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: pymc4-dev - language: python - name: pymc4-dev ---- - -+++ {"tags": []} - -# Generalized Extreme Value Distribution - -:::{post} Sept 27, 2022 -:tags: extreme, inference, posterior predictive -:category: beginner -:author: Colin Caprani -::: - -+++ {"tags": []} - -## Introduction - -The Generalized Extreme Value (GEV) distribution is a meta-distribution containing the Weibull, Gumbel, and Frechet families of extreme value distributions. It is used for modelling the distribution of extremes (maxima or minima) of stationary processes, such as the annual maximum wind speed, annual maximum truck weight on a bridge, and so on, without needing *a priori* decision on the tail behaviour. - -Following the parametrization used in {cite:t}`coles2001gev`, the GEV distribution for maxima is given by: - -$$G(x) = \exp \left\{ \left[ 1 - \xi \left( \frac{x - \mu}{\sigma} \right) \right]^{-\frac{1}{\xi}} \right\}$$ - -when: -- $\xi < 0$ we get the Weibull distribution with a bounded upper tail; -- $\xi = 0$, in the limit, we get the Gumbel distribution, unbonded in both tails; -- $\xi > 0$, we get the Frechet distribution which is bounded in the lower tail. - -Note that this parametrization of the shape parameter $\xi$ is opposite in sign to that used in SciPy (where it is denoted `c`). Further, the distribution for minima is readily examined by studying the distribution of maxima of the negative of the data. - -We will use the example of the Port Pirie annual maximum sea-level data used in {cite:t}`coles2001gev`, and compare with the frequentist results presented there. - -```{code-cell} ipython3 -import aesara.tensor as at -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc as pm -import pymc_experimental.distributions as pmx - -from arviz.plots import plot_utils as azpu -``` - -## Data -The Port Pirie data is provided by {cite:t}`coles2001gev`, and repeated here: - -```{code-cell} ipython3 -# fmt: off -data = np.array([4.03, 3.83, 3.65, 3.88, 4.01, 4.08, 4.18, 3.80, - 4.36, 3.96, 3.98, 4.69, 3.85, 3.96, 3.85, 3.93, - 3.75, 3.63, 3.57, 4.25, 3.97, 4.05, 4.24, 4.22, - 3.73, 4.37, 4.06, 3.71, 3.96, 4.06, 4.55, 3.79, - 3.89, 4.11, 3.85, 3.86, 3.86, 4.21, 4.01, 4.11, - 4.24, 3.96, 4.21, 3.74, 3.85, 3.88, 3.66, 4.11, - 3.71, 4.18, 3.90, 3.78, 3.91, 3.72, 4.00, 3.66, - 3.62, 4.33, 4.55, 3.75, 4.08, 3.90, 3.88, 3.94, - 4.33]) -# fmt: on -plt.hist(data) -plt.show() -``` - -## Modelling & Prediction -In the modelling we wish to do two thing: - -- parameter inference on the GEV parameters, based on some fairly non-informative priors, and; -- prediction of the 10-year return level. - -Predictions of extreme values considering parameter uncertainty are easily accomplished in the Bayesian setting. It is interesting to compare this ease to the difficulties encountered by {cite:t}`caprani2010gev` in doing this in a frequentist setting. In any case, the predicted value at a probability of exceedance $p$ is given by: - -$$ x_p = \mu - \frac{\sigma}{\xi} \left\{1 - \left[-\log\left(1-p\right)\right] \right\} $$ - -This is a deterministic function of the parameter values, and so is accomplished using `pm.Deterministic` within the model context. - -Consider then, the 10-year return period, for which $p = 1/10$: - -```{code-cell} ipython3 -p = 1 / 10 -``` - -And now set up the model using priors estimated from a quick review of the historgram above: - -- $\mu$: there is no real basis for considering anything other than a `Normal` distribution with a standard deviation limiting negative outcomes; -- $\sigma$: this must be positive, and has a small value, so use `HalfNormal` with a unit standard deviation; -- $\xi$: we are agnostic to the tail behaviour so centre this at zero, but limit to physically reasonable bounds of $\pm 0.6$, and keep it somewhat tight near zero. - -```{code-cell} ipython3 -:tags: [] - -# Optionally centre the data, depending on fitting and divergences -# cdata = (data - data.mean())/data.std() - -with pm.Model() as model: - # Priors - μ = pm.Normal("μ", mu=3.8, sigma=0.2) - σ = pm.HalfNormal("σ", sigma=0.3) - ξ = pm.TruncatedNormal("ξ", mu=0, sigma=0.2, lower=-0.6, upper=0.6) - - # Estimation - gev = pmx.GenExtreme("gev", mu=μ, sigma=σ, xi=ξ, observed=data) - # Return level - z_p = pm.Deterministic("z_p", μ - σ / ξ * (1 - (-np.log(1 - p)) ** (-ξ))) -``` - -## Prior Predictive Checks -Let's get a feel for how well our selected priors cover the range of the data: - -```{code-cell} ipython3 -idata = pm.sample_prior_predictive(samples=1000, model=model) -az.plot_ppc(idata, group="prior", figsize=(12, 6)) -ax = plt.gca() -ax.set_xlim([2, 6]) -ax.set_ylim([0, 2]); -``` - -And we can look at the sampled values of the parameters, using the `plot_posterior` function, but passing in the `idata` object and specifying the `group` to be `"prior"`: - -```{code-cell} ipython3 -az.plot_posterior( - idata, group="prior", var_names=["μ", "σ", "ξ"], hdi_prob="hide", point_estimate=None -); -``` - -## Inference -Press the magic Inference Button$^{\mathrm{TM}}$: - -```{code-cell} ipython3 -with model: - trace = pm.sample( - 5000, - cores=4, - chains=4, - tune=2000, - initvals={"μ": -0.5, "σ": 1.0, "ξ": -0.1}, - target_accept=0.98, - ) -# add trace to existing idata object -idata.extend(trace) -``` - -```{code-cell} ipython3 -az.plot_trace(idata, var_names=["μ", "σ", "ξ"], figsize=(12, 12)); -``` - -### Divergences -The trace exhibits divergences (usually). The HMC/NUTS sampler can have problems when the bounds of support for parameters are approached. And since the bounds of the GEV change with the sign of $\xi$, it is difficult to offer a transformation that resolves this problem. One possible transformation - the Box-Cox - has been proposed by {cite:t}`bali2003gev`, but {cite:t}`caprani2009gev` find it numerically unstable, even for just maximum likelihood estimation. In any case, recommendations to alleviate divergence problems are: - -- Increase the target acceptance ratio; -- Use more informative priors, especially limit the shape parameter to physically reasonable values, typically $\xi \in [-0.5,0.5]$; -- Decide upon the domain of attraction of the tail (i.e. Weibull, Gumbel, or Frechet), and use that distribution directly. - - -### Inferences -The 95% credible interval range of the parameter estimates is: - -```{code-cell} ipython3 -az.hdi(idata, hdi_prob=0.95) -``` - -And examine the prediction distribution, considering parameter variability (and without needing to assume normality): - -```{code-cell} ipython3 -az.plot_posterior(idata, hdi_prob=0.95, var_names=["z_p"], round_to=4); -``` - -And let's compare the prior and posterior predictions of $z_p$ to see how the data has influenced things: - -```{code-cell} ipython3 -az.plot_dist_comparison(idata, var_names=["z_p"]); -``` - -## Comparison -To compare with the results given in {cite:t}`coles2001gev`, we approximate the maximum likelihood estimates (MLE) using the mode of the posterior distributions (the *maximum a posteriori* or MAP estimate). These are close when the prior is reasonably flat around the posterior estimate. - -The MLE results given in {cite:t}`coles2001gev` are: - -$$\left(\hat{\mu}, \hat{\sigma}, \hat{\xi} \right) = \left( 3.87, 0.198, -0.050 \right) $$ - - -And the variance-covariance matrix of the estimates is: - -$$ V = \left[ \begin{array} 0.000780 & 0.000197 & -0.00107 \\ - 0.000197 & 0.000410 & -0.000778 \\ - -0.00107 & -0.000778 & 0.00965 - \end{array} \right] $$ - - -Note that extracting the MLE estimates from our inference involves accessing some of the Arviz back end functions to bash the xarray into something examinable: - -```{code-cell} ipython3 -_, vals = az.sel_utils.xarray_to_ndarray(idata["posterior"], var_names=["μ", "σ", "ξ"]) -mle = [azpu.calculate_point_estimate("mode", val) for val in vals] -mle -``` - -```{code-cell} ipython3 -idata["posterior"].drop_vars("z_p").to_dataframe().cov().round(6) -``` - -The results are a good match, but the benefit of doing this in a Bayesian setting is we get the full posterior joint distribution of the parameters and the return level, essentially for free. Compare this to the loose normality assumption and computational effort to get even the variance-covarince matrix, as done in {cite:t}`coles2001gev`. - -Finally, we examine the pairs plots and see where any difficulties in inference lie using the divergences - -```{code-cell} ipython3 -az.plot_pair(idata, var_names=["μ", "σ", "ξ"], kind="kde", marginals=True, divergences=True); -``` - -## Authors - -* Authored by [Colin Caprani](https://github.com/ccaprani), October 2021 - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,arviz -``` - -```{code-cell} ipython3 - -``` diff --git a/myst_nbs/case_studies/LKJ.myst.md b/myst_nbs/case_studies/LKJ.myst.md deleted file mode 100644 index 6361174ac..000000000 --- a/myst_nbs/case_studies/LKJ.myst.md +++ /dev/null @@ -1,331 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -+++ {"id": "XShKDkNir2PX"} - -# LKJ Cholesky Covariance Priors for Multivariate Normal Models - -+++ {"id": "QxSKBbjKr2PZ"} - -While the [inverse-Wishart distribution](https://en.wikipedia.org/wiki/Inverse-Wishart_distribution) is the conjugate prior for the covariance matrix of a multivariate normal distribution, it is [not very well-suited](https://github.com/pymc-devs/pymc3/issues/538#issuecomment-94153586) to modern Bayesian computational methods. For this reason, the [LKJ prior](http://www.sciencedirect.com/science/article/pii/S0047259X09000876) is recommended when modeling the covariance matrix of a multivariate normal distribution. - -To illustrate modelling covariance with the LKJ distribution, we first generate a two-dimensional normally-distributed sample data set. - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ -id: 17Thh2DHr2Pa -outputId: 90631275-86c9-4f4a-f81a-22465d0c8b8c ---- -%env MKL_THREADING_LAYER=GNU -import arviz as az -import numpy as np -import pymc as pm -import seaborn as sns - -from matplotlib import pyplot as plt -from matplotlib.lines import Line2D -from matplotlib.patches import Ellipse - -print(f"Running on PyMC v{pm.__version__}") -``` - -```{code-cell} ipython3 -:id: Sq6K4Ie4r2Pc - -%config InlineBackend.figure_format = 'retina' -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ -id: eA491vJMr2Pc -outputId: 30ea38db-0767-4e89-eb09-68927878018e ---- -N = 10000 - -mu_actual = np.array([1.0, -2.0]) -sigmas_actual = np.array([0.7, 1.5]) -Rho_actual = np.matrix([[1.0, -0.4], [-0.4, 1.0]]) - -Sigma_actual = np.diag(sigmas_actual) * Rho_actual * np.diag(sigmas_actual) - -x = rng.multivariate_normal(mu_actual, Sigma_actual, size=N) -Sigma_actual -``` - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ - height: 628 -id: ZmFDGQ8Jr2Pd -outputId: 03ba3248-370c-4ff9-8626-ba601423b9c1 ---- -var, U = np.linalg.eig(Sigma_actual) -angle = 180.0 / np.pi * np.arccos(np.abs(U[0, 0])) - -fig, ax = plt.subplots(figsize=(8, 6)) - -e = Ellipse(mu_actual, 2 * np.sqrt(5.991 * var[0]), 2 * np.sqrt(5.991 * var[1]), angle=angle) -e.set_alpha(0.5) -e.set_facecolor("C0") -e.set_zorder(10) -ax.add_artist(e) - -ax.scatter(x[:, 0], x[:, 1], c="k", alpha=0.05, zorder=11) -ax.set_xlabel("y") -ax.set_ylabel("z") - -rect = plt.Rectangle((0, 0), 1, 1, fc="C0", alpha=0.5) -ax.legend([rect], ["95% density region"], loc=2); -``` - -+++ {"id": "d6320GCir2Pd"} - -The sampling distribution for the multivariate normal model is $\mathbf{x} \sim N(\mu, \Sigma)$, where $\Sigma$ is the covariance matrix of the sampling distribution, with $\Sigma_{ij} = \textrm{Cov}(x_i, x_j)$. The density of this distribution is - -$$f(\mathbf{x}\ |\ \mu, \Sigma^{-1}) = (2 \pi)^{-\frac{k}{2}} |\Sigma|^{-\frac{1}{2}} \exp\left(-\frac{1}{2} (\mathbf{x} - \mu)^{\top} \Sigma^{-1} (\mathbf{x} - \mu)\right).$$ - -The LKJ distribution provides a prior on the correlation matrix, $\mathbf{C} = \textrm{Corr}(x_i, x_j)$, which, combined with priors on the standard deviations of each component, [induces](http://www3.stat.sinica.edu.tw/statistica/oldpdf/A10n416.pdf) a prior on the covariance matrix, $\Sigma$. Since inverting $\Sigma$ is numerically unstable and inefficient, it is computationally advantageous to use the [Cholesky decompositon](https://en.wikipedia.org/wiki/Cholesky_decomposition) of $\Sigma$, $\Sigma = \mathbf{L} \mathbf{L}^{\top}$, where $\mathbf{L}$ is a lower-triangular matrix. This decompositon allows computation of the term $(\mathbf{x} - \mu)^{\top} \Sigma^{-1} (\mathbf{x} - \mu)$ using back-substitution, which is more numerically stable and efficient than direct matrix inversion. - -PyMC supports LKJ priors for the Cholesky decomposition of the covariance matrix via the [LKJCholeskyCov](https://docs.pymc.io/en/latest/api/distributions/generated/pymc.LKJCholeskyCov.html) distribution. This distribution has parameters `n` and `sd_dist`, which are the dimension of the observations, $\mathbf{x}$, and the PyMC distribution of the component standard deviations, respectively. It also has a hyperparamter `eta`, which controls the amount of correlation between components of $\mathbf{x}$. The LKJ distribution has the density $f(\mathbf{C}\ |\ \eta) \propto |\mathbf{C}|^{\eta - 1}$, so $\eta = 1$ leads to a uniform distribution on correlation matrices, while the magnitude of correlations between components decreases as $\eta \to \infty$. - -In this example, we model the standard deviations with $\textrm{Exponential}(1.0)$ priors, and the correlation matrix as $\mathbf{C} \sim \textrm{LKJ}(\eta = 2)$. - -```{code-cell} ipython3 -:id: 7GcM6oENr2Pe - -with pm.Model() as m: - packed_L = pm.LKJCholeskyCov( - "packed_L", n=2, eta=2.0, sd_dist=pm.Exponential.dist(1.0, shape=2), compute_corr=False - ) -``` - -+++ {"id": "6Cscu-CRr2Pe"} - -Since the Cholesky decompositon of $\Sigma$ is lower triangular, `LKJCholeskyCov` only stores the diagonal and sub-diagonal entries, for efficiency: - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ -id: JMWeTjDjr2Pe -outputId: e4f767a3-c1d7-4016-a3cf-91089c925bdb ---- -packed_L.eval() -``` - -+++ {"id": "59FtijDir2Pe"} - -We use [expand_packed_triangular](../api/math.rst) to transform this vector into the lower triangular matrix $\mathbf{L}$, which appears in the Cholesky decomposition $\Sigma = \mathbf{L} \mathbf{L}^{\top}$. - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ -id: YxBbFyUxr2Pf -outputId: bd37c630-98dd-437b-bb13-89281aeccc44 ---- -with m: - L = pm.expand_packed_triangular(2, packed_L) - Sigma = L.dot(L.T) - -L.eval().shape -``` - -+++ {"id": "SwdNd_0Jr2Pf"} - -Often however, you'll be interested in the posterior distribution of the correlations matrix and of the standard deviations, not in the posterior Cholesky covariance matrix *per se*. Why? Because the correlations and standard deviations are easier to interpret and often have a scientific meaning in the model. As of PyMC v4, the `compute_corr` argument is set to `True` by default, which returns a tuple consisting of the Cholesky decomposition, the correlations matrix, and the standard deviations. - -```{code-cell} ipython3 -:id: ac3eQeMJr2Pf - -coords = {"axis": ["y", "z"], "axis_bis": ["y", "z"], "obs_id": np.arange(N)} -with pm.Model(coords=coords, rng_seeder=RANDOM_SEED) as model: - chol, corr, stds = pm.LKJCholeskyCov( - "chol", n=2, eta=2.0, sd_dist=pm.Exponential.dist(1.0, shape=2) - ) - cov = pm.Deterministic("cov", chol.dot(chol.T), dims=("axis", "axis_bis")) -``` - -+++ {"id": "cpEupNzWr2Pg"} - -To complete our model, we place independent, weakly regularizing priors, $N(0, 1.5),$ on $\mu$: - -```{code-cell} ipython3 -:id: iTI4uiBdr2Pg - -with model: - mu = pm.Normal("mu", 0.0, sigma=1.5, dims="axis") - obs = pm.MvNormal("obs", mu, chol=chol, observed=x, dims=("obs_id", "axis")) -``` - -+++ {"id": "QOCi1RKvr2Ph"} - -We sample from this model using NUTS and give the trace to [ArviZ](https://arviz-devs.github.io/arviz/) for summarization: - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ - height: 608 -id: vBPIQDWrr2Ph -outputId: f039bfb8-1acf-42cb-b054-bc2c97697f96 ---- -with model: - trace = pm.sample( - idata_kwargs={"dims": {"chol_stds": ["axis"], "chol_corr": ["axis", "axis_bis"]}}, - ) -az.summary(trace, var_names="~chol", round_to=2) -``` - -+++ {"id": "X8ucBpcRr2Ph"} - -Sampling went smoothly: no divergences and good r-hats (except for the diagonal elements of the correlation matrix - however, these are not a concern, because, they should be equal to 1 for each sample for each chain and the variance of a constant value isn't defined. If one of the diagonal elements has `r_hat` defined, it's likely due to tiny numerical errors). - -You can also see that the sampler recovered the true means, correlations and standard deviations. As often, that will be clearer in a graph: - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ - height: 228 -id: dgOKiSLdr2Pi -outputId: a29bde4b-c4dc-49f4-e65d-c3365c1933e1 ---- -az.plot_trace( - trace, - var_names="chol_corr", - coords={"axis": "y", "axis_bis": "z"}, - lines=[("chol_corr", {}, Rho_actual[0, 1])], -); -``` - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ - height: 628 -id: dtBWyd5Jr2Pi -outputId: 94ee6945-a564-487a-e447-3c447057f0bf ---- -az.plot_trace( - trace, - var_names=["~chol", "~chol_corr"], - compact=True, - lines=[ - ("mu", {}, mu_actual), - ("cov", {}, Sigma_actual), - ("chol_stds", {}, sigmas_actual), - ], -); -``` - -+++ {"id": "NnLWJyCMr2Pi"} - -The posterior expected values are very close to the true value of each component! How close exactly? Let's compute the percentage of closeness for $\mu$ and $\Sigma$: - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ -id: yDlyVSizr2Pj -outputId: 69c22c57-db27-4f43-ab94-7b88480a21f9 ---- -mu_post = trace.posterior["mu"].mean(("chain", "draw")).values -(1 - mu_post / mu_actual).round(2) -``` - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ -id: rFF947Grr2Pj -outputId: 398332a0-a142-4ad0-dadf-bde13ef2b00b ---- -Sigma_post = trace.posterior["cov"].mean(("chain", "draw")).values -(1 - Sigma_post / Sigma_actual).round(2) -``` - -+++ {"id": "DMDwKtp0r2Pj"} - -So the posterior means are within 3% of the true values of $\mu$ and $\Sigma$. - -Now let's replicate the plot we did at the beginning, but let's overlay the posterior distribution on top of the true distribution -- you'll see there is excellent visual agreement between both: - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ - height: 628 -id: _dwHYuj1r2Pj -outputId: 9b53b875-af25-4f79-876f-a02e72bba5a9 ---- -var_post, U_post = np.linalg.eig(Sigma_post) -angle_post = 180.0 / np.pi * np.arccos(np.abs(U_post[0, 0])) - -fig, ax = plt.subplots(figsize=(8, 6)) - -e = Ellipse( - mu_actual, - 2 * np.sqrt(5.991 * var[0]), - 2 * np.sqrt(5.991 * var[1]), - angle=angle, - linewidth=3, - linestyle="dashed", -) -e.set_edgecolor("C0") -e.set_zorder(11) -e.set_fill(False) -ax.add_artist(e) - -e_post = Ellipse( - mu_post, - 2 * np.sqrt(5.991 * var_post[0]), - 2 * np.sqrt(5.991 * var_post[1]), - angle=angle_post, - linewidth=3, -) -e_post.set_edgecolor("C1") -e_post.set_zorder(10) -e_post.set_fill(False) -ax.add_artist(e_post) - -ax.scatter(x[:, 0], x[:, 1], c="k", alpha=0.05, zorder=11) -ax.set_xlabel("y") -ax.set_ylabel("z") - -line = Line2D([], [], color="C0", linestyle="dashed", label="True 95% density region") -line_post = Line2D([], [], color="C1", label="Estimated 95% density region") -ax.legend( - handles=[line, line_post], - loc=2, -); -``` - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ -id: kJCfuzGtr2Pq -outputId: da547b05-d812-4959-aff6-cf4a12faca15 ---- -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,xarray -``` diff --git a/myst_nbs/case_studies/bayesian_ab_testing_introduction.myst.md b/myst_nbs/case_studies/bayesian_ab_testing_introduction.myst.md deleted file mode 100644 index 53c08e528..000000000 --- a/myst_nbs/case_studies/bayesian_ab_testing_introduction.myst.md +++ /dev/null @@ -1,695 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3.8.10 ('pymc-examples-ipRlw-UN') - language: python - name: python3 ---- - -(bayesian_ab_testing_intro)= -# Introduction to Bayesian A/B Testing - -:::{post} May 23, 2021 -:tags: case study, ab test -:category: beginner, tutorial -:author: Cuong Duong -::: - -```{code-cell} ipython3 -from dataclasses import dataclass -from typing import Dict, List, Union - -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm - -from scipy.stats import bernoulli, expon -``` - -```{code-cell} ipython3 -RANDOM_SEED = 4000 -rng = np.random.default_rng(RANDOM_SEED) - -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") - -plotting_defaults = dict( - bins=50, - kind="hist", - textsize=10, -) -``` - -This notebook demonstrates how to implement a Bayesian analysis of an A/B test. We implement the models discussed in VWO's Bayesian A/B Testing Whitepaper {cite:p}`stucchio2015bayesian`, and discuss the effect of different prior choices for these models. This notebook does _not_ discuss other related topics like how to choose a prior, early stopping, and power analysis. - -#### What is A/B testing? - -From https://vwo.com/ab-testing/: - -> A/B testing (also known as split testing) is a process of showing two variants of the same web page to different segments of website visitors at the same time and comparing which variant drives more conversions. - -Specifically, A/B tests are often used in the software industry to determine whether a new feature or changes to an existing feature should be released to users, and the impact of the change on core product metrics ("conversions"). Furthermore: - -* We can test more than two variants at the same time. We'll be dealing with how to analyse these tests in this notebook as well. -* Exactly what "conversions" means can vary between tests, but two classes of conversions we'll focus on are: - * Bernoulli conversions - a flag for whether the visitor did the target action or not (e.g. completed at least one purchase). - * Value conversions - a real value per visitor (e.g. the dollar revenue, which could also be 0). - -If you've studied [controlled experiments](https://www.khanacademy.org/science/high-school-biology/hs-biology-foundations/hs-biology-and-the-scientific-method/a/experiments-and-observations) in the context of biology, psychology, and other sciences before, A/B testing will sound a lot like a controlled experiment - and that's because it is! The concept of a control group and treatment groups, and the principles of experimental design, are the building blocks of A/B testing. The main difference is the context in which the experiment is run: A/B tests are typically run by online software companies, where the subjects are visitors to the website / app, the outcomes of interest are behaviours that can be tracked like signing up, purchasing a product, and returning to the website. - -A/B tests are typically analysed with traditional hypothesis tests (see [t-test](https://en.wikipedia.org/wiki/Student%27s_t-test)), but another method is to use Bayesian statistics. This allows us to incorporate prior distributions and produce a range of outcomes to the questions "is there a winning variant?" and "by how much?". - -+++ - -### Bernoulli Conversions - -+++ - -Let's first deal with a simple two-variant A/B test, where the metric of interest is the proportion of users performing an action (e.g. purchase at least one item), a bernoulli conversion. Our variants are called A and B, where A refers to the existing landing page and B refers to the new page we want to test. The outcome that we want to perform statistical inference on is whether B is "better" than A, which is depends on the underlying "true" conversion rates for each variant. We can formulate this as follows: - -Let $\theta_A, \theta_B$ be the true conversion rates for variants A and B respectively. Then the outcome of whether a visitor converts in variant A is the random variable $\mathrm{Bernoulli}(\theta_A)$, and $\mathrm{Bernoulli}(\theta_B)$ for variant B. If we assume that visitors' behaviour on the landing page is independent of other visitors (a fair assumption), then the total conversions $y$ for a variant has the Binomial distribution: - -$$y \sim \sum^N\mathrm{Bernoulli}(\theta) = \mathrm{Binomial}(N, \theta)$$ - -Under a Bayesian framework, we assume the true conversion rates $\theta_A, \theta_B$ cannot be known, and instead they each follow a Beta distribution. The underlying rates are assumed to be independent (we would split traffic between each variant randomly, so one variant would not affect the other): - -$$\theta_A \sim \theta_B \sim \mathrm{Beta}(\alpha, \beta)$$ - -The observed data for the duration of the A/B test (the likelihoood distribution) is: the number of visitors landing on the page `N`, and the number of visitors purchasing at least one item `y`: - -$$y_A \sim \mathrm{Binomial}(n = N_A, p = \theta_A), y_B \sim \mathrm{Binomial}(n = N_B, p = \theta_B)$$ - -With this, we can sample from the joint posterior of $\theta_A, \theta_B$. - -You may have noticed that the Beta distribution is the conjugate prior for the Binomial, so we don't need MCMC sampling to estimate the posterior (the exact solution can be found in the VWO paper). We'll still demonstrate how sampling can be done with PyMC though, and doing this makes it easier to extend the model with different priors, dependency assumptions, etc. - -Finally, remember that our outcome of interest is whether B is better than A. A common measure in practice for whether B is better than is the _relative uplift in conversion rates_, i.e. the percentage difference of $\theta_B$ over $\theta_A$: - -$$\mathrm{reluplift}_B = \theta_B / \theta_A - 1$$ - -We'll implement this model setup in PyMC below. - -```{code-cell} ipython3 -@dataclass -class BetaPrior: - alpha: float - beta: float -``` - -```{code-cell} ipython3 -@dataclass -class BinomialData: - trials: int - successes: int -``` - -```{code-cell} ipython3 -class ConversionModelTwoVariant: - def __init__(self, priors: BetaPrior): - self.priors = priors - - def create_model(self, data: List[BinomialData]) -> pm.Model: - trials = [d.trials for d in data] - successes = [d.successes for d in data] - with pm.Model() as model: - p = pm.Beta("p", alpha=self.priors.alpha, beta=self.priors.beta, shape=2) - obs = pm.Binomial("y", n=trials, p=p, shape=2, observed=successes) - reluplift = pm.Deterministic("reluplift_b", p[1] / p[0] - 1) - return model -``` - -Now that we've defined a class that can take a prior and our synthetic data as inputs, our first step is to choose an appropriate prior. There are a few things to consider when doing this in practice, but for the purpose of this notebook we'll focus on the following: - -* We assume that the same Beta prior is set for each variant. -* An _uninformative_ or _weakly informative_ prior occurs when we set low values for `alpha` and `beta`. For example, `alpha = 1, beta = 1` leads to a uniform distribution as a prior. If we were considering one distribution in isolation, setting this prior is a statement that we don't know anything about the value of the parameter, nor our confidence around it. In the context of A/B testing however, we're interested in comparing the _relative uplift_ of one variant over another. With a weakly informative Beta prior, this relative uplift distribution is very wide, so we're implicitly saying that the variants could be very different to each other. -* A _strong_ prior occurs when we set high values for `alpha` and `beta`. Contrary to the above, a strong prior would imply that the relative uplift distribution is thin, i.e. our prior belief is that the variants are not very different from each other. - -We illustrate these points with prior predictive checks. - -+++ - -#### Prior predictive checks - -+++ - -Note that we can pass in arbitrary values for the observed data in these prior predictive checks. PyMC will not use that data when sampling from the prior predictive distribution. - -```{code-cell} ipython3 -weak_prior = ConversionModelTwoVariant(BetaPrior(alpha=100, beta=100)) -``` - -```{code-cell} ipython3 -strong_prior = ConversionModelTwoVariant(BetaPrior(alpha=10000, beta=10000)) -``` - -```{code-cell} ipython3 -with weak_prior.create_model(data=[BinomialData(1, 1), BinomialData(1, 1)]): - weak_prior_predictive = pm.sample_prior_predictive(samples=10000, return_inferencedata=False) -``` - -```{code-cell} ipython3 -with strong_prior.create_model(data=[BinomialData(1, 1), BinomialData(1, 1)]): - strong_prior_predictive = pm.sample_prior_predictive(samples=10000, return_inferencedata=False) -``` - -```{code-cell} ipython3 -fig, axs = plt.subplots(2, 1, figsize=(7, 7), sharex=True) -az.plot_posterior(weak_prior_predictive["reluplift_b"], ax=axs[0], **plotting_defaults) -axs[0].set_title(f"B vs. A Rel Uplift Prior Predictive, {weak_prior.priors}", fontsize=10) -axs[0].axvline(x=0, color="red") -az.plot_posterior(strong_prior_predictive["reluplift_b"], ax=axs[1], **plotting_defaults) -axs[1].set_title(f"B vs. A Rel Uplift Prior Predictive, {strong_prior.priors}", fontsize=10) -axs[1].axvline(x=0, color="red"); -``` - -With the weak prior our 94% HDI for the relative uplift for B over A is roughly [-20%, +20%], whereas it is roughly [-2%, +2%] with the strong prior. This is effectively the "starting point" for the relative uplift distribution, and will affect how the observed conversions translate to the posterior distribution. - -How we choose these priors in practice depends on broader context of the company running the A/B tests. A strong prior can help guard against false discoveries, but may require more data to detect winning variants when they exist (and more data = more time required running the test). A weak prior gives more weight to the observed data, but could also lead to more false discoveries as a result of early stopping issues. - -Below we'll walk through the inference results from two different prior choices. - -+++ - -#### Data - -+++ - -We generate two datasets: one where the "true" conversion rate of each variant is the same, and one where variant B has a higher true conversion rate. - -```{code-cell} ipython3 -def generate_binomial_data( - variants: List[str], true_rates: List[str], samples_per_variant: int = 100000 -) -> pd.DataFrame: - data = {} - for variant, p in zip(variants, true_rates): - data[variant] = bernoulli.rvs(p, size=samples_per_variant) - agg = ( - pd.DataFrame(data) - .aggregate(["count", "sum"]) - .rename(index={"count": "trials", "sum": "successes"}) - ) - return agg -``` - -```{code-cell} ipython3 -# Example generated data -generate_binomial_data(["A", "B"], [0.23, 0.23]) -``` - -We'll also write a function to wrap the data generation, sampling, and posterior plots so that we can easily compare the results of both models (strong and weak prior) under both scenarios (same true rate vs. different true rate). - -```{code-cell} ipython3 -def run_scenario_twovariant( - variants: List[str], - true_rates: List[float], - samples_per_variant: int, - weak_prior: BetaPrior, - strong_prior: BetaPrior, -) -> None: - generated = generate_binomial_data(variants, true_rates, samples_per_variant) - data = [BinomialData(**generated[v].to_dict()) for v in variants] - with ConversionModelTwoVariant(priors=weak_prior).create_model(data): - trace_weak = pm.sample(draws=5000) - with ConversionModelTwoVariant(priors=strong_prior).create_model(data): - trace_strong = pm.sample(draws=5000) - - true_rel_uplift = true_rates[1] / true_rates[0] - 1 - - fig, axs = plt.subplots(2, 1, figsize=(7, 7), sharex=True) - az.plot_posterior(trace_weak.posterior["reluplift_b"], ax=axs[0], **plotting_defaults) - axs[0].set_title(f"True Rel Uplift = {true_rel_uplift:.1%}, {weak_prior}", fontsize=10) - axs[0].axvline(x=0, color="red") - az.plot_posterior(trace_strong.posterior["reluplift_b"], ax=axs[1], **plotting_defaults) - axs[1].set_title(f"True Rel Uplift = {true_rel_uplift:.1%}, {strong_prior}", fontsize=10) - axs[1].axvline(x=0, color="red") - fig.suptitle("B vs. A Rel Uplift") - return trace_weak, trace_strong -``` - -#### Scenario 1 - same underlying conversion rates - -```{code-cell} ipython3 -trace_weak, trace_strong = run_scenario_twovariant( - variants=["A", "B"], - true_rates=[0.23, 0.23], - samples_per_variant=100000, - weak_prior=BetaPrior(alpha=100, beta=100), - strong_prior=BetaPrior(alpha=10000, beta=10000), -) -``` - -* In both cases, the true uplift of 0% lies within the 94% HDI. -* We can then use this relative uplift distribution to make a decision about whether to apply the new landing page / features in Variant B as the default. For example, we can decide that if the 94% HDI is above 0, we would roll out Variant B. In this case, 0 is in the HDI, so the decision would be to _not_ roll out Variant B. - -+++ - -#### Scenario 2 - different underlying rates - -```{code-cell} ipython3 -run_scenario_twovariant( - variants=["A", "B"], - true_rates=[0.21, 0.23], - samples_per_variant=100000, - weak_prior=BetaPrior(alpha=100, beta=100), - strong_prior=BetaPrior(alpha=10000, beta=10000), -) -``` - -* In both cases, the posterior relative uplift distribution suggests that B has a higher conversion rate than A, as the 94% HDI is well above 0. The decision in this case would be to roll out Variant B to all users, and this outcome "true discovery". -* That said, in practice are usually also interested in _how much better_ Variant B is. For the model with the strong prior, the prior is effectively pulling the relative uplift distribution closer to 0, so our central estimate of the relative uplift is **conservative (i.e. understated)**. We would need much more data for our inference to get closer to the true relative uplift of 9.5%. - -The above examples demonstrate how to calculate perform A/B testing analysis for a two-variant test with the simple Beta-Binomial model, and the benefits and disadvantages of choosing a weak vs. strong prior. In the next section we provide a guide for handling a multi-variant ("A/B/n") test. - -+++ - -### Generalising to multi-variant tests - -+++ - -We'll continue using Bernoulli conversions and the Beta-Binomial model in this section for simplicity. The focus is on how to analyse tests with 3 or more variants - e.g. instead of just having one different landing page to test, we have multiple ideas we want to test at once. How can we tell if there's a winner amongst all of them? - -There are two main approaches we can take here: - -1. Take A as the 'control'. Compare the other variants (B, C, etc.) against A, one at a time. -2. For each variant, compare against the `max()` of the other variants. - -Approach 1 is intuitive to most people, and is easily explained. But what if there are two variants that both beat the control, and we want to know which one is better? We can't make that inference with the individual uplift distributions. Approach 2 does handle this case - it effectively tries to find whether there is a clear winner or clear loser(s) amongst all the variants. - -We'll implement the model setup for both approaches below, cleaning up our code from before so that it generalises to the `n` variant case. Note that we can also re-use this model for the 2-variant case. - -```{code-cell} ipython3 -class ConversionModel: - def __init__(self, priors: BetaPrior): - self.priors = priors - - def create_model(self, data: List[BinomialData], comparison_method) -> pm.Model: - num_variants = len(data) - trials = [d.trials for d in data] - successes = [d.successes for d in data] - with pm.Model() as model: - p = pm.Beta("p", alpha=self.priors.alpha, beta=self.priors.beta, shape=num_variants) - y = pm.Binomial("y", n=trials, p=p, observed=successes, shape=num_variants) - reluplift = [] - for i in range(num_variants): - if comparison_method == "compare_to_control": - comparison = p[0] - elif comparison_method == "best_of_rest": - others = [p[j] for j in range(num_variants) if j != i] - if len(others) > 1: - comparison = pm.math.maximum(*others) - else: - comparison = others[0] - else: - raise ValueError(f"comparison method {comparison_method} not recognised.") - reluplift.append(pm.Deterministic(f"reluplift_{i}", p[i] / comparison - 1)) - return model -``` - -```{code-cell} ipython3 -def run_scenario_bernoulli( - variants: List[str], - true_rates: List[float], - samples_per_variant: int, - priors: BetaPrior, - comparison_method: str, -) -> az.InferenceData: - generated = generate_binomial_data(variants, true_rates, samples_per_variant) - data = [BinomialData(**generated[v].to_dict()) for v in variants] - with ConversionModel(priors).create_model(data=data, comparison_method=comparison_method): - trace = pm.sample(draws=5000) - - n_plots = len(variants) - fig, axs = plt.subplots(nrows=n_plots, ncols=1, figsize=(3 * n_plots, 7), sharex=True) - for i, variant in enumerate(variants): - if i == 0 and comparison_method == "compare_to_control": - axs[i].set_yticks([]) - else: - az.plot_posterior(trace.posterior[f"reluplift_{i}"], ax=axs[i], **plotting_defaults) - axs[i].set_title(f"Rel Uplift {variant}, True Rate = {true_rates[i]:.2%}", fontsize=10) - axs[i].axvline(x=0, color="red") - fig.suptitle(f"Method {comparison_method}, {priors}") - - return trace -``` - -We generate data where variants B and C are well above A, but quite close to each other: - -```{code-cell} ipython3 -_ = run_scenario_bernoulli( - variants=["A", "B", "C"], - true_rates=[0.21, 0.23, 0.228], - samples_per_variant=100000, - priors=BetaPrior(alpha=5000, beta=5000), - comparison_method="compare_to_control", -) -``` - -* The relative uplift posteriors for both B and C show that they are clearly better than A (94% HDI well above 0), by roughly 7-8% relative. -* However, we can't infer whether there is a winner between B and C. - -```{code-cell} ipython3 -_ = run_scenario_bernoulli( - variants=["A", "B", "C"], - true_rates=[0.21, 0.23, 0.228], - samples_per_variant=100000, - priors=BetaPrior(alpha=5000, beta=5000), - comparison_method="best_of_rest", -) -``` - -* The uplift plot for A tells us that it's a clear loser compared to variants B and C (94% HDI for A's relative uplift is well below 0). -* Note that the relative uplift calculations for B and C are effectively ignoring variant A. This is because, say, when we are calculating `reluplift` for B, the maximum of the other variants will likely be variant C. Similarly when we are calculating `reluplift` for C, it is likely being compared to B. -* The uplift plots for B and C tell us that we can't yet call a clear winner between the two variants, as the 94% HDI still overlaps with 0. We'd need a larger sample size to detect the 23% vs 22.8% conversion rate difference. -* One disadvantage of this approach is that we can't directly say what the uplift of these variants is over variant A (the control). This number is often important in practice, as it allows us to estimate the overall impact if the A/B test changes were rolled out to all visitors. We _can_ get this number approximately though, by reframing the question to be "how much worse is A compared to the other two variants" (which is shown in Variant A's relative uplift distribution). - -+++ - -### Value Conversions - -+++ - -Now what if we wanted to compare A/B test variants in terms of how much revenue they generate, and/or estimate how much additional revenue a winning variant brings? We can't use a Beta-Binomial model for this, as the possible values for each visitor are now in the range `[0, Inf)`. The model proposed in the VWO paper is as follows: - -The revenue generated by an individual visitor is `revenue = probability of paying at all * mean amount spent when paying`: - -$$\mathrm{Revenue}_i = \mathrm{Bernoulli}(\theta)_i * \mathrm{Exponential}(\lambda)_i I(\mathrm{Bernoulli}(\theta)_i = 1)$$ - -We assume that the probability of paying at all is independent to the mean amount spent when paying. This is a typical assumption in practice, unless we have reason to believe that the two parameters have dependencies. With this, we can create separate models for the total number of visitors paying, and the total amount spent amongst the purchasing visitors (assuming independence between the behaviour of each visitor): - -$$c \sim \sum^N\mathrm{Bernoulli}(\theta) = \mathrm{Binomial}(N, \theta)$$ - -$$r \sim \sum^K\mathrm{Exponential}(\lambda) = \mathrm{Gamma}(K, \lambda)$$ - -where $N$ is the total number of visitors, $K$ is the total number of visitors with at least one purchase. - -We can re-use our Beta-Binomial model from before to model the Bernoulli conversions. For the mean purchase amount, we use a Gamma prior (which is also a conjugate prior to the Gamma likelihood). So in a two-variant test, the setup is: - -$$\theta_A \sim \theta_B \sim \mathrm{Beta}(\alpha_1, \beta_1)$$ -$$\lambda_A \sim \lambda_B \sim \mathrm{Gamma}(\alpha_2, \beta_2)$$ -$$c_A \sim \mathrm{Binomial}(N_A, \theta_A), c_B \sim \mathrm{Binomial}(N_B, \theta_B)$$ -$$r_A \sim \mathrm{Gamma}(c_A, \lambda_A), r_B \sim \mathrm{Gamma}(c_B, \lambda_B)$$ -$$\mu_A = \theta_A * \dfrac{1}{\lambda_A}, \mu_B = \theta_B * \dfrac{1}{\lambda_B}$$ -$$\mathrm{reluplift}_B = \mu_B / \mu_A - 1$$ - -$\mu$ here represents the average revenue per visitor, including those who don't make a purchase. This is the best way to capture the overall revenue effect - some variants may increase the average sales value, but reduce the proportion of visitors that pay at all (e.g. if we promoted more expensive items on the landing page). - -Below we put the model setup into code and perform prior predictive checks. - -```{code-cell} ipython3 -@dataclass -class GammaPrior: - alpha: float - beta: float -``` - -```{code-cell} ipython3 -@dataclass -class RevenueData: - visitors: int - purchased: int - total_revenue: float -``` - -```{code-cell} ipython3 -class RevenueModel: - def __init__(self, conversion_rate_prior: BetaPrior, mean_purchase_prior: GammaPrior): - self.conversion_rate_prior = conversion_rate_prior - self.mean_purchase_prior = mean_purchase_prior - - def create_model(self, data: List[RevenueData], comparison_method: str) -> pm.Model: - num_variants = len(data) - visitors = [d.visitors for d in data] - purchased = [d.purchased for d in data] - total_revenue = [d.total_revenue for d in data] - - with pm.Model() as model: - theta = pm.Beta( - "theta", - alpha=self.conversion_rate_prior.alpha, - beta=self.conversion_rate_prior.beta, - shape=num_variants, - ) - lam = pm.Gamma( - "lam", - alpha=self.mean_purchase_prior.alpha, - beta=self.mean_purchase_prior.beta, - shape=num_variants, - ) - converted = pm.Binomial( - "converted", n=visitors, p=theta, observed=purchased, shape=num_variants - ) - revenue = pm.Gamma( - "revenue", alpha=purchased, beta=lam, observed=total_revenue, shape=num_variants - ) - revenue_per_visitor = pm.Deterministic("revenue_per_visitor", theta * (1 / lam)) - theta_reluplift = [] - reciprocal_lam_reluplift = [] - reluplift = [] - for i in range(num_variants): - if comparison_method == "compare_to_control": - comparison_theta = theta[0] - comparison_lam = 1 / lam[0] - comparison_rpv = revenue_per_visitor[0] - elif comparison_method == "best_of_rest": - others_theta = [theta[j] for j in range(num_variants) if j != i] - others_lam = [1 / lam[j] for j in range(num_variants) if j != i] - others_rpv = [revenue_per_visitor[j] for j in range(num_variants) if j != i] - if len(others_rpv) > 1: - comparison_theta = pm.math.maximum(*others_theta) - comparison_lam = pm.math.maximum(*others_lam) - comparison_rpv = pm.math.maximum(*others_rpv) - else: - comparison_theta = others_theta[0] - comparison_lam = others_lam[0] - comparison_rpv = others_rpv[0] - else: - raise ValueError(f"comparison method {comparison_method} not recognised.") - theta_reluplift.append( - pm.Deterministic(f"theta_reluplift_{i}", theta[i] / comparison_theta - 1) - ) - reciprocal_lam_reluplift.append( - pm.Deterministic( - f"reciprocal_lam_reluplift_{i}", (1 / lam[i]) / comparison_lam - 1 - ) - ) - reluplift.append( - pm.Deterministic(f"reluplift_{i}", revenue_per_visitor[i] / comparison_rpv - 1) - ) - return model -``` - -For the Beta prior, we can set a similar prior to before - centered around 0.5, with the magnitude of `alpha` and `beta` determining how "thin" the distribution is. - -We need to be a bit more careful about the Gamma prior. The mean of the Gamma prior is $\dfrac{\alpha_G}{\beta_G}$, and needs to be set to a reasonable value given existing mean purchase values. For example, if `alpha` and `beta` were set such that the mean was 1 dollar, but the average revenue per visitor for a website is much higher at 100 dollars, his could affect our inference. - -```{code-cell} ipython3 -c_prior = BetaPrior(alpha=5000, beta=5000) -mp_prior = GammaPrior(alpha=9000, beta=900) -``` - -```{code-cell} ipython3 -data = [ - RevenueData(visitors=1, purchased=1, total_revenue=1), - RevenueData(visitors=1, purchased=1, total_revenue=1), -] -``` - -```{code-cell} ipython3 -with RevenueModel(c_prior, mp_prior).create_model(data, "best_of_rest"): - revenue_prior_predictive = pm.sample_prior_predictive(samples=10000, return_inferencedata=False) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots() -az.plot_posterior(revenue_prior_predictive["reluplift_1"], ax=ax, **plotting_defaults) -ax.set_title(f"Revenue Rel Uplift Prior Predictive, {c_prior}, {mp_prior}", fontsize=10) -ax.axvline(x=0, color="red"); -``` - -Similar to the model for Bernoulli conversions, the width of the prior predictive uplift distribution will depend on the strength of our priors. See the Bernoulli conversions section for a discussion of the benefits and disadvantages of using a weak vs. strong prior. - -Next we generate synthetic data for the model. We'll generate the following scenarios: - -* Same propensity to purchase and same mean purchase value. -* Lower propensity to purchase and higher mean purchase value, but overall same revenue per visitor. -* Higher propensity to purchase and higher mean purchase value, and overall higher revenue per visitor. - -```{code-cell} ipython3 -def generate_revenue_data( - variants: List[str], - true_conversion_rates: List[float], - true_mean_purchase: List[float], - samples_per_variant: int, -) -> pd.DataFrame: - converted = {} - mean_purchase = {} - for variant, p, mp in zip(variants, true_conversion_rates, true_mean_purchase): - converted[variant] = bernoulli.rvs(p, size=samples_per_variant) - mean_purchase[variant] = expon.rvs(scale=mp, size=samples_per_variant) - converted = pd.DataFrame(converted) - mean_purchase = pd.DataFrame(mean_purchase) - revenue = converted * mean_purchase - agg = pd.concat( - [ - converted.aggregate(["count", "sum"]).rename( - index={"count": "visitors", "sum": "purchased"} - ), - revenue.aggregate(["sum"]).rename(index={"sum": "total_revenue"}), - ] - ) - return agg -``` - -```{code-cell} ipython3 -def run_scenario_value( - variants: List[str], - true_conversion_rates: List[float], - true_mean_purchase: List[float], - samples_per_variant: int, - conversion_rate_prior: BetaPrior, - mean_purchase_prior: GammaPrior, - comparison_method: str, -) -> az.InferenceData: - generated = generate_revenue_data( - variants, true_conversion_rates, true_mean_purchase, samples_per_variant - ) - data = [RevenueData(**generated[v].to_dict()) for v in variants] - with RevenueModel(conversion_rate_prior, mean_purchase_prior).create_model( - data, comparison_method - ): - trace = pm.sample(draws=5000, chains=2, cores=1) - - n_plots = len(variants) - fig, axs = plt.subplots(nrows=n_plots, ncols=1, figsize=(3 * n_plots, 7), sharex=True) - for i, variant in enumerate(variants): - if i == 0 and comparison_method == "compare_to_control": - axs[i].set_yticks([]) - else: - az.plot_posterior(trace.posterior[f"reluplift_{i}"], ax=axs[i], **plotting_defaults) - true_rpv = true_conversion_rates[i] * true_mean_purchase[i] - axs[i].set_title(f"Rel Uplift {variant}, True RPV = {true_rpv:.2f}", fontsize=10) - axs[i].axvline(x=0, color="red") - fig.suptitle(f"Method {comparison_method}, {conversion_rate_prior}, {mean_purchase_prior}") - - return trace -``` - -#### Scenario 1 - same underlying purchase rate and mean purchase value - -```{code-cell} ipython3 -_ = run_scenario_value( - variants=["A", "B"], - true_conversion_rates=[0.1, 0.1], - true_mean_purchase=[10, 10], - samples_per_variant=100000, - conversion_rate_prior=BetaPrior(alpha=5000, beta=5000), - mean_purchase_prior=GammaPrior(alpha=9000, beta=900), - comparison_method="best_of_rest", -) -``` - -* The 94% HDI contains 0 as expected. - -+++ - -#### Scenario 2 - lower purchase rate, higher mean purchase, same overall revenue per visitor - -```{code-cell} ipython3 -scenario_value_2 = run_scenario_value( - variants=["A", "B"], - true_conversion_rates=[0.1, 0.08], - true_mean_purchase=[10, 12.5], - samples_per_variant=100000, - conversion_rate_prior=BetaPrior(alpha=5000, beta=5000), - mean_purchase_prior=GammaPrior(alpha=9000, beta=900), - comparison_method="best_of_rest", -) -``` - -* The 94% HDI for the average revenue per visitor (RPV) contains 0 as expected. -* In these cases, it's also useful to plot the relative uplift distributions for `theta` (the purchase-anything rate) and `1 / lam` (the mean purchase value) to understand how the A/B test has affected visitor behaviour. We show this below: - -```{code-cell} ipython3 -axs = az.plot_posterior( - scenario_value_2, - var_names=["theta_reluplift_1", "reciprocal_lam_reluplift_1"], - **plotting_defaults, -) -axs[0].set_title(f"Conversion Rate Uplift B, True Uplift = {(0.04 / 0.05 - 1):.2%}", fontsize=10) -axs[0].axvline(x=0, color="red") -axs[1].set_title( - f"Revenue per Converting Visitor Uplift B, True Uplift = {(25 / 20 - 1):.2%}", fontsize=10 -) -axs[1].axvline(x=0, color="red"); -``` - -* Variant B's conversion rate uplift has a HDI well below 0, while the revenue per converting visitor has a HDI well above 0. So the model is able to capture the reduction in purchasing visitors as well as the increase in mean purchase amount. - -+++ - -#### Scenario 3 - Higher propensity to purchase and mean purchase value - -```{code-cell} ipython3 -_ = run_scenario_value( - variants=["A", "B"], - true_conversion_rates=[0.1, 0.11], - true_mean_purchase=[10, 10.5], - samples_per_variant=100000, - conversion_rate_prior=BetaPrior(alpha=5000, beta=5000), - mean_purchase_prior=GammaPrior(alpha=9000, beta=900), - comparison_method="best_of_rest", -) -``` - -* The 94% HDI is above 0 for variant B as expected. - -Note that one concern with using value conversions in practice (that doesn't show up when we're just simulating synthetic data) is the existence of outliers. For example, a visitor in one variant could spend thousands of dollars, and the observed revenue data no longer follows a 'nice' distribution like Gamma. It's common to impute these outliers prior to running a statistical analysis (we have to be careful with removing them altogether, as this could bias the inference), or fall back to bernoulli conversions for decision making. - -+++ - -### Further Reading - -There are many other considerations to implementing a Bayesian framework to analyse A/B tests in practice. Some include: - -* How do we choose our prior distributions? -* In practice, people look at A/B test results every day, not only once at the end of the test. How do we balance finding true differences faster vs. minizing false discoveries (the 'early stopping' problem)? -* How do we plan the length and size of A/B tests using power analysis, if we're using Bayesian models to analyse the results? -* Outside of the conversion rates (bernoulli random variables for each visitor), many value distributions in online software cannot be fit with nice densities like Normal, Gamma, etc. How do we model these? - -Various textbooks and online resources dive into these areas in more detail. Doing Bayesian Data Analysis {cite:p}`kruschke2014doing` by John Kruschke is a great resource, and has been translated to PyMC [here](https://github.com/JWarmenhoven/DBDA-python). - -We also plan to create more PyMC tutorials on these topics, so stay tuned! - -+++ - -## Authors - -* Authored by [Cuong Duong](https://github.com/tcuongd) in May, 2021 ([pymc-examples#164](https://github.com/pymc-devs/pymc-examples/pull/164)) -* Re-executed by [percevalve](https://github.com/percevalve) in May, 2022 ([pymc-examples#351](https://github.com/pymc-devs/pymc-examples/pull/351)) - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/case_studies/binning.myst.md b/myst_nbs/case_studies/binning.myst.md deleted file mode 100644 index dadb01c7a..000000000 --- a/myst_nbs/case_studies/binning.myst.md +++ /dev/null @@ -1,952 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python [conda env:pymc_env] - language: python - name: conda-env-pymc_env-py ---- - -(awkward_binning)= -# Estimating parameters of a distribution from awkwardly binned data -:::{post} Oct 23, 2021 -:tags: binned data, case study, parameter estimation -:category: intermediate -:author: Eric Ma, Benjamin T. Vincent -::: - -+++ - -## The problem -Let us say that we are interested in inferring the properties of a population. This could be anything from the distribution of age, or income, or body mass index, or a whole range of different possible measures. In completing this task, we might often come across the situation where we have multiple datasets, each of which can inform our beliefs about the overall population. - -Very often this data can be in a form that we, as data scientists, would not consider ideal. For example, this data may have been binned into categories. One reason why this is not ideal is that this binning process actually discards information - we lose any knowledge about where in a certain bin an individual datum lies. A second reason why this is not ideal is that different studies may use alternative binning methods - for example one study might record age in terms of decades (e.g. is someone in their 20's, 30's, 40's and so on) but another study may record age (indirectly) by assigning generational labels (Gen Z, Millennial, Gen X, Boomer II, Boomer I, Post War) for example. - -So we are faced with a problem: we have datasets with counts of our measure of interest (whether that be age, income, BMI, or whatever), but they are binned, and they have been binned _differently_. This notebook presents a solution to this problem that [PyMC Labs](https://www.pymc-labs.io) worked on, supported by the [Gates Foundation](https://www.gatesfoundation.org/). We _can_ make inferences about the parameters of a population level distribution. - -![](gates.png) - -+++ - -## The solution - -More formally, we describe the problem as: if we have the bin edges (aka cut points) used for data binning, and bin counts, how can we estimate the parameters of the underlying distribution? We will present a solution and various illustrative examples of this solution, which makes the following assumptions: - -1. that the bins are order-able (e.g. underweight, normal, overweight, obese), -2. the underlying distribution is specified in a parametric form, and -3. the cut points that delineate the bins are known and can be pinpointed on the support of the distribution (also known as the valid values that the probability distribution can return). - -The approach used is heavily based upon the logic behind [ordinal regression](https://en.wikipedia.org/wiki/Ordinal_regression). This approach proposes that observed bin counts $Y = {1, 2, \ldots, K}$ are generated from a set of bin edges (aka cutpoints) $\theta_1, \ldots, \theta _{K-1}$ operating upon a latent probability distribution which we could call $y^*$. We can describe the probability of observing data in bin 1 as: - -$$P(Y=1) = \Phi(\theta_1) - \Phi(-\infty) = \Phi(\theta_1) - 0$$ - -bin 2 as: - -$$P(Y=2) = \Phi(\theta_2) - \Phi(\theta_1)$$ - -bin 3 as: - -$$P(Y=3) = \Phi(\theta_3) - \Phi(\theta_2)$$ - -and bin 4 as: - -$$P(Y=4) = \Phi(\infty) - \Phi(\theta_3) = 1 - \Phi(\theta_3)$$ - -where $\Phi$ is the standard cumulative normal. - -![](ordinal.png) - -In ordinal regression, the cutpoints are treated as latent variables and the parameters of the normal distribution may be treated as observed (or derived from other predictor variables). This problem differs in that: - -- the parameters of the Gaussian are _unknown_, -- we do not want to be confined to the Gaussian distribution, -- we have observed an array of `cutpoints`, -- we have observed bin `counts`, - -We are now in a position to sketch out a generative PyMC model: - -```python -import aesara.tensor as at - -with pm.Model() as model: - # priors - mu = pm.Normal("mu") - sigma = pm.HalfNormal("sigma") - # generative process - probs = pm.math.exp(pm.logcdf(pm.Normal.dist(mu=mu, sigma=sigma), cutpoints)) - probs = pm.math.concatenate([[0], probs, [1]]) - probs = at.extra_ops.diff(probs) - # likelihood - pm.Multinomial("counts", p=probs, n=sum(counts), observed=counts) -``` - -The exact way we implement the models below differs only very slightly from this, but let's decompose how this works. -Firstly we define priors over the `mu` and `sigma` parameters of the latent distribution. Then we have 3 lines which calculate the probability that any observed datum falls in a given bin. The first line of this -```python -probs = pm.math.exp(pm.logcdf(pm.Normal.dist(mu=mu, sigma=sigma), cutpoints)) -``` -calculates the cumulative density at each of the cutpoints. The second line -```python -probs = pm.math.concatenate([[0], probs, [1]]) -``` -simply concatenates the cumulative density at $-\infty$ (which is zero) and at $\infty$ (which is 1). -The third line -```python -probs = at.extra_ops.diff(probs) -``` -calculates the difference between consecutive cumulative densities to give the actual probability of a datum falling in any given bin. - -Finally, we end with the Multinomial likelihood which tells us the likelihood of observing the `counts` given the set of bin `probs` and the total number of observations `sum(counts)`. - -Hypothetically we could have used base python, or numpy, to describe the generative process. The problem with this however is that gradient information is lost, and so completing these operations using numerical libraries which retain gradient information allows this to be used by the MCMC sampling algorithms. - -The approach was illustrated with a Gaussian distribution, and below we show a number of worked examples using Gaussian distributions. However, the approach is general, and at the end of the notebook we provide a demonstration that the approach does indeed extend to non-Gaussian distributions. - -```{code-cell} ipython3 -:tags: [] - -import warnings - -import aesara.tensor as at -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm -import seaborn as sns - -warnings.filterwarnings(action="ignore", category=UserWarning) -``` - -```{code-cell} ipython3 -%matplotlib inline -%config InlineBackend.figure_format = 'retina' -plt.rcParams.update({"font.size": 14}) -az.style.use("arviz-darkgrid") -rng = np.random.default_rng(1234) -``` - -## Simulated data with a Gaussian distribution - -The first few examples will be based on 2 hypothetical studies which measure a Gaussian distributed variable. Each study will have it's own sample size, and our task is to learn the parameters of the population level Gaussian distribution. Frustration 1 is that the data have been binned. Frustration 2 is that each study has used different categories, that is, different cutpoints in this data binning process. - -In this simulation approach, we will define the true population level parameters as: -- `true_mu`: -2.0 -- `true_sigma`: 2.0 - -Our goal will be to recover the `mu` and `sigma` values given only the bin counts and cutpoints. - -```{code-cell} ipython3 -# Generate two different sets of random samples from the same Gaussian. -true_mu, true_sigma = -2, 2 -x1 = rng.normal(loc=true_mu, scale=true_sigma, size=1500) -x2 = rng.normal(loc=true_mu, scale=true_sigma, size=2000) -``` - -The studies used the following, different, cutpoints for their data binning process. - -```{code-cell} ipython3 -# First discretization (cutpoints) -d1 = np.array([-2.0, -1.0, 0.0, 1.0, 2.0]) -# Second discretization (cutpoints) -d2 = np.array([-5.0, -3.5, -2.0, -0.5, 1.0, 2.5]) -``` - -```{code-cell} ipython3 -def data_to_bincounts(data, cutpoints): - # categorise each datum into correct bin - bins = np.digitize(data, bins=cutpoints) - # bin counts - counts = pd.DataFrame({"bins": bins}).groupby(by="bins")["bins"].agg("count") - return counts - - -c1 = data_to_bincounts(x1, d1) -c2 = data_to_bincounts(x2, d2) -``` - -Let's visualise this in one convenient figure. The left hand column shows the underlying data and the cutpoints for both studies. The right hand column shows the resulting bin counts. - -```{code-cell} ipython3 -fig, ax = plt.subplots(2, 2, figsize=(12, 8)) - -# First set of measurements -ax[0, 0].hist(x1, 50, alpha=0.5) - -for cut in d1: - ax[0, 0].axvline(cut, color="k", ls=":") - -# Plot observed bin counts -c1.plot(kind="bar", ax=ax[0, 1], alpha=0.5) -ax[0, 1].set_xticklabels([f"bin {n}" for n in range(len(c1))]) -ax[0, 1].set(title="Study 1", xlabel="c1", ylabel="bin count") - -# Second set of measuremsnts -ax[1, 0].hist(x2, 50, alpha=0.5) - -for cut in d2: - ax[1, 0].axvline(cut, color="k", ls=":") - -# Plot observed bin counts -c2.plot(kind="bar", ax=ax[1, 1], alpha=0.5) -ax[1, 1].set_xticklabels([f"bin {n}" for n in range(len(c2))]) -ax[1, 1].set(title="Study 2", xlabel="c2", ylabel="bin count") - -# format axes -ax[0, 0].set(xlim=(-11, 5), xlabel="$x$", ylabel="observed frequency", title="Sample 1") -ax[1, 0].set(xlim=(-11, 5), xlabel="$x$", ylabel="observed frequency", title="Sample 2"); -``` - -Each bin is paired with counts. -As you'll see above, -`c1` and `c2` are binned differently. -One has 6 bins, the other has 7. -`c1` omits basically half of the Gaussian distribution. - -To recap, in a real situation we might have access to the cutpoints and to the bin counts, but _not_ the underlying data `x1` or `x2`. - -+++ - -## Example 1: Gaussian parameter estimation with one set of bins - -We will start by investigating what happens when we use only one set of bins to estimate the `mu` and `sigma` parameter. - -+++ - -### Model specification - -```{code-cell} ipython3 -:tags: [] - -with pm.Model() as model1: - sigma = pm.HalfNormal("sigma") - mu = pm.Normal("mu") - - probs1 = pm.math.exp(pm.logcdf(pm.Normal.dist(mu=mu, sigma=sigma), d1)) - probs1 = at.extra_ops.diff(pm.math.concatenate([[0], probs1, [1]])) - pm.Multinomial("counts1", p=probs1, n=c1.sum(), observed=c1.values) -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(model1) -``` - -```{code-cell} ipython3 -:tags: [] - -with model1: - trace1 = pm.sample() -``` - -### Checks on model - -We first start with posterior predictive checks. -Given the posterior values, -we should be able to generate observations that look close to what we observed. - -```{code-cell} ipython3 -:tags: [] - -with model1: - ppc = pm.sample_posterior_predictive(trace1) -``` - -We can do this graphically. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(12, 4)) -# Plot observed bin counts -c1.plot(kind="bar", ax=ax, alpha=0.5) -# Plot posterior predictive -ppc.posterior_predictive.plot.scatter(x="counts1_dim_0", y="counts1", color="k", alpha=0.2) -# Formatting -ax.set_xticklabels([f"bin {n}" for n in range(len(c1))]) -ax.set_title("Six bin discretization of N(-2, 2)") -``` - -It looks like the numbers are in the right ballpark. -With the numbers ordered correctly, we also have the correct proportions identified. - -+++ - -We can also get programmatic access to our posterior predictions in a number of ways: - -```{code-cell} ipython3 -ppc.posterior_predictive.counts1.values -``` - -Let's take the mean and compare it against the observed counts: - -```{code-cell} ipython3 -ppc.posterior_predictive.counts1.mean(dim=["chain", "draw"]).values -``` - -```{code-cell} ipython3 -c1.values -``` - -### Recovering parameters - -The more important question is whether we have recovered the parameters of the distribution or not. -Recall that we used `mu = -2` and `sigma = 2` to generate the data. - -```{code-cell} ipython3 -:tags: [] - -az.plot_posterior(trace1, var_names=["mu", "sigma"], ref_val=[true_mu, true_sigma]); -``` - -Pretty good! And we can access the posterior mean estimates (stored as [xarray](http://xarray.pydata.org/en/stable/index.html) types) as below. The MCMC samples arrive back in a 2D matrix with one dimension for the MCMC chain (`chain`), and one for the sample number (`draw`). We can calculate the overal posterior average with `.mean(dim=["draw", "chain"])`. - -```{code-cell} ipython3 -:tags: [] - -trace1.posterior["mu"].mean(dim=["draw", "chain"]).values -``` - -```{code-cell} ipython3 -:tags: [] - -trace1.posterior["sigma"].mean(dim=["draw", "chain"]).values -``` - -## Example 2: Parameter estimation with the other set of bins - -Above, we used one set of binned data. Let's see what happens when we swap out for the other set of data. - -+++ - -### Model specification - -As with the above, here's the model specification. - -```{code-cell} ipython3 -:tags: [] - -with pm.Model() as model2: - sigma = pm.HalfNormal("sigma") - mu = pm.Normal("mu") - - probs2 = pm.math.exp(pm.logcdf(pm.Normal.dist(mu=mu, sigma=sigma), d2)) - probs2 = at.extra_ops.diff(pm.math.concatenate([[0], probs2, [1]])) - pm.Multinomial("counts2", p=probs2, n=c2.sum(), observed=c2.values) -``` - -```{code-cell} ipython3 -:tags: [] - -with model2: - trace2 = pm.sample() -``` - -```{code-cell} ipython3 -:tags: [] - -az.plot_trace(trace2); -``` - -### Posterior predictive checks - -Let's run a PPC check to ensure we are generating data that are similar to what we observed. - -```{code-cell} ipython3 -:tags: [] - -with model2: - ppc = pm.sample_posterior_predictive(trace2) -``` - -We calculate the mean bin posterior predictive bin counts, averaged over samples. - -```{code-cell} ipython3 -:tags: [] - -ppc.posterior_predictive.counts2.mean(dim=["chain", "draw"]).values -``` - -```{code-cell} ipython3 -:tags: [] - -c2.values -``` - -Looks like a good match. But as always it is wise to visualise things. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(12, 4)) -# Plot observed bin counts -c2.plot(kind="bar", ax=ax, alpha=0.5) -# Plot posterior predictive -ppc.posterior_predictive.plot.scatter(x="counts2_dim_0", y="counts2", color="k", alpha=0.2) -# Formatting -ax.set_xticklabels([f"bin {n}" for n in range(len(c2))]) -ax.set_title("Seven bin discretization of N(-2, 2)") -``` - -Not bad! - -+++ - -### Recovering parameters - -And did we recover the parameters? - -```{code-cell} ipython3 -az.plot_posterior(trace2, var_names=["mu", "sigma"], ref_val=[true_mu, true_sigma]); -``` - -```{code-cell} ipython3 -:tags: [] - -trace2.posterior["mu"].mean(dim=["draw", "chain"]).values -``` - -```{code-cell} ipython3 -:tags: [] - -trace2.posterior["sigma"].mean(dim=["draw", "chain"]).values -``` - -## Example 3: Parameter estimation with two bins together - -Now we need to see what happens if we add in both ways of binning. - -+++ - -### Model Specification - -```{code-cell} ipython3 -:tags: [] - -with pm.Model() as model3: - sigma = pm.HalfNormal("sigma") - mu = pm.Normal("mu") - - probs1 = pm.math.exp(pm.logcdf(pm.Normal.dist(mu=mu, sigma=sigma), d1)) - probs1 = at.extra_ops.diff(pm.math.concatenate([np.array([0]), probs1, np.array([1])])) - probs1 = pm.Deterministic("normal1_cdf", probs1) - - probs2 = pm.math.exp(pm.logcdf(pm.Normal.dist(mu=mu, sigma=sigma), d2)) - probs2 = at.extra_ops.diff(pm.math.concatenate([np.array([0]), probs2, np.array([1])])) - probs2 = pm.Deterministic("normal2_cdf", probs2) - - pm.Multinomial("counts1", p=probs1, n=c1.sum(), observed=c1.values) - pm.Multinomial("counts2", p=probs2, n=c2.sum(), observed=c2.values) -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(model3) -``` - -```{code-cell} ipython3 -:tags: [] - -with model3: - trace3 = pm.sample() -``` - -```{code-cell} ipython3 -az.plot_pair(trace3, var_names=["mu", "sigma"], divergences=True); -``` - -### Posterior predictive checks - -```{code-cell} ipython3 -with model3: - ppc = pm.sample_posterior_predictive(trace3) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(1, 2, figsize=(12, 4), sharey=True) - -# Study 1 ---------------------------------------------------------------- -# Plot observed bin counts -c1.plot(kind="bar", ax=ax[0], alpha=0.5) -# Plot posterior predictive -ppc.posterior_predictive.plot.scatter( - x="counts1_dim_0", y="counts1", color="k", alpha=0.2, ax=ax[0] -) -# Formatting -ax[0].set_xticklabels([f"bin {n}" for n in range(len(c1))]) -ax[0].set_title("Six bin discretization of N(-2, 2)") - -# Study 1 ---------------------------------------------------------------- -# Plot observed bin counts -c2.plot(kind="bar", ax=ax[1], alpha=0.5) -# Plot posterior predictive -ppc.posterior_predictive.plot.scatter( - x="counts2_dim_0", y="counts2", color="k", alpha=0.2, ax=ax[1] -) -# Formatting -ax[1].set_xticklabels([f"bin {n}" for n in range(len(c2))]) -ax[1].set_title("Seven bin discretization of N(-2, 2)") -``` - -### Recovering parameters - -```{code-cell} ipython3 -:tags: [] - -trace3.posterior["mu"].mean(dim=["draw", "chain"]).values -``` - -```{code-cell} ipython3 -:tags: [] - -trace3.posterior["sigma"].mean(dim=["draw", "chain"]).values -``` - -```{code-cell} ipython3 -:tags: [] - -az.plot_posterior(trace3, var_names=["mu", "sigma"], ref_val=[true_mu, true_sigma]); -``` - -## Example 4: Parameter estimation with continuous and binned measures - -For the sake of completeness, let's see how we can estimate population parameters based one one set of continuous measures, and one set of binned measures. We will use the simulated data we have already generated. - -+++ - -### Model Specification - -```{code-cell} ipython3 -with pm.Model() as model4: - sigma = pm.HalfNormal("sigma") - mu = pm.Normal("mu") - # study 1 - probs1 = pm.math.exp(pm.logcdf(pm.Normal.dist(mu=mu, sigma=sigma), d1)) - probs1 = at.extra_ops.diff(pm.math.concatenate([np.array([0]), probs1, np.array([1])])) - probs1 = pm.Deterministic("normal1_cdf", probs1) - pm.Multinomial("counts1", p=probs1, n=c1.sum(), observed=c1.values) - # study 2 - pm.Normal("y", mu=mu, sigma=sigma, observed=x2) -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(model4) -``` - -```{code-cell} ipython3 -with model4: - trace4 = pm.sample() -``` - -### Posterior predictive checks - -```{code-cell} ipython3 -with model4: - ppc = pm.sample_posterior_predictive(trace4) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(1, 2, figsize=(12, 4)) - -# Study 1 ---------------------------------------------------------------- -# Plot observed bin counts -c1.plot(kind="bar", ax=ax[0], alpha=0.5) -# Plot posterior predictive -ppc.posterior_predictive.plot.scatter( - x="counts1_dim_0", y="counts1", color="k", alpha=0.2, ax=ax[0] -) -# Formatting -ax[0].set_xticklabels([f"bin {n}" for n in range(len(c1))]) -ax[0].set_title("Posterior predictive: Study 1") - -# Study 2 ---------------------------------------------------------------- -ax[1].hist(ppc.posterior_predictive.y.values.flatten(), 50, density=True, alpha=0.5) -ax[1].set(title="Posterior predictive: Study 2", xlabel="$x$", ylabel="density"); -``` - -We can calculate the mean and standard deviation of the posterior predictive distribution for study 2 and see that they are close to our true parameters. - -```{code-cell} ipython3 -np.mean(ppc.posterior_predictive.y.values.flatten()), np.std( - ppc.posterior_predictive.y.values.flatten() -) -``` - -### Recovering parameters -Finally, we can check the posterior estimates of the parameters and see that the estimates here are spot on. - -```{code-cell} ipython3 -az.plot_posterior(trace4, var_names=["mu", "sigma"], ref_val=[true_mu, true_sigma]); -``` - -## Example 5: Hierarchical estimation -The previous examples all assumed that study 1 and study 2 data were sampled from the same population. While this was in fact true for our simulated data, when we are working with real data, we are not in a position to know this. So it could be useful to be able to ask the question, "does it look like data from study 1 and study 2 are drawn from the same population?" - -We can do this using the same basic approach - we can estimate population level parameters like before, but now we can add in _study level_ parameter estimates. This will be a new hierarchical layer in our model between the population level parameters and the observations. - -+++ - -### Model specification - -This time, because we are getting into a more complicated model, we will use `coords` to tell PyMC about the dimensionality of the variables. This feeds in to the posterior samples which are outputted in xarray format, which makes life easier when processing posterior samples for statistical or visualization purposes later. - -```{code-cell} ipython3 -coords = { - "study": np.array([0, 1]), - "bin1": np.arange(len(c1)), - "bin2": np.arange(len(c2)), -} -``` - -```{code-cell} ipython3 -with pm.Model(coords=coords) as model5: - # Population level priors - mu_pop_mean = pm.Normal("mu_pop_mean", 0.0, 1.0) - mu_pop_variance = pm.HalfNormal("mu_pop_variance", sigma=1) - - sigma_pop_mean = pm.HalfNormal("sigma_pop_mean", sigma=1) - sigma_pop_sigma = pm.HalfNormal("sigma_pop_sigma", sigma=1) - - # Study level priors - mu = pm.Normal("mu", mu=mu_pop_mean, sigma=mu_pop_variance, dims="study") - sigma = pm.TruncatedNormal( - "sigma", mu=sigma_pop_mean, sigma=sigma_pop_sigma, lower=0, dims="study" - ) - - # Study 1 - probs1 = pm.math.exp(pm.logcdf(pm.Normal.dist(mu=mu[0], sigma=sigma[0]), d1)) - probs1 = at.extra_ops.diff(pm.math.concatenate([np.array([0]), probs1, np.array([1])])) - probs1 = pm.Deterministic("normal1_cdf", probs1, dims="bin1") - - # Study 2 - probs2 = pm.math.exp(pm.logcdf(pm.Normal.dist(mu=mu[1], sigma=sigma[1]), d2)) - probs2 = at.extra_ops.diff(pm.math.concatenate([np.array([0]), probs2, np.array([1])])) - probs2 = pm.Deterministic("normal2_cdf", probs2, dims="bin2") - - # Likelihood - pm.Multinomial("counts1", p=probs1, n=c1.sum(), observed=c1.values, dims="bin1") - pm.Multinomial("counts2", p=probs2, n=c2.sum(), observed=c2.values, dims="bin2") -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(model5) -``` - -The model above is fine _but_ running this model as it is results in hundreds of divergences in the sampling process (you can find out more about divergences from the {ref}`diagnosing_with_divergences` notebook). While we won't go deep into the reasons here, the long story cut short is that Gaussian centering introduces pathologies into our log likelihood space that make it difficult for MCMC samplers to work. Firstly, we removed the population level estimates on `sigma` and just stick with study level priors. We used the Gamma distribution to avoid any zero values. Secondly use a non-centered reparameterization to specify `mu`. This does not completely solve the problem, but it does drastically reduce the number of divergences. - -```{code-cell} ipython3 -with pm.Model(coords=coords) as model5: - # Population level priors - mu_pop_mean = pm.Normal("mu_pop_mean", 0.0, 1.0) - mu_pop_variance = pm.HalfNormal("mu_pop_variance", sigma=1) - - # Study level priors - x = pm.Normal("x", dims="study") - mu = pm.Deterministic("mu", x * mu_pop_variance + mu_pop_mean, dims="study") - - sigma = pm.Gamma("sigma", alpha=2, beta=1, dims="study") - - # Study 1 - probs1 = pm.math.exp(pm.logcdf(pm.Normal.dist(mu=mu[0], sigma=sigma[0]), d1)) - probs1 = at.extra_ops.diff(pm.math.concatenate([np.array([0]), probs1, np.array([1])])) - probs1 = pm.Deterministic("normal1_cdf", probs1, dims="bin1") - - # Study 2 - probs2 = pm.math.exp(pm.logcdf(pm.Normal.dist(mu=mu[1], sigma=sigma[1]), d2)) - probs2 = at.extra_ops.diff(pm.math.concatenate([np.array([0]), probs2, np.array([1])])) - probs2 = pm.Deterministic("normal2_cdf", probs2, dims="bin2") - - # Likelihood - pm.Multinomial("counts1", p=probs1, n=c1.sum(), observed=c1.values, dims="bin1") - pm.Multinomial("counts2", p=probs2, n=c2.sum(), observed=c2.values, dims="bin2") -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(model5) -``` - -```{code-cell} ipython3 -with model5: - trace5 = pm.sample(tune=2000, target_accept=0.99) -``` - -We can see that despite our efforts, we still get some divergences. Plotting the samples and highlighting the divergences suggests (from the top left subplot) that our model is suffering from the funnel problem - -```{code-cell} ipython3 -az.plot_pair( - trace5, var_names=["mu_pop_mean", "mu_pop_variance", "sigma"], coords=coords, divergences=True -); -``` - -### Posterior predictive checks - -```{code-cell} ipython3 -with model5: - ppc = pm.sample_posterior_predictive(trace5) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(1, 2, figsize=(12, 4), sharey=True) - -# Study 1 ---------------------------------------------------------------- -# Plot observed bin counts -c1.plot(kind="bar", ax=ax[0], alpha=0.5) -# Plot posterior predictive -ppc.posterior_predictive.plot.scatter(x="bin1", y="counts1", color="k", alpha=0.2, ax=ax[0]) -# Formatting -ax[0].set_xticklabels([f"bin {n}" for n in range(len(c1))]) -ax[0].set_title("Six bin discretization of N(-2, 2)") - -# Study 1 ---------------------------------------------------------------- -# Plot observed bin counts -c2.plot(kind="bar", ax=ax[1], alpha=0.5) -# Plot posterior predictive -ppc.posterior_predictive.plot.scatter(x="bin2", y="counts2", color="k", alpha=0.2, ax=ax[1]) -# Formatting -ax[1].set_xticklabels([f"bin {n}" for n in range(len(c2))]) -ax[1].set_title("Seven bin discretization of N(-2, 2)") -``` - -### Inspect posterior - -+++ - -Any evidence for differences in study-level means or standard deviations? - -```{code-cell} ipython3 -fig, ax = plt.subplots(1, 2, figsize=(10, 3)) - -diff = trace5.posterior.mu.sel(study=0) - trace5.posterior.mu.sel(study=1) -az.plot_posterior(diff, ref_val=0, ax=ax[0]) -ax[0].set(title="difference in study level mean estimates") - -diff = trace5.posterior.sigma.sel(study=0) - trace5.posterior.sigma.sel(study=1) -az.plot_posterior(diff, ref_val=0, ax=ax[1]) -ax[1].set(title="difference in study level std estimate"); -``` - -No compelling evidence for differences between the population level means and standard deviations. - -+++ - -Population level estimates in the mean parameter. There is no population level estimate of sigma in this reparameterised model. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(10, 3)) - -pop_mean = rng.normal( - trace5.posterior.mu_pop_mean.values.flatten(), trace5.posterior.mu_pop_variance.values.flatten() -) -az.plot_posterior(pop_mean, ax=ax, ref_val=true_mu) -ax.set(title="population level mean estimate"); -``` - -Another possible solution would be to make independent inferences about the study level parameters from group 1 and group 2, and then look for any evidendence that these differ. Taking this approach works just fine, no divergences in sight, although this approach drifts away from our core goal of making population level inferences. Rather than fully work through this example, we included the code in case it is useful to anyone's use case. - -```python -with pm.Model(coords=coords) as model5: - # Study level priors - mu = pm.Normal("mu", dims='study') - sigma = pm.HalfNormal("sigma", dims='study') - - # Study 1 - probs1 = pm.math.exp(pm.logcdf(pm.Normal.dist(mu=mu[0], sigma=sigma[0]), d1)) - probs1 = at.extra_ops.diff(pm.math.concatenate([np.array([0]), probs1, np.array([1])])) - probs1 = pm.Deterministic("normal1_cdf", probs1, dims='bin1') - - # Study 2 - probs2 = pm.math.exp(pm.logcdf(pm.Normal.dist(mu=mu[1], sigma=sigma[1]), d2)) - probs2 = at.extra_ops.diff(pm.math.concatenate([np.array([0]), probs2, np.array([1])])) - probs2 = pm.Deterministic("normal2_cdf", probs2, dims='bin2') - - # Likelihood - pm.Multinomial("counts1", p=probs1, n=c1.sum(), observed=c1.values, dims='bin1') - pm.Multinomial("counts2", p=probs2, n=c2.sum(), observed=c2.values, dims='bin2') -``` - -+++ - -## Example 6: A non-normal distribution - -In theory, the method we're using is quite general. Its dependencies are usually well-specified: - -- A parametric distribution -- Known cut points on the support of that distribution to bin our data -- Counts (and hence proportions) of each bin - -We will now empirically verify that the parameters of other distributions are recoverable using the same methods. We will approximate the distribution of [Body Mass Index](https://en.wikipedia.org/wiki/Body_mass_index) (BMI) from 2 hypothetical (simulated) studies. - -In the first study, the fictional researchers used a set of thresholds which give them many categories of: -- Underweight (Severe thinness): $< 16$ -- Underweight (Moderate thinness): $16 - 17$ -- Underweight (Mild thinness): $17 - 18.5$ -- Normal range: $18.5 - 25$ -- Overweight (Pre-obese): $25 - 30$ -- Obese (Class I): $30 - 35$ -- Obese (Class II): $35 - 40$ -- Obese (Class III): $\ge 40$ - -The second set of researchers used a categorisation scheme recommended by the Hospital Authority of Hong Kong: -- Underweight (Unhealthy): $< 18.5$ -- Normal range (Healthy): $18.5 - 23$ -- Overweight I (At risk): $23 - 25$ -- Overweight II (Moderately obese): $25 - 30$ -- Overweight III (Severely obese): $\ge 30$ - -```{code-cell} ipython3 -# First discretization -d1 = np.array([16, 17, 18.5, 25, 30, 35, 40]) -# Second discretization -d2 = np.array([18.5, 23, 30]) -``` - -We assume the true underlying BMI distribution is Gumbel distributed with mu and beta parameters of 20 and 4, respectively. - -```{code-cell} ipython3 -# True underlying BMI distribution -true_mu, true_beta = 20, 4 -BMI = pm.Gumbel.dist(mu=true_mu, beta=true_beta) - -# Generate two different sets of random samples from the same Gaussian. -x1 = pm.draw(BMI, 800) -x2 = pm.draw(BMI, 1200) - -# Calculate bin counts -c1 = data_to_bincounts(x1, d1) -c2 = data_to_bincounts(x2, d2) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(2, 2, figsize=(12, 8)) - -# First set of measurements ---------------------------------------------- -ax[0, 0].hist(x1, 50, alpha=0.5) - -for cut in d1: - ax[0, 0].axvline(cut, color="k", ls=":") - -# Plot observed bin counts -c1.plot(kind="bar", ax=ax[0, 1], alpha=0.5) -ax[0, 1].set_xticklabels([f"bin {n}" for n in range(len(c1))]) -ax[0, 1].set(title="Sample 1 bin counts", xlabel="c1", ylabel="bin count") - -# Second set of measuremsnts --------------------------------------------- -ax[1, 0].hist(x2, 50, alpha=0.5) - -for cut in d2: - ax[1, 0].axvline(cut, color="k", ls=":") - -# Plot observed bin counts -c2.plot(kind="bar", ax=ax[1, 1], alpha=0.5) -ax[1, 1].set_xticklabels([f"bin {n}" for n in range(len(c2))]) -ax[1, 1].set(title="Sample 2 bin counts", xlabel="c2", ylabel="bin count") - -# format axes ------------------------------------------------------------ -ax[0, 0].set(xlim=(0, 50), xlabel="BMI", ylabel="observed frequency", title="Sample 1") -ax[1, 0].set(xlim=(0, 50), xlabel="BMI", ylabel="observed frequency", title="Sample 2"); -``` - -### Model specification - -This is a variation of Example 3 above. The only changes are: -- update the probability distribution to match our target (the Gumbel distribution) -- ensure we specify priors for our target distribution, appropriate given our domain knowledge. - -```{code-cell} ipython3 -with pm.Model() as model6: - mu = pm.Normal("mu", 20, 5) - beta = pm.HalfNormal("beta", 10) - - probs1 = pm.math.exp(pm.logcdf(pm.Gumbel.dist(mu=mu, beta=beta), d1)) - probs1 = at.extra_ops.diff(pm.math.concatenate([np.array([0]), probs1, np.array([1])])) - probs1 = pm.Deterministic("gumbel_cdf1", probs1) - - probs2 = pm.math.exp(pm.logcdf(pm.Gumbel.dist(mu=mu, beta=beta), d2)) - probs2 = at.extra_ops.diff(pm.math.concatenate([np.array([0]), probs2, np.array([1])])) - probs2 = pm.Deterministic("gumbel_cdf2", probs2) - - pm.Multinomial("counts1", p=probs1, n=c1.sum(), observed=c1.values) - pm.Multinomial("counts2", p=probs2, n=c2.sum(), observed=c2.values) -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(model6) -``` - -```{code-cell} ipython3 -with model6: - trace6 = pm.sample() -``` - -### Posterior predictive checks - -```{code-cell} ipython3 -with model6: - ppc = pm.sample_posterior_predictive(trace6) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(1, 2, figsize=(12, 4), sharey=True) - -# Study 1 ---------------------------------------------------------------- -# Plot observed bin counts -c1.plot(kind="bar", ax=ax[0], alpha=0.5) -# Plot posterior predictive -ppc.posterior_predictive.plot.scatter( - x="counts1_dim_0", y="counts1", color="k", alpha=0.2, ax=ax[0] -) -# Formatting -ax[0].set_xticklabels([f"bin {n}" for n in range(len(c1))]) -ax[0].set_title("Study 1") - -# Study 1 ---------------------------------------------------------------- -# Plot observed bin counts -c2.plot(kind="bar", ax=ax[1], alpha=0.5) -# Plot posterior predictive -ppc.posterior_predictive.plot.scatter( - x="counts2_dim_0", y="counts2", color="k", alpha=0.2, ax=ax[1] -) -# Formatting -ax[1].set_xticklabels([f"bin {n}" for n in range(len(c2))]) -ax[1].set_title("Study 2") -``` - -### Recovering parameters - -```{code-cell} ipython3 -az.plot_posterior(trace6, var_names=["mu", "beta"], ref_val=[true_mu, true_beta]); -``` - -We can see that we were able to do a good job of recovering the known parameters of the underlying BMI population. - -If we were interested in testing whether there were any differences in the BMI distributions in Study 1 and Study 2, then we could simply take the model in Example 6 and adapt it to operate with our desired target distribution, just like we did in this example. - -+++ - -## Conclusions - -As you can see, this method for estimating known parameters of Gaussian and non-Gaussian distributions works pretty well. -While these examples have been applied to synthetic data, doing these kinds of parameter recovery studies is crucial. If we tried to recover population level parameters from counts and could _not_ do it when we know the ground truth, then this would indicate the approach is not trustworthy. But the various parameter recovery examples demonstrate that we _can_ in fact accurately recover population level parameters from binned, and _differently_ binned data. - -A key technical point to note here is that when we pass in the observed counts, -they ought to be in the exact CDF order. -Not shown here are experiments where we scrambled the counts' order; -there, the estimation of underlying distribution parameters were incorrect. - -We have presented a range of different examples here which makes clear that the general approach can be adapted easily to the particular situation or research questions being faced. These approaches should easily be adaptable to novel but related data science situations. - -+++ - -## Authors -* Authored by [Eric Ma](https://github.com/ericmjl) and [Benjamin T. Vincent](https://github.com/drbenvincent) in September, 2021 ([pymc-examples#229](https://github.com/pymc-devs/pymc-examples/pull/229)) -* Updated to run in PyMC v4 by Fernando Irarrazaval in June 2022 ([pymc-examples#366](https://github.com/pymc-devs/pymc-examples/pull/366)) - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/case_studies/blackbox_external_likelihood.myst.md b/myst_nbs/case_studies/blackbox_external_likelihood.myst.md deleted file mode 100644 index bea2183f7..000000000 --- a/myst_nbs/case_studies/blackbox_external_likelihood.myst.md +++ /dev/null @@ -1,667 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(blackbox_external_likelihood)= -# Using a "black box" likelihood function (Cython) - -:::{note} -This notebook in part of a set of two twin notebooks that perform the exact same task, this one -uses Cython whereas {ref}`this other one ` uses NumPy -::: - -```{code-cell} ipython3 -%load_ext Cython - -import os -import platform - -import arviz as az -import corner -import cython -import emcee -import IPython -import matplotlib -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm -import theano -import theano.tensor as tt - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -``` - -[PyMC3](https://docs.pymc.io/index.html) is a great tool for doing Bayesian inference and parameter estimation. It has a load of [in-built probability distributions](https://docs.pymc.io/api/distributions.html) that you can use to set up priors and likelihood functions for your particular model. You can even create your own [custom distributions](https://docs.pymc.io/prob_dists.html#custom-distributions). - -However, this is not necessarily that simple if you have a model function, or probability distribution, that, for example, relies on an external code that you have little/no control over (and may even be, for example, wrapped `C` code rather than Python). This can be problematic went you need to pass parameters set as PyMC3 distributions to these external functions; your external function probably wants you to pass it floating point numbers rather than PyMC3 distributions! - -```python -import pymc3 as pm -from external_module import my_external_func # your external function! - -# set up your model -with pm.Model(): - # your external function takes two parameters, a and b, with Uniform priors - a = pm.Uniform('a', lower=0., upper=1.) - b = pm.Uniform('b', lower=0., upper=1.) - - m = my_external_func(a, b) # <--- this is not going to work! -``` - -Another issue is that if you want to be able to use the [gradient-based step samplers](https://docs.pymc.io/notebooks/getting_started.html#Gradient-based-sampling-methods) like [NUTS](https://docs.pymc.io/api/inference.html#module-pymc3.step_methods.hmc.nuts) and [Hamiltonian Monte Carlo (HMC)](https://docs.pymc.io/api/inference.html#hamiltonian-monte-carlo), then your model/likelihood needs a gradient to be defined. If you have a model that is defined as a set of Theano operators then this is no problem - internally it will be able to do automatic differentiation - but if your model is essentially a "black box" then you won't necessarily know what the gradients are. - -Defining a model/likelihood that PyMC3 can use and that calls your "black box" function is possible, but it relies on creating a [custom Theano Op](https://docs.pymc.io/advanced_theano.html#writing-custom-theano-ops). This is, hopefully, a clear description of how to do this, including one way of writing a gradient function that could be generally applicable. - -In the examples below, we create a very simple model and log-likelihood function in [Cython](http://cython.org/). Cython is used just as an example to show what you might need to do if calling external `C` codes, but you could in fact be using pure Python codes. The log-likelihood function used is actually just a [Normal distribution](https://en.wikipedia.org/wiki/Normal_distribution), so defining this yourself is obviously overkill (and I'll compare it to doing the same thing purely with the pre-defined PyMC3 [Normal](https://docs.pymc.io/api/distributions/continuous.html#pymc3.distributions.continuous.Normal) distribution), but it should provide a simple to follow demonstration. - -First, let's define a _super-complicated_™ model (a straight line!), which is parameterised by two variables (a gradient `m` and a y-intercept `c`) and calculated at a vector of points `x`. Here the model is defined in [Cython](http://cython.org/) and calls [GSL](https://www.gnu.org/software/gsl/) functions. This is just to show that you could be calling some other `C` library that you need. In this example, the model parameters are all packed into a list/array/tuple called `theta`. - -Let's also define a _really-complicated_™ log-likelihood function (a Normal log-likelihood that ignores the normalisation), which takes in the list/array/tuple of model parameter values `theta`, the points at which to calculate the model `x`, the vector of "observed" data points `data`, and the standard deviation of the noise in the data `sigma`. This log-likelihood function calls the _super-complicated_™ model function. - -```{code-cell} ipython3 -%%cython -I/usr/include -L/usr/lib/x86_64-linux-gnu -lgsl -lgslcblas -lm - -import cython - -cimport cython - -import numpy as np - -cimport numpy as np - -### STUFF FOR USING GSL (FEEL FREE TO IGNORE!) ### - -# declare GSL vector structure and functions -cdef extern from "gsl/gsl_block.h": - cdef struct gsl_block: - size_t size - double * data - -cdef extern from "gsl/gsl_vector.h": - cdef struct gsl_vector: - size_t size - size_t stride - double * data - gsl_block * block - int owner - - ctypedef struct gsl_vector_view: - gsl_vector vector - - int gsl_vector_scale (gsl_vector * a, const double x) nogil - int gsl_vector_add_constant (gsl_vector * a, const double x) nogil - gsl_vector_view gsl_vector_view_array (double * base, size_t n) nogil - -################################################### - - -# define your super-complicated model that uses loads of external codes -cpdef my_model(theta, np.ndarray[np.float64_t, ndim=1] x): - """ - A straight line! - - Note: - This function could simply be: - - m, c = thetha - return m*x + x - - but I've made it more complicated for demonstration purposes - """ - m, c = theta # unpack line gradient and y-intercept - - cdef size_t length = len(x) # length of x - - cdef np.ndarray line = np.copy(x) # make copy of x vector - cdef gsl_vector_view lineview # create a view of the vector - lineview = gsl_vector_view_array(line.data, length) - - # multiply x by m - gsl_vector_scale(&lineview.vector, m) - - # add c - gsl_vector_add_constant(&lineview.vector, c) - - # return the numpy array - return line - - -# define your really-complicated likelihood function that uses loads of external codes -cpdef my_loglike(theta, np.ndarray[np.float64_t, ndim=1] x, - np.ndarray[np.float64_t, ndim=1] data, sigma): - """ - A Gaussian log-likelihood function for a model with parameters given in theta - """ - - model = my_model(theta, x) - - return -(0.5/sigma**2)*np.sum((data - model)**2) -``` - -Now, as things are, if we wanted to sample from this log-likelihood function, using certain prior distributions for the model parameters (gradient and y-intercept) using PyMC3, we might try something like this (using a [PyMC3 DensityDist](https://docs.pymc.io/prob_dists.html#custom-distributions)): - -```python -import pymc3 as pm - -# create/read in our "data" (I'll show this in the real example below) -x = ... -sigma = ... -data = ... - -with pm.Model(): - # set priors on model gradient and y-intercept - m = pm.Uniform('m', lower=-10., upper=10.) - c = pm.Uniform('c', lower=-10., upper=10.) - - # create custom distribution - pm.DensityDist('likelihood', my_loglike, - observed={'theta': (m, c), 'x': x, 'data': data, 'sigma': sigma}) - - # sample from the distribution - trace = pm.sample(1000) -``` - -But, this will give an error like: - -``` -ValueError: setting an array element with a sequence. -``` - -This is because `m` and `c` are Theano tensor-type objects. - -So, what we actually need to do is create a [Theano Op](http://deeplearning.net/software/theano/extending/extending_theano.html). This will be a new class that wraps our log-likelihood function (or just our model function, if that is all that is required) into something that can take in Theano tensor objects, but internally can cast them as floating point values that can be passed to our log-likelihood function. We will do this below, initially without defining a [grad() method](http://deeplearning.net/software/theano/extending/op.html#grad) for the Op. - -```{code-cell} ipython3 -# define a theano Op for our likelihood function -class LogLike(tt.Op): - - """ - Specify what type of object will be passed and returned to the Op when it is - called. In our case we will be passing it a vector of values (the parameters - that define our model) and returning a single "scalar" value (the - log-likelihood) - """ - - itypes = [tt.dvector] # expects a vector of parameter values when called - otypes = [tt.dscalar] # outputs a single scalar value (the log likelihood) - - def __init__(self, loglike, data, x, sigma): - """ - Initialise the Op with various things that our log-likelihood function - requires. Below are the things that are needed in this particular - example. - - Parameters - ---------- - loglike: - The log-likelihood (or whatever) function we've defined - data: - The "observed" data that our log-likelihood function takes in - x: - The dependent variable (aka 'x') that our model requires - sigma: - The noise standard deviation that our function requires. - """ - - # add inputs as class attributes - self.likelihood = loglike - self.data = data - self.x = x - self.sigma = sigma - - def perform(self, node, inputs, outputs): - # the method that is used when calling the Op - (theta,) = inputs # this will contain my variables - - # call the log-likelihood function - logl = self.likelihood(theta, self.x, self.data, self.sigma) - - outputs[0][0] = np.array(logl) # output the log-likelihood -``` - -Now, let's use this Op to repeat the example shown above. To do this let's create some data containing a straight line with additive Gaussian noise (with a mean of zero and a standard deviation of `sigma`). For simplicity we set [uniform](https://docs.pymc.io/api/distributions/continuous.html#pymc3.distributions.continuous.Uniform) prior distributions on the gradient and y-intercept. As we've not set the `grad()` method of the Op PyMC3 will not be able to use the gradient-based samplers, so will fall back to using the [Slice](https://docs.pymc.io/api/inference.html#module-pymc3.step_methods.slicer) sampler. - -```{code-cell} ipython3 -# set up our data -N = 10 # number of data points -sigma = 1.0 # standard deviation of noise -x = np.linspace(0.0, 9.0, N) - -mtrue = 0.4 # true gradient -ctrue = 3.0 # true y-intercept - -truemodel = my_model([mtrue, ctrue], x) - -# make data -np.random.seed(716742) # set random seed, so the data is reproducible each time -data = sigma * np.random.randn(N) + truemodel - -ndraws = 3000 # number of draws from the distribution -nburn = 1000 # number of "burn-in points" (which we'll discard) - -# create our Op -logl = LogLike(my_loglike, data, x, sigma) - -# use PyMC3 to sampler from log-likelihood -with pm.Model(): - # uniform priors on m and c - m = pm.Uniform("m", lower=-10.0, upper=10.0) - c = pm.Uniform("c", lower=-10.0, upper=10.0) - - # convert m and c to a tensor vector - theta = tt.as_tensor_variable([m, c]) - - # use a DensityDist (use a lamdba function to "call" the Op) - pm.DensityDist("likelihood", lambda v: logl(v), observed={"v": theta}) - - trace = pm.sample(ndraws, tune=nburn, discard_tuned_samples=True) - -# plot the traces -_ = az.plot_trace(trace, lines={"m": mtrue, "c": ctrue}) - -# put the chains in an array (for later!) -samples_pymc3 = np.vstack((trace["m"], trace["c"])).T -``` - -What if we wanted to use NUTS or HMC? If we knew the analytical derivatives of the model/likelihood function then we could add a [grad() method](http://deeplearning.net/software/theano/extending/op.html#grad) to the Op using that analytical form. - -But, what if we don't know the analytical form. If our model/likelihood is purely Python and made up of standard maths operators and Numpy functions, then the [autograd](https://github.com/HIPS/autograd) module could potentially be used to find gradients (also, see [here](https://github.com/ActiveState/code/blob/master/recipes/Python/580610_Auto_differentiation/recipe-580610.py) for a nice Python example of automatic differentiation). But, if our model/likelihood truly is a "black box" then we can just use the good-old-fashioned [finite difference](https://en.wikipedia.org/wiki/Finite_difference) to find the gradients - this can be slow, especially if there are a large number of variables, or the model takes a long time to evaluate. Below, a function to find gradients has been defined that uses the finite difference (the central difference) - it uses an iterative method with successively smaller interval sizes to check that the gradient converges. But, you could do something far simpler and just use, for example, the SciPy [approx_fprime](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.approx_fprime.html) function. Here, the gradient function is defined in Cython for speed, but if the function it evaluates to find the gradients is the performance bottle neck then having this as a pure Python function may not make a significant speed difference. - -```{code-cell} ipython3 -%%cython - -import cython - -cimport cython - -import numpy as np - -cimport numpy as np - -import warnings - - -def gradients(vals, func, releps=1e-3, abseps=None, mineps=1e-9, reltol=1e-3, - epsscale=0.5): - """ - Calculate the partial derivatives of a function at a set of values. The - derivatives are calculated using the central difference, using an iterative - method to check that the values converge as step size decreases. - - Parameters - ---------- - vals: array_like - A set of values, that are passed to a function, at which to calculate - the gradient of that function - func: - A function that takes in an array of values. - releps: float, array_like, 1e-3 - The initial relative step size for calculating the derivative. - abseps: float, array_like, None - The initial absolute step size for calculating the derivative. - This overrides `releps` if set. - `releps` is set then that is used. - mineps: float, 1e-9 - The minimum relative step size at which to stop iterations if no - convergence is achieved. - epsscale: float, 0.5 - The factor by which releps if scaled in each iteration. - - Returns - ------- - grads: array_like - An array of gradients for each non-fixed value. - """ - - grads = np.zeros(len(vals)) - - # maximum number of times the gradient can change sign - flipflopmax = 10. - - # set steps - if abseps is None: - if isinstance(releps, float): - eps = np.abs(vals)*releps - eps[eps == 0.] = releps # if any values are zero set eps to releps - teps = releps*np.ones(len(vals)) - elif isinstance(releps, (list, np.ndarray)): - if len(releps) != len(vals): - raise ValueError("Problem with input relative step sizes") - eps = np.multiply(np.abs(vals), releps) - eps[eps == 0.] = np.array(releps)[eps == 0.] - teps = releps - else: - raise RuntimeError("Relative step sizes are not a recognised type!") - else: - if isinstance(abseps, float): - eps = abseps*np.ones(len(vals)) - elif isinstance(abseps, (list, np.ndarray)): - if len(abseps) != len(vals): - raise ValueError("Problem with input absolute step sizes") - eps = np.array(abseps) - else: - raise RuntimeError("Absolute step sizes are not a recognised type!") - teps = eps - - # for each value in vals calculate the gradient - count = 0 - for i in range(len(vals)): - # initial parameter diffs - leps = eps[i] - cureps = teps[i] - - flipflop = 0 - - # get central finite difference - fvals = np.copy(vals) - bvals = np.copy(vals) - - # central difference - fvals[i] += 0.5*leps # change forwards distance to half eps - bvals[i] -= 0.5*leps # change backwards distance to half eps - cdiff = (func(fvals)-func(bvals))/leps - - while 1: - fvals[i] -= 0.5*leps # remove old step - bvals[i] += 0.5*leps - - # change the difference by a factor of two - cureps *= epsscale - if cureps < mineps or flipflop > flipflopmax: - # if no convergence set flat derivative (TODO: check if there is a better thing to do instead) - warnings.warn("Derivative calculation did not converge: setting flat derivative.") - grads[count] = 0. - break - leps *= epsscale - - # central difference - fvals[i] += 0.5*leps # change forwards distance to half eps - bvals[i] -= 0.5*leps # change backwards distance to half eps - cdiffnew = (func(fvals)-func(bvals))/leps - - if cdiffnew == cdiff: - grads[count] = cdiff - break - - # check whether previous diff and current diff are the same within reltol - rat = (cdiff/cdiffnew) - if np.isfinite(rat) and rat > 0.: - # gradient has not changed sign - if np.abs(1.-rat) < reltol: - grads[count] = cdiffnew - break - else: - cdiff = cdiffnew - continue - else: - cdiff = cdiffnew - flipflop += 1 - continue - - count += 1 - - return grads -``` - -So, now we can just redefine our Op with a `grad()` method, right? - -It's not quite so simple! The `grad()` method itself requires that its inputs are Theano tensor variables, whereas our `gradients` function above, like our `my_loglike` function, wants a list of floating point values. So, we need to define another Op that calculates the gradients. Below, I define a new version of the `LogLike` Op, called `LogLikeWithGrad` this time, that has a `grad()` method. This is followed by anothor Op called `LogLikeGrad` that, when called with a vector of Theano tensor variables, returns another vector of values that are the gradients (i.e., the [Jacobian](https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant)) of our log-likelihood function at those values. Note that the `grad()` method itself does not return the gradients directly, but instead returns the [Jacobian](https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant)-vector product (you can hopefully just copy what I've done and not worry about what this means too much!). - -```{code-cell} ipython3 -# define a theano Op for our likelihood function -class LogLikeWithGrad(tt.Op): - - itypes = [tt.dvector] # expects a vector of parameter values when called - otypes = [tt.dscalar] # outputs a single scalar value (the log likelihood) - - def __init__(self, loglike, data, x, sigma): - """ - Initialise with various things that the function requires. Below - are the things that are needed in this particular example. - - Parameters - ---------- - loglike: - The log-likelihood (or whatever) function we've defined - data: - The "observed" data that our log-likelihood function takes in - x: - The dependent variable (aka 'x') that our model requires - sigma: - The noise standard deviation that out function requires. - """ - - # add inputs as class attributes - self.likelihood = loglike - self.data = data - self.x = x - self.sigma = sigma - - # initialise the gradient Op (below) - self.logpgrad = LogLikeGrad(self.likelihood, self.data, self.x, self.sigma) - - def perform(self, node, inputs, outputs): - # the method that is used when calling the Op - (theta,) = inputs # this will contain my variables - - # call the log-likelihood function - logl = self.likelihood(theta, self.x, self.data, self.sigma) - - outputs[0][0] = np.array(logl) # output the log-likelihood - - def grad(self, inputs, g): - # the method that calculates the gradients - it actually returns the - # vector-Jacobian product - g[0] is a vector of parameter values - (theta,) = inputs # our parameters - return [g[0] * self.logpgrad(theta)] - - -class LogLikeGrad(tt.Op): - - """ - This Op will be called with a vector of values and also return a vector of - values - the gradients in each dimension. - """ - - itypes = [tt.dvector] - otypes = [tt.dvector] - - def __init__(self, loglike, data, x, sigma): - """ - Initialise with various things that the function requires. Below - are the things that are needed in this particular example. - - Parameters - ---------- - loglike: - The log-likelihood (or whatever) function we've defined - data: - The "observed" data that our log-likelihood function takes in - x: - The dependent variable (aka 'x') that our model requires - sigma: - The noise standard deviation that out function requires. - """ - - # add inputs as class attributes - self.likelihood = loglike - self.data = data - self.x = x - self.sigma = sigma - - def perform(self, node, inputs, outputs): - (theta,) = inputs - - # define version of likelihood function to pass to derivative function - def lnlike(values): - return self.likelihood(values, self.x, self.data, self.sigma) - - # calculate gradients - grads = gradients(theta, lnlike) - - outputs[0][0] = grads -``` - -Now, let's re-run PyMC3 with our new "grad"-ed Op. This time it will be able to automatically use NUTS. - -```{code-cell} ipython3 -# create our Op -logl = LogLikeWithGrad(my_loglike, data, x, sigma) - -# use PyMC3 to sampler from log-likelihood -with pm.Model() as opmodel: - # uniform priors on m and c - m = pm.Uniform("m", lower=-10.0, upper=10.0) - c = pm.Uniform("c", lower=-10.0, upper=10.0) - - # convert m and c to a tensor vector - theta = tt.as_tensor_variable([m, c]) - - # use a DensityDist - pm.DensityDist("likelihood", lambda v: logl(v), observed={"v": theta}) - - trace = pm.sample(ndraws, tune=nburn, discard_tuned_samples=True) - -# plot the traces -_ = az.plot_trace(trace, lines={"m": mtrue, "c": ctrue}) - -# put the chains in an array (for later!) -samples_pymc3_2 = np.vstack((trace["m"], trace["c"])).T -``` - -Now, finally, just to check things actually worked as we might expect, let's do the same thing purely using PyMC3 distributions (because in this simple example we can!) - -```{code-cell} ipython3 -with pm.Model() as pymodel: - # uniform priors on m and c - m = pm.Uniform("m", lower=-10.0, upper=10.0) - c = pm.Uniform("c", lower=-10.0, upper=10.0) - - # convert m and c to a tensor vector - theta = tt.as_tensor_variable([m, c]) - - # use a Normal distribution - pm.Normal("likelihood", mu=(m * x + c), sigma=sigma, observed=data) - - trace = pm.sample(ndraws, tune=nburn, discard_tuned_samples=True) - -# plot the traces -_ = az.plot_trace(trace, lines={"m": mtrue, "c": ctrue}) - -# put the chains in an array (for later!) -samples_pymc3_3 = np.vstack((trace["m"], trace["c"])).T -``` - -To check that they match let's plot all the examples together and also find the autocorrelation lengths. - -```{code-cell} ipython3 -import warnings - -warnings.simplefilter( - action="ignore", category=FutureWarning -) # suppress emcee autocorr FutureWarning - -matplotlib.rcParams["font.size"] = 18 - -hist2dkwargs = { - "plot_datapoints": False, - "plot_density": False, - "levels": 1.0 - np.exp(-0.5 * np.arange(1.5, 2.1, 0.5) ** 2), -} # roughly 1 and 2 sigma - -colors = ["r", "g", "b"] -labels = ["Theanp Op (no grad)", "Theano Op (with grad)", "Pure PyMC3"] - -for i, samples in enumerate([samples_pymc3, samples_pymc3_2, samples_pymc3_3]): - # get maximum chain autocorrelartion length - autocorrlen = int(np.max(emcee.autocorr.integrated_time(samples, c=3))) - print("Auto-correlation length ({}): {}".format(labels[i], autocorrlen)) - - if i == 0: - fig = corner.corner( - samples, - labels=[r"$m$", r"$c$"], - color=colors[i], - hist_kwargs={"density": True}, - **hist2dkwargs, - truths=[mtrue, ctrue], - ) - else: - corner.corner( - samples, color=colors[i], hist_kwargs={"density": True}, fig=fig, **hist2dkwargs - ) - -fig.set_size_inches(8, 8) -``` - -We can now check that the gradient Op works was we expect it to. First, just create and call the `LogLikeGrad` class, which should return the gradient directly (note that we have to create a [Theano function](http://deeplearning.net/software/theano/library/compile/function.html) to convert the output of the Op to an array). Secondly, we call the gradient from `LogLikeWithGrad` by using the [Theano tensor gradient](http://deeplearning.net/software/theano/library/gradient.html#theano.gradient.grad) function. Finally, we will check the gradient returned by the PyMC3 model for a Normal distribution, which should be the same as the log-likelihood function we defined. In all cases we evaluate the gradients at the true values of the model function (the straight line) that was created. - -```{code-cell} ipython3 -# test the gradient Op by direct call -theano.config.compute_test_value = "ignore" -theano.config.exception_verbosity = "high" - -var = tt.dvector() -test_grad_op = LogLikeGrad(my_loglike, data, x, sigma) -test_grad_op_func = theano.function([var], test_grad_op(var)) -grad_vals = test_grad_op_func([mtrue, ctrue]) - -print(f'Gradient returned by "LogLikeGrad": {grad_vals}') - -# test the gradient called through LogLikeWithGrad -test_gradded_op = LogLikeWithGrad(my_loglike, data, x, sigma) -test_gradded_op_grad = tt.grad(test_gradded_op(var), var) -test_gradded_op_grad_func = theano.function([var], test_gradded_op_grad) -grad_vals_2 = test_gradded_op_grad_func([mtrue, ctrue]) - -print(f'Gradient returned by "LogLikeWithGrad": {grad_vals_2}') - -# test the gradient that PyMC3 uses for the Normal log likelihood -test_model = pm.Model() -with test_model: - m = pm.Uniform("m", lower=-10.0, upper=10.0) - c = pm.Uniform("c", lower=-10.0, upper=10.0) - - pm.Normal("likelihood", mu=(m * x + c), sigma=sigma, observed=data) - - gradfunc = test_model.logp_dlogp_function([m, c], dtype=None) - gradfunc.set_extra_values({"m_interval__": mtrue, "c_interval__": ctrue}) - grad_vals_pymc3 = gradfunc(np.array([mtrue, ctrue]))[1] # get dlogp values - -print(f'Gradient returned by PyMC3 "Normal" distribution: {grad_vals_pymc3}') -``` - -We can also do some [profiling](http://docs.pymc.io/notebooks/profiling.html) of the Op, as used within a PyMC3 Model, to check performance. First, we'll profile using the `LogLikeWithGrad` Op, and then doing the same thing purely using PyMC3 distributions. - -```{code-cell} ipython3 -# profile logpt using our Op -opmodel.profile(opmodel.logpt).summary() -``` - -```{code-cell} ipython3 -# profile using our PyMC3 distribution -pymodel.profile(pymodel.logpt).summary() -``` - -## Authors - -* Adapted from a blog post by [Matt Pitkin](http://mattpitkin.github.io/samplers-demo/pages/pymc3-blackbox-likelihood/) on August 27, 2018. That post was based on an example provided by [Jørgen Midtbø](https://github.com/jorgenem/). - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/case_studies/blackbox_external_likelihood_numpy.myst.md b/myst_nbs/case_studies/blackbox_external_likelihood_numpy.myst.md deleted file mode 100644 index 2ee5ccb0f..000000000 --- a/myst_nbs/case_studies/blackbox_external_likelihood_numpy.myst.md +++ /dev/null @@ -1,473 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(blackbox_external_likelihood_numpy)= -# Using a "black box" likelihood function (numpy) - -:::{post} Dec 16, 2021 -:tags: case study, external likelihood, -:category: beginner -:author: Matt Pitkin, Jørgen Midtbø, Oriol Abril -::: - -:::{note} -This notebook in part of a set of two twin notebooks that perform the exact same task, this one -uses numpy whereas {ref}`this other one ` uses Cython -::: - -```{code-cell} ipython3 -import aesara -import aesara.tensor as at -import arviz as az -import IPython -import matplotlib -import matplotlib.pyplot as plt -import numpy as np -import pymc as pm - -print(f"Running on PyMC v{pm.__version__}") -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -``` - -## Introduction -[PyMC](https://docs.pymc.io) is a great tool for doing Bayesian inference and parameter estimation. It has a load of {doc}`in-built probability distributions ` that you can use to set up priors and likelihood functions for your particular model. You can even create your own {ref}`custom distributions `. - -However, this is not necessarily that simple if you have a model function, or probability distribution, that, for example, relies on an external code that you have little/no control over (and may even be, for example, wrapped `C` code rather than Python). This can be problematic when you need to pass parameters as PyMC distributions to these external functions; your external function probably wants you to pass it floating point numbers rather than PyMC distributions! - -```python -import pymc as pm -from external_module import my_external_func # your external function! - -# set up your model -with pm.Model(): - # your external function takes two parameters, a and b, with Uniform priors - a = pm.Uniform('a', lower=0., upper=1.) - b = pm.Uniform('b', lower=0., upper=1.) - - m = my_external_func(a, b) # <--- this is not going to work! -``` - -Another issue is that if you want to be able to use the gradient-based step samplers like {class}`pymc.NUTS` and {class}`Hamiltonian Monte Carlo (HMC) `, then your model/likelihood needs a gradient to be defined. If you have a model that is defined as a set of Aesara operators then this is no problem - internally it will be able to do automatic differentiation - but if your model is essentially a "black box" then you won't necessarily know what the gradients are. - -Defining a model/likelihood that PyMC can use and that calls your "black box" function is possible, but it relies on creating a [custom Aesara Op](https://docs.pymc.io/advanced_aesara.html#writing-custom-aesara-ops). This is, hopefully, a clear description of how to do this, including one way of writing a gradient function that could be generally applicable. - -In the examples below, we create a very simple model and log-likelihood function in numpy. - -```{code-cell} ipython3 -def my_model(theta, x): - m, c = theta - return m * x + c - - -def my_loglike(theta, x, data, sigma): - model = my_model(theta, x) - return -(0.5 / sigma**2) * np.sum((data - model) ** 2) -``` - -Now, as things are, if we wanted to sample from this log-likelihood function, using certain prior distributions for the model parameters (gradient and y-intercept) using PyMC, we might try something like this (using a {class}`pymc.DensityDist` or {class}`pymc.Potential`): - -```python -import pymc as pm - -# create/read in our "data" (I'll show this in the real example below) -x = ... -sigma = ... -data = ... - -with pm.Model(): - # set priors on model gradient and y-intercept - m = pm.Uniform('m', lower=-10., upper=10.) - c = pm.Uniform('c', lower=-10., upper=10.) - - # create custom distribution - pm.DensityDist('likelihood', my_loglike, - observed={'theta': (m, c), 'x': x, 'data': data, 'sigma': sigma}) - - # sample from the distribution - trace = pm.sample(1000) -``` - -But, this will give an error like: - -``` -ValueError: setting an array element with a sequence. -``` - -This is because `m` and `c` are Aesara tensor-type objects. - -So, what we actually need to do is create a [Aesara Op](http://deeplearning.net/software/aesara/extending/extending_aesara.html). This will be a new class that wraps our log-likelihood function (or just our model function, if that is all that is required) into something that can take in Aesara tensor objects, but internally can cast them as floating point values that can be passed to our log-likelihood function. We will do this below, initially without defining a [grad() method](http://deeplearning.net/software/aesara/extending/op.html#grad) for the Op. - -+++ - -## Aesara Op without grad - -```{code-cell} ipython3 -# define a aesara Op for our likelihood function -class LogLike(at.Op): - - """ - Specify what type of object will be passed and returned to the Op when it is - called. In our case we will be passing it a vector of values (the parameters - that define our model) and returning a single "scalar" value (the - log-likelihood) - """ - - itypes = [at.dvector] # expects a vector of parameter values when called - otypes = [at.dscalar] # outputs a single scalar value (the log likelihood) - - def __init__(self, loglike, data, x, sigma): - """ - Initialise the Op with various things that our log-likelihood function - requires. Below are the things that are needed in this particular - example. - - Parameters - ---------- - loglike: - The log-likelihood (or whatever) function we've defined - data: - The "observed" data that our log-likelihood function takes in - x: - The dependent variable (aka 'x') that our model requires - sigma: - The noise standard deviation that our function requires. - """ - - # add inputs as class attributes - self.likelihood = loglike - self.data = data - self.x = x - self.sigma = sigma - - def perform(self, node, inputs, outputs): - # the method that is used when calling the Op - (theta,) = inputs # this will contain my variables - - # call the log-likelihood function - logl = self.likelihood(theta, self.x, self.data, self.sigma) - - outputs[0][0] = np.array(logl) # output the log-likelihood -``` - -Now, let's use this Op to repeat the example shown above. To do this let's create some data containing a straight line with additive Gaussian noise (with a mean of zero and a standard deviation of `sigma`). For simplicity we set {class}`~pymc.Uniform` prior distributions on the gradient and y-intercept. As we've not set the `grad()` method of the Op PyMC will not be able to use the gradient-based samplers, so will fall back to using the {class}`pymc.Slice` sampler. - -```{code-cell} ipython3 -# set up our data -N = 10 # number of data points -sigma = 1.0 # standard deviation of noise -x = np.linspace(0.0, 9.0, N) - -mtrue = 0.4 # true gradient -ctrue = 3.0 # true y-intercept - -truemodel = my_model([mtrue, ctrue], x) - -# make data -rng = np.random.default_rng(716743) -data = sigma * rng.normal(size=N) + truemodel - -# create our Op -logl = LogLike(my_loglike, data, x, sigma) - -# use PyMC to sampler from log-likelihood -with pm.Model(): - # uniform priors on m and c - m = pm.Uniform("m", lower=-10.0, upper=10.0) - c = pm.Uniform("c", lower=-10.0, upper=10.0) - - # convert m and c to a tensor vector - theta = at.as_tensor_variable([m, c]) - - # use a Potential to "call" the Op and include it in the logp computation - pm.Potential("likelihood", logl(theta)) - - # Use custom number of draws to replace the HMC based defaults - idata_mh = pm.sample(3000, tune=1000) - -# plot the traces -az.plot_trace(idata_mh, lines=[("m", {}, mtrue), ("c", {}, ctrue)]); -``` - -## Aesara Op with grad - -What if we wanted to use NUTS or HMC? If we knew the analytical derivatives of the model/likelihood function then we could add a [grad() method](http://deeplearning.net/software/aesara/extending/op.html#grad) to the Op using that analytical form. - -But, what if we don't know the analytical form. If our model/likelihood is purely Python and made up of standard maths operators and Numpy functions, then the [autograd](https://github.com/HIPS/autograd) module could potentially be used to find gradients (also, see [here](https://github.com/ActiveState/code/blob/master/recipes/Python/580610_Auto_differentiation/recipe-580610.py) for a nice Python example of automatic differentiation). But, if our model/likelihood truly is a "black box" then we can just use the good-old-fashioned [finite difference](https://en.wikipedia.org/wiki/Finite_difference) to find the gradients - this can be slow, especially if there are a large number of variables, or the model takes a long time to evaluate. Below, a function to find gradients has been defined that uses the finite difference (the central difference) - it uses an iterative method with successively smaller interval sizes to check that the gradient converges. But, you could do something far simpler and just use, for example, the SciPy [approx_fprime](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.approx_fprime.html) function. - -Note that since PyMC 3.11.0, normalization constants are dropped from the computation, thus, we will do the same to ensure both gradients return exactly the same value (which will be checked below). As `sigma=1` in this case the dropped factor is only a factor 2, but for completeness, the term is shown as a comment. Try to see what happens if you uncomment this term and rerun the whole notebook. - -```{code-cell} ipython3 -def normal_gradients(theta, x, data, sigma): - """ - Calculate the partial derivatives of a function at a set of values. The - derivatives are calculated using the central difference, using an iterative - method to check that the values converge as step size decreases. - - Parameters - ---------- - theta: array_like - A set of values, that are passed to a function, at which to calculate - the gradient of that function - x, data, sigma: - Observed variables as we have been using so far - - - Returns - ------- - grads: array_like - An array of gradients for each non-fixed value. - """ - - grads = np.empty(2) - aux_vect = data - my_model(theta, x) # /(2*sigma**2) - grads[0] = np.sum(aux_vect * x) - grads[1] = np.sum(aux_vect) - - return grads -``` - -So, now we can just redefine our Op with a `grad()` method, right? - -It's not quite so simple! The `grad()` method itself requires that its inputs are Aesara tensor variables, whereas our `gradients` function above, like our `my_loglike` function, wants a list of floating point values. So, we need to define another Op that calculates the gradients. Below, I define a new version of the `LogLike` Op, called `LogLikeWithGrad` this time, that has a `grad()` method. This is followed by anothor Op called `LogLikeGrad` that, when called with a vector of Aesara tensor variables, returns another vector of values that are the gradients (i.e., the [Jacobian](https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant)) of our log-likelihood function at those values. Note that the `grad()` method itself does not return the gradients directly, but instead returns the [Jacobian](https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant)-vector product (you can hopefully just copy what I've done and not worry about what this means too much!). - -```{code-cell} ipython3 -# define a aesara Op for our likelihood function -class LogLikeWithGrad(at.Op): - - itypes = [at.dvector] # expects a vector of parameter values when called - otypes = [at.dscalar] # outputs a single scalar value (the log likelihood) - - def __init__(self, loglike, data, x, sigma): - """ - Initialise with various things that the function requires. Below - are the things that are needed in this particular example. - - Parameters - ---------- - loglike: - The log-likelihood (or whatever) function we've defined - data: - The "observed" data that our log-likelihood function takes in - x: - The dependent variable (aka 'x') that our model requires - sigma: - The noise standard deviation that out function requires. - """ - - # add inputs as class attributes - self.likelihood = loglike - self.data = data - self.x = x - self.sigma = sigma - - # initialise the gradient Op (below) - self.logpgrad = LogLikeGrad(self.data, self.x, self.sigma) - - def perform(self, node, inputs, outputs): - # the method that is used when calling the Op - (theta,) = inputs # this will contain my variables - - # call the log-likelihood function - logl = self.likelihood(theta, self.x, self.data, self.sigma) - - outputs[0][0] = np.array(logl) # output the log-likelihood - - def grad(self, inputs, g): - # the method that calculates the gradients - it actually returns the - # vector-Jacobian product - g[0] is a vector of parameter values - (theta,) = inputs # our parameters - return [g[0] * self.logpgrad(theta)] - - -class LogLikeGrad(at.Op): - - """ - This Op will be called with a vector of values and also return a vector of - values - the gradients in each dimension. - """ - - itypes = [at.dvector] - otypes = [at.dvector] - - def __init__(self, data, x, sigma): - """ - Initialise with various things that the function requires. Below - are the things that are needed in this particular example. - - Parameters - ---------- - data: - The "observed" data that our log-likelihood function takes in - x: - The dependent variable (aka 'x') that our model requires - sigma: - The noise standard deviation that out function requires. - """ - - # add inputs as class attributes - self.data = data - self.x = x - self.sigma = sigma - - def perform(self, node, inputs, outputs): - (theta,) = inputs - - # calculate gradients - grads = normal_gradients(theta, self.x, self.data, self.sigma) - - outputs[0][0] = grads -``` - -Now, let's re-run PyMC with our new "grad"-ed Op. This time it will be able to automatically use NUTS. - -```{code-cell} ipython3 -# create our Op -logl = LogLikeWithGrad(my_loglike, data, x, sigma) - -# use PyMC to sampler from log-likelihood -with pm.Model() as opmodel: - # uniform priors on m and c - m = pm.Uniform("m", lower=-10.0, upper=10.0) - c = pm.Uniform("c", lower=-10.0, upper=10.0) - - # convert m and c to a tensor vector - theta = at.as_tensor_variable([m, c]) - - # use a Potential - pm.Potential("likelihood", logl(theta)) - - idata_grad = pm.sample() - -# plot the traces -_ = az.plot_trace(idata_grad, lines=[("m", {}, mtrue), ("c", {}, ctrue)]) -``` - -## Comparison to equivalent PyMC distributions -Now, finally, just to check things actually worked as we might expect, let's do the same thing purely using PyMC distributions (because in this simple example we can!) - -```{code-cell} ipython3 -with pm.Model() as pymodel: - # uniform priors on m and c - m = pm.Uniform("m", lower=-10.0, upper=10.0) - c = pm.Uniform("c", lower=-10.0, upper=10.0) - - # convert m and c to a tensor vector - theta = at.as_tensor_variable([m, c]) - - # use a Normal distribution - pm.Normal("likelihood", mu=(m * x + c), sd=sigma, observed=data) - - idata = pm.sample() - -# plot the traces -az.plot_trace(idata, lines=[("m", {}, mtrue), ("c", {}, ctrue)]); -``` - -To check that they match let's plot all the examples together and also find the autocorrelation lengths. - -```{code-cell} ipython3 -_, axes = plt.subplots(3, 2, sharex=True, sharey=True) -az.plot_autocorr(idata_mh, combined=True, ax=axes[0, :]) -az.plot_autocorr(idata_grad, combined=True, ax=axes[1, :]) -az.plot_autocorr(idata, combined=True, ax=axes[2, :]) -axes[2, 0].set_xlim(right=40); -``` - -```{code-cell} ipython3 -# Plot MH result (blue) -pair_kwargs = dict( - kind="kde", - marginals=True, - reference_values={"m": mtrue, "c": ctrue}, - kde_kwargs={"contourf_kwargs": {"alpha": 0}, "contour_kwargs": {"colors": "C0"}}, - reference_values_kwargs={"color": "k", "ms": 15, "marker": "d"}, - marginal_kwargs={"color": "C0"}, -) -ax = az.plot_pair(idata_mh, **pair_kwargs) - -# Plot nuts+blackbox fit (orange) -pair_kwargs["kde_kwargs"]["contour_kwargs"]["colors"] = "C1" -pair_kwargs["marginal_kwargs"]["color"] = "C1" -az.plot_pair(idata_grad, **pair_kwargs, ax=ax) - -# Plot pure pymc+nuts fit (green) -pair_kwargs["kde_kwargs"]["contour_kwargs"]["colors"] = "C2" -pair_kwargs["marginal_kwargs"]["color"] = "C2" -az.plot_pair(idata, **pair_kwargs, ax=ax); -``` - -We can now check that the gradient Op works as expected. First, just create and call the `LogLikeGrad` class, which should return the gradient directly (note that we have to create a [Aesara function](http://deeplearning.net/software/aesara/library/compile/function.html) to convert the output of the Op to an array). Secondly, we call the gradient from `LogLikeWithGrad` by using the [Aesara tensor gradient](http://deeplearning.net/software/aesara/library/gradient.html#aesara.gradient.grad) function. Finally, we will check the gradient returned by the PyMC model for a Normal distribution, which should be the same as the log-likelihood function we defined. In all cases we evaluate the gradients at the true values of the model function (the straight line) that was created. - -```{code-cell} ipython3 -# test the gradient Op by direct call -aesara.config.compute_test_value = "ignore" -aesara.config.exception_verbosity = "high" - -var = at.dvector() -test_grad_op = LogLikeGrad(data, x, sigma) -test_grad_op_func = aesara.function([var], test_grad_op(var)) -grad_vals = test_grad_op_func([mtrue, ctrue]) - -print(f'Gradient returned by "LogLikeGrad": {grad_vals}') - -# test the gradient called through LogLikeWithGrad -test_gradded_op = LogLikeWithGrad(my_loglike, data, x, sigma) -test_gradded_op_grad = at.grad(test_gradded_op(var), var) -test_gradded_op_grad_func = aesara.function([var], test_gradded_op_grad) -grad_vals_2 = test_gradded_op_grad_func([mtrue, ctrue]) - -print(f'Gradient returned by "LogLikeWithGrad": {grad_vals_2}') - -# test the gradient that PyMC uses for the Normal log likelihood -test_model = pm.Model() -with test_model: - m = pm.Uniform("m", lower=-10.0, upper=10.0) - c = pm.Uniform("c", lower=-10.0, upper=10.0) - - pm.Normal("likelihood", mu=(m * x + c), sigma=sigma, observed=data) - - gradfunc = test_model.logp_dlogp_function([m, c], dtype=None) - gradfunc.set_extra_values({"m_interval__": mtrue, "c_interval__": ctrue}) - grad_vals_pymc = gradfunc(np.array([mtrue, ctrue]))[1] # get dlogp values - -print(f'Gradient returned by PyMC "Normal" distribution: {grad_vals_pymc}') -``` - -We could also do some profiling to compare performance between implementations. The {ref}`profiling` notebook shows how to do it. - -+++ - -## Authors - -* Adapted from [Jørgen Midtbø](https://github.com/jorgenem/)'s [example](https://discourse.pymc.io/t/connecting-pymc-to-external-code-help-with-understanding-aesara-custom-ops/670) by Matt Pitkin both as a [blogpost](http://mattpitkin.github.io/samplers-demo/pages/pymc-blackbox-likelihood/) and as an example notebook to this gallery in August, 2018 ([pymc#3169](https://github.com/pymc-devs/pymc/pull/3169) and [pymc#3177](https://github.com/pymc-devs/pymc/pull/3177)) -* Updated by [Oriol Abril](https://github.com/OriolAbril) on December 2021 to drop the Cython dependency from the original notebook and use numpy instead ([pymc-examples#28](https://github.com/pymc-devs/pymc-examples/pull/28)) - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p xarray -``` - -:::{include} ../page_footer.md -::: - -```{code-cell} ipython3 - -``` diff --git a/myst_nbs/case_studies/conditional-autoregressive-model.myst.md b/myst_nbs/case_studies/conditional-autoregressive-model.myst.md deleted file mode 100644 index 07af629e4..000000000 --- a/myst_nbs/case_studies/conditional-autoregressive-model.myst.md +++ /dev/null @@ -1,772 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -```{code-cell} ipython3 -import arviz as az -import numpy as np -import pandas as pd -import pymc3 as pm -import scipy.stats as stats -import theano -import theano.tensor as tt - -from pymc3.distributions import continuous, distribution -from theano import scan, shared -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -RANDOM_SEED = 8927 -np.random.seed(RANDOM_SEED) -az.style.use("arviz-darkgrid") -floatX = "float32" -``` - -# Conditional Autoregressive (CAR) model -A walkthrough of implementing a Conditional Autoregressive (CAR) model in `PyMC3`, with `WinBUGS`/`PyMC2` and `Stan` code as references. - -+++ - -As a probabilistic language, there are some fundamental differences between `PyMC3` and other alternatives such as `WinBUGS`, `JAGS`, and `Stan`. In this notebook, I will summarise some heuristics and intuition I got over the past two years using `PyMC3`. I will outline some thinking in how I approach a modelling problem using `PyMC3`, and how thinking in linear algebra solves most of the programming problems. I hope this notebook will shed some light onto the design and features of `PyMC3`, and similar languages that are built on linear algebra packages with a static world view (e.g., Edward, which is based on Tensorflow). - - -For more resources comparing between PyMC3 codes and other probabilistic languages: -* [PyMC3 port of "Doing Bayesian Data Analysis" - PyMC3 vs WinBUGS/JAGS/Stan](https://github.com/aloctavodia/Doing_bayesian_data_analysis) -* [PyMC3 port of "Bayesian Cognitive Modeling" - PyMC3 vs WinBUGS/JAGS/Stan](https://github.com/junpenglao/Bayesian-Cognitive-Modeling-in-Pymc3) -* [PyMC3 port of "Statistical Rethinking" - PyMC3 vs Stan](https://github.com/aloctavodia/Statistical-Rethinking-with-Python-and-PyMC3) - -+++ - -## Background information -Suppose we want to implement a [Conditional Autoregressive (CAR) model](http://www.statsref.com/HTML/index.html?car_models.html) with examples in [WinBUGS/PyMC2](http://glau.ca/?p=340) and [Stan](http://mc-stan.org/documentation/case-studies/mbjoseph-CARStan.html). -For the sake of brevity, I will not go into the details of the CAR model. The essential idea is autocorrelation, which is informally "correlation with itself". In a CAR model, the probability of values estimated at any given location $y_i$ are conditional on some neighboring values $y_j, _{j \neq i}$ (in another word, correlated/covariated with these values): - -$$y_i \mid y_j, j \neq i \sim \mathcal{N}(\alpha \sum_{j = 1}^n b_{ij} y_j, \sigma_i^{2})$$ - -where $\sigma_i^{2}$ is a spatially varying covariance parameter, and $b_{ii} = 0$. - -Here we will demonstrate the implementation of a CAR model using a canonical example: the lip cancer risk data in Scotland between 1975 and 1980. The original data is from (Kemp et al. 1985). This dataset includes observed lip cancer case counts at 56 spatial units in Scotland, with the expected number of cases as intercept, and an area-specific continuous variable coded for the proportion of the population employed in agriculture, fishing, or forestry (AFF). We want to model how lip cancer rates (`O` below) relate to AFF (`aff` below), as exposure to sunlight is a risk factor. - -$$O_i \sim \mathcal{Poisson}(\text{exp}(\beta_0 + \beta_1*aff + \phi_i + \log(\text{E}_i)))$$ -$$\phi_i \mid \phi_j, j \neq i \sim \mathcal{N}(\alpha \sum_{j = 1}^n b_{ij} \phi_j, \sigma_i^{2})$$ - -Setting up the data: - -```{code-cell} ipython3 -# Read the data from file containing columns: NAME, CANCER, CEXP, AFF, ADJ, WEIGHTS -df_scot_cancer = pd.read_csv(pm.get_data("scotland_lips_cancer.csv")) - -# name of the counties -county = df_scot_cancer["NAME"].values - -# observed -O = df_scot_cancer["CANCER"].values -N = len(O) - -# expected (E) rates, based on the age of the local population -E = df_scot_cancer["CEXP"].values -logE = np.log(E) - -# proportion of the population engaged in agriculture, forestry, or fishing (AFF) -aff = df_scot_cancer["AFF"].values / 10.0 - -# Spatial adjacency information: column (ADJ) contains list entries which are preprocessed to obtain adj as list of lists -adj = ( - df_scot_cancer["ADJ"].apply(lambda x: [int(val) for val in x.strip("][").split(",")]).to_list() -) - -# Change to Python indexing (i.e. -1) -for i in range(len(adj)): - for j in range(len(adj[i])): - adj[i][j] = adj[i][j] - 1 - -# spatial weight: column (WEIGHTS) contains list entries which are preprocessed to obtain weights as list of lists -weights = ( - df_scot_cancer["WEIGHTS"] - .apply(lambda x: [int(val) for val in x.strip("][").split(",")]) - .to_list() -) -Wplus = np.asarray([sum(w) for w in weights]) -``` - -## A WinBUGS/PyMC2 implementation - -The classical `WinBUGS` implementation (more information [here](http://glau.ca/?p=340)): - -```stan -model -{ - for (i in 1 : regions) { - O[i] ~ dpois(mu[i]) - log(mu[i]) <- log(E[i]) + beta0 + beta1*aff[i]/10 + phi[i] + theta[i] - theta[i] ~ dnorm(0.0,tau.h) - } - phi[1:regions] ~ car.normal(adj[], weights[], Wplus[], tau.c) - - beta0 ~ dnorm(0.0, 1.0E-5) # vague prior on grand intercept - beta1 ~ dnorm(0.0, 1.0E-5) # vague prior on covariate effect - - tau.h ~ dgamma(3.2761, 1.81) - tau.c ~ dgamma(1.0, 1.0) - - sd.h <- sd(theta[]) # marginal SD of heterogeneity effects - sd.c <- sd(phi[]) # marginal SD of clustering (spatial) effects - - alpha <- sd.c / (sd.h + sd.c) -} -``` - -The main challenge to porting this model to `PyMC3` is the `car.normal` function in `WinBUGS`. It is a likelihood function that conditions each realization on some neighbour realization (a smoothed property). In `PyMC2`, it could be implemented as a [custom likelihood function (a `@stochastic` node) `mu_phi`](http://glau.ca/?p=340): - -```python -@stochastic -def mu_phi(tau=tau_c, value=np.zeros(N)): - # Calculate mu based on average of neighbours - mu = np.array([ sum(weights[i]*value[adj[i]])/Wplus[i] for i in xrange(N)]) - # Scale precision to the number of neighbours - taux = tau*Wplus - return normal_like(value,mu,taux) -``` - -We can just define `mu_phi` similarly and wrap it in a `pymc3.DensityDist`, however, doing so usually results in a very slow model (both in compiling and sampling). In general, porting pymc2 code into pymc3 (or even generally porting `WinBUGS`, `JAGS`, or `Stan` code into `PyMC3`) that use a `for` loops tend to perform poorly in `theano`, the backend of `PyMC3`. - -The underlying mechanism in `PyMC3` is very different compared to `PyMC2`, using `for` loops to generate RV or stacking multiple RV with arguments such as `[pm.Binomial('obs%'%i, p[i], n) for i in range(K)]` generate unnecessary large number of nodes in `theano` graph, which then slows down compilation appreciably. - -The easiest way is to move the loop out of `pm.Model`. And usually is not difficult to do. For example, in `Stan` you can have a `transformed data{}` block; in `PyMC3` you just need to compute it before defining your Model. - -If it is absolutely necessary to use a `for` loop, you can use a theano loop (i.e., `theano.scan`), which you can find some introduction on the [theano website](http://deeplearning.net/software/theano/tutorial/loop.html) and see a usecase in PyMC3 [timeseries distribution](https://github.com/pymc-devs/pymc3/blob/master/pymc3/distributions/timeseries.py#L125-L130). - -+++ - -## PyMC3 implementation using `theano.scan` - -So lets try to implement the CAR model using `theano.scan`. First we create a `theano` function with `theano.scan` and check if it really works by comparing its result to the for-loop. - -```{code-cell} ipython3 -value = np.asarray( - np.random.randn( - N, - ), - dtype=theano.config.floatX, -) - -maxwz = max([sum(w) for w in weights]) -N = len(weights) -wmat = np.zeros((N, maxwz)) -amat = np.zeros((N, maxwz), dtype="int32") -for i, w in enumerate(weights): - wmat[i, np.arange(len(w))] = w - amat[i, np.arange(len(w))] = adj[i] - -# defining the tensor variables -x = tt.vector("x") -x.tag.test_value = value -w = tt.matrix("w") -# provide Theano with a default test-value -w.tag.test_value = wmat -a = tt.matrix("a", dtype="int32") -a.tag.test_value = amat - - -def get_mu(w, a): - a1 = tt.cast(a, "int32") - return tt.sum(w * x[a1]) / tt.sum(w) - - -results, _ = theano.scan(fn=get_mu, sequences=[w, a]) -compute_elementwise = theano.function(inputs=[x, w, a], outputs=results) - -print(compute_elementwise(value, wmat, amat)) - - -def mu_phi(value): - N = len(weights) - # Calculate mu based on average of neighbours - mu = np.array([np.sum(weights[i] * value[adj[i]]) / Wplus[i] for i in range(N)]) - return mu - - -print(mu_phi(value)) -``` - -Since it produces the same result as the original for-loop, we will wrap it as a new distribution with a log-likelihood function in `PyMC3`. - -```{code-cell} ipython3 -class CAR(distribution.Continuous): - """ - Conditional Autoregressive (CAR) distribution - - Parameters - ---------- - a : list of adjacency information - w : list of weight information - tau : precision at each location - """ - - def __init__(self, w, a, tau, *args, **kwargs): - super().__init__(*args, **kwargs) - self.a = a = tt.as_tensor_variable(a) - self.w = w = tt.as_tensor_variable(w) - self.tau = tau * tt.sum(w, axis=1) - self.mode = 0.0 - - def get_mu(self, x): - def weigth_mu(w, a): - a1 = tt.cast(a, "int32") - return tt.sum(w * x[a1]) / tt.sum(w) - - mu_w, _ = scan(fn=weigth_mu, sequences=[self.w, self.a]) - - return mu_w - - def logp(self, x): - mu_w = self.get_mu(x) - tau = self.tau - return tt.sum(continuous.Normal.dist(mu=mu_w, tau=tau).logp(x)) -``` - -We then use it in our `PyMC3` version of the CAR model: - -```{code-cell} ipython3 -with pm.Model() as model1: - # Vague prior on intercept - beta0 = pm.Normal("beta0", mu=0.0, tau=1.0e-5) - # Vague prior on covariate effect - beta1 = pm.Normal("beta1", mu=0.0, tau=1.0e-5) - - # Random effects (hierarchial) prior - tau_h = pm.Gamma("tau_h", alpha=3.2761, beta=1.81) - # Spatial clustering prior - tau_c = pm.Gamma("tau_c", alpha=1.0, beta=1.0) - - # Regional random effects - theta = pm.Normal("theta", mu=0.0, tau=tau_h, shape=N) - mu_phi = CAR("mu_phi", w=wmat, a=amat, tau=tau_c, shape=N) - - # Zero-centre phi - phi = pm.Deterministic("phi", mu_phi - tt.mean(mu_phi)) - - # Mean model - mu = pm.Deterministic("mu", tt.exp(logE + beta0 + beta1 * aff + theta + phi)) - - # Likelihood - Yi = pm.Poisson("Yi", mu=mu, observed=O) - - # Marginal SD of heterogeniety effects - sd_h = pm.Deterministic("sd_h", tt.std(theta)) - # Marginal SD of clustering (spatial) effects - sd_c = pm.Deterministic("sd_c", tt.std(phi)) - # Proportion sptial variance - alpha = pm.Deterministic("alpha", sd_c / (sd_h + sd_c)) - - infdata1 = pm.sample( - 1000, - tune=500, - cores=4, - init="advi", - target_accept=0.9, - max_treedepth=15, - return_inferencedata=True, - ) -``` - -Note: there are some hidden problems with the model, some regions of the parameter space are quite difficult to sample from. Here I am using ADVI as initialization, which gives a smaller variance of the mass matrix. It keeps the sampler around the mode. - -```{code-cell} ipython3 -az.plot_trace(infdata1, var_names=["alpha", "sd_h", "sd_c"]); -``` - -We also got a lot of Rhat warning, that's because the Zero-centre phi introduce unidentification to the model: - -```{code-cell} ipython3 -summary1 = az.summary(infdata1) -summary1[summary1["r_hat"] > 1.05] -``` - -```{code-cell} ipython3 -az.plot_forest( - infdata1, - kind="ridgeplot", - var_names=["phi"], - combined=False, - ridgeplot_overlap=3, - ridgeplot_alpha=0.25, - colors="white", - figsize=(9, 7), -); -``` - -```{code-cell} ipython3 -az.plot_posterior(infdata1, var_names=["alpha"]); -``` - -`theano.scan` is much faster than using a python for loop, but it is still quite slow. One approach for improving it is to use linear algebra. That is, we should try to find a way to use matrix multiplication instead of looping (if you have experience in using MATLAB, it is the same philosophy). In our case, we can totally do that. - -For a similar problem, you can also have a look of [my port of Lee and Wagenmakers' book](https://github.com/junpenglao/Bayesian-Cognitive-Modeling-in-Pymc3). For example, in Chapter 19, the Stan code use [a for loop to generate the likelihood function](https://github.com/stan-dev/example-models/blob/master/Bayesian_Cognitive_Modeling/CaseStudies/NumberConcepts/NumberConcept_1_Stan.R#L28-L59), and I [generate the matrix outside and use matrix multiplication etc](http://nbviewer.jupyter.org/github/junpenglao/Bayesian-Cognitive-Modeling-in-Pymc3/blob/master/CaseStudies/NumberConceptDevelopment.ipynb#19.1-Knower-level-model-for-Give-N) to archive the same purpose. - -+++ - -## PyMC3 implementation using matrix "trick" - -Again, we try on some simulated data to make sure the implementation is correct. - -```{code-cell} ipython3 -maxwz = max([sum(w) for w in weights]) -N = len(weights) -wmat2 = np.zeros((N, N)) -amat2 = np.zeros((N, N), dtype="int32") -for i, a in enumerate(adj): - amat2[i, a] = 1 - wmat2[i, a] = weights[i] - -value = np.asarray( - np.random.randn( - N, - ), - dtype=theano.config.floatX, -) - -print(np.sum(value * amat2, axis=1) / np.sum(wmat2, axis=1)) - - -def mu_phi(value): - N = len(weights) - # Calculate mu based on average of neighbours - mu = np.array([np.sum(weights[i] * value[adj[i]]) / Wplus[i] for i in range(N)]) - return mu - - -print(mu_phi(value)) -``` - -Now create a new CAR distribution with the matrix multiplication instead of `theano.scan` to get the `mu` - -```{code-cell} ipython3 -class CAR2(distribution.Continuous): - """ - Conditional Autoregressive (CAR) distribution - - Parameters - ---------- - a : adjacency matrix - w : weight matrix - tau : precision at each location - """ - - def __init__(self, w, a, tau, *args, **kwargs): - super().__init__(*args, **kwargs) - self.a = a = tt.as_tensor_variable(a) - self.w = w = tt.as_tensor_variable(w) - self.tau = tau * tt.sum(w, axis=1) - self.mode = 0.0 - - def logp(self, x): - tau = self.tau - w = self.w - a = self.a - - mu_w = tt.sum(x * a, axis=1) / tt.sum(w, axis=1) - return tt.sum(continuous.Normal.dist(mu=mu_w, tau=tau).logp(x)) -``` - -```{code-cell} ipython3 -with pm.Model() as model2: - # Vague prior on intercept - beta0 = pm.Normal("beta0", mu=0.0, tau=1.0e-5) - # Vague prior on covariate effect - beta1 = pm.Normal("beta1", mu=0.0, tau=1.0e-5) - - # Random effects (hierarchial) prior - tau_h = pm.Gamma("tau_h", alpha=3.2761, beta=1.81) - # Spatial clustering prior - tau_c = pm.Gamma("tau_c", alpha=1.0, beta=1.0) - - # Regional random effects - theta = pm.Normal("theta", mu=0.0, tau=tau_h, shape=N) - mu_phi = CAR2("mu_phi", w=wmat2, a=amat2, tau=tau_c, shape=N) - - # Zero-centre phi - phi = pm.Deterministic("phi", mu_phi - tt.mean(mu_phi)) - - # Mean model - mu = pm.Deterministic("mu", tt.exp(logE + beta0 + beta1 * aff + theta + phi)) - - # Likelihood - Yi = pm.Poisson("Yi", mu=mu, observed=O) - - # Marginal SD of heterogeniety effects - sd_h = pm.Deterministic("sd_h", tt.std(theta)) - # Marginal SD of clustering (spatial) effects - sd_c = pm.Deterministic("sd_c", tt.std(phi)) - # Proportion sptial variance - alpha = pm.Deterministic("alpha", sd_c / (sd_h + sd_c)) - - infdata2 = pm.sample( - 1000, - tune=500, - cores=4, - init="advi", - target_accept=0.9, - max_treedepth=15, - return_inferencedata=True, - ) -``` - -**As you can see, it is appreciably faster using matrix multiplication.** - -```{code-cell} ipython3 -summary2 = az.summary(infdata2) -summary2[summary2["r_hat"] > 1.05] -``` - -```{code-cell} ipython3 -az.plot_forest( - infdata2, - kind="ridgeplot", - var_names=["phi"], - combined=False, - ridgeplot_overlap=3, - ridgeplot_alpha=0.25, - colors="white", - figsize=(9, 7), -); -``` - -```{code-cell} ipython3 -az.plot_trace(infdata2, var_names=["alpha", "sd_h", "sd_c"]); -``` - -```{code-cell} ipython3 -az.plot_posterior(infdata2, var_names=["alpha"]); -``` - -## PyMC3 implementation using Matrix multiplication - -There are almost always multiple ways to formulate a particular model. Some approaches work better than the others under different contexts (size of your dataset, properties of the sampler, etc). - -In this case, we can express the CAR prior as: - -$$\phi \sim \mathcal{N}(0, [D_\tau (I - \alpha B)]^{-1}).$$ - -You can find more details in the original [Stan case study](http://mc-stan.org/documentation/case-studies/mbjoseph-CARStan.html). You might come across similar constructs in Gaussian Process, which result in a zero-mean Gaussian distribution conditioned on a covariance function. - -In the `Stan` Code, matrix D is generated in the model using a `transformed data{}` block: -``` -transformed data{ - vector[n] zeros; - matrix[n, n] D; - { - vector[n] W_rowsums; - for (i in 1:n) { - W_rowsums[i] = sum(W[i, ]); - } - D = diag_matrix(W_rowsums); - } - zeros = rep_vector(0, n); -} -``` -We can generate the same matrix quite easily: - -```{code-cell} ipython3 -X = np.hstack((np.ones((N, 1)), stats.zscore(aff, ddof=1)[:, None])) -W = wmat2 -D = np.diag(W.sum(axis=1)) -log_offset = logE[:, None] -``` - -Then in the `Stan` model: -```stan -model { - phi ~ multi_normal_prec(zeros, tau * (D - alpha * W)); - ... -} -``` -since the precision matrix just generated by some matrix multiplication, we can do just that in `PyMC3`: - -```{code-cell} ipython3 -with pm.Model() as model3: - # Vague prior on intercept and effect - beta = pm.Normal("beta", mu=0.0, tau=1.0, shape=(2, 1)) - - # Priors for spatial random effects - tau = pm.Gamma("tau", alpha=2.0, beta=2.0) - alpha = pm.Uniform("alpha", lower=0, upper=1) - phi = pm.MvNormal("phi", mu=0, tau=tau * (D - alpha * W), shape=(1, N)) - - # Mean model - mu = pm.Deterministic("mu", tt.exp(tt.dot(X, beta) + phi.T + log_offset)) - - # Likelihood - Yi = pm.Poisson("Yi", mu=mu.ravel(), observed=O) - - infdata3 = pm.sample(1000, tune=2000, cores=4, target_accept=0.85, return_inferencedata=True) -``` - -```{code-cell} ipython3 -az.plot_trace(infdata3, var_names=["alpha", "beta", "tau"]); -``` - -```{code-cell} ipython3 -az.plot_posterior(infdata3, var_names=["alpha"]); -``` - -Notice that since the model parameterization is different than in the `WinBUGS` model, the `alpha` can't be interpreted in the same way. - -+++ - -## PyMC3 implementation using Sparse Matrix - -Note that in the node $\phi \sim \mathcal{N}(0, [D_\tau (I - \alpha B)]^{-1})$, we are computing the log-likelihood for a multivariate Gaussian distribution, which might not scale well in high-dimensions. We can take advantage of the fact that the covariance matrix here $[D_\tau (I - \alpha B)]^{-1}$ is **sparse**, and there are faster ways to compute its log-likelihood. - -For example, a more efficient sparse representation of the CAR in `Stan`: -```stan -functions { - /** - * Return the log probability of a proper conditional autoregressive (CAR) prior - * with a sparse representation for the adjacency matrix - * - * @param phi Vector containing the parameters with a CAR prior - * @param tau Precision parameter for the CAR prior (real) - * @param alpha Dependence (usually spatial) parameter for the CAR prior (real) - * @param W_sparse Sparse representation of adjacency matrix (int array) - * @param n Length of phi (int) - * @param W_n Number of adjacent pairs (int) - * @param D_sparse Number of neighbors for each location (vector) - * @param lambda Eigenvalues of D^{-1/2}*W*D^{-1/2} (vector) - * - * @return Log probability density of CAR prior up to additive constant - */ - real sparse_car_lpdf(vector phi, real tau, real alpha, - int[,] W_sparse, vector D_sparse, vector lambda, int n, int W_n) { - row_vector[n] phit_D; // phi' * D - row_vector[n] phit_W; // phi' * W - vector[n] ldet_terms; - - phit_D = (phi .* D_sparse)'; - phit_W = rep_row_vector(0, n); - for (i in 1:W_n) { - phit_W[W_sparse[i, 1]] = phit_W[W_sparse[i, 1]] + phi[W_sparse[i, 2]]; - phit_W[W_sparse[i, 2]] = phit_W[W_sparse[i, 2]] + phi[W_sparse[i, 1]]; - } - - for (i in 1:n) ldet_terms[i] = log1m(alpha * lambda[i]); - return 0.5 * (n * log(tau) - + sum(ldet_terms) - - tau * (phit_D * phi - alpha * (phit_W * phi))); - } -} -``` -with the data transformed in the model: -```stan -transformed data { - int W_sparse[W_n, 2]; // adjacency pairs - vector[n] D_sparse; // diagonal of D (number of neighbors for each site) - vector[n] lambda; // eigenvalues of invsqrtD * W * invsqrtD - - { // generate sparse representation for W - int counter; - counter = 1; - // loop over upper triangular part of W to identify neighbor pairs - for (i in 1:(n - 1)) { - for (j in (i + 1):n) { - if (W[i, j] == 1) { - W_sparse[counter, 1] = i; - W_sparse[counter, 2] = j; - counter = counter + 1; - } - } - } - } - for (i in 1:n) D_sparse[i] = sum(W[i]); - { - vector[n] invsqrtD; - for (i in 1:n) { - invsqrtD[i] = 1 / sqrt(D_sparse[i]); - } - lambda = eigenvalues_sym(quad_form(W, diag_matrix(invsqrtD))); - } -} -``` -and the likelihood: -```stan -model { - phi ~ sparse_car(tau, alpha, W_sparse, D_sparse, lambda, n, W_n); -} -``` - -+++ - -This is quite a lot of code to digest, so my general approach is to compare the intermediate steps (whenever possible) with `Stan`. In this case, I will try to compute `tau, alpha, W_sparse, D_sparse, lambda, n, W_n` outside of the `Stan` model in `R` and compare with my own implementation. - -Below is a Sparse CAR implementation in `PyMC3` ([see also here](https://github.com/pymc-devs/pymc3/issues/2066#issuecomment-296397012)). Again, we try to avoid using any looping, as in `Stan`. - -```{code-cell} ipython3 -import scipy - - -class Sparse_CAR(distribution.Continuous): - """ - Sparse Conditional Autoregressive (CAR) distribution - - Parameters - ---------- - alpha : spatial smoothing term - W : adjacency matrix - tau : precision at each location - """ - - def __init__(self, alpha, W, tau, *args, **kwargs): - self.alpha = alpha = tt.as_tensor_variable(alpha) - self.tau = tau = tt.as_tensor_variable(tau) - D = W.sum(axis=0) - n, m = W.shape - self.n = n - self.median = self.mode = self.mean = 0 - super().__init__(*args, **kwargs) - - # eigenvalues of D^−1/2 * W * D^−1/2 - Dinv_sqrt = np.diag(1 / np.sqrt(D)) - DWD = np.matmul(np.matmul(Dinv_sqrt, W), Dinv_sqrt) - self.lam = scipy.linalg.eigvalsh(DWD) - - # sparse representation of W - w_sparse = scipy.sparse.csr_matrix(W) - self.W = theano.sparse.as_sparse_variable(w_sparse) - self.D = tt.as_tensor_variable(D) - - # Precision Matrix (inverse of Covariance matrix) - # d_sparse = scipy.sparse.csr_matrix(np.diag(D)) - # self.D = theano.sparse.as_sparse_variable(d_sparse) - # self.Phi = self.tau * (self.D - self.alpha*self.W) - - def logp(self, x): - logtau = self.n * tt.log(tau) - logdet = tt.log(1 - self.alpha * self.lam).sum() - - # tau * ((phi .* D_sparse)' * phi - alpha * (phit_W * phi)) - Wx = theano.sparse.dot(self.W, x) - tau_dot_x = self.D * x.T - self.alpha * Wx.ravel() - logquad = self.tau * tt.dot(x.ravel(), tau_dot_x.ravel()) - - # logquad = tt.dot(x.T, theano.sparse.dot(self.Phi, x)).sum() - return 0.5 * (logtau + logdet - logquad) -``` - -```{code-cell} ipython3 -with pm.Model() as model4: - # Vague prior on intercept and effect - beta = pm.Normal("beta", mu=0.0, tau=1.0, shape=(2, 1)) - - # Priors for spatial random effects - tau = pm.Gamma("tau", alpha=2.0, beta=2.0) - alpha = pm.Uniform("alpha", lower=0, upper=1) - phi = Sparse_CAR("phi", alpha, W, tau, shape=(N, 1)) - - # Mean model - mu = pm.Deterministic("mu", tt.exp(tt.dot(X, beta) + phi + log_offset)) - - # Likelihood - Yi = pm.Poisson("Yi", mu=mu.ravel(), observed=O) - - infdata4 = pm.sample(1000, tune=2000, cores=4, target_accept=0.85, return_inferencedata=True) -``` - -```{code-cell} ipython3 -az.plot_trace(infdata4, var_names=["alpha", "beta", "tau"]); -``` - -```{code-cell} ipython3 -az.plot_posterior(infdata4, var_names=["alpha"]) -``` - -As you can see above, the sparse representation returns the same estimates, while being much faster than any other implementation. - -+++ - -## A few other warnings -In `Stan`, there is an option to write a `generated quantities` block for sample generation. Doing the similar in pymc3, however, is not recommended. - -Consider the following simple sample: - -```python -# Data -x = np.array([1.1, 1.9, 2.3, 1.8]) -n = len(x) - -with pm.Model() as model1: - # prior - mu = pm.Normal('mu', mu=0, tau=.001) - sigma = pm.Uniform('sigma', lower=0, upper=10) - # observed - xi = pm.Normal('xi', mu=mu, tau=1/(sigma**2), observed=x) - # generation - p = pm.Deterministic('p', pm.math.sigmoid(mu)) - count = pm.Binomial('count', n=10, p=p, shape=10) -``` - -where we intended to use - -```python -count = pm.Binomial('count', n=10, p=p, shape=10) -``` -to generate posterior prediction. However, if the new RV added to the model is a discrete variable it can cause weird turbulence to the trace. You can see [issue #1990](https://github.com/pymc-devs/pymc3/issues/1990) for related discussion. - -+++ - -## Final remarks - -In this notebook, most of the parameter conventions (e.g., using `tau` when defining a Normal distribution) and choice of priors are strictly matched with the original code in `Winbugs` or `Stan`. However, it is important to note that merely porting the code from one probabilistic programming language to the another is not necessarily the best practice. The aim is not just to run the code in `PyMC3`, but to make sure the model is appropriate so that it returns correct estimates, and runs efficiently (fast sampling). - -For example, as [@aseyboldt](https://github.com/aseyboldt) pointed out [here](https://github.com/pymc-devs/pymc3/pull/2080#issuecomment-297456574) and [here](https://github.com/pymc-devs/pymc3/issues/1924#issue-215496293), non-centered parametrizations are often a better choice than the centered parametrizations. In our case here, `phi` is following a zero-mean Normal distribution, thus it can be left out in the beginning and used to scale the values afterwards. Often, doing this can avoid correlations in the posterior (it will be slower in some cases, however). - -Another thing to keep in mind is that models can be sensitive to choices of prior distributions; for example, you can have a hard time using Normal variables with a large sd as prior. Gelman often recommends Cauchy or StudentT (*i.e.*, weakly-informative priors). More information on prior choice can be found on the [Stan wiki](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations). - -There are always ways to improve code. Since our computational graph with `pm.Model()` consist of `theano` objects, we can always do `print(VAR_TO_CHECK.tag.test_value)` right after the declaration or computation to check the shape. -For example, in our last example, as suggested by [@aseyboldt](https://github.com/pymc-devs/pymc3/pull/2080#issuecomment-297456574) there seem to be a lot of correlation in the posterior. That probably slows down NUTS quite a bit. As a debugging tool and guide for reparametrization you can look at the singular value decomposition of the standardized samples from the trace – basically the eigenvalues of the correlation matrix. If the problem is high dimensional you can use stuff from `scipy.sparse.linalg` to only compute the largest singular value: - -```python -from scipy import linalg, sparse - -vals = np.array([model.dict_to_array(v) for v in trace[1000:]]).T -vals[:] -= vals.mean(axis=1)[:, None] -vals[:] /= vals.std(axis=1)[:, None] - -U, S, Vh = sparse.linalg.svds(vals, k=20) -``` - -Then look at `plt.plot(S)` to see if any principal components are obvious, and check which variables are contributing by looking at the singular vectors: `plt.plot(U[:, -1] ** 2)`. You can get the indices by looking at `model.bijection.ordering.vmap`. - -Another great way to check the correlations in the posterior is to do a pairplot of the posterior (if your model doesn't contain too many parameters). You can see quite clearly if and where the the posterior parameters are correlated. - -```{code-cell} ipython3 -az.plot_pair(infdata1, var_names=["beta0", "beta1", "tau_h", "tau_c"], divergences=True); -``` - -```{code-cell} ipython3 -az.plot_pair(infdata2, var_names=["beta0", "beta1", "tau_h", "tau_c"], divergences=True); -``` - -```{code-cell} ipython3 -az.plot_pair(infdata3, var_names=["beta", "tau", "alpha"], divergences=True); -``` - -```{code-cell} ipython3 -az.plot_pair(infdata4, var_names=["beta", "tau", "alpha"], divergences=True); -``` - -* Notebook Written by [Junpeng Lao](https://www.github.com/junpenglao/), inspired by `PyMC3` [issue#2022](https://github.com/pymc-devs/pymc3/issues/2022), [issue#2066](https://github.com/pymc-devs/pymc3/issues/2066) and [comments](https://github.com/pymc-devs/pymc3/issues/2066#issuecomment-296397012). I would like to thank [@denadai2](https://github.com/denadai2), [@aseyboldt](https://github.com/aseyboldt), and [@twiecki](https://github.com/twiecki) for the helpful discussion. - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/case_studies/factor_analysis.myst.md b/myst_nbs/case_studies/factor_analysis.myst.md deleted file mode 100644 index 33ff60c6e..000000000 --- a/myst_nbs/case_studies/factor_analysis.myst.md +++ /dev/null @@ -1,348 +0,0 @@ ---- -jupytext: - notebook_metadata_filter: substitutions - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 -substitutions: - extra_dependencies: seaborn xarray-einstats ---- - -(factor_analysis)= -# Factor analysis - -:::{post} 19 Mar, 2022 -:tags: factor analysis, matrix factorization, PCA -:category: advanced, how-to -:author: Chris Hartl, Christopher Krapu, Oriol Abril-Pla -::: - -+++ - -Factor analysis is a widely used probabilistic model for identifying low-rank structure in multivariate data as encoded in latent variables. It is very closely related to principal components analysis, and differs only in the prior distributions assumed for these latent variables. It is also a good example of a linear Gaussian model as it can be described entirely as a linear transformation of underlying Gaussian variates. For a high-level view of how factor analysis relates to other models, you can check out [this diagram](https://www.cs.ubc.ca/~murphyk/Bayes/Figures/gmka.gif) originally published by Ghahramani and Roweis. - -+++ - -:::{include} ../extra_installs.md -::: - -```{code-cell} ipython3 -import aesara.tensor as at -import arviz as az -import matplotlib -import numpy as np -import pymc as pm -import scipy as sp -import seaborn as sns -import xarray as xr - -from matplotlib import pyplot as plt -from matplotlib.lines import Line2D -from numpy.random import default_rng -from xarray_einstats import linalg -from xarray_einstats.stats import XrContinuousRV - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") - -np.set_printoptions(precision=3, suppress=True) -RANDOM_SEED = 31415 -rng = default_rng(RANDOM_SEED) -``` - -## Simulated data generation - -+++ - -To work through a few examples, we'll first generate some data. The data will not follow the exact generative process assumed by the factor analysis model, as the latent variates will not be Gaussian. We'll assume that we have an observed data set with $N$ rows and $d$ columns which are actually a noisy linear function of $k_{true}$ latent variables. - -```{code-cell} ipython3 -n = 250 -k_true = 4 -d = 10 -``` - -The next code cell generates the data via creating latent variable arrays `M` and linear transformation `Q`. Then, the matrix product $QM$ is perturbed with additive Gaussian noise controlled by the variance parameter `err_sd`. - -```{code-cell} ipython3 -err_sd = 2 -M = rng.binomial(1, 0.25, size=(k_true, n)) -Q = np.hstack([rng.exponential(2 * k_true - k, size=(d, 1)) for k in range(k_true)]) * rng.binomial( - 1, 0.75, size=(d, k_true) -) -Y = np.round(1000 * np.dot(Q, M) + rng.standard_normal(size=(d, n)) * err_sd) / 1000 -``` - -Because of the way we have generated the data, the covariance matrix expressing correlations between columns of $Y$ will be equal to $QQ^T$. The fundamental assumption of PCA and factor analysis is that $QQ^T$ is not full rank. We can see hints of this if we plot the covariance matrix: - -```{code-cell} ipython3 -plt.figure(figsize=(4, 3)) -sns.heatmap(np.corrcoef(Y)); -``` - -If you squint long enough, you may be able to glimpse a few places where distinct columns are likely linear functions of each other. - -+++ - -## Model -Probabilistic PCA (PPCA) and factor analysis (FA) are a common source of topics on [PyMC Discourse](https://discourse.pymc.io/). The posts linked below handle different aspects of the problem including: -* [Minibatched FA for large datasets](https://discourse.pymc.io/t/large-scale-factor-analysis-with-minibatch-advi/246) -* [Handling missing data in FA](https://discourse.pymc.io/t/dealing-with-missing-data/252) -* [Identifiability in FA / PPCA](https://discourse.pymc.io/t/unique-solution-for-probabilistic-pca/1324/14) - -+++ - -### Direct implementation - -+++ - -The model for factor analysis is the probabilistic matrix factorization - -$X_{(d,n)}|W_{(d,k)}, F_{(k,n)} \sim N(WF, \Psi)$ - -with $\Psi$ a diagonal matrix. Subscripts denote the dimensionality of the matrices. Probabilistic PCA is a variant that sets $\Psi = \sigma^2I$. A basic implementation (taken from [this gist](https://gist.github.com/twiecki/c95578a6539d2098be2d83575e3d15fe)) is shown in the next cell. Unfortunately, it has undesirable properties for model fitting. - -```{code-cell} ipython3 -k = 2 - -coords = {"latent_columns": np.arange(k), "rows": np.arange(n), "observed_columns": np.arange(d)} - -with pm.Model(coords=coords) as PPCA: - W = pm.Normal("W", dims=("observed_columns", "latent_columns")) - F = pm.Normal("F", dims=("latent_columns", "rows")) - psi = pm.HalfNormal("psi", 1.0) - X = pm.Normal("X", mu=at.dot(W, F), sigma=psi, observed=Y, dims=("observed_columns", "rows")) - - trace = pm.sample(tune=2000, random_seed=RANDOM_SEED) # target_accept=0.9 -``` - -At this point, there are already several warnings regarding diverging samples and failure of convergence checks. We can see further problems in the trace plot below. This plot shows the path taken by each sampler chain for a single entry in the matrix $W$ as well as the average evaluated over samples for each chain. - -```{code-cell} ipython3 -for i in trace.posterior.chain.values: - samples = trace.posterior["W"].sel(chain=i, observed_columns=3, latent_columns=1) - plt.plot(samples, label="Chain {}".format(i + 1)) - plt.axhline(samples.mean(), color=f"C{i}") -plt.legend(ncol=4, loc="upper center", fontsize=12, frameon=True), plt.xlabel("Sample"); -``` - -Each chain appears to have a different sample mean and we can also see that there is a great deal of autocorrelation across chains, manifest as long-range trends over sampling iterations. Some of the chains may have divergences as well, lending further evidence to the claim that using MCMC for this model as shown is suboptimal. - -One of the primary drawbacks for this model formulation is its lack of identifiability. With this model representation, only the product $WF$ matters for the likelihood of $X$, so $P(X|W, F) = P(X|W\Omega, \Omega^{-1}F)$ for any invertible matrix $\Omega$. While the priors on $W$ and $F$ constrain $|\Omega|$ to be neither too large or too small, factors and loadings can still be rotated, reflected, and/or permuted *without changing the model likelihood*. Expect it to happen between runs of the sampler, or even for the parametrization to "drift" within run, and to produce the highly autocorrelated $W$ traceplot above. - -### Alternative parametrization - -This can be fixed by constraining the form of W to be: - + Lower triangular - + Positive with an increasing diagonal - -We can adapt `expand_block_triangular` to fill out a non-square matrix. This function mimics `pm.expand_packed_triangular`, but while the latter only works on packed versions of square matrices (i.e. $d=k$ in our model, the former can also be used with nonsquare matrices. - -```{code-cell} ipython3 -def expand_packed_block_triangular(d, k, packed, diag=None, mtype="aesara"): - # like expand_packed_triangular, but with d > k. - assert mtype in {"aesara", "numpy"} - assert d >= k - - def set_(M, i_, v_): - if mtype == "aesara": - return at.set_subtensor(M[i_], v_) - M[i_] = v_ - return M - - out = at.zeros((d, k), dtype=float) if mtype == "aesara" else np.zeros((d, k), dtype=float) - if diag is None: - idxs = np.tril_indices(d, m=k) - out = set_(out, idxs, packed) - else: - idxs = np.tril_indices(d, k=-1, m=k) - out = set_(out, idxs, packed) - idxs = (np.arange(k), np.arange(k)) - out = set_(out, idxs, diag) - return out -``` - -We'll also define another function which helps create a diagonal positive matrix with increasing entries along the main diagonal. - -```{code-cell} ipython3 -def makeW(d, k, dim_names): - # make a W matrix adapted to the data shape - n_od = int(k * d - k * (k - 1) / 2 - k) - - # trick: the cumulative sum of z will be positive increasing - z = pm.HalfNormal("W_z", 1.0, dims="latent_columns") - b = pm.HalfNormal("W_b", 1.0, shape=(n_od,), dims="packed_dim") - L = expand_packed_block_triangular(d, k, b, at.ones(k)) - W = pm.Deterministic("W", at.dot(L, at.diag(at.extra_ops.cumsum(z))), dims=dim_names) - return W -``` - -With these modifications, we remake the model and run the MCMC sampler again. - -```{code-cell} ipython3 -with pm.Model(coords=coords) as PPCA_identified: - W = makeW(d, k, ("observed_columns", "latent_columns")) - F = pm.Normal("F", dims=("latent_columns", "rows")) - psi = pm.HalfNormal("psi", 1.0) - X = pm.Normal("X", mu=at.dot(W, F), sigma=psi, observed=Y, dims=("observed_columns", "rows")) - trace = pm.sample(tune=2000) # target_accept=0.9 - -for i in range(4): - samples = trace.posterior["W"].sel(chain=i, observed_columns=3, latent_columns=1) - plt.plot(samples, label="Chain {}".format(i + 1)) - -plt.legend(ncol=4, loc="lower center", fontsize=8), plt.xlabel("Sample"); -``` - -$W$ (and $F$!) now have entries with identical posterior distributions as compared between sampler chains. - -Because the $k \times n$ parameters in F all need to be sampled, sampling can become quite expensive for very large `n`. In addition, the link between an observed data point $X_i$ and an associated latent value $F_i$ means that streaming inference with mini-batching cannot be performed. - -This scalability problem can be addressed analytically by integrating $F$ out of the model. By doing so, we postpone any calculation for individual values of $F_i$ until later. Hence, this approach is often described as *amortized inference*. However, this fixes the prior on $F$, allowing for no modeling flexibility. In keeping with $F_{ij} \sim N(0, 1)$ we have: - -$X|WF \sim \mathrm{MatrixNormal}(WF, \Psi, I), \;\; F_{ij} \sim N(0, 1)$ - -$X|W \sim \mathrm{MatrixNormal}(0, \Psi + WW^T, I)$ - -If you are unfamiliar with the matrix normal distribution, you can consider it to be an extension of the multivariate Gaussian to matrix-valued random variates. Then, the between-row correlations and the between-column correlations are handled by two separate covariance matrices specified as parameters to the matrix normal. Here, it simplifies our notation for a model formulation that has marginalized out $F_i$. The explicit integration of $F_i$ also enables batching the observations for faster computation of `ADVI` and `FullRankADVI` approximations. - -```{code-cell} ipython3 -coords["observed_columns2"] = coords["observed_columns"] -with pm.Model(coords=coords) as PPCA_scaling: - W = makeW(d, k, ("observed_columns", "latent_columns")) - Y_mb = pm.Minibatch(Y.T, 50) # MvNormal parametrizes covariance of columns, so transpose Y - psi = pm.HalfNormal("psi", 1.0) - E = pm.Deterministic( - "cov", - at.dot(W, at.transpose(W)) + psi * at.diag(at.ones(d)), - dims=("observed_columns", "observed_columns2"), - ) - X = pm.MvNormal("X", 0.0, cov=E, observed=Y_mb) - trace_vi = pm.fit(n=50000, method="fullrank_advi", obj_n_mc=1).sample() -``` - -## Results -When we compare the posteriors calculated using MCMC and VI, we find that (for at least this specific parameter we are looking at) the two distributions are close, but they do differ in their mean. The MCMC chains all agree with each other and the ADVI estimate is not far off. - -```{code-cell} ipython3 -col_selection = dict(observed_columns=3, latent_columns=1) -ax = az.plot_kde( - trace_vi.posterior["W"].sel(**col_selection).squeeze().values, - label="FR-ADVI posterior", - plot_kwargs={"alpha": 0}, - fill_kwargs={"alpha": 0.5, "color": "red"}, -) -for i in trace.posterior.chain.values: - mcmc_samples = trace.posterior["W"].sel(chain=i, **col_selection).values - az.plot_kde( - mcmc_samples, - label="MCMC posterior for chain {}".format(i + 1), - plot_kwargs={"color": f"C{i}"}, - ) - -ax.set_title(rf"PDFs of $W$ estimate at {col_selection}") -ax.legend(loc="upper right"); -``` - -### Post-hoc identification of F - -The matrix $F$ is typically of interest for factor analysis, and is often used as a feature matrix for dimensionality reduction. However, $F$ has been -marginalized away in order to make fitting the model easier; and now we need it back. This is, in effect, an exercise in least-squares as: - -$X|WF \sim N(WF, \Psi)$ - -$(W^TW)^{-1}W^T\Psi^{-1/2}X|W,F \sim N(F, (W^TW)^{-1})$ - -+++ - -Here, we draw many random variates from a standard normal distribution, transforming them appropriate to represent the posterior of $F$ which is multivariate normal under our model. - -```{code-cell} ipython3 -# configure xarray-einstats -def get_default_dims(dims, dims2): - proposed_dims = [dim for dim in dims if dim not in {"chain", "draw"}] - assert len(proposed_dims) == 2 - if dims2 is None: - return proposed_dims - - -linalg.get_default_dims = get_default_dims -``` - -```{code-cell} ipython3 -post = trace_vi.posterior -obs = trace.observed_data - -WW = linalg.inv( - linalg.matmul( - post["W"], post["W"], dims=("latent_columns", "observed_columns", "latent_columns") - ) -) -WW_W = linalg.matmul( - WW, - post["W"], - dims=(("latent_columns", "latent_columns2"), ("latent_columns", "observed_columns")), -) -F_mu = xr.dot(1 / np.sqrt(post["psi"]) * WW_W, obs["X"], dims="observed_columns") -WW_chol = linalg.cholesky(WW) -norm_dist = XrContinuousRV(sp.stats.norm, xr.zeros_like(F_mu)) # the zeros_like defines the shape -F_sampled = F_mu + linalg.matmul( - WW_chol, - norm_dist.rvs(), - dims=(("latent_columns", "latent_columns2"), ("latent_columns", "rows")), -) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots() -ls = ["-", "--"] -for i in range(2): - for j in range(5): - az.plot_kde( - F_sampled.sel(latent_columns=i, rows=j).squeeze().values, - plot_kwargs={"color": f"C{j}", "ls": ls[i]}, - ax=ax, - ) -legend = ax.legend( - handles=[Line2D([], [], color="k", ls=ls[i], label=f"{i}") for i in range(2)], - title="latent column", - loc="upper left", -) -ax.add_artist(legend) -ax.legend( - handles=[Line2D([], [], color=f"C{i}", label=f"{i}") for i in range(5)], - title="row", - loc="upper right", -); -``` - -## Authors -* Authored by [chartl](https://github.com/chartl) on May 6, 2019 -* Updated by [Christopher Krapu](https://github.com/ckrapu) on April 4, 2021 -* Updated by Oriol Abril-Pla to use PyMC v4 and xarray-einstats on March, 2022 - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aeppl -``` - -:::{include} ../page_footer.md -::: - -```{code-cell} ipython3 - -``` diff --git a/myst_nbs/case_studies/hierarchical_partial_pooling.myst.md b/myst_nbs/case_studies/hierarchical_partial_pooling.myst.md deleted file mode 100644 index fe9b637b7..000000000 --- a/myst_nbs/case_studies/hierarchical_partial_pooling.myst.md +++ /dev/null @@ -1,182 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3.10.5 ('pymc-dev') - language: python - name: python3 ---- - -(hierarchical_partial_pooling)= -# Hierarchical Partial Pooling -:::{post} Oct 07, 2021 -:tags: hierarchical model, -:category: intermediate, -:author: Vladislavs Dovgalecs, Adrian Seybolt, Christian Luhmann -::: - -+++ - -Suppose you are tasked with estimating baseball batting skills for several players. One such performance metric is batting average. Since players play a different number of games and bat in different positions in the order, each player has a different number of at-bats. However, you want to estimate the skill of all players, including those with a relatively small number of batting opportunities. - -So, suppose a player came to bat only 4 times, and never hit the ball. Are they a bad player? - -As a disclaimer, the author of this notebook assumes little to non-existent knowledge about baseball and its rules. The number of times at bat in his entire life is around "4". - - -## Data - -We will use the baseball [data](http://www.swarthmore.edu/NatSci/peverso1/Sports%20Data/JamesSteinData/Efron-Morris%20Baseball/EfronMorrisBB.txt) {cite:p}`efron1975data`. - - -## Approach - -We will use PyMC to estimate the batting average for each player. Having estimated the averages across all players in the datasets, we can use this information to inform an estimate of an additional player, for which there is little data (*i.e.* 4 at-bats). - -In the absence of a Bayesian hierarchical model, there are two approaches for this problem: - -1. independently compute batting average for each player (no pooling) -2. compute an overall average, under the assumption that everyone has the same underlying average (complete pooling) - -Of course, neither approach is realistic. Clearly, all players aren't equally skilled hitters, so the global average is implausible. At the same time, professional baseball players are similar in many ways, so their averages aren't entirely independent either. - -It may be possible to cluster groups of "similar" players, and estimate group averages, but using a hierarchical modeling approach is a natural way of sharing information that does not involve identifying *ad hoc* clusters. - -The idea of hierarchical partial pooling is to model the global performance, and use that estimate to parameterize a population of players that accounts for differences among the players' performances. This tradeoff between global and individual performance will be automatically tuned by the model. Also, uncertainty due to different number of at bats for each player (*i.e.* informatino) will be automatically accounted for, by shrinking those estimates closer to the global mean. - -For far more in-depth discussion please refer to Stan [tutorial](http://mc-stan.org/documentation/case-studies/pool-binary-trials.html) {cite:p}`carpenter2016hierarchical` on the subject. The model and parameter values were taken from that example. - -```{code-cell} ipython3 -import aesara.tensor as at -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm - -%matplotlib inline -``` - -```{code-cell} ipython3 -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -Now we can load the dataset using pandas: - -```{code-cell} ipython3 -data = pd.read_csv(pm.get_data("efron-morris-75-data.tsv"), sep="\t") -at_bats, hits = data[["At-Bats", "Hits"]].to_numpy().T -``` - -Now let's develop a generative model for these data. - -We will assume that there exists a hidden factor (`phi`) related to the expected performance for all players (not limited to our 18). Since the population mean is an unknown value between 0 and 1, it must be bounded from below and above. Also, we assume that nothing is known about global average. Hence, a natural choice for a prior distribution is the uniform distribution. - -Next, we introduce a hyperparameter `kappa` to account for the variance in the population batting averages, for which we will use a bounded Pareto distribution. This will ensure that the estimated value falls within reasonable bounds. These hyperparameters will be, in turn, used to parameterize a beta distribution, which is ideal for modeling quantities on the unit interval. The beta distribution is typically parameterized via a scale and shape parameter, it may also be parametrized in terms of its mean $\mu \in [0,1]$ and sample size (a proxy for variance) $\nu = \alpha + \beta (\nu > 0)$. - -The final step is to specify a sampling distribution for the data (hit or miss) for every player, using a Binomial distribution. This is where the data are brought to bear on the model. - -+++ - -We could use `pm.Pareto('kappa', m=1.5)`, to define our prior on `kappa`, but the Pareto -distribution has very long tails. Exploring these properly -is difficult for the sampler, so we use an equivalent -but faster parametrization using the exponential distribution. -We use the fact that the log of a Pareto distributed -random variable follows an exponential distribution. - -```{code-cell} ipython3 -N = len(hits) -player_names = data["FirstName"] + " " + data["LastName"] -coords = {"player_names": player_names.tolist()} - -with pm.Model(coords=coords) as baseball_model: - - phi = pm.Uniform("phi", lower=0.0, upper=1.0) - - kappa_log = pm.Exponential("kappa_log", lam=1.5) - kappa = pm.Deterministic("kappa", at.exp(kappa_log)) - - theta = pm.Beta("theta", alpha=phi * kappa, beta=(1.0 - phi) * kappa, dims="player_names") - y = pm.Binomial("y", n=at_bats, p=theta, dims="player_names", observed=hits) -``` - -Recall our original question was with regard to the true batting average for a player with only 4 at bats and no hits. We can add this as an additional variable in the model - -```{code-cell} ipython3 -with baseball_model: - - theta_new = pm.Beta("theta_new", alpha=phi * kappa, beta=(1.0 - phi) * kappa) - y_new = pm.Binomial("y_new", n=4, p=theta_new, observed=0) -``` - -The model can visualized like this - -```{code-cell} ipython3 -pm.model_to_graphviz(baseball_model) -``` - -We can now fit the model using MCMC: - -```{code-cell} ipython3 -with baseball_model: - idata = pm.sample(2000, tune=2000, chains=2, target_accept=0.95) - - # check convergence diagnostics - assert all(az.rhat(idata) < 1.03) -``` - -Now we can plot the posteriors distribution of the parameters. First, the population hyperparameters: - -```{code-cell} ipython3 -az.plot_trace(idata, var_names=["phi", "kappa"]); -``` - -Hence, the population mean batting average is in the 0.22-0.31 range, with an expected value of around 0.26. - -Next, the estimates for all 18 players in the dataset: - -```{code-cell} ipython3 -az.plot_forest(idata, var_names="theta"); -``` - -Finally, let's get the estimate for our 0-for-4 player: - -```{code-cell} ipython3 -az.plot_trace(idata, var_names=["theta_new"]); -``` - -Notice that, despite the fact our additional player did not get any hits, the estimate of his average is not zero -- zero is not even a highly-probably value. This is because we are assuming that the player is drawn from a *population* of players with a distribution specified by our estimated hyperparemeters. However, the estimated mean for this player is toward the low end of the means for the players in our dataset, indicating that the 4 at-bats contributed some information toward the estimate. - -+++ - -## Authors -* authored by Vladislavs Dovgalecs in November, 2016 ([pymc#1546](https://github.com/pymc-devs/pymc/pull/1546)) -* updated by Adrian Seybolt in June, 2017 ([pymc#2288](https://github.com/pymc-devs/pymc/pull/2288)) -* updated by Christian Luhmann in August, 2020 ([pymc#4068](https://github.com/pymc-devs/pymc/pull/4068)) - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesera,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/case_studies/item_response_nba.myst.md b/myst_nbs/case_studies/item_response_nba.myst.md deleted file mode 100644 index 402b8f25e..000000000 --- a/myst_nbs/case_studies/item_response_nba.myst.md +++ /dev/null @@ -1,529 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(item_response_nba)= -# NBA Foul Analysis with Item Response Theory - -:::{post} Apr 17, 2022 -:tags: hierarchical model, case study, generalized linear model -:category: intermediate, tutorial -:author: Austin Rochford, Lorenzo Toniazzi -::: - -```{code-cell} ipython3 -import os - -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm - -%matplotlib inline -print(f"Running on PyMC v{pm.__version__}") -``` - -```{code-cell} ipython3 -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -## Introduction -This tutorial shows an application of Bayesian Item Response Theory {cite:p}`fox2010bayesian` to NBA basketball foul calls data using PyMC. Based on Austin Rochford's blogpost [NBA Foul Calls and Bayesian Item Response Theory](https://www.austinrochford.com/posts/2017-04-04-nba-irt.html). - -### Motivation -Our scenario is that we observe a binary outcome (a foul being called or not) from an interaction (a basketball play) of two agents with two different roles (the player committing the alleged foul and the player disadvantaged in the play). Moreover, each committing or disadvantaged agent is an individual which might be observed several times (say LeBron James observed committing a foul in more than one play). Then it might be that not only the agent's role, but also the abilities of the single individual player contribute to the observed outcome. And so we'd like to __estimate the contribution to the observed outcome of each individual's (latent) ability as a committing or disadvantaged agent.__ This would allow us, for example, to rank players from more to less effective, quantify uncertainty in this ranking and discover extra hierarchical structures involved in foul calls. All pretty useful stuff! - - -So how can we study this common and complex __multi-agent interaction__ scenario, with __hierarchical__ structures between more than a thousand individuals? - -Despite the scenario's overwhelming complexity, Bayesian Item Response Theory combined with modern powerful statistical software allows for quite elegant and effective modeling options. One of these options employs a {term}`Generalized Linear Model` called [Rasch model](https://en.wikipedia.org/wiki/Rasch_model), which we now discuss in more detail. - - -### Rasch Model -We sourced our data from the official [NBA Last Two Minutes Reports](https://official.nba.com/2020-21-nba-officiating-last-two-minute-reports/) with game data between 2015 to 2021. In this dataset, each row `k` is one play involving two players (the committing and the disadvantaged) where a foul has been either called or not. So we model the probability `p_k` that a referee calls a foul in play `k` as a function of the players involved. Hence we define two latent variables for each player, namely: -- `theta`: which estimates the player's ability to have a foul called when disadvantaged, and -- `b`: which estimates the player's ability to have a foul not called when committing. - -Note that the higher these player's parameters, the better the outcome for the player's team. These two parameters are then estimated using a standard Rasch model, by assuming the log-odds-ratio of `p_k` equals `theta-b` for the corresponding players involved in play `k`. Also, we place hierarchical hyperpriors on all `theta`'s and all `b`'s to account for shared abilities between players and largely different numbers of observations for different players. - - -### Discussion -Our analysis gives an estimate of the latent skills `theta` and `b` for each player in terms of posterior distributions. We analyze this outcome in three ways. - -We first display the role of shared hyperpriors, by showing how posteriors of players with little observations are drawn to the league average. - -Secondly, we rank the posteriors by their mean to view best and worst committing and disadvantaged players, and observe that several players still rank in the top 10 of the same model estimated in [Austin Rochford blogpost](https://www.austinrochford.com/posts/2017-04-04-nba-irt.html) on different data. - -Thirdly, we show how we spot that grouping payers by their position is likely to be an informative extra hierarchical layer to introduce in our model, and leave this as an exercise for the interested reader. Let us conclude by mentioning that this opportunity of easily adding informed hierarchical structure to a model is one of the features that makes Bayesian modelling very flexible and powerful for quantifying uncertainty in scenarios where introducing (or discovering) problem-specific knowledge is crucial. - - -The analysis in this notebook is performed in four main steps: - -1. Data collection and processing. -2. Definition and instantiation of the Rasch model. -3. Posterior sampling and convergence checks. -4. Analysis of the posterior results. - -## Data collection and processing -We first import data from the original data set, which can be found at [this URL](https://raw.githubusercontent.com/polygraph-cool/last-two-minute-report/32f1c43dfa06c2e7652cc51ea65758007f2a1a01/output/all_games.csv). Each row corresponds to a play between the NBA seasons 2015-16 and 2020-21. We imported only five columns, namely -- `committing`: the name of the committing player in the play. -- `disadvantaged`: the name of the disadvantaged player in the play. -- `decision`: the reviewed decision of the play, which can take four values, namely: - - `CNC`: correct noncall, `INC`: incorrect noncall, `IC`: incorrect call, `CC`: correct call. -- `committing_position`: the position of the committing player which can take values - - `G`: guard, `F`: forward, `C`: center, `G-F`, `F-G`, `F-C`, `C-F`. -- `disadvantaged_position`: the position of the disadvantaged player, with possible values as above. - -We note that we already removed from the original dataset the plays where less than two players are involved (for example travel calls or clock violations). Also, the original dataset does not contain information on the players' position, which we added ourselves. - -```{code-cell} ipython3 -:tags: [] - -try: - df_orig = pd.read_csv(os.path.join("..", "data", "item_response_nba.csv"), index_col=0) -except FileNotFoundError: - df_orig = pd.read_csv(pm.get_data("item_response_nba.csv"), index_col=0) -df_orig.head() -``` - -We now process our data in three steps: - 1. We create a dataframe `df` by removing the position information from `df_orig`, and we create a dataframe `df_position` collecting all players with the respective position. (This last dataframe will not be used until the very end of the notebook.) - 2. We add a column to `df`, called `foul_called`, that assigns 1 to a play if a foul was called, and 0 otherwise. - 3. We assign IDs to committing and disadvantaged players and use this indexing to identify the respective players in each observed play. - -Finally, we display the head of our main dataframe `df` along with some basic statistics. - -```{code-cell} ipython3 -:tags: [] - -# 1. Construct df and df_position -df = df_orig[["committing", "disadvantaged", "decision"]] - -df_position = pd.concat( - [ - df_orig.groupby("committing").committing_position.first(), - df_orig.groupby("disadvantaged").disadvantaged_position.first(), - ] -).to_frame() -df_position = df_position[~df_position.index.duplicated(keep="first")] -df_position.index.name = "player" -df_position.columns = ["position"] - -# 2. Create the binary foul_called variable -def foul_called(decision): - """Correct and incorrect noncalls (CNC and INC) take value 0. - Correct and incorrect calls (CC and IC) take value 1. - """ - out = 0 - if (decision == "CC") | (decision == "IC"): - out = 1 - return out - - -df = df.assign(foul_called=lambda df: df["decision"].apply(foul_called)) - -# 3 We index observed calls by committing and disadvantaged players -committing_observed, committing = pd.factorize(df.committing, sort=True) -disadvantaged_observed, disadvantaged = pd.factorize(df.disadvantaged, sort=True) -df.index.name = "play_id" - -# Display of main dataframe with some statistics -print(f"Number of observed plays: {len(df)}") -print(f"Number of disadvanteged players: {len(disadvantaged)}") -print(f"Number of committing players: {len(committing)}") -print(f"Global probability of a foul being called: " f"{100*round(df.foul_called.mean(),3)}%\n\n") -df.head() -``` - -+++ {"tags": []} - -## Item Response Model - -### Model definition - -We denote by: -- $N_d$ and $N_c$ the number of disadvantaged and committing players, respectively, -- $K$ the number of plays, -- $k$ a play, -- $y_k$ the observed call/noncall in play $k$, -- $p_k$ the probability of a foul being called in play $k$, -- $i(k)$ the disadvantaged player in play $k$, and by -- $j(k)$ the committing player in play $k$. - -We assume that each disadvantaged player is described by the latent variable: -- $\theta_i$ for $i=1,2,...,N_d$, - -and each committing player is described by the latent variable: -- $b_j$ for $j=1,2,...,N_c$. - -Then we model each observation $y_k$ as the result of an independent Bernoulli trial with probability $p_k$, where - -$$ -p_k =\text{sigmoid}(\eta_k)=\left(1+e^{-\eta_k}\right)^{-1},\quad\text{with}\quad \eta_k=\theta_{i(k)}-b_{j(k)}, -$$ - -for $k=1,2,...,K$, by defining (via a [non-centered parametrisation](https://twiecki.io/blog/2017/02/08/bayesian-hierchical-non-centered/)) - -\begin{align*} -\theta_{i}&= \sigma_\theta\Delta_{\theta,i}+\mu_\theta\sim \text{Normal}(\mu_\theta,\sigma_\theta^2), &i=1,2,...,N_d,\\ -b_{j}&= \sigma_b\Delta_{b,j}\sim \text{Normal}(0,\sigma_b^2), &j=1,2,...,N_c, -\end{align*} - -with priors/hyperpriors - -\begin{align*} -\Delta_{\theta,i}&\sim \text{Normal}(0,1), &i=1,2,...,N_d,\\ -\Delta_{b,j}&\sim \text{Normal}(0,1), &j=1,2,...,N_c,\\ -\mu_\theta&\sim \text{Normal}(0,100),\\ -\sigma_\theta &\sim \text{HalfCauchy}(2.5),\\ -\sigma_b &\sim \text{HalfCauchy}(2.5). -\end{align*} - -Note that $p_k$ is always dependent on $\mu_\theta,\,\sigma_\theta$ and $\sigma_b$ ("pooled priors") and also depends on the actual players involved in the play due to $\Delta_{\theta,i}$ and $\Delta_{b,j}$ ("unpooled priors"). This means our model features partial pooling. Morover, note that we do not pool $\theta$'s with $b$'s, hence assuming these skills are independent even for the same player. Also, note that we normalised the mean of $b_{j}$ to zero. - -Finally, notice how we worked backwards from our data to construct this model. This is a very natural way to construct a model, allowing us to quickly see how each variable connects to others and their intuition. Meanwhile, when instantiating the model below, the construction goes in the opposite direction, i.e. starting from priors and moving up to the observations. - -### PyMC implementation -We now implement the model above in PyMC. Note that, to easily keep track of the players (as we have hundreds of them being both committing and disadvantaged), we make use of the `coords` argument for {class}`pymc.Model`. (For tutorials on this functionality see the notebook {ref}`data_container` or [this blogpost](https://oriolabrilpla.cat/python/arviz/pymc3/xarray/2020/09/22/pymc3-arviz.html).) We choose our priors to be the same as in [Austin Rochford's post](https://www.austinrochford.com/posts/2017-04-04-nba-irt.html), to make the comparison consistent. - -```{code-cell} ipython3 -coords = {"disadvantaged": disadvantaged, "committing": committing} - -with pm.Model(coords=coords) as model: - - # Data - foul_called_observed = pm.Data("foul_called_observed", df.foul_called, mutable=False) - - # Hyperpriors - mu_theta = pm.Normal("mu_theta", 0.0, 100.0) - sigma_theta = pm.HalfCauchy("sigma_theta", 2.5) - sigma_b = pm.HalfCauchy("sigma_b", 2.5) - - # Priors - Delta_theta = pm.Normal("Delta_theta", 0.0, 1.0, dims="disadvantaged") - Delta_b = pm.Normal("Delta_b", 0.0, 1.0, dims="committing") - - # Deterministic - theta = pm.Deterministic("theta", Delta_theta * sigma_theta + mu_theta, dims="disadvantaged") - b = pm.Deterministic("b", Delta_b * sigma_b, dims="committing") - eta = pm.Deterministic("eta", theta[disadvantaged_observed] - b[committing_observed]) - - # Likelihood - y = pm.Bernoulli("y", logit_p=eta, observed=foul_called_observed) -``` - -We now plot our model to show the hierarchical structure (and the non-centered parametrisation) on the variables `theta` and `b`. - -```{code-cell} ipython3 -:tags: [] - -pm.model_to_graphviz(model) -``` - -## Sampling and convergence - -We now sample from our Rasch model. - -```{code-cell} ipython3 -with model: - trace = pm.sample(1000, tune=1500, random_seed=RANDOM_SEED) -``` - -We plot below the energy difference of the obtained trace. Also, we assume our sampler has converged as it passed all automatic PyMC convergence checks. - -```{code-cell} ipython3 -az.plot_energy(trace); -``` - -## Posterior analysis -### Visualisation of partial pooling -Our first check is to plot -- y: the difference between the raw mean probability (from the data) and the posterior mean probability for each disadvantaged and committing player -- x: as a function of the number of observations per disadvantaged and committing player. - -These plots show, as expected, that the hierarchical structure of our model tends to estimate posteriors towards the global mean for players with a low amount of observations. - -```{code-cell} ipython3 -:tags: [] - -# Global posterior means of μ_theta and μ_b -mu_theta_mean, mu_b_mean = trace.posterior["mu_theta"].mean(), 0 -# Raw mean from data of each disadvantaged player -disadvantaged_raw_mean = df.groupby("disadvantaged")["foul_called"].mean() -# Raw mean from data of each committing player -committing_raw_mean = df.groupby("committing")["foul_called"].mean() -# Posterior mean of each disadvantaged player -disadvantaged_posterior_mean = ( - 1 / (1 + np.exp(-trace.posterior["theta"].mean(dim=["chain", "draw"]))).to_pandas() -) -# Posterior mean of each committing player -committing_posterior_mean = ( - 1 - / (1 + np.exp(-(mu_theta_mean - trace.posterior["b"].mean(dim=["chain", "draw"])))).to_pandas() -) - -# Compute difference of raw and posterior mean for each -# disadvantaged and committing player -def diff(a, b): - return a - b - - -df_disadvantaged = pd.DataFrame( - disadvantaged_raw_mean.combine(disadvantaged_posterior_mean, diff), - columns=["Raw - posterior mean"], -) -df_committing = pd.DataFrame( - committing_raw_mean.combine(committing_posterior_mean, diff), columns=["Raw - posterior mean"] -) -# Add the number of observations for each disadvantaged and committing player -df_disadvantaged = df_disadvantaged.assign(obs_disadvantaged=df["disadvantaged"].value_counts()) -df_committing = df_committing.assign(obs_committing=df["committing"].value_counts()) - -# Plot the difference between raw and posterior means as a function of -# the number of observations -fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True) -fig.suptitle( - "Difference of raw and posterior mean of player's foul call probability as " - "\na function of the player's number of observations\n", - fontsize=15, -) -ax1.scatter(data=df_disadvantaged, x="obs_disadvantaged", y="Raw - posterior mean", s=7, marker="o") -ax1.set_title("theta") -ax1.set_ylabel("Raw mean - posterior mean") -ax1.set_xlabel("obs_disadvantaged") -ax2.scatter(data=df_committing, x="obs_committing", y="Raw - posterior mean", s=7) -ax2.set_title("b") -ax2.set_xlabel("obs_committing") -plt.show() -``` - -### Top and bottom committing and disadvantaged players -As we successfully estimated the skills of disadvantaged (`theta`) and committing (`b`) players, we can finally check which players perform better and worse in our model. -So we now plot our posteriors using forest plots. We plot the 10 top and bottom players ranked with respect to the latent skill `theta` and `b`, respectively. - -```{code-cell} ipython3 -def order_posterior(inferencedata, var, bottom_bool): - xarray_ = inferencedata.posterior[var].mean(dim=["chain", "draw"]) - return xarray_.sortby(xarray_, ascending=bottom_bool) - - -top_theta, bottom_theta = ( - order_posterior(trace, "theta", False), - order_posterior(trace, "theta", True), -) -top_b, bottom_b = (order_posterior(trace, "b", False), order_posterior(trace, "b", True)) - -amount = 10 # How many top players we want to display in each cathegory - -fig = plt.figure(figsize=(17, 14)) -fig.suptitle( - "\nPosterior estimates for top and bottom disadvantaged (theta) and " - "committing (b) players \n(94% HDI)\n", - fontsize=25, -) -theta_top_ax = fig.add_subplot(221) -b_top_ax = fig.add_subplot(222) -theta_bottom_ax = fig.add_subplot(223, sharex=theta_top_ax) -b_bottom_ax = fig.add_subplot(224, sharex=b_top_ax) - -# theta: plot top -az.plot_forest( - trace, - var_names=["theta"], - combined=True, - coords={"disadvantaged": top_theta["disadvantaged"][:amount]}, - ax=theta_top_ax, - labeller=az.labels.NoVarLabeller(), -) -theta_top_ax.set_title(f"theta: top {amount}") -theta_top_ax.set_xlabel("theta\n") -theta_top_ax.set_xlim(xmin=-2.5, xmax=0.1) -theta_top_ax.vlines(mu_theta_mean, -1, amount, "k", "--", label=("League average")) -theta_top_ax.legend(loc=2) - - -# theta: plot bottom -az.plot_forest( - trace, - var_names=["theta"], - colors="blue", - combined=True, - coords={"disadvantaged": bottom_theta["disadvantaged"][:amount]}, - ax=theta_bottom_ax, - labeller=az.labels.NoVarLabeller(), -) -theta_bottom_ax.set_title(f"theta: bottom {amount}") -theta_bottom_ax.set_xlabel("theta") -theta_bottom_ax.vlines(mu_theta_mean, -1, amount, "k", "--", label=("League average")) -theta_bottom_ax.legend(loc=2) - -# b: plot top -az.plot_forest( - trace, - var_names=["b"], - colors="blue", - combined=True, - coords={"committing": top_b["committing"][:amount]}, - ax=b_top_ax, - labeller=az.labels.NoVarLabeller(), -) -b_top_ax.set_title(f"b: top {amount}") -b_top_ax.set_xlabel("b\n") -b_top_ax.set_xlim(xmin=-1.5, xmax=1.5) -b_top_ax.vlines(0, -1, amount, "k", "--", label="League average") -b_top_ax.legend(loc=2) - -# b: plot bottom -az.plot_forest( - trace, - var_names=["b"], - colors="blue", - combined=True, - coords={"committing": bottom_b["committing"][:amount]}, - ax=b_bottom_ax, - labeller=az.labels.NoVarLabeller(), -) -b_bottom_ax.set_title(f"b: bottom {amount}") -b_bottom_ax.set_xlabel("b") -b_bottom_ax.vlines(0, -1, amount, "k", "--", label="League average") -b_bottom_ax.legend(loc=2) -plt.show(); -``` - -By visiting [Austin Rochford post](https://www.austinrochford.com/posts/2017-04-04-nba-irt.html) and checking the analogous table for the Rasch model there (which uses data from the 2016-17 season), the reader can see that several top players in both skills are still in the top 10 with our larger data set (covering seasons 2015-16 to 2020-21). - -+++ {"tags": []} - -### Discovering extra hierarchical structure - -A natural question to ask is whether players skilled as disadvantaged players (i.e. players with high `theta`) are also likely to be skilled as committing players (i.e. with high `b`), and the other way around. So, the next two plots show the `theta` (resp. `b`) score for the top players with respect to `b` ( resp.`theta`). - -```{code-cell} ipython3 -amount = 20 # How many top players we want to display -top_theta_players = top_theta["disadvantaged"][:amount].values -top_b_players = top_b["committing"][:amount].values - -top_theta_in_committing = set(committing).intersection(set(top_theta_players)) -top_b_in_disadvantaged = set(disadvantaged).intersection(set(top_b_players)) -if (len(top_theta_in_committing) < amount) | (len(top_b_in_disadvantaged) < amount): - print( - f"Some players in the top {amount} for theta (or b) do not have observations for b (or theta).\n", - "Plot not shown", - ) -else: - fig = plt.figure(figsize=(17, 14)) - fig.suptitle( - "\nScores as committing (b) for best disadvantaged (theta) players" - " and vice versa" - "\n(94% HDI)\n", - fontsize=25, - ) - b_top_theta = fig.add_subplot(121) - theta_top_b = fig.add_subplot(122) - - az.plot_forest( - trace, - var_names=["b"], - colors="blue", - combined=True, - coords={"committing": top_theta_players}, - figsize=(7, 7), - ax=b_top_theta, - labeller=az.labels.NoVarLabeller(), - ) - b_top_theta.set_title(f"\nb score for top {amount} in theta\n (94% HDI)\n\n", fontsize=17) - b_top_theta.set_xlabel("b") - b_top_theta.vlines(mu_b_mean, -1, amount, color="k", ls="--", label="League average") - b_top_theta.legend(loc="upper right", bbox_to_anchor=(0.46, 1.05)) - - az.plot_forest( - trace, - var_names=["theta"], - colors="blue", - combined=True, - coords={"disadvantaged": top_b_players}, - figsize=(7, 7), - ax=theta_top_b, - labeller=az.labels.NoVarLabeller(), - ) - theta_top_b.set_title(f"\ntheta score for top {amount} in b\n (94% HDI)\n\n", fontsize=17) - theta_top_b.set_xlabel("theta") - theta_top_b.vlines(mu_theta_mean, -1, amount, color="k", ls="--", label="League average") - theta_top_b.legend(loc="upper right", bbox_to_anchor=(0.46, 1.05)); -``` - -These plots suggest that scoring high in `theta` does not correlate with high or low scores in `b`. Moreover, with a little knowledge of NBA basketball, one can visually note that a higher score in `b` is expected from players playing center or forward rather than guards or point guards. -Given the last observation, we decide to plot a histogram for the occurrence of different positions for top disadvantaged (`theta`) and committing (`b`) players. Interestingly, we see below that the largest share of best disadvantaged players are guards, meanwhile, the largest share of best committing players are centers (and at the same time a very small share of guards). - -```{code-cell} ipython3 -:tags: [] - -amount = 50 # How many top players we want to display -top_theta_players = top_theta["disadvantaged"][:amount].values -top_b_players = top_b["committing"][:amount].values - -positions = ["C", "C-F", "F-C", "F", "G-F", "G"] - -# Histogram of positions of top disadvantaged players -fig = plt.figure(figsize=(8, 6)) -top_theta_position = fig.add_subplot(121) -df_position.loc[df_position.index.isin(top_theta_players)].position.value_counts().loc[ - positions -].plot.bar(ax=top_theta_position, color="orange", label="theta") -top_theta_position.set_title(f"Positions of top {amount} disadvantaged (theta)\n", fontsize=12) -top_theta_position.legend(loc="upper left") - -# Histogram of positions of top committing players -top_b_position = fig.add_subplot(122, sharey=top_theta_position) -df_position.loc[df_position.index.isin(top_b_players)].position.value_counts().loc[ - positions -].plot.bar(ax=top_b_position, label="b") -top_b_position.set_title(f"Positions of top {amount} committing (b)\n", fontsize=12) -top_b_position.legend(loc="upper right"); -``` - -The histograms above suggest that it might be appropriate to add a hierarchical layer to our model. Namely, group disadvantaged and committing players by the respective positions to account for the role of position in evaluating the latent skills `theta` and `b`. This can be done in our Rasch model by imposing mean and variance hyperpriors for each player grouped by the positions, which is left as an exercise for the reader. To this end, notice that the dataframe `df_orig` is set up precisely to add this hierarchical structure. Have fun! - -A warm thank you goes to [Eric Ma](https://github.com/ericmjl) for many useful comments that improved this notebook. - -+++ {"tags": []} - -## Authors - -* Adapted from Austin Rochford's [blogpost on NBA Foul Calls and Bayesian Item Response Theory](https://www.austinrochford.com/posts/2017-04-04-nba-irt.html) by [Lorenzo Toniazzi](https://github.com/ltoniazzi) on 3 Jul 2021 ([PR181](https://github.com/pymc-devs/pymc-examples/pull/181)) -* Re-executed by [Michael Osthege](https://github.com/michaelosthege) on 10 Jan 2022 ([PR266](https://github.com/pymc-devs/pymc-examples/pull/266)) -* Updated by [Lorenzo Toniazzi](https://github.com/ltoniazzi) on 25 Apr 2022 ([PR309](https://github.com/pymc-devs/pymc-examples/pull/309)) - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Watermark - -```{code-cell} ipython3 -:tags: [] - -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/case_studies/mediation_analysis.myst.md b/myst_nbs/case_studies/mediation_analysis.myst.md deleted file mode 100644 index e64528a42..000000000 --- a/myst_nbs/case_studies/mediation_analysis.myst.md +++ /dev/null @@ -1,241 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(mediation_analysis)= -# Bayesian mediation analysis - -:::{post} February, 2022 -:tags: mediation, path analysis, regression -:category: beginner -:author: Benjamin T. Vincent -::: - -This notebook covers Bayesian [mediation analysis](https://en.wikipedia.org/wiki/Mediation_(statistics) ). This is useful when we want to explore possible mediating pathways between a predictor and an outcome variable. - -It is important to note that the approach to mediation analysis has evolved over time. This notebook was heavily influenced by the approach of {cite:t}`hayes2017introduction` as set out in his textbook "Introduction to Mediation, Moderation and Conditional Process Analysis." - -Readers should be aware that mediation analysis is commonly confused with moderation analysis for which we have a separate example ({ref}`moderation_analysis`). - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc as pm -import seaborn as sns - -from pandas import DataFrame -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -plt.rcParams.update({"font.size": 14}) -seed = 42 -rng = np.random.default_rng(seed); -``` - -## The mediation model - -The simple mediation model is very simple where $m$ is a linear function of $x$, and $y$ is a linear function of $x$ and $m$: - -$$ -m_i \sim \mathrm{Normal}(i_M + a \cdot x_i, \sigma_M) -$$ - -$$ -y_i \sim \mathrm{Normal}(i_Y + c' \cdot x_i + b \cdot m_i, \sigma_Y) -$$ - -where $i$ indexes each observation (row in the dataset), and $i_M$ and $i_Y$ are intercept parameters. Note that $x_i$, $m_i$, and $y_i$ are observed data. - -![](mediation.png) - -Using definitions from {cite:t}`hayes2017introduction`, we can define a few effects of interest: -- **Direct effect:** is given by $c'$. Two cases that differ by one unit on $x$ but are equal on $m$ are estimated to differ by $c'$ units on $y$. -- **Indirect effect:** is given by $a \cdot b$. Two cases which differ by one unit of $x$ are estimated to differ by $a \cdot b$ units on $y$ as a result of the effect of $x \rightarrow m$ and $m \rightarrow y$. -- **Total effect:** is $c = c' + a \cdot b$ which is simply the sum of the direct and indirect effects. This could be understood as: two cases that differ by one unit on $x$ are estimated to differ by $a \cdot b$ units on $y$ due to the indirect pathway $x \rightarrow m \rightarrow y$, and by $c'$ units due to the direct pathway $x \rightarrow y$. The total effect could also be estimated by evaluating the alternative model $y_i \sim \mathrm{Normal}(i_{Y*} + c \cdot x_i, \sigma_{Y*})$. - -+++ - -## Generate simulated data - -```{code-cell} ipython3 -def make_data(): - N = 75 - a, b, cprime = 0.5, 0.6, 0.3 - im, iy, σm, σy = 2.0, 0.0, 0.5, 0.5 - x = rng.normal(loc=0, scale=1, size=N) - m = im + rng.normal(loc=a * x, scale=σm, size=N) - y = iy + (cprime * x) + rng.normal(loc=b * m, scale=σy, size=N) - print(f"True direct effect = {cprime}") - print(f"True indirect effect = {a*b}") - print(f"True total effect = {cprime+a*b}") - return x, m, y - - -x, m, y = make_data() - -sns.pairplot(DataFrame({"x": x, "m": m, "y": y})); -``` - -## Define the PyMC3 model and conduct inference - -```{code-cell} ipython3 -def mediation_model(x, m, y): - with pm.Model() as model: - x = pm.ConstantData("x", x, dims="obs_id") - y = pm.ConstantData("y", y, dims="obs_id") - m = pm.ConstantData("m", m, dims="obs_id") - - # intercept priors - im = pm.Normal("im", mu=0, sigma=10) - iy = pm.Normal("iy", mu=0, sigma=10) - # slope priors - a = pm.Normal("a", mu=0, sigma=10) - b = pm.Normal("b", mu=0, sigma=10) - cprime = pm.Normal("cprime", mu=0, sigma=10) - # noise priors - σm = pm.HalfCauchy("σm", 1) - σy = pm.HalfCauchy("σy", 1) - - # likelihood - pm.Normal("m_likelihood", mu=im + a * x, sigma=σm, observed=m, dims="obs_id") - pm.Normal("y_likelihood", mu=iy + b * m + cprime * x, sigma=σy, observed=y, dims="obs_id") - - # calculate quantities of interest - indirect_effect = pm.Deterministic("indirect effect", a * b) - total_effect = pm.Deterministic("total effect", a * b + cprime) - - return model - - -model = mediation_model(x, m, y) -pm.model_to_graphviz(model) -``` - -```{code-cell} ipython3 -with model: - result = pm.sample(tune=4000, target_accept=0.9, random_seed=42) -``` - -Visualise the trace to check for convergence. - -```{code-cell} ipython3 -az.plot_trace(result) -plt.tight_layout() -``` - -We have good chain mixing and the posteriors for each chain look very similar, so no problems in that regard. - -+++ - -## Visualise the important parameters - -First we will use a pair plot to look at joint posterior distributions. - -```{code-cell} ipython3 -az.plot_pair( - result, - marginals=True, - point_estimate="median", - figsize=(12, 12), - scatter_kwargs={"alpha": 0.05}, - var_names=["a", "b", "cprime", "indirect effect", "total effect"], -); -``` - -## Interpreting the results -We can take a closer look at the indirect, total, and direct effects: - -```{code-cell} ipython3 -ax = az.plot_posterior( - result, - var_names=["cprime", "indirect effect", "total effect"], - ref_val=0, - hdi_prob=0.95, - figsize=(14, 4), -) -ax[0].set(title="direct effect"); -``` - -- The posterior mean **direct effect** is 0.29, meaning that for every 1 unit of increase in $x$, $y$ increases by 0.29 due to the direct effect $x \rightarrow y$. -- The posterior mean **indirect effect** is 0.49, meaning that for every 1 unit of increase in $x$, $y$ increases by 0.49 through the pathway $x \rightarrow m \rightarrow y$. The probability that the indirect effect is zero is infinitesimal. -- The posterior mean **total effect** is 0.77, meaning that for every 1 unit of increase in $x$, $y$ increases by 0.77 through both the direct and indirect pathways. - -+++ - -## Double check with total effect only model -Above, we stated that the total effect could also be estimated by evaluating the alternative model $y_i \sim \mathrm{Normal}(i_{Y*} + c \cdot x_i, \sigma_{Y*})$. Here we will check this by comparing the posterior distribution for $c'$ in the mediation model, and the posterior distribution for $c$ in this alternative model. - -```{code-cell} ipython3 -with pm.Model() as total_effect_model: - _x = pm.ConstantData("_x", x, dims="obs_id") - iy = pm.Normal("iy", mu=0, sigma=1) - c = pm.Normal("c", mu=0, sigma=1) - σy = pm.HalfCauchy("σy", 1) - μy = iy + c * _x - pm.Normal("yy", mu=μy, sd=σy, observed=y, dims="obs_id") -``` - -```{code-cell} ipython3 -with total_effect_model: - total_effect_result = pm.sample(tune=4000, target_accept=0.9, random_seed=42) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(14, 4)) -az.plot_posterior( - total_effect_result, var_names=["c"], point_estimate=None, hdi_prob="hide", c="r", lw=4, ax=ax -) -az.plot_posterior( - result, var_names=["total effect"], point_estimate=None, hdi_prob="hide", c="k", lw=4, ax=ax -); -``` - -As we can see, the posterior distributions over the direct effects are near-identical for the mediation model (black curve) and the direct model (red curve). - -+++ - -## Parameter estimation versus hypothesis testing -This notebook has focused on the approach of Bayesian parameter estimation. For many situations this is entirely sufficient, and more information can be found in {cite:t}`yuan2009bayesian`. It will tell us, amongst other things, what our posterior beliefs are in the direct effects, indirect effects, and total effects. And we can use those posterior beliefs to conduct posterior predictive checks to visually check how well the model accounts for the data. - -However, depending upon the use case it may be preferable to test hypotheses about the presence or absence of an indirect effect ($x \rightarrow m \rightarrow y$) for example. In this case, it may be more appropriate to take a more explicit hypothesis testing approach to see examine the relative credibility of the mediation model as compared to a simple direct effect model (i.e. $y_i = \mathrm{Normal}(i_{Y*} + c \cdot x_i, \sigma_{Y*})$). Readers are referred to {cite:t}`nuijten2015default` for a hypothesis testing approach to Bayesian mediation models and to {cite:t}`kruschke2011bayesian` for more information on parameter estimation versus hypothesis testing. - -+++ - -## Summary -As stated at the outset, the procedures used in mediation analysis have evolved over time. So there are plenty of people who are not necessarily up to speed with modern best practice. The approach in this notebook sticks to that outlined by {cite:t}`hayes2017introduction`, but it is relevant to be aware of some of this history to avoid confusion - which is particularly important if defending your approach in peer review. - -+++ - -## Authors -- Authored by Benjamin T. Vincent in August 2021 -- Updated by Benjamin T. Vincent in February 2022 - -+++ - -## References -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/case_studies/moderation_analysis.myst.md b/myst_nbs/case_studies/moderation_analysis.myst.md deleted file mode 100644 index a8b78233c..000000000 --- a/myst_nbs/case_studies/moderation_analysis.myst.md +++ /dev/null @@ -1,383 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: pymc-dev-py39 - language: python - name: pymc-dev-py39 ---- - -(moderation_analysis)= -# Bayesian moderation analysis - -:::{post} March, 2022 -:tags: moderation, path analysis, -:category: beginner -:author: Benjamin T. Vincent -::: - -This notebook covers Bayesian [moderation analysis](https://en.wikipedia.org/wiki/Moderation_(statistics)). This is appropriate when we believe that one predictor variable (the moderator) may influence the linear relationship between another predictor variable and an outcome. Here we look at an example where we look at the relationship between hours of training and muscle mass, where it may be that age (the moderating variable) affects this relationship. - -This is not intended as a one-stop solution to a wide variety of data analysis problems, rather, it is intended as an educational exposition to show how moderation analysis works and how to conduct Bayesian parameter estimation in PyMC. - -Note that this is sometimes mixed up with [mediation analysis](https://en.wikipedia.org/wiki/Mediation_(statistics)). Mediation analysis is appropriate when we believe the effect of a predictor variable upon an outcome variable is (partially, or fully) mediated through a 3rd mediating variable. Readers are referred to the textbook by {cite:t}`hayes2017introduction` as a comprehensive (albeit Frequentist) guide to moderation and related models as well as the PyMC example {ref}`mediation_analysis`. - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm -import xarray as xr - -from matplotlib.cm import ScalarMappable -from matplotlib.colors import Normalize -``` - -```{code-cell} ipython3 -az.style.use("arviz-darkgrid") -%config InlineBackend.figure_format = 'retina' -``` - -First in the (hidden) code cell below, we define some helper functions for plotting that we will use later. - -```{code-cell} ipython3 -:tags: [hide-input] - -def make_scalarMap(m): - """Create a Matplotlib `ScalarMappable` so we can use a consistent colormap across both data points and posterior predictive lines. We can use `scalarMap.cmap` to use as a colormap, and `scalarMap.to_rgba(moderator_value)` to grab a colour for a given moderator value.""" - return ScalarMappable(norm=Normalize(vmin=np.min(m), vmax=np.max(m)), cmap="viridis") - - -def plot_data(x, moderator, y, ax=None): - if ax is None: - fig, ax = plt.subplots(1, 1) - else: - fig = plt.gcf() - - h = ax.scatter(x, y, c=moderator, cmap=scalarMap.cmap) - ax.set(xlabel="x", ylabel="y") - # colourbar for moderator - cbar = fig.colorbar(h) - cbar.ax.set_ylabel("moderator") - return ax - - -def posterior_prediction_plot(result, x, moderator, m_quantiles, ax=None): - """Plot posterior predicted `y`""" - if ax is None: - fig, ax = plt.subplots(1, 1) - - post = result.posterior.stack(sample=("chain", "draw")) - xi = xr.DataArray(np.linspace(np.min(x), np.max(x), 20), dims=["x_plot"]) - m_levels = result.constant_data["m"].quantile(m_quantiles).rename({"quantile": "m_level"}) - - for p, m in zip(m_quantiles, m_levels): - y = post.β0 + post.β1 * xi + post.β2 * xi * m + post.β3 * m - region = y.quantile([0.025, 0.5, 0.975], dim="sample") - ax.fill_between( - xi, - region.sel(quantile=0.025), - region.sel(quantile=0.975), - alpha=0.2, - color=scalarMap.to_rgba(m), - edgecolor="w", - ) - ax.plot( - xi, - region.sel(quantile=0.5), - color=scalarMap.to_rgba(m), - linewidth=2, - label=f"{p*100}th percentile of moderator", - ) - - ax.legend(fontsize=9) - ax.set(xlabel="x", ylabel="y") - return ax - - -def plot_moderation_effect(m, m_quantiles, ax=None): - """Spotlight graph""" - - if ax is None: - fig, ax = plt.subplots(1, 1) - - post = result.posterior.stack(sample=("chain", "draw")) - - # calculate 95% CI region and median - xi = xr.DataArray(np.linspace(np.min(m), np.max(m), 20), dims=["x_plot"]) - rate = post.β1 + post.β2 * xi - region = rate.quantile([0.025, 0.5, 0.975], dim="sample") - - ax.fill_between( - xi, - region.sel(quantile=0.025), - region.sel(quantile=0.975), - alpha=0.2, - color="k", - edgecolor="w", - ) - - ax.plot(xi, region.sel(quantile=0.5), color="k", linewidth=2) - - # plot points at each percentile of m - percentile_list = np.array(m_quantiles) * 100 - m_levels = np.percentile(m, percentile_list) - for p, m in zip(percentile_list, m_levels): - ax.plot( - m, - np.mean(post.β1) + np.mean(post.β2) * m, - "o", - c=scalarMap.to_rgba(m), - markersize=10, - label=f"{p}th percentile of moderator", - ) - - ax.legend(fontsize=9) - - ax.set( - title="Spotlight graph", - xlabel="$moderator$", - ylabel=r"$\beta_1 + \beta_2 \cdot moderator$", - ) -``` - -# Does the effect of training upon muscularity decrease with age? - -I've taken inspiration from a blog post {cite:t}`vandenbergSPSS` which examines whether age influences (moderates) the effect of training on muscle percentage. We might speculate that more training results in higher muscle mass, at least for younger people. But it might be the case that the relationship between training and muscle mass changes with age - perhaps training is less effective at increasing muscle mass in older age? - -The schematic box and arrow notation often used to represent moderation is shown by an arrow from the moderating variable to the line between a predictor and an outcome variable. - -![](moderation_figure.png) - -It can be useful to use consistent notation, so we will define: -- $x$ as the main predictor variable. In this example it is training. -- $y$ as the outcome variable. In this example it is muscle percentage. -- $m$ as the moderator. In this example it is age. - -## The moderation model - -While the visual schematic (above) is a useful shorthand to understand complex models when you already know what moderation is, you can't derive it from the diagram alone. So let us formally specify the moderation model - it defines an outcome variable $y$ as: - -$$ -y \sim \mathrm{Normal}(\beta_0 + \beta_1 \cdot x + \beta_2 \cdot x \cdot m + \beta_3 \cdot m, \sigma^2) -$$ - -where $y$, $x$, and $m$ are your observed data, and the following are the model parameters: -- $\beta_0$ is the intercept, its value does not have that much importance in the interpretation of this model. -- $\beta_1$ is the rate at which $y$ (muscle percentage) increases per unit of $x$ (training hours). -- $\beta_2$ is the coefficient for the interaction term $x \cdot m$. -- $\beta_3$ is the rate at which $y$ (muscle percentage) increases per unit of $m$ (age). -- $\sigma$ is the standard deviation of the observation noise. - -We can see that the mean $y$ is simply a multiple linear regression with an interaction term between the two predictors, $x$ and $m$. - -We can get some insight into why this is the case by thinking about this as a multiple linear regression with $x$ and $m$ as predictor variables, but where the value of $m$ influences the relationship between $x$ and $y$. This is achieved by making the regression coefficient for $x$ is a function of $m$: - -$$ -y \sim \mathrm{Normal}(\beta_0 + f(m) \cdot x + \beta_3 \cdot m, \sigma^2) -$$ - -and if we define that as a linear function, $f(m) = \beta_1 + \beta_2 \cdot m$, we get - -$$ -y \sim \mathrm{Normal}(\beta_0 + (\beta_1 + \beta_2 \cdot m) \cdot x + \beta_3 \cdot m, \sigma^2) -$$ - -We can use $f(m) = \beta_1 + \beta_2 \cdot m$ later to visualise the moderation effect. - -+++ - -## Import data -First, we will load up our example data and do some basic data visualisation. The dataset is taken from {cite:t}`vandenbergSPSS` but it is unclear if this corresponds to real life research data or if it was simulated. - -```{code-cell} ipython3 -def load_data(): - try: - df = pd.read_csv("../data/muscle-percent-males-interaction.csv") - except: - df = pd.read_csv(pm.get_data("muscle-percent-males-interaction.csv")) - - x = df["thours"].values - m = df["age"].values - y = df["mperc"].values - return (x, y, m) - - -x, y, m = load_data() - -# Make a scalar color map for this dataset (Just for plotting, nothing to do with inference) -scalarMap = make_scalarMap(m) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(1, 3, figsize=(14, 3)) - -ax[0].hist(x, alpha=0.5) -ax[0].set(xlabel="training, $x$") - -ax[1].hist(m, alpha=0.5) -ax[1].set(xlabel="age, $m$") - -ax[2].hist(y, alpha=0.5) -ax[2].set(xlabel="muscle percentage, $y$"); -``` - -## Define the PyMC model and conduct inference - -```{code-cell} ipython3 -def model_factory(x, m, y): - with pm.Model() as model: - x = pm.ConstantData("x", x) - m = pm.ConstantData("m", m) - # priors - β0 = pm.Normal("β0", mu=0, sd=10) - β1 = pm.Normal("β1", mu=0, sd=10) - β2 = pm.Normal("β2", mu=0, sd=10) - β3 = pm.Normal("β3", mu=0, sd=10) - σ = pm.HalfCauchy("σ", 1) - # likelihood - y = pm.Normal("y", mu=β0 + (β1 * x) + (β2 * x * m) + (β3 * m), sd=σ, observed=y, dims="obs") - - return model -``` - -```{code-cell} ipython3 -model = model_factory(x, m, y) -``` - -Plot the model graph to confirm it is as intended. - -```{code-cell} ipython3 -pm.model_to_graphviz(model) -``` - -```{code-cell} ipython3 -with model: - result = pm.sample(draws=1000, tune=1000, random_seed=42, nuts={"target_accept": 0.9}) -``` - -Visualise the trace to check for convergence. - -```{code-cell} ipython3 -az.plot_trace(result); -``` - -We have good chain mixing and the posteriors for each chain look very similar, so no problems in that regard. - -+++ - -## Visualise the important parameters - -First we will use a pair plot to look at joint posterior distributions. This might help us identify any estimation issues with the interaction term (see the discussion below about multicollinearity). - -```{code-cell} ipython3 -az.plot_pair( - result, - marginals=True, - point_estimate="median", - figsize=(12, 12), - scatter_kwargs={"alpha": 0.01}, -); -``` - -And just for the sake of completeness, we can plot the posterior distributions for each of the $\beta$ parameters and use this to arrive at research conclusions. - -```{code-cell} ipython3 -az.plot_posterior(result, var_names=["β1", "β2", "β3"], figsize=(14, 4)); -``` - -For example, from an estimation (in contrast to a hypothesis testing) perspective, we could look at the posterior over $\beta_2$ and claim a credibly less than zero moderation effect. - -+++ - -## Posterior predictive checks -Define a set of quantiles of $m$ that we are interested in visualising. - -```{code-cell} ipython3 -m_quantiles = [0.025, 0.25, 0.5, 0.75, 0.975] -``` - -### Visualisation in data space -Here we will plot the data alongside model posterior predictive checks. This can be a useful visual method of comparing the model predictions against the data. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(10, 6)) -ax = plot_data(x, m, y, ax=ax) -posterior_prediction_plot(result, x, m, m_quantiles, ax=ax) -ax.set_title("Data and posterior prediction"); -``` - -### Spotlight graph -We can also visualise the moderation effect by plotting $\beta_1 + \beta_2 \cdot m$ as a function of the $m$. This was named a spotlight graph, see {cite:t}`spiller2013spotlights` and {cite:t}`mcclelland2017multicollinearity`. - -```{code-cell} ipython3 -fig, ax = plt.subplots(1, 2, figsize=(10, 5)) -plot_moderation_effect(m, m_quantiles, ax[0]) -az.plot_posterior(result, var_names="β2", ax=ax[1]); -``` - -The expression $\beta_1 + \beta_2 \cdot \text{moderator}$ defines the rate of change of the outcome (muscle percentage) per unit of $x$ (training hours/week). We can see that as age (the moderator) increases, this effect of training hours/week on muscle percentage decreases. - -+++ - -## Related issues: mean centering and multicollinearity - -Readers should be aware that there are issues around mean-centering and multicollinearity. The original [SPSS Moderation Regression Tutorial](https://www.spss-tutorials.com/spss-regression-with-moderation-interaction-effect/) did mean-centre the predictor variables $x$ and $m$. This will have a downstream effect upon the interaction term $x \cdot m$. - -One effect of mean centering is to change the interpretation of the parameter estimates. In this notebook, we did not mean center the variables which will affect the parameter estimates and their interpretation. It is not that one is correct or incorrect, but one must be cognisant of how mean-centering (or not) affects the interpretation of parameter estimates. Readers are again directed to {cite:t}`hayes2017introduction` for a more in-depth consideration of mean-centering in moderation analyses. - -Another issue, particularly relevant to moderation analysis is [multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity), where one predictor variable is well-described as a linear combination of other predictors. This is clearly the case in moderation analysis as the interaction term $m \cdot x$ is by definition a linear combination of $x$ and $m$. - -{cite:t}`iacobucci2016mean` explored the issues of mean-centering and multicollinearity and conclude: - > When all is said and done, should a researcher mean center the X1 and X2 variables before computing a product term X1X2 to include in a moderated multiple regression? It depends. Mean centering is advisable when: (1) the predictor variables are measured on scales with arbitrary zeros and the researcher seeks to enhance the interpretation of the regression results vis-à-vis the variables’ means rather than the arbitrary zero points, or (2) the research questions involve testing the main effect terms in addition to the interaction term and the researcher seeks to obtain these statistical tests without the interference of the so-called nonessential multicollinearity. On the other hand, mean centering may be bypassed when: (1) the research question involves primarily the test of the interaction term, with no regard for the lower order main effect terms, or (2) the research question involves primarily the assessment of the overall fit of the model, the R2, with no interest in apportioning the explained variability across the predictors, main effects or interaction alike. - -This was critiqued however by {cite:t}`mcclelland2017multicollinearity` who claimed that {cite:t}`iacobucci2016mean` made a number of errors, and that multicollinearity is a red herring: - -> Multicollinearity is irrelevant to the search for moderator variables, contrary to the implications of Iacobucci, Schneider, Popovich, and Bakamitsos (Behavior Research Methods, 2016, this issue). Multicollinearity is like the red herring in a mystery novel that distracts the statistical detective from the pursuit of a true moderator relationship. - -They state: - -> Researchers using MMR [moderated multiple regression] need not compute any multicollinearity diagnostics nor worry about it at all. They need not use mean-centering or the orthogonal transformation or do anything else to avoid the purported problems of multicollinearity. The only purpose of those transformations is to facilitate understanding of MMR models. - -Bearing in mind {cite:t}`mcclelland2017multicollinearity` took a frequentist hypothesis testing (not a Bayesian approach) their take-home points can be paraphrased as: -1. Fit the regression model, $y \sim \mathrm{Normal}(\beta_0 + \beta_1 \cdot x + \beta_2 \cdot x \cdot m + \beta_3 \cdot m, \sigma^2)$, with original (not mean-centred) data. -2. If the main interest is on the moderation effect, then focus upon $\beta_2$. -3. Transformations are useful if conditional relationships are to be highlighted. -4. "... researchers who wish to examine all possible conditional relationships or to help their readers who might want to consider other conditional relationships, should construct the [spotlight] graph..." - -But readers are strongly encouraged to read {cite:t}`mcclelland2017multicollinearity` for more details, as well as the reply from {cite:t}`iacobucci2017mean`. Readers should also be aware that there are conflicting opinions and recommendations about mean centering etc in textbooks (see Further Reading below), some of which are published before 2017. None of these textbooks explicitly cite {cite:t}`mcclelland2017multicollinearity`, so it is unclear if the textbook authors are unaware of, agree with, or disagree with {cite:t}`mcclelland2017multicollinearity`. - -## Further reading -- Further information about the 'moderation effect', or what {cite:t}`mcclelland2017multicollinearity` called a spotlight graphs, can be found in {cite:t}`bauer2005probing` and {cite:t}`spiller2013spotlights`. Although these papers take a frequentist (not Bayesian) perspective. -- {cite:t}`zhang2017moderation` compare maximum likelihood and Bayesian methods for moderation analysis with missing predictor variables. -- Multicollinearity, data centering, and linear models with interaction terms are also discussed in a number of prominent Bayesian text books {cite:p}`gelman2013bayesian, gelman2020regression,kruschke2014doing,mcelreath2018statistical`. - -+++ - -## Authors -- Authored by Benjamin T. Vincent in June 2021 -- Updated by Benjamin T. Vincent in March 2022 - -+++ - -## References -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/case_studies/multilevel_modeling.myst.md b/myst_nbs/case_studies/multilevel_modeling.myst.md deleted file mode 100644 index 61bb342da..000000000 --- a/myst_nbs/case_studies/multilevel_modeling.myst.md +++ /dev/null @@ -1,1199 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(GLM-hierarchical)= -(multilevel_modeling)= -# A Primer on Bayesian Methods for Multilevel Modeling - -:::{post} 24 October, 2022 -:tags: hierarchical model, case study, generalized linear model -:category: intermediate -:author: Chris Fonnesbeck, Colin Carroll, Alex Andorra, Oriol Abril, Farhan Reynaldo -::: - -+++ - -Hierarchical or multilevel modeling is a generalization of regression modeling. - -*Multilevel models* are regression models in which the constituent model parameters are given **probability models**. This implies that model parameters are allowed to **vary by group**. - -Observational units are often naturally **clustered**. Clustering induces dependence between observations, despite random sampling of clusters and random sampling within clusters. - -A *hierarchical model* is a particular multilevel model where parameters are nested within one another. - -Some multilevel structures are not hierarchical. - -* e.g. "country" and "year" are not nested, but may represent separate, but overlapping, clusters of parameters - -We will motivate this topic using an environmental epidemiology example. - -+++ - -## Example: Radon contamination {cite:t}`gelman2006data` - -Radon is a radioactive gas that enters homes through contact points with the ground. It is a carcinogen that is the primary cause of lung cancer in non-smokers. Radon levels vary greatly from household to household. - -![radon](https://www.cgenarchive.org/uploads/2/5/2/6/25269392/7758459_orig.jpg) - -The EPA did a study of radon levels in 80,000 houses. There are two important predictors: - -* measurement in basement or first floor (radon higher in basements) -* county uranium level (positive correlation with radon levels) - -We will focus on modeling radon levels in Minnesota. - -The hierarchy in this example is households within county. - -+++ - -### Data organization - -+++ - -First, we import the data from a local file, and extract Minnesota's data. - -```{code-cell} ipython3 -import os -import warnings - -import aesara.tensor as at -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm -import seaborn as sns -import xarray as xr - -warnings.filterwarnings("ignore", module="scipy") - -print(f"Running on PyMC v{pm.__version__}") -``` - -```{code-cell} ipython3 -RANDOM_SEED = 8924 -az.style.use("arviz-darkgrid") -``` - -The original data exists as several independent datasets, which we will import, merge, and process here. First is the data on measurements from individual homes from across the United States. We will extract just the subset from Minnesota. - -```{code-cell} ipython3 -try: - srrs2 = pd.read_csv(os.path.join("..", "data", "srrs2.dat")) -except FileNotFoundError: - srrs2 = pd.read_csv(pm.get_data("srrs2.dat")) - -srrs2.columns = srrs2.columns.map(str.strip) -srrs_mn = srrs2[srrs2.state == "MN"].copy() -``` - -Next, obtain the county-level predictor, uranium, by combining two variables. - -```{code-cell} ipython3 -try: - cty = pd.read_csv(os.path.join("..", "data", "cty.dat")) -except FileNotFoundError: - cty = pd.read_csv(pm.get_data("cty.dat")) - -srrs_mn["fips"] = srrs_mn.stfips * 1000 + srrs_mn.cntyfips -cty_mn = cty[cty.st == "MN"].copy() -cty_mn["fips"] = 1000 * cty_mn.stfips + cty_mn.ctfips -``` - -Use the `merge` method to combine home- and county-level information in a single DataFrame. - -```{code-cell} ipython3 -srrs_mn = srrs_mn.merge(cty_mn[["fips", "Uppm"]], on="fips") -srrs_mn = srrs_mn.drop_duplicates(subset="idnum") -u = np.log(srrs_mn.Uppm).unique() - -n = len(srrs_mn) -``` - -Let's encode the county names and make local copies of the variables we will use. -We also need a lookup table (`dict`) for each unique county, for indexing. - -```{code-cell} ipython3 -srrs_mn.county = srrs_mn.county.map(str.strip) -county, mn_counties = srrs_mn.county.factorize() -srrs_mn["county_code"] = county -radon = srrs_mn.activity -srrs_mn["log_radon"] = log_radon = np.log(radon + 0.1).values -floor_measure = srrs_mn.floor.values -``` - -Distribution of radon levels in MN (log scale): - -```{code-cell} ipython3 -srrs_mn.log_radon.hist(bins=25, grid=False) -plt.xlabel("log(radon)") -plt.ylabel("frequency"); -``` - -## Conventional approaches - -The two conventional alternatives to modeling radon exposure represent the two extremes of the bias-variance tradeoff: - -***Complete pooling***: - -Treat all counties the same, and estimate a single radon level. - -$$y_i = \alpha + \beta x_i + \epsilon_i$$ - -***No pooling***: - -Model radon in each county independently. - -$$y_i = \alpha_{j[i]} + \beta x_i + \epsilon_i$$ - -where $j = 1,\ldots,85$ - -The errors $\epsilon_i$ may represent measurement error, temporal within-house variation, or variation among houses. - -+++ - -Here are the point estimates of the slope and intercept for the complete pooling model: - -```{code-cell} ipython3 -with pm.Model() as pooled_model: - floor_ind = pm.MutableData("floor_ind", floor_measure, dims="obs_id") - - alpha = pm.Normal("alpha", 0, sigma=10) - beta = pm.Normal("beta", mu=0, sigma=10) - sigma = pm.Exponential("sigma", 5) - - theta = alpha + beta * floor_ind - - y = pm.Normal("y", theta, sigma=sigma, observed=log_radon, dims="obs_id") -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(pooled_model) -``` - -You may be wondering why we are using the `pm.Data` container above even though the variable `floor_ind` is not an observed variable nor a parameter of the model. As you'll see, this will make our lives much easier when we'll plot and diagnose our model.ArviZ will thus include `floor_ind` as a variable in the `constant_data` group of the resulting {ref}`InferenceData ` object. Moreover, including `floor_ind` in the `InferenceData` object makes sharing and reproducing analysis much easier, all the data needed to analyze or rerun the model is stored there. - -+++ - -Before running the model let's do some **prior predictive checks**. - -Indeed, having sensible priors is not only a way to incorporate scientific knowledge into the model, it can also help and make the MCMC machinery faster -- here we are dealing with a simple linear regression, so no link function comes and distorts the outcome space; but one day this will happen to you and you'll need to think hard about your priors to help your MCMC sampler. So, better to train ourselves when it's quite easy than having to learn when it's very hard. - -There is a convenient function for prior predictive sampling in PyMC: - -```{code-cell} ipython3 -with pooled_model: - prior_checks = pm.sample_prior_predictive(random_seed=RANDOM_SEED) -``` - -ArviZ `InferenceData` uses `xarray.Dataset`s under the hood, which give access to several common plotting functions with `.plot`. In this case, we want scatter plot of the mean log radon level (which is stored in variable `a`) for each of the two levels we are considering. If our desired plot is supported by xarray plotting capabilities, we can take advantage of xarray to automatically generate both plot and labels for us. Notice how everything is directly plotted and annotated, the only change we need to do is renaming the y axis label from `a` to `Mean log radon level`. - -```{code-cell} ipython3 -prior = prior_checks.prior.squeeze(drop=True) - -xr.concat((prior["alpha"], prior["alpha"] + prior["beta"]), dim="location").rename( - "log_radon" -).assign_coords(location=["basement", "floor"]).plot.scatter( - x="location", y="log_radon", edgecolors="none" -); -``` - -I'm no radon expert, but before seeing the data, these priors seem to allow for quite a wide range of the mean log radon level, both as measured either in a basement or on a floor. But don't worry, we can always change these priors if sampling gives us hints that they might not be appropriate -- after all, priors are assumptions, not oaths; and as with most assumptions, they can be tested. - -However, we can already think of an improvement: Remember that we stated radon levels tend to be higher in basements, so we could incorporate this prior scientific knowledge into our model by forcing the floor effect (`beta`) to be negative. For now, we will leave the model as is, and trust that the information in the data will be sufficient. - -Speaking of sampling, let's fire up the Bayesian machinery! - -```{code-cell} ipython3 -with pooled_model: - pooled_trace = pm.sample(random_seed=RANDOM_SEED) -``` - -No divergences and a sampling that only took seconds! Here the chains look very good (good R hat, good effective sample size, small sd). The model also estimated a negative floor effect, as we expected. - -```{code-cell} ipython3 -az.summary(pooled_trace, round_to=2) -``` - -Let's plot the expected radon levels in basements (`alpha`) and on floors (`alpha + beta`) in relation to the data used to fit the model: - -```{code-cell} ipython3 -post_mean = pooled_trace.posterior.mean(dim=("chain", "draw")) - -plt.scatter(srrs_mn.floor, np.log(srrs_mn.activity + 0.1)) -xvals = xr.DataArray(np.linspace(-0.2, 1.2)) -plt.plot(xvals, post_mean["beta"] * xvals + post_mean["alpha"], "r--"); -``` - -This looks reasonable, though notice that there is a great deal of residual variability in the data. - -Let's now turn our attention to the unpooled model, and see how it fares in comparison. - -```{code-cell} ipython3 -coords = {"county": mn_counties} - -with pm.Model(coords=coords) as unpooled_model: - floor_idx = pm.MutableData("floor_ind", floor_measure, dims="obs_id") - - alpha = pm.Normal("alpha", 0, sigma=10, dims="county") - beta = pm.Normal("beta", 0, sigma=10) - sigma = pm.Exponential("sigma", 1) - - theta = alpha[county] + beta * floor_ind - - y = pm.Normal("y", theta, sigma=sigma, observed=log_radon, dims="obs_id") -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(unpooled_model) -``` - -```{code-cell} ipython3 -with unpooled_model: - unpooled_trace = pm.sample(random_seed=RANDOM_SEED) -``` - -The sampling was clean here too; Let's look at the expected values for both basement (dimension 0) and floor (dimension 1) in each county: - -```{code-cell} ipython3 -ax = az.plot_forest( - unpooled_trace, - var_names=["alpha"], - r_hat=True, - combined=True, - figsize=(6, 18), - labeller=az.labels.NoVarLabeller(), -) -ax[0].set_ylabel("alpha"); -``` - -To identify counties with high radon levels, we can plot the ordered mean estimates, as well as their 94% HPD: - -```{code-cell} ipython3 -unpooled_means = unpooled_trace.posterior.mean(dim=("chain", "draw")) -unpooled_hdi = az.hdi(unpooled_trace) - -unpooled_means_iter = unpooled_means.sortby("alpha") -unpooled_hdi_iter = unpooled_hdi.sortby(unpooled_means_iter.alpha) - -_, ax = plt.subplots(figsize=(12, 5)) -xticks = np.arange(0, 86, 6) -unpooled_means_iter.plot.scatter(x="county", y="alpha", ax=ax, alpha=0.8) -ax.vlines( - np.arange(mn_counties.size), - unpooled_hdi_iter.alpha.sel(hdi="lower"), - unpooled_hdi_iter.alpha.sel(hdi="higher"), - color="orange", - alpha=0.6, -) -ax.set(ylabel="Radon estimate", ylim=(-2, 4.5)) -ax.set_xticks(xticks) -ax.set_xticklabels(unpooled_means_iter.county.values[xticks]) -ax.tick_params(rotation=90); -``` - -Now that we have fit both conventional (*i.e.* non-hierarchcial) models, let's see how their inferences differ. Here are visual comparisons between the pooled and unpooled estimates for a subset of counties representing a range of sample sizes. - -```{code-cell} ipython3 -sample_counties = ( - "LAC QUI PARLE", - "AITKIN", - "KOOCHICHING", - "DOUGLAS", - "CLAY", - "STEARNS", - "RAMSEY", - "ST LOUIS", -) - -fig, axes = plt.subplots(2, 4, figsize=(12, 6), sharey=True, sharex=True) -axes = axes.ravel() -m = unpooled_means["beta"] -for i, c in enumerate(sample_counties): - y = srrs_mn.log_radon[srrs_mn.county == c] - x = srrs_mn.floor[srrs_mn.county == c] - axes[i].scatter(x + np.random.randn(len(x)) * 0.01, y, alpha=0.4) - - # No pooling model - b = unpooled_means["alpha"].sel(county=c) - - # Plot both models and data - xvals = xr.DataArray(np.linspace(0, 1)) - axes[i].plot(xvals, m * xvals + b) - axes[i].plot(xvals, post_mean["beta"] * xvals + post_mean["alpha"], "r--") - axes[i].set_xticks([0, 1]) - axes[i].set_xticklabels(["basement", "floor"]) - axes[i].set_ylim(-1, 3) - axes[i].set_title(c) - if not i % 2: - axes[i].set_ylabel("log radon level") -``` - -Neither of these models are satisfactory: - -* If we are trying to identify high-radon counties, pooling is useless -- because, by definition, the pooled model estimates radon at the state-level. In other words, pooling leads to maximal *underfitting*: the variation across counties is not taken into account and only the overall population is estimated. -* We do not trust extreme unpooled estimates produced by models using few observations. This leads to maximal *overfitting*: only the within-county variations are taken into account and the overall population (i.e the state-level, which tells us about similarites across counties) is not estimated. - -This issue is acute for small sample sizes, as seen above: in counties where we have few floor measurements, if radon levels are higher for those data points than for basement ones (Aitkin, Koochiching, Ramsey), the model will estimate that radon levels are higher in floors than basements for these counties. But we shouldn't trust this conclusion, because both scientific knowledge and the situation in other counties tell us that it is usually the reverse (basement radon > floor radon). So unless we have a lot of observations telling us otherwise for a given county, we should be skeptical and shrink our county-estimates to the state-estimates -- in other words, we should balance between cluster-level and population-level information, and the amount of shrinkage will depend on how extreme and how numerous the data in each cluster are. - -Here is where hierarchical models come into play. - -+++ - -## Multilevel and hierarchical models - -When we pool our data, we imply that they are sampled from the same model. This ignores any variation among sampling units (other than sampling variance) -- we assume that counties are all the same: - -![pooled](pooled_model.png) - -When we analyze data unpooled, we imply that they are sampled independently from separate models. At the opposite extreme from the pooled case, this approach claims that differences between sampling units are too large to combine them -- we assume that counties have no similarity whatsoever: - -![unpooled](unpooled_model.png) - -In a hierarchical model, parameters are viewed as a sample from a population distribution of parameters. Thus, we view them as being neither entirely different or exactly the same. This is ***partial pooling***: - -![hierarchical](partial_pooled_model.png) - -We can use PyMC to easily specify multilevel models, and fit them using Markov chain Monte Carlo. - -+++ - -## Partial pooling model - -The simplest partial pooling model for the household radon dataset is one which simply estimates radon levels, without any predictors at any level. A partial pooling model represents a compromise between the pooled and unpooled extremes, approximately a weighted average (based on sample size) of the unpooled county estimates and the pooled estimates. - -$$\hat{\alpha} \approx \frac{(n_j/\sigma_y^2)\bar{y}_j + (1/\sigma_{\alpha}^2)\bar{y}}{(n_j/\sigma_y^2) + (1/\sigma_{\alpha}^2)}$$ - -Estimates for counties with smaller sample sizes will shrink towards the state-wide average, while those for counties with larger sample sizes will be closer to the unpooled county estimates. - -+++ - -Let's start with the simplest model, which ignores the effect of floor vs. basement measurement. - -```{code-cell} ipython3 -with pm.Model(coords=coords) as partial_pooling: - county_idx = pm.MutableData("county_idx", county, dims="obs_id") - - # Priors - mu_a = pm.Normal("mu_a", mu=0.0, sigma=10) - sigma_a = pm.Exponential("sigma_a", 1) - - # Random intercepts - alpha = pm.Normal("alpha", mu=mu_a, sigma=sigma_a, dims="county") - - # Model error - sigma_y = pm.Exponential("sigma_y", 1) - - # Expected value - y_hat = alpha[county_idx] - - # Data likelihood - y_like = pm.Normal("y_like", mu=y_hat, sigma=sigma_y, observed=log_radon, dims="obs_id") -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(partial_pooling) -``` - -```{code-cell} ipython3 -with partial_pooling: - partial_pooling_trace = pm.sample(tune=2000, random_seed=RANDOM_SEED) -``` - -```{code-cell} ipython3 -N_county = srrs_mn.groupby("county")["idnum"].count().values - -fig, axes = plt.subplots(1, 2, figsize=(10, 4), sharex=True, sharey=True) -for ax, trace, level in zip( - axes, - (unpooled_trace, partial_pooling_trace), - ("no pooling", "partial pooling"), -): - - # add variable with x values to xarray dataset - trace.posterior = trace.posterior.assign_coords({"N_county": ("county", N_county)}) - # plot means - trace.posterior.mean(dim=("chain", "draw")).plot.scatter( - x="N_county", y="alpha", ax=ax, alpha=0.9 - ) - ax.hlines( - partial_pooling_trace.posterior.alpha.mean(), - 0.9, - max(N_county) + 1, - alpha=0.4, - ls="--", - label="Est. population mean", - ) - - # plot hdi - hdi = az.hdi(trace).alpha - ax.vlines(N_county, hdi.sel(hdi="lower"), hdi.sel(hdi="higher"), color="orange", alpha=0.5) - - ax.set( - title=f"{level.title()} Estimates", - xlabel="Nbr obs in county (log scale)", - xscale="log", - ylabel="Log radon", - ) - ax.legend(fontsize=10) -``` - -Notice the difference between the unpooled and partially-pooled estimates, particularly at smaller sample sizes: As expected, the former are both more extreme and more imprecise. Indeed, in the partially-pooled model, estimates in small-sample-size counties are informed by the population parameters -- hence more precise estimates. Moreover, the smaller the sample size, the more regression towards the overall mean (the dashed gray line) -- hence less extreme estimates. In other words, the model is skeptical of extreme deviations from the population mean in counties where data is sparse. This is known as **shrinkage**. - -+++ - -Now let's go back and integrate the `floor` predictor, but allowing the intercept to vary by county. - -## Varying intercept model - -This model allows intercepts to vary across county, according to a random effect. - -$$y_i = \alpha_{j[i]} + \beta x_{i} + \epsilon_i$$ - -where - -$$\epsilon_i \sim N(0, \sigma_y^2)$$ - -and the intercept random effect: - -$$\alpha_{j[i]} \sim N(\mu_{\alpha}, \sigma_{\alpha}^2)$$ - -As with the the “no-pooling” model, we set a separate intercept for each county, but rather than fitting separate least squares regression models for each county, multilevel modeling **shares strength** among counties, allowing for more reasonable inference in counties with little data. - -```{code-cell} ipython3 -with pm.Model(coords=coords) as varying_intercept: - floor_idx = pm.MutableData("floor_idx", floor_measure, dims="obs_id") - county_idx = pm.MutableData("county_idx", county, dims="obs_id") - - # Priors - mu_a = pm.Normal("mu_a", mu=0.0, sigma=10.0) - sigma_a = pm.Exponential("sigma_a", 1) - - # Random intercepts - alpha = pm.Normal("alpha", mu=mu_a, sigma=sigma_a, dims="county") - # Common slope - beta = pm.Normal("beta", mu=0.0, sigma=10.0) - - # Model error - sd_y = pm.Exponential("sd_y", 1) - - # Expected value - y_hat = alpha[county_idx] + beta * floor_idx - - # Data likelihood - y_like = pm.Normal("y_like", mu=y_hat, sigma=sd_y, observed=log_radon, dims="obs_id") -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(varying_intercept) -``` - -```{code-cell} ipython3 -with varying_intercept: - varying_intercept_trace = pm.sample(tune=2000, random_seed=RANDOM_SEED) -``` - -```{code-cell} ipython3 -ax = pm.plot_forest( - varying_intercept_trace, - var_names=["alpha"], - figsize=(6, 18), - combined=True, - r_hat=True, - labeller=az.labels.NoVarLabeller(), -) -ax[0].set_ylabel("alpha") -``` - -```{code-cell} ipython3 -pm.plot_posterior(varying_intercept_trace, var_names=["sigma_a", "beta"]); -``` - -The estimate for the `floor` coefficient is approximately -0.66, which can be interpreted as houses without basements having about half ($\exp(-0.66) = 0.52$) the radon levels of those with basements, after accounting for county. - -```{code-cell} ipython3 -az.summary(varying_intercept_trace, var_names=["beta"]) -``` - -```{code-cell} ipython3 -xvals = xr.DataArray([0, 1], dims="Level", coords={"Level": ["Basement", "Floor"]}) -post = varying_intercept_trace.posterior # alias for readability -theta = ( - (post.alpha + post.beta * xvals).mean(dim=("chain", "draw")).to_dataset(name="Mean log radon") -) - -_, ax = plt.subplots() -theta.plot.scatter(x="Level", y="Mean log radon", alpha=0.2, color="k", ax=ax) # scatter -ax.plot(xvals, theta["Mean log radon"].T, "k-", alpha=0.2) -# add lines too -ax.set_title("MEAN LOG RADON BY COUNTY"); -``` - -It is easy to show that the partial pooling model provides more objectively reasonable estimates than either the pooled or unpooled models, at least for counties with small sample sizes. - -```{code-cell} ipython3 -sample_counties = ( - "LAC QUI PARLE", - "AITKIN", - "KOOCHICHING", - "DOUGLAS", - "CLAY", - "STEARNS", - "RAMSEY", - "ST LOUIS", -) - -fig, axes = plt.subplots(2, 4, figsize=(12, 6), sharey=True, sharex=True) -axes = axes.ravel() -m = unpooled_means["beta"] -for i, c in enumerate(sample_counties): - y = srrs_mn.log_radon[srrs_mn.county == c] - x = srrs_mn.floor[srrs_mn.county == c] - axes[i].scatter(x + np.random.randn(len(x)) * 0.01, y, alpha=0.4) - - # No pooling model - b = unpooled_means["alpha"].sel(county=c) - - # Plot both models and data - xvals = xr.DataArray(np.linspace(0, 1)) - axes[i].plot(xvals, m.values * xvals + b.values) - axes[i].plot(xvals, post_mean["beta"] * xvals + post_mean["alpha"], "r--") - - varying_intercept_trace.posterior.sel(county=c).beta - post = varying_intercept_trace.posterior.sel(county=c).mean(dim=("chain", "draw")) - theta = post.alpha.values + post.beta.values * xvals - axes[i].plot(xvals, theta, "k:") - axes[i].set_xticks([0, 1]) - axes[i].set_xticklabels(["basement", "floor"]) - axes[i].set_ylim(-1, 3) - axes[i].set_title(c) - if not i % 2: - axes[i].set_ylabel("log radon level") -``` - -## Varying intercept and slope model - -The most general model allows both the intercept and slope to vary by county: - -$$y_i = \alpha_{j[i]} + \beta_{j[i]} x_{i} + \epsilon_i$$ - -```{code-cell} ipython3 -with pm.Model(coords=coords) as varying_intercept_slope: - floor_idx = pm.MutableData("floor_idx", floor_measure, dims="obs_id") - county_idx = pm.MutableData("county_idx", county, dims="obs_id") - - # Priors - mu_a = pm.Normal("mu_a", mu=0.0, sigma=10.0) - sigma_a = pm.Exponential("sigma_a", 1) - - mu_b = pm.Normal("mu_b", mu=0.0, sigma=10.0) - sigma_b = pm.Exponential("sigma_b", 1) - - # Random intercepts - alpha = pm.Normal("alpha", mu=mu_a, sigma=sigma_a, dims="county") - # Random slopes - beta = pm.Normal("beta", mu=mu_b, sigma=sigma_b, dims="county") - - # Model error - sigma_y = pm.Exponential("sigma_y", 1) - - # Expected value - y_hat = alpha[county_idx] + beta[county_idx] * floor_idx - - # Data likelihood - y_like = pm.Normal("y_like", mu=y_hat, sigma=sigma_y, observed=log_radon, dims="obs_id") -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(varying_intercept_slope) -``` - -```{code-cell} ipython3 -with varying_intercept_slope: - varying_intercept_slope_trace = pm.sample(tune=2000, random_seed=RANDOM_SEED) -``` - -Notice that the trace of this model includes divergences, which can be problematic depending on where and how frequently they occur. These can occur is some hierararchical models, and they can be avoided by using the **non-centered parametrization**. - -+++ - -## Non-centered Parameterization - -The partial pooling models specified above uses a **centered** parameterization of the slope random effect. That is, the individual county effects are distributed around a county mean, with a spread controlled by the hierarchical standard deviation parameter. As the preceding plot reveals, this constraint serves to **shrink** county estimates toward the overall mean, to a degree proportional to the county sample size. This is exactly what we want, and the model appears to fit well--the Gelman-Rubin statistics are exactly 1. - -But, on closer inspection, there are signs of trouble. Specifically, let's look at the trace of the random effects, and their corresponding standard deviation: - -```{code-cell} ipython3 -fig, axs = plt.subplots(nrows=2) -axs[0].plot(varying_intercept_slope_trace.posterior.sel(chain=0)["sigma_b"], alpha=0.5) -axs[0].set(ylabel="sigma_b") -axs[1].plot(varying_intercept_slope_trace.posterior.sel(chain=0)["beta"], alpha=0.05) -axs[1].set(ylabel="beta"); -``` - -Notice that when the chain reaches the lower end of the parameter space for $\sigma_b$, it appears to get "stuck" and the entire sampler, including the random slopes `b`, mixes poorly. - -Jointly plotting the random effect variance and one of the individual random slopes demonstrates what is going on. - -```{code-cell} ipython3 -ax = az.plot_pair( - varying_intercept_slope_trace, - var_names=["beta", "sigma_b"], - coords=dict(county="AITKIN"), - marginals=True, - # marginal_kwargs={"kind": "hist"}, -) -ax[1, 0].set_ylim(0, 0.7); -``` - -When the group variance is small, this implies that the individual random slopes are themselves close to the group mean. This results in a *funnel*-shaped relationship between the samples of group variance and any of the slopes (particularly those with a smaller sample size). - -In itself, this is not a problem, since this is the behavior we expect. However, if the sampler is tuned for the wider (unconstrained) part of the parameter space, it has trouble in the areas of higher curvature. The consequence of this is that the neighborhood close to the lower bound of $\sigma_b$ is sampled poorly; indeed, in our chain it is not sampled at all below 0.1. The result of this will be biased inference. - -Now that we've spotted the problem, what can we do about it? The best way to deal with this issue is to reparameterize our model. Notice the random slopes in this version: - -```{code-cell} ipython3 -with pm.Model(coords=coords) as varying_intercept_slope_noncentered: - floor_idx = pm.MutableData("floor_idx", floor_measure, dims="obs_id") - county_idx = pm.MutableData("county_idx", county, dims="obs_id") - - # Priors - mu_a = pm.Normal("mu_a", mu=0.0, sigma=10.0) - sigma_a = pm.Exponential("sigma_a", 5) - - # Non-centered random intercepts - # Centered: a = pm.Normal('a', mu_a, sigma=sigma_a, shape=counties) - z_a = pm.Normal("z_a", mu=0, sigma=1, dims="county") - alpha = pm.Deterministic("alpha", mu_a + z_a * sigma_a, dims="county") - - mu_b = pm.Normal("mu_b", mu=0.0, sigma=10.0) - sigma_b = pm.Exponential("sigma_b", 5) - - # Non-centered random slopes - z_b = pm.Normal("z_b", mu=0, sigma=1, dims="county") - beta = pm.Deterministic("beta", mu_b + z_b * sigma_b, dims="county") - - # Model error - sigma_y = pm.Exponential("sigma_y", 5) - - # Expected value - y_hat = alpha[county_idx] + beta[county_idx] * floor_idx - - # Data likelihood - y_like = pm.Normal("y_like", mu=y_hat, sigma=sigma_y, observed=log_radon, dims="obs_id") -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(varying_intercept_slope_noncentered) -``` - -This is a [**non-centered** parameterization](https://twiecki.io/blog/2017/02/08/bayesian-hierchical-non-centered/). By this, we mean that the random deviates are no longer explicitly modeled as being centered on $\mu_b$. Instead, they are independent standard normals $\upsilon$, which are then scaled by the appropriate value of $\sigma_b$, before being location-transformed by the mean. - -This model samples much better. - -```{code-cell} ipython3 -with varying_intercept_slope_noncentered: - noncentered_trace = pm.sample(tune=3000, target_accept=0.95, random_seed=RANDOM_SEED) -``` - -Notice that the bottlenecks in the traces are gone.| - -```{code-cell} ipython3 -fig, axs = plt.subplots(nrows=2) -axs[0].plot(noncentered_trace.posterior.sel(chain=0)["sigma_b"], alpha=0.5) -axs[0].set(ylabel="sigma_b") -axs[1].plot(noncentered_trace.posterior.sel(chain=0)["beta"], alpha=0.05) -axs[1].set(ylabel="beta"); -``` - -And correspondingly, the low end of the posterior distribution of the slope random effect variance can now be sampled efficiently. - -```{code-cell} ipython3 -ax = az.plot_pair( - noncentered_trace, - var_names=["beta", "sigma_b"], - coords=dict(county="AITKIN"), - marginals=True, - # marginal_kwargs={"kind": "hist"}, -) -ax[1, 0].set_ylim(0, 0.7); -``` - -As a result, we are now fully exploring the support of the posterior. This results in less bias in these parameters. - -```{code-cell} ipython3 -fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, constrained_layout=True) -az.plot_posterior(varying_intercept_slope_trace, var_names=["sigma_b"], ax=ax1) -az.plot_posterior(noncentered_trace, var_names=["sigma_b"], ax=ax2) -ax1.set_title("Centered (top) and non-centered (bottom)"); -``` - -Notice that `sigma_b` now has a lot of density near zero, which would indicate that counties don't vary that much in their answer to the `floor` "treatment". - -This was the problem with the original parameterization: the sampler has difficulty with the geometry of the posterior distribution when the values of the slope random effects are so different for standard deviations very close to zero compared to when they are positive. However, even with the non-centered model the sampler is not that comfortable with `sigma_b`: in fact if you look at the estimates with `az.summary` you'll see that the number of effective samples is quite low for `sigma_b`. - -Also note that `sigma_a` is not that big either -- i.e counties do differ in their baseline radon levels, but not by a lot. However we don't have that much of a problem to sample from this distribution because it's much narrower than `sigma_b` and doesn't get dangerously close to 0. - -```{code-cell} ipython3 -az.summary(varying_intercept_slope_trace, var_names=["sigma_a", "sigma_b"]) -``` - -To wrap up this model, let's plot the relationship between radon and floor for each county: - -```{code-cell} ipython3 -xvals = xr.DataArray([0, 1], dims="Level", coords={"Level": ["Basement", "Floor"]}) -post = noncentered_trace.posterior # alias for readability -theta = ( - (post.alpha + post.beta * xvals).mean(dim=("chain", "draw")).to_dataset(name="Mean log radon") -) - -_, ax = plt.subplots() -theta.plot.scatter(x="Level", y="Mean log radon", alpha=0.2, color="k", ax=ax) # scatter -ax.plot(xvals, theta["Mean log radon"].T, "k-", alpha=0.2) -# add lines too -ax.set_title("MEAN LOG RADON BY COUNTY"); -``` - -This, while both the intercept and the slope vary by county, there is far less variation in the slope. - -But wait, there is more! We can (and maybe should) take into account the covariation between intercepts and slopes: when baseline radon is low in a given county, maybe that means the difference between floor and basement measurements will decrease -- because there isn't that much radon anyway. That would translate into a positive correlation between `alpha` and `beta`, and adding that into our model would make even more efficient use the available data. - -To model this correlation, we'll use a multivariate Normal distribution instead of two different Normals for `alpha` and `beta`. This simply means that each county's parameters come from a common distribution with mean `mu_alpha` for intercepts and `mu_beta` for slopes, and slopes and intercepts co-vary according to the covariance matrix `S`. In mathematical form: - -$$y \sim Normal(\theta, \sigma)$$ - -$$\theta = \alpha + \beta \times floor$$ - -$$\begin{bmatrix} \alpha \\ \beta \end{bmatrix} \sim MvNormal(\begin{bmatrix} \mu_{\alpha} \\ \mu_{\beta} \end{bmatrix}, \Sigma)$$ - -$$\Sigma = \begin{pmatrix} \sigma_{\alpha} & 0 \\ 0 & \sigma_{\beta} \end{pmatrix} - P - \begin{pmatrix} \sigma_{\alpha} & 0 \\ 0 & \sigma_{\beta} \end{pmatrix}$$ - -where $\alpha$ and $\beta$ are the mean intercept and slope respectively, $\sigma_{\alpha}$ and $\sigma_{\beta}$ represent the variation in intercepts and slopes respectively, and $P$ is the correlation matrix of intercepts and slopes. In this case, as their is only one slope, $P$ contains only one relevant figure: the correlation between $\alpha$ and $\beta$. - -This translates quite easily in PyMC: - -```{code-cell} ipython3 -coords["param"] = ["alpha", "beta"] -coords["param_bis"] = ["alpha", "beta"] -with pm.Model(coords=coords) as covariation_intercept_slope: - - floor_idx = pm.MutableData("floor_idx", floor_measure, dims="obs_id") - county_idx = pm.MutableData("county_idx", county, dims="obs_id") - - # prior stddev in intercepts & slopes (variation across counties): - sd_dist = pm.Exponential.dist(0.5, shape=(2,)) - - # get back standard deviations and rho: - chol, corr, stds = pm.LKJCholeskyCov("chol", n=2, eta=2.0, sd_dist=sd_dist) - - # prior for average intercept: - mu_alpha_beta = pm.Normal("mu_alpha", mu=0.0, sigma=5.0, shape=2) - # prior for average slope: - mu_beta = pm.Normal("mu_beta", mu=0.0, sigma=1.0) - # population of varying effects: - alpha_beta_county = pm.MvNormal( - "alpha_beta_county", mu=mu_alpha_beta, chol=chol, dims=("county", "param") - ) - - # Expected value per county: - theta = alpha_beta_county[county_idx, 0] + alpha_beta_county[county_idx, 1] * floor_idx - # Model error: - sigma = pm.Exponential("sigma", 1.0) - - y = pm.Normal("y", theta, sigma=sigma, observed=log_radon, dims="obs_id") -``` - -This is by far the most complex model we've done so far, so the model code is correspondingly complex. The main complication is the use of a `LKJCholeskyCov` distribution for the covariance matrix. This is a Cholesky decomposition of the covariance matrix that allows it to sample more easily. - -As you may expect, we also want to non-center the random effects here. This again results in a `Deterministic` operation that here multiplies the covariance with independent standard normals. - -```{code-cell} ipython3 -coords["param"] = ["alpha", "beta"] -coords["param_bis"] = ["alpha", "beta"] -with pm.Model(coords=coords) as covariation_intercept_slope: - - floor_idx = pm.MutableData("floor_idx", floor_measure, dims="obs_id") - county_idx = pm.MutableData("county_idx", county, dims="obs_id") - - # prior stddev in intercepts & slopes (variation across counties): - sd_dist = pm.Exponential.dist(0.5, shape=(2,)) - - # get back standard deviations and rho: - chol, corr, stds = pm.LKJCholeskyCov("chol", n=2, eta=2.0, sd_dist=sd_dist) - - # priors for average intercept and slope: - mu_alpha_beta = pm.Normal("mu_alpha_beta", mu=0.0, sigma=5.0, shape=2) - - # population of varying effects: - z = pm.Normal("z", 0.0, 1.0, dims=("param", "county")) - alpha_beta_county = pm.Deterministic( - "alpha_beta_county", at.dot(chol, z).T, dims=("county", "param") - ) - - # Expected value per county: - theta = ( - mu_alpha_beta[0] - + alpha_beta_county[county_idx, 0] - + (mu_alpha_beta[1] + alpha_beta_county[county_idx, 1]) * floor_idx - ) - - # Model error: - sigma = pm.Exponential("sigma", 1.0) - - y = pm.Normal("y", theta, sigma=sigma, observed=log_radon, dims="obs_id") - - covariation_intercept_slope_trace = pm.sample( - 1000, - tune=3000, - target_accept=0.95, - idata_kwargs={"dims": {"chol_stds": ["param"], "chol_corr": ["param", "param_bis"]}}, - ) -``` - -```{code-cell} ipython3 -az.plot_trace( - covariation_intercept_slope_trace, - var_names=["~z", "~chol", "~chol_corr"], - compact=True, - chain_prop={"ls": "-"}, -); -``` - -```{code-cell} ipython3 -az.plot_trace( - covariation_intercept_slope_trace, - var_names="chol_corr", - lines=[("chol_corr", {}, 0.0)], - compact=True, - chain_prop={"ls": "-"}, - coords={ - "param": xr.DataArray(["alpha"], dims=["pointwise_sel"]), - "param_bis": xr.DataArray(["beta"], dims=["pointwise_sel"]), - }, -); -``` - -So the correlation between slopes and intercepts seems to be negative: when the county intercept increases, the county slope tends to decrease. In other words, when basement radon in a county gets bigger, the difference with floor radon tends to get bigger too (because floor readings get smaller while basement readings get bigger). But again, the uncertainty is wide that it's possible the correlation goes the other way around or is simply close to zero. - -And how much variation is there across counties? It's not easy to read `sigma_ab` above, so let's do a forest plot and compare the estimates with the model that doesn't include the covariation between slopes and intercepts: - -```{code-cell} ipython3 -az.plot_forest( - [varying_intercept_slope_trace, covariation_intercept_slope_trace], - model_names=["No covariation", "With covariation"], - var_names=["mu_a", "mu_b", "mu_alpha_beta", "sigma_a", "sigma_b", "chol_stds", "chol_corr"], - combined=True, - figsize=(8, 6), -); -``` - -The estimates are very close to each other, both for the means and the standard deviations. But remember, the information given by the correlation is only seen at the county level: in theory it uses even more information from the data to get an even more informed pooling of information for all county parameters. So let's visually compare estimates of both models at the county level: - -```{code-cell} ipython3 -# posterior means of covariation model: -a_county_cov = ( - covariation_intercept_slope_trace.posterior["mu_alpha_beta"][..., 0] - + covariation_intercept_slope_trace.posterior["alpha_beta_county"].sel(param="alpha") -).mean(dim=("chain", "draw")) -b_county_cov = ( - covariation_intercept_slope_trace.posterior["mu_alpha_beta"][..., 1] - + covariation_intercept_slope_trace.posterior["alpha_beta_county"].sel(param="beta") -).mean(dim=("chain", "draw")) - -# plot both and connect with lines -avg_a_county = noncentered_trace.posterior["alpha"].mean(dim=("chain", "draw")) -avg_b_county = noncentered_trace.posterior["beta"].mean(dim=("chain", "draw")) -plt.scatter(avg_a_county, avg_b_county, label="No cov estimates", alpha=0.6) -plt.scatter( - a_county_cov, - b_county_cov, - facecolors="none", - edgecolors="k", - lw=1, - label="With cov estimates", - alpha=0.8, -) -plt.plot([avg_a_county, a_county_cov], [avg_b_county, b_county_cov], "k-", alpha=0.5) -plt.xlabel("Intercept") -plt.ylabel("Slope") -plt.legend(); -``` - -The negative correlation is somewhat clear here: when the intercept increases, the slope decreases. So we understand why the model put most of the posterior weight into negative territory for the correlation term. Nevertheless, the model gives a non-trivial posterior probability to the possibility that the correlation could in fact be zero or positive. - -Interestingly, the differences between both models occur at extreme slope and intercept values. This is because the second model used the slightly negative correlation between intercepts and slopes to adjust their estimates: when intercepts are *larger* (smaller) than average, the model pushes *down* (up) the associated slopes. - -Globally, there is a lot of agreement here: modeling the correlation didn’t change inference that much. We already saw that radon levels tended to be lower in floors than basements, and when we checked the posterior distributions of the average effects (`alpha` and `beta`) and standard deviations, we noticed that they were almost identical. But on average the model with covariation will be more accurate -- because it squeezes additional information from the data, to shrink estimates in both dimensions. - -+++ - -## Adding group-level predictors - -A primary strength of multilevel models is the ability to handle predictors on multiple levels simultaneously. If we consider the varying-intercepts model above: - -$$y_i = \alpha_{j[i]} + \beta x_{i} + \epsilon_i$$ - -we may, instead of a simple random effect to describe variation in the expected radon value, specify another regression model with a county-level covariate. Here, we use the county uranium reading $u_j$, which is thought to be related to radon levels: - -$$\alpha_j = \gamma_0 + \gamma_1 u_j + \zeta_j$$ - -$$\zeta_j \sim N(0, \sigma_{\alpha}^2)$$ - -Thus, we are now incorporating a house-level predictor (floor or basement) as well as a county-level predictor (uranium). - -Note that the model has both indicator variables for each county, plus a county-level covariate. In classical regression, this would result in collinearity. In a multilevel model, the partial pooling of the intercepts towards the expected value of the group-level linear model avoids this. - -Group-level predictors also serve to reduce group-level variation, $\sigma_{\alpha}$ (here it would be the variation across counties, `sigma_a`). An important implication of this is that the group-level estimate induces stronger pooling -- by definition, a smaller $\sigma_{\alpha}$ means a stronger shrinkage of counties parameters towards the overall state mean. - -This is fairly straightforward to implement in PyMC -- we just add another level: - -```{code-cell} ipython3 -with pm.Model(coords=coords) as hierarchical_intercept: - - # Priors - sigma_a = pm.HalfCauchy("sigma_a", 5) - - # County uranium model - gamma_0 = pm.Normal("gamma_0", mu=0.0, sigma=10.0) - gamma_1 = pm.Normal("gamma_1", mu=0.0, sigma=10.0) - - # Uranium model for intercept - mu_a = pm.Deterministic("mu_a", gamma_0 + gamma_1 * u) - # County variation not explained by uranium - epsilon_a = pm.Normal("epsilon_a", mu=0, sigma=1, dims="county") - alpha = pm.Deterministic("alpha", mu_a + sigma_a * epsilon_a, dims="county") - - # Common slope - beta = pm.Normal("beta", mu=0.0, sigma=10.0) - - # Model error - sigma_y = pm.Uniform("sigma_y", lower=0, upper=100) - - # Expected value - y_hat = alpha[county] + beta * floor_measure - - # Data likelihood - y_like = pm.Normal("y_like", mu=y_hat, sigma=sigma_y, observed=log_radon) -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(hierarchical_intercept) -``` - -Do you see the new level, with `sigma_a` and `gamma`, which is two-dimensional because it contains the linear model for `a_county`? - -```{code-cell} ipython3 -with hierarchical_intercept: - hierarchical_intercept_trace = pm.sample(tune=2000, random_seed=RANDOM_SEED) -``` - -```{code-cell} ipython3 -uranium = u -post = hierarchical_intercept_trace.posterior.assign_coords(uranium=uranium) -avg_a = post["mu_a"].mean(dim=("chain", "draw")).values[np.argsort(uranium)] -avg_a_county = post["alpha"].mean(dim=("chain", "draw")) -avg_a_county_hdi = az.hdi(post, var_names="alpha")["alpha"] - -_, ax = plt.subplots() -ax.plot(uranium[np.argsort(uranium)], avg_a, "k--", alpha=0.6, label="Mean intercept") -az.plot_hdi( - uranium, - post["alpha"], - fill_kwargs={"alpha": 0.1, "color": "k", "label": "Mean intercept HPD"}, - ax=ax, -) -ax.scatter(uranium, avg_a_county, alpha=0.8, label="Mean county-intercept") -ax.vlines( - uranium, - avg_a_county_hdi.sel(hdi="lower"), - avg_a_county_hdi.sel(hdi="higher"), - alpha=0.5, - color="orange", -) -plt.xlabel("County-level uranium") -plt.ylabel("Intercept estimate") -plt.legend(fontsize=9); -``` - -Uranium is indeed strongly associated with baseline radon levels in each county. The graph above shows the average relationship and its uncertainty: the baseline radon level in an average county as a function of uranium, as well as the 94% HPD of this radon level (dashed line and envelope). The blue points and orange bars represent the relationship between baseline radon and uranium, but now for each county. As you see, the uncertainty is bigger now, because it adds on top of the average uncertainty -- each county has its idyosyncracies after all. - -If we compare the county-intercepts for this model with those of the partial-pooling model without a county-level covariate:The standard errors on the intercepts are narrower than for the partial-pooling model without a county-level covariate. - -```{code-cell} ipython3 -labeller = az.labels.mix_labellers((az.labels.NoVarLabeller, az.labels.NoModelLabeller)) -ax = az.plot_forest( - [varying_intercept_trace, hierarchical_intercept_trace], - model_names=["W/t. county pred.", "With county pred."], - var_names=["alpha"], - combined=True, - figsize=(6, 40), - textsize=9, - labeller=labeller(), -) -ax[0].set_ylabel("alpha"); -``` - -We see that the compatibility intervals are narrower for the model including the county-level covariate. This is expected, as the effect of a covariate is to reduce the variation in the outcome variable -- provided the covariate is of predictive value. More importantly, with this model we were able to squeeze even more information out of the data. - -+++ - -### Correlations among levels - -In some instances, having predictors at multiple levels can reveal correlation between individual-level variables and group residuals. We can account for this by including the average of the individual predictors as a covariate in the model for the group intercept. - -$$\alpha_j = \gamma_0 + \gamma_1 u_j + \gamma_2 \bar{x} + \zeta_j$$ - -These are broadly referred to as ***contextual effects***. - -To add these effects to our model, let's create a new variable containing the mean of `floor` in each county and add that to our previous model: - -```{code-cell} ipython3 -# Create new variable for mean of floor across counties -avg_floor_data = srrs_mn.groupby("county")["floor"].mean().values -``` - -```{code-cell} ipython3 -with pm.Model(coords=coords) as contextual_effect: - floor_idx = pm.Data("floor_idx", floor_measure, mutable=True) - county_idx = pm.Data("county_idx", county, mutable=True) - y = pm.Data("y", log_radon, mutable=True) - - # Priors - sigma_a = pm.HalfCauchy("sigma_a", 5) - - # County uranium model for slope - gamma = pm.Normal("gamma", mu=0.0, sigma=10, shape=3) - - # Uranium model for intercept - mu_a = pm.Deterministic("mu_a", gamma[0] + gamma[1] * u + gamma[2] * avg_floor_data) - - # County variation not explained by uranium - epsilon_a = pm.Normal("epsilon_a", mu=0, sigma=1, dims="county") - alpha = pm.Deterministic("alpha", mu_a + sigma_a * epsilon_a) - - # Common slope - beta = pm.Normal("beta", mu=0.0, sigma=10) - - # Model error - sigma_y = pm.Uniform("sigma_y", lower=0, upper=100) - - # Expected value - y_hat = alpha[county_idx] + beta * floor_idx - - # Data likelihood - y_like = pm.Normal("y_like", mu=y_hat, sigma=sigma_y, observed=y) -``` - -```{code-cell} ipython3 -with contextual_effect: - contextual_effect_trace = pm.sample(tune=2000, random_seed=RANDOM_SEED) -``` - -```{code-cell} ipython3 -az.summary(contextual_effect_trace, var_names="gamma", round_to=2) -``` - -So we might infer from this that counties with higher proportions of houses without basements tend to have higher baseline levels of radon. This seems to be new, as up to this point we saw that `floor` was *negatively* associated with radon levels. But remember this was at the household-level: radon tends to be higher in houses with basements. But at the county-level it seems that the less basements on average in the county, the more radon. So it's not that contradictory. What's more, the estimate for $\gamma_2$ is quite uncertain and overlaps with zero, so it's possible that the relationship is not that strong. And finally, let's note that $\gamma_2$ estimates something else than uranium's effect, as this is already taken into account by $\gamma_1$ -- it answers the question "once we know uranium level in the county, is there any value in learning about the proportion of houses without basements?". - -All of this is to say that we shouldn't interpret this causally: there is no credible mechanism by which a basement (or absence thereof) *causes* radon emissions. More probably, our causal graph is missing something: a confounding variable, one that influences both basement construction and radon levels, is lurking somewhere in the dark... Perhaps is it the type of soil, which might influence what type of structures are built *and* the level of radon? Maybe adding this to our model would help with causal inference. - -+++ - -### Prediction - -{cite:t}`gelman2006multilevel` used cross-validation tests to check the prediction error of the unpooled, pooled, and partially-pooled models - -**root mean squared cross-validation prediction errors**: - -* unpooled = 0.86 -* pooled = 0.84 -* multilevel = 0.79 - -There are two types of prediction that can be made in a multilevel model: - -1. a new individual within an existing group -2. a new individual within a new group - -For example, if we wanted to make a prediction for a new house with no basement in St. Louis and Kanabec counties, we just need to sample from the radon model with the appropriate intercept. - -+++ - -That is, - -$$\tilde{y}_i \sim N(\alpha_{69} + \beta (x_i=1), \sigma_y^2)$$ - -Because we judiciously set the county index and floor values as shared variables earlier, we can modify them directly to the desired values (69 and 1 respectively) and sample corresponding posterior predictions, without having to redefine and recompile our model. Using the model just above: - -```{code-cell} ipython3 -prediction_coords = {"obs_id": ["ST LOUIS", "KANABEC"]} -with contextual_effect: - pm.set_data({"county_idx": np.array([69, 31]), "floor_idx": np.array([1, 1]), "y": np.ones(2)}) - stl_pred = pm.sample_posterior_predictive(contextual_effect_trace.posterior) - -contextual_effect_trace.extend(stl_pred) -``` - -```{code-cell} ipython3 -az.plot_posterior(contextual_effect_trace, group="posterior_predictive"); -``` - -## Benefits of Multilevel Models - -- Accounting for natural hierarchical structure of observational data. - -- Estimation of coefficients for (under-represented) groups. - -- Incorporating individual- and group-level information when estimating group-level coefficients. - -- Allowing for variation among individual-level coefficients across groups. - -As an alternative approach to hierarchical modeling for this problem, check out a [geospatial approach](https://www.pymc-labs.io/blog-posts/spatial-gaussian-process-01/) to modeling radon levels. - -## References - -:::{bibliography} -:filter: docname in docnames - -mcelreath2018statistical -::: - -+++ - -## Authors - -* Authored by Chris Fonnesbeck in May, 2017 ([pymc#2124](https://github.com/pymc-devs/pymc/pull/2124)) -* Updated by Colin Carroll in June, 2018 ([pymc#3049](https://github.com/pymc-devs/pymc/pull/3049)) -* Updated by Alex Andorra in January, 2020 ([pymc#3765](https://github.com/pymc-devs/pymc/pull/3765)) -* Updated by Oriol Abril in June, 2020 ([pymc#3963](https://github.com/pymc-devs/pymc/pull/3963)) -* Updated by Farhan Reynaldo in November 2021 ([pymc-examples#246](https://github.com/pymc-devs/pymc-examples/pull/246)) -* Updated by Chris Fonnesbeck in Februry 2022 ([pymc-examples#285](https://github.com/pymc-devs/pymc-examples/pull/285)) -* Updated by Chris Fonnesbeck in November 2022 ([pymc-examples#468](https://github.com/pymc-devs/pymc-examples/pull/468)) -* Updated by Oriol Abril in November 2022 ([pymc-examples#473](https://github.com/pymc-devs/pymc-examples/pull/473)) - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` - -:::{include} ../page_footer.md -::: - -```{code-cell} ipython3 - -``` diff --git a/myst_nbs/case_studies/probabilistic_matrix_factorization.myst.md b/myst_nbs/case_studies/probabilistic_matrix_factorization.myst.md deleted file mode 100644 index cd6dde4b2..000000000 --- a/myst_nbs/case_studies/probabilistic_matrix_factorization.myst.md +++ /dev/null @@ -1,783 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(probabilistic_matrix_factorization)= -# Probabilistic Matrix Factorization for Making Personalized Recommendations - -:::{post} June 3, 2022 -:tags: case study, product recommendation, matrix factorization -:category: intermediate -:author: Ruslan Salakhutdinov, Andriy Mnih, Mack Sweeney, Colin Carroll, Rob Zinkov -::: - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm -import xarray as xr -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -## Motivation - -So you are browsing for something to watch on Netflix and just not liking the suggestions. You just know you can do better. All you need to do is collect some ratings data from yourself and friends and build a recommendation algorithm. This notebook will guide you in doing just that! - -We'll start out by getting some intuition for how our model will work. Then we'll formalize our intuition. Afterwards, we'll examine the dataset we are going to use. Once we have some notion of what our data looks like, we'll define some baseline methods for predicting preferences for movies. Following that, we'll look at Probabilistic Matrix Factorization (PMF), which is a more sophisticated Bayesian method for predicting preferences. Having detailed the PMF model, we'll use PyMC for MAP estimation and MCMC inference. Finally, we'll compare the results obtained with PMF to those obtained from our baseline methods and discuss the outcome. - -## Intuition - -Normally if we want recommendations for something, we try to find people who are similar to us and ask their opinions. If Bob, Alice, and Monty are all similar to me, and they all like crime dramas, I'll probably like crime dramas. Now this isn't always true. It depends on what we consider to be "similar". In order to get the best bang for our buck, we really want to look for people who have the most similar taste. Taste being a complex beast, we'd probably like to break it down into something more understandable. We might try to characterize each movie in terms of various factors. Perhaps films can be moody, light-hearted, cinematic, dialogue-heavy, big-budget, etc. Now imagine we go through IMDB and assign each movie a rating in each of the categories. How moody is it? How much dialogue does it have? What's its budget? Perhaps we use numbers between 0 and 1 for each category. Intuitively, we might call this the film's profile. - -Now let's suppose we go back to those 5 movies we rated. At this point, we can get a richer picture of our own preferences by looking at the film profiles of each of the movies we liked and didn't like. Perhaps we take the averages across the 5 film profiles and call this our ideal type of film. In other words, we have computed some notion of our inherent _preferences_ for various types of movies. Suppose Bob, Alice, and Monty all do the same. Now we can compare our preferences and determine how similar each of us really are. I might find that Bob is the most similar and the other two are still more similar than other people, but not as much as Bob. So I want recommendations from all three people, but when I make my final decision, I'm going to put more weight on Bob's recommendation than those I get from Alice and Monty. - -While the above procedure sounds fairly effective as is, it also reveals an unexpected additional source of information. If we rated a particular movie highly, and we know its film profile, we can compare with the profiles of other movies. If we find one with very close numbers, it is probable we'll also enjoy this movie. Both this approach and the one above are commonly known as _neighborhood approaches_. Techniques that leverage both of these approaches simultaneously are often called _collaborative filtering_ {cite:p}`koren2009matrixfactorization`. The first approach we talked about uses user-user similarity, while the second uses item-item similarity. Ideally, we'd like to use both sources of information. The idea is we have a lot of items available to us, and we'd like to work together with others to filter the list of items down to those we'll each like best. My list should have the items I'll like best at the top and those I'll like least at the bottom. Everyone else wants the same. If I get together with a bunch of other people, we all watch 5 movies, and we have some efficient computational process to determine similarity, we can very quickly order the movies to our liking. - -## Formalization - -Let's take some time to make the intuitive notions we've been discussing more concrete. We have a set of $M$ movies, or _items_ ($M = 100$ in our example above). We also have $N$ people, whom we'll call _users_ of our recommender system. For each item, we'd like to find a $D$ dimensional factor composition (film profile above) to describe the item. Ideally, we'd like to do this without actually going through and manually labeling all of the movies. Manual labeling would be both slow and error-prone, as different people will likely label movies differently. So we model each movie as a $D$ dimensional vector, which is its latent factor composition. Furthermore, we expect each user to have some preferences, but without our manual labeling and averaging procedure, we have to rely on the latent factor compositions to learn $D$ dimensional latent preference vectors for each user. The only thing we get to observe is the $N \times M$ ratings matrix $R$ provided by the users. Entry $R_{ij}$ is the rating user $i$ gave to item $j$. Many of these entries may be missing, since most users will not have rated all 100 movies. Our goal is to fill in the missing values with predicted ratings based on the latent variables $U$ and $V$. We denote the predicted ratings by $R_{ij}^*$. We also define an indicator matrix $I$, with entry $I_{ij} = 0$ if $R_{ij}$ is missing and $I_{ij} = 1$ otherwise. - -So we have an $N \times D$ matrix of user preferences which we'll call $U$ and an $M \times D$ factor composition matrix we'll call $V$. We also have a $N \times M$ rating matrix we'll call $R$. We can think of each row $U_i$ as indications of how much each user prefers each of the $D$ latent factors. Each row $V_j$ can be thought of as how much each item can be described by each of the latent factors. In order to make a recommendation, we need a suitable prediction function which maps a user preference vector $U_i$ and an item latent factor vector $V_j$ to a predicted ranking. The choice of this prediction function is an important modeling decision, and a variety of prediction functions have been used. Perhaps the most common is the dot product of the two vectors, $U_i \cdot V_j$ {cite:p}`koren2009matrixfactorization`. - -To better understand CF techniques, let us explore a particular example. Imagine we are seeking to recommend movies using a model which infers five latent factors, $V_j$, for $j = 1,2,3,4,5$. In reality, the latent factors are often unexplainable in a straightforward manner, and most models make no attempt to understand what information is being captured by each factor. However, for the purposes of explanation, let us assume the five latent factors might end up capturing the film profile we were discussing above. So our five latent factors are: moody, light-hearted, cinematic, dialogue, and budget. Then for a particular user $i$, imagine we infer a preference vector $U_i = <0.5, 0.1, 1.5, 1.1, 0.3>$. Also, for a particular item $j$, we infer these values for the latent factors: $V_j = <0.5, 1.5, 1.25, 0.8, 0.9>$. Using the dot product as the prediction function, we would calculate 3.425 as the ranking for that item, which is more or less a neutral preference given our 1 to 5 rating scale. - -$$ 0.5 \times 0.5 + 0.1 \times 1.5 + 1.5 \times 1.25 + 1.1 \times 0.8 + 0.3 \times 0.9 = 3.425 $$ - -+++ - -## Data - -The MovieLens 100k dataset {cite:p}`harper2015movielens` was collected by the GroupLens Research Project at the University of Minnesota. This data set consists of 100,000 ratings (1-5) from 943 users on 1682 movies. Each user rated at least 20 movies, and be have basic information on the users (age, gender, occupation, zip). Each movie includes basic information like title, release date, video release date, and genre. We will implement a model that is suitable for collaborative filtering on this data and evaluate it in terms of root mean squared error (RMSE) to validate the results. - -The data was collected through the [MovieLens website](https://movielens.org/) during the seven-month period from September 19th, -1997 through April 22nd, 1998. This data has been cleaned up - users -who had less than 20 ratings or did not have complete demographic -information were removed from this data set. - - -Let's begin by exploring our data. We want to get a general feel for what it looks like and a sense for what sort of patterns it might contain. Here are the user rating data: - -```{code-cell} ipython3 -data_kwargs = dict(sep="\t", names=["userid", "itemid", "rating", "timestamp"]) -try: - data = pd.read_csv("../data/ml_100k_u.data", **data_kwargs) -except FileNotFoundError: - data = pd.read_csv(pm.get_data("ml_100k_u.data"), **data_kwargs) - -data.head() -``` - -And here is the movie detail data: - -```{code-cell} ipython3 -# fmt: off -movie_columns = ['movie id', 'movie title', 'release date', 'video release date', 'IMDb URL', - 'unknown','Action','Adventure', 'Animation',"Children's", 'Comedy', 'Crime', - 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', - 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western'] -# fmt: on - -item_kwargs = dict(sep="|", names=movie_columns, index_col="movie id", parse_dates=["release date"]) -try: - movies = pd.read_csv("../data/ml_100k_u.item", **item_kwargs) -except FileNotFoundError: - movies = pd.read_csv(pm.get_data("ml_100k_u.item"), **item_kwargs) - -movies.head() -``` - -```{code-cell} ipython3 -# Plot histogram of ratings -data.groupby("rating").size().plot(kind="bar"); -``` - -```{code-cell} ipython3 -data.rating.describe() -``` - -This must be a decent batch of movies. From our exploration above, we know most ratings are in the range 3 to 5, and positive ratings are more likely than negative ratings. Let's look at the means for each movie to see if we have any particularly good (or bad) movie here. - -```{code-cell} ipython3 -movie_means = data.join(movies["movie title"], on="itemid").groupby("movie title").rating.mean() -movie_means[:50].plot(kind="bar", grid=False, figsize=(16, 6), title="Mean ratings for 50 movies"); -``` - -While the majority of the movies generally get positive feedback from users, there are definitely a few that stand out as bad. Let's take a look at the worst and best movies, just for fun: - -```{code-cell} ipython3 -fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(16, 4), sharey=True) -movie_means.nlargest(30).plot(kind="bar", ax=ax1, title="Top 30 movies in data set") -movie_means.nsmallest(30).plot(kind="bar", ax=ax2, title="Bottom 30 movies in data set"); -``` - -Make sense to me. We now know there are definite popularity differences between the movies. Some of them are simply better than others, and some are downright lousy. Looking at the movie means allowed us to discover these general trends. Perhaps there are similar trends across users. It might be the case that some users are simply more easily entertained than others. Let's take a look. - -```{code-cell} ipython3 -user_means = data.groupby("userid").rating.mean().sort_values() -_, ax = plt.subplots(figsize=(16, 6)) -ax.plot(np.arange(len(user_means)), user_means.values, "k-") - -ax.fill_between(np.arange(len(user_means)), user_means.values, alpha=0.3) -ax.set_xticklabels("") -# 1000 labels is nonsensical -ax.set_ylabel("Rating") -ax.set_xlabel(f"{len(user_means)} average ratings per user") -ax.set_ylim(0, 5) -ax.set_xlim(0, len(user_means)); -``` - -We see even more significant trends here. Some users rate nearly everything highly, and some (though not as many) rate nearly everything negatively. These observations will come in handy when considering models to use for predicting user preferences on unseen movies. - -+++ - -## Methods - -Having explored the data, we're now ready to dig in and start addressing the problem. We want to predict how much each user is going to like all of the movies he or she has not yet read. - - -### Baselines - -Every good analysis needs some kind of baseline methods to compare against. It's difficult to claim we've produced good results if we have no reference point for what defines "good". We'll define three very simple baseline methods and find the RMSE using these methods. Our goal will be to obtain lower RMSE scores with whatever model we produce. - -#### Uniform Random Baseline - -Our first baseline is about as dead stupid as you can get. Every place we see a missing value in $R$, we'll simply fill it with a number drawn uniformly at random in the range [1, 5]. We expect this method to do the worst by far. - -$$R_{ij}^* \sim Uniform$$ - -#### Global Mean Baseline - -This method is only slightly better than the last. Wherever we have a missing value, we'll fill it in with the mean of all observed ratings. - -$$\text{global_mean} = \frac{1}{N \times M} \sum_{i=1}^N \sum_{j=1}^M I_{ij}(R_{ij})$$ - -$$R_{ij}^* = \text{global_mean}$$ - -#### Mean of Means Baseline - -Now we're going to start getting a bit smarter. We imagine some users might be easily amused, and inclined to rate all movies more highly. Other users might be the opposite. Additionally, some movies might simply be more witty than others, so all users might rate some movies more highly than others in general. We can clearly see this in our graph of the movie means above. We'll attempt to capture these general trends through per-user and per-movie rating means. We'll also incorporate the global mean to smooth things out a bit. So if we see a missing value in cell $R_{ij}$, we'll average the global mean with the mean of $U_i$ and the mean of $V_j$ and use that value to fill it in. - -$$\text{user_means} = \frac{1}{M} \sum_{j=1}^M I_{ij}(R_{ij})$$ - -$$\text{movie_means} = \frac{1}{N} \sum_{i=1}^N I_{ij}(R_{ij})$$ - -$$R_{ij}^* = \frac{1}{3} \left(\text{user_means}_i + \text{ movie_means}_j + \text{ global_mean} \right)$$ - -```{code-cell} ipython3 -# Create a base class with scaffolding for our 3 baselines. - - -def split_title(title): - """Change "BaselineMethod" to "Baseline Method".""" - words = [] - tmp = [title[0]] - for c in title[1:]: - if c.isupper(): - words.append("".join(tmp)) - tmp = [c] - else: - tmp.append(c) - words.append("".join(tmp)) - return " ".join(words) - - -class Baseline: - """Calculate baseline predictions.""" - - def __init__(self, train_data): - """Simple heuristic-based transductive learning to fill in missing - values in data matrix.""" - self.predict(train_data.copy()) - - def predict(self, train_data): - raise NotImplementedError("baseline prediction not implemented for base class") - - def rmse(self, test_data): - """Calculate root mean squared error for predictions on test data.""" - return rmse(test_data, self.predicted) - - def __str__(self): - return split_title(self.__class__.__name__) - - -# Implement the 3 baselines. - - -class UniformRandomBaseline(Baseline): - """Fill missing values with uniform random values.""" - - def predict(self, train_data): - nan_mask = np.isnan(train_data) - masked_train = np.ma.masked_array(train_data, nan_mask) - pmin, pmax = masked_train.min(), masked_train.max() - N = nan_mask.sum() - train_data[nan_mask] = rng.uniform(pmin, pmax, N) - self.predicted = train_data - - -class GlobalMeanBaseline(Baseline): - """Fill in missing values using the global mean.""" - - def predict(self, train_data): - nan_mask = np.isnan(train_data) - train_data[nan_mask] = train_data[~nan_mask].mean() - self.predicted = train_data - - -class MeanOfMeansBaseline(Baseline): - """Fill in missing values using mean of user/item/global means.""" - - def predict(self, train_data): - nan_mask = np.isnan(train_data) - masked_train = np.ma.masked_array(train_data, nan_mask) - global_mean = masked_train.mean() - user_means = masked_train.mean(axis=1) - item_means = masked_train.mean(axis=0) - self.predicted = train_data.copy() - n, m = train_data.shape - for i in range(n): - for j in range(m): - if np.ma.isMA(item_means[j]): - self.predicted[i, j] = np.mean((global_mean, user_means[i])) - else: - self.predicted[i, j] = np.mean((global_mean, user_means[i], item_means[j])) - - -baseline_methods = {} -baseline_methods["ur"] = UniformRandomBaseline -baseline_methods["gm"] = GlobalMeanBaseline -baseline_methods["mom"] = MeanOfMeansBaseline -``` - -```{code-cell} ipython3 -num_users = data.userid.unique().shape[0] -num_items = data.itemid.unique().shape[0] -sparsity = 1 - len(data) / (num_users * num_items) -print(f"Users: {num_users}\nMovies: {num_items}\nSparsity: {sparsity}") - -dense_data = data.pivot(index="userid", columns="itemid", values="rating").values -``` - -## Probabilistic Matrix Factorization - -Probabilistic Matrix Factorization {cite:p}`mnih2008advances` is a probabilistic approach to the collaborative filtering problem that takes a Bayesian perspective. The ratings $R$ are modeled as draws from a Gaussian distribution. The mean for $R_{ij}$ is $U_i V_j^T$. The precision $\alpha$ is a fixed parameter that reflects the uncertainty of the estimations; the normal distribution is commonly reparameterized in terms of precision, which is the inverse of the variance. Complexity is controlled by placing zero-mean spherical Gaussian priors on $U$ and $V$. In other words, each row of $U$ is drawn from a multivariate Gaussian with mean $\mu = 0$ and precision which is some multiple of the identity matrix $I$. Those multiples are $\alpha_U$ for $U$ and $\alpha_V$ for $V$. So our model is defined by: - -$\newcommand\given[1][]{\:#1\vert\:}$ - -$$ -P(R \given U, V, \alpha^2) = - \prod_{i=1}^N \prod_{j=1}^M - \left[ \mathcal{N}(R_{ij} \given U_i V_j^T, \alpha^{-1}) \right]^{I_{ij}} -$$ - -$$ -P(U \given \alpha_U^2) = - \prod_{i=1}^N \mathcal{N}(U_i \given 0, \alpha_U^{-1} \boldsymbol{I}) -$$ - -$$ -P(V \given \alpha_U^2) = - \prod_{j=1}^M \mathcal{N}(V_j \given 0, \alpha_V^{-1} \boldsymbol{I}) -$$ - -Given small precision parameters, the priors on $U$ and $V$ ensure our latent variables do not grow too far from 0. This prevents overly strong user preferences and item factor compositions from being learned. This is commonly known as complexity control, where the complexity of the model here is measured by the magnitude of the latent variables. Controlling complexity like this helps prevent overfitting, which allows the model to generalize better for unseen data. We must also choose an appropriate $\alpha$ value for the normal distribution for $R$. So the challenge becomes choosing appropriate values for $\alpha_U$, $\alpha_V$, and $\alpha$. This challenge can be tackled with the soft weight-sharing methods discussed by {cite:t}`nowlan1992simplifying`. However, for the purposes of this analysis, we will stick to using point estimates obtained from our data. - -```{code-cell} ipython3 -import logging -import time - -import aesara -import scipy as sp - -# Enable on-the-fly graph computations, but ignore -# absence of intermediate test values. -aesara.config.compute_test_value = "ignore" - -# Set up logging. -logger = logging.getLogger() -logger.setLevel(logging.INFO) - - -class PMF: - """Probabilistic Matrix Factorization model using pymc.""" - - def __init__(self, train, dim, alpha=2, std=0.01, bounds=(1, 5)): - """Build the Probabilistic Matrix Factorization model using pymc. - - :param np.ndarray train: The training data to use for learning the model. - :param int dim: Dimensionality of the model; number of latent factors. - :param int alpha: Fixed precision for the likelihood function. - :param float std: Amount of noise to use for model initialization. - :param (tuple of int) bounds: (lower, upper) bound of ratings. - These bounds will simply be used to cap the estimates produced for R. - - """ - self.dim = dim - self.alpha = alpha - self.std = np.sqrt(1.0 / alpha) - self.bounds = bounds - self.data = train.copy() - n, m = self.data.shape - - # Perform mean value imputation - nan_mask = np.isnan(self.data) - self.data[nan_mask] = self.data[~nan_mask].mean() - - # Low precision reflects uncertainty; prevents overfitting. - # Set to the mean variance across users and items. - self.alpha_u = 1 / self.data.var(axis=1).mean() - self.alpha_v = 1 / self.data.var(axis=0).mean() - - # Specify the model. - logging.info("building the PMF model") - with pm.Model( - coords={ - "users": np.arange(n), - "movies": np.arange(m), - "latent_factors": np.arange(dim), - "obs_id": np.arange(self.data[~nan_mask].shape[0]), - } - ) as pmf: - U = pm.MvNormal( - "U", - mu=0, - tau=self.alpha_u * np.eye(dim), - dims=("users", "latent_factors"), - initval=rng.standard_normal(size=(n, dim)) * std, - ) - V = pm.MvNormal( - "V", - mu=0, - tau=self.alpha_v * np.eye(dim), - dims=("movies", "latent_factors"), - initval=rng.standard_normal(size=(m, dim)) * std, - ) - R = pm.Normal( - "R", - mu=(U @ V.T)[~nan_mask], - tau=self.alpha, - dims="obs_id", - observed=self.data[~nan_mask], - ) - - logging.info("done building the PMF model") - self.model = pmf - - def __str__(self): - return self.name -``` - -We'll also need functions for calculating the MAP and performing sampling on our PMF model. When the observation noise variance $\alpha$ and the prior variances $\alpha_U$ and $\alpha_V$ are all kept fixed, maximizing the log posterior is equivalent to minimizing the sum-of-squared-errors objective function with quadratic regularization terms. - -$$ E = \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^M I_{ij} (R_{ij} - U_i V_j^T)^2 + \frac{\lambda_U}{2} \sum_{i=1}^N \|U\|_{Fro}^2 + \frac{\lambda_V}{2} \sum_{j=1}^M \|V\|_{Fro}^2, $$ - -where $\lambda_U = \alpha_U / \alpha$, $\lambda_V = \alpha_V / \alpha$, and $\|\cdot\|_{Fro}^2$ denotes the Frobenius norm {cite:p}`mnih2008advances`. Minimizing this objective function gives a local minimum, which is essentially a maximum a posteriori (MAP) estimate. While it is possible to use a fast Stochastic Gradient Descent procedure to find this MAP, we'll be finding it using the utilities built into `pymc`. In particular, we'll use `find_MAP` with Powell optimization (`scipy.optimize.fmin_powell`). Having found this MAP estimate, we can use it as our starting point for MCMC sampling. - -Since it is a reasonably complex model, we expect the MAP estimation to take some time. So let's save it after we've found it. Note that we define a function for finding the MAP below, assuming it will receive a namespace with some variables in it. Then we attach that function to the PMF class, where it will have such a namespace after initialization. The PMF class is defined in pieces this way so I can say a few things between each piece to make it clearer. - -```{code-cell} ipython3 -def _find_map(self): - """Find mode of posterior using L-BFGS-B optimization.""" - tstart = time.time() - with self.model: - logging.info("finding PMF MAP using L-BFGS-B optimization...") - self._map = pm.find_MAP(method="L-BFGS-B") - - elapsed = int(time.time() - tstart) - logging.info("found PMF MAP in %d seconds" % elapsed) - return self._map - - -def _map(self): - try: - return self._map - except: - return self.find_map() - - -# Update our class with the new MAP infrastructure. -PMF.find_map = _find_map -PMF.map = property(_map) -``` - -So now our PMF class has a `map` `property` which will either be found using Powell optimization or loaded from a previous optimization. Once we have the MAP, we can use it as a starting point for our MCMC sampler. We'll need a sampling function in order to draw MCMC samples to approximate the posterior distribution of the PMF model. - -```{code-cell} ipython3 -# Draw MCMC samples. -def _draw_samples(self, **kwargs): - # kwargs.setdefault("chains", 1) - with self.model: - self.trace = pm.sample(**kwargs) - - -# Update our class with the sampling infrastructure. -PMF.draw_samples = _draw_samples -``` - -We could define some kind of default trace property like we did for the MAP, but that would mean using possibly nonsensical values for `nsamples` and `cores`. Better to leave it as a non-optional call to `draw_samples`. Finally, we'll need a function to make predictions using our inferred values for $U$ and $V$. For user $i$ and movie $j$, a prediction is generated by drawing from $\mathcal{N}(U_i V_j^T, \alpha)$. To generate predictions from the sampler, we generate an $R$ matrix for each $U$ and $V$ sampled, then we combine these by averaging over the $K$ samples. - -$$ -P(R_{ij}^* \given R, \alpha, \alpha_U, \alpha_V) \approx - \frac{1}{K} \sum_{k=1}^K \mathcal{N}(U_i V_j^T, \alpha) -$$ - -We'll want to inspect the individual $R$ matrices before averaging them for diagnostic purposes. So we'll write code for the averaging piece during evaluation. The function below simply draws an $R$ matrix given a $U$ and $V$ and the fixed $\alpha$ stored in the PMF object. - -```{code-cell} ipython3 -def _predict(self, U, V): - """Estimate R from the given values of U and V.""" - R = np.dot(U, V.T) - sample_R = rng.normal(R, self.std) - # bound ratings - low, high = self.bounds - sample_R[sample_R < low] = low - sample_R[sample_R > high] = high - return sample_R - - -PMF.predict = _predict -``` - -One final thing to note: the dot products in this model are often constrained using a logistic function $g(x) = 1/(1 + exp(-x))$, that bounds the predictions to the range [0, 1]. To facilitate this bounding, the ratings are also mapped to the range [0, 1] using $t(x) = (x + min) / range$. The authors of PMF also introduced a constrained version which performs better on users with less ratings {cite:p}`salakhutdinov2008bayesian`. Both models are generally improvements upon the basic model presented here. However, in the interest of time and space, these will not be implemented here. - -+++ - -## Evaluation - -### Metrics - -In order to understand how effective our models are, we'll need to be able to evaluate them. We'll be evaluating in terms of root mean squared error (RMSE), which looks like this: - -$$ -RMSE = \sqrt{ \frac{ \sum_{i=1}^N \sum_{j=1}^M I_{ij} (R_{ij} - R_{ij}^*)^2 } - { \sum_{i=1}^N \sum_{j=1}^M I_{ij} } } -$$ - -In this case, the RMSE can be thought of as the standard deviation of our predictions from the actual user preferences. - -```{code-cell} ipython3 -# Define our evaluation function. -def rmse(test_data, predicted): - """Calculate root mean squared error. - Ignoring missing values in the test data. - """ - I = ~np.isnan(test_data) # indicator for missing values - N = I.sum() # number of non-missing values - sqerror = abs(test_data - predicted) ** 2 # squared error array - mse = sqerror[I].sum() / N # mean squared error - return np.sqrt(mse) # RMSE -``` - -### Training Data vs. Test Data - -The next thing we need to do is split our data into a training set and a test set. Matrix factorization techniques use [transductive learning](http://en.wikipedia.org/wiki/Transduction_%28machine_learning%29) rather than inductive learning. So we produce a test set by taking a random sample of the cells in the full $N \times M$ data matrix. The values selected as test samples are replaced with `nan` values in a copy of the original data matrix to produce the training set. Since we'll be producing random splits, let's also write out the train/test sets generated. This will allow us to replicate our results. We'd like to be able to idenfity which split is which, so we'll take a hash of the indices selected for testing and use that to save the data. - -```{code-cell} ipython3 -# Define a function for splitting train/test data. -def split_train_test(data, percent_test=0.1): - """Split the data into train/test sets. - :param int percent_test: Percentage of data to use for testing. Default 10. - """ - n, m = data.shape # # users, # movies - N = n * m # # cells in matrix - - # Prepare train/test ndarrays. - train = data.copy() - test = np.ones(data.shape) * np.nan - - # Draw random sample of training data to use for testing. - tosample = np.where(~np.isnan(train)) # ignore nan values in data - idx_pairs = list(zip(tosample[0], tosample[1])) # tuples of row/col index pairs - - test_size = int(len(idx_pairs) * percent_test) # use 10% of data as test set - train_size = len(idx_pairs) - test_size # and remainder for training - - indices = np.arange(len(idx_pairs)) # indices of index pairs - sample = rng.choice(indices, replace=False, size=test_size) - - # Transfer random sample from train set to test set. - for idx in sample: - idx_pair = idx_pairs[idx] - test[idx_pair] = train[idx_pair] # transfer to test set - train[idx_pair] = np.nan # remove from train set - - # Verify everything worked properly - assert train_size == N - np.isnan(train).sum() - assert test_size == N - np.isnan(test).sum() - - # Return train set and test set - return train, test - - -train, test = split_train_test(dense_data) -``` - -## Results - -```{code-cell} ipython3 -# Let's see the results: -baselines = {} -for name in baseline_methods: - Method = baseline_methods[name] - method = Method(train) - baselines[name] = method.rmse(test) - print("{} RMSE:\t{:.5f}".format(method, baselines[name])) -``` - -As expected: the uniform random baseline is the worst by far, the global mean baseline is next best, and the mean of means method is our best baseline. Now let's see how PMF stacks up. - -```{code-cell} ipython3 -:tags: [hide-output] - -# We use a fixed precision for the likelihood. -# This reflects uncertainty in the dot product. -# We choose 2 in the footsteps Salakhutdinov -# Mnihof. -ALPHA = 2 - -# The dimensionality D; the number of latent factors. -# We can adjust this higher to try to capture more subtle -# characteristics of each movie. However, the higher it is, -# the more expensive our inference procedures will be. -# Specifically, we have D(N + M) latent variables. For our -# Movielens dataset, this means we have D(2625), so for 5 -# dimensions, we are sampling 13125 latent variables. -DIM = 10 - - -pmf = PMF(train, DIM, ALPHA, std=0.05) -``` - -### Predictions Using MAP - -```{code-cell} ipython3 -:tags: [hide-output] - -# Find MAP for PMF. -pmf.find_map(); -``` - -Excellent. The first thing we want to do is make sure the MAP estimate we obtained is reasonable. We can do this by computing RMSE on the predicted ratings obtained from the MAP values of $U$ and $V$. First we define a function for generating the predicted ratings $R$ from $U$ and $V$. We ensure the actual rating bounds are enforced by setting all values below 1 to 1 and all values above 5 to 5. Finally, we compute RMSE for both the training set and the test set. We expect the test RMSE to be higher. The difference between the two gives some idea of how much we have overfit. Some difference is always expected, but a very low RMSE on the training set with a high RMSE on the test set is a definite sign of overfitting. - -```{code-cell} ipython3 -def eval_map(pmf_model, train, test): - U = pmf_model.map["U"] - V = pmf_model.map["V"] - # Make predictions and calculate RMSE on train & test sets. - predictions = pmf_model.predict(U, V) - train_rmse = rmse(train, predictions) - test_rmse = rmse(test, predictions) - overfit = test_rmse - train_rmse - - # Print report. - print("PMF MAP training RMSE: %.5f" % train_rmse) - print("PMF MAP testing RMSE: %.5f" % test_rmse) - print("Train/test difference: %.5f" % overfit) - - return test_rmse - - -# Add eval function to PMF class. -PMF.eval_map = eval_map -``` - -```{code-cell} ipython3 -# Evaluate PMF MAP estimates. -pmf_map_rmse = pmf.eval_map(train, test) -pmf_improvement = baselines["mom"] - pmf_map_rmse -print("PMF MAP Improvement: %.5f" % pmf_improvement) -``` - -We actually see a decrease in performance between the MAP estimate and the mean of means performance. We also have a fairly large difference in the RMSE values between the train and the test sets. This indicates that the point estimates for $\alpha_U$ and $\alpha_V$ that we calculated from our data are not doing a great job of controlling model complexity. - -Let's see if we can improve our estimates by approximating our posterior distribution with MCMC sampling. We'll draw 500 samples, with 500 tuning samples. - -+++ - -### Predictions using MCMC - -```{code-cell} ipython3 -:tags: [hide-output] - -# Draw MCMC samples. -pmf.draw_samples(draws=500, tune=500) -``` - -### Diagnostics and Posterior Predictive Check - -The next step is to check how many samples we should discard as burn-in. Normally, we'd do this using a traceplot to get some idea of where the sampled variables start to converge. In this case, we have high-dimensional samples, so we need to find a way to approximate them. One way was proposed by {cite:t}`salakhutdinov2008bayesian`. We can calculate the Frobenius norms of $U$ and $V$ at each step and monitor those for convergence. This essentially gives us some idea when the average magnitude of the latent variables is stabilizing. The equations for the Frobenius norms of $U$ and $V$ are shown below. We will use `numpy`'s `linalg` package to calculate these. - -$$ \|U\|_{Fro}^2 = \sqrt{\sum_{i=1}^N \sum_{d=1}^D |U_{id}|^2}, \hspace{40pt} \|V\|_{Fro}^2 = \sqrt{\sum_{j=1}^M \sum_{d=1}^D |V_{jd}|^2} $$ - -```{code-cell} ipython3 -def _norms(pmf_model): - """Return norms of latent variables at each step in the - sample trace. These can be used to monitor convergence - of the sampler. - """ - - norms = dict() - norms["U"] = xr.apply_ufunc( - np.linalg.norm, - pmf_model.trace.posterior["U"], - input_core_dims=[["users", "latent_factors"]], - kwargs={"ord": "fro", "axis": (-2, -1)}, - ) - norms["V"] = xr.apply_ufunc( - np.linalg.norm, - pmf_model.trace.posterior["V"], - input_core_dims=[["movies", "latent_factors"]], - kwargs={"ord": "fro", "axis": (-2, -1)}, - ) - - return xr.Dataset(norms) - - -def _traceplot(pmf_model): - """Plot Frobenius norms of U and V as a function of sample #.""" - fig, ax = plt.subplots(2, 2, figsize=(12, 7)) - az.plot_trace(pmf_model.norms(), axes=ax) - ax[0][1].set_title(label=r"$\|U\|_{Fro}^2$ at Each Sample", fontsize=10) - ax[1][1].set_title(label=r"$\|V\|_{Fro}^2$ at Each Sample", fontsize=10) - ax[1][1].set_xlabel("Sample Number", fontsize=10) - - -PMF.norms = _norms -PMF.traceplot = _traceplot -``` - -```{code-cell} ipython3 -pmf.traceplot() -``` - -It appears we get convergence of $U$ and $V$ after about the default tuning. When testing for convergence, we also want to see convergence of the particular statistics we are looking for, since different characteristics of the posterior may converge at different rates. Let's also do a traceplot of the RSME. We'll compute RMSE for both the train and the test set, even though the convergence is indicated by RMSE on the training set alone. In addition, let's compute a running RMSE on the train/test sets to see how aggregate performance improves or decreases as we continue to sample. - -Notice here that we are sampling from 1 chain only, which makes the convergence statisitcs like $\hat{R}$ impossible (we can still compute the split-rhat but the purpose is different). The reason of not sampling multiple chain is that PMF might not have unique solution. Thus without constraints, the solutions are at best symmetrical, at worse identical under any rotation, in any case subject to label switching. In fact if we sample from multiple chains we will see large $\hat{R}$ indicating the sampler is exploring different solutions in different part of parameter space. - -```{code-cell} ipython3 -def _running_rmse(pmf_model, test_data, train_data, plot=True): - """Calculate RMSE for each step of the trace to monitor convergence.""" - results = {"per-step-train": [], "running-train": [], "per-step-test": [], "running-test": []} - R = np.zeros(test_data.shape) - for cnt in pmf.trace.posterior.draw.values: - U = pmf_model.trace.posterior["U"].sel(chain=0, draw=cnt) - V = pmf_model.trace.posterior["V"].sel(chain=0, draw=cnt) - sample_R = pmf_model.predict(U, V) - R += sample_R - running_R = R / (cnt + 1) - results["per-step-train"].append(rmse(train_data, sample_R)) - results["running-train"].append(rmse(train_data, running_R)) - results["per-step-test"].append(rmse(test_data, sample_R)) - results["running-test"].append(rmse(test_data, running_R)) - - results = pd.DataFrame(results) - - if plot: - results.plot( - kind="line", - grid=False, - figsize=(15, 7), - title="Per-step and Running RMSE From Posterior Predictive", - ) - - # Return the final predictions, and the RMSE calculations - return running_R, results - - -PMF.running_rmse = _running_rmse -``` - -```{code-cell} ipython3 -predicted, results = pmf.running_rmse(test, train) -``` - -```{code-cell} ipython3 -# And our final RMSE? -final_test_rmse = results["running-test"].values[-1] -final_train_rmse = results["running-train"].values[-1] -print("Posterior predictive train RMSE: %.5f" % final_train_rmse) -print("Posterior predictive test RMSE: %.5f" % final_test_rmse) -print("Train/test difference: %.5f" % (final_test_rmse - final_train_rmse)) -print("Improvement from MAP: %.5f" % (pmf_map_rmse - final_test_rmse)) -print("Improvement from Mean of Means: %.5f" % (baselines["mom"] - final_test_rmse)) -``` - -We have some interesting results here. As expected, our MCMC sampler provides lower error on the training set. However, it seems it does so at the cost of overfitting the data. This results in a decrease in test RMSE as compared to the MAP, even though it is still much better than our best baseline. So why might this be the case? Recall that we used point estimates for our precision parameters $\alpha_U$ and $\alpha_V$ and we chose a fixed precision $\alpha$. It is quite likely that by doing this, we constrained our posterior in a way that biased it towards the training data. In reality, the variance in the user ratings and the movie ratings is unlikely to be equal to the means of sample variances we used. Also, the most reasonable observation precision $\alpha$ is likely different as well. - -+++ - -### Summary of Results - -Let's summarize our results. - -```{code-cell} ipython3 -size = 100 # RMSE doesn't really change after 100th sample anyway. -all_results = pd.DataFrame( - { - "uniform random": np.repeat(baselines["ur"], size), - "global means": np.repeat(baselines["gm"], size), - "mean of means": np.repeat(baselines["mom"], size), - "PMF MAP": np.repeat(pmf_map_rmse, size), - "PMF MCMC": results["running-test"][:size], - } -) -fig, ax = plt.subplots(figsize=(10, 5)) -all_results.plot(kind="line", grid=False, ax=ax, title="RMSE for all methods") -ax.set_xlabel("Number of Samples") -ax.set_ylabel("RMSE"); -``` - -## Summary - -We set out to predict user preferences for unseen movies. First we discussed the intuitive notion behind the user-user and item-item neighborhood approaches to collaborative filtering. Then we formalized our intuitions. With a firm understanding of our problem context, we moved on to exploring our subset of the Movielens data. After discovering some general patterns, we defined three baseline methods: uniform random, global mean, and mean of means. With the goal of besting our baseline methods, we implemented the basic version of Probabilistic Matrix Factorization (PMF) using `pymc`. - -Our results demonstrate that the mean of means method is our best baseline on our prediction task. As expected, we are able to obtain a significant decrease in RMSE using the PMF MAP estimate obtained via Powell optimization. We illustrated one way to monitor convergence of an MCMC sampler with a high-dimensionality sampling space using the Frobenius norms of the sampled variables. The traceplots using this method seem to indicate that our sampler converged to the posterior. Results using this posterior showed that attempting to improve the MAP estimation using MCMC sampling actually overfit the training data and increased test RMSE. This was likely caused by the constraining of the posterior via fixed precision parameters $\alpha$, $\alpha_U$, and $\alpha_V$. - -As a followup to this analysis, it would be interesting to also implement the logistic and constrained versions of PMF. We expect both models to outperform the basic PMF model. We could also implement the fully Bayesian version of PMF (BPMF) {cite:p}`salakhutdinov2008bayesian`, which places hyperpriors on the model parameters to automatically learn ideal mean and precision parameters for $U$ and $V$. This would likely resolve the issue we faced in this analysis. We would expect BPMF to improve upon the MAP estimation produced here by learning more suitable hyperparameters and parameters. For a basic (but working!) implementation of BPMF in `pymc`, see [this gist](https://gist.github.com/macks22/00a17b1d374dfc267a9a). - -If you made it this far, then congratulations! You now have some idea of how to build a basic recommender system. These same ideas and methods can be used on many different recommendation tasks. Items can be movies, products, advertisements, courses, or even other people. Any time you can build yourself a user-item matrix with user preferences in the cells, you can use these types of collaborative filtering algorithms to predict the missing values. If you want to learn more about recommender systems, the first reference is a good place to start. - -+++ - -## Authors - -The model discussed in this analysis was developed by Ruslan Salakhutdinov and Andriy Mnih. Code and supporting text are the original work of [Mack Sweeney](https://www.linkedin.com/in/macksweeney) with changes made to adapt the code and text for the MovieLens dataset by Colin Carroll and Rob Zinkov. - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames - -goldberg2001eigentaste -::: - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/case_studies/putting_workflow.myst.md b/myst_nbs/case_studies/putting_workflow.myst.md deleted file mode 100644 index ffe45c962..000000000 --- a/myst_nbs/case_studies/putting_workflow.myst.md +++ /dev/null @@ -1,871 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3.10.6 ('pymc_env') - language: python - name: python3 -substitutions: - conda_dependencies: '!!xarray-einstats not available!!' - pip_dependencies: xarray-einstats ---- - -(putting_workflow)= -# Model building and expansion for golf putting - -:::{post} Apr 2, 2022 -:tags: Bayesian workflow, model expansion, sports -:category: intermediate, how-to -:author: Colin Carroll, Marco Gorelli, Oriol Abril-Pla -::: - -**This uses and closely follows [the case study from Andrew Gelman](https://mc-stan.org/users/documentation/case-studies/golf.html), written in Stan. There are some new visualizations and we steered away from using improper priors, but much credit to him and to the Stan group for the wonderful case study and software.** - -We use a data set from "Statistics: A Bayesian Perspective" {cite:p}`berry1996statistics`. The dataset describes the outcome of professional golfers putting from a number of distances, and is small enough that we can just print and load it inline, instead of doing any special `csv` reading. - -:::{include} ../extra_installs.md -::: - -```{code-cell} ipython3 -import io - -import aesara.tensor as at -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm -import scipy -import scipy.stats as st -import xarray as xr - -from xarray_einstats.stats import XrContinuousRV - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 -RANDOM_SEED = 8927 -az.style.use("arviz-darkgrid") -``` - -```{code-cell} ipython3 -# golf putting data from berry (1996) -golf_data = """distance tries successes -2 1443 1346 -3 694 577 -4 455 337 -5 353 208 -6 272 149 -7 256 136 -8 240 111 -9 217 69 -10 200 67 -11 237 75 -12 202 52 -13 192 46 -14 174 54 -15 167 28 -16 201 27 -17 195 31 -18 191 33 -19 147 20 -20 152 24""" - - -golf_data = pd.read_csv(io.StringIO(golf_data), sep=" ", dtype={"distance": "float"}) - -BALL_RADIUS = (1.68 / 2) / 12 -CUP_RADIUS = (4.25 / 2) / 12 -``` - -We start plotting the data to get a better idea of how it looks. The hidden cell contains the plotting code - -```{code-cell} ipython3 -:tags: [hide-input] - -def plot_golf_data(golf_data, ax=None, color="C0"): - """Utility function to standardize a pretty plotting of the golf data.""" - if ax is None: - _, ax = plt.subplots() - bg_color = ax.get_facecolor() - rv = st.beta(golf_data.successes, golf_data.tries - golf_data.successes) - ax.vlines(golf_data.distance, *rv.interval(0.68), label=None, color=color) - ax.plot( - golf_data.distance, - golf_data.successes / golf_data.tries, - "o", - mec=color, - mfc=bg_color, - label=None, - ) - - ax.set_xlabel("Distance from hole") - ax.set_ylabel("Percent of putts made") - ax.set_ylim(bottom=0, top=1) - - ax.set_xlim(left=0) - ax.grid(True, axis="y", alpha=0.7) - return ax -``` - -```{code-cell} ipython3 -ax = plot_golf_data(golf_data) -ax.set_title("Overview of data from Berry (1996)"); -``` - -After plotting, we see that generally golfers are less accurate from further away. Note that this data is pre-aggregated: we may be able to do more interesting work with granular putt-by-putt data. This data set appears to have been binned to the nearest foot. - -We might think about doing prediction with this data: fitting a curve to this data would allow us to make reasonable guesses at intermediate distances, as well as perhaps to extrapolate to longer distances. - -+++ - -## Logit model - -First we will fit a traditional logit-binomial model. We model the number of successes directly, with - -$$ -a, b \sim \mathcal{N}(0, 1) \\ -p(\text{success}) = \operatorname{logit}^{-1}(a \cdot \text{distance} + b) \\ -\text{num. successes} \sim \operatorname{Binomial}(\text{tries}, p(\text{success})) -$$ - -Here is how to write that model in PyMC. We use underscore appendices in our model variables to avoid polluting the namespace. We also use {class}`pymc.MutableData` to let us swap out the data later, when we will work with a newer data set. - -```{code-cell} ipython3 -with pm.Model() as logit_model: - distance_ = pm.MutableData("distance", golf_data["distance"], dims="obs_id") - tries_ = pm.MutableData("tries", golf_data["tries"], dims="obs_id") - successes_ = pm.MutableData("successes", golf_data["successes"], dims="obs_id") - - a_ = pm.Normal("a") - b_ = pm.Normal("b") - - pm.Binomial( - "success", - n=tries_, - p=pm.math.invlogit(a_ * distance_ + b_), - observed=successes_, - dims="obs_id", - ) - - -pm.model_to_graphviz(logit_model) -``` - -We have some intuition that $a$ should be negative, and also that $b$ should be positive (since when $\text{distance} = 0$, we expect to make nearly 100% of putts). We are not putting that into the model, though. We are using this as a baseline, and we may as well wait and see if we need to add stronger priors. - -```{code-cell} ipython3 -with logit_model: - logit_trace = pm.sample(1000, tune=1000, target_accept=0.9) - -az.summary(logit_trace) -``` - -We see $a$ and $b$ have the signs we expected. There were no bad warnings emitted from the sampler. Looking at the summary, the number of effective samples is reasonable, and the rhat is close to 1. This is a small model, so we are not being too careful about inspecting the fit. - -We plot 50 posterior draws of $p(\text{success})$ along with the expected value. Also, we draw 500 points from the posterior predictive to plot: - -```{code-cell} ipython3 -# Draw posterior predictive samples -with logit_model: - # hard to plot more than 400 sensibly - # we generate a posterior predictive sample for only 1 in every 10 draws - logit_trace.extend(pm.sample_posterior_predictive(logit_trace.sel(draw=slice(None, None, 10)))) -logit_post = logit_trace.posterior -logit_ppc = logit_trace.posterior_predictive -const_data = logit_trace.constant_data -logit_ppc_success = (logit_ppc["success"] / const_data["tries"]).stack(sample=("chain", "draw")) - -# Plotting -ax = plot_golf_data(golf_data) -t_ary = np.linspace(CUP_RADIUS - BALL_RADIUS, golf_data.distance.max(), 200) -t = xr.DataArray(t_ary, coords=[("distance", t_ary)]) -logit_post["expit"] = scipy.special.expit(logit_post["a"] * t + logit_post["b"]) -logit_post_subset = az.extract_dataset(logit_post, num_samples=50, rng=RANDOM_SEED) - -ax.plot( - t, - logit_post_subset["expit"], - lw=1, - color="C1", - alpha=0.5, -) - -ax.plot(t, logit_post["expit"].mean(("chain", "draw")), color="C2") - -ax.plot(golf_data.distance, logit_ppc_success, "k.", alpha=0.01) -ax.set_title("Logit mean and posterior predictive"); -``` - -The fit is ok, but not great! It is a good start for a baseline, and lets us answer curve-fitting type questions. We may not trust much extrapolation beyond the end of the data, especially given how the curve does not fit the last four values very well. For example, putts from 50 feet are expected to be made with probability: - -```{code-cell} ipython3 -prob_at_50 = ( - scipy.special.expit(logit_post["a"] * 50 + logit_post["b"]).mean(("chain", "draw")).item() -) -print(f"{100 * prob_at_50:.5f}%") -``` - -The lesson from this is that - -$$ -\mathbb{E}[f(\theta)] \ne f(\mathbb{E}[\theta]). -$$ - -this appeared here in using - -```python -# Right! -scipy.special.expit(logit_trace.posterior["a"] * 50 + logit_trace.posterior["b"]).mean(('chain', 'draw')) -``` -rather than - -```python -# Wrong! -scipy.special.expit(logit_trace.posterior["a"].mean(('chain', 'draw')) * 50 + logit_trace.posterior["b"].mean(('chain', 'draw'))) -``` - -to calculate our expectation at 50 feet. - -+++ - -## Geometry-based model - -As a second pass at modelling this data, both to improve fit and to increase confidence in extrapolation, we think about the geometry of the situation. We suppose professional golfers can hit the ball in a certain direction, with some small(?) error. Specifically, the angle the ball actually travels is normally distributed around 0, with some variance that we will try to learn. - -Then the ball goes in whenever the error in angle is small enough that the ball still hits the cup. This is intuitively nice! A longer putt will admit a smaller error in angle, and so a lower success rate than for shorter putts. - -I am skipping a derivation of the probability of making a putt given the accuracy variance and distance to the hole, but it is a fun exercise in geometry, and turns out to be - -$$ -p(\text{success} | \sigma_{\text{angle}}, \text{distance}) = 2 \Phi\left( \frac{ \arcsin \left((R - r) / \text{distance}\right)}{\sigma_{\text{angle}}}\right), -$$ - -where $\Phi$ is the normal cumulative density function, $R$ is the radius of the cup (turns out 2.125 inches), and $r$ is the radius of the golf ball (around 0.84 inches). - -To get a feeling for this model, let's look at a few manually plotted values for $\sigma_{\text{angle}}$. - -```{code-cell} ipython3 -def forward_angle_model(variances_of_shot, t): - norm_dist = XrContinuousRV(st.norm, 0, variances_of_shot) - return 2 * norm_dist.cdf(np.arcsin((CUP_RADIUS - BALL_RADIUS) / t)) - 1 -``` - -```{code-cell} ipython3 -_, ax = plt.subplots() -var_shot_ary = [0.01, 0.02, 0.05, 0.1, 0.2, 1] -var_shot_plot = xr.DataArray(var_shot_ary, coords=[("variance", var_shot_ary)]) - -forward_angle_model(var_shot_plot, t).plot.line(hue="variance") - -plot_golf_data(golf_data, ax=ax) -ax.set_title("Model prediction for selected amounts of variance"); -``` - -This looks like a promising approach! A variance of 0.02 radians looks like it will be close to the right answer. The model also predicted that putts from 0 feet all go in, which is a nice side effect. We might think about whether a golfer misses putts symmetrically. It is plausible that a right handed putter and a left handed putter might have a different bias to their shots. -### Fitting the model - -PyMC has $\Phi$ implemented, but it is pretty hidden (`pm.distributions.dist_math.normal_lcdf`), and it is worthwhile to implement it ourselves anyways, using an identity with the [error function](https://en.wikipedia.org/wiki/Error_function). - -```{code-cell} ipython3 -def phi(x): - """Calculates the standard normal cumulative distribution function.""" - return 0.5 + 0.5 * at.erf(x / at.sqrt(2.0)) - - -with pm.Model() as angle_model: - distance_ = pm.MutableData("distance", golf_data["distance"], dims="obs_id") - tries_ = pm.MutableData("tries", golf_data["tries"], dims="obs_id") - successes_ = pm.MutableData("successes", golf_data["successes"], dims="obs_id") - - variance_of_shot = pm.HalfNormal("variance_of_shot") - p_goes_in = pm.Deterministic( - "p_goes_in", - 2 * phi(at.arcsin((CUP_RADIUS - BALL_RADIUS) / distance_) / variance_of_shot) - 1, - dims="obs_id", - ) - success = pm.Binomial("success", n=tries_, p=p_goes_in, observed=successes_, dims="obs_id") - - -pm.model_to_graphviz(angle_model) -``` - -### Prior Predictive Checks - -We often wish to sample from the prior, especially if we have some idea of what the observations would look like, but not a lot of intuition for the prior parameters. We have an angle-based model here, but it might not be intuitive if the *variance* of the angle is given, how that effects the accuracy of a shot. Let's check! - -Sometimes a custom visualization or dashboard is useful for a prior predictive check. Here, we plot our prior distribution of putts from 20 feet away. - -```{code-cell} ipython3 -with angle_model: - angle_trace = pm.sample_prior_predictive(500) - -angle_prior = angle_trace.prior.squeeze() - -angle_of_shot = XrContinuousRV(st.norm, 0, angle_prior["variance_of_shot"]).rvs( - random_state=RANDOM_SEED -) # radians -distance = 20 # feet - -end_positions = xr.Dataset( - {"endx": distance * np.cos(angle_of_shot), "endy": distance * np.sin(angle_of_shot)} -) - -fig, ax = plt.subplots() -for draw in end_positions["draw"]: - end = end_positions.sel(draw=draw) - ax.plot([0, end["endx"]], [0, end["endy"]], "k-o", lw=1, mfc="w", alpha=0.5) -ax.plot(0, 0, "go", label="Start", mfc="g", ms=20) -ax.plot(distance, 0, "ro", label="Goal", mfc="r", ms=20) - -ax.set_title(f"Prior distribution of putts from {distance}ft away") - -ax.legend(); -``` - -This is a little funny! Most obviously, it should probably be not this common to putt the ball *backwards*. This also leads us to worry that we are using a normal distribution to model an angle. The [von Mises](https://en.wikipedia.org/wiki/Von_Mises_distribution) distribution may be appropriate here. Also, the golfer needs to stand somewhere, so perhaps adding some bounds to the von Mises would be appropriate. We will find that this model learns from the data quite well, though, and these additions are not necessary. - -```{code-cell} ipython3 -with angle_model: - angle_trace.extend(pm.sample(1000, tune=1000, target_accept=0.85)) - -angle_post = angle_trace.posterior -``` - -```{code-cell} ipython3 -ax = plot_golf_data(golf_data) - -angle_post["expit"] = forward_angle_model(angle_post["variance_of_shot"], t) - -ax.plot( - t, - az.extract_dataset(angle_post, num_samples=50)["expit"], - lw=1, - color="C1", - alpha=0.1, -) - -ax.plot( - t, - angle_post["expit"].mean(("chain", "draw")), - label="Geometry-based model", -) - -ax.plot( - t, - logit_post["expit"].mean(("chain", "draw")), - label="Logit-binomial model", -) -ax.set_title("Comparing the fit of geometry-based and logit-binomial model") -ax.legend(); -``` - -This new model appears to fit much better, and by modelling the geometry of the situation, we may have a bit more confidence in extrapolating the data. This model suggests that a 50 foot putt has much higher chance of going in: - -```{code-cell} ipython3 -angle_prob_at_50 = forward_angle_model(angle_post["variance_of_shot"], np.array([50])) -print(f"{100 * angle_prob_at_50.mean().item():.2f}% vs {100 * prob_at_50:.5f}%") -``` - -We can also recreate our prior predictive plot, giving us some confidence that the prior was not leading to unreasonable situations in the posterior distribution: the variance in angle is quite small! - -```{code-cell} ipython3 -angle_of_shot = XrContinuousRV( - st.norm, 0, az.extract_dataset(angle_post, num_samples=500)["variance_of_shot"] -).rvs( - random_state=RANDOM_SEED -) # radians -distance = 20 # feet - -end_positions = xr.Dataset( - {"endx": distance * np.cos(angle_of_shot), "endy": distance * np.sin(angle_of_shot)} -) - -fig, ax = plt.subplots() -for sample in range(end_positions.dims["sample"]): - end = end_positions.isel(sample=sample) - ax.plot([0, end["endx"]], [0, end["endy"]], "k-o", lw=1, mfc="w", alpha=0.5) -ax.plot(0, 0, "go", label="Start", mfc="g", ms=20) -ax.plot(distance, 0, "ro", label="Goal", mfc="r", ms=20) - -ax.set_title(f"Prior distribution of putts from {distance}ft away") -ax.set_xlim(-21, 21) -ax.set_ylim(-21, 21) -ax.legend(); -``` - -## New Data! - -Mark Broadie used new summary data on putting to fit a new model. We will use this new data to refine our model: - -```{code-cell} ipython3 -# golf putting data from Broadie (2018) -new_golf_data = """distance tries successes -0.28 45198 45183 -0.97 183020 182899 -1.93 169503 168594 -2.92 113094 108953 -3.93 73855 64740 -4.94 53659 41106 -5.94 42991 28205 -6.95 37050 21334 -7.95 33275 16615 -8.95 30836 13503 -9.95 28637 11060 -10.95 26239 9032 -11.95 24636 7687 -12.95 22876 6432 -14.43 41267 9813 -16.43 35712 7196 -18.44 31573 5290 -20.44 28280 4086 -21.95 13238 1642 -24.39 46570 4767 -28.40 38422 2980 -32.39 31641 1996 -36.39 25604 1327 -40.37 20366 834 -44.38 15977 559 -48.37 11770 311 -52.36 8708 231 -57.25 8878 204 -63.23 5492 103 -69.18 3087 35 -75.19 1742 24""" - -new_golf_data = pd.read_csv(io.StringIO(new_golf_data), sep=" ") -``` - -```{code-cell} ipython3 -ax = plot_golf_data(new_golf_data) -plot_golf_data(golf_data, ax=ax, color="C1") - -t_ary = np.linspace(CUP_RADIUS - BALL_RADIUS, new_golf_data.distance.max(), 200) -t = xr.DataArray(t_ary, coords=[("distance", t_ary)]) - -ax.plot( - t, forward_angle_model(angle_trace.posterior["variance_of_shot"], t).mean(("chain", "draw")) -) -ax.set_title("Comparing the new data set to the old data set, and\nconsidering the old model fit"); -``` - -This new data set represents ~200 times the number of putt attempts as the old data, and includes putts up to 75ft, compared to 20ft for the old data set. It also seems that the new data represents a different population from the old data: while the two have different bins, the new data suggests higher success for most data. This may be from a different method of collecting the data, or golfers improving in the intervening years. - -+++ - -## Fitting the model on the new data - -Since we think these may be two different populations, the easiest solution would be to refit our model. This goes worse than earlier: there are divergences, and it takes much longer to run. This may indicate a problem with the model: Andrew Gelman calls this the "folk theorem of statistical computing". - -```{code-cell} ipython3 -with angle_model: - pm.set_data( - { - "distance": new_golf_data["distance"], - "tries": new_golf_data["tries"], - "successes": new_golf_data["successes"], - } - ) - new_angle_trace = pm.sample(1000, tune=1500) -``` - -:::{note} -As you will see in the plot below, this model fits the new data quite badly. In this case, all the divergences -and convergence warnings have no other solution than using a different model that can actually explain the data. -::: - -```{code-cell} ipython3 -ax = plot_golf_data(new_golf_data) -plot_golf_data(golf_data, ax=ax, color="C1") - -new_angle_post = new_angle_trace.posterior - -ax.plot( - t, - forward_angle_model(angle_post["variance_of_shot"], t).mean(("chain", "draw")), - label="Trained on original data", -) -ax.plot( - t, - forward_angle_model(new_angle_post["variance_of_shot"], t).mean(("chain", "draw")), - label="Trained on new data", -) -ax.set_title("Retraining the model on new data") -ax.legend(); -``` - -## A model incorporating distance to hole - -We might assume that, in addition to putting in the right direction, a golfer may need to hit the ball the right distance. Specifically, we assume: - -1. If a put goes short *or* more than 3 feet past the hole, it will not go in. -2. Golfers aim for 1 foot past the hole -3. The distance the ball goes, $u$, is distributed according to -$$ -u \sim \mathcal{N}\left(1 + \text{distance}, \sigma_{\text{distance}} (1 + \text{distance})\right), -$$ -where we will learn $\sigma_{\text{distance}}$. - -Again, this is a geometry and algebra problem to work the probability that the ball goes in from any given distance: -$$ -P(\text{good distance}) = P(\text{distance} < u < \text{distance} + 3) -$$ - -it uses `phi`, the cumulative normal density function we implemented earlier. - -```{code-cell} ipython3 -OVERSHOT = 1.0 -DISTANCE_TOLERANCE = 3.0 - - -with pm.Model() as distance_angle_model: - distance_ = pm.MutableData("distance", new_golf_data["distance"], dims="obs_id") - tries_ = pm.MutableData("tries", new_golf_data["tries"], dims="obs_id") - successes_ = pm.MutableData("successes", new_golf_data["successes"], dims="obs_id") - - variance_of_shot = pm.HalfNormal("variance_of_shot") - variance_of_distance = pm.HalfNormal("variance_of_distance") - p_good_angle = pm.Deterministic( - "p_good_angle", - 2 * phi(at.arcsin((CUP_RADIUS - BALL_RADIUS) / distance_) / variance_of_shot) - 1, - dims="obs_id", - ) - p_good_distance = pm.Deterministic( - "p_good_distance", - phi((DISTANCE_TOLERANCE - OVERSHOT) / ((distance_ + OVERSHOT) * variance_of_distance)) - - phi(-OVERSHOT / ((distance_ + OVERSHOT) * variance_of_distance)), - dims="obs_id", - ) - - success = pm.Binomial( - "success", n=tries_, p=p_good_angle * p_good_distance, observed=successes_, dims="obs_id" - ) - - -pm.model_to_graphviz(distance_angle_model) -``` - -This model still has only 2 dimensions to fit. We might think about checking on `OVERSHOT` and `DISTANCE_TOLERANCE`. Checking the first might involve a call to a local golf course, and the second might require a trip to a green and some time experimenting. We might also think about adding some explicit correlations: it is plausible that less control over angle would correspond to less control over distance, or that longer putts lead to more variance in the angle. - -+++ - -## Fitting the distance angle model - -```{code-cell} ipython3 -with distance_angle_model: - distance_angle_trace = pm.sample(1000, tune=1000, target_accept=0.85) -``` - -```{code-cell} ipython3 -def forward_distance_angle_model(variance_of_shot, variance_of_distance, t): - rv = XrContinuousRV(st.norm, 0, 1) - angle_prob = 2 * rv.cdf(np.arcsin((CUP_RADIUS - BALL_RADIUS) / t) / variance_of_shot) - 1 - - distance_prob_one = rv.cdf( - (DISTANCE_TOLERANCE - OVERSHOT) / ((t + OVERSHOT) * variance_of_distance) - ) - distance_prob_two = rv.cdf(-OVERSHOT / ((t + OVERSHOT) * variance_of_distance)) - distance_prob = distance_prob_one - distance_prob_two - - return angle_prob * distance_prob - - -ax = plot_golf_data(new_golf_data) - -distance_angle_post = distance_angle_trace.posterior - -ax.plot( - t, - forward_angle_model(new_angle_post["variance_of_shot"], t).mean(("chain", "draw")), - label="Just angle", -) -ax.plot( - t, - forward_distance_angle_model( - distance_angle_post["variance_of_shot"], - distance_angle_post["variance_of_distance"], - t, - ).mean(("chain", "draw")), - label="Distance and angle", -) - -ax.set_title("Comparing fits of models on new data") -ax.legend(); -``` - -This new model looks better, and fit much more quickly with fewer sampling problems compared to the old model.There is some mismatch between 10 and 40 feet, but it seems generally good. We can come to this same conclusion by taking posterior predictive samples, and looking at the residuals. Here, we see that the fit is being driven by the first 4 bins, which contain ~40% of the data. - -```{code-cell} ipython3 -with distance_angle_model: - pm.sample_posterior_predictive(distance_angle_trace, extend_inferencedata=True) - -const_data = distance_angle_trace.constant_data -pp = distance_angle_trace.posterior_predictive -residuals = 100 * ((const_data["successes"] - pp["success"]) / const_data["tries"]).mean( - ("chain", "draw") -) - -fig, ax = plt.subplots() - -ax.plot(new_golf_data.distance, residuals, "o-") -ax.axhline(y=0, linestyle="dashed", linewidth=1) -ax.set_xlabel("Distance from hole") -ax.set_ylabel("Absolute error in expected\npercent of success") - -ax.set_title("Residuals of new model"); -``` - -## A new model - -It is reasonable to stop at this point, but if we want to improve the fit everywhere, we may want to choose a different likelihood from the `Binomial`, which cares deeply about those points with many observations. One thing we could do is add some independent extra error to each data point. We could do this in a few ways: -1. The `Binomial` distribution in usually parametrized by $n$, the number of observations, and $p$, the probability of an individual success. We could instead parametrize it by mean ($np$) and variance ($np(1-p)$), and add error independent of $n$ to the likelihood. -2. Use a `BetaBinomial` distribution, though the error there would still be (roughly) proportional to the number observations -3. Approximate the Binomial with a Normal distribution of the probability of success. This is actually equivalent to the first approach, but does not require a custom distribution. Note that we will use $p$ as the mean, and $p(1-p) / n$ as the variance. Once we add some dispersion $\epsilon$, the variance becomes $p(1-p)/n + \epsilon$. - -We follow approach 3, as in the Stan case study, and leave 1 as an exercise. - -```{code-cell} ipython3 -with pm.Model() as disp_distance_angle_model: - distance_ = pm.MutableData("distance", new_golf_data["distance"], dims="obs_id") - tries_ = pm.MutableData("tries", new_golf_data["tries"], dims="obs_id") - successes_ = pm.MutableData("successes", new_golf_data["successes"], dims="obs_id") - obs_prop_ = pm.MutableData( - "obs_prop", new_golf_data["successes"] / new_golf_data["tries"], dims="obs_id" - ) - - variance_of_shot = pm.HalfNormal("variance_of_shot") - variance_of_distance = pm.HalfNormal("variance_of_distance") - dispersion = pm.HalfNormal("dispersion") - - p_good_angle = pm.Deterministic( - "p_good_angle", - 2 * phi(at.arcsin((CUP_RADIUS - BALL_RADIUS) / distance_) / variance_of_shot) - 1, - dims="obs_id", - ) - p_good_distance = pm.Deterministic( - "p_good_distance", - phi((DISTANCE_TOLERANCE - OVERSHOT) / ((distance_ + OVERSHOT) * variance_of_distance)) - - phi(-OVERSHOT / ((distance_ + OVERSHOT) * variance_of_distance)), - dims="obs_id", - ) - - p = p_good_angle * p_good_distance - p_success = pm.Normal( - "p_success", - mu=p, - sigma=at.sqrt(((p * (1 - p)) / tries_) + dispersion**2), - observed=obs_prop_, # successes_ / tries_ - dims="obs_id", - ) - - -pm.model_to_graphviz(disp_distance_angle_model) -``` - -```{code-cell} ipython3 -with disp_distance_angle_model: - disp_distance_angle_trace = pm.sample(1000, tune=1000) - pm.sample_posterior_predictive(disp_distance_angle_trace, extend_inferencedata=True) -``` - -```{code-cell} ipython3 -ax = plot_golf_data(new_golf_data, ax=None) - -disp_distance_angle_post = disp_distance_angle_trace.posterior - -ax.plot( - t, - forward_distance_angle_model( - distance_angle_post["variance_of_shot"], - distance_angle_post["variance_of_distance"], - t, - ).mean(("chain", "draw")), - label="Distance and angle", -) -ax.plot( - t, - forward_distance_angle_model( - disp_distance_angle_post["variance_of_shot"], - disp_distance_angle_post["variance_of_distance"], - t, - ).mean(("chain", "draw")), - label="Dispersed model", -) -ax.set_title("Comparing dispersed model with binomial distance/angle model") -ax.legend(); -``` - -This new model does better between 10 and 30 feet, as we can also see using the residuals plot - note that this model does marginally worse for very short putts: - -```{code-cell} ipython3 -const_data = distance_angle_trace.constant_data -old_pp = distance_angle_trace.posterior_predictive -old_residuals = 100 * ((const_data["successes"] - old_pp["success"]) / const_data["tries"]).mean( - ("chain", "draw") -) - -pp = disp_distance_angle_trace.posterior_predictive -residuals = 100 * (const_data["successes"] / const_data["tries"] - pp["p_success"]).mean( - ("chain", "draw") -) - -fig, ax = plt.subplots() - -ax.plot(new_golf_data.distance, residuals, label="Dispersed model") -ax.plot(new_golf_data.distance, old_residuals, label="Distance and angle model") -ax.legend() -ax.axhline(y=0, linestyle="dashed", linewidth=1) -ax.set_xlabel("Distance from hole") -ax.set_ylabel("Absolute error in expected\npercent of success") -ax.set_title("Residuals of dispersed model vs distance/angle model"); -``` - -## Beyond prediction - -We want to use Bayesian analysis because we care about quantifying uncertainty in our parameters. We have a beautiful geometric model that not only gives us predictions, but gives us posterior distributions over our parameters. We can use this to back out how where our putts may end up, if not in the hole! - -First, we can try to visualize how 20,000 putts from a professional golfer might look. We: - -1. Set the number of trials to 5 -2. For each *joint* posterior sample of `variance_of_shot` and `variance_of_distance`, - draw an angle and a distance from normal distribution 5 times. -3. Plot the point, unless it would have gone in the hole - -```{code-cell} ipython3 -def simulate_from_distance(trace, distance_to_hole, trials=5): - variance_of_shot = trace.posterior["variance_of_shot"] - variance_of_distance = trace.posterior["variance_of_distance"] - - theta = XrContinuousRV(st.norm, 0, variance_of_shot).rvs(size=trials, dims="trials") - distance = XrContinuousRV( - st.norm, distance_to_hole + OVERSHOT, (distance_to_hole + OVERSHOT) * variance_of_distance - ).rvs(size=trials, dims="trials") - - final_position = xr.concat( - (distance * np.cos(theta), distance * np.sin(theta)), dim="axis" - ).assign_coords(axis=["x", "y"]) - - made_it = np.abs(theta) < np.arcsin((CUP_RADIUS - BALL_RADIUS) / distance_to_hole) - made_it = ( - made_it - * (final_position.sel(axis="x") > distance_to_hole) - * (final_position.sel(axis="x") < distance_to_hole + DISTANCE_TOLERANCE) - ) - - dims = [dim for dim in final_position.dims if dim != "axis"] - final_position = final_position.where(~made_it).stack(idx=dims).dropna(dim="idx") - total_simulations = made_it.size - - _, ax = plt.subplots() - - ax.plot(0, 0, "k.", lw=1, mfc="black", ms=250 / distance_to_hole) - ax.plot(*final_position, ".", alpha=0.1, mfc="r", ms=250 / distance_to_hole, mew=0.5) - ax.plot(distance_to_hole, 0, "ko", lw=1, mfc="black", ms=350 / distance_to_hole) - - ax.set_facecolor("#e6ffdb") - ax.set_title( - f"Final position of {total_simulations:,d} putts from {distance_to_hole}ft.\n" - f"({100 * made_it.mean().item():.1f}% made)" - ) - return ax - - -simulate_from_distance(distance_angle_trace, distance_to_hole=50); -``` - -```{code-cell} ipython3 -simulate_from_distance(distance_angle_trace, distance_to_hole=7); -``` - -We can then use this to work out how many putts a player may need to take from a given distance. This can influence strategic decisions like trying to reach the green in fewer shots, which may lead to a longer first putt, vs. a more conservative approach. We do this by simulating putts until they have all gone in. - -Note that this is again something we might check experimentally. In particular, a highly unscientific search around the internet finds claims that professionals only 3-putt from 20-25ft around 3% of the time. Our model puts the chance of 3 or more putts from 22.5 feet at 2.8%, which seems suspiciously good. - -```{code-cell} ipython3 -def expected_num_putts(trace, distance_to_hole, trials=100_000): - distance_to_hole = distance_to_hole * np.ones(trials) - - combined_trace = trace.posterior.stack(sample=("chain", "draw")) - - n_samples = combined_trace.dims["sample"] - - idxs = np.random.randint(0, n_samples, trials) - variance_of_shot = combined_trace["variance_of_shot"].isel(sample=idxs) - variance_of_distance = combined_trace["variance_of_distance"].isel(sample=idxs) - n_shots = [] - while distance_to_hole.size > 0: - theta = np.random.normal(0, variance_of_shot) - distance = np.random.normal( - distance_to_hole + OVERSHOT, (distance_to_hole + OVERSHOT) * variance_of_distance - ) - - final_position = np.array([distance * np.cos(theta), distance * np.sin(theta)]) - - made_it = np.abs(theta) < np.arcsin( - (CUP_RADIUS - BALL_RADIUS) / distance_to_hole.clip(min=CUP_RADIUS - BALL_RADIUS) - ) - made_it = ( - made_it - * (final_position[0] > distance_to_hole) - * (final_position[0] < distance_to_hole + DISTANCE_TOLERANCE) - ) - - distance_to_hole = np.sqrt( - (final_position[0] - distance_to_hole) ** 2 + final_position[1] ** 2 - )[~made_it].copy() - variance_of_shot = variance_of_shot[~made_it] - variance_of_distance = variance_of_distance[~made_it] - n_shots.append(made_it.sum()) - return np.array(n_shots) / trials -``` - -```{code-cell} ipython3 -distances = (10, 20, 40, 80) -fig, axes = plt.subplots(nrows=2, ncols=2, sharex=True, sharey=True, figsize=(10, 10)) - -for distance, ax in zip(distances, axes.ravel()): - made = 100 * expected_num_putts(disp_distance_angle_trace, distance) - x = np.arange(1, 1 + len(made), dtype=int) - ax.vlines(np.arange(1, 1 + len(made)), 0, made, linewidths=50) - ax.set_title(f"{distance} feet") - ax.set_ylabel("Percent of attempts") - ax.set_xlabel("Number of putts") -ax.set_xticks(x) -ax.set_ylim(0, 100) -ax.set_xlim(0, 5.6) -fig.suptitle("Simulated number of putts from\na few distances"); -``` - -## Authors -* Adapted by Colin Carroll from the [Model building and expansion for golf putting] case study in the Stan documentation ([pymc#3666](https://github.com/pymc-devs/pymc/pull/3666)) -* Updated by Marco Gorelli ([pymc-examples#39](https://github.com/pymc-devs/pymc-examples/pull/39)) -* Updated by Oriol Abril-Pla to use PyMC v4 and xarray-einstats - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aeppl,xarray_einstats -``` - -:::{include} ../page_footer.md -::: - -```{code-cell} ipython3 - -``` diff --git a/myst_nbs/case_studies/reinforcement_learning.myst.md b/myst_nbs/case_studies/reinforcement_learning.myst.md deleted file mode 100644 index 9638f4303..000000000 --- a/myst_nbs/case_studies/reinforcement_learning.myst.md +++ /dev/null @@ -1,548 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -+++ {"id": "Pq7u0kdRwDje"} - -(reinforcement_learning)= -# Fitting a Reinforcement Learning Model to Behavioral Data with PyMC - -:::{post} Aug 5, 2022 -:tags: Aesara, Reinforcement Learning -:category: advanced, how-to -:author: Ricardo Vieira -::: - - -Reinforcement Learning models are commonly used in behavioral research to model how animals and humans learn, in situtions where they get to make repeated choices that are followed by some form of feedback, such as a reward or a punishment. - -In this notebook we will consider the simplest learning scenario, where there are only two possible actions. When an action is taken, it is always followed by an immediate reward. Finally, the outcome of each action is independent from the previous actions taken. This scenario is sometimes referred to as the [multi-armed bandit problem](https://en.wikipedia.org/wiki/Multi-armed_bandit). - - -Let's say that the two actions (e.g., left and right buttons) are associated with a unit reward 40% and 60% of the time, respectively. At the beginning the learning agent does not know which action $a$ is better, so they may start by assuming both actions have a mean value of 50%. We can store these values in a table, which is usually referred to as a $Q$ table: - -$$ Q = \begin{cases} - .5, a = \text{left}\\ - .5, a = \text{right} - \end{cases} -$$ - -When an action is chosen and a reward $r = \{0,1\}$ is observed, the estimated value of that action is updated as follows: - -$$Q_{a} = Q_{a} + \alpha (r - Q_{a})$$ - -where $\alpha \in [0, 1]$ is a learning parameter that influences how much the value of an action is shifted towards the observed reward in each trial. Finally, the $Q$ table values are converted into action probabilities via the softmax transformation: - -$$ P(a = \text{right}) = \frac{\exp(\beta Q_{\text{right}})}{\exp(\beta Q_{\text{right}}) + \exp(\beta Q_{\text{left}})}$$ - -where the $\beta \in (0, +\infty)$ parameter determines the level of noise in the agent choices. Larger values will be associated with more deterministic choices and smaller values with increasingly random choices. - -```{code-cell} ipython3 -:id: QTq-0HMw7dBK - -import aesara -import aesara.tensor as at -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc as pm -import scipy - -from matplotlib.lines import Line2D -``` - -```{code-cell} ipython3 -seed = sum(map(ord, "RL_PyMC")) -rng = np.random.default_rng(seed) -az.style.use("arviz-darkgrid") -%config InlineBackend.figure_format = "retina" -``` - -+++ {"id": "aG_Nxvr5wC4B"} - -## Generating fake data - -```{code-cell} ipython3 -:id: hcPVL7kZ8Zs2 - -def generate_data(rng, alpha, beta, n=100, p_r=None): - if p_r is None: - p_r = [0.4, 0.6] - actions = np.zeros(n, dtype="int") - rewards = np.zeros(n, dtype="int") - Qs = np.zeros((n, 2)) - - # Initialize Q table - Q = np.array([0.5, 0.5]) - for i in range(n): - # Apply the Softmax transformation - exp_Q = np.exp(beta * Q) - prob_a = exp_Q / np.sum(exp_Q) - - # Simulate choice and reward - a = rng.choice([0, 1], p=prob_a) - r = rng.random() < p_r[a] - - # Update Q table - Q[a] = Q[a] + alpha * (r - Q[a]) - - # Store values - actions[i] = a - rewards[i] = r - Qs[i] = Q.copy() - - return actions, rewards, Qs -``` - -```{code-cell} ipython3 -:id: ceNagbmsZXW6 - -true_alpha = 0.5 -true_beta = 5 -n = 150 -actions, rewards, Qs = generate_data(rng, true_alpha, true_beta, n) -``` - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ - height: 208 -id: MDhJI8vOXZeU -outputId: 60f7ee37-2d1f-44ad-afff-b9ba7d82a8d8 -tags: [hide-input] ---- -_, ax = plt.subplots(figsize=(12, 5)) -x = np.arange(len(actions)) - -ax.plot(x, Qs[:, 0] - 0.5 + 0, c="C0", lw=3, alpha=0.3) -ax.plot(x, Qs[:, 1] - 0.5 + 1, c="C1", lw=3, alpha=0.3) - -s = 7 -lw = 2 - -cond = (actions == 0) & (rewards == 0) -ax.plot(x[cond], actions[cond], "o", ms=s, mfc="None", mec="C0", mew=lw) - -cond = (actions == 0) & (rewards == 1) -ax.plot(x[cond], actions[cond], "o", ms=s, mfc="C0", mec="C0", mew=lw) - -cond = (actions == 1) & (rewards == 0) -ax.plot(x[cond], actions[cond], "o", ms=s, mfc="None", mec="C1", mew=lw) - -cond = (actions == 1) & (rewards == 1) -ax.plot(x[cond], actions[cond], "o", ms=s, mfc="C1", mec="C1", mew=lw) - -ax.set_yticks([0, 1], ["left", "right"]) -ax.set_ylim(-1, 2) -ax.set_ylabel("action") -ax.set_xlabel("trial") - -reward_artist = Line2D([], [], c="k", ls="none", marker="o", ms=s, mew=lw, label="Reward") -no_reward_artist = Line2D( - [], [], ls="none", marker="o", mfc="w", mec="k", ms=s, mew=lw, label="No reward" -) -Qvalue_artist = Line2D([], [], c="k", ls="-", lw=3, alpha=0.3, label="Qvalue (centered)") - -ax.legend(handles=[no_reward_artist, Qvalue_artist, reward_artist], fontsize=12, loc=(1.01, 0.27)); -``` - -+++ {"id": "6RNLAtqDXgG_"} - -The plot above shows a simulated run of 150 trials, with parameters $\alpha = .5$ and $\beta = 5$, and constant reward probabilities of $.4$ and $.6$ for the left (blue) and right (orange) actions, respectively. - -Solid and empty dots indicate actions followed by rewards and no-rewards, respectively. The solid line shows the estimated $Q$ value for each action centered around the respective colored dots (the line is above its dots when the respective $Q$ value is above $.5$, and below otherwise). It can be seen that this value increases with rewards (solid dots) and decreases with non-rewards (empty dots). - -The change in line height following each outcome is directly related to the $\alpha$ parameter. The influence of the $\beta$ parameter is more difficult to grasp, but one way to think about it is that the higher its value, the more an agent will stick to the action that has the highest estimated value, even if the difference between the two is quite small. Conversely, as this value approaches zero, the agent will start picking randomly between the two actions, regardless of their estimated values. - -+++ {"id": "LUTfha8Hc1ap"} - -## Estimating the learning parameters via Maximum Likelihood - -Having generated the data, the goal is to now 'invert the model' to estimate the learning parameters $\alpha$ and $\beta$. I start by doing it via Maximum Likelihood Estimation (MLE). This requires writing a custom function that computes the likelihood of the data given a potential $\alpha$ and $\beta$ and the fixed observed actions and rewards (actually the function computes the negative log likelihood, in order to avoid underflow issues). - -I employ the handy scipy.optimize.minimize function, to quickly retrieve the values of $\alpha$ and $\beta$ that maximize the likelihood of the data (or actually, minimize the negative log likelihood). - -This was also helpful when I later wrote the Aesara function that computed the choice probabilities in PyMC. First, the underlying logic is the same, the only thing that changes is the syntax. Second, it provides a way to be confident that I did not mess up, and what I was actually computing was what I intended to. - -```{code-cell} ipython3 -:id: lWGlRE3BjR0E - -def llik_td(x, *args): - # Extract the arguments as they are passed by scipy.optimize.minimize - alpha, beta = x - actions, rewards = args - - # Initialize values - Q = np.array([0.5, 0.5]) - logp_actions = np.zeros(len(actions)) - - for t, (a, r) in enumerate(zip(actions, rewards)): - # Apply the softmax transformation - Q_ = Q * beta - logp_action = Q_ - scipy.special.logsumexp(Q_) - - # Store the log probability of the observed action - logp_actions[t] = logp_action[a] - - # Update the Q values for the next trial - Q[a] = Q[a] + alpha * (r - Q[a]) - - # Return the negative log likelihood of all observed actions - return -np.sum(logp_actions[1:]) -``` - -+++ {"id": "xXZgywFIgz6J"} - -The function `llik_td` is strikingly similar to the `generate_data` one, except that instead of simulating an action and reward in each trial, it stores the log-probability of the observed action. - -The function `scipy.special.logsumexp` is used to compute the term $\log(\exp(\beta Q_{\text{right}}) + \exp(\beta Q_{\text{left}}))$ in a way that is more numerically stable. - -In the end, the function returns the negative sum of all the log probabilities, which is equivalent to multiplying the probabilities in their original scale. - -(The first action is ignored just to make the output comparable to the later Aesara function. It doesn't actually change any estimation, as the initial probabilities are fixed and do not depend on either the $\alpha$ or $\beta$ parameters.) - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ - height: 34 -id: -E8B-rrBgy0j -outputId: 7c18b426-8d50-4706-f940-45ec716877f4 ---- -llik_td([true_alpha, true_beta], *(actions, rewards)) -``` - -+++ {"id": "WT2UwuKWvRCq"} - -Above, I computed the negative log likelihood of the data given the true $\alpha$ and $\beta$ parameters. - -Below, I let scipy find the MLE values for the two parameters: - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ - height: 260 -id: W1MOBxvw4Zl9 -outputId: 39a73f7a-2362-4ef7-cc03-1e9aeda35ecf ---- -x0 = [true_alpha, true_beta] -result = scipy.optimize.minimize(llik_td, x0, args=(actions, rewards), method="BFGS") -print(result) -print("") -print(f"MLE: alpha = {result.x[0]:.2f} (true value = {true_alpha})") -print(f"MLE: beta = {result.x[1]:.2f} (true value = {true_beta})") -``` - -+++ {"id": "y_cXP93QeVVM"} - -The estimated MLE values are relatively close to the true ones. However, this procedure does not give any idea of the plausible uncertainty around these parameter values. To get that, I'll turn to PyMC for a bayesian posterior estimation. - -But before that, I will implement a simple vectorization optimization to the log-likelihood function that will be more similar to the Aesara counterpart. The reason for this is to speed up the slow bayesian inference engine down the road. - -```{code-cell} ipython3 -:id: 4knb5sKW9V66 - -def llik_td_vectorized(x, *args): - # Extract the arguments as they are passed by scipy.optimize.minimize - alpha, beta = x - actions, rewards = args - - # Create a list with the Q values of each trial - Qs = np.ones((n, 2), dtype="float64") - Qs[0] = 0.5 - for t, (a, r) in enumerate( - zip(actions[:-1], rewards[:-1]) - ): # The last Q values were never used, so there is no need to compute them - Qs[t + 1, a] = Qs[t, a] + alpha * (r - Qs[t, a]) - Qs[t + 1, 1 - a] = Qs[t, 1 - a] - - # Apply the softmax transformation in a vectorized way - Qs_ = Qs * beta - logp_actions = Qs_ - scipy.special.logsumexp(Qs_, axis=1)[:, None] - - # Return the logp_actions for the observed actions - logp_actions = logp_actions[np.arange(len(actions)), actions] - return -np.sum(logp_actions[1:]) -``` - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ - height: 34 -id: w9Z_Ik7AlBQC -outputId: 445a7838-29d0-4f21-bfd8-5b65606af286 ---- -llik_td_vectorized([true_alpha, true_beta], *(actions, rewards)) -``` - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ - height: 34 -id: bDPZJe7RqCZX -outputId: a90fbb47-ee9b-4390-87ff-f4b39ece8fca ---- -%timeit llik_td([true_alpha, true_beta], *(actions, rewards)) -``` - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ - height: 34 -id: Dvrqf878swBX -outputId: 94bf3268-0eab-4ce9-deb9-5d1527b3c19d ---- -%timeit llik_td_vectorized([true_alpha, true_beta], *(actions, rewards)) -``` - -+++ {"id": "YAs_zpPZyopT"} - -The vectorized function gives the same results, but runs almost one order of magnitude faster. - -When implemented as an Aesara function, the difference between the vectorized and standard versions was not this drastic. Still, it ran twice as fast, which meant the model also sampled at twice the speed it would otherwise have! - -+++ {"id": "tC7xbCCIL7K4"} - -## Estimating the learning parameters via PyMC - -The most challenging part was to create an Aesara function/loop to estimate the Q values when sampling our parameters with PyMC. - -```{code-cell} ipython3 -:id: u8L_FAB4hle1 - -def update_Q(action, reward, Qs, alpha): - """ - This function updates the Q table according to the RL update rule. - It will be called by aesara.scan to do so recursevely, given the observed data and the alpha parameter - This could have been replaced be the following lamba expression in the aesara.scan fn argument: - fn=lamba action, reward, Qs, alpha: at.set_subtensor(Qs[action], Qs[action] + alpha * (reward - Qs[action])) - """ - - Qs = at.set_subtensor(Qs[action], Qs[action] + alpha * (reward - Qs[action])) - return Qs -``` - -```{code-cell} ipython3 -:id: dHzhTy20g4vh - -# Transform the variables into appropriate Aesara objects -rewards_ = at.as_tensor_variable(rewards, dtype="int32") -actions_ = at.as_tensor_variable(actions, dtype="int32") - -alpha = at.scalar("alpha") -beta = at.scalar("beta") - -# Initialize the Q table -Qs = 0.5 * at.ones((2,), dtype="float64") - -# Compute the Q values for each trial -Qs, _ = aesara.scan( - fn=update_Q, sequences=[actions_, rewards_], outputs_info=[Qs], non_sequences=[alpha] -) - -# Apply the softmax transformation -Qs = Qs * beta -logp_actions = Qs - at.logsumexp(Qs, axis=1, keepdims=True) - -# Calculate the negative log likelihod of the observed actions -logp_actions = logp_actions[at.arange(actions_.shape[0] - 1), actions_[1:]] -neg_loglike = -at.sum(logp_actions) -``` - -+++ {"id": "C9Ayn6-kzhPN"} - -Let's wrap it up in a function to test out if it's working as expected. - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ - height: 89 -id: g1hkTd75xxwo -outputId: a2310fd3-cac2-48c6-9d22-3c3b72410427 ---- -aesara_llik_td = aesara.function( - inputs=[alpha, beta], outputs=neg_loglike, on_unused_input="ignore" -) -result = aesara_llik_td(true_alpha, true_beta) -float(result) -``` - -+++ {"id": "AmcoU1CF5ix-"} - -The same result is obtained, so we can be confident that the Aesara loop is working as expected. We are now ready to implement the PyMC model. - -```{code-cell} ipython3 -:id: c70L4ZBT7QLr - -def aesara_llik_td(alpha, beta, actions, rewards): - rewards = at.as_tensor_variable(rewards, dtype="int32") - actions = at.as_tensor_variable(actions, dtype="int32") - - # Compute the Qs values - Qs = 0.5 * at.ones((2,), dtype="float64") - Qs, updates = aesara.scan( - fn=update_Q, sequences=[actions, rewards], outputs_info=[Qs], non_sequences=[alpha] - ) - - # Apply the sotfmax transformation - Qs = Qs[:-1] * beta - logp_actions = Qs - at.logsumexp(Qs, axis=1, keepdims=True) - - # Calculate the log likelihood of the observed actions - logp_actions = logp_actions[at.arange(actions.shape[0] - 1), actions[1:]] - return at.sum(logp_actions) # PyMC expects the standard log-likelihood -``` - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ - height: 245 -id: XQNBZLMvAdbo -outputId: 65d7a861-476c-4598-985c-e0b0fcd744c4 ---- -with pm.Model() as m: - alpha = pm.Beta(name="alpha", alpha=1, beta=1) - beta = pm.HalfNormal(name="beta", sigma=10) - - like = pm.Potential(name="like", var=aesara_llik_td(alpha, beta, actions, rewards)) - - tr = pm.sample(random_seed=rng) -``` - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ - height: 539 -id: vgSumt-oATfN -outputId: eb3348a4-3092-48c8-d8b4-678af0173079 ---- -az.plot_trace(data=tr); -``` - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ - height: 408 -id: BL84iT_RAzEL -outputId: dcd4174b-4148-45cb-f72d-973f1487d8c2 ---- -az.plot_posterior(data=tr, ref_val=[true_alpha, true_beta]); -``` - -+++ {"id": "1FtAp76PBLCr"} - -In this example, the obtained posteriors are nicely centered around the MLE values. What we have gained is an idea of the plausible uncertainty around these values. - -### Alternative model using Bernoulli for the likelihood - -In this last section I provide an alternative implementation of the model using a Bernoulli likelihood. - -+++ - -:::{Note} -One reason why it's useful to use the Bernoulli likelihood is that one can then do prior and posterior predictive sampling as well as model comparison. With `pm.Potential` you cannot do it, because PyMC does not know what is likelihood and what is prior nor how to generate random draws. Neither of this is a problem when using a `pm.Bernoulli` likelihood. -::: - -```{code-cell} ipython3 -:id: pQdszDk_qYCX - -def right_action_probs(alpha, beta, actions, rewards): - rewards = at.as_tensor_variable(rewards, dtype="int32") - actions = at.as_tensor_variable(actions, dtype="int32") - - # Compute the Qs values - Qs = 0.5 * at.ones((2,), dtype="float64") - Qs, updates = aesara.scan( - fn=update_Q, sequences=[actions, rewards], outputs_info=[Qs], non_sequences=[alpha] - ) - - # Apply the sotfmax transformation - Qs = Qs[:-1] * beta - logp_actions = Qs - at.logsumexp(Qs, axis=1, keepdims=True) - - # Return the probabilities for the right action, in the original scale - return at.exp(logp_actions[:, 1]) -``` - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ - height: 121 -id: S55HgqZiTfpa -outputId: a2db2d68-8bf3-4773-8368-5b6dff310e4b ---- -with pm.Model() as m_alt: - alpha = pm.Beta(name="alpha", alpha=1, beta=1) - beta = pm.HalfNormal(name="beta", sigma=10) - - action_probs = right_action_probs(alpha, beta, actions, rewards) - like = pm.Bernoulli(name="like", p=action_probs, observed=actions[1:]) - - tr_alt = pm.sample(random_seed=rng) -``` - -```{code-cell} ipython3 ---- -colab: - base_uri: https://localhost:8080/ - height: 452 -id: zjXW103JiDRQ -outputId: aafc1b1e-082e-414b-cac7-0ad805097057 ---- -az.plot_trace(data=tr_alt); -``` - -```{code-cell} ipython3 -:id: SDJN2w117eox - -az.plot_posterior(data=tr_alt, ref_val=[true_alpha, true_beta]); -``` - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aeppl,xarray -``` - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Credits - -* Authored by [Ricardo Vieira](https://github.com/ricardov94) in June 2022 - - * Adapted PyMC code from Maria Eckstein ([GitHub](https://github.com/MariaEckstein/SLCN), [PyMC Discourse](https://discourse.pymc.io/t/modeling-reinforcement-learning-of-human-participant-using-pymc3/1735)) - - * Adapted MLE code from Robert Wilson and Anne Collins {cite:p}`collinswilson2019` ([GitHub](https://github.com/AnneCollins/TenSimpleRulesModeling)) - -* Re-executed by [Juan Orduz](https://juanitorduz.github.io/) in August 2022 ([pymc-examples#410](https://github.com/pymc-devs/pymc-examples/pull/410)) - -+++ - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/case_studies/rugby_analytics.myst.md b/myst_nbs/case_studies/rugby_analytics.myst.md deleted file mode 100644 index 211d1212d..000000000 --- a/myst_nbs/case_studies/rugby_analytics.myst.md +++ /dev/null @@ -1,523 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 -substitutions: - extra_dependencies: seaborn numba xarray-einstats ---- - -# A Hierarchical model for Rugby prediction - -:::{post} 19 Mar, 2022 -:tags: hierarchical model, sports -:category: intermediate, how-to -:author: Peadar Coyle, Meenal Jhajharia, Oriol Abril-Pla -::: - -+++ - -In this example, we're going to reproduce the first model described in {cite:t}`baio2010bayesian` using PyMC. Then show how to sample from the posterior predictive to simulate championship outcomes from the scored goals which are the modeled quantities. - -We apply the results of the paper to the Six Nations Championship, which is a competition between Italy, Ireland, Scotland, England, France and Wales. - -+++ - -## Motivation -Your estimate of the strength of a team depends on your estimates of the other strengths - -Ireland are a stronger team than Italy for example - but by how much? - -Source for Results 2014 are Wikipedia. I've added the subsequent years, 2015, 2016, 2017. Manually pulled from Wikipedia. - -* We want to infer a latent parameter - that is the 'strength' of a team based only on their **scoring intensity**, and all we have are their scores and results, we can't accurately measure the 'strength' of a team. -* Probabilistic Programming is a brilliant paradigm for modeling these **latent** parameters -* Aim is to build a model for the upcoming Six Nations in 2018. - -+++ - -:::{include} ../extra_installs.md -::: - -```{code-cell} ipython3 -!date - -import aesara.tensor as at -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm -import seaborn as sns - -from matplotlib.ticker import StrMethodFormatter - -%matplotlib inline -``` - -```{code-cell} ipython3 -az.style.use("arviz-darkgrid") -plt.rcParams["figure.constrained_layout.use"] = False -``` - -This is a Rugby prediction exercise. So we'll input some data. We've taken this from Wikipedia and BBC sports. - -```{code-cell} ipython3 -try: - df_all = pd.read_csv("../data/rugby.csv", index_col=0) -except: - df_all = pd.read_csv(pm.get_data("rugby.csv"), index_col=0) -``` - -## What do we want to infer? - -* We want to infer the latent parameters (every team's strength) that are generating the data we observe (the scorelines). -* Moreover, we know that the scorelines are a noisy measurement of team strength, so ideally, we want a model that makes it easy to quantify our uncertainty about the underlying strengths. -* Often we don't know what the Bayesian Model is explicitly, so we have to 'estimate' the Bayesian Model' -* If we can't solve something, approximate it. -* Markov-Chain Monte Carlo (MCMC) instead draws samples from the posterior. -* Fortunately, this algorithm can be applied to almost any model. - -## What do we want? - -* We want to quantify our uncertainty -* We want to also use this to generate a model -* We want the answers as distributions not point estimates - -+++ - -### Visualization/EDA -We should do some some exploratory data analysis of this dataset. - -The plots should be fairly self-explantory, we'll look at things like difference between teams in terms of their scores. - -```{code-cell} ipython3 -df_all.describe() -``` - -```{code-cell} ipython3 -# Let's look at the tail end of this dataframe -df_all.tail() -``` - -There are a few things here that we don't need. We don't need the year for our model. -But that is something that could improve a future model. - -Firstly let us look at differences in scores by year. - -```{code-cell} ipython3 -df_all["difference"] = np.abs(df_all["home_score"] - df_all["away_score"]) -``` - -```{code-cell} ipython3 -( - df_all.groupby("year")["difference"] - .mean() - .plot( - kind="bar", - title="Average magnitude of scores difference Six Nations", - yerr=df_all.groupby("year")["difference"].std(), - ) - .set_ylabel("Average (abs) point difference") -); -``` - -We can see that the standard error is large. So we can't say anything about the differences. -Let's look country by country. - -```{code-cell} ipython3 -df_all["difference_non_abs"] = df_all["home_score"] - df_all["away_score"] -``` - -Let us first loook at a Pivot table with a sum of this, broken down by year. - -```{code-cell} ipython3 -df_all.pivot_table("difference_non_abs", "home_team", "year") -``` - -Now let's first plot this by home team without year. - -```{code-cell} ipython3 -( - df_all.pivot_table("difference_non_abs", "home_team") - .rename_axis("Home_Team") - .plot(kind="bar", rot=0, legend=False) - .set_ylabel("Score difference Home team and away team") -); -``` - -You can see that Italy and Scotland have negative scores on average. You can also see that England, Ireland and Wales have been the strongest teams lately at home. - -```{code-cell} ipython3 -( - df_all.pivot_table("difference_non_abs", "away_team") - .rename_axis("Away_Team") - .plot(kind="bar", rot=0, legend=False) - .set_ylabel("Score difference Home team and away team") -); -``` - -This indicates that Italy, Scotland and France all have poor away from home form. -England suffers the least when playing away from home. This aggregate view doesn't take into account the strength of the teams. - -+++ - -Let us look a bit more at a timeseries plot of the average of the score difference over the year. - -We see some changes in team behaviour, and we also see that Italy is a poor team. - -```{code-cell} ipython3 -g = sns.FacetGrid(df_all, col="home_team", col_wrap=2, height=5) -g.map(sns.scatterplot, "year", "difference_non_abs") -g.fig.autofmt_xdate() -``` - -```{code-cell} ipython3 -g = sns.FacetGrid(df_all, col="away_team", col_wrap=2, height=5) -g = g.map(plt.scatter, "year", "difference_non_abs").set_axis_labels("Year", "Score Difference") -g.fig.autofmt_xdate() -``` - -You can see some interesting things here like Wales were good away from home in 2015. -In that year they won three games away from home and won by 40 points or so away from home to Italy. - -So now we've got a feel for the data, we can proceed on with describing the model. - -+++ - -### What assumptions do we know for our 'generative story'? - -* We know that the Six Nations in Rugby only has 6 teams - they each play each other once -* We have data from the last few years -* We also know that in sports scoring is modelled as a Poisson distribution -* We consider home advantage to be a strong effect in sports - -+++ - -## The model. - -The league is made up by a total of T= 6 teams, playing each other once -in a season. We indicate the number of points scored by the home and the away team in the g-th game of the season (15 games) as $y_{g1}$ and $y_{g2}$ respectively.

-The vector of observed counts $\mathbb{y} = (y_{g1}, y_{g2})$ is modelled as independent Poisson: -$y_{gi}| \theta_{gj} \tilde\;\; Poisson(\theta_{gj})$ -where the theta parameters represent the scoring intensity in the g-th game for the team playing at home (j=1) and away (j=2), respectively.

- -+++ - -We model these parameters according to a formulation that has been used widely in the statistical literature, assuming a log-linear random effect model: -$$log \theta_{g1} = home + att_{h(g)} + def_{a(g)} $$ -$$log \theta_{g2} = att_{a(g)} + def_{h(g)}$$ - - -* The parameter home represents the advantage for the team hosting the game and we assume that this effect is constant for all the teams and throughout the season -* The scoring intensity is determined jointly by the attack and defense ability of the two teams involved, represented by the parameters att and def, respectively - -* Conversely, for each t = 1, ..., T, the team-specific effects are modelled as exchangeable from a common distribution: - -* $att_{t} \; \tilde\;\; Normal(\mu_{att},\tau_{att})$ and $def_{t} \; \tilde\;\;Normal(\mu_{def},\tau_{def})$ - -* We did some munging above and adjustments of the data to make it **tidier** for our model. -* The log function to away scores and home scores is a standard trick in the sports analytics literature - -+++ - -## Building of the model -We now build the model in PyMC, specifying the global parameters, the team-specific parameters and the likelihood function - -```{code-cell} ipython3 -plt.rcParams["figure.constrained_layout.use"] = True -home_idx, teams = pd.factorize(df_all["home_team"], sort=True) -away_idx, _ = pd.factorize(df_all["away_team"], sort=True) -coords = {"team": teams} -``` - -```{code-cell} ipython3 -with pm.Model(coords=coords) as model: - # constant data - home_team = pm.ConstantData("home_team", home_idx, dims="match") - away_team = pm.ConstantData("away_team", away_idx, dims="match") - - # global model parameters - home = pm.Normal("home", mu=0, sigma=1) - sd_att = pm.HalfNormal("sd_att", sigma=2) - sd_def = pm.HalfNormal("sd_def", sigma=2) - intercept = pm.Normal("intercept", mu=3, sigma=1) - - # team-specific model parameters - atts_star = pm.Normal("atts_star", mu=0, sigma=sd_att, dims="team") - defs_star = pm.Normal("defs_star", mu=0, sigma=sd_def, dims="team") - - atts = pm.Deterministic("atts", atts_star - at.mean(atts_star), dims="team") - defs = pm.Deterministic("defs", defs_star - at.mean(defs_star), dims="team") - home_theta = at.exp(intercept + home + atts[home_idx] + defs[away_idx]) - away_theta = at.exp(intercept + atts[away_idx] + defs[home_idx]) - - # likelihood of observed data - home_points = pm.Poisson( - "home_points", - mu=home_theta, - observed=df_all["home_score"], - dims=("match"), - ) - away_points = pm.Poisson( - "away_points", - mu=away_theta, - observed=df_all["away_score"], - dims=("match"), - ) - trace = pm.sample(1000, tune=1500, cores=4) -``` - -* We specified the model and the likelihood function - -* All this runs on an Aesara graph under the hood - -```{code-cell} ipython3 -az.plot_trace(trace, var_names=["intercept", "home", "sd_att", "sd_def"], compact=False); -``` - -Let us apply good *statistical workflow* practices and look at the various evaluation metrics to see if our NUTS sampler converged. - -```{code-cell} ipython3 -az.plot_energy(trace, figsize=(6, 4)); -``` - -```{code-cell} ipython3 -az.summary(trace, kind="diagnostics") -``` - -Our model has converged well and $\hat{R}$ looks good. - -+++ - -Let us look at some of the stats, just to verify that our model has returned the correct attributes. We can see that some teams are stronger than others. This is what we would expect with attack - -```{code-cell} ipython3 -trace_hdi = az.hdi(trace) -trace_hdi["atts"] -``` - -```{code-cell} ipython3 -trace.posterior["atts"].median(("chain", "draw")) -``` - -## Results -From the above we can start to understand the different distributions of attacking strength and defensive strength. -These are probabilistic estimates and help us better understand the uncertainty in sports analytics - -```{code-cell} ipython3 -_, ax = plt.subplots(figsize=(12, 6)) - -ax.scatter(teams, trace.posterior["atts"].median(dim=("chain", "draw")), color="C0", alpha=1, s=100) -ax.vlines( - teams, - trace_hdi["atts"].sel({"hdi": "lower"}), - trace_hdi["atts"].sel({"hdi": "higher"}), - alpha=0.6, - lw=5, - color="C0", -) -ax.set_xlabel("Teams") -ax.set_ylabel("Posterior Attack Strength") -ax.set_title("HDI of Team-wise Attack Strength"); -``` - -This is one of the powerful things about Bayesian modelling, we can have *uncertainty quantification* of some of our estimates. -We've got a Bayesian credible interval for the attack strength of different countries. - -We can see an overlap between Ireland, Wales and England which is what you'd expect since these teams have won in recent years. - -Italy is well behind everyone else - which is what we'd expect and there's an overlap between Scotland and France which seems about right. - -There are probably some effects we'd like to add in here, like weighting more recent results more strongly. -However that'd be a much more complicated model. - -```{code-cell} ipython3 -# subclass arviz labeller to omit the variable name -class TeamLabeller(az.labels.BaseLabeller): - def make_label_flat(self, var_name, sel, isel): - sel_str = self.sel_to_str(sel, isel) - return sel_str -``` - -```{code-cell} ipython3 -ax = az.plot_forest(trace, var_names=["atts"], labeller=TeamLabeller()) -ax[0].set_title("Team Offense"); -``` - -```{code-cell} ipython3 -ax = az.plot_forest(trace, var_names=["defs"], labeller=TeamLabeller()) -ax[0].set_title("Team Defense"); -``` - -Good teams like Ireland and England have a strong negative effect defense. Which is what we expect. We expect our strong teams to have strong positive effects in attack and strong negative effects in defense. - -+++ - -This approach that we're using of looking at parameters and examining them is part of a good statistical workflow. -We also think that perhaps our priors could be better specified. However this is beyond the scope of this article. -We recommend for a good discussion of 'statistical workflow' you visit [Robust Statistical Workflow with RStan](http://mc-stan.org/users/documentation/case-studies/rstan_workflow.html) - -+++ - -Let's do some other plots. So we can see our range for our defensive effect. -I'll print the teams below too just for reference - -```{code-cell} ipython3 -az.plot_posterior(trace, var_names=["defs"]); -``` - -We can see that Ireland's mean is -0.39 which means we expect Ireland to have a strong defense. -Which is what we'd expect, Ireland generally even in games it loses doesn't lose by say 50 points. -And we can see that the 94% HDI is between -0.491, and -0.28 - -In comparison with Italy, we see a strong positive effect 0.58 mean and a HDI of 0.51 and 0.65. This means that we'd expect Italy to concede a lot of points, compared to what it scores. -Given that Italy often loses by 30 - 60 points, this seems correct. - -We see here also that this informs what other priors we could bring into this. We could bring some sort of world ranking as a prior. - -As of December 2017 the [rugby rankings](https://www.worldrugby.org/rankings/mru) indicate that England is 2nd in the world, Ireland 3rd, Scotland 5th, Wales 7th, France 9th and Italy 14th. We could bring that into a model and it can explain some of the fact that Italy is apart from a lot of the other teams. - -+++ - -Now let's simulate who wins over a total of 4000 simulations, one per sample in the posterior. - -```{code-cell} ipython3 -with model: - pm.sample_posterior_predictive(trace, extend_inferencedata=True) -pp = trace.posterior_predictive -const = trace.constant_data -team_da = trace.posterior.team -``` - -The posterior predictive samples contain the goals scored by each team in each match. We modeled and therefore simulated according to scoring and devensive powers using goals as observed variable. - -Our goal now is to see who wins the competition, so we can estimate the probability each team has of winning the whole competition. From that we need to convert the scored goals to points: - -```{code-cell} ipython3 -# fmt: off -pp["home_win"] = ( - (pp["home_points"] > pp["away_points"]) * 3 # home team wins and gets 3 points - + (pp["home_points"] == pp["away_points"]) * 2 # tie -> home team gets 2 points -) -pp["away_win"] = ( - (pp["home_points"] < pp["away_points"]) * 3 - + (pp["home_points"] == pp["away_points"]) * 2 -) -# fmt: on -``` - -Then add the points each team has collected throughout all matches: - -```{code-cell} ipython3 -groupby_sum_home = pp.home_win.groupby(team_da[const.home_team]).sum() -groupby_sum_away = pp.away_win.groupby(team_da[const.away_team]).sum() - -pp["teamscores"] = groupby_sum_home + groupby_sum_away -``` - -And eventually generate the ranks of all teams for each of the 4000 simulations. As our data is stored in xarray objects inside the InferenceData class, we will use {doc}`einstats:index`: - -```{code-cell} ipython3 -from xarray_einstats.stats import rankdata - -pp["rank"] = rankdata(-pp["teamscores"], dims="team", method="min") -pp[["rank"]].sel(team="England") -``` - -As you can see, we now have a collection of 4000 integers between 1 and 6 for each team, 1 meaning they win the competition. We can use a histogram with bin edges at half integers to count and normalize how many times each team -finishes in each position: - -```{code-cell} ipython3 -from xarray_einstats.numba import histogram - -bin_edges = np.arange(7) + 0.5 -data_sim = ( - histogram(pp["rank"], dims=("chain", "draw"), bins=bin_edges, density=True) - .rename({"bin": "rank"}) - .assign_coords(rank=np.arange(6) + 1) -) -``` - -Now that we have reduced the data to a 2 dimensional array, we will convert it to a pandas DataFrame -which is now a more adequate choice to work with our data: - -```{code-cell} ipython3 -idx_dim, col_dim = data_sim.dims -sim_table = pd.DataFrame(data_sim, index=data_sim[idx_dim], columns=data_sim[col_dim]) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 4)) -ax = sim_table.T.plot(kind="barh", ax=ax) -ax.xaxis.set_major_formatter(StrMethodFormatter("{x:.1%}")) -ax.set_xlabel("Rank-wise Probability of results for all six teams") -ax.set_yticklabels(np.arange(1, 7)) -ax.set_ylabel("Ranks") -ax.invert_yaxis() -ax.legend(loc="best", fontsize="medium"); -``` - -We see according to this model that Ireland finishes with the most points about 60% of the time, and England finishes with the most points 45% of the time and Wales finishes with the most points about 10% of the time. (Note that these probabilities do not sum to 100% since there is a non-zero chance of a tie atop the table.) - -> As an Irish rugby fan - I like this model. However it indicates some problems with shrinkage, and bias. Since recent form suggests England will win. - -Nevertheless the point of this model was to illustrate how a Hierarchical model could be applied to a sports analytics problem, and illustrate the power of PyMC. - -+++ - -## Covariates -We should do some exploration of the variables - -```{code-cell} ipython3 -az.plot_pair( - trace, - var_names=["atts"], - kind="scatter", - divergences=True, - textsize=25, - marginals=True, -), -figsize = (10, 10) -``` - -We observe that there isn't a lot of correlation between these covariates, other than the weaker teams like Italy have a more negative distribution of these variables. -Nevertheless this is a good method to get some insight into how the variables are behaving. - -+++ - -## Authors - -* Adapted [Daniel Weitzenfeld's](http://danielweitzenfeld.github.io/passtheroc/blog/2014/10/28/bayes-premier-league/) blog post by [Peadar Coyle](). The original blog post was based on the work of {cite:p}`baio2010bayesian` -* Updated by Meenal Jhajharia to use ArviZ and xarray -* Updated by Oriol Abril-Pla to use PyMC v4 and xarray-einstats - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p xarray,aeppl,numba,xarray_einstats -``` - -:::{include} ../page_footer.md -::: - -```{code-cell} ipython3 - -``` diff --git a/myst_nbs/case_studies/spline.myst.md b/myst_nbs/case_studies/spline.myst.md deleted file mode 100644 index a3259874d..000000000 --- a/myst_nbs/case_studies/spline.myst.md +++ /dev/null @@ -1,297 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(spline)= -# Splines - -:::{post} June 4, 2022 -:tags: patsy, regression, spline -:category: beginner -:author: Joshua Cook -::: - -+++ {"tags": []} - -## Introduction - -Often, the model we want to fit is not a perfect line between some $x$ and $y$. -Instead, the parameters of the model are expected to vary over $x$. -There are multiple ways to handle this situation, one of which is to fit a *spline*. -Spline fit is effectively a sum of multiple individual curves (piecewise polynomials), each fit to a different section of $x$, that are tied together at their boundaries, often called *knots*. - -The spline is effectively multiple individual lines, each fit to a different section of $x$, that are tied together at their boundaries, often called *knots*. - -Below is a full working example of how to fit a spline using PyMC. The data and model are taken from [*Statistical Rethinking* 2e](https://xcelab.net/rm/statistical-rethinking/) by [Richard McElreath's](https://xcelab.net/rm/) {cite:p}`mcelreath2018statistical`. - -For more information on this method of non-linear modeling, I suggesting beginning with [chapter 5 of Bayesian Modeling and Computation in Python](https://bayesiancomputationbook.com/markdown/chp_05.html) {cite:p}`martin2021bayesian`. - -```{code-cell} ipython3 -from pathlib import Path - -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm - -from patsy import dmatrix -``` - -```{code-cell} ipython3 -%matplotlib inline -%config InlineBackend.figure_format = "retina" - -RANDOM_SEED = 8927 -az.style.use("arviz-darkgrid") -``` - -## Cherry blossom data - -The data for this example is the number of days (`doy` for "days of year") that the cherry trees were in bloom in each year (`year`). -For convenience, years missing a `doy` were dropped (which is a bad idea to deal with missing data in general!). - -```{code-cell} ipython3 -try: - blossom_data = pd.read_csv(Path("..", "data", "cherry_blossoms.csv"), sep=";") -except FileNotFoundError: - blossom_data = pd.read_csv(pm.get_data("cherry_blossoms.csv"), sep=";") - - -blossom_data.dropna().describe() -``` - -```{code-cell} ipython3 -blossom_data = blossom_data.dropna(subset=["doy"]).reset_index(drop=True) -blossom_data.head(n=10) -``` - -After dropping rows with missing data, there are 827 years with the numbers of days in which the trees were in bloom. - -```{code-cell} ipython3 -blossom_data.shape -``` - -If we visualize the data, it is clear that there a lot of annual variation, but some evidence for a non-linear trend in bloom days over time. - -```{code-cell} ipython3 -blossom_data.plot.scatter( - "year", "doy", color="cornflowerblue", s=10, title="Cherry Blossom Data", ylabel="Days in bloom" -); -``` - -+++ {"tags": []} - -## The model - -We will fit the following model. - -$D \sim \mathcal{N}(\mu, \sigma)$ -$\quad \mu = a + Bw$ -$\qquad a \sim \mathcal{N}(100, 10)$ -$\qquad w \sim \mathcal{N}(0, 10)$ -$\quad \sigma \sim \text{Exp}(1)$ - -The number of days of bloom $D$ will be modeled as a normal distribution with mean $\mu$ and standard deviation $\sigma$. In turn, the mean will be a linear model composed of a y-intercept $a$ and spline defined by the basis $B$ multiplied by the model parameter $w$ with a variable for each region of the basis. Both have relatively weak normal priors. - -### Prepare the spline - -The spline will have 15 *knots*, splitting the year into 16 sections (including the regions covering the years before and after those in which we have data). The knots are the boundaries of the spline, the name owing to how the individual lines will be tied together at these boundaries to make a continuous and smooth curve. The knots will be unevenly spaced over the years such that each region will have the same proportion of data. - -```{code-cell} ipython3 -num_knots = 15 -knot_list = np.quantile(blossom_data.year, np.linspace(0, 1, num_knots)) -knot_list -``` - -Below is a plot of the locations of the knots over the data. - -```{code-cell} ipython3 -blossom_data.plot.scatter( - "year", "doy", color="cornflowerblue", s=10, title="Cherry Blossom Data", ylabel="Day of Year" -) -for knot in knot_list: - plt.gca().axvline(knot, color="grey", alpha=0.4); -``` - -We can use `patsy` to create the matrix $B$ that will be the b-spline basis for the regression. -The degree is set to 3 to create a cubic b-spline. - -```{code-cell} ipython3 -:tags: [hide-output] - -B = dmatrix( - "bs(year, knots=knots, degree=3, include_intercept=True) - 1", - {"year": blossom_data.year.values, "knots": knot_list[1:-1]}, -) -B -``` - -The b-spline basis is plotted below, showing the *domain* of each piece of the spline. The height of each curve indicates how influential the corresponding model covariate (one per spline region) will be on model's inference of that region. The overlapping regions represent the knots, showing how the smooth transition from one region to the next is formed. - -```{code-cell} ipython3 -spline_df = ( - pd.DataFrame(B) - .assign(year=blossom_data.year.values) - .melt("year", var_name="spline_i", value_name="value") -) - -color = plt.cm.magma(np.linspace(0, 0.80, len(spline_df.spline_i.unique()))) - -fig = plt.figure() -for i, c in enumerate(color): - subset = spline_df.query(f"spline_i == {i}") - subset.plot("year", "value", c=c, ax=plt.gca(), label=i) -plt.legend(title="Spline Index", loc="upper center", fontsize=8, ncol=6); -``` - -### Fit the model - -Finally, the model can be built using PyMC. A graphical diagram shows the organization of the model parameters (note that this requires the installation of `python-graphviz`, which I recommend doing in a `conda` virtual environment). - -```{code-cell} ipython3 -COORDS = {"splines": np.arange(B.shape[1])} -with pm.Model(coords=COORDS) as spline_model: - a = pm.Normal("a", 100, 5) - w = pm.Normal("w", mu=0, sigma=3, size=B.shape[1], dims="splines") - mu = pm.Deterministic("mu", a + pm.math.dot(np.asarray(B, order="F"), w.T)) - sigma = pm.Exponential("sigma", 1) - D = pm.Normal("D", mu=mu, sigma=sigma, observed=blossom_data.doy, dims="obs") -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(spline_model) -``` - -```{code-cell} ipython3 -with spline_model: - idata = pm.sample_prior_predictive() - idata.extend(pm.sample(draws=1000, tune=1000, random_seed=RANDOM_SEED, chains=4)) - pm.sample_posterior_predictive(idata, extend_inferencedata=True) -``` - -## Analysis - -Now we can analyze the draws from the posterior of the model. - -### Parameter Estimates - -Below is a table summarizing the posterior distributions of the model parameters. -The posteriors of $a$ and $\sigma$ are quite narrow while those for $w$ are wider. -This is likely because all of the data points are used to estimate $a$ and $\sigma$ whereas only a subset are used for each value of $w$. -(It could be interesting to model these hierarchically allowing for the sharing of information and adding regularization across the spline.) -The effective sample size and $\widehat{R}$ values all look good, indicating that the model has converged and sampled well from the posterior distribution. - -```{code-cell} ipython3 -az.summary(idata, var_names=["a", "w", "sigma"]) -``` - -The trace plots of the model parameters look good (homogeneous and no sign of trend), further indicating that the chains converged and mixed. - -```{code-cell} ipython3 -az.plot_trace(idata, var_names=["a", "w", "sigma"]); -``` - -```{code-cell} ipython3 -az.plot_forest(idata, var_names=["w"], combined=False, r_hat=True); -``` - -Another visualization of the fit spline values is to plot them multiplied against the basis matrix. -The knot boundaries are shown as vertical lines again, but now the spline basis is multiplied against the values of $w$ (represented as the rainbow-colored curves). The dot product of $B$ and $w$ – the actual computation in the linear model – is shown in black. - -```{code-cell} ipython3 -wp = idata.posterior["w"].mean(("chain", "draw")).values - -spline_df = ( - pd.DataFrame(B * wp.T) - .assign(year=blossom_data.year.values) - .melt("year", var_name="spline_i", value_name="value") -) - -spline_df_merged = ( - pd.DataFrame(np.dot(B, wp.T)) - .assign(year=blossom_data.year.values) - .melt("year", var_name="spline_i", value_name="value") -) - - -color = plt.cm.rainbow(np.linspace(0, 1, len(spline_df.spline_i.unique()))) -fig = plt.figure() -for i, c in enumerate(color): - subset = spline_df.query(f"spline_i == {i}") - subset.plot("year", "value", c=c, ax=plt.gca(), label=i) -spline_df_merged.plot("year", "value", c="black", lw=2, ax=plt.gca()) -plt.legend(title="Spline Index", loc="lower center", fontsize=8, ncol=6) - -for knot in knot_list: - plt.gca().axvline(knot, color="grey", alpha=0.4); -``` - -### Model predictions - -Lastly, we can visualize the predictions of the model using the posterior predictive check. - -```{code-cell} ipython3 -post_pred = az.summary(idata, var_names=["mu"]).reset_index(drop=True) -blossom_data_post = blossom_data.copy().reset_index(drop=True) -blossom_data_post["pred_mean"] = post_pred["mean"] -blossom_data_post["pred_hdi_lower"] = post_pred["hdi_3%"] -blossom_data_post["pred_hdi_upper"] = post_pred["hdi_97%"] -``` - -```{code-cell} ipython3 -blossom_data.plot.scatter( - "year", - "doy", - color="cornflowerblue", - s=10, - title="Cherry blossom data with posterior predictions", - ylabel="Days in bloom", -) -for knot in knot_list: - plt.gca().axvline(knot, color="grey", alpha=0.4) - -blossom_data_post.plot("year", "pred_mean", ax=plt.gca(), lw=3, color="firebrick") -plt.fill_between( - blossom_data_post.year, - blossom_data_post.pred_hdi_lower, - blossom_data_post.pred_hdi_upper, - color="firebrick", - alpha=0.4, -); -``` - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Authors - -- Created by Joshua Cook -- Updated by Tyler James Burch -- Updated by Chris Fonnesbeck - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,xarray,patsy -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/case_studies/stochastic_volatility.myst.md b/myst_nbs/case_studies/stochastic_volatility.myst.md deleted file mode 100644 index 1409c8f17..000000000 --- a/myst_nbs/case_studies/stochastic_volatility.myst.md +++ /dev/null @@ -1,220 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: pymc_env - language: python - name: pymc_env ---- - -(stochastic_volatility)= -# Stochastic Volatility model - -:::{post} June 17, 2022 -:tags: time series, case study -:category: beginner -:author: John Salvatier -::: - -```{code-cell} ipython3 -import os - -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm - -rng = np.random.RandomState(1234) -az.style.use("arviz-darkgrid") -``` - -Asset prices have time-varying volatility (variance of day over day `returns`). In some periods, returns are highly variable, while in others very stable. Stochastic volatility models model this with a latent volatility variable, modeled as a stochastic process. The following model is similar to the one described in the No-U-Turn Sampler paper, {cite:p}`hoffman2014nuts`. - -$$ \sigma \sim Exponential(50) $$ - -$$ \nu \sim Exponential(.1) $$ - -$$ s_i \sim Normal(s_{i-1}, \sigma^{-2}) $$ - -$$ \log(r_i) \sim t(\nu, 0, \exp(-2 s_i)) $$ - -Here, $r$ is the daily return series and $s$ is the latent log volatility process. - -+++ - -## Build Model - -+++ - -First we load daily returns of the S&P 500, and calculate the daily log returns. This data is from May 2008 to November 2019. - -```{code-cell} ipython3 -try: - returns = pd.read_csv(os.path.join("..", "data", "SP500.csv"), index_col="Date") -except FileNotFoundError: - returns = pd.read_csv(pm.get_data("SP500.csv"), index_col="Date") - -returns["change"] = np.log(returns["Close"]).diff() -returns = returns.dropna() - -returns.head() -``` - -As you can see, the volatility seems to change over time quite a bit but cluster around certain time-periods. For example, the 2008 financial crisis is easy to pick out. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(14, 4)) -returns.plot(y="change", label="S&P 500", ax=ax) -ax.set(xlabel="time", ylabel="returns") -ax.legend(); -``` - -Specifying the model in `PyMC` mirrors its statistical specification. - -```{code-cell} ipython3 -def make_stochastic_volatility_model(data): - with pm.Model(coords={"time": data.index.values}) as model: - step_size = pm.Exponential("step_size", 10) - volatility = pm.GaussianRandomWalk("volatility", sigma=step_size, dims="time") - nu = pm.Exponential("nu", 0.1) - returns = pm.StudentT( - "returns", nu=nu, lam=np.exp(-2 * volatility), observed=data["change"], dims="time" - ) - return model - - -stochastic_vol_model = make_stochastic_volatility_model(returns) -``` - -## Checking the model - -Two good things to do to make sure our model is what we expect is to -1. Take a look at the model structure. This lets us know we specified the priors we wanted and the connections we wanted. It is also handy to remind ourselves of the size of the random variables. -2. Take a look at the prior predictive samples. This helps us interpret what our priors imply about the data. - -```{code-cell} ipython3 -pm.model_to_graphviz(stochastic_vol_model) -``` - -```{code-cell} ipython3 -with stochastic_vol_model: - idata = pm.sample_prior_predictive(500, random_seed=rng) - -prior_predictive = idata.prior_predictive.stack(pooled_chain=("chain", "draw")) -``` - -We plot and inspect the prior predictive. This is *many* orders of magnitude larger than the actual returns we observed. In fact, I cherry-picked a few draws to keep the plot from looking silly. This may suggest changing our priors: a return that our model considers plausible would violate all sorts of constraints by a huge margin: the total value of all goods and services the world produces is ~$\$10^9$, so we might reasonably *not* expect any returns above that magnitude. - -That said, we get somewhat reasonable results fitting this model anyways, and it is standard, so we leave it as is. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(14, 4)) -returns["change"].plot(ax=ax, lw=1, color="black") -ax.plot( - prior_predictive["returns"].isel(pooled_chain=slice(4, 6, None)), - "g", - alpha=0.5, - lw=1, - zorder=-10, -) - -max_observed, max_simulated = np.max(np.abs(returns["change"])), np.max( - np.abs(prior_predictive["returns"].values) -) -ax.set_title(f"Maximum observed: {max_observed:.2g}\nMaximum simulated: {max_simulated:.2g}(!)"); -``` - -## Fit Model - -+++ - -Once we are happy with our model, we can sample from the posterior. This is a somewhat tricky model to fit even with NUTS, so we sample and tune a little longer than default. - -```{code-cell} ipython3 -with stochastic_vol_model: - idata.extend(pm.sample(2000, tune=2000, random_seed=rng)) - -posterior = idata.posterior.stack(pooled_chain=("chain", "draw")) -posterior["exp_volatility"] = np.exp(posterior["volatility"]) -``` - -```{code-cell} ipython3 -with stochastic_vol_model: - idata.extend(pm.sample_posterior_predictive(idata, random_seed=rng)) - -posterior_predictive = idata.posterior_predictive.stack(pooled_chain=("chain", "draw")) -``` - -Note that the `step_size` parameter does not look perfect: the different chains look somewhat different. This again indicates some weakness in our model: it may make sense to allow the step_size to change over time, especially over this 11 year time span. - -```{code-cell} ipython3 -az.plot_trace(idata, var_names=["step_size", "nu"]); -``` - -Now we can look at our posterior estimates of the volatility in S&P 500 returns over time. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(14, 4)) - -y_vals = posterior["exp_volatility"].isel(pooled_chain=slice(None, None, 5)) -x_vals = y_vals.time.astype(np.datetime64) - -plt.plot(x_vals, y_vals, "k", alpha=0.002) -ax.set_xlim(x_vals.min(), x_vals.max()) -ax.set_ylim(bottom=0) -ax.set(title="Estimated volatility over time", xlabel="Date", ylabel="Volatility"); -``` - -Finally, we can use the posterior predictive distribution to see the how the learned volatility could have effected returns. - -```{code-cell} ipython3 -fig, axes = plt.subplots(nrows=2, figsize=(14, 7), sharex=True) -returns["change"].plot(ax=axes[0], color="black") - -axes[1].plot(posterior["exp_volatility"].isel(pooled_chain=slice(None, None, 100)), "r", alpha=0.5) -axes[0].plot( - posterior_predictive["returns"].isel(pooled_chain=slice(None, None, 100)), - "g", - alpha=0.5, - zorder=-10, -) -axes[0].set_title("True log returns (black) and posterior predictive log returns (green)") -axes[1].set_title("Posterior volatility"); -``` - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Authors - -+++ - -* Written by John Salvatier -* Updated by Kyle Meyer -* Updated by Thomas Wiecki -* Updated by Chris Fonnesbeck -* Updated by Aaron Maxwell on May 18, 2018 ([pymc#2978](https://github.com/pymc-devs/pymc/pull/2978)) -* Updated by Colin Carroll on November 16, 2019 ([pymc#3682](https://github.com/pymc-devs/pymc/pull/3682)) -* Updated by Abhipsha Das on July 24, 2021 ([pymc-examples#155](https://github.com/pymc-devs/pymc-examples/pull/155)) -* Updated by Michael Osthege on June 1, 2022 ([pymc-examples#343](https://github.com/pymc-devs/pymc-examples/pull/343)) -* Updated by Christopher Krapu on June 17, 2022 ([pymc-examples#378](https://github.com/pymc-devs/pymc-examples/pull/378)) - -+++ - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/case_studies/wrapping_jax_function.myst.md b/myst_nbs/case_studies/wrapping_jax_function.myst.md deleted file mode 100644 index add11f1c0..000000000 --- a/myst_nbs/case_studies/wrapping_jax_function.myst.md +++ /dev/null @@ -1,767 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: pymc-examples - language: python - name: pymc-examples -substitutions: - extra_dependencies: jax numpyro ---- - -(wrapping_jax_function)= -# How to wrap a JAX function for use in PyMC - -:::{post} Mar 24, 2022 -:tags: Aesara, hidden markov model, JAX -:category: advanced, how-to -:author: Ricardo Vieira -::: - -```{code-cell} ipython3 -import aesara -import aesara.tensor as at -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc as pm - -from aesara.graph import Apply, Op -``` - -```{code-cell} ipython3 -RANDOM_SEED = 104109109 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -:::{include} ../extra_installs.md -::: - -```{code-cell} ipython3 -import jax -import jax.numpy as jnp -import jax.scipy as jsp -import pymc.sampling_jax - -from aesara.link.jax.dispatch import jax_funcify -``` - -## Intro: Aesara and its backends - -PyMC uses the {doc}`Aesara ` library to create and manipulate probabilistic graphs. Aesara is backend-agnostic, meaning it can make use of functions written in different languages or frameworks, including pure Python, NumPy, C, Cython, Numba, and [JAX](https://jax.readthedocs.io/en/latest/index.html). - -All that is needed is to encapsulate such function in a Aesara {class}`~aesara.graph.op.Op`, which enforces a specific API regarding how inputs and outputs of pure "operations" should be handled. It also implements methods for optional extra functionality like symbolic shape inference and automatic differentiation. This is well covered in the Aesara {ref}`Op documentation ` and in our {ref}`blackbox_external_likelihood_numpy` pymc-example. - -More recently, Aesara became capable of compiling directly to some of these languages/frameworks, meaning that we can convert a complete Aesara graph into a JAX or NUMBA jitted function, whereas traditionally they could only be converted to Python or C. - -This has some interesting uses, such as sampling models defined in PyMC with pure JAX samplers, like those implemented in [NumPyro](https://num.pyro.ai/en/latest/index.html) or [BlackJax](https://github.com/blackjax-devs/blackjax). - -This notebook illustrates how we can implement a new Aesara {class}`~aesara.graph.op.Op` that wraps a JAX function. - -### Outline - -1. We start in a similar path as that taken in the {ref}`blackbox_external_likelihood_numpy`, which wraps a NumPy function in a Aesara {class}`~aesara.graph.op.Op`, this time wrapping a JAX jitted function instead. -2. We then enable Aesara to "unwrap" the just wrapped JAX function, so that the whole graph can be compiled to JAX. We make use of this to sample our PyMC model via the JAX NumPyro NUTS sampler. - -+++ - -## A motivating example: marginal HMM - -+++ - -For illustration purposes, we will simulate data following a simple [Hidden Markov Model](https://en.wikipedia.org/wiki/Hidden_Markov_model) (HMM), with 3 possible latent states $S \in \{0, 1, 2\}$ and normal emission likelihood. - -$$Y \sim \text{Normal}((S + 1) \cdot \text{signal}, \text{noise})$$ - -Our HMM will have a fixed Categorical probability $P$ of switching across states, which depends only on the last state - -$$S_{t+1} \sim \text{Categorical}(P_{S_t})$$ - -To complete our model, we assume a fixed probability $P_{t0}$ for each possible initial state $S_{t0}$, - -$$S_{t0} \sim \text{Categorical}(P_{t0})$$ - - -### Simulating data -Let's generate data according to this model! The first step is to set some values for the parameters in our model - -```{code-cell} ipython3 -# Emission signal and noise parameters -emission_signal_true = 1.15 -emission_noise_true = 0.15 - -p_initial_state_true = np.array([0.9, 0.09, 0.01]) - -# Probability of switching from state_t to state_t+1 -p_transition_true = np.array( - [ - # 0, 1, 2 - [0.9, 0.09, 0.01], # 0 - [0.1, 0.8, 0.1], # 1 - [0.2, 0.1, 0.7], # 2 - ] -) - -# Confirm that we have defined valid probabilities -assert np.isclose(np.sum(p_initial_state_true), 1) -assert np.allclose(np.sum(p_transition_true, axis=-1), 1) -``` - -```{code-cell} ipython3 -# Let's compute the log of the probalitiy transition matrix for later use -with np.errstate(divide="ignore"): - logp_initial_state_true = np.log(p_initial_state_true) - logp_transition_true = np.log(p_transition_true) - -logp_initial_state_true, logp_transition_true -``` - -```{code-cell} ipython3 -# We will observe 70 HMM processes, each with a total of 50 steps -n_obs = 70 -n_steps = 50 -``` - -We write a helper function to generate a single HMM process and create our simulated data - -```{code-cell} ipython3 -def simulate_hmm(p_initial_state, p_transition, emission_signal, emission_noise, n_steps, rng): - """Generate hidden state and emission from our HMM model.""" - - possible_states = np.array([0, 1, 2]) - - hidden_states = [] - initial_state = rng.choice(possible_states, p=p_initial_state) - hidden_states.append(initial_state) - for step in range(n_steps): - new_hidden_state = rng.choice(possible_states, p=p_transition[hidden_states[-1]]) - hidden_states.append(new_hidden_state) - hidden_states = np.array(hidden_states) - - emissions = rng.normal( - (hidden_states + 1) * emission_signal, - emission_noise, - ) - - return hidden_states, emissions -``` - -```{code-cell} ipython3 -single_hmm_hidden_state, single_hmm_emission = simulate_hmm( - p_initial_state_true, - p_transition_true, - emission_signal_true, - emission_noise_true, - n_steps, - rng, -) -print(single_hmm_hidden_state) -print(np.round(single_hmm_emission, 2)) -``` - -```{code-cell} ipython3 -hidden_state_true = [] -emission_observed = [] - -for i in range(n_obs): - hidden_state, emission = simulate_hmm( - p_initial_state_true, - p_transition_true, - emission_signal_true, - emission_noise_true, - n_steps, - rng, - ) - hidden_state_true.append(hidden_state) - emission_observed.append(emission) - -hidden_state = np.array(hidden_state_true) -emission_observed = np.array(emission_observed) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(2, 1, figsize=(8, 6), sharex=True) -# Plot first five hmm processes -for i in range(4): - ax[0].plot(hidden_state_true[i] + i * 0.02, color=f"C{i}", lw=2, alpha=0.4) - ax[1].plot(emission_observed[i], color=f"C{i}", lw=2, alpha=0.4) -ax[0].set_yticks([0, 1, 2]) -ax[0].set_ylabel("hidden state") -ax[1].set_ylabel("observed emmission") -ax[1].set_xlabel("step") -fig.suptitle("Simulated data"); -``` - -The figure above shows the hidden state and respective observed emission of our simulated data. Later, we will use this data to perform posterior inferences about the true model parameters. - -+++ - -## Computing the marginal HMM likelihood using JAX - -+++ - -We will write a JAX function to compute the likelihood of our HMM model, marginalizing over the hidden states. This allows for more efficient sampling of the remaining model parameters. To achieve this, we will use the well known [forward algorithm](https://en.wikipedia.org/wiki/Forward_algorithm), working on the log scale for numerical stability. - -We will take advantage of JAX [scan](https://jax.readthedocs.io/en/latest/_autosummary/jax.lax.scan.html) to obtain an efficient and differentiable log-likelihood, and the handy [vmap](https://jax.readthedocs.io/en/latest/_autosummary/jax.vmap.html#jax.vmap) to automatically vectorize this log-likelihood across multiple observed processes. - -+++ - -Our core JAX function computes the marginal log-likelihood of a single HMM process - -```{code-cell} ipython3 -def hmm_logp( - emission_observed, - emission_signal, - emission_noise, - logp_initial_state, - logp_transition, -): - """Compute the marginal log-likelihood of a single HMM process.""" - - hidden_states = np.array([0, 1, 2]) - - # Compute log-likelihood of observed emissions for each (step x possible hidden state) - logp_emission = jsp.stats.norm.logpdf( - emission_observed[:, None], - (hidden_states + 1) * emission_signal, - emission_noise, - ) - - # We use the forward_algorithm to compute log_alpha(x_t) = logp(x_t, y_1:t) - log_alpha = logp_initial_state + logp_emission[0] - log_alpha, _ = jax.lax.scan( - f=lambda log_alpha_prev, logp_emission: ( - jsp.special.logsumexp(log_alpha_prev + logp_transition.T, axis=-1) + logp_emission, - None, - ), - init=log_alpha, - xs=logp_emission[1:], - ) - - return jsp.special.logsumexp(log_alpha) -``` - -Let's test it with the true parameters and the first simulated HMM process - -```{code-cell} ipython3 -hmm_logp( - emission_observed[0], - emission_signal_true, - emission_noise_true, - logp_initial_state_true, - logp_transition_true, -) -``` - -We now use vmap to vectorize the core function across multiple observations. - -```{code-cell} ipython3 -def vec_hmm_logp(*args): - vmap = jax.vmap( - hmm_logp, - # Only the first argument, needs to be vectorized - in_axes=(0, None, None, None, None), - ) - # For simplicity we sum across all the HMM processes - return jnp.sum(vmap(*args)) - - -# We jit it for better performance! -jitted_vec_hmm_logp = jax.jit(vec_hmm_logp) -``` - -Passing a row matrix with only the first simulated HMM process should return the same result - -```{code-cell} ipython3 -jitted_vec_hmm_logp( - emission_observed[0][None, :], - emission_signal_true, - emission_noise_true, - logp_initial_state_true, - logp_transition_true, -) -``` - -Our goal is, however, to compute the joint log-likelihood for all the simulated data - -```{code-cell} ipython3 ---- -pycharm: - name: '#%% - - ' ---- -jitted_vec_hmm_logp( - emission_observed, - emission_signal_true, - emission_noise_true, - logp_initial_state_true, - logp_transition_true, -) -``` - -We will also ask JAX to give us the function of the gradients with respect to each input. This will come in handy later. - -```{code-cell} ipython3 -jitted_vec_hmm_logp_grad = jax.jit(jax.grad(vec_hmm_logp, argnums=list(range(5)))) -``` - -Let's print out the gradient with respect to `emission_signal`. We will check this value is unchanged after we wrap our function in Aesara. - -```{code-cell} ipython3 -jitted_vec_hmm_logp_grad( - emission_observed, - emission_signal_true, - emission_noise_true, - logp_initial_state_true, - logp_transition_true, -)[1] -``` - -## Wrapping the JAX function in Aesara - -+++ - -Now we are ready to wrap our JAX jitted function in a Aesara {class}`~aesara.graph.op.Op`, that we can then use in our PyMC models. We recommend you check Aesara's official {ref}`Op documentation ` if you want to understand it in more detail. - -In brief, we will inherit from {class}`~aesara.graph.op.Op` and define the following methods: -1. `make_node`: Creates an {class}`~aesara.graph.basic.Apply` node that holds together the symbolic inputs and outputs of our operation -2. `perform`: Python code that returns the evaluation of our operation, given concrete input values -3. `grad`: Returns a Aesara symbolic graph that represents the gradient expression of an output cost wrt to its inputs - -For the `grad` we will create a second {class}`~aesara.graph.op.Op` that wraps our jitted grad version from above - -```{code-cell} ipython3 -class HMMLogpOp(Op): - def make_node( - self, - emission_observed, - emission_signal, - emission_noise, - logp_initial_state, - logp_transition, - ): - # Convert our inputs to symbolic variables - inputs = [ - at.as_tensor_variable(emission_observed), - at.as_tensor_variable(emission_signal), - at.as_tensor_variable(emission_noise), - at.as_tensor_variable(logp_initial_state), - at.as_tensor_variable(logp_transition), - ] - # Define the type of the output returned by the wrapped JAX function - outputs = [at.dscalar()] - return Apply(self, inputs, outputs) - - def perform(self, node, inputs, outputs): - result = jitted_vec_hmm_logp(*inputs) - # Aesara raises an error if the dtype of the returned output is not - # exactly the one expected from the Apply node (in this case - # `dscalar`, which stands for float64 scalar), so we make sure - # to convert to the expected dtype. To avoid unnecessary conversions - # you should make sure the expected output defined in `make_node` - # is already of the correct dtype - outputs[0][0] = np.asarray(result, dtype=node.outputs[0].dtype) - - def grad(self, inputs, output_gradients): - ( - grad_wrt_emission_obsered, - grad_wrt_emission_signal, - grad_wrt_emission_noise, - grad_wrt_logp_initial_state, - grad_wrt_logp_transition, - ) = hmm_logp_grad_op(*inputs) - # If there are inputs for which the gradients will never be needed or cannot - # be computed, `aesara.gradient.grad_not_implemented` should be used as the - # output gradient for that input. - output_gradient = output_gradients[0] - return [ - output_gradient * grad_wrt_emission_obsered, - output_gradient * grad_wrt_emission_signal, - output_gradient * grad_wrt_emission_noise, - output_gradient * grad_wrt_logp_initial_state, - output_gradient * grad_wrt_logp_transition, - ] - - -class HMMLogpGradOp(Op): - def make_node( - self, - emission_observed, - emission_signal, - emission_noise, - logp_initial_state, - logp_transition, - ): - inputs = [ - at.as_tensor_variable(emission_observed), - at.as_tensor_variable(emission_signal), - at.as_tensor_variable(emission_noise), - at.as_tensor_variable(logp_initial_state), - at.as_tensor_variable(logp_transition), - ] - # This `Op` will return one gradient per input. For simplicity, we assume - # each output is of the same type as the input. In practice, you should use - # the exact dtype to avoid overhead when saving the results of the computation - # in `perform` - outputs = [inp.type() for inp in inputs] - return Apply(self, inputs, outputs) - - def perform(self, node, inputs, outputs): - ( - grad_wrt_emission_obsered_result, - grad_wrt_emission_signal_result, - grad_wrt_emission_noise_result, - grad_wrt_logp_initial_state_result, - grad_wrt_logp_transition_result, - ) = jitted_vec_hmm_logp_grad(*inputs) - outputs[0][0] = np.asarray(grad_wrt_emission_obsered_result, dtype=node.outputs[0].dtype) - outputs[1][0] = np.asarray(grad_wrt_emission_signal_result, dtype=node.outputs[1].dtype) - outputs[2][0] = np.asarray(grad_wrt_emission_noise_result, dtype=node.outputs[2].dtype) - outputs[3][0] = np.asarray(grad_wrt_logp_initial_state_result, dtype=node.outputs[3].dtype) - outputs[4][0] = np.asarray(grad_wrt_logp_transition_result, dtype=node.outputs[4].dtype) - - -# Initialize our `Op`s -hmm_logp_op = HMMLogpOp() -hmm_logp_grad_op = HMMLogpGradOp() -``` - -We recommend using the debug helper `eval` method to confirm we specified everything correctly. We should get the same outputs as before: - -```{code-cell} ipython3 -hmm_logp_op( - emission_observed, - emission_signal_true, - emission_noise_true, - logp_initial_state_true, - logp_transition_true, -).eval() -``` - -```{code-cell} ipython3 -hmm_logp_grad_op( - emission_observed, - emission_signal_true, - emission_noise_true, - logp_initial_state_true, - logp_transition_true, -)[1].eval() -``` - -+++ {"pycharm": {"name": "#%% md\n"}} - -It's also useful to check the gradient of our {class}`~aesara.graph.op.Op` can be requested via the Aesara `grad` interface: - -```{code-cell} ipython3 -# We define the symbolic `emission_signal` variable outside of the `Op` -# so that we can request the gradient wrt to it -emission_signal_variable = at.as_tensor_variable(emission_signal_true) -x = hmm_logp_op( - emission_observed, - emission_signal_variable, - emission_noise_true, - logp_initial_state_true, - logp_transition_true, -) -x_grad_wrt_emission_signal = at.grad(x, wrt=emission_signal_variable) -x_grad_wrt_emission_signal.eval() -``` - -### Sampling with PyMC - -+++ - -We are now ready to make inferences about our HMM model with PyMC. We will define priors for each model parameter and use {class}`~pymc.Potential` to add the joint log-likelihood term to our model. - -```{code-cell} ipython3 -with pm.Model(rng_seeder=int(rng.integers(2**30))) as model: - emission_signal = pm.Normal("emission_signal", 0, 1) - emission_noise = pm.HalfNormal("emission_noise", 1) - - p_initial_state = pm.Dirichlet("p_initial_state", np.ones(3)) - logp_initial_state = at.log(p_initial_state) - - p_transition = pm.Dirichlet("p_transition", np.ones(3), size=3) - logp_transition = at.log(p_transition) - - loglike = pm.Potential( - "hmm_loglike", - hmm_logp_op( - emission_observed, - emission_signal, - emission_noise, - logp_initial_state, - logp_transition, - ), - ) -``` - -```{code-cell} ipython3 ---- -pycharm: - name: '#%% - - ' ---- -pm.model_to_graphviz(model) -``` - -Before we start sampling, we check the logp of each variable at the model initial point. Bugs tend to manifest themselves in the form of `nan` or `-inf` for the initial probabilities. - -```{code-cell} ipython3 -initial_point = model.compute_initial_point() -initial_point -``` - -```{code-cell} ipython3 ---- -pycharm: - name: '#%% - - ' ---- -model.point_logps(initial_point) -``` - -We are now ready to sample! - -```{code-cell} ipython3 ---- -pycharm: - name: '#%% - - ' ---- -with model: - idata = pm.sample(chains=2, cores=1) -``` - -```{code-cell} ipython3 ---- -pycharm: - name: '#%% - - ' ---- -az.plot_trace(idata); -``` - -```{code-cell} ipython3 -true_values = [ - emission_signal_true, - emission_noise_true, - *p_initial_state_true, - *p_transition_true.ravel(), -] - -az.plot_posterior(idata, ref_val=true_values, grid=(3, 5)); -``` - -The posteriors look reasonably centered around the true values used to generate our data. - -+++ - -## Unwrapping the wrapped JAX function - -+++ - -As mentioned in the beginning, Aesara can compile an entire graph to JAX. To do this, it needs to know how each {class}`~aesara.graph.op.Op` in the graph can be converted to a JAX function. This can be done by {term}`dispatch ` with {func}`aesara.link.jax.dispatch.jax_funcify`. Most of the default Aesara {class}`~aesara.graph.op.Op`s already have such a dispatch function, but we will need to add a new one for our custom `HMMLogpOp`, as Aesara has never seen that before. - -For that we need a function which returns (another) JAX function, that performs the same computation as in our `perform` method. Fortunately, we started exactly with such function, so this amounts to 3 short lines of code. - -```{code-cell} ipython3 -@jax_funcify.register(HMMLogpOp) -def hmm_logp_dispatch(op, **kwargs): - return vec_hmm_logp -``` - -:::{note} -We do not return the jitted function, so that the entire Aesara graph can be jitted together after being converted to JAX. -::: - -For a better understanding of {class}`~aesara.graph.op.Op` JAX conversions, we recommend reading Aesara's {doc}`Adding JAX and Numba support for Ops guide `. - -We can test that our conversion function is working properly by compiling a {func}`aesara.function` with `mode="JAX"`: - -```{code-cell} ipython3 -out = hmm_logp_op( - emission_observed, - emission_signal_true, - emission_noise_true, - logp_initial_state_true, - logp_transition_true, -) -jax_fn = aesara.function(inputs=[], outputs=out, mode="JAX") -jax_fn() -``` - -We can also compile a JAX function that computes the log probability of each variable in our PyMC model, similar to {meth}`~pymc.Model.point_logps`. We will use the helper method {meth}`~pymc.Model.compile_fn`. - -```{code-cell} ipython3 -model_logp_jax_fn = model.compile_fn(model.logpt(sum=False), mode="JAX") -model_logp_jax_fn(initial_point) -``` - -Note that we could have added an equally simple function to convert our `HMMLogpGradOp`, in case we wanted to convert Aesara gradient graphs to JAX. In our case, we don't need to do this because we will rely on JAX `grad` function (or more precisely, NumPyro will rely on it) to obtain these again from our compiled JAX function. - -We include a {ref}`short discussion ` at the end of this document, to help you better understand the trade-offs between working with Aesara graphs vs JAX functions, and when you might want to use one or the other. - -+++ - -### Sampling with NumPyro - -+++ - -Now that we know our model logp can be entirely compiled to JAX, we can use the handy {func}`pymc.sampling_jax.sample_numpyro_nuts` to sample our model using the pure JAX sampler implemented in NumPyro. - -```{code-cell} ipython3 -with model: - idata_numpyro = pm.sampling_jax.sample_numpyro_nuts(chains=2, progress_bar=False) -``` - -```{code-cell} ipython3 -az.plot_trace(idata_numpyro); -``` - -```{code-cell} ipython3 -az.plot_posterior(idata_numpyro, ref_val=true_values, grid=(3, 5)); -``` - -As expected, sampling results look pretty similar! - -Depending on the model and computer architecture you are using, a pure JAX sampler can provide considerable speedups. - -+++ - -(aesara_vs_jax)= -## Some brief notes on using Aesara vs JAX - -+++ - -### When should you use JAX? - -+++ - -As we have seen, it is pretty straightforward to interface between Aesara graphs and JAX functions. - -This can be very handy when you want to combine previously implemented JAX function with PyMC models. We used a marginalized HMM log-likelihood in this example, but the same strategy could be used to do Bayesian inference with Deep Neural Networks or Differential Equations, or pretty much any other functions implemented in JAX that can be used in the context of a Bayesian model. - -It can also be worth it, if you need to make use of JAX's unique features like vectorization, support for tree structures, or its fine-grained parallelization, and GPU and TPU capabilities. - -+++ - -### When should you not use JAX? - -+++ - -Like JAX, Aesara has the goal of mimicking the NumPy and Scipy APIs, so that writing code in Aesara should feel very similar to how code is written in those libraries. - -There are, however, some of advantages to working with Aesara: - -1. Aesara graphs are considerably easier to {ref}`inspect and debug ` than JAX functions -2. Aesara has clever {ref}`optimization and stabilization routines ` that are not possible or implemented in JAX -3. Aesara graphs can be easily {ref}`manipulated after creation ` - -Point 2 means your graphs are likely to perform better if written in Aesara. In general you don't have to worry about using specialized functions like `log1p` or `logsumexp`, as Aesara will be able to detect the equivalent naive expressions and replace them by their specialized counterparts. Importantly, you still benefit from these optimizations when your graph is later compiled to JAX. - -The catch is that Aesara cannot reason about JAX functions, and by association {class}`~aesara.graph.op.Op`s that wrap them. This means that the larger the portion of the graph is "hidden" inside a JAX function, the less a user will benefit from Aesara's rewrite and debugging abilities. - -Point 3 is more important for library developers. It is the main reason why PyMC developers opted to use Aesara (and before that, its predecessor Theano) as its backend. Many of the user-facing utilities provided by PyMC rely on the ability to easily parse and manipulate Aesara graphs. - -+++ - -## Bonus: Using a single Op that can compute its own gradients - -+++ - -We had to create two {class}`~aesara.graph.op.Op`s, one for the function we cared about and a separate one for its gradients. However, JAX provides a `value_and_grad` utility that can return both the value of a function and its gradients. We can do something similar and get away with a single {class}`~aesara.graph.op.Op` if we are clever about it. - -By doing this we can (potentially) save memory and reuse computation that is shared between the function and its gradients. This may be relevant when working with very large JAX functions. - -Note that this is only useful if you are interested in taking gradients with respect to your {class}`~aesara.graph.op.Op` using Aesara. If your endgoal is to compile your graph to JAX, and only then take the gradients (as NumPyro does), then it's better to use the first approach. You don't even need to implement the `grad` method and associated {class}`~aesara.graph.op.Op` in that case. - -```{code-cell} ipython3 ---- -pycharm: - name: '#%% - - ' ---- -jitted_hmm_logp_value_and_grad = jax.jit(jax.value_and_grad(vec_hmm_logp, argnums=list(range(5)))) -``` - -```{code-cell} ipython3 -class HmmLogpValueGradOp(Op): - # By default only show the first output, and "hide" the other ones - default_output = 0 - - def make_node(self, *inputs): - inputs = [at.as_tensor_variable(inp) for inp in inputs] - # We now have one output for the function value, and one output for each gradient - outputs = [at.dscalar()] + [inp.type() for inp in inputs] - return Apply(self, inputs, outputs) - - def perform(self, node, inputs, outputs): - result, grad_results = jitted_hmm_logp_value_and_grad(*inputs) - outputs[0][0] = np.asarray(result, dtype=node.outputs[0].dtype) - for i, grad_result in enumerate(grad_results, start=1): - outputs[i][0] = np.asarray(grad_result, dtype=node.outputs[i].dtype) - - def grad(self, inputs, output_gradients): - # The `Op` computes its own gradients, so we call it again. - value = self(*inputs) - # We hid the gradient outputs by setting `default_update=0`, but we - # can retrieve them anytime by accessing the `Apply` node via `value.owner` - gradients = value.owner.outputs[1:] - - # Make sure the user is not trying to take the gradient with respect to - # the gradient outputs! That would require computing the second order - # gradients - assert all( - isinstance(g.type, aesara.gradient.DisconnectedType) for g in output_gradients[1:] - ) - - return [output_gradients[0] * grad for grad in gradients] - - -hmm_logp_value_grad_op = HmmLogpValueGradOp() -``` - -We check again that we can take the gradient using Aesara `grad` interface - -```{code-cell} ipython3 -emission_signal_variable = at.as_tensor_variable(emission_signal_true) -# Only the first output is assigned to the variable `x`, due to `default_output=0` -x = hmm_logp_value_grad_op( - emission_observed, - emission_signal_variable, - emission_noise_true, - logp_initial_state_true, - logp_transition_true, -) -at.grad(x, emission_signal_variable).eval() -``` - -## Authors - -+++ - -Authored by [Ricardo Vieira](https://github.com/ricardoV94/) in March 24, 2022 ([pymc-examples#302](https://github.com/pymc-devs/pymc-examples/pull/302)) - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/causal_inference/difference_in_differences.myst.md b/myst_nbs/causal_inference/difference_in_differences.myst.md deleted file mode 100644 index 0bceba1a1..000000000 --- a/myst_nbs/causal_inference/difference_in_differences.myst.md +++ /dev/null @@ -1,452 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: pymc_env - language: python - name: pymc_env ---- - -(difference_in_differences)= -# Difference in differences - -:::{post} Sept, 2022 -:tags: counterfactuals, causal inference, time series, regression, posterior predictive, difference in differences, quasi experiments, panel data -:category: intermediate -:author: Benjamin T. Vincent -::: - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm -import seaborn as sns -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -## Introduction - -This notebook provides a brief overview of the difference in differences approach to causal inference, and shows a working example of how to conduct this type of analysis under the Bayesian framework, using PyMC. While the notebooks provides a high level overview of the approach, I recommend consulting two excellent textbooks on causal inference. Both [The Effect](https://theeffectbook.net/) {cite:p}`huntington2021effect` and [Causal Inference: The Mixtape](https://mixtape.scunning.com) {cite:p}`cunningham2021causal` have chapters devoted to difference in differences. - -[Difference in differences](https://en.wikipedia.org/wiki/Difference_in_differences) would be a good approach to take for causal inference if: -* you want to know the causal impact of a treatment/intervention -* you have pre and post treatment measures -* you have both a treatment and a control group -* the treatment was _not_ allocated by randomisation, that is, you are in a [quasi-experimental](https://en.wikipedia.org/wiki/Quasi-experiment) setting. - -Otherwise there are likely better suited approaches you could use. - -Note that our desire to estimate the causal impact of a treatment involves [counterfactual thinking](https://en.wikipedia.org/wiki/Counterfactual_thinking). This is because we are asking "What would the post-treatment outcome of the treatment group be _if_ treatment had not been administered?" but we can never observe this. - -+++ - -### Example - -A classic example is given by a study by {cite:t}`card1993minimum`. This study examined the effects of increasing the minimum wage upon employment in the fast food sector. This is a quasi-experimental setting because the intervention (increase in minimum wages) was not applied to different geographical units (e.g. states) randomly. The intevention was applied to New Jersey in April 1992. If they measured pre and post intervention employment rates in New Jersey only, then they would have failed to control for omitted variables changing over time (e.g. seasonal effects) which could provide alternative causal explanations for changes in employment rates. But by selecting a control state (Pennsylvania), this allows one to infer that changes in employment in Pennsylvania would match the counterfactual - what _would have happened if_ New Jersey had not received the intervention? - -+++ - -### Causal DAG - -The causal DAG for difference in differences is given below. It says: -* Treatment status of an observation is causally influenced by group and time. Note that treatment and group are different things. Group is either experimental or control, but the experimental group is only 'treated' after the intervention time, hence treatment status depends on both group and time. -* The outcome measured is causally influenced by time, group, and treatment. -* No additional causal influences are considered. - -We are primarily interested in the effect of the treatment upon the outcome and how this changes over time (pre to post treatment). If we only focused on treatment, time and outcome on the treatment group (i.e. not have a control group), then we would be unable to attribute changes in the outcome to the treatment rather than any number of other factors occurring over time to the treatment group. Another way of saying this is that treatment would be fully determined by time, so there is no way to disambiguate the changes in the pre and post outcome measures as being caused by treatment or time. - -![](DAG_difference_in_differences.png) - -But by adding a control group, we are able to compare the changes in time of the control group and the changes in time of the treatment group. One of the key assumptions in the difference in differences approach is the _parallel trends assumption_ - that both groups change in similar ways over time. Another way of saying this is that _if_ the control and treatment groups change in similar ways over time, then we can be fairly convinced that difference in differences in groups over time is due to the treatment. - -+++ - -### Define the difference in differences model - -**Note:** I'm defining this model slightly differently compared to what you might find in other sources. This is to facilitate counterfactual inference later on in the notebook, and to emphasise the assumptions about trends over continuous time. - -First, let's define a Python function to calculate the expected value of the outcome: - -```{code-cell} ipython3 -def outcome(t, control_intercept, treat_intercept_delta, trend, Δ, group, treated): - return control_intercept + (treat_intercept_delta * group) + (t * trend) + (Δ * treated * group) -``` - -But we should take a closer look at this with mathematical notation. The expected value of the $i^{th}$ observation is $\mu_i$ and is defined by: - -$$ -\mu_i = \beta_{c} - + (\beta_{\Delta} \cdot \mathrm{group}_i) - + (\mathrm{trend} \cdot t_i) - + (\Delta \cdot \mathrm{treated}_i \cdot \mathrm{group}_i) -$$ - -where there are the following parameters: -* $\beta_c$ is the intercept for the control group -* $\beta_{\Delta}$ is a deflection of the treatment group intercept from the control group intercept -* $\Delta$ is the causal impact of the treatment -* $\mathrm{trend}$ is the slope, and a core assumption of the model is that the slopes are identical for both groups - -and the following observed data: -* $t_i$ is time, scaled conveniently so that the pre-intervention measurement time is at $t=0$ and the post-intervention measurement time is $t=1$ -* $\mathrm{group}_i$ is a dummy variable for control ($g=0$) or treatment ($g=1$) group -* $\mathrm{treated}_i$ is a binary indicator variable for untreated or treated. And this is function of both time and group: $\mathrm{treated}_i = f(t_i, \mathrm{group}_i)$. - -We can underline this latter point that treatment is causally influenced by time and group by looking at the DAG above, and by writing a Python function to define this function. - -```{code-cell} ipython3 -def is_treated(t, intervention_time, group): - return (t > intervention_time) * group -``` - -### Visualise the difference in differences model -Very often a picture is worth a thousand words, so if the description above was confusing, then I'd recommend re-reading it after getting some more visual intuition from the plot below. - -```{code-cell} ipython3 -# true parameters -control_intercept = 1 -treat_intercept_delta = 0.25 -trend = 1 -Δ = 0.5 -intervention_time = 0.5 -``` - -```{code-cell} ipython3 -:tags: [hide-input] - -fig, ax = plt.subplots() -ti = np.linspace(-0.5, 1.5, 1000) -ax.plot( - ti, - outcome( - ti, - control_intercept, - treat_intercept_delta, - trend, - Δ=0, - group=1, - treated=is_treated(ti, intervention_time, group=1), - ), - color="blue", - label="counterfactual", - ls=":", -) -ax.plot( - ti, - outcome( - ti, - control_intercept, - treat_intercept_delta, - trend, - Δ, - group=1, - treated=is_treated(ti, intervention_time, group=1), - ), - color="blue", - label="treatment group", -) -ax.plot( - ti, - outcome( - ti, - control_intercept, - treat_intercept_delta, - trend, - Δ, - group=0, - treated=is_treated(ti, intervention_time, group=0), - ), - color="C1", - label="control group", -) -ax.axvline(x=intervention_time, ls="-", color="r", label="treatment time", lw=3) -t = np.array([0, 1]) -ax.plot( - t, - outcome( - t, - control_intercept, - treat_intercept_delta, - trend, - Δ, - group=1, - treated=is_treated(t, intervention_time, group=1), - ), - "o", - color="blue", -) -ax.plot( - t, - outcome( - t, - control_intercept, - treat_intercept_delta, - trend, - Δ=0, - group=0, - treated=is_treated(t, intervention_time, group=0), - ), - "o", - color="C1", -) -ax.set( - xlabel="time", - ylabel="metric", - xticks=t, - xticklabels=["pre", "post"], - title="Difference in Differences", -) -ax.legend(); -``` - -So we can summarise the intuition of difference in differences by looking at this plot: -* We assume that the treatment and control groups are evolving over time in a similar manner. -* We can easily estimate the slope of the control group from pre to post treatment. -* We can engage in counterfactual thinking and can ask: "What would the post-treatment outcome of the treatment group be _if_ they had not been treated?" - -If we can answer that question and estimate this counterfactual quantity, then we can ask: "What is the causal impact of the treatment?" And we can answer this question by comparing the observed post treatment outcome of the treatment group against the counterfactual quantity. - -We can think about this visually and state another way... By looking at the pre/post difference in the control group, we can attribute any differences in the pre/post differences of the control and treatment groups to the causal effect of the treatment. And that is why the method is called difference in differences. - -+++ - -## Generate a synthetic dataset - -```{code-cell} ipython3 -df = pd.DataFrame( - { - "group": [0, 0, 1, 1] * 10, - "t": [0.0, 1.0, 0.0, 1.0] * 10, - "unit": np.concatenate([[i] * 2 for i in range(20)]), - } -) - -df["treated"] = is_treated(df["t"], intervention_time, df["group"]) - -df["y"] = outcome( - df["t"], - control_intercept, - treat_intercept_delta, - trend, - Δ, - df["group"], - df["treated"], -) -df["y"] += rng.normal(0, 0.1, df.shape[0]) -df.head() -``` - -So we see that we have [panel data](https://en.wikipedia.org/wiki/Panel_data) with just two points in time: the pre ($t=0$) and post ($t=1$) intervention measurement times. - -```{code-cell} ipython3 -sns.lineplot(df, x="t", y="y", hue="group", units="unit", estimator=None) -sns.scatterplot(df, x="t", y="y", hue="group"); -``` - -If we wanted, we could calculate a point estimate of the difference in differences (in a non-regression approach) like this. - -```{code-cell} ipython3 -diff_control = ( - df.loc[(df["t"] == 1) & (df["group"] == 0)]["y"].mean() - - df.loc[(df["t"] == 0) & (df["group"] == 0)]["y"].mean() -) -print(f"Pre/post difference in control group = {diff_control:.2f}") - -diff_treat = ( - df.loc[(df["t"] == 1) & (df["group"] == 1)]["y"].mean() - - df.loc[(df["t"] == 0) & (df["group"] == 1)]["y"].mean() -) - -print(f"Pre/post difference in treatment group = {diff_treat:.2f}") - -diff_in_diff = diff_treat - diff_control -print(f"Difference in differences = {diff_in_diff:.2f}") -``` - -But hang on, we are Bayesians! Let's Bayes... - -+++ - -## Bayesian difference in differences - -+++ - -### PyMC model -For those already well-versed in PyMC, you can see that this model is pretty simple. We just have a few components: -* Define data nodes. This is optional, but useful later when we run posterior predictive checks and counterfactual inference -* Define priors -* Evaluate the model expectation using the `outcome` function that we already defined above -* Define a normal likelihood distribution. - -```{code-cell} ipython3 -with pm.Model() as model: - # data - t = pm.MutableData("t", df["t"].values, dims="obs_idx") - treated = pm.MutableData("treated", df["treated"].values, dims="obs_idx") - group = pm.MutableData("group", df["group"].values, dims="obs_idx") - # priors - _control_intercept = pm.Normal("control_intercept", 0, 5) - _treat_intercept_delta = pm.Normal("treat_intercept_delta", 0, 1) - _trend = pm.Normal("trend", 0, 5) - _Δ = pm.Normal("Δ", 0, 1) - sigma = pm.HalfNormal("sigma", 1) - # expectation - mu = pm.Deterministic( - "mu", - outcome(t, _control_intercept, _treat_intercept_delta, _trend, _Δ, group, treated), - dims="obs_idx", - ) - # likelihood - pm.Normal("obs", mu, sigma, observed=df["y"].values, dims="obs_idx") -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(model) -``` - -### Inference - -```{code-cell} ipython3 -with model: - idata = pm.sample() -``` - -```{code-cell} ipython3 -az.plot_trace(idata, var_names="~mu"); -``` - -### Posterior prediction -NOTE: Technically we are doing 'pushforward prediction' for $\mu$ as this is a deterministic function of it's inputs. Posterior prediction would be a more appropriate label if we generated predicted observations - these would be stochastic based on the normal likelihood we've specified for our data. Nevertheless, this section is called 'posterior prediction' to emphasise the fact that we are following the Bayesian workflow. - -```{code-cell} ipython3 -# pushforward predictions for control group -with model: - group_control = [0] * len(ti) # must be integers - treated = [0] * len(ti) # must be integers - pm.set_data({"t": ti, "group": group_control, "treated": treated}) - ppc_control = pm.sample_posterior_predictive(idata, var_names=["mu"]) - -# pushforward predictions for treatment group -with model: - group = [1] * len(ti) # must be integers - pm.set_data( - { - "t": ti, - "group": group, - "treated": is_treated(ti, intervention_time, group), - } - ) - ppc_treatment = pm.sample_posterior_predictive(idata, var_names=["mu"]) - -# counterfactual: what do we predict of the treatment group (after the intervention) if -# they had _not_ been treated? -t_counterfactual = np.linspace(intervention_time, 1.5, 100) -with model: - group = [1] * len(t_counterfactual) # must be integers - pm.set_data( - { - "t": t_counterfactual, - "group": group, - "treated": [0] * len(t_counterfactual), # THIS IS OUR COUNTERFACTUAL - } - ) - ppc_counterfactual = pm.sample_posterior_predictive(idata, var_names=["mu"]) -``` - -## Wrapping up -We can plot what we've learnt below: - -```{code-cell} ipython3 -:tags: [hide-input] - -ax = sns.scatterplot(df, x="t", y="y", hue="group") - -az.plot_hdi( - ti, - ppc_control.posterior_predictive["mu"], - smooth=False, - ax=ax, - color="blue", - fill_kwargs={"label": "control HDI"}, -) -az.plot_hdi( - ti, - ppc_treatment.posterior_predictive["mu"], - smooth=False, - ax=ax, - color="C1", - fill_kwargs={"label": "treatment HDI"}, -) -az.plot_hdi( - t_counterfactual, - ppc_counterfactual.posterior_predictive["mu"], - smooth=False, - ax=ax, - color="C2", - fill_kwargs={"label": "counterfactual"}, -) -ax.axvline(x=intervention_time, ls="-", color="r", label="treatment time", lw=3) -ax.set( - xlabel="time", - ylabel="metric", - xticks=[0, 1], - xticklabels=["pre", "post"], - title="Difference in Differences", -) -ax.legend(); -``` - -This is an awesome plot, but there are quite a few things going on here, so let's go through it: -* Blue shaded region represents credible regions for the expected value of the control group -* Orange shaded region represents similar regions for the treatment group. We can see how the outcome jumps immediately after the intervention. -* The green shaded region is something pretty novel, and nice. This represents our counterfactual inference of _what we would expect if_ the treatment group were never given the treatment. By definition, we never made any observations of items in the treatment group that were not treated after the intervention time. Nevertheless, with the model described at the top of the notebook and the Bayesian inference methods outlined, we can reason about such _what if_ questions. -* The difference between this counterfactual expectation and the observed values (post treatment in the treatment condition) represents our inferred causal impact of the treatment. Let's take a look at that posterior distribution in more detail: - -```{code-cell} ipython3 -ax = az.plot_posterior(idata.posterior["Δ"], ref_val=Δ, figsize=(10, 3)) -ax.set(title=r"Posterior distribution of causal impact of treatment, $\Delta$"); -``` - -So there we have it, we have a full posterior distribution over our estimated causal impact using the difference in differences approach. - -+++ - -## Summary -Of course, when using the difference in differences approach for real applications, there is a lot more due diligence that's needed. Readers are encouraged to check out the textbooks listed above in the introduction as well as a useful review paper {cite:p}`wing2018designing` which covers the important contextual issues in more detail. Additionally, {cite:t}`bertrand2004much` takes a skeptical look at the approach as well as proposing solutions to some of the problems they highlight. - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Authors -- Authored by [Benjamin T. Vincent](https://github.com/drbenvincent) in Sept 2022 ([#424](https://github.com/pymc-devs/pymc-examples/pull/424)). - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/causal_inference/excess_deaths.myst.md b/myst_nbs/causal_inference/excess_deaths.myst.md deleted file mode 100644 index 00767c924..000000000 --- a/myst_nbs/causal_inference/excess_deaths.myst.md +++ /dev/null @@ -1,503 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: pymc_env - language: python - name: pymc_env ---- - -(excess_deaths)= -# Counterfactual inference: calculating excess deaths due to COVID-19 - -:::{post} July, 2022 -:tags: counterfactuals, causal inference, time series, case study, Bayesian workflow, forecasting, causal impact, regression, posterior predictive -:category: intermediate -:author: Benjamin T. Vincent -::: - -Causal reasoning and counterfactual thinking are really interesting but complex topics! Nevertheless, we can make headway into understanding the ideas through relatively simple examples. This notebook focuses on the concepts and the practical implementation of Bayesian causal reasoning using PyMC. - -We do this using the sobering but important example of calculating excess deaths due to COVID-19. As such, the ideas in this notebook strongly overlap with Google's [CausalImpact](https://google.github.io/CausalImpact/CausalImpact.html) (see {cite:t}`google_causal_impact2015`). Practically, we will try to estimate the number of 'excess deaths' since the onset of COVID-19, using data from England and Wales. Excess deaths are defined as: - -$$ -\text{Excess deaths} = - \underbrace{\text{Reported Deaths}}_{\text{noisy measure of actual deaths}} - - \underbrace{\text{Expected Deaths}}_{\text{unmeasurable counterfactual}} -$$ - -Making a claim about excess deaths requires causal/counterfactual reasoning. While the reported number of deaths is nothing but a (maybe noisy and/or lagged) measure of a real observable fact in the world, _expected deaths_ is unmeasurable because these are never realised in our timeline. That is, the expected deaths is a counterfactual thought experiment where we can ask "What would/will happen if?" - -+++ - -## Overall strategy -How do we go about this, practically? We will follow this strategy: -1. Import data on reported number of deaths from all causes (our outcome variable), as well as a few reasonable predictor variables: - - average monthly temperature - - month of the year, which we use to model seasonal effects - - and time which is used to model any underlying linear trend. -2. Split into `pre` and `post` covid datasets. This is an important step. We want to come up with a model based upon what we know _before_ COVID-19 so that we can construct our counterfactual predictions based on data before COVID-19 had any impact. -3. Estimate model parameters based on the `pre` dataset. -4. [Retrodict](https://en.wikipedia.org/wiki/Retrodiction) the number of deaths expected by the model in the pre COVID-19 period. This is not a counterfactual, but acts to tell us how capable the model is at accounting for the already observed data. -5. Counterfactual inference - we use our model to construct a counterfactual forecast. What would we expect to see in the future if there was no COVID-19? This can be achieved by using the famous do-operator Practically, we do this with posterior prediction on out-of-sample data. -6. Calculate the excess deaths by comparing the reported deaths with our counterfactual (expected number of deaths). - -+++ - -## Modelling strategy -We could take many different approaches to the modelling. Because we are dealing with time series data, then it would be very sensible to use a time series modelling approach. For example, Google's [CausalImpact](https://google.github.io/CausalImpact/CausalImpact.html) uses a [Bayesian structural time-series](https://en.wikipedia.org/wiki/Bayesian_structural_time_series) model, but there are many alternative time series models we could choose. - -But because the focus of this case study is on the counterfactual reasoning rather than the specifics of time-series modelling, I chose the simpler approach of linear regression for time-series model (see {cite:t}`martin2021bayesian` for more on this). - -+++ {"tags": []} - -## Causal inference disclaimer - -Readers should be aware that there are of course limits to the causal claims we can make here. If we were dealing with a marketing example where we ran a promotion for a period of time and wanted to make inferences about _excess sales_, then we could only make strong causal claims if we had done our due diligence in accounting for other factors which may have also taken place during our promotion period. - -Similarly, there are [many other things that changed in the UK since January 2020](https://en.wikipedia.org/wiki/2020_in_the_United_Kingdom#Events) (the well documented time of the first COVID-19 cases) in England and Wales. So if we wanted to be rock solid then we should account for other feasibly relevant factors. - -Finally, we are _not_ claiming that $x$ people died directly from the COVID-19 virus. The beauty of the concept of excess deaths is that it captures deaths from all causes that are in excess of what we would expect. As such, it covers not only those who died directly from the COVID-19 virus, but also from all downstream effects of the virus and availability of care, for example. - -```{code-cell} ipython3 -import calendar -import os - -import aesara.tensor as at -import arviz as az -import matplotlib.dates as mdates -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm -import seaborn as sns -import xarray as xr -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -Now let's define some helper functions - -```{code-cell} ipython3 -:tags: [hide-cell] - -def ZeroSumNormal(name, *, sigma=None, active_dims=None, dims, model=None): - model = pm.modelcontext(model=model) - - if isinstance(dims, str): - dims = [dims] - - if isinstance(active_dims, str): - active_dims = [active_dims] - - if active_dims is None: - active_dims = dims[-1] - - def extend_axis(value, axis): - n_out = value.shape[axis] + 1 - sum_vals = value.sum(axis, keepdims=True) - norm = sum_vals / (at.sqrt(n_out) + n_out) - fill_val = norm - sum_vals / at.sqrt(n_out) - out = at.concatenate([value, fill_val], axis=axis) - return out - norm - - dims_reduced = [] - active_axes = [] - for i, dim in enumerate(dims): - if dim in active_dims: - active_axes.append(i) - dim_name = f"{dim}_reduced" - if name not in model.coords: - model.add_coord(dim_name, length=len(model.coords[dim]) - 1, mutable=False) - dims_reduced.append(dim_name) - else: - dims_reduced.append(dim) - - raw = pm.Normal(f"{name}_raw", sigma=sigma, dims=dims_reduced) - for axis in active_axes: - raw = extend_axis(raw, axis) - return pm.Deterministic(name, raw, dims=dims) - - -def format_x_axis(ax, minor=False): - # major ticks - ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y %b")) - ax.xaxis.set_major_locator(mdates.YearLocator()) - ax.grid(which="major", linestyle="-", axis="x") - # minor ticks - if minor: - ax.xaxis.set_minor_formatter(mdates.DateFormatter("%Y %b")) - ax.xaxis.set_minor_locator(mdates.MonthLocator()) - ax.grid(which="minor", linestyle=":", axis="x") - # rotate labels - for label in ax.get_xticklabels(which="both"): - label.set(rotation=70, horizontalalignment="right") - - -def plot_xY(x, Y, ax): - quantiles = Y.quantile((0.025, 0.25, 0.5, 0.75, 0.975), dim=("chain", "draw")).transpose() - - az.plot_hdi( - x, - hdi_data=quantiles.sel(quantile=[0.025, 0.975]), - fill_kwargs={"alpha": 0.25}, - smooth=False, - ax=ax, - ) - az.plot_hdi( - x, - hdi_data=quantiles.sel(quantile=[0.25, 0.75]), - fill_kwargs={"alpha": 0.5}, - smooth=False, - ax=ax, - ) - ax.plot(x, quantiles.sel(quantile=0.5), color="C1", lw=3) - - -# default figure sizes -figsize = (10, 5) - -# create a list of month strings, for plotting purposes -month_strings = calendar.month_name[1:] -``` - -## Import data -For our purposes we will obtain number of deaths (per month) reported in England and Wales. This data is available from the Office of National Statistics dataset [Deaths registered monthly in England and Wales](https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/monthlyfiguresondeathsregisteredbyareaofusualresidence). I manually downloaded this data for the years 2006-2022 and aggregated it into a single `.csv` file. I also added the average UK monthly temperature data as a predictor, obtained from the [average UK temperature from the Met Office](https://www.metoffice.gov.uk/research/climate/maps-and-data/uk-and-regional-series) dataset. - -```{code-cell} ipython3 -try: - df = pd.read_csv(os.path.join("..", "data", "deaths_and_temps_england_wales.csv")) -except FileNotFoundError: - df = pd.read_csv(pm.get_data("deaths_and_temps_england_wales.csv")) - -df["date"] = pd.to_datetime(df["date"]) -df = df.set_index("date") - -# split into separate dataframes for pre and post onset of COVID-19 -pre = df[df.index < "2020"] -post = df[df.index >= "2020"] -``` - -## Visualise data - -+++ - -### Reported deaths over time -Plotting the time series shows that there is clear seasonality in the number of deaths, and we can also take a guess that there may be an increase in the average number of deaths per year. - -```{code-cell} ipython3 -ax = sns.lineplot(data=df, x="date", y="deaths", hue="pre") -format_x_axis(ax) -``` - -### Seasonality - -Let's take a closer look at the seasonal pattern (just of the pre-covid data) by plotting deaths as a function of month, and we will color code the year. This confirms our suspicion of a seasonal trend in numbers of deaths with there being more deaths in the winter season than the summer. We can also see a large number of deaths in January, followed by a slight dip in February which bounces back in March. This could be due to a combination of: -- `push-back` of deaths that actually occurred in December being registered in January -- or `pull-forward` where many of the vulnerable people who would have died in February ended up dying in January, potentially due to the cold conditions. - -The colour coding supports our suspicion that there is a positive main effect of year - that the baseline number of deaths per year is increasing. - -```{code-cell} ipython3 -ax = sns.lineplot(data=pre, x="month", y="deaths", hue="year", lw=3) -ax.set(title="Pre COVID-19 data"); -``` - -### Linear trend - -Let's look at that more closely by plotting the total deaths over time, pre COVID-19. While there is some variability here, it seems like adding a linear trend as a predictor will capture some of the variance in reported deaths, and therefore make for a better model of reported deaths. - -```{code-cell} ipython3 -annual_deaths = pd.DataFrame(pre.groupby("year")["deaths"].sum()).reset_index() -sns.regplot(x="year", y="deaths", data=annual_deaths); -``` - -### Effects of temperature on deaths - -Looking at the `pre` data alone, there is a clear negative relationship between monthly average temperature and the number of deaths. Over a wider range of temperatures it is clear that this deaths will have a U-shaped relationship with temperature. But the climate in England and Wales, we only see the lower side of this curve. Despite that, the relationship could plausibly be approximately quadratic, but for our purposes a linear relationship seems like a reasonable place to start. - -```{code-cell} ipython3 -fig, ax = plt.subplots(1, 2, figsize=figsize) -sns.regplot(x="temp", y="deaths", data=pre, scatter_kws={"s": 40}, order=1, ax=ax[0]) -ax[0].set(title="Linear fit (pre COVID-19 data)") -sns.regplot(x="temp", y="deaths", data=pre, scatter_kws={"s": 40}, order=2, ax=ax[1]) -ax[1].set(title="Quadratic fit (pre COVID-19 data)"); -``` - -Let's examine the slope of this relationship, which will be useful in defining a prior for a temperature coefficient in our model. - -```{code-cell} ipython3 -# NOTE: results are returned from higher to lower polynomial powers -slope, intercept = np.polyfit(pre["temp"], pre["deaths"], 1) -print(f"{slope:.0f} deaths/degree") -``` - -Based on this, if we focus only on the relationship between temperature and deaths, we expect there to be 764 _fewer_ deaths for every $1^\circ C$ increase in average monthly temperature. So we can use this figure when it comes to defining a prior over the coefficient for the temperature effect. - -+++ - -## Modelling - -We are going to estimate reported deaths over time with an intercept, a linear trend, seasonal deflections (for each month), and average monthly temperature. So this is a pretty straightforward linear model. The only thing of note is that we transform the normally distributed monthly deflections to have a mean of zero in order to reduce the degrees of freedom of the model by one, which should help with parameter identifiability. - -```{code-cell} ipython3 -with pm.Model(coords={"month": month_strings}) as model: - - # observed predictors and outcome - month = pm.MutableData("month", pre["month"].to_numpy(), dims="t") - time = pm.MutableData("time", pre["t"].to_numpy(), dims="t") - temp = pm.MutableData("temp", pre["temp"].to_numpy(), dims="t") - deaths = pm.MutableData("deaths", pre["deaths"].to_numpy(), dims="t") - - # priors - intercept = pm.Normal("intercept", 40_000, 10_000) - month_mu = ZeroSumNormal("month mu", sigma=3000, dims="month") - linear_trend = pm.TruncatedNormal("linear trend", 0, 50, lower=0) - temp_coeff = pm.Normal("temp coeff", 0, 200) - - # the actual linear model - mu = pm.Deterministic( - "mu", - intercept + (linear_trend * time) + month_mu[month - 1] + (temp_coeff * temp), - dims="t", - ) - sigma = pm.HalfNormal("sigma", 2_000) - # likelihood - pm.TruncatedNormal("obs", mu=mu, sigma=sigma, lower=0, observed=deaths, dims="t") -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(model) -``` - -## Prior predictive check - -As part of the Bayesian workflow, we will plot our prior predictions to see what outcomes the model finds before having observed any data. - -```{code-cell} ipython3 -with model: - idata = pm.sample_prior_predictive(random_seed=RANDOM_SEED) - - -fig, ax = plt.subplots(figsize=figsize) - -plot_xY(pre.index, idata.prior_predictive["obs"], ax) -format_x_axis(ax) -ax.plot(pre.index, pre["deaths"], label="observed") -ax.set(title="Prior predictive distribution in the pre COVID-19 era") -plt.legend(); -``` - -This seems reasonable: -- The _a priori_ number of deaths looks centred on the observed numbers. -- Given the priors, the predicted range of deaths is quite broad, and so is unlikely to over-constrain the model. -- The model does not predict negative numbers of deaths per month. - -We can look at this in more detail with the Arviz prior predictive check (ppc) plot. Again we see that the distribution of the observations is centered on the actual observations but has more spread. This is useful as we know the priors are not too restrictive and are unlikely to systematically influence our posterior predictions upwards or downwards. - -```{code-cell} ipython3 -az.plot_ppc(idata, group="prior"); -``` - -## Inference -Draw samples for the posterior distribution, and remember we are doing this for the pre COVID-19 data only. - -```{code-cell} ipython3 -with model: - idata.extend(pm.sample(random_seed=RANDOM_SEED)) -``` - -```{code-cell} ipython3 -az.plot_trace(idata, var_names=["~mu", "~month mu_raw"]); -``` - -Let's also look at the posterior estimates of the monthly deflections, in a different way to focus on the seasonal effect. - -```{code-cell} ipython3 -az.plot_forest(idata.posterior, var_names="month mu", figsize=figsize); -``` - -## Posterior predictive check - -Another important aspect of the Bayesian workflow is to plot the model's posterior predictions, allowing us to see how well the model can retrodict the already observed data. It is at this point that we can decide whether the model is too simple (then we'd build more complexity into the model) or if it's fine. - -```{code-cell} ipython3 -with model: - idata.extend(pm.sample_posterior_predictive(idata, random_seed=RANDOM_SEED)) - - -fig, ax = plt.subplots(figsize=figsize) - -az.plot_hdi(pre.index, idata.posterior_predictive["obs"], hdi_prob=0.5, smooth=False) -az.plot_hdi(pre.index, idata.posterior_predictive["obs"], hdi_prob=0.95, smooth=False) -ax.plot(pre.index, pre["deaths"], label="observed") -format_x_axis(ax) -ax.set(title="Posterior predictive distribution in the pre COVID-19 era") -plt.legend(); -``` - -Let's do another check now, but focussing on the seasonal effect. We will replicate the plot that we had above of deaths as a function of month of the year. And in order to keep the plot from being a complete mess, we will just plot the posterior mean. As such this is not a posterior _predictive_ check, but a check of the posterior. - -```{code-cell} ipython3 -temp = idata.posterior["mu"].mean(dim=["chain", "draw"]).to_dataframe() -pre = pre.assign(deaths_predicted=temp["mu"].values) - -fig, ax = plt.subplots(1, 2, figsize=figsize, sharey=True) -sns.lineplot(data=pre, x="month", y="deaths", hue="year", ax=ax[0], lw=3) -ax[0].set(title="Observed") -sns.lineplot(data=pre, x="month", y="deaths_predicted", hue="year", ax=ax[1], lw=3) -ax[1].set(title="Model predicted mean"); -``` - -The model is doing a pretty good job of capturing the properties of the data. On the right, we can clearly see the main effects of `month` and `year`. However, we can see that there is something interesting happening in the data (left) in January which the model is not capturing. This might be able to be captured in the model by adding an interaction between `month` and `year`, but this is left as an exercise for the reader. - -+++ - -## Excess deaths: Pre-Covid - -This step is not strictly necessary, but we can apply the excess deaths formula to the models' retrodictions for the `pre` period. This is useful because we can examine how good the model is. - -```{code-cell} ipython3 -:tags: [hide-input] - -# convert deaths into an XArray object with a labelled dimension to help in the next step -deaths = xr.DataArray(pre["deaths"].to_numpy(), dims=["t"]) - -# do the calculation by taking the difference -excess_deaths = deaths - idata.posterior_predictive["obs"] - -fig, ax = plt.subplots(figsize=figsize) -# the transpose is to keep arviz happy, ordering the dimensions as (chain, draw, t) -az.plot_hdi(pre.index, excess_deaths.transpose(..., "t"), hdi_prob=0.5, smooth=False) -az.plot_hdi(pre.index, excess_deaths.transpose(..., "t"), hdi_prob=0.95, smooth=False) -format_x_axis(ax) -ax.axhline(y=0, color="k") -ax.set(title="Excess deaths, pre COVID-19"); -``` - -We can see that we have a few spikes here where the number of excess deaths is plausibly greater than zero. Such occasions are above and beyond what we could expect from: a) seasonal effects, b) the linearly increasing trend, c) the effect of cold winters. - -If we were interested, then we could start to generate hypotheses about what additional predictors may account for this. Some ideas could include the prevalence of the common cold, or minimum monthly temperatures which could add extra predictive information not captured by the mean. - -We can also see that there is some additional temporal trend that the model is not quite capturing. There is some systematic low-frequency drift from the posterior mean from zero. That is, there is additional variance in the data that our predictors are not quite capturing which could potentially be caused by changes in the size of vulnerable cohorts over time. - -But we are close to our objective of calculating excess deaths during the COVID-19 period, so we will move on as the primary purpose here is on counterfactual thinking, not on building the most comprehensive model of reported deaths ever. - -+++ - -## Counterfactual inference -Now we will use our model to predict the reported deaths in the 'what if?' scenario of business as usual. - -So we update the model with the `month` and time (`t`) and `temp` data from the `post` dataframe and run posterior predictive sampling to predict the number of reported deaths we would observe in this counterfactual scenario. We could also call this 'forecasting'. - -```{code-cell} ipython3 -with model: - pm.set_data( - { - "month": post["month"].to_numpy(), - "time": post["t"].to_numpy(), - "temp": post["temp"].to_numpy(), - } - ) - counterfactual = pm.sample_posterior_predictive( - idata, var_names=["obs"], random_seed=RANDOM_SEED - ) -``` - -```{code-cell} ipython3 -:tags: [hide-input] - -fig, ax = plt.subplots(figsize=figsize) - -plot_xY(post.index, counterfactual.posterior_predictive["obs"], ax) -format_x_axis(ax, minor=True) -ax.plot(post.index, post["deaths"], label="reported deaths") -ax.set(title="Counterfactual: Posterior predictive forecast of deaths if COVID-19 had not appeared") -plt.legend(); -``` - -We now have the ingredients needed to calculate excess deaths. Namely, the reported number of deaths, and the Bayesian counterfactual prediction of how many would have died if nothing had changed from the pre to post COVID-19 era. - -+++ - -## Excess deaths: since Covid onset - -+++ - -Now we'll use the predicted number of deaths under the counterfactual scenario and compare that to the reported number of deaths to come up with our counterfactual estimate of excess deaths. - -```{code-cell} ipython3 -# convert deaths into an XArray object with a labelled dimension to help in the next step -deaths = xr.DataArray(post["deaths"].to_numpy(), dims=["t"]) - -# do the calculation by taking the difference -excess_deaths = deaths - counterfactual.posterior_predictive["obs"] -``` - -And we can easily compute the cumulative excess deaths - -```{code-cell} ipython3 -# calculate the cumulative excess deaths -cumsum = excess_deaths.cumsum(dim="t") -``` - -```{code-cell} ipython3 -:tags: [hide-input] - -fig, ax = plt.subplots(2, 1, figsize=(figsize[0], 9), sharex=True) - -# Plot the excess deaths -# The transpose is to keep arviz happy, ordering the dimensions as (chain, draw, t) -plot_xY(post.index, excess_deaths.transpose(..., "t"), ax[0]) -format_x_axis(ax[0], minor=True) -ax[0].axhline(y=0, color="k") -ax[0].set(title="Excess deaths, since COVID-19 onset") - -# Plot the cumulative excess deaths -plot_xY(post.index, cumsum.transpose(..., "t"), ax[1]) -format_x_axis(ax[1], minor=True) -ax[1].axhline(y=0, color="k") -ax[1].set(title="Cumulative excess deaths, since COVID-19 onset"); -``` - -And there we have it - we've done some Bayesian counterfactual inference in PyMC! In just a few steps we've: -- Built a simple linear regression model. -- Inferred the model parameters based on pre COVID-19 data, running prior and posterior predictive checks. We note that the model is pretty good, but as always there might be ways to improve the model in the future. -- Used the model to create counterfactual predictions of what would happen in the future (COVID-19 era) if nothing had changed. -- Calculated the excess deaths (and cumulative excess deaths) by comparing the reported deaths to our counterfactual expected number of deaths. - -The bad news of course, is that as of the last data point (May 2022) the number of excess deaths in England and Wales has started to rise again. - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Authors -- Authored by [Benjamin T. Vincent](https://github.com/drbenvincent) in July 2022. - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/causal_inference/interrupted_time_series.myst.md b/myst_nbs/causal_inference/interrupted_time_series.myst.md deleted file mode 100644 index 683ce956a..000000000 --- a/myst_nbs/causal_inference/interrupted_time_series.myst.md +++ /dev/null @@ -1,363 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: pymc_env - language: python - name: pymc_env ---- - -(interrupted_time_series)= -# Interrupted time series analysis - -:::{post} Oct, 2022 -:tags: counterfactuals, causal inference, time series, forecasting, causal impact, quasi experiments -:category: intermediate -:author: Benjamin T. Vincent -::: - -+++ - -This notebook focuses on how to conduct a simple Bayesian [interrupted time series analysis](https://en.wikipedia.org/wiki/Interrupted_time_series). This is useful in [quasi-experimental settings](https://en.wikipedia.org/wiki/Quasi-experiment) where an intervention was applied to all treatment units. - -For example, if a change to a website was made and you want to know the causal impact of the website change then _if_ this change was applied selectively and randomly to a test group of website users, then you may be able to make causal claims using the [A/B testing approach](https://en.wikipedia.org/wiki/A/B_testing). - -However, if the website change was rolled out to _all_ users of the website then you do not have a control group. In this case you do not have a direct measurement of the counterfactual, what _would have happened if_ the website change was not made. In this case, if you have data over a 'good' number of time points, then you may be able to make use of the interrupted time series approach. - -Interested readers are directed to the excellent textbook [The Effect](https://theeffectbook.net/) {cite:p}`huntington2021effect`. Chapter 17 covers 'event studies' which the author prefers to the interrupted time series terminology. - -+++ - -## Causal DAG - -A simple causal DAG for the interrupted time series is given below, but see {cite:p}`huntington2021effect` for a more general DAG. In short it says: - -* The outcome is causally influenced by time (e.g. other factors that change over time) and by the treatment. -* The treatment is causally influenced by time. - -![](DAG_interrupted_time_series.png) - -Intuitively, we could describe the logic of the approach as: -* We know that the outcome varies over time. -* If we build a model of how the outcome varies over time _before_ the treatment, then we can predit the counterfactual of what we would expect to happen _if_ the treatment had not occurred. -* We can compare this counterfactual with the observations from the time of the intervention onwards. If there is a meaningful discrepancy then we can attribute this as a causal impact of the intervention. - -This is reasonable if we have ruled out other plausible causes occurring at the same point in time as (or after) the intervention. This becomes more tricky to justify the more time has passed since the intervention because it is more likely that other relevant events maye have occurred that could provide alternative causal explanations. - -If this does not make sense immediately, I recommend checking the example data figure below then revisiting this section. - -```{code-cell} ipython3 -import arviz as az -import matplotlib.dates as mdates -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm -import xarray as xr - -from scipy.stats import norm -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -Now let's define some helper functions - -```{code-cell} ipython3 -:tags: [hide-cell] - -def format_x_axis(ax, minor=False): - # major ticks - ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y %b")) - ax.xaxis.set_major_locator(mdates.YearLocator()) - ax.grid(which="major", linestyle="-", axis="x") - # minor ticks - if minor: - ax.xaxis.set_minor_formatter(mdates.DateFormatter("%Y %b")) - ax.xaxis.set_minor_locator(mdates.MonthLocator()) - ax.grid(which="minor", linestyle=":", axis="x") - # rotate labels - for label in ax.get_xticklabels(which="both"): - label.set(rotation=70, horizontalalignment="right") - - -def plot_xY(x, Y, ax): - quantiles = Y.quantile((0.025, 0.25, 0.5, 0.75, 0.975), dim=("chain", "draw")).transpose() - - az.plot_hdi( - x, - hdi_data=quantiles.sel(quantile=[0.025, 0.975]), - fill_kwargs={"alpha": 0.25}, - smooth=False, - ax=ax, - ) - az.plot_hdi( - x, - hdi_data=quantiles.sel(quantile=[0.25, 0.75]), - fill_kwargs={"alpha": 0.5}, - smooth=False, - ax=ax, - ) - ax.plot(x, quantiles.sel(quantile=0.5), color="C1", lw=3) - - -# default figure sizes -figsize = (10, 5) -``` - -## Generate data - -The focus of this example is on making causal claims using the interrupted time series approach. Therefore we will work with some relatively simple synthetic data which only requires a very simple model. - -```{code-cell} ipython3 -:tags: [] - -treatment_time = "2017-01-01" -β0 = 0 -β1 = 0.1 -dates = pd.date_range( - start=pd.to_datetime("2010-01-01"), end=pd.to_datetime("2020-01-01"), freq="M" -) -N = len(dates) - - -def causal_effect(df): - return (df.index > treatment_time) * 2 - - -df = ( - pd.DataFrame() - .assign(time=np.arange(N), date=dates) - .set_index("date", drop=True) - .assign(y=lambda x: β0 + β1 * x.time + causal_effect(x) + norm(0, 0.5).rvs(N)) -) -df -``` - -```{code-cell} ipython3 -# Split into pre and post intervention dataframes -pre = df[df.index < treatment_time] -post = df[df.index >= treatment_time] -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots() -ax = pre["y"].plot(label="pre") -post["y"].plot(ax=ax, label="post") -ax.axvline(treatment_time, c="k", ls=":") -plt.legend(); -``` - -In this simple dataset, we have a noisy linear trend upwards, and because this data is synthetic we know that we have a step increase in the outcome at the intervention time, and this effect is persistent over time. - -+++ - -## Modelling -Here we build a simple linear model. Remember that we are building a model of the pre-intervention data with the goal that it would do a reasonable job of forecasting what would have happened if the intervention had not been applied. Put another way, we are _not_ modelling any aspect of the post-intervention observations such as a change in intercept, slope or whether the effect is transient or permenent. - -```{code-cell} ipython3 -with pm.Model() as model: - # observed predictors and outcome - time = pm.MutableData("time", pre["time"].to_numpy(), dims="obs_id") - # priors - beta0 = pm.Normal("beta0", 0, 1) - beta1 = pm.Normal("beta1", 0, 0.2) - # the actual linear model - mu = pm.Deterministic("mu", beta0 + (beta1 * time), dims="obs_id") - sigma = pm.HalfNormal("sigma", 2) - # likelihood - pm.Normal("obs", mu=mu, sigma=sigma, observed=pre["y"].to_numpy(), dims="obs_id") -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(model) -``` - -## Prior predictive check - -As part of the Bayesian workflow, we will plot our prior predictions to see what outcomes the model finds before having observed any data. - -```{code-cell} ipython3 -with model: - idata = pm.sample_prior_predictive(random_seed=RANDOM_SEED) - -fig, ax = plt.subplots(figsize=figsize) - -plot_xY(pre.index, idata.prior_predictive["obs"], ax) -format_x_axis(ax) -ax.plot(pre.index, pre["y"], label="observed") -ax.set(title="Prior predictive distribution in the pre intervention era") -plt.legend(); -``` - -This seems reasonable in that the priors over the intercept and slope are broad enough to lead to predicted observations which easily contain the actual data. This means that the particular priors chosen will not unduly constrain the posterior parameter estimates. - -+++ - -## Inference -Draw samples for the posterior distribution, and remember we are doing this for the pre intervention data only. - -```{code-cell} ipython3 -with model: - idata.extend(pm.sample(random_seed=RANDOM_SEED)) -``` - -```{code-cell} ipython3 -az.plot_trace(idata, var_names=["~mu"]); -``` - -## Posterior predictive check - -Another important aspect of the Bayesian workflow is to plot the model's posterior predictions, allowing us to see how well the model can retrodict the already observed data. It is at this point that we can decide whether the model is too simple (then we'd build more complexity into the model) or if it's fine. - -```{code-cell} ipython3 -with model: - idata.extend(pm.sample_posterior_predictive(idata, random_seed=RANDOM_SEED)) - -fig, ax = plt.subplots(figsize=figsize) - -az.plot_hdi(pre.index, idata.posterior_predictive["obs"], hdi_prob=0.5, smooth=False) -az.plot_hdi(pre.index, idata.posterior_predictive["obs"], hdi_prob=0.95, smooth=False) -ax.plot(pre.index, pre["y"], label="observed") -format_x_axis(ax) -ax.set(title="Posterior predictive distribution in the pre intervention era") -plt.legend(); -``` - -The next step is not strictly necessary, but we can calculate the difference between the model retrodictions and the data to look at the errors. This can be useful to identify any unexpected inability to retrodict pre-intervention data. - -```{code-cell} ipython3 -:tags: [hide-input] - -# convert outcome into an XArray object with a labelled dimension to help in the next step -y = xr.DataArray(pre["y"].to_numpy(), dims=["obs_id"]) - -# do the calculation by taking the difference -excess = y - idata.posterior_predictive["obs"] -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=figsize) -# the transpose is to keep arviz happy, ordering the dimensions as (chain, draw, time) -az.plot_hdi(pre.index, excess.transpose(..., "obs_id"), hdi_prob=0.5, smooth=False) -az.plot_hdi(pre.index, excess.transpose(..., "obs_id"), hdi_prob=0.95, smooth=False) -format_x_axis(ax) -ax.axhline(y=0, color="k") -ax.set(title="Residuals, pre intervention"); -``` - -## Counterfactual inference -Now we will use our model to predict the observed outcome in the 'what if?' scenario of no intervention. - -So we update the model with the `time` data from the `post` intervention dataframe and run posterior predictive sampling to predict the observations we would observe in this counterfactual scenario. We could also call this 'forecasting'. - -```{code-cell} ipython3 -with model: - pm.set_data( - { - "time": post["time"].to_numpy(), - } - ) - counterfactual = pm.sample_posterior_predictive( - idata, var_names=["obs"], random_seed=RANDOM_SEED - ) -``` - -```{code-cell} ipython3 -:tags: [hide-input] - -fig, ax = plt.subplots(figsize=figsize) - -plot_xY(post.index, counterfactual.posterior_predictive["obs"], ax) -format_x_axis(ax, minor=False) -ax.plot(post.index, post["y"], label="observed") -ax.set( - title="Counterfactual: Posterior predictive forecast of outcome if intervention not taken place" -) -plt.legend(); -``` - -We now have the ingredients needed to calculate the causal impact. This is simply the difference between the Bayesian counterfactual predictions and the observations. - -+++ - -## Causal impact: since the intervention - -+++ - -Now we'll use the predicted outcome under the counterfactual scenario and compare that to the observed outcome to come up with our counterfactual estimate. - -```{code-cell} ipython3 -# convert outcome into an XArray object with a labelled dimension to help in the next step -outcome = xr.DataArray(post["y"].to_numpy(), dims=["obs_id"]) - -# do the calculation by taking the difference -excess = outcome - counterfactual.posterior_predictive["obs"] -``` - -And we can easily compute the cumulative causal impact - -```{code-cell} ipython3 -# calculate the cumulative causal impact -cumsum = excess.cumsum(dim="obs_id") -``` - -```{code-cell} ipython3 -:tags: [hide-input] - -fig, ax = plt.subplots(2, 1, figsize=(figsize[0], 9), sharex=True) - -# Plot the excess -# The transpose is to keep arviz happy, ordering the dimensions as (chain, draw, t) -plot_xY(post.index, excess.transpose(..., "obs_id"), ax[0]) -format_x_axis(ax[0], minor=True) -ax[0].axhline(y=0, color="k") -ax[0].set(title="Causal impact, since intervention") - -# Plot the cumulative excess -plot_xY(post.index, cumsum.transpose(..., "obs_id"), ax[1]) -format_x_axis(ax[1], minor=False) -ax[1].axhline(y=0, color="k") -ax[1].set(title="Cumulative causal impact, since intervention"); -``` - -And there we have it - we've done some Bayesian counterfactual inference in PyMC using the interrupted time series approach! In just a few steps we've: -- Built a simple model to predict a time series. -- Inferred the model parameters based on pre intervention data, running prior and posterior predictive checks. We note that the model is pretty good. -- Used the model to create counterfactual predictions of what would happen after the intervention time if the intervention had not occurred. -- Calculated the causal impact (and cumulative causal impact) by comparing the observed outcome to our counterfactual expected outcome in the case of no intervention. - -There are of course many ways that the interrupted time series approach could be more involved in real world settings. For example there could be more temporal structure, such as seasonality. If so then we might want to use a specific time series model, not just a linear regression model. There could also be additional informative predictor variables to incorporate into the model. Additionally some designs do not just consist of pre and post intervention periods (also known as A/B designs), but could also involve a period where the intervention is inactive, active, and then inactive (also known as an ABA design). - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Authors -- Authored by [Benjamin T. Vincent](https://github.com/drbenvincent) in October 2022. - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/causal_inference/regression_discontinuity.myst.md b/myst_nbs/causal_inference/regression_discontinuity.myst.md deleted file mode 100644 index 7aefa892e..000000000 --- a/myst_nbs/causal_inference/regression_discontinuity.myst.md +++ /dev/null @@ -1,244 +0,0 @@ ---- -jupytext: - notebook_metadata_filter: substitutions - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: pymc-dev-py39 - language: python - name: pymc-dev-py39 ---- - -(regression_discontinuity)= -# Regression discontinuity design analysis - -:::{post} April, 2022 -:tags: regression, causal inference, quasi experiments, counterfactuals -:category: beginner, explanation -:author: Benjamin T. Vincent -::: - -[Quasi experiments](https://en.wikipedia.org/wiki/Quasi-experiment) involve experimental interventions and quantitative measures. However, quasi-experiments do _not_ involve random assignment of units (e.g. cells, people, companies, schools, states) to test or control groups. This inability to conduct random assignment poses problems when making causal claims as it makes it harder to argue that any difference between a control and test group are because of an intervention and not because of a confounding factor. - -The [regression discontinuity design](https://en.wikipedia.org/wiki/Regression_discontinuity_design) is a particular form of quasi experimental design. It consists of a control and test group, but assignment of units to conditions is chosen based upon a threshold criteria, not randomly. - -:::{figure-md} fig-target - -![regression discontinuity design schematic](regression_discontinuity.png) - -A schematic diagram of the regression discontinuity design. The dashed green line shows where we would have expected the post test scores of the test group to be if they had not received the treatment. Image taken from [https://conjointly.com/kb/regression-discontinuity-design/](https://conjointly.com/kb/regression-discontinuity-design/). -::: - -Units with very low scores are likely to differ systematically along some dimensions than units with very high scores. For example, if we look at students who achieve the highest, and students who achieve the lowest, in all likelihood there are confounding variables. Students with high scores are likely to have come from more privileged backgrounds than those with the lowest scores. - -If we gave extra tuition (our experimental intervention) to students scoring in the lowest half of scores then we can easily imagine that we have large differences in some measure of privilege between test and control groups. At a first glance, this would seem to make the regression discontinuity design useless - the whole point of random assignment is to reduce or eliminate systematic biases between control and test groups. But use of a threshold would seem to maximise the differences in confounding variables between groups. Isn't this an odd thing to do? - -The key point however is that it is much less likely that students scoring just below and just above the threshold systematically differ in their degree of privilege. And so _if_ we find evidence of a meaningful discontinuity in a post-test score in those just above and just below the threshold, then it is much more plausible that the intervention (applied according to the threshold criteria) was causally responsible. - -## Sharp v.s. fuzzy regression discontinuity designs -Note that regression discontinuity designs fall into two categories. This notebook focuses on _sharp_ regression discontinuity designs, but it is important to understand both sharp and fuzzy variants: - -- **Sharp:** Here, the assignment to control or treatment groups is purely dictated by the threshold. There is no uncertainty in which units are in which group. -- **Fuzzy:** In some situations there may not be a sharp boundary between control and treatment based upon the threshold. This could happen for example if experimenters are not strict in assigning units to groups based on the threshold. Alternatively, there could be non-compliance on the side of the actual units being studied. For example, patients may not always be fully compliant in taking medication, so some unknown proportion of patients assigned to the test group may actually be in the control group because of non compliance. - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm -``` - -```{code-cell} ipython3 -RANDOM_SEED = 123 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -%config InlineBackend.figure_format = 'retina' -``` - -## Generate simulated data -Note that here we assume that there is negligible/zero measurement noise, but that there is some variability in the true values from pre- to post-test. It is possible to take into account measurement noise on the pre- and post-test results, but we do not engage with that in this notebook. - -```{code-cell} ipython3 -:tags: [hide-input] - -# define true parameters -threshold = 0.0 -treatment_effect = 0.7 -N = 1000 -sd = 0.3 # represents change between pre and post test with zero measurement error - -# No measurement error, but random change from pre to post -df = ( - pd.DataFrame.from_dict({"x": rng.normal(size=N)}) - .assign(treated=lambda x: x.x < threshold) - .assign(y=lambda x: x.x + rng.normal(loc=0, scale=sd, size=N) + treatment_effect * x.treated) -) - -df -``` - -```{code-cell} ipython3 -:tags: [hide-input] - -def plot_data(df): - fig, ax = plt.subplots(figsize=(12, 7)) - ax.plot(df.x[~df.treated], df.y[~df.treated], "o", alpha=0.3, label="untreated") - ax.plot(df.x[df.treated], df.y[df.treated], "o", alpha=0.3, label="treated") - ax.axvline(x=threshold, color="k", ls="--", lw=3, label="treatment threshold") - ax.set(xlabel=r"observed $x$ (pre-test)", ylabel=r"observed $y$ (post-test)") - ax.legend() - return ax - - -plot_data(df); -``` - -+++ {"tags": []} - -## Sharp regression discontinuity model - -We can define our Bayesian regression discontinuity model as: - -$$ -\begin{aligned} -\Delta & \sim \text{Cauchy}(0, 1) \\ -\sigma & \sim \text{HalfNormal}(0, 1) \\ -\mu & = x_i + \underbrace{\Delta \cdot treated_i}_{\text{treatment effect}} \\ -y_i & \sim \text{Normal}(\mu, \sigma) -\end{aligned} -$$ - -where: -- $\Delta$ is the size of the discontinuity, -- $\sigma$ is the standard deviation of change in the pre- to post-test scores, -- $x_i$ and $y_i$ are observed pre- and post-test measures for unit $i$, and -- $treated_i$ is an observed indicator variable (0 for control group, 1 for test group). - -Notes: -- We make the simplifying assumption of no measurement error. -- Here, we confine ourselves to the situation where we use the same measure (e.g. heart rate, educational attainment, upper arm circumference) for both the pre-test ($x$) and post-test ($y$). So the _untreated_ post-test measure can be modelled as $y \sim \text{Normal}(\mu=x, \sigma)$. -- In the case that the pre- and post-test measuring instruments where not identical, then we would want to build slope and intercept parameters into $\mu$ to capture the 'exchange rate' between the pre- and post-test measures. -- We assume we have accurately observed whether a unit has been treated or not. That is, this model assumes a sharp discontinuity with no uncertainty. - -```{code-cell} ipython3 -with pm.Model() as model: - x = pm.MutableData("x", df.x, dims="obs_id") - treated = pm.MutableData("treated", df.treated, dims="obs_id") - sigma = pm.HalfNormal("sigma", 1) - delta = pm.Cauchy("effect", alpha=0, beta=1) - mu = pm.Deterministic("mu", x + (delta * treated), dims="obs_id") - pm.Normal("y", mu=mu, sigma=sigma, observed=df.y, dims="obs_id") - -pm.model_to_graphviz(model) -``` - -## Inference - -```{code-cell} ipython3 -with model: - idata = pm.sample(random_seed=RANDOM_SEED) -``` - -We can see that we get no sampling warnings, and that plotting the MCMC traces shows no issues. - -```{code-cell} ipython3 -az.plot_trace(idata, var_names=["effect", "sigma"]); -``` - -We can also see that we are able to accurately recover the true discontinuity magnitude (left) and the standard deviation of the change in units between pre- and post-test (right). - -```{code-cell} ipython3 -az.plot_posterior( - idata, var_names=["effect", "sigma"], ref_val=[treatment_effect, sd], hdi_prob=0.95 -); -``` - -The most important thing is the posterior over the `effect` parameter. We can use that to base a judgement about the strength of the effect (e.g. through the 95% credible interval) or the presence/absence of an effect (e.g. through a Bayes Factor or ROPE). - -+++ {"tags": []} - -## Counterfactual questions - -We can use posterior prediction to ask what would we expect to see if: -- no units were exposed to the treatment (blue shaded region, which is very narrow) -- all units were exposed to the treatment (orange shaded region). - -_Technical note:_ Formally we are doing posterior prediction of `y`. Running `pm.sample_posterior_predictive` multiple times with different random seeds will result in new and different samples of `y` each time. However, this is not the case (we are not formally doing posterior prediction) for `mu`. This is because `mu` is a deterministic function (`mu = x + delta*treated`), so for our already obtained posterior samples of `delta`, the values of `mu` will be entirely determined by the values of `x` and `treated` data). - -```{code-cell} ipython3 -:tags: [] - -# MODEL EXPECTATION WITHOUT TREATMENT ------------------------------------ -# probe data -_x = np.linspace(np.min(df.x), np.max(df.x), 500) -_treated = np.zeros(_x.shape) - -# posterior prediction (see technical note above) -with model: - pm.set_data({"x": _x, "treated": _treated}) - ppc = pm.sample_posterior_predictive(idata, var_names=["mu", "y"]) - -# plotting -ax = plot_data(df) -az.plot_hdi( - _x, - ppc.posterior_predictive["mu"], - color="C0", - hdi_prob=0.95, - ax=ax, - fill_kwargs={"label": r"$\mu$ untreated"}, -) - -# MODEL EXPECTATION WITH TREATMENT --------------------------------------- -# probe data -_x = np.linspace(np.min(df.x), np.max(df.x), 500) -_treated = np.ones(_x.shape) - -# posterior prediction (see technical note above) -with model: - pm.set_data({"x": _x, "treated": _treated}) - ppc = pm.sample_posterior_predictive(idata, var_names=["mu", "y"]) - -# plotting -az.plot_hdi( - _x, - ppc.posterior_predictive["mu"], - color="C1", - hdi_prob=0.95, - ax=ax, - fill_kwargs={"label": r"$\mu$ treated"}, -) -ax.legend() -``` - -The blue shaded region shows the 95% credible region of the expected value of the post-test measurement for a range of possible pre-test measures, in the case of no treatment. This is actually infinitely narrow because this particular model assumes $\mu=x$ (see above). - -The orange shaded region shows the 95% credible region of the expected value of the post-test measurement for a range of possible pre-test measures in the case of treatment. - -Both are actually very interesting as examples of counterfactual inference. We did not observe any units that were untreated below the threshold, nor any treated units above the threshold. But assuming our model is a good description of reality, we can ask the counterfactual questions "What if a unit above the threshold was treated?" and "What if a unit below the threshold was treated?" - -+++ - -## Summary -In this notebook we have merely touched the surface of how to analyse data from regression discontinuity designs. Arguably, we have restricted our focus to almost the simplest possible case so that we can focus upon the core properties of the approach which allows causal claims to be made. - -+++ - -## Authors -- Authored by [Benjamin T. Vincent](https://github.com/drbenvincent) in April 2022 - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/diagnostics_and_criticism/Bayes_factor.myst.md b/myst_nbs/diagnostics_and_criticism/Bayes_factor.myst.md deleted file mode 100644 index 50c9afb0d..000000000 --- a/myst_nbs/diagnostics_and_criticism/Bayes_factor.myst.md +++ /dev/null @@ -1,359 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- - -(Bayes_factor)= -# Bayes Factors and Marginal Likelihood -:::{post} Jun 1, 2022 -:tags: Bayes Factors, model comparison -:category: beginner, explanation -:author: Osvaldo Martin -::: - -```{code-cell} ipython3 -import arviz as az -import numpy as np -import pymc as pm - -from matplotlib import pyplot as plt -from matplotlib.ticker import FormatStrFormatter -from scipy.special import betaln -from scipy.stats import beta - -print(f"Running on PyMC v{pm.__version__}") -``` - -```{code-cell} ipython3 -az.style.use("arviz-darkgrid") -``` - -The "Bayesian way" to compare models is to compute the _marginal likelihood_ of each model $p(y \mid M_k)$, _i.e._ the probability of the observed data $y$ given the $M_k$ model. This quantity, the marginal likelihood, is just the normalizing constant of Bayes' theorem. We can see this if we write Bayes' theorem and make explicit the fact that all inferences are model-dependant. - -$$p (\theta \mid y, M_k ) = \frac{p(y \mid \theta, M_k) p(\theta \mid M_k)}{p( y \mid M_k)}$$ - -where: - -* $y$ is the data -* $\theta$ the parameters -* $M_k$ one model out of K competing models - - -Usually when doing inference we do not need to compute this normalizing constant, so in practice we often compute the posterior up to a constant factor, that is: - -$$p (\theta \mid y, M_k ) \propto p(y \mid \theta, M_k) p(\theta \mid M_k)$$ - -However, for model comparison and model averaging the marginal likelihood is an important quantity. Although, it's not the only way to perform these tasks, you can read about model averaging and model selection using alternative methods [here](model_comparison.ipynb), [there](model_averaging.ipynb) and [elsewhere](GLM-model-selection.ipynb). Actually, these alternative methods are most often than not a better choice compared with using the marginal likelihood. - -+++ - -## Bayesian model selection - -If our main objective is to choose only one model, the _best_ one, from a set of models we can just choose the one with the largest $p(y \mid M_k)$. This is totally fine if **all models** are assumed to have the same _a priori_ probability. Otherwise, we have to take into account that not all models are equally likely _a priori_ and compute: - -$$p(M_k \mid y) \propto p(y \mid M_k) p(M_k)$$ - -Sometimes the main objective is not to just keep a single model but instead to compare models to determine which ones are more likely and by how much. This can be achieved using Bayes factors: - -$$BF_{01} = \frac{p(y \mid M_0)}{p(y \mid M_1)}$$ - -that is, the ratio between the marginal likelihood of two models. The larger the BF the _better_ the model in the numerator ($M_0$ in this example). To ease the interpretation of BFs Harold Jeffreys proposed a scale for interpretation of Bayes Factors with levels of *support* or *strength*. This is just a way to put numbers into words. - -* 1-3: anecdotal -* 3-10: moderate -* 10-30: strong -* 30-100: very strong -* $>$ 100: extreme - -Notice that if you get numbers below 1 then the support is for the model in the denominator, tables for those cases are also available. Of course, you can also just take the inverse of the values in the above table or take the inverse of the BF value and you will be OK. - -It is very important to remember that these rules are just conventions, simple guides at best. Results should always be put into context of our problems and should be accompanied with enough details so others could evaluate by themselves if they agree with our conclusions. The evidence necessary to make a claim is not the same in particle physics, or a court, or to evacuate a town to prevent hundreds of deaths. - -+++ - -## Bayesian model averaging - -Instead of choosing one single model from a set of candidate models, model averaging is about getting one meta-model by averaging the candidate models. The Bayesian version of this weights each model by its marginal posterior probability. - -$$p(\theta \mid y) = \sum_{k=1}^K p(\theta \mid y, M_k) \; p(M_k \mid y)$$ - -This is the optimal way to average models if the prior is _correct_ and the _correct_ model is one of the $M_k$ models in our set. Otherwise, _bayesian model averaging_ will asymptotically select the one single model in the set of compared models that is closest in [Kullback-Leibler divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence). - -Check this [example](model_averaging.ipynb) as an alternative way to perform model averaging. - -+++ - -## Some remarks - -Now we will briefly discuss some key facts about the _marginal likelihood_ - -* The good - * **Occam Razor included**: Models with more parameters have a larger penalization than models with fewer parameters. The intuitive reason is that the larger the number of parameters the more _spread_ the _prior_ with respect to the likelihood. - - -* The bad - * Computing the marginal likelihood is, generally, a hard task because it’s an integral of a highly variable function over a high dimensional parameter space. In general this integral needs to be solved numerically using more or less sophisticated methods. - -$$p(y \mid M_k) = \int_{\theta_k} p(y \mid \theta_k, M_k) \; p(\theta_k | M_k) \; d\theta_k$$ - -* The ugly - * The marginal likelihood depends **sensitively** on the specified prior for the parameters in each model $p(\theta_k \mid M_k)$. - -Notice that *the good* and *the ugly* are related. Using the marginal likelihood to compare models is a good idea because a penalization for complex models is already included (thus preventing us from overfitting) and, at the same time, a change in the prior will affect the computations of the marginal likelihood. At first this sounds a little bit silly; we already know that priors affect computations (otherwise we could simply avoid them), but the point here is the word **sensitively**. We are talking about changes in the prior that will keep inference of $\theta$ more or less the same, but could have a big impact in the value of the marginal likelihood. - -+++ - -## Computing Bayes factors - -The marginal likelihood is generally not available in closed-form except for some restricted models. For this reason many methods have been devised to compute the marginal likelihood and the derived Bayes factors, some of these methods are so simple and [naive](https://radfordneal.wordpress.com/2008/08/17/the-harmonic-mean-of-the-likelihood-worst-monte-carlo-method-ever/) that works very bad in practice. Most of the useful methods have been originally proposed in the field of Statistical Mechanics. This connection is explained because the marginal likelihood is analogous to a central quantity in statistical physics known as the _partition function_ which in turn is closely related to another very important quantity the _free-energy_. Many of the connections between Statistical Mechanics and Bayesian inference are summarized [here](https://arxiv.org/abs/1706.01428). - -+++ - -### Using a hierarchical model - -Computation of Bayes factors can be framed as a hierarchical model, where the high-level parameter is an index assigned to each model and sampled from a categorical distribution. In other words, we perform inference for two (or more) competing models at the same time and we use a discrete _dummy_ variable that _jumps_ between models. How much time we spend sampling each model is proportional to $p(M_k \mid y)$. - -Some common problems when computing Bayes factors this way is that if one model is better than the other, by definition, we will spend more time sampling from it than from the other model. And this could lead to inaccuracies because we will be undersampling the less likely model. Another problem is that the values of the parameters get updated even when the parameters are not used to fit that model. That is, when model 0 is chosen, parameters in model 1 are updated but since they are not used to explain the data, they only get restricted by the prior. If the prior is too vague, it is possible that when we choose model 1, the parameter values are too far away from the previous accepted values and hence the step is rejected. Therefore we end up having a problem with sampling. - -In case we find these problems, we can try to improve sampling by implementing two modifications to our model: - -* Ideally, we can get a better sampling of both models if they are visited equally, so we can adjust the prior for each model in such a way to favour the less favourable model and disfavour the most favourable one. This will not affect the computation of the Bayes factor because we have to include the priors in the computation. - -* Use pseudo priors, as suggested by Kruschke and others. The idea is simple: if the problem is that the parameters drift away unrestricted, when the model they belong to is not selected, then one solution is to try to restrict them artificially, but only when not used! You can find an example of using pseudo priors in a model used by Kruschke in his book and [ported](https://github.com/aloctavodia/Doing_bayesian_data_analysis) to Python/PyMC3. - -If you want to learn more about this approach to the computation of the marginal likelihood see [Chapter 12 of Doing Bayesian Data Analysis](http://www.sciencedirect.com/science/book/9780124058880). This chapter also discuss how to use Bayes Factors as a Bayesian alternative to classical hypothesis testing. - -+++ - -### Analytically - -For some models, like the beta-binomial model (AKA the _coin-flipping_ model) we can compute the marginal likelihood analytically. If we write this model as: - -$$\theta \sim Beta(\alpha, \beta)$$ -$$y \sim Bin(n=1, p=\theta)$$ - -the _marginal likelihood_ will be: - -$$p(y) = \binom {n}{h} \frac{B(\alpha + h,\ \beta + n - h)} {B(\alpha, \beta)}$$ - -where: - -* $B$ is the [beta function](https://en.wikipedia.org/wiki/Beta_function) not to get confused with the $Beta$ distribution -* $n$ is the number of trials -* $h$ is the number of success - -Since we only care about the relative value of the _marginal likelihood_ under two different models (for the same data), we can omit the binomial coefficient $\binom {n}{h}$, thus we can write: - -$$p(y) \propto \frac{B(\alpha + h,\ \beta + n - h)} {B(\alpha, \beta)}$$ - -This expression has been coded in the following cell, but with a twist. We will be using the `betaln` function instead of the `beta` function, this is done to prevent underflow. - -```{code-cell} ipython3 -def beta_binom(prior, y): - """ - Compute the marginal likelihood, analytically, for a beta-binomial model. - - prior : tuple - tuple of alpha and beta parameter for the prior (beta distribution) - y : array - array with "1" and "0" corresponding to the success and fails respectively - """ - alpha, beta = prior - h = np.sum(y) - n = len(y) - p_y = np.exp(betaln(alpha + h, beta + n - h) - betaln(alpha, beta)) - return p_y -``` - -Our data for this example consist on 100 "flips of a coin" and the same number of observed "heads" and "tails". We will compare two models one with a uniform prior and one with a _more concentrated_ prior around $\theta = 0.5$ - -```{code-cell} ipython3 -y = np.repeat([1, 0], [50, 50]) # 50 "heads" and 50 "tails" -priors = ((1, 1), (30, 30)) -``` - -```{code-cell} ipython3 -for a, b in priors: - distri = beta(a, b) - x = np.linspace(0, 1, 300) - x_pdf = distri.pdf(x) - plt.plot(x, x_pdf, label=rf"$\alpha$ = {a:d}, $\beta$ = {b:d}") - plt.yticks([]) - plt.xlabel("$\\theta$") - plt.legend() -``` - -The following cell returns the Bayes factor - -```{code-cell} ipython3 -BF = beta_binom(priors[1], y) / beta_binom(priors[0], y) -print(round(BF)) -``` - -We see that the model with the more concentrated prior $\text{beta}(30, 30)$ has $\approx 5$ times more support than the model with the more extended prior $\text{beta}(1, 1)$. Besides the exact numerical value this should not be surprising since the prior for the most favoured model is concentrated around $\theta = 0.5$ and the data $y$ has equal number of head and tails, consintent with a value of $\theta$ around 0.5. - -+++ - -### Sequential Monte Carlo - -The [Sequential Monte Carlo](SMC2_gaussians.ipynb) sampler is a method that basically progresses by a series of successive *annealed* sequences from the prior to the posterior. A nice by-product of this process is that we get an estimation of the marginal likelihood. Actually for numerical reasons the returned value is the log marginal likelihood (this helps to avoid underflow). - -```{code-cell} ipython3 -models = [] -idatas = [] -for alpha, beta in priors: - with pm.Model() as model: - a = pm.Beta("a", alpha, beta) - yl = pm.Bernoulli("yl", a, observed=y) - idata = pm.sample_smc(random_seed=42) - models.append(model) - idatas.append(idata) -``` - -```{code-cell} ipython3 -BF_smc = np.exp( - idatas[1].sample_stats["log_marginal_likelihood"].mean() - - idatas[0].sample_stats["log_marginal_likelihood"].mean() -) -np.round(BF_smc).item() -``` - -As we can see from the previous cell, SMC gives essentially the same answer as the analytical calculation! - -Note: In the cell above we compute a difference (instead of a division) because we are on the log-scale, for the same reason we take the exponential before returning the result. Finally, the reason we compute the mean, is because we get one value log marginal likelihood value per chain. - -The advantage of using SMC to compute the (log) marginal likelihood is that we can use it for a wider range of models as a closed-form expression is no longer needed. The cost we pay for this flexibility is a more expensive computation. Notice that SMC (with an independent Metropolis kernel as implemented in PyMC) is not as efficient or robust as gradient-based samplers like NUTS. As the dimensionality of the problem increases a more accurate estimation of the posterior and the _marginal likelihood_ will require a larger number of `draws`, rank-plots can be of help to diagnose sampling problems with SMC. - -+++ - -## Bayes factors and inference - -So far we have used Bayes factors to judge which model seems to be better at explaining the data, and we get that one of the models is $\approx 5$ _better_ than the other. - -But what about the posterior we get from these models? How different they are? - -```{code-cell} ipython3 -az.summary(idatas[0], var_names="a", kind="stats").round(2) -``` - -```{code-cell} ipython3 -az.summary(idatas[1], var_names="a", kind="stats").round(2) -``` - -We may argue that the results are pretty similar, we have the same mean value for $\theta$, and a slightly wider posterior for `model_0`, as expected since this model has a wider prior. We can also check the posterior predictive distribution to see how similar they are. - -```{code-cell} ipython3 -ppc_0 = pm.sample_posterior_predictive(idatas[0], model=models[0]).posterior_predictive -ppc_1 = pm.sample_posterior_predictive(idatas[1], model=models[1]).posterior_predictive -``` - -```{code-cell} ipython3 -_, ax = plt.subplots(figsize=(9, 6)) - -bins = np.linspace(0.2, 0.8, 8) -ax = az.plot_dist( - ppc_0["yl"].mean("yl_dim_2"), - label="model_0", - kind="hist", - hist_kwargs={"alpha": 0.5, "bins": bins}, -) -ax = az.plot_dist( - ppc_1["yl"].mean("yl_dim_2"), - label="model_1", - color="C1", - kind="hist", - hist_kwargs={"alpha": 0.5, "bins": bins}, - ax=ax, -) -ax.legend() -ax.set_xlabel("$\\theta$") -ax.xaxis.set_major_formatter(FormatStrFormatter("%0.1f")) -ax.set_yticks([]); -``` - -In this example the observed data $y$ is more consistent with `model_1` (because the prior is concentrated around the correct value of $\theta$) than `model_0` (which assigns equal probability to every possible value of $\theta$), and this difference is captured by the Bayes factor. We could say Bayes factors are measuring which model, as a whole, is better, including details of the prior that may be irrelevant for parameter inference. In fact in this example we can also see that it is possible to have two different models, with different Bayes factors, but nevertheless get very similar predictions. The reason is that the data is informative enough to reduce the effect of the prior up to the point of inducing a very similar posterior. As predictions are computed from the posterior we also get very similar predictions. In most scenarios when comparing models what we really care is the predictive accuracy of the models, if two models have similar predictive accuracy we consider both models as similar. To estimate the predictive accuracy we can use tools like PSIS-LOO-CV (`az.loo`), WAIC (`az.waic`), or cross-validation. - -+++ - -## Savage-Dickey Density Ratio - -For the previous examples we have compared two beta-binomial models, but sometimes what we want to do is to compare a null hypothesis H_0 (or null model) against an alternative one H_1. For example, to answer the question _is this coin biased?_, we could compare the value $\theta = 0.5$ (representing no bias) against the result from a model were we let $\theta$ to vary. For this kind of comparison the null-model is nested within the alternative, meaning the null is a particular value of the model we are building. In those cases computing the Bayes Factor is very easy and it does not require any special method, because the math works out conveniently so we just need to compare the prior and posterior evaluated at the null-value (for example $\theta = 0.5$), under the alternative model. We can see that is true from the following expression: - - -$$ -BF_{01} = \frac{p(y \mid H_0)}{p(y \mid H_1)} \frac{p(\theta=0.5 \mid y, H_1)}{p(\theta=0.5 \mid H_1)} -$$ - -Which only [holds](https://statproofbook.github.io/P/bf-sddr) when H_0 is a particular case of H_1. - -Let's do it with PyMC and ArviZ. We need just need to get posterior and prior samples for a model. Let's try with beta-binomial model with uniform prior we previously saw. - -```{code-cell} ipython3 -with pm.Model() as model_uni: - a = pm.Beta("a", 1, 1) - yl = pm.Bernoulli("yl", a, observed=y) - idata_uni = pm.sample(2000, random_seed=42) - idata_uni.extend(pm.sample_prior_predictive(8000)) -``` - -And now we call ArviZ's `az.plot_bf` function - -```{code-cell} ipython3 -az.plot_bf(idata_uni, var_name="a", ref_val=0.5); -``` - -The plot shows one KDE for the prior (blue) and one for the posterior (orange). The two black dots show we evaluate both distribution as 0.5. We can see that the Bayes factor in favor of the null BF_01 is $\approx 8$, which we can interpret as a _moderate evidence_ in favor of the null (see the Jeffreys' scale we discussed before). - -As we already discussed Bayes factors are measuring which model, as a whole, is better at explaining the data. And this includes the prior, even if the prior has a relatively low impact on the posterior computation. We can also see this effect of the prior when comparing a second model against the null. - -If instead our model would be a beta-binomial with prior beta(30, 30), the BF_01 would be lower (_anecdotal_ on the Jeffreys' scale). This is because under this model the value of $\theta=0.5$ is much more likely a priori than for a uniform prior, and hence the posterior and prior will me much more similar. Namely there is not too much surprise about seeing the posterior concentrated around 0.5 after collecting data. - -Let's compute it to see for ourselves. - -```{code-cell} ipython3 -with pm.Model() as model_conc: - a = pm.Beta("a", 30, 30) - yl = pm.Bernoulli("yl", a, observed=y) - idata_conc = pm.sample(2000, random_seed=42) - idata_conc.extend(pm.sample_prior_predictive(8000)) -``` - -```{code-cell} ipython3 -az.plot_bf(idata_conc, var_name="a", ref_val=0.5); -``` - -* Authored by Osvaldo Martin in September, 2017 ([pymc#2563](https://github.com/pymc-devs/pymc/pull/2563)) -* Updated by Osvaldo Martin in August, 2018 ([pymc#3124](https://github.com/pymc-devs/pymc/pull/3124)) -* Updated by Osvaldo Martin in May, 2022 ([pymc-examples#342](https://github.com/pymc-devs/pymc-examples/pull/342)) -* Updated by Osvaldo Martin in Nov, 2022 - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames - -Dickey1970 -Wagenmakers2010 -::: - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/diagnostics_and_criticism/Diagnosing_biased_Inference_with_Divergences.myst.md b/myst_nbs/diagnostics_and_criticism/Diagnosing_biased_Inference_with_Divergences.myst.md deleted file mode 100644 index c28d4c42d..000000000 --- a/myst_nbs/diagnostics_and_criticism/Diagnosing_biased_Inference_with_Divergences.myst.md +++ /dev/null @@ -1,579 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(diagnosing_with_divergences)= -# Diagnosing Biased Inference with Divergences - -:::{post} Feb, 2018 -:tags: hierarchical model, diagnostics -:category: intermediate -:author: Agustina Arroyuelo -::: - -```{code-cell} ipython3 -from collections import defaultdict - -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc3 as pm - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -SEED = [20100420, 20134234] -``` - -This notebook is a PyMC3 port of [Michael Betancourt's post on mc-stan](http://mc-stan.org/documentation/case-studies/divergences_and_bias.html). For detailed explanation of the underlying mechanism please check the original post, [Diagnosing Biased Inference with Divergences](http://mc-stan.org/documentation/case-studies/divergences_and_bias.html) and Betancourt's excellent paper, [A Conceptual Introduction to Hamiltonian Monte Carlo](https://arxiv.org/abs/1701.02434). - -+++ - -Bayesian statistics is all about building a model and estimating the parameters in that model. However, a naive or direct parameterization of our probability model can sometimes be ineffective, you can check out Thomas Wiecki's blog post, [Why hierarchical models are awesome, tricky, and Bayesian](http://twiecki.github.io/blog/2017/02/08/bayesian-hierchical-non-centered/) on the same issue in PyMC3. Suboptimal parameterization often leads to slow sampling, and more problematic, biased MCMC estimators. - -More formally, as explained in the original post, [Diagnosing Biased Inference with Divergences](http://mc-stan.org/documentation/case-studies/divergences_and_bias.html): - -Markov chain Monte Carlo (MCMC) approximates expectations with respect to a given target distribution, - -$$ \mathbb{E}{\pi} [ f ] = \int \mathrm{d}q \, \pi (q) \, f(q)$$ - -using the states of a Markov chain, ${q{0}, \ldots, q_{N} }$, - -$$ \mathbb{E}{\pi} [ f ] \approx \hat{f}{N} = \frac{1}{N + 1} \sum_{n = 0}^{N} f(q_{n}) $$ - -These estimators, however, are guaranteed to be accurate only asymptotically as the chain grows to be infinitely long, - -$$ \lim_{N \rightarrow \infty} \hat{f}{N} = \mathbb{E}{\pi} [ f ]$$ - -To be useful in applied analyses, we need MCMC estimators to converge to the true expectation values sufficiently quickly that they are reasonably accurate before we exhaust our finite computational resources. This fast convergence requires strong ergodicity conditions to hold, in particular geometric ergodicity between a Markov transition and a target distribution. Geometric ergodicity is usually the necessary condition for MCMC estimators to follow a central limit theorem, which ensures not only that they are unbiased even after only a finite number of iterations but also that we can empirically quantify their precision using the MCMC standard error. - -Unfortunately, proving geometric ergodicity is infeasible for any nontrivial problem. Instead we must rely on empirical diagnostics that identify obstructions to geometric ergodicity, and hence well-behaved MCMC estimators. For a general Markov transition and target distribution, the best known diagnostic is the split $\hat{R}$ statistic over an ensemble of Markov chains initialized from diffuse points in parameter space; to do any better we need to exploit the particular structure of a given transition or target distribution. - -Hamiltonian Monte Carlo, for example, is especially powerful in this regard as its failures to be geometrically ergodic with respect to any target distribution manifest in distinct behaviors that have been developed into sensitive diagnostics. One of these behaviors is the appearance of divergences that indicate the Hamiltonian Markov chain has encountered regions of high curvature in the target distribution which it cannot adequately explore. - -In this notebook we aim to identify divergences and the underlying pathologies in `PyMC3`. - -+++ - -## The Eight Schools Model - -The hierarchical model of the Eight Schools dataset (Rubin 1981) as seen in `Stan`: - -$$\mu \sim \mathcal{N}(0, 5)$$ -$$\tau \sim \text{Half-Cauchy}(0, 5)$$ -$$\theta_{n} \sim \mathcal{N}(\mu, \tau)$$ -$$y_{n} \sim \mathcal{N}(\theta_{n}, \sigma_{n}),$$ - -where $n \in \{1, \ldots, 8 \}$ and the $\{ y_{n}, \sigma_{n} \}$ are given as data. - -Inferring the hierarchical hyperparameters, $\mu$ and $\sigma$, together with the group-level parameters, $\theta_{1}, \ldots, \theta_{8}$, allows the model to pool data across the groups and reduce their posterior variance. Unfortunately, the direct *centered* parameterization also squeezes the posterior distribution into a particularly challenging geometry that obstructs geometric ergodicity and hence biases MCMC estimation. - -```{code-cell} ipython3 -# Data of the Eight Schools Model -J = 8 -y = np.array([28.0, 8.0, -3.0, 7.0, -1.0, 1.0, 18.0, 12.0]) -sigma = np.array([15.0, 10.0, 16.0, 11.0, 9.0, 11.0, 10.0, 18.0]) -# tau = 25. -``` - -## A Centered Eight Schools Implementation - -`Stan` model: - -```C -data { - int J; - real y[J]; - real sigma[J]; -} - -parameters { - real mu; - real tau; - real theta[J]; -} - -model { - mu ~ normal(0, 5); - tau ~ cauchy(0, 5); - theta ~ normal(mu, tau); - y ~ normal(theta, sigma); -} -``` -Similarly, we can easily implement it in `PyMC3` - -```{code-cell} ipython3 -with pm.Model() as Centered_eight: - mu = pm.Normal("mu", mu=0, sigma=5) - tau = pm.HalfCauchy("tau", beta=5) - theta = pm.Normal("theta", mu=mu, sigma=tau, shape=J) - obs = pm.Normal("obs", mu=theta, sigma=sigma, observed=y) -``` - -Unfortunately, this direct implementation of the model exhibits a pathological geometry that frustrates geometric ergodicity. Even more worrisome, the resulting bias is subtle and may not be obvious upon inspection of the Markov chain alone. To understand this bias, let's consider first a short Markov chain, commonly used when computational expediency is a motivating factor, and only afterwards a longer Markov chain. - -+++ - -### A Dangerously-Short Markov Chain - -```{code-cell} ipython3 -with Centered_eight: - short_trace = pm.sample(600, chains=2, random_seed=SEED) -``` - -In the [original post](http://mc-stan.org/documentation/case-studies/divergences_and_bias.html) a single chain of 1200 sample is applied. However, since split $\hat{R}$ is not implemented in `PyMC3` we fit 2 chains with 600 sample each instead. - -The Gelman-Rubin diagnostic $\hat{R}$ doesn’t indicate any problem (values are all close to 1). You could try re-running the model with a different seed and see if this still holds. - -```{code-cell} ipython3 -az.summary(short_trace).round(2) -``` - -Moreover, the trace plots all look fine. Let's consider, for example, the hierarchical standard deviation $\tau$, or more specifically, its logarithm, $log(\tau)$. Because $\tau$ is constrained to be positive, its logarithm will allow us to better resolve behavior for small values. Indeed the chains seems to be exploring both small and large values reasonably well. - -```{code-cell} ipython3 ---- -mystnb: - figure: - caption: Trace plot of log(tau) - name: nb-divergence-traceplot - image: - alt: log-tau ---- -# plot the trace of log(tau) -ax = az.plot_trace( - {"log(tau)": short_trace.get_values(varname="tau_log__", combine=False)}, legend=True -) -ax[0, 1].set_xlabel("Draw") -ax[0, 1].set_ylabel("log(tau)") -ax[0, 1].set_title("") - -ax[0, 0].set_xlabel("log(tau)") -ax[0, 0].set_title("Probability density function of log(tau)"); -``` - -Unfortunately, the resulting estimate for the mean of $log(\tau)$ is strongly biased away from the true value, here shown in grey. - -```{code-cell} ipython3 -# plot the estimate for the mean of log(τ) cumulating mean -logtau = np.log(short_trace["tau"]) -mlogtau = [np.mean(logtau[:i]) for i in np.arange(1, len(logtau))] -plt.figure(figsize=(15, 4)) -plt.axhline(0.7657852, lw=2.5, color="gray") -plt.plot(mlogtau, lw=2.5) -plt.ylim(0, 2) -plt.xlabel("Iteration") -plt.ylabel("MCMC mean of log(tau)") -plt.title("MCMC estimation of log(tau)"); -``` - -Hamiltonian Monte Carlo, however, is not so oblivious to these issues as $\approx$ 3% of the iterations in our lone Markov chain ended with a divergence. - -```{code-cell} ipython3 -# display the total number and percentage of divergent -divergent = short_trace["diverging"] -print("Number of Divergent %d" % divergent.nonzero()[0].size) -divperc = divergent.nonzero()[0].size / len(short_trace) * 100 -print("Percentage of Divergent %.1f" % divperc) -``` - -Even with a single short chain these divergences are able to identity the bias and advise skepticism of any resulting MCMC estimators. - -Additionally, because the divergent transitions, here shown in green, tend to be located near the pathologies we can use them to identify the location of the problematic neighborhoods in parameter space. - -```{code-cell} ipython3 -def pairplot_divergence(trace, ax=None, divergence=True, color="C3", divergence_color="C2"): - theta = trace.get_values(varname="theta", combine=True)[:, 0] - logtau = trace.get_values(varname="tau_log__", combine=True) - if not ax: - _, ax = plt.subplots(1, 1, figsize=(10, 5)) - ax.plot(theta, logtau, "o", color=color, alpha=0.5) - if divergence: - divergent = trace["diverging"] - ax.plot(theta[divergent], logtau[divergent], "o", color=divergence_color) - ax.set_xlabel("theta[0]") - ax.set_ylabel("log(tau)") - ax.set_title("scatter plot between log(tau) and theta[0]") - return ax - - -pairplot_divergence(short_trace); -``` - -It is important to point out that the pathological samples from the trace are not necessarily concentrated at the funnel: when a divergence is encountered, the subtree being constructed is rejected and the transition samples uniformly from the existing discrete trajectory. Consequently, divergent samples will not be located exactly in the region of high curvature. - -In `pymc3`, we recently implemented a warning system that also saves the information of _where_ the divergence occurs, and hence you can visualize them directly. To be more precise, what we include as the divergence point in the warning is the point where that problematic leapfrog step started. Some could also be because the divergence happens in one of the leapfrog step (which strictly speaking is not a point). But nonetheless, visualizing these should give a closer proximate where the funnel is. - -Notices that only the first 100 divergences are stored, so that we don't eat all memory. - -```{code-cell} ipython3 -divergent_point = defaultdict(list) - -chain_warn = short_trace.report._chain_warnings -for i in range(len(chain_warn)): - for warning_ in chain_warn[i]: - if warning_.step is not None and warning_.extra is not None: - for RV in Centered_eight.free_RVs: - para_name = RV.name - divergent_point[para_name].append(warning_.extra[para_name]) - -for RV in Centered_eight.free_RVs: - para_name = RV.name - divergent_point[para_name] = np.asarray(divergent_point[para_name]) - -tau_log_d = divergent_point["tau_log__"] -theta0_d = divergent_point["theta"] -Ndiv_recorded = len(tau_log_d) -``` - -```{code-cell} ipython3 -_, ax = plt.subplots(1, 2, figsize=(15, 6), sharex=True, sharey=True) - -pairplot_divergence(short_trace, ax=ax[0], color="C7", divergence_color="C2") - -plt.title("scatter plot between log(tau) and theta[0]") - -pairplot_divergence(short_trace, ax=ax[1], color="C7", divergence_color="C2") - -theta_trace = short_trace["theta"] -theta0 = theta_trace[:, 0] - -ax[1].plot( - [theta0[divergent == 1][:Ndiv_recorded], theta0_d], - [logtau[divergent == 1][:Ndiv_recorded], tau_log_d], - "k-", - alpha=0.5, -) - -ax[1].scatter( - theta0_d, tau_log_d, color="C3", label="Location of Energy error (start location of leapfrog)" -) - -plt.title("scatter plot between log(tau) and theta[0]") -plt.legend(); -``` - -There are many other ways to explore and visualize the pathological region in the parameter space. For example, we can reproduce Figure 5b in [Visualization in Bayesian workflow](https://arxiv.org/pdf/1709.01449.pdf) - -```{code-cell} ipython3 -tracedf = pm.trace_to_dataframe(short_trace) -plotorder = [ - "mu", - "tau", - "theta__0", - "theta__1", - "theta__2", - "theta__3", - "theta__4", - "theta__5", - "theta__6", - "theta__7", -] -tracedf = tracedf[plotorder] - -_, ax = plt.subplots(1, 2, figsize=(15, 4), sharex=True, sharey=True) -ax[0].plot(tracedf.values[divergent == 0].T, color="k", alpha=0.025) -ax[0].plot(tracedf.values[divergent == 1].T, color="C2", lw=0.5) - -ax[1].plot(tracedf.values[divergent == 0].T, color="k", alpha=0.025) -ax[1].plot(tracedf.values[divergent == 1].T, color="C2", lw=0.5) -divsp = np.hstack( - [ - divergent_point["mu"], - np.exp(divergent_point["tau_log__"]), - divergent_point["theta"], - ] -) -ax[1].plot(divsp.T, "C3", lw=0.5) -plt.ylim([-20, 40]) -plt.xticks(range(10), plotorder) -plt.tight_layout() -``` - -```{code-cell} ipython3 -# A small wrapper function for displaying the MCMC sampler diagnostics as above -def report_trace(trace): - # plot the trace of log(tau) - az.plot_trace({"log(tau)": trace.get_values(varname="tau_log__", combine=False)}) - - # plot the estimate for the mean of log(τ) cumulating mean - logtau = np.log(trace["tau"]) - mlogtau = [np.mean(logtau[:i]) for i in np.arange(1, len(logtau))] - plt.figure(figsize=(15, 4)) - plt.axhline(0.7657852, lw=2.5, color="gray") - plt.plot(mlogtau, lw=2.5) - plt.ylim(0, 2) - plt.xlabel("Iteration") - plt.ylabel("MCMC mean of log(tau)") - plt.title("MCMC estimation of log(tau)") - plt.show() - - # display the total number and percentage of divergent - divergent = trace["diverging"] - print("Number of Divergent %d" % divergent.nonzero()[0].size) - divperc = divergent.nonzero()[0].size / len(trace) * 100 - print("Percentage of Divergent %.1f" % divperc) - - # scatter plot between log(tau) and theta[0] - # for the identification of the problematic neighborhoods in parameter space - pairplot_divergence(trace); -``` - -### A Safer, Longer Markov Chain - -Given the potential insensitivity of split $\hat{R}$ on single short chains, `Stan` recommend always running multiple chains as long as possible to have the best chance to observe any obstructions to geometric ergodicity. Because it is not always possible to run long chains for complex models, however, divergences are an incredibly powerful diagnostic for biased MCMC estimation. - -```{code-cell} ipython3 -with Centered_eight: - longer_trace = pm.sample(4000, chains=2, tune=1000, random_seed=SEED) -``` - -```{code-cell} ipython3 -report_trace(longer_trace) -``` - -```{code-cell} ipython3 -az.summary(longer_trace).round(2) -``` - -Similar to the result in `Stan`, $\hat{R}$ does not indicate any serious issues. However, the effective sample size per iteration has drastically fallen, indicating that we are exploring less efficiently the longer we run. This odd behavior is a clear sign that something problematic is afoot. As shown in the trace plot, the chain occasionally "sticks" as it approaches small values of $\tau$, exactly where we saw the divergences concentrating. This is a clear indication of the underlying pathologies. These sticky intervals induce severe oscillations in the MCMC estimators early on, until they seem to finally settle into biased values. - -In fact the sticky intervals are the Markov chain trying to correct the biased exploration. If we ran the chain even longer then it would eventually get stuck again and drag the MCMC estimator down towards the true value. Given an infinite number of iterations this delicate balance asymptotes to the true expectation as we’d expect given the consistency guarantee of MCMC. Stopping after any finite number of iterations, however, destroys this balance and leaves us with a significant bias. - -More details can be found in Betancourt's [recent paper](https://arxiv.org/abs/1701.02434). - -+++ - -## Mitigating Divergences by Adjusting PyMC3's Adaptation Routine - -Divergences in Hamiltonian Monte Carlo arise when the Hamiltonian transition encounters regions of extremely large curvature, such as the opening of the hierarchical funnel. Unable to accurate resolve these regions, the transition malfunctions and flies off towards infinity. With the transitions unable to completely explore these regions of extreme curvature, we lose geometric ergodicity and our MCMC estimators become biased. - -Algorithm implemented in `Stan` uses a heuristic to quickly identify these misbehaving trajectories, and hence label divergences, without having to wait for them to run all the way to infinity. This heuristic can be a bit aggressive, however, and sometimes label transitions as divergent even when we have not lost geometric ergodicity. - -To resolve this potential ambiguity we can adjust the step size, $\epsilon$, of the Hamiltonian transition. The smaller the step size the more accurate the trajectory and the less likely it will be mislabeled as a divergence. In other words, if we have geometric ergodicity between the Hamiltonian transition and the target distribution then decreasing the step size will reduce and then ultimately remove the divergences entirely. If we do not have geometric ergodicity, however, then decreasing the step size will not completely remove the divergences. - -Like `Stan`, the step size in `PyMC3` is tuned automatically during warm up, but we can coerce smaller step sizes by tweaking the configuration of `PyMC3`'s adaptation routine. In particular, we can increase the `target_accept` parameter from its default value of 0.8 closer to its maximum value of 1. - -+++ - -### Adjusting Adaptation Routine - -```{code-cell} ipython3 -with Centered_eight: - fit_cp85 = pm.sample(5000, chains=2, tune=2000, target_accept=0.85) -``` - -```{code-cell} ipython3 -with Centered_eight: - fit_cp90 = pm.sample(5000, chains=2, tune=2000, target_accept=0.90) -``` - -```{code-cell} ipython3 -with Centered_eight: - fit_cp95 = pm.sample(5000, chains=2, tune=2000, target_accept=0.95) -``` - -```{code-cell} ipython3 -with Centered_eight: - fit_cp99 = pm.sample(5000, chains=2, tune=2000, target_accept=0.99) -``` - -```{code-cell} ipython3 -df = pd.DataFrame( - [ - longer_trace["step_size"].mean(), - fit_cp85["step_size"].mean(), - fit_cp90["step_size"].mean(), - fit_cp95["step_size"].mean(), - fit_cp99["step_size"].mean(), - ], - columns=["Step_size"], -) -df["Divergent"] = pd.Series( - [ - longer_trace["diverging"].sum(), - fit_cp85["diverging"].sum(), - fit_cp90["diverging"].sum(), - fit_cp95["diverging"].sum(), - fit_cp99["diverging"].sum(), - ] -) -df["delta_target"] = pd.Series([".80", ".85", ".90", ".95", ".99"]) -df -``` - -Here, the number of divergent transitions dropped dramatically when delta was increased to 0.99. - -This behavior also has a nice geometric intuition. The more we decrease the step size the more the Hamiltonian Markov chain can explore the neck of the funnel. Consequently, the marginal posterior distribution for $log (\tau)$ stretches further and further towards negative values with the decreasing step size. - -Since in `PyMC3` after tuning we have a smaller step size than `Stan`, the geometery is better explored. - -However, the Hamiltonian transition is still not geometrically ergodic with respect to the centered implementation of the Eight Schools model. Indeed, this is expected given the observed bias. - -```{code-cell} ipython3 -_, ax = plt.subplots(1, 1, figsize=(10, 6)) - -pairplot_divergence(fit_cp99, ax=ax, color="C3", divergence=False) - -pairplot_divergence(longer_trace, ax=ax, color="C1", divergence=False) - -ax.legend(["Centered, delta=0.99", "Centered, delta=0.85"]); -``` - -```{code-cell} ipython3 -logtau0 = longer_trace["tau_log__"] -logtau2 = np.log(fit_cp90["tau"]) -logtau1 = fit_cp99["tau_log__"] - -plt.figure(figsize=(15, 4)) -plt.axhline(0.7657852, lw=2.5, color="gray") -mlogtau0 = [np.mean(logtau0[:i]) for i in np.arange(1, len(logtau0))] -plt.plot(mlogtau0, label="Centered, delta=0.85", lw=2.5) -mlogtau2 = [np.mean(logtau2[:i]) for i in np.arange(1, len(logtau2))] -plt.plot(mlogtau2, label="Centered, delta=0.90", lw=2.5) -mlogtau1 = [np.mean(logtau1[:i]) for i in np.arange(1, len(logtau1))] -plt.plot(mlogtau1, label="Centered, delta=0.99", lw=2.5) -plt.ylim(0, 2) -plt.xlabel("Iteration") -plt.ylabel("MCMC mean of log(tau)") -plt.title("MCMC estimation of log(tau)") -plt.legend(); -``` - -## A Non-Centered Eight Schools Implementation - -Although reducing the step size improves exploration, ultimately it only reveals the true extent the pathology in the centered implementation. Fortunately, there is another way to implement hierarchical models that does not suffer from the same pathologies. - -In a non-centered parameterization we do not try to fit the group-level parameters directly, rather we fit a latent Gaussian variable from which we can recover the group-level parameters with a scaling and a translation. - -$$\mu \sim \mathcal{N}(0, 5)$$ -$$\tau \sim \text{Half-Cauchy}(0, 5)$$ -$$\tilde{\theta}_{n} \sim \mathcal{N}(0, 1)$$ -$$\theta_{n} = \mu + \tau \cdot \tilde{\theta}_{n}.$$ - -Stan model: - -```C -data { - int J; - real y[J]; - real sigma[J]; -} - -parameters { - real mu; - real tau; - real theta_tilde[J]; -} - -transformed parameters { - real theta[J]; - for (j in 1:J) - theta[j] = mu + tau * theta_tilde[j]; -} - -model { - mu ~ normal(0, 5); - tau ~ cauchy(0, 5); - theta_tilde ~ normal(0, 1); - y ~ normal(theta, sigma); -} -``` - -```{code-cell} ipython3 -with pm.Model() as NonCentered_eight: - mu = pm.Normal("mu", mu=0, sigma=5) - tau = pm.HalfCauchy("tau", beta=5) - theta_tilde = pm.Normal("theta_t", mu=0, sigma=1, shape=J) - theta = pm.Deterministic("theta", mu + tau * theta_tilde) - obs = pm.Normal("obs", mu=theta, sigma=sigma, observed=y) -``` - -```{code-cell} ipython3 -with NonCentered_eight: - fit_ncp80 = pm.sample(5000, chains=2, tune=1000, random_seed=SEED, target_accept=0.80) -``` - -```{code-cell} ipython3 -az.summary(fit_ncp80).round(2) -``` - -As shown above, the effective sample size per iteration has drastically improved, and the trace plots no longer show any "stickyness". However, we do still see the rare divergence. These infrequent divergences do not seem concentrate anywhere in parameter space, which is indicative of the divergences being false positives. - -```{code-cell} ipython3 -report_trace(fit_ncp80) -``` - -As expected of false positives, we can remove the divergences entirely by decreasing the step size. - -```{code-cell} ipython3 -with NonCentered_eight: - fit_ncp90 = pm.sample(5000, chains=2, tune=1000, random_seed=SEED, target_accept=0.90) - -# display the total number and percentage of divergent -divergent = fit_ncp90["diverging"] -print("Number of Divergent %d" % divergent.nonzero()[0].size) -``` - -The more agreeable geometry of the non-centered implementation allows the Markov chain to explore deep into the neck of the funnel, capturing even the smallest values of `tau` ($\tau$) that are consistent with the measurements. Consequently, MCMC estimators from the non-centered chain rapidly converge towards their true expectation values. - -```{code-cell} ipython3 -_, ax = plt.subplots(1, 1, figsize=(10, 6)) - -pairplot_divergence(fit_ncp80, ax=ax, color="C0", divergence=False) -pairplot_divergence(fit_cp99, ax=ax, color="C3", divergence=False) -pairplot_divergence(fit_cp90, ax=ax, color="C1", divergence=False) - -ax.legend(["Non-Centered, delta=0.80", "Centered, delta=0.99", "Centered, delta=0.90"]); -``` - -```{code-cell} ipython3 -logtaun = fit_ncp80["tau_log__"] - -plt.figure(figsize=(15, 4)) -plt.axhline(0.7657852, lw=2.5, color="gray") -mlogtaun = [np.mean(logtaun[:i]) for i in np.arange(1, len(logtaun))] -plt.plot(mlogtaun, color="C0", lw=2.5, label="Non-Centered, delta=0.80") - -mlogtau1 = [np.mean(logtau1[:i]) for i in np.arange(1, len(logtau1))] -plt.plot(mlogtau1, color="C3", lw=2.5, label="Centered, delta=0.99") - -mlogtau0 = [np.mean(logtau0[:i]) for i in np.arange(1, len(logtau0))] -plt.plot(mlogtau0, color="C1", lw=2.5, label="Centered, delta=0.90") -plt.ylim(0, 2) -plt.xlabel("Iteration") -plt.ylabel("MCMC mean of log(tau)") -plt.title("MCMC estimation of log(tau)") -plt.legend(); -``` - -## Authors -* Adapted from Michael Betancourt's post January 2017, [Diagnosing Biased Inference with Divergences](https://mc-stan.org/users/documentation/case-studies/divergences_and_bias.html) -* Updated by Agustina Arroyuelo in February 2018, ([pymc#2861](https://github.com/pymc-devs/pymc/pull/2861)) -* Updated by [@CloudChaoszero](https://github.com/CloudChaoszero) in January 2021, ([pymc-examples#25](https://github.com/pymc-devs/pymc-examples/pull/25)) -* Updated Markdown and styling by @reshamas in August 2022, ([pymc-examples#402](https://github.com/pymc-devs/pymc-examples/pull/402)) - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` - -:::{include} ../page_footer.md -::: - -```{code-cell} ipython3 - -``` diff --git a/myst_nbs/diagnostics_and_criticism/model_averaging.myst.md b/myst_nbs/diagnostics_and_criticism/model_averaging.myst.md deleted file mode 100644 index 54c4ce251..000000000 --- a/myst_nbs/diagnostics_and_criticism/model_averaging.myst.md +++ /dev/null @@ -1,393 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(model_averaging)= -# Model Averaging - -:::{post} Aug 2022 -:tags: model comparison, model averaging -:category: intermediate -:author: Osvaldo Martin -::: - -```{code-cell} ipython3 ---- -papermill: - duration: 4.910288 - end_time: '2020-11-29T12:13:07.788552' - exception: false - start_time: '2020-11-29T12:13:02.878264' - status: completed -tags: [] ---- -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc3 as pm - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 ---- -papermill: - duration: 0.058811 - end_time: '2020-11-29T12:13:07.895012' - exception: false - start_time: '2020-11-29T12:13:07.836201' - status: completed -tags: [] ---- -RANDOM_SEED = 8927 -np.random.seed(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -+++ {"papermill": {"duration": 0.068882, "end_time": "2020-11-29T12:13:08.020372", "exception": false, "start_time": "2020-11-29T12:13:07.951490", "status": "completed"}, "tags": []} - -When confronted with more than one model we have several options. One of them is to perform model selection, using for example a given Information Criterion as exemplified the PyMC examples {ref}`pymc:model_comparison` and the {ref}`GLM-model-selection`. Model selection is appealing for its simplicity, but we are discarding information about the uncertainty in our models. This is somehow similar to computing the full posterior and then just keep a point-estimate like the posterior mean; we may become overconfident of what we really know. You can also browse the {doc}`blog/tag/model-comparison` tag to find related posts. - -One alternative is to perform model selection but discuss all the different models together with the computed values of a given Information Criterion. It is important to put all these numbers and tests in the context of our problem so that we and our audience can have a better feeling of the possible limitations and shortcomings of our methods. If you are in the academic world you can use this approach to add elements to the discussion section of a paper, presentation, thesis, and so on. - -Yet another approach is to perform model averaging. The idea now is to generate a meta-model (and meta-predictions) using a weighted average of the models. There are several ways to do this and PyMC includes 3 of them that we are going to briefly discuss, you will find a more thorough explanation in the work by {cite:t}`Yao_2018`. PyMC integrates with ArviZ for model comparison. - - -## Pseudo Bayesian model averaging - -Bayesian models can be weighted by their marginal likelihood, this is known as Bayesian Model Averaging. While this is theoretically appealing, it is problematic in practice: on the one hand the marginal likelihood is highly sensible to the specification of the prior, in a way that parameter estimation is not, and on the other, computing the marginal likelihood is usually a challenging task. An alternative route is to use the values of WAIC (Widely Applicable Information Criterion) or LOO (pareto-smoothed importance sampling Leave-One-Out cross-validation), which we will call generically IC, to estimate weights. We can do this by using the following formula: - -$$w_i = \frac {e^{ - \frac{1}{2} dIC_i }} {\sum_j^M e^{ - \frac{1}{2} dIC_j }}$$ - -Where $dIC_i$ is the difference between the i-esim information criterion value and the lowest one. Remember that the lowest the value of the IC, the better. We can use any information criterion we want to compute a set of weights, but, of course, we cannot mix them. - -This approach is called pseudo Bayesian model averaging, or Akaike-like weighting and is an heuristic way to compute the relative probability of each model (given a fixed set of models) from the information criteria values. Look how the denominator is just a normalization term to ensure that the weights sum up to one. - -## Pseudo Bayesian model averaging with Bayesian Bootstrapping - -The above formula for computing weights is a very nice and simple approach, but with one major caveat it does not take into account the uncertainty in the computation of the IC. We could compute the standard error of the IC (assuming a Gaussian approximation) and modify the above formula accordingly. Or we can do something more robust, like using a [Bayesian Bootstrapping](http://www.sumsar.net/blog/2015/04/the-non-parametric-bootstrap-as-a-bayesian-model/) to estimate, and incorporate this uncertainty. - -## Stacking - -The third approach implemented in PyMC is known as _stacking of predictive distributions_ by {cite:t}`Yao_2018`. We want to combine several models in a metamodel in order to minimize the divergence between the meta-model and the _true_ generating model, when using a logarithmic scoring rule this is equivalent to: - -$$\max_{w} \frac{1}{n} \sum_{i=1}^{n}log\sum_{k=1}^{K} w_k p(y_i|y_{-i}, M_k)$$ - -Where $n$ is the number of data points and $K$ the number of models. To enforce a solution we constrain $w$ to be $w_k \ge 0$ and $\sum_{k=1}^{K} w_k = 1$. - -The quantity $p(y_i|y_{-i}, M_k)$ is the leave-one-out predictive distribution for the $M_k$ model. Computing it requires fitting each model $n$ times, each time leaving out one data point. Fortunately we can approximate the exact leave-one-out predictive distribution using LOO (or even WAIC), and that is what we do in practice. - -## Weighted posterior predictive samples - -Once we have computed the weights, using any of the above 3 methods, we can use them to get a weighted posterior predictive samples. PyMC offers functions to perform these steps in a simple way, so let see them in action using an example. - -The following example is taken from the superb book {cite:t}`mcelreath2018statistical` by Richard McElreath. You will find more PyMC examples from this book in the repository [Statistical-Rethinking-with-Python-and-PyMC](https://github.com/pymc-devs/pymc-resources/tree/main/Rethinking_2). We are going to explore a simplified version of it. Check the book for the whole example and a more thorough discussion of both, the biological motivation for this problem and a theoretical/practical discussion of using Information Criteria to compare, select and average models. - -Briefly, our problem is as follows: We want to explore the composition of milk across several primate species, it is hypothesized that females from species of primates with larger brains produce more _nutritious_ milk (loosely speaking this is done _in order to_ support the development of such big brains). This is an important question for evolutionary biologists and try to give an answer we will use 3 variables, two predictor variables: the proportion of neocortex compare to the total mass of the brain and the logarithm of the body mass of the mothers. And for predicted variable, the kilocalories per gram of milk. With these variables we are going to build 3 different linear models: - -1. A model using only the neocortex variable -2. A model using only the logarithm of the mass variable -3. A model using both variables - -Let start by uploading the data and centering the `neocortex` and `log mass` variables, for better sampling. - -```{code-cell} ipython3 ---- -papermill: - duration: 1.114901 - end_time: '2020-11-29T12:13:09.196103' - exception: false - start_time: '2020-11-29T12:13:08.081202' - status: completed -tags: [] ---- -d = pd.read_csv( - "https://raw.githubusercontent.com/pymc-devs/resources/master/Rethinking_2/Data/milk.csv", - sep=";", -) -d = d[["kcal.per.g", "neocortex.perc", "mass"]].rename({"neocortex.perc": "neocortex"}, axis=1) -d["log_mass"] = np.log(d["mass"]) -d = d[~d.isna().any(axis=1)].drop("mass", axis=1) -d.iloc[:, 1:] = d.iloc[:, 1:] - d.iloc[:, 1:].mean() -d.head() -``` - -+++ {"papermill": {"duration": 0.048113, "end_time": "2020-11-29T12:13:09.292526", "exception": false, "start_time": "2020-11-29T12:13:09.244413", "status": "completed"}, "tags": []} - -Now that we have the data we are going to build our first model using only the `neocortex`. - -```{code-cell} ipython3 ---- -papermill: - duration: 75.962348 - end_time: '2020-11-29T12:14:25.303027' - exception: false - start_time: '2020-11-29T12:13:09.340679' - status: completed -tags: [] ---- -with pm.Model() as model_0: - alpha = pm.Normal("alpha", mu=0, sigma=10) - beta = pm.Normal("beta", mu=0, sigma=10) - sigma = pm.HalfNormal("sigma", 10) - - mu = alpha + beta * d["neocortex"] - - kcal = pm.Normal("kcal", mu=mu, sigma=sigma, observed=d["kcal.per.g"]) - trace_0 = pm.sample(2000, return_inferencedata=True) -``` - -+++ {"papermill": {"duration": 0.049578, "end_time": "2020-11-29T12:14:25.401979", "exception": false, "start_time": "2020-11-29T12:14:25.352401", "status": "completed"}, "tags": []} - -The second model is exactly the same as the first one, except we now use the logarithm of the mass - -```{code-cell} ipython3 ---- -papermill: - duration: 8.996265 - end_time: '2020-11-29T12:14:34.447153' - exception: false - start_time: '2020-11-29T12:14:25.450888' - status: completed -tags: [] ---- -with pm.Model() as model_1: - alpha = pm.Normal("alpha", mu=0, sigma=10) - beta = pm.Normal("beta", mu=0, sigma=1) - sigma = pm.HalfNormal("sigma", 10) - - mu = alpha + beta * d["log_mass"] - - kcal = pm.Normal("kcal", mu=mu, sigma=sigma, observed=d["kcal.per.g"]) - - trace_1 = pm.sample(2000, return_inferencedata=True) -``` - -+++ {"papermill": {"duration": 0.049839, "end_time": "2020-11-29T12:14:34.547268", "exception": false, "start_time": "2020-11-29T12:14:34.497429", "status": "completed"}, "tags": []} - -And finally the third model using the `neocortex` and `log_mass` variables - -```{code-cell} ipython3 ---- -papermill: - duration: 19.373847 - end_time: '2020-11-29T12:14:53.971081' - exception: false - start_time: '2020-11-29T12:14:34.597234' - status: completed -tags: [] ---- -with pm.Model() as model_2: - alpha = pm.Normal("alpha", mu=0, sigma=10) - beta = pm.Normal("beta", mu=0, sigma=1, shape=2) - sigma = pm.HalfNormal("sigma", 10) - - mu = alpha + pm.math.dot(beta, d[["neocortex", "log_mass"]].T) - - kcal = pm.Normal("kcal", mu=mu, sigma=sigma, observed=d["kcal.per.g"]) - - trace_2 = pm.sample(2000, return_inferencedata=True) -``` - -+++ {"papermill": {"duration": 0.050236, "end_time": "2020-11-29T12:14:54.072799", "exception": false, "start_time": "2020-11-29T12:14:54.022563", "status": "completed"}, "tags": []} - -Now that we have sampled the posterior for the 3 models, we are going to compare them visually. One option is to use the `forestplot` function that supports plotting more than one trace. - -```{code-cell} ipython3 ---- -papermill: - duration: 0.967337 - end_time: '2020-11-29T12:14:55.090748' - exception: false - start_time: '2020-11-29T12:14:54.123411' - status: completed -tags: [] ---- -traces = [trace_0, trace_1, trace_2] -az.plot_forest(traces, figsize=(10, 5)); -``` - -+++ {"papermill": {"duration": 0.052958, "end_time": "2020-11-29T12:14:55.196722", "exception": false, "start_time": "2020-11-29T12:14:55.143764", "status": "completed"}, "tags": []} - -Another option is to plot several traces in a same plot is to use `plot_density`. This plot is somehow similar to a forestplot, but we get truncated KDE (kernel density estimation) plots (by default 95% credible intervals) grouped by variable names together with a point estimate (by default the mean). - -```{code-cell} ipython3 ---- -papermill: - duration: 2.61715 - end_time: '2020-11-29T12:14:57.866426' - exception: false - start_time: '2020-11-29T12:14:55.249276' - status: completed -tags: [] ---- -ax = az.plot_density( - traces, - var_names=["alpha", "sigma"], - shade=0.1, - data_labels=["Model 0 (neocortex)", "Model 1 (log_mass)", "Model 2 (neocortex+log_mass)"], -) - -ax[0, 0].set_xlabel("Density") -ax[0, 0].set_ylabel("") -ax[0, 0].set_title("95% Credible Intervals: alpha") - -ax[0, 1].set_xlabel("Density") -ax[0, 1].set_ylabel("") -ax[0, 1].set_title("95% Credible Intervals: sigma") -``` - -+++ {"papermill": {"duration": 0.055089, "end_time": "2020-11-29T12:14:57.977616", "exception": false, "start_time": "2020-11-29T12:14:57.922527", "status": "completed"}, "tags": []} - -Now that we have sampled the posterior for the 3 models, we are going to use WAIC (Widely applicable information criterion) to compare the 3 models. We can do this using the `compare` function included with ArviZ. - -```{code-cell} ipython3 ---- -papermill: - duration: 0.239084 - end_time: '2020-11-29T12:14:58.272998' - exception: false - start_time: '2020-11-29T12:14:58.033914' - status: completed -tags: [] ---- -model_dict = dict(zip(["model_0", "model_1", "model_2"], traces)) -comp = az.compare(model_dict) -comp -``` - -+++ {"papermill": {"duration": 0.056609, "end_time": "2020-11-29T12:14:58.387481", "exception": false, "start_time": "2020-11-29T12:14:58.330872", "status": "completed"}, "tags": []} - -We can see that the best model is `model_2`, the one with both predictor variables. Notice the DataFrame is ordered from lowest to highest WAIC (_i.e_ from _better_ to _worst_ model). Check the {ref}`pymc:model_comparison` for a more detailed discussion on model comparison. - -We can also see that we get a column with the relative `weight` for each model (according to the first equation at the beginning of this notebook). This weights can be _vaguely_ interpreted as the probability that each model will make the correct predictions on future data. Of course this interpretation is conditional on the models used to compute the weights, if we add or remove models the weights will change. And also is dependent on the assumptions behind WAIC (or any other Information Criterion used). So try to not overinterpret these `weights`. - -Now we are going to use computed `weights` to generate predictions based not on a single model, but on the weighted set of models. This is one way to perform model averaging. Using PyMC we can call the `sample_posterior_predictive_w` function as follows: - -```{code-cell} ipython3 ---- -papermill: - duration: 31.463179 - end_time: '2020-11-29T12:15:29.907492' - exception: false - start_time: '2020-11-29T12:14:58.444313' - status: completed -tags: [] ---- -ppc_w = pm.sample_posterior_predictive_w( - traces=traces, - models=[model_0, model_1, model_2], - weights=comp.weight.sort_index(ascending=True), - progressbar=True, -) -``` - -+++ {"papermill": {"duration": 0.058454, "end_time": "2020-11-29T12:15:30.024455", "exception": false, "start_time": "2020-11-29T12:15:29.966001", "status": "completed"}, "tags": []} - -Notice that we are passing the weights ordered by their index. We are doing this because we pass `traces` and `models` ordered from model 0 to 2, but the computed weights are ordered from lowest to highest WAIC (or equivalently from larger to lowest weight). In summary, we must be sure that we are correctly pairing the weights and models. - -We are also going to compute PPCs for the lowest-WAIC model. - -```{code-cell} ipython3 ---- -papermill: - duration: 25.204481 - end_time: '2020-11-29T12:15:55.287049' - exception: false - start_time: '2020-11-29T12:15:30.082568' - status: completed -tags: [] ---- -ppc_2 = pm.sample_posterior_predictive(trace=trace_2, model=model_2, progressbar=False) -``` - -+++ {"papermill": {"duration": 0.058214, "end_time": "2020-11-29T12:15:55.404271", "exception": false, "start_time": "2020-11-29T12:15:55.346057", "status": "completed"}, "tags": []} - -A simple way to compare both kind of predictions is to plot their mean and hpd interval. - -```{code-cell} ipython3 ---- -papermill: - duration: 0.301319 - end_time: '2020-11-29T12:15:55.764128' - exception: false - start_time: '2020-11-29T12:15:55.462809' - status: completed -tags: [] ---- -mean_w = ppc_w["kcal"].mean() -hpd_w = az.hdi(ppc_w["kcal"].flatten()) - -mean = ppc_2["kcal"].mean() -hpd = az.hdi(ppc_2["kcal"].flatten()) - -plt.plot(mean_w, 1, "C0o", label="weighted models") -plt.hlines(1, *hpd_w, "C0") -plt.plot(mean, 0, "C1o", label="model 2") -plt.hlines(0, *hpd, "C1") - -plt.yticks([]) -plt.ylim(-1, 2) -plt.xlabel("kcal per g") -plt.legend(); -``` - -+++ {"papermill": {"duration": 0.05969, "end_time": "2020-11-29T12:15:55.884685", "exception": false, "start_time": "2020-11-29T12:15:55.824995", "status": "completed"}, "tags": []} - -As we can see the mean value is almost the same for both predictions but the uncertainty in the weighted model is larger. We have effectively propagated the uncertainty about which model we should select to the posterior predictive samples. You can now try with the other two methods for computing weights `stacking` (the default and recommended method) and `pseudo-BMA`. - -**Final notes:** - -There are other ways to average models such as, for example, explicitly building a meta-model that includes all the models we have. We then perform parameter inference while jumping between the models. One problem with this approach is that jumping between models could hamper the proper sampling of the posterior. - -Besides averaging discrete models we can sometimes think of continuous versions of them. A toy example is to imagine that we have a coin and we want to estimated its degree of bias, a number between 0 and 1 having a 0.5 equal chance of head and tails (fair coin). We could think of two separate models one with a prior biased towards heads and one towards tails. We could fit both separate models and then average them using, for example, IC-derived weights. An alternative, is to build a hierarchical model to estimate the prior distribution, instead of contemplating two discrete models we will be computing a continuous model that includes these the discrete ones as particular cases. Which approach is better? That depends on our concrete problem. Do we have good reasons to think about two discrete models, or is our problem better represented with a continuous bigger model? - -+++ - -## Authors - -* Authored by Osvaldo Martin in June 2017 ([pymc#2273](https://github.com/pymc-devs/pymc/pull/2273)) -* Updated by Osvaldo Martin in December 2017 ([pymc#2741](https://github.com/pymc-devs/pymc/pull/2741)) -* Updated by Marco Gorelli in November 2020 ([pymc#4271](https://github.com/pymc-devs/pymc/pull/4271)) -* Moved from pymc to pymc-examples repo in December 2020 ([pymc-examples#8](https://github.com/pymc-devs/pymc-examples/pull/8)) -* Updated by Raul Maldonado in February 2021 ([pymc#25](https://github.com/pymc-devs/pymc-examples/pull/25)) -* Updated Markdown and styling by @reshamas in August 2022, ([pymc-examples#414](https://github.com/pymc-devs/pymc-examples/pull/414)) - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Watermark - -```{code-cell} ipython3 ---- -papermill: - duration: 0.127595 - end_time: '2020-11-29T12:16:06.392237' - exception: false - start_time: '2020-11-29T12:16:06.264642' - status: completed -tags: [] ---- -%load_ext watermark -%watermark -n -u -v -iv -w -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/diagnostics_and_criticism/sampler-stats.myst.md b/myst_nbs/diagnostics_and_criticism/sampler-stats.myst.md deleted file mode 100644 index 8a3467ad6..000000000 --- a/myst_nbs/diagnostics_and_criticism/sampler-stats.myst.md +++ /dev/null @@ -1,203 +0,0 @@ ---- -jupytext: - notebook_metadata_filter: substitutions - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(sampler_stats)= -# Sampler Statistics - -:::{post} May 31, 2022 -:tags: diagnostics -:category: beginner -:author: Meenal Jhajharia, Christian Luhmann -::: - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm - -%matplotlib inline - -print(f"Running on PyMC v{pm.__version__}") -``` - -```{code-cell} ipython3 -az.style.use("arviz-darkgrid") -plt.rcParams["figure.constrained_layout.use"] = False -``` - -When checking for convergence or when debugging a badly behaving sampler, it is often helpful to take a closer look at what the sampler is doing. For this purpose some samplers export statistics for each generated sample. - -As a minimal example we sample from a standard normal distribution: - -```{code-cell} ipython3 -model = pm.Model() -with model: - mu1 = pm.Normal("mu1", mu=0, sigma=1, shape=10) -``` - -```{code-cell} ipython3 -with model: - step = pm.NUTS() - idata = pm.sample(2000, tune=1000, init=None, step=step, chains=4) -``` - -- `Note`: NUTS provides the following statistics (these are internal statistics that the sampler uses, you don't need to do anything with them when using PyMC, to learn more about them, {class}`pymc.NUTS`. - -```{code-cell} ipython3 -idata.sample_stats -``` - -The sample statistics variables are defined as follows: - -- `process_time_diff`: The time it took to draw the sample, as defined by the python standard library time.process_time. This counts all the CPU time, including worker processes in BLAS and OpenMP. - -- `step_size`: The current integration step size. - -- `diverging`: (boolean) Indicates the presence of leapfrog transitions with large energy deviation from starting and subsequent termination of the trajectory. “large” is defined as `max_energy_error` going over a threshold. - -- `lp`: The joint log posterior density for the model (up to an additive constant). - -- `energy`: The value of the Hamiltonian energy for the accepted proposal (up to an additive constant). - -- `energy_error`: The difference in the Hamiltonian energy between the initial point and the accepted proposal. - -- `perf_counter_diff`: The time it took to draw the sample, as defined by the python standard library time.perf_counter (wall time). - -- `perf_counter_start`: The value of time.perf_counter at the beginning of the computation of the draw. - -- `n_steps`: The number of leapfrog steps computed. It is related to `tree_depth` with `n_steps <= 2^tree_dept`. - -- `max_energy_error`: The maximum absolute difference in Hamiltonian energy between the initial point and all possible samples in the proposed tree. - -- `acceptance_rate`: The average acceptance probabilities of all possible samples in the proposed tree. - -- `step_size_bar`: The current best known step-size. After the tuning samples, the step size is set to this value. This should converge during tuning. - -- `tree_depth`: The number of tree doublings in the balanced binary tree. - -+++ - -Some points to `Note`: -- Some of the sample statistics used by NUTS are renamed when converting to `InferenceData` to follow {ref}`ArviZ's naming convention `, while some are specific to PyMC3 and keep their internal PyMC3 name in the resulting InferenceData object. -- `InferenceData` also stores additional info like the date, versions used, sampling time and tuning steps as attributes. - -```{code-cell} ipython3 -idata.sample_stats["tree_depth"].plot(col="chain", ls="none", marker=".", alpha=0.3); -``` - -```{code-cell} ipython3 -az.plot_posterior( - idata, group="sample_stats", var_names="acceptance_rate", hdi_prob="hide", kind="hist" -); -``` - -We check if there are any divergences, if yes, how many? - -```{code-cell} ipython3 -idata.sample_stats["diverging"].sum() -``` - -In this case no divergences are found. If there are any, check [this notebook](https://github.com/pymc-devs/pymc-examples/blob/main/examples/diagnostics_and_criticism/Diagnosing_biased_Inference_with_Divergences.ipynb) for information on handling divergences. - -+++ - -It is often useful to compare the overall distribution of the -energy levels with the change of energy between successive samples. -Ideally, they should be very similar: - -```{code-cell} ipython3 -az.plot_energy(idata, figsize=(6, 4)); -``` - -If the overall distribution of energy levels has longer tails, the efficiency of the sampler will deteriorate quickly. - -+++ - -## Multiple samplers - -If multiple samplers are used for the same model (e.g. for continuous and discrete variables), the exported values are merged or stacked along a new axis. - -```{code-cell} ipython3 -coords = {"step": ["BinaryMetropolis", "Metropolis"], "obs": ["mu1"]} -dims = {"accept": ["step"]} - -with pm.Model(coords=coords) as model: - mu1 = pm.Bernoulli("mu1", p=0.8) - mu2 = pm.Normal("mu2", mu=0, sigma=1, dims="obs") -``` - -```{code-cell} ipython3 -with model: - step1 = pm.BinaryMetropolis([mu1]) - step2 = pm.Metropolis([mu2]) - idata = pm.sample( - 10000, - init=None, - step=[step1, step2], - chains=4, - tune=1000, - idata_kwargs={"dims": dims, "coords": coords}, - ) -``` - -```{code-cell} ipython3 -list(idata.sample_stats.data_vars) -``` - -Both samplers export `accept`, so we get one acceptance probability for each sampler: - -```{code-cell} ipython3 -az.plot_posterior( - idata, - group="sample_stats", - var_names="accept", - hdi_prob="hide", - kind="hist", -); -``` - -We notice that `accept` sometimes takes really high values (jumps from regions of low probability to regions of much higher probability). - -```{code-cell} ipython3 -# Range of accept values -idata.sample_stats["accept"].max("draw") - idata.sample_stats["accept"].min("draw") -``` - -```{code-cell} ipython3 -# We can try plotting the density and view the high density intervals to understand the variable better -az.plot_density( - idata, - group="sample_stats", - var_names="accept", - point_estimate="mean", -); -``` - -## Authors -* Updated by Meenal Jhajharia in April 2021 ([pymc-examples#95](https://github.com/pymc-devs/pymc-examples/pull/95)) -* Updated to v4 by Christian Luhmann in May 2022 ([pymc-examples#338](https://github.com/pymc-devs/pymc-examples/pull/338)) - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/gaussian_processes/GP-Circular.myst.md b/myst_nbs/gaussian_processes/GP-Circular.myst.md deleted file mode 100644 index 330da3c84..000000000 --- a/myst_nbs/gaussian_processes/GP-Circular.myst.md +++ /dev/null @@ -1,212 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: pymc3 - language: python - name: pymc3 ---- - -# GP-Circular - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm -import theano.tensor as tt -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -RANDOM_SEED = 42 -np.random.seed(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -Circular domains are a challenge for Gaussian Processes. - -* Periodic patterns are assumed, but they are hard to capture with primitives -* For circular domain $[0, \pi)$ how to model correlation between $\pi-\varepsilon$ and $\varepsilon$, real distance is $2\varepsilon$, but computes differently if you just treat it non circular $(\pi-\varepsilon) - \varepsilon$ -* For correctly computed distances we need to verify kernel we obtain is positive definite - -**An alternative approach is required.** - - -In the following [paper](https://hal.archives-ouvertes.fr/hal-01119942v1/document), the Weinland function is used to solve the problem and ensures positive definite kernel on the circular domain (and not only). - -$$ -W_c(t) = \left(1 + \tau \frac{t}{c}\right)\left(1-\frac{t}{c}\right)_+^\tau -$$ -where $c$ is maximum value for $t$ and $\tau\ge 4$ is some positive number - -+++ - -The kernel itself for geodesic distance (arc length) on a circle looks like - -$$ -k_g(x, y) = W_\pi(\text{dist}_{\mathit{geo}}(x, y)) -$$ - -+++ - -Briefly, you can think - -* $t$ is time, it runs from $0$ to $24$ and then goes back to $0$ -* $c$ is maximum distance between any timestamps, here it would be $12$ -* $\tau$ is proportional to the correleation strength. Let's see how much! - -+++ - -In python the Weinland function is implemented like this - -```{code-cell} ipython3 -def weinland(t, c, tau=4): - return (1 + tau * t / c) * np.clip(1 - t / c, 0, np.inf) ** tau -``` - -We also need implementation for the distance on a circular domain - -```{code-cell} ipython3 -def angular_distance(x, y, c): - # https://stackoverflow.com/questions/1878907/the-smallest-difference-between-2-angles - return (x - y + c) % (c * 2) - c -``` - -```{code-cell} ipython3 -C = np.pi -x = np.linspace(0, C) -``` - -Let's visualize what the Weinland function is, and how it affects the kernel: - -```{code-cell} ipython3 -plt.figure(figsize=(16, 9)) -for tau in range(4, 10): - plt.plot(x, weinland(x, C, tau), label=f"tau={tau}") -plt.legend() -plt.ylabel("K(x, y)") -plt.xlabel("dist"); -``` - -As we see, the higher $\tau$ is, the less correlated the samples - -Also, let's validate our circular distance function is working as expected - -```{code-cell} ipython3 -plt.plot( - np.linspace(0, 10 * np.pi, 1000), - abs(angular_distance(np.linspace(0, 10 * np.pi, 1000), 1.5, C)), -) -plt.ylabel(r"$\operatorname{dist}_{geo}(1.5, x)$") -plt.xlabel("$x$"); -``` - -In pymc3 we will use `pm.gp.cov.Circular` to model circular functions - -```{code-cell} ipython3 -angles = np.linspace(0, 2 * np.pi) -observed = dict(x=np.random.uniform(0, np.pi * 2, size=5), y=np.random.randn(5) + 4) - - -def plot_kernel_results(Kernel): - """ - To check for many kernels we leave it as a parameter - """ - with pm.Model() as model: - cov = Kernel() - gp = pm.gp.Marginal(pm.gp.mean.Constant(4), cov_func=cov) - lik = gp.marginal_likelihood("x_obs", X=observed["x"][:, None], y=observed["y"], noise=0.2) - mp = pm.find_MAP() - # actual functions - y_sampled = gp.conditional("y", angles[:, None]) - # GP predictions (mu, cov) - y_pred = gp.predict(angles[:, None], point=mp) - trace = pm.sample_posterior_predictive([mp], var_names=["y"], samples=100) - plt.figure(figsize=(9, 9)) - paths = plt.polar(angles, trace["y"].T, color="b", alpha=0.05) - plt.scatter(observed["x"], observed["y"], color="r", alpha=1, label="observations") - plt.polar(angles, y_pred[0], color="black") - plt.fill_between( - angles, - y_pred[0] - np.diag(y_pred[1]) ** 0.5, - y_pred[0] + np.diag(y_pred[1]) ** 0.5, - color="gray", - alpha=0.5, - label=r"$\mu\pm\sigma$", - ) - plt.fill_between( - angles, - y_pred[0] - np.diag(y_pred[1]) ** 0.5 * 3, - y_pred[0] + np.diag(y_pred[1]) ** 0.5 * 3, - color="gray", - alpha=0.25, - label=r"$\mu\pm3\sigma$", - ) - plt.legend() -``` - -```{code-cell} ipython3 -def circular(): - tau = pm.Deterministic("τ", pm.Gamma("_τ", alpha=2, beta=1) + 4) - cov = pm.gp.cov.Circular(1, period=2 * np.pi, tau=tau) - return cov -``` - -```{code-cell} ipython3 -plot_kernel_results(circular) -``` - -An alternative solution is Periodic kernel. - -**Note**: - -* In Periodic kernel, the key parameter to control for correlation between points is `ls` -* In Circular kernel it is `tau`, adding `ls` parameter did not make sense since it cancels out - -Basically there is little difference between these kernels, only the way to model correlations. - -```{code-cell} ipython3 -def periodic(): - ls = pm.Gamma("ℓ", alpha=2, beta=1) - return pm.gp.cov.Periodic(1, 2 * np.pi, ls=ls) -``` - -```{code-cell} ipython3 -plot_kernel_results(periodic) -``` - -From the simulation, we see that **Circular kernel leads to a more uncertain posterior.** - -+++ - -Let's see how Exponential kernel fails - -```{code-cell} ipython3 -def rbf(): - ls = pm.Gamma("ℓ", alpha=2, beta=1) - return pm.gp.cov.Exponential(1, ls=ls) -``` - -```{code-cell} ipython3 -plot_kernel_results(rbf) -``` - -The results look similar to what we had with Circular kernel, but the change point $0^\circ$ is not taken in account - -+++ - -## Conclusions - -* Use circular/periodic kernel once you strongly believe function should smoothly go through the boundary of the cycle -* Periodic kernel is as fine as Circular except that the latter allows more uncertainty -* RBF kernel is not the right choice - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/gaussian_processes/GP-Heteroskedastic.myst.md b/myst_nbs/gaussian_processes/GP-Heteroskedastic.myst.md deleted file mode 100644 index d36e73df0..000000000 --- a/myst_nbs/gaussian_processes/GP-Heteroskedastic.myst.md +++ /dev/null @@ -1,538 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python [conda env:pymc3] - language: python - name: conda-env-pymc3-py ---- - -# Heteroskedastic Gaussian Processes - -+++ - -We can typically divide the sources of uncertainty in our models into two categories. "Aleatoric" uncertainty (from the Latin word for dice or randomness) arises from the intrinsic variability of our system. "Epistemic" uncertainty (from the Greek word for knowledge) arises from how our observations are placed throughout the domain of interest. - -Gaussian Process (GP) models are a powerful tool to capture both of these sources of uncertainty. By considering the distribution of all functions that satisfy the conditions specified by the covariance kernel and the data, these models express low epistemic uncertainty near the observations and high epistemic uncertainty farther away. To incorporate aleatoric uncertainty, the standard GP model assumes additive white noise with constant magnitude throughout the domain. However, this "homoskedastic" model can do a poor job of representing your system if some regions have higher variance than others. Among other fields, this is particularly common in the experimental sciences, where varying experimental parameters can affect both the magnitude and the variability of the response. Explicitly incorporating the dependence (and inter-dependence) of noise on the inputs and outputs can lead to a better understanding of the mean behavior as well as a more informative landscape for optimization, for example. - -This notebook will work through several approaches to heteroskedastic modeling with GPs. We'll use toy data that represents (independent) repeated measurements at a range of input values on a system where the magnitude of the noise increases with the response variable. We'll start with simplistic modeling approaches such as fitting a GP to the mean at each point weighted by the variance at each point (which may be useful if individual measurements are taken via a method with known uncertainty), contrasting this with a typical homoskedastic GP. We'll then construct a model that uses one latent GP to model the response mean and a second (independent) latent GP to model the response variance. To improve the efficiency and scalability of this model, we'll re-formulate it in a sparse framework. Finally, we'll use a coregionalization kernel to allow correlation between the noise and the mean response. - -+++ - -## Data - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm -import theano.tensor as tt - -from scipy.spatial.distance import pdist - -%config InlineBackend.figure_format ='retina' -%load_ext watermark -``` - -```{code-cell} ipython3 -SEED = 2020 -rng = np.random.default_rng(SEED) -az.style.use("arviz-darkgrid") -``` - -```{code-cell} ipython3 -def signal(x): - return x / 2 + np.sin(2 * np.pi * x) / 5 - - -def noise(y): - return np.exp(y) / 20 - - -X = np.linspace(0.1, 1, 20)[:, None] -X = np.vstack([X, X + 2]) -X_ = X.flatten() -y = signal(X_) -σ_fun = noise(y) - -y_err = rng.lognormal(np.log(σ_fun), 0.1) -y_obs = rng.normal(y, y_err, size=(5, len(y))) -y_obs_ = y_obs.T.flatten() -X_obs = np.tile(X.T, (5, 1)).T.reshape(-1, 1) -X_obs_ = X_obs.flatten() -idx = np.tile(np.array([i for i, _ in enumerate(X_)]), (5, 1)).T.flatten() - -Xnew = np.linspace(-0.15, 3.25, 100)[:, None] -Xnew_ = Xnew.flatten() -ynew = signal(Xnew) - -plt.plot(X, y, "C0o") -plt.errorbar(X_, y, y_err, color="C0") -``` - -## Helper and plotting functions - -```{code-cell} ipython3 -def get_ℓ_prior(points): - """Calculates mean and sd for InverseGamma prior on lengthscale""" - distances = pdist(points[:, None]) - distinct = distances != 0 - ℓ_l = distances[distinct].min() if sum(distinct) > 0 else 0.1 - ℓ_u = distances[distinct].max() if sum(distinct) > 0 else 1 - ℓ_σ = max(0.1, (ℓ_u - ℓ_l) / 6) - ℓ_μ = ℓ_l + 3 * ℓ_σ - return ℓ_μ, ℓ_σ - - -ℓ_μ, ℓ_σ = [stat for stat in get_ℓ_prior(X_)] -``` - -```{code-cell} ipython3 -def plot_inducing_points(ax): - yl = ax.get_ylim() - yu = -np.subtract(*yl) * 0.025 + yl[0] - ax.plot(Xu, np.full(Xu.shape, yu), "xk", label="Inducing Points") - ax.legend(loc="upper left") - - -def get_quantiles(samples, quantiles=[2.5, 50, 97.5]): - return [np.percentile(samples, p, axis=0) for p in quantiles] - - -def plot_mean(ax, mean_samples): - """Plots the median and 95% CI from samples of the mean - - Note that, although each individual GP exhibits a normal distribution at each point - (by definition), we are sampling from a mixture of GPs defined by the posteriors of - our hyperparameters. As such, we use percentiles rather than mean +/- stdev to - represent the spread of predictions from our models. - """ - l, m, u = get_quantiles(mean_samples) - ax.plot(Xnew, m, "C0", label="Median") - ax.fill_between(Xnew_, l, u, facecolor="C0", alpha=0.5, label="95% CI") - - ax.plot(Xnew, ynew, "--k", label="Mean Function") - ax.plot(X, y, "C1.", label="Observed Means") - ax.set_title("Mean Behavior") - ax.legend(loc="upper left") - - -def plot_var(ax, var_samples): - """Plots the median and 95% CI from samples of the variance""" - if var_samples.squeeze().ndim == 1: - ax.plot(Xnew, var_samples, "C0", label="Median") - else: - l, m, u = get_quantiles(var_samples) - ax.plot(Xnew, m, "C0", label="Median") - ax.fill_between(Xnew.flatten(), l, u, facecolor="C0", alpha=0.5, label="95% CI") - ax.plot(Xnew, noise(signal(Xnew_)) ** 2, "--k", label="Noise Function") - ax.plot(X, y_err**2, "C1.", label="Observed Variance") - ax.set_title("Variance Behavior") - ax.legend(loc="upper left") - - -def plot_total(ax, mean_samples, var_samples=None, bootstrap=True, n_boots=100): - """Plots the overall mean and variance of the aggregate system - - We can represent the overall uncertainty via explicitly sampling the underlying normal - distributrions (with `bootstrap=True`) or as the mean +/- the standard deviation from - the Law of Total Variance. For systems with many observations, there will likely be - little difference, but in cases with few observations and informative priors, plotting - the percentiles will likely give a more accurate representation. - """ - - if (var_samples is None) or (var_samples.squeeze().ndim == 1): - samples = mean_samples - l, m, u = get_quantiles(samples) - ax.plot(Xnew, m, "C0", label="Median") - elif bootstrap: - # Estimate the aggregate behavior using samples from each normal distribution in the posterior - samples = ( - rng.normal( - mean_samples.T[:, :, None], - np.sqrt(var_samples).T[:, :, None], - (*mean_samples.T.shape, n_boots), - ) - .reshape(len(Xnew_), -1) - .T - ) - l, m, u = get_quantiles(samples) - ax.plot(Xnew, m, "C0", label="Median") - else: - m = mean_samples.mean(axis=0) - ax.plot(Xnew, m, "C0", label="Mean") - sd = np.sqrt(mean_samples.var(axis=0) + var_samples.mean(axis=0)) - l, u = m - 2 * sd, m + 2 * sd - - ax.fill_between(Xnew.flatten(), l, u, facecolor="C0", alpha=0.5, label="Total 95% CI") - - ax.plot(Xnew, ynew, "--k", label="Mean Function") - ax.plot(X_obs, y_obs_, "C1.", label="Observations") - ax.set_title("Aggregate Behavior") - ax.legend(loc="upper left") -``` - -## Homoskedastic GP - -+++ - -First let's fit a standard homoskedastic GP using PyMC3's `Marginal Likelihood` implementation. Here and throughout this notebook we'll use an informative prior for length scale as suggested by [Michael Betancourt](https://betanalpha.github.io/assets/case_studies/gp_part3/part3.html#4_adding_an_informative_prior_for_the_length_scale). We could use `pm.find_MAP()` and `.predict`for even faster inference and prediction, with similar results, but for direct comparison to the other models we'll use NUTS and `.conditional` instead, which run fast enough. - -```{code-cell} ipython3 -with pm.Model() as model_hm: - ℓ = pm.InverseGamma("ℓ", mu=ℓ_μ, sigma=ℓ_σ) - η = pm.Gamma("η", alpha=2, beta=1) - cov = η**2 * pm.gp.cov.ExpQuad(input_dim=1, ls=ℓ) - - gp_hm = pm.gp.Marginal(cov_func=cov) - - σ = pm.Exponential("σ", lam=1) - - ml_hm = gp_hm.marginal_likelihood("ml_hm", X=X_obs, y=y_obs_, noise=σ) - - trace_hm = pm.sample(return_inferencedata=True, random_seed=SEED) - -with model_hm: - mu_pred_hm = gp_hm.conditional("mu_pred_hm", Xnew=Xnew) - noisy_pred_hm = gp_hm.conditional("noisy_pred_hm", Xnew=Xnew, pred_noise=True) - samples_hm = pm.sample_posterior_predictive(trace_hm, var_names=["mu_pred_hm", "noisy_pred_hm"]) -``` - -```{code-cell} ipython3 -_, axs = plt.subplots(1, 3, figsize=(18, 4)) -mu_samples = samples_hm["mu_pred_hm"] -noisy_samples = samples_hm["noisy_pred_hm"] -plot_mean(axs[0], mu_samples) -plot_var(axs[1], noisy_samples.var(axis=0)) -plot_total(axs[2], noisy_samples) -``` - -Here we've plotted our understanding of the mean behavior with the corresponding epistemic uncertainty on the left, our understanding of the variance or aleatoric uncertainty in the middle, and integrate all sources of uncertainty on the right. This model captures the mean behavior well, but we can see that it overestimates the noise in the lower regime while underestimating the noise in the upper regime, as expected. - -+++ - -## Variance-weighted GP - -+++ - -The simplest approach to modeling a heteroskedastic system is to fit a GP on the mean at each point along the domain and supply the standard deviation as weights. - -```{code-cell} ipython3 -with pm.Model() as model_wt: - ℓ = pm.InverseGamma("ℓ", mu=ℓ_μ, sigma=ℓ_σ) - η = pm.Gamma("η", alpha=2, beta=1) - cov = η**2 * pm.gp.cov.ExpQuad(input_dim=1, ls=ℓ) - - gp_wt = pm.gp.Marginal(cov_func=cov) - - ml_wt = gp_wt.marginal_likelihood("ml_wt", X=X, y=y, noise=y_err) - - trace_wt = pm.sample(return_inferencedata=True, random_seed=SEED) - -with model_wt: - mu_pred_wt = gp_wt.conditional("mu_pred_wt", Xnew=Xnew) - samples_wt = pm.sample_posterior_predictive(trace_wt, var_names=["mu_pred_wt"]) -``` - -```{code-cell} ipython3 -_, axs = plt.subplots(1, 3, figsize=(18, 4)) -mu_samples = samples_wt["mu_pred_wt"] -plot_mean(axs[0], mu_samples) -axs[0].errorbar(X_, y, y_err, ls="none", color="C1", label="STDEV") -plot_var(axs[1], mu_samples.var(axis=0)) -plot_total(axs[2], mu_samples) -``` - -This approach captured slightly more nuance in the overall uncertainty than the homoskedastic GP, but still underestimated the variance within both the observed regimes. Note that the variance displayed by this model is purely epistemic: our understanding of the mean behavior is weighted by the uncertainty in our observations, but we didn't include a component to account for aleatoric noise. - -+++ - -## Heteroskedastic GP: latent variance model - -+++ - -Now let's model the mean and the log of the variance as separate GPs through PyMC3's `Latent` implementation, feeding both into a `Normal` likelihood. Note that we add a small amount of diagonal noise to the individual covariances in order to stabilize them for inversion. - -```{code-cell} ipython3 -with pm.Model() as model_ht: - ℓ = pm.InverseGamma("ℓ", mu=ℓ_μ, sigma=ℓ_σ) - η = pm.Gamma("η", alpha=2, beta=1) - cov = η**2 * pm.gp.cov.ExpQuad(input_dim=1, ls=ℓ) + pm.gp.cov.WhiteNoise(sigma=1e-6) - - gp_ht = pm.gp.Latent(cov_func=cov) - μ_f = gp_ht.prior("μ_f", X=X_obs) - - σ_ℓ = pm.InverseGamma("σ_ℓ", mu=ℓ_μ, sigma=ℓ_σ) - σ_η = pm.Gamma("σ_η", alpha=2, beta=1) - σ_cov = σ_η**2 * pm.gp.cov.ExpQuad(input_dim=1, ls=σ_ℓ) + pm.gp.cov.WhiteNoise(sigma=1e-6) - - σ_gp = pm.gp.Latent(cov_func=σ_cov) - lg_σ_f = σ_gp.prior("lg_σ_f", X=X_obs) - σ_f = pm.Deterministic("σ_f", pm.math.exp(lg_σ_f)) - - lik_ht = pm.Normal("lik_ht", mu=μ_f, sd=σ_f, observed=y_obs_) - - trace_ht = pm.sample(target_accept=0.95, chains=2, return_inferencedata=True, random_seed=SEED) - -with model_ht: - μ_pred_ht = gp_ht.conditional("μ_pred_ht", Xnew=Xnew) - lg_σ_pred_ht = σ_gp.conditional("lg_σ_pred_ht", Xnew=Xnew) - samples_ht = pm.sample_posterior_predictive(trace_ht, var_names=["μ_pred_ht", "lg_σ_pred_ht"]) -``` - -```{code-cell} ipython3 -_, axs = plt.subplots(1, 3, figsize=(18, 4)) -μ_samples = samples_ht["μ_pred_ht"] -σ_samples = np.exp(samples_ht["lg_σ_pred_ht"]) -plot_mean(axs[0], μ_samples) -plot_var(axs[1], σ_samples**2) -plot_total(axs[2], μ_samples, σ_samples**2) -``` - -That looks much better! We've accurately captured the mean behavior of our system along with an understanding of the underlying trend in the variance, with appropriate uncertainty. Crucially, the aggregate behavior of the model integrates both epistemic *and* aleatoric uncertainty, and the ~5% of our observations fall outside the 2σ band are more or less evenly distributed across the domain. However, that took *over two hours* to sample only 4k NUTS iterations. Due to the expense of the requisite matrix inversions, GPs are notoriously inefficient for large data sets. Let's reformulate this model using a sparse approximation. - -+++ - -### Sparse Heteroskedastic GP - -+++ - -Sparse approximations to GPs use a small set of *inducing points* to condition the model, vastly improve speed of inference and somewhat improving memory consumption. PyMC3 doesn't have an implementation for sparse latent GPs ([yet](https://github.com/pymc-devs/pymc3/pull/2951)), but we can throw together our own real quick using Bill Engel's [DTC latent GP example](https://gist.github.com/bwengals/a0357d75d2083657a2eac85947381a44). These inducing points can be specified in a variety of ways, such as via the popular k-means initialization or even optimized as part of the model, but since our observations are evenly distributed we can make do with simply a subset of our unique input values. - -```{code-cell} ipython3 -class SparseLatent: - def __init__(self, cov_func): - self.cov = cov_func - - def prior(self, name, X, Xu): - Kuu = self.cov(Xu) - self.L = pm.gp.util.cholesky(pm.gp.util.stabilize(Kuu)) - - self.v = pm.Normal(f"u_rotated_{name}", mu=0.0, sd=1.0, shape=len(Xu)) - self.u = pm.Deterministic(f"u_{name}", tt.dot(self.L, self.v)) - - Kfu = self.cov(X, Xu) - self.Kuiu = tt.slinalg.solve_upper_triangular( - self.L.T, tt.slinalg.solve_lower_triangular(self.L, self.u) - ) - self.mu = pm.Deterministic(f"mu_{name}", tt.dot(Kfu, self.Kuiu)) - return self.mu - - def conditional(self, name, Xnew, Xu): - Ksu = self.cov(Xnew, Xu) - mus = tt.dot(Ksu, self.Kuiu) - tmp = tt.slinalg.solve_lower_triangular(self.L, Ksu.T) - Qss = tt.dot(tmp.T, tmp) # Qss = tt.dot(tt.dot(Ksu, tt.nlinalg.pinv(Kuu)), Ksu.T) - Kss = self.cov(Xnew) - Lss = pm.gp.util.cholesky(pm.gp.util.stabilize(Kss - Qss)) - mu_pred = pm.MvNormal(name, mu=mus, chol=Lss, shape=len(Xnew)) - return mu_pred -``` - -```{code-cell} ipython3 -# Explicitly specify inducing points by downsampling our input vector -Xu = X[1::2] - -with pm.Model() as model_hts: - ℓ = pm.InverseGamma("ℓ", mu=ℓ_μ, sigma=ℓ_σ) - η = pm.Gamma("η", alpha=2, beta=1) - cov = η**2 * pm.gp.cov.ExpQuad(input_dim=1, ls=ℓ) - - μ_gp = SparseLatent(cov) - μ_f = μ_gp.prior("μ", X_obs, Xu) - - σ_ℓ = pm.InverseGamma("σ_ℓ", mu=ℓ_μ, sigma=ℓ_σ) - σ_η = pm.Gamma("σ_η", alpha=2, beta=1) - σ_cov = σ_η**2 * pm.gp.cov.ExpQuad(input_dim=1, ls=σ_ℓ) - - lg_σ_gp = SparseLatent(σ_cov) - lg_σ_f = lg_σ_gp.prior("lg_σ_f", X_obs, Xu) - σ_f = pm.Deterministic("σ_f", pm.math.exp(lg_σ_f)) - - lik_hts = pm.Normal("lik_hts", mu=μ_f, sd=σ_f, observed=y_obs_) - trace_hts = pm.sample(target_accept=0.95, return_inferencedata=True, random_seed=SEED) - -with model_hts: - μ_pred = μ_gp.conditional("μ_pred", Xnew, Xu) - lg_σ_pred = lg_σ_gp.conditional("lg_σ_pred", Xnew, Xu) - samples_hts = pm.sample_posterior_predictive(trace_hts, var_names=["μ_pred", "lg_σ_pred"]) -``` - -```{code-cell} ipython3 -_, axs = plt.subplots(1, 3, figsize=(18, 4)) -μ_samples = samples_hts["μ_pred"] -σ_samples = np.exp(samples_hts["lg_σ_pred"]) -plot_mean(axs[0], μ_samples) -plot_inducing_points(axs[0]) -plot_var(axs[1], σ_samples**2) -plot_inducing_points(axs[1]) -plot_total(axs[2], μ_samples, σ_samples**2) -plot_inducing_points(axs[2]) -``` - -That was ~8x faster with nearly indistinguishable results, and fewer divergences as well. - -+++ - -## Heteroskedastic GP with correlated noise and mean response: Linear Model of Coregionalization - -+++ - -So far, we've modeled the mean and noise of our system as independent. However, there may be scenarios where we expect them to be correlated, for example if higher measurement values are expected to have greater noise. Here, we'll explicitly model this correlation through a covariance function that is a Kronecker product of the spatial kernel we've used previously and a `Coregion` kernel, as suggested by Bill Engels [here](https://discourse.pymc.io/t/coregionalization-model-for-two-separable-multidimensional-gaussian-process/2550/4). This is an implementation of the Linear Model of Coregionalization, which treats each correlated GP as a linear combination of a small number of independent basis functions, which are themselves GPs. We first add a categorical dimension to the domain of our observations to indicate whether the mean or variance is being considered, then unpack the respective components before feeding them into a `Normal` likelihood as above. - -```{code-cell} ipython3 -def add_coreg_idx(x): - return np.hstack([np.tile(x, (2, 1)), np.vstack([np.zeros(x.shape), np.ones(x.shape)])]) - - -Xu_c, X_obs_c, Xnew_c = [add_coreg_idx(x) for x in [Xu, X_obs, Xnew]] - -with pm.Model() as model_htsc: - ℓ = pm.InverseGamma("ℓ", mu=ℓ_μ, sigma=ℓ_σ) - η = pm.Gamma("η", alpha=2, beta=1) - EQcov = η**2 * pm.gp.cov.ExpQuad(input_dim=1, active_dims=[0], ls=ℓ) - - D_out = 2 # two output dimensions, mean and variance - rank = 2 # two basis GPs - W = pm.Normal("W", mu=0, sd=3, shape=(D_out, rank), testval=np.full([D_out, rank], 0.1)) - kappa = pm.Gamma("kappa", alpha=1.5, beta=1, shape=D_out) - coreg = pm.gp.cov.Coregion(input_dim=1, active_dims=[0], kappa=kappa, W=W) - - cov = pm.gp.cov.Kron([EQcov, coreg]) - - gp_LMC = SparseLatent(cov) - LMC_f = gp_LMC.prior("LMC", X_obs_c, Xu_c) - - μ_f = LMC_f[: len(y_obs_)] - lg_σ_f = LMC_f[len(y_obs_) :] - σ_f = pm.Deterministic("σ_f", pm.math.exp(lg_σ_f)) - - lik_htsc = pm.Normal("lik_htsc", mu=μ_f, sd=σ_f, observed=y_obs_) - trace_htsc = pm.sample(target_accept=0.95, return_inferencedata=True, random_seed=SEED) - -with model_htsc: - c_mu_pred = gp_LMC.conditional("c_mu_pred", Xnew_c, Xu_c) - samples_htsc = pm.sample_posterior_predictive(trace_htsc, var_names=["c_mu_pred"]) -``` - -```{code-cell} ipython3 -μ_samples = samples_htsc["c_mu_pred"][:, : len(Xnew)] -σ_samples = np.exp(samples_htsc["c_mu_pred"][:, len(Xnew) :]) - -_, axs = plt.subplots(1, 3, figsize=(18, 4)) -plot_mean(axs[0], μ_samples) -plot_inducing_points(axs[0]) -plot_var(axs[1], σ_samples**2) -axs[1].set_ylim(-0.01, 0.2) -axs[1].legend(loc="upper left") -plot_inducing_points(axs[1]) -plot_total(axs[2], μ_samples, σ_samples**2) -plot_inducing_points(axs[2]) -``` - -We can look at the learned correlation between the mean and variance by inspecting the covariance matrix $\bf{B}$ constructed via $\mathbf{B} \equiv \mathbf{WW}^T+diag(\kappa)$: - -```{code-cell} ipython3 -with model_htsc: - B_samples = pm.sample_posterior_predictive(trace_htsc, var_names=["W", "kappa"]) -``` - -```{code-cell} ipython3 -# Keep in mind that the first dimension in all arrays is the sampling dimension -W = B_samples["W"] -W_T = np.swapaxes(W, 1, 2) -WW_T = np.matmul(W, W_T) - -kappa = B_samples["kappa"] -I = np.tile(np.identity(2), [kappa.shape[0], 1, 1]) -# einsum is just a concise way of doing multiplication and summation over arbitrary axes -diag_kappa = np.einsum("ij,ijk->ijk", kappa, I) - -B = WW_T + diag_kappa -B.mean(axis=0) -``` - -```{code-cell} ipython3 -sd = np.sqrt(np.diagonal(B, axis1=1, axis2=2)) -outer_sd = np.einsum("ij,ik->ijk", sd, sd) -correlation = B / outer_sd -print(f"2.5%ile correlation: {np.percentile(correlation,2.5,axis=0)[0,1]:0.3f}") -print(f"Median correlation: {np.percentile(correlation,50,axis=0)[0,1]:0.3f}") -print(f"97.5%ile correlation: {np.percentile(correlation,97.5,axis=0)[0,1]:0.3f}") -``` - -The model has inadvertently learned that the mean and noise are slightly negatively correlated, albeit with a wide credible interval. - -+++ - -## Comparison - -+++ - -The three latent approaches shown here varied in their complexity and efficiency, but ultimately produced very similar regression surfaces, as shown below. All three displayed a nuanced understanding of both aleatoric and epistemic uncertainties. It's worth noting that we had to increase `target_accept` from the default 0.8 to 0.95 to avoid an excessive number of divergences, but this has the downside of slowing down NUTS evaluations. Sampling times could be decreased by reducing `target_accept`, at the expense of potentially biased inference due to divergences, or by further reducing the number of inducing points used in the sparse approximations. Inspecting the convergence statistics for each method, all had low r_hat values of 1.01 or below but the LMC model showed low effective sample sizes for some parameters, in particular the `ess_tail` for the η and ℓ parameters. To have confidence in the 95% CI bounds for this model, we should run the sampling for more iterations, ideally at least until the smallest `ess_tail` is above 200 but the higher the better. - -+++ - -### Regression surfaces - -```{code-cell} ipython3 -_, axs = plt.subplots(1, 3, figsize=(18, 4)) - -μ_samples = samples_ht["μ_pred_ht"] -σ_samples = np.exp(samples_ht["lg_σ_pred_ht"]) -plot_total(axs[0], μ_samples, σ_samples**2) -axs[0].set_title("Latent") - -μ_samples = samples_hts["μ_pred"] -σ_samples = np.exp(samples_hts["lg_σ_pred"]) -plot_total(axs[1], μ_samples, σ_samples**2) -axs[1].set_title("Sparse Latent") - -μ_samples = samples_htsc["c_mu_pred"][:, : len(Xnew)] -σ_samples = np.exp(samples_htsc["c_mu_pred"][:, len(Xnew) :]) -plot_total(axs[2], μ_samples, σ_samples**2) -axs[2].set_title("Correlated Sparse Latent") - -yls = [ax.get_ylim() for ax in axs] -yl = [np.min([l[0] for l in yls]), np.max([l[1] for l in yls])] -for ax in axs: - ax.set_ylim(yl) - -plot_inducing_points(axs[1]) -plot_inducing_points(axs[2]) - -axs[0].legend().remove() -axs[1].legend().remove() -``` - -### Latent model convergence - -```{code-cell} ipython3 -display(az.summary(trace_ht).sort_values("ess_bulk").iloc[:5]) -``` - -### Sparse Latent model convergence - -```{code-cell} ipython3 -display(az.summary(trace_hts).sort_values("ess_bulk").iloc[:5]) -``` - -### Correlated Sparse Latent model convergence - -```{code-cell} ipython3 -display(az.summary(trace_htsc).sort_values("ess_bulk").iloc[:5]) -``` - -* This notebook was written by John Goertz on 5 May, 2021. - -```{code-cell} ipython3 -%watermark -n -u -v -iv -w -p xarray -``` - -```{code-cell} ipython3 - -``` diff --git a/myst_nbs/gaussian_processes/GP-Kron.myst.md b/myst_nbs/gaussian_processes/GP-Kron.myst.md deleted file mode 100644 index 3b7a30305..000000000 --- a/myst_nbs/gaussian_processes/GP-Kron.myst.md +++ /dev/null @@ -1,306 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(GP-Kron)= -# Kronecker Structured Covariances - -:::{post} October, 2022 -:tags: gaussian process -:category: intermediate -:author: Bill Engels, Raul-ing Average, Christopher Krapu, Danh Phan -::: - -+++ - -PyMC contains implementations for models that have Kronecker structured covariances. This patterned structure enables Gaussian process models to work on much larger datasets. Kronecker structure can be exploited when -- The dimension of the input data is two or greater ($\mathbf{x} \in \mathbb{R}^{d}\,, d \ge 2$) -- The influence of the process across each dimension or set of dimensions is *separable* -- The kernel can be written as a product over dimension, without cross terms: - -$$k(\mathbf{x}, \mathbf{x'}) = \prod_{i = 1}^{d} k(\mathbf{x}_{i}, \mathbf{x'}_i) \,.$$ - -The covariance matrix that corresponds to the covariance function above can be written with a *Kronecker product* - -$$ -\mathbf{K} = \mathbf{K}_2 \otimes \mathbf{K}_2 \otimes \cdots \otimes \mathbf{K}_d \,. -$$ - -These implementations support the following property of Kronecker products to speed up calculations, $(\mathbf{K}_1 \otimes \mathbf{K}_2)^{-1} = \mathbf{K}_{1}^{-1} \otimes \mathbf{K}_{2}^{-1}$, the inverse of the sum is the sum of the inverses. If $K_1$ is $n \times n$ and $K_2$ is $m \times m$, then $\mathbf{K}_1 \otimes \mathbf{K}_2$ is $mn \times mn$. For $m$ and $n$ of even modest size, this inverse becomes impossible to do efficiently. Inverting two matrices, one $n \times n$ and another $m \times m$ is much easier. - -This structure is common in spatiotemporal data. Given that there is Kronecker structure in the covariance matrix, this implementation is exact -- not an approximation to the full Gaussian process. PyMC contains two implementations that follow the same pattern as {class}`gp.Marginal ` and {class}`gp.Latent `. For Kronecker structured covariances where the data likelihood is Gaussian, use {class}`gp.MarginalKron `. For Kronecker structured covariances where the data likelihood is non-Gaussian, use {class}`gp.LatentKron `. - -Our implementations follow [Saatchi's Thesis](http://mlg.eng.cam.ac.uk/pub/authors/#Saatci). `gp.MarginalKron` follows "Algorithm 16" using the Eigendecomposition, and `gp.LatentKron` follows "Algorithm 14", and uses the Cholesky decomposition. - -+++ - -## Using `MarginalKron` for a 2D spatial problem - -The following is a canonical example of the usage of `gp.MarginalKron`. Like `gp.Marginal`, this model assumes that the underlying GP is unobserved, but the sum of the GP and normally distributed noise are observed. - -For the simulated data set, we draw one sample from a Gaussian process with inputs in two dimensions whose covariance is Kronecker structured. Then we use `gp.MarginalKron` to recover the unknown Gaussian process hyperparameters $\theta$ that were used to simulate the data. - -+++ - -### Example - -We'll simulate a two dimensional data set and display it as a scatter plot whose points are colored by magnitude. The two dimensions are labeled `x1` and `x2`. This could be a spatial dataset, for instance. The covariance will have a Kronecker structure since the points lie on a two dimensional grid. - -```{code-cell} ipython3 -import arviz as az -import matplotlib as mpl -import numpy as np -import pymc as pm - -plt = mpl.pyplot -%matplotlib inline -%config InlineBackend.figure_format = 'retina' -``` - -```{code-cell} ipython3 -RANDOM_SEED = 12345 -rng = np.random.default_rng(RANDOM_SEED) - -# One dimensional column vectors of inputs -n1, n2 = (50, 30) -x1 = np.linspace(0, 5, n1) -x2 = np.linspace(0, 3, n2) - -# make cartesian grid out of each dimension x1 and x2 -X = pm.math.cartesian(x1[:, None], x2[:, None]) - -l1_true = 0.8 -l2_true = 1.0 -eta_true = 1.0 - -# Although we could, we don't exploit kronecker structure to draw the sample -cov = ( - eta_true**2 - * pm.gp.cov.Matern52(2, l1_true, active_dims=[0]) - * pm.gp.cov.Cosine(2, ls=l2_true, active_dims=[1]) -) - -K = cov(X).eval() -f_true = rng.multivariate_normal(np.zeros(X.shape[0]), K, 1).flatten() - -sigma_true = 0.25 -y = f_true + sigma_true * rng.standard_normal(X.shape[0]) -``` - -The lengthscale along the `x2` dimension is longer than the lengthscale along the `x1` direction (`l1_true` < `l2_true`). - -```{code-cell} ipython3 -fig = plt.figure(figsize=(12, 6)) -cmap = "terrain" -norm = mpl.colors.Normalize(vmin=-3, vmax=3) -plt.scatter(X[:, 0], X[:, 1], s=35, c=y, marker="o", norm=norm, cmap=cmap) -plt.colorbar() -plt.xlabel("x1"), plt.ylabel("x2") -plt.title("Simulated dataset"); -``` - -There are 1500 data points in this data set. Without using the Kronecker factorization, finding the MAP estimate would be much slower. - -+++ - -Since the two covariances are a product, we only require one scale parameter `eta` to model the product covariance function. - -```{code-cell} ipython3 -# this implementation takes a list of inputs for each dimension as input -Xs = [x1[:, None], x2[:, None]] - -with pm.Model() as model: - # Set priors on the hyperparameters of the covariance - ls1 = pm.Gamma("ls1", alpha=2, beta=2) - ls2 = pm.Gamma("ls2", alpha=2, beta=2) - eta = pm.HalfNormal("eta", sigma=2) - - # Specify the covariance functions for each Xi - # Since the covariance is a product, only scale one of them by eta. - # Scaling both overparameterizes the covariance function. - cov_x1 = pm.gp.cov.Matern52(1, ls=ls1) # cov_x1 must accept X1 without error - cov_x2 = eta**2 * pm.gp.cov.Cosine(1, ls=ls2) # cov_x2 must accept X2 without error - - # Specify the GP. The default mean function is `Zero`. - gp = pm.gp.MarginalKron(cov_funcs=[cov_x1, cov_x2]) - - # Set the prior on the variance for the Gaussian noise - sigma = pm.HalfNormal("sigma", sigma=2) - - # Place a GP prior over the function f. - y_ = gp.marginal_likelihood("y", Xs=Xs, y=y, sigma=sigma) -``` - -```{code-cell} ipython3 -with model: - mp = pm.find_MAP(method="BFGS") -``` - -```{code-cell} ipython3 -mp -``` - -Next we use the map point `mp` to extrapolate in a region outside the original grid. We can also interpolate. There is no grid restriction on the new inputs where predictions are desired. It's important to note that under the current implementation, having a grid structure in these points doesn't produce any efficiency gains. The plot with the extrapolations is shown below. The original data is marked with circles as before, but the extrapolated posterior mean is marked with squares. - -```{code-cell} ipython3 -x1new = np.linspace(5.1, 7.1, 20) -x2new = np.linspace(-0.5, 3.5, 40) -Xnew = pm.math.cartesian(x1new[:, None], x2new[:, None]) - -with model: - mu, var = gp.predict(Xnew, point=mp, diag=True) -``` - -```{code-cell} ipython3 -fig = plt.figure(figsize=(12, 6)) -cmap = "terrain" -norm = mpl.colors.Normalize(vmin=-3, vmax=3) -m = plt.scatter(X[:, 0], X[:, 1], s=30, c=y, marker="o", norm=norm, cmap=cmap) -plt.colorbar(m) -plt.scatter(Xnew[:, 0], Xnew[:, 1], s=30, c=mu, marker="s", norm=norm, cmap=cmap) -plt.ylabel("x2"), plt.xlabel("x1") -plt.title("observed data 'y' (circles) with predicted mean (squares)"); -``` - -## `LatentKron` - -Like the `gp.Latent` implementation, the `gp.LatentKron` implementation specifies a Kronecker structured GP regardless of context. **It can be used with any likelihood function, or can be used to model a variance or some other unobserved processes**. The syntax follows that of `gp.Latent` exactly. - -### Example 1 - -To compare with `MarginalLikelihood`, we use same example as before where the noise is normal, but the GP itself is not marginalized out. Instead, it is sampled directly using NUTS. It is very important to note that `gp.LatentKron` does not require a Gaussian likelihood like `gp.MarginalKron`; rather, any likelihood is admissible. - -```{code-cell} ipython3 -with pm.Model() as model: - # Set priors on the hyperparameters of the covariance - ls1 = pm.Gamma("ls1", alpha=2, beta=2) - ls2 = pm.Gamma("ls2", alpha=2, beta=2) - eta = pm.HalfNormal("eta", sigma=2) - - # Specify the covariance functions for each Xi - cov_x1 = pm.gp.cov.Matern52(1, ls=ls1) - cov_x2 = eta**2 * pm.gp.cov.Cosine(1, ls=ls2) - - # Set the prior on the variance for the Gaussian noise - sigma = pm.HalfNormal("sigma", sigma=2) - - # Specify the GP. The default mean function is `Zero`. - gp = pm.gp.LatentKron(cov_funcs=[cov_x1, cov_x2]) - - # Place a GP prior over the function f. - f = gp.prior("f", Xs=Xs) - - y_ = pm.Normal("y_", mu=f, sigma=sigma, observed=y) -``` - -```{code-cell} ipython3 -with model: - tr = pm.sample(500, chains=1, return_inferencedata=True, target_accept=0.90) -``` - -The posterior distribution of the unknown lengthscale parameters, covariance scaling `eta`, and white noise `sigma` are shown below. The vertical lines are the true values that were used to generate the original data set. - -```{code-cell} ipython3 -az.plot_trace( - tr, - var_names=["ls1", "ls2", "eta", "sigma"], - lines={"ls1": l1_true, "ls2": l2_true, "eta": eta_true, "sigma": sigma_true}, -) -plt.tight_layout() -``` - -```{code-cell} ipython3 -x1new = np.linspace(5.1, 7.1, 20) -x2new = np.linspace(-0.5, 3.5, 40) -Xnew = pm.math.cartesian(x1new[:, None], x2new[:, None]) - -with model: - fnew = gp.conditional("fnew3", Xnew, jitter=1e-6) - -with model: - ppc = pm.sample_posterior_predictive(tr, var_names=["fnew3"]) -``` - -```{code-cell} ipython3 -x1new = np.linspace(5.1, 7.1, 20)[:, None] -x2new = np.linspace(-0.5, 3.5, 40)[:, None] -Xnew = pm.math.cartesian(x1new, x2new) -x1new.shape, x2new.shape, Xnew.shape -``` - -```{code-cell} ipython3 -with model: - fnew = gp.conditional("fnew", Xnew, jitter=1e-6) -``` - -```{code-cell} ipython3 -with model: - ppc = pm.sample_posterior_predictive(tr, var_names=["fnew"]) -``` - -Below we show the original data set as colored circles, and the mean of the conditional samples as colored squares. The results closely follow those given by the `gp.MarginalKron` implementation. - -```{code-cell} ipython3 -fig = plt.figure(figsize=(14, 7)) -m = plt.scatter(X[:, 0], X[:, 1], s=30, c=y, marker="o", norm=norm, cmap=cmap) -plt.colorbar(m) -plt.scatter( - Xnew[:, 0], - Xnew[:, 1], - s=30, - c=np.mean(ppc.posterior_predictive["fnew"].sel(chain=0), axis=0), - marker="s", - norm=norm, - cmap=cmap, -) -plt.ylabel("x2"), plt.xlabel("x1") -plt.title("observed data 'y' (circles) with mean of conditional, or predicted, samples (squares)"); -``` - -Next we plot the original data set indicated with circles markers, along with four samples from the conditional distribution over `fnew` indicated with square markers. As we can see, the level of variation in the predictive distribution leads to distinctly different patterns in the values of `fnew`. However, these samples display the correct correlation structure - we see distinct sinusoidal patterns in the y-axis and proximal correlation structure in the x-axis. The patterns displayed in the observed data seamlessly blend into the conditional distribution. - -```{code-cell} ipython3 -fig, axs = plt.subplots(2, 2, figsize=(24, 16)) -axs = axs.ravel() - -for i, ax in enumerate(axs): - ax.axis("off") - ax.scatter(X[:, 0], X[:, 1], s=20, c=y, marker="o", norm=norm, cmap=cmap) - ax.scatter( - Xnew[:, 0], - Xnew[:, 1], - s=20, - c=ppc.posterior_predictive["fnew"].sel(chain=0)[i], - marker="s", - norm=norm, - cmap=cmap, - ) - ax.set_title(f"Sample {i+1}", fontsize=24) -``` - -## Authors -* Authored by [Bill Engels](https://github.com/bwengals), 2018 -* Updated by [Raul-ing Average](https://github.com/CloudChaoszero), March 2021 -* Updated by [Christopher Krapu](https://github.com/ckrapu), July 2021 -* Updated to PyMC 4.x by [Danh Phan](https://github.com/danhphan), November 2022 - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/gaussian_processes/GP-Latent.myst.md b/myst_nbs/gaussian_processes/GP-Latent.myst.md deleted file mode 100644 index 278e0b44f..000000000 --- a/myst_nbs/gaussian_processes/GP-Latent.myst.md +++ /dev/null @@ -1,444 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: pymc-dev - language: python - name: pymc-dev ---- - -(gp_latent)= -# Gaussian Processes: Latent Variable Implementation - -:::{post} Sept 28, 2022 -:tags: gaussian processes, time series -:category: reference, intermediate -:author: Bill Engels -::: - -+++ - -The {class}`gp.Latent ` class is a direct implementation of a Gaussian process without approximation. Given a mean and covariance function, we can place a prior on the function $f(x)$, - -$$ -f(x) \sim \mathcal{GP}(m(x),\, k(x, x')) \,. -$$ - -It is called "Latent" because the GP itself is included in the model as a latent variable, it is not marginalized out as is the case with {class}`gp.Marginal `. Unlike `gp.Latent`, you won't find samples from the GP posterior in the trace with `gp.Marginal`. This is the most direct implementation of a GP because it doesn't assume a particular likelihood function or structure in the data or in the covariance matrix. - -+++ - -## The `.prior` method - -The `prior` method adds a multivariate normal prior distribution to the PyMC model over the vector of GP function values, $\mathbf{f}$, - -$$ -\mathbf{f} \sim \text{MvNormal}(\mathbf{m}_{x},\, \mathbf{K}_{xx}) \,, -$$ - -where the vector $\mathbf{m}_x$ and the matrix $\mathbf{K}_{xx}$ are the mean vector and covariance matrix evaluated over the inputs $x$. By default, PyMC reparameterizes the prior on `f` under the hood by rotating it with the Cholesky factor of its covariance matrix. This improves sampling by reducing covariances in the posterior of the transformed random variable, `v`. The reparameterized model is, - -$$ -\begin{aligned} - \mathbf{v} \sim \text{N}(0, 1)& \\ - \mathbf{L} = \text{Cholesky}(\mathbf{K}_{xx})& \\ - \mathbf{f} = \mathbf{m}_{x} + \mathbf{Lv} \\ -\end{aligned} -$$ - -For more information on this reparameterization, see the section on [drawing values from a multivariate distribution](https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Drawing_values_from_the_distribution). - -+++ - -## The `.conditional` method - -The conditional method implements the predictive distribution for function values that were not part of the original data set. This distribution is, - -$$ -\mathbf{f}_* \mid \mathbf{f} \sim \text{MvNormal} \left( - \mathbf{m}_* + \mathbf{K}_{*x}\mathbf{K}_{xx}^{-1} \mathbf{f} ,\, - \mathbf{K}_{**} - \mathbf{K}_{*x}\mathbf{K}_{xx}^{-1}\mathbf{K}_{x*} \right) -$$ - -Using the same `gp` object we defined above, we can construct a random variable with this -distribution by, - -```python -# vector of new X points we want to predict the function at -X_star = np.linspace(0, 2, 100)[:, None] - -with latent_gp_model: - f_star = gp.conditional("f_star", X_star) -``` - -+++ - -## Example 1: Regression with Student-T distributed noise - -The following is an example showing how to specify a simple model with a GP prior using the {class}`gp.Latent` class. We use a GP to generate the data so we can verify that the inference we perform is correct. Note that the likelihood is not normal, but IID Student-T. For a more efficient implementation when the likelihood is Gaussian, use {class}`gp.Marginal`. - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc as pm -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' - -RANDOM_SEED = 8998 -rng = np.random.default_rng(RANDOM_SEED) - -az.style.use("arviz-darkgrid") -``` - -```{code-cell} ipython3 ---- -jupyter: - outputs_hidden: false ---- -n = 50 # The number of data points -X = np.linspace(0, 10, n)[:, None] # The inputs to the GP must be arranged as a column vector - -# Define the true covariance function and its parameters -ell_true = 1.0 -eta_true = 4.0 -cov_func = eta_true**2 * pm.gp.cov.ExpQuad(1, ell_true) - -# A mean function that is zero everywhere -mean_func = pm.gp.mean.Zero() - -# The latent function values are one sample from a multivariate normal -# Note that we have to call `eval()` because PyMC built on top of Theano -f_true = pm.draw(pm.MvNormal.dist(mu=mean_func(X), cov=cov_func(X)), 1, random_seed=rng) - -# The observed data is the latent function plus a small amount of T distributed noise -# The standard deviation of the noise is `sigma`, and the degrees of freedom is `nu` -sigma_true = 1.0 -nu_true = 5.0 -y = f_true + sigma_true * rng.normal(size=n) - -## Plot the data and the unobserved latent function -fig = plt.figure(figsize=(10, 4)) -ax = fig.gca() -ax.plot(X, f_true, "dodgerblue", lw=3, label="True generating function 'f'") -ax.plot(X, y, "ok", ms=3, label="Observed data") -ax.set_xlabel("X") -ax.set_ylabel("y") -plt.legend(frameon=True); -``` - -The data above shows the observations, marked with black dots, of the unknown function $f(x)$ that has been corrupted by noise. The true function is in blue. - -### Coding the model in PyMC - -Here's the model in PyMC. We use an informative {class}`pm.Gamma(alpha=2, beta=1)` prior over the lengthscale parameter, and weakly informative {class}`pm.HalfNormal(sigma=5)` priors over the covariance function scale, and noise scale. A `pm.Gamma(2, 0.5)` prior is assigned to the degrees of freedom parameter of the noise. Finally, a GP prior is placed on the unknown function. For more information on choosing priors in Gaussian process models, check out some of [recommendations by the Stan folks](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations#priors-for-gaussian-processes). - -```{code-cell} ipython3 ---- -jupyter: - outputs_hidden: false ---- -with pm.Model() as model: - ell = pm.Gamma("ell", alpha=2, beta=1) - eta = pm.HalfNormal("eta", sigma=5) - - cov = eta**2 * pm.gp.cov.ExpQuad(1, ell) - gp = pm.gp.Latent(cov_func=cov) - - f = gp.prior("f", X=X) - - sigma = pm.HalfNormal("sigma", sigma=2.0) - nu = 1 + pm.Gamma( - "nu", alpha=2, beta=0.1 - ) # add one because student t is undefined for degrees of freedom less than one - y_ = pm.StudentT("y", mu=f, lam=1.0 / sigma, nu=nu, observed=y) - - idata = pm.sample(1000, tune=1000, chains=2, cores=1) -``` - -```{code-cell} ipython3 -# check Rhat, values above 1 may indicate convergence issues -n_nonconverged = int( - np.sum(az.rhat(idata)[["eta", "ell", "sigma", "f_rotated_"]].to_array() > 1.03).values -) -if n_nonconverged == 0: - print("No Rhat values above 1.03, \N{check mark}") -else: - print(f"The MCMC chains for {n_nonconverged} RVs appear not to have converged.") -``` - -### Results - -The joint posterior of the two covariance function hyperparameters is plotted below in the left panel. In the right panel is the joint posterior of the standard deviation of the noise, and the degrees of freedom parameter of the likelihood. The light blue lines show the true values that were used to draw the function from the GP. - -```{code-cell} ipython3 -fig, axs = plt.subplots(1, 2, figsize=(10, 4)) -axs = axs.flatten() - -# plot eta vs ell -az.plot_pair( - idata, - var_names=["eta", "ell"], - kind=["hexbin"], - ax=axs[0], - gridsize=25, - divergences=True, -) -axs[0].axvline(x=eta_true, color="dodgerblue") -axs[0].axhline(y=ell_true, color="dodgerblue") - -# plot nu vs sigma -az.plot_pair( - idata, - var_names=["nu", "sigma"], - kind=["hexbin"], - ax=axs[1], - gridsize=25, - divergences=True, -) - -axs[1].axvline(x=nu_true, color="dodgerblue") -axs[1].axhline(y=sigma_true, color="dodgerblue"); -``` - -```{code-cell} ipython3 -f_post = az.extract(idata, var_names="f").transpose("sample", ...) -f_post -``` - -Below is the posterior of the GP, - -```{code-cell} ipython3 ---- -jupyter: - outputs_hidden: false ---- -# plot the results -fig = plt.figure(figsize=(10, 4)) -ax = fig.gca() - -# plot the samples from the gp posterior with samples and shading -from pymc.gp.util import plot_gp_dist - -f_post = az.extract(idata, var_names="f").transpose("sample", ...) -plot_gp_dist(ax, f_post, X) - -# plot the data and the true latent function -ax.plot(X, f_true, "dodgerblue", lw=3, label="True generating function 'f'") -ax.plot(X, y, "ok", ms=3, label="Observed data") - -# axis labels and title -plt.xlabel("X") -plt.ylabel("True f(x)") -plt.title("Posterior distribution over $f(x)$ at the observed values") -plt.legend(); -``` - -As you can see by the red shading, the posterior of the GP prior over the function does a great job of representing both the fit, and the uncertainty caused by the additive noise. The result also doesn't over fit due to outliers from the Student-T noise model. - -### Prediction using `.conditional` - -Next, we extend the model by adding the conditional distribution so we can predict at new $x$ locations. Lets see how the extrapolation looks out to higher $x$. To do this, we extend our `model` with the `conditional` distribution of the GP. Then, we can sample from it using the `trace` and the `sample_posterior_predictive` function. This is similar to how Stan uses its `generated quantities {...}` block. We could have included `gp.conditional` in the model *before* we did the NUTS sampling, but it is more efficient to separate these steps. - -```{code-cell} ipython3 ---- -jupyter: - outputs_hidden: false ---- -n_new = 200 -X_new = np.linspace(-4, 14, n_new)[:, None] - -# add the GP conditional to the model, given the new X values -with model: - f_pred = gp.conditional("f_pred", X_new, jitter=1e-4) - -# Sample from the GP conditional distribution -with model: - ppc = pm.sample_posterior_predictive(idata.posterior, var_names=["f_pred"]) - idata.extend(ppc) -``` - -```{code-cell} ipython3 ---- -jupyter: - outputs_hidden: false ---- -fig = plt.figure(figsize=(10, 4)) -ax = fig.gca() - -f_pred = az.extract(idata.posterior_predictive, var_names="f_pred").transpose("sample", ...) -plot_gp_dist(ax, f_pred, X_new) - -ax.plot(X, f_true, "dodgerblue", lw=3, label="True generating function 'f'") -ax.plot(X, y, "ok", ms=3, label="Observed data") - -ax.set_xlabel("X") -ax.set_ylabel("True f(x)") -ax.set_title("Conditional distribution of f_*, given f") -plt.legend(); -``` - -## Example 2: Classification - -First we use a GP to generate some data that follows a Bernoulli distribution, where $p$, the probability of a one instead of a zero is a function of $x$. I reset the seed and added more fake data points, because it can be difficult for the model to discern variations around 0.5 with few observations. - -```{code-cell} ipython3 -# reset the random seed for the new example -RANDOM_SEED = 8888 -rng = np.random.default_rng(RANDOM_SEED) - -# number of data points -n = 300 - -# x locations -x = np.linspace(0, 10, n) - -# true covariance -ell_true = 0.5 -eta_true = 1.0 -cov_func = eta_true**2 * pm.gp.cov.ExpQuad(1, ell_true) -K = cov_func(x[:, None]).eval() - -# zero mean function -mean = np.zeros(n) - -# sample from the gp prior -f_true = pm.draw(pm.MvNormal.dist(mu=mean, cov=K), 1, random_seed=rng) - -# Sample the GP through the likelihood -y = pm.Bernoulli.dist(p=pm.math.invlogit(f_true)).eval() -``` - -```{code-cell} ipython3 ---- -jupyter: - outputs_hidden: false ---- -fig = plt.figure(figsize=(10, 4)) -ax = fig.gca() - -ax.plot(x, pm.math.invlogit(f_true).eval(), "dodgerblue", lw=3, label="True rate") -# add some noise to y to make the points in the plot more visible -ax.plot(x, y + np.random.randn(n) * 0.01, "kx", ms=6, label="Observed data") - -ax.set_xlabel("X") -ax.set_ylabel("y") -ax.set_xlim([0, 11]) -plt.legend(loc=(0.35, 0.65), frameon=True); -``` - -```{code-cell} ipython3 ---- -jupyter: - outputs_hidden: false ---- -with pm.Model() as model: - ell = pm.InverseGamma("ell", mu=1.0, sigma=0.5) - eta = pm.Exponential("eta", lam=1.0) - cov = eta**2 * pm.gp.cov.ExpQuad(1, ell) - - gp = pm.gp.Latent(cov_func=cov) - f = gp.prior("f", X=x[:, None]) - - # logit link and Bernoulli likelihood - p = pm.Deterministic("p", pm.math.invlogit(f)) - y_ = pm.Bernoulli("y", p=p, observed=y) - - idata = pm.sample(1000, chains=2, cores=1) -``` - -```{code-cell} ipython3 -# check Rhat, values above 1 may indicate convergence issues -n_nonconverged = int(np.sum(az.rhat(idata)[["eta", "ell", "f_rotated_"]].to_array() > 1.03).values) -if n_nonconverged == 0: - print("No Rhat values above 1.03, \N{check mark}") -else: - print(f"The MCMC chains for {n_nonconverged} RVs appear not to have converged.") -``` - -```{code-cell} ipython3 -ax = az.plot_pair( - idata, - var_names=["eta", "ell"], - kind=["kde", "scatter"], - scatter_kwargs={"color": "darkslategray", "alpha": 0.4}, - gridsize=25, - divergences=True, -) - -ax.axvline(x=eta_true, color="dodgerblue") -ax.axhline(y=ell_true, color="dodgerblue"); -``` - -```{code-cell} ipython3 ---- -jupyter: - outputs_hidden: false ---- -n_pred = 200 -X_new = np.linspace(0, 12, n_pred)[:, None] - -with model: - f_pred = gp.conditional("f_pred", X_new, jitter=1e-4) - p_pred = pm.Deterministic("p_pred", pm.math.invlogit(f_pred)) - -with model: - ppc = pm.sample_posterior_predictive(idata.posterior, var_names=["f_pred", "p_pred"]) - idata.extend(ppc) -``` - -```{code-cell} ipython3 ---- -jupyter: - outputs_hidden: false ---- -# plot the results -fig = plt.figure(figsize=(10, 4)) -ax = fig.gca() - -# plot the samples from the gp posterior with samples and shading -p_pred = az.extract(idata.posterior_predictive, var_names="p_pred").transpose("sample", ...) -plot_gp_dist(ax, p_pred, X_new) - -# plot the data (with some jitter) and the true latent function -plt.plot(x, pm.math.invlogit(f_true).eval(), "dodgerblue", lw=3, label="True f") -plt.plot( - x, - y + np.random.randn(y.shape[0]) * 0.01, - "kx", - ms=6, - alpha=0.5, - label="Observed data", -) - -# axis labels and title -plt.xlabel("X") -plt.ylabel("True f(x)") -plt.xlim([0, 12]) -plt.title("Posterior distribution over $f(x)$ at the observed values") -plt.legend(loc=(0.32, 0.65), frameon=True); -``` - -## Authors - -* Created by [Bill Engels](https://github.com/bwengals) in 2017 ([pymc#1674](https://github.com/pymc-devs/pymc/pull/1674)) -* Reexecuted by [Colin Caroll](https://github.com/ColCarroll) in 2019 ([pymc#3397](https://github.com/pymc-devs/pymc/pull/3397)) -* Updated for V4 by Bill Engels in September 2022 ([pymc-examples#237](https://github.com/pymc-devs/pymc-examples/pull/237)) - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/gaussian_processes/GP-Marginal.myst.md b/myst_nbs/gaussian_processes/GP-Marginal.myst.md deleted file mode 100644 index 567d39835..000000000 --- a/myst_nbs/gaussian_processes/GP-Marginal.myst.md +++ /dev/null @@ -1,302 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- - -# Marginal Likelihood Implementation - -The `gp.Marginal` class implements the more common case of GP regression: the observed data are the sum of a GP and Gaussian noise. `gp.Marginal` has a `marginal_likelihood` method, a `conditional` method, and a `predict` method. Given a mean and covariance function, the function $f(x)$ is modeled as, - -$$ -f(x) \sim \mathcal{GP}(m(x),\, k(x, x')) \,. -$$ - -The observations $y$ are the unknown function plus noise - -$$ -\begin{aligned} - \epsilon &\sim N(0, \Sigma) \\ - y &= f(x) + \epsilon \\ -\end{aligned} -$$ - -+++ - -## The `.marginal_likelihood` method - -The unknown latent function can be analytically integrated out of the product of the GP prior probability with a normal likelihood. This quantity is called the marginal likelihood. - -$$ -p(y \mid x) = \int p(y \mid f, x) \, p(f \mid x) \, df -$$ - -The log of the marginal likelihood, $p(y \mid x)$, is - -$$ -\log p(y \mid x) = - -\frac{1}{2} (\mathbf{y} - \mathbf{m}_x)^{T} - (\mathbf{K}_{xx} + \boldsymbol\Sigma)^{-1} - (\mathbf{y} - \mathbf{m}_x) - - \frac{1}{2}\log(\mathbf{K}_{xx} + \boldsymbol\Sigma) - - \frac{n}{2}\log (2 \pi) -$$ - -$\boldsymbol\Sigma$ is the covariance matrix of the Gaussian noise. Since the Gaussian noise doesn't need to be white to be conjugate, the `marginal_likelihood` method supports either using a white noise term when a scalar is provided, or a noise covariance function when a covariance function is provided. - -The `gp.marginal_likelihood` method implements the quantity given above. Some sample code would be, - -```python -import numpy as np -import pymc3 as pm - -# A one dimensional column vector of inputs. -X = np.linspace(0, 1, 10)[:,None] - -with pm.Model() as marginal_gp_model: - # Specify the covariance function. - cov_func = pm.gp.cov.ExpQuad(1, ls=0.1) - - # Specify the GP. The default mean function is `Zero`. - gp = pm.gp.Marginal(cov_func=cov_func) - - # The scale of the white noise term can be provided, - sigma = pm.HalfCauchy("sigma", beta=5) - y_ = gp.marginal_likelihood("y", X=X, y=y, noise=sigma) - - # OR a covariance function for the noise can be given - # noise_l = pm.Gamma("noise_l", alpha=2, beta=2) - # cov_func_noise = pm.gp.cov.Exponential(1, noise_l) + pm.gp.cov.WhiteNoise(sigma=0.1) - # y_ = gp.marginal_likelihood("y", X=X, y=y, noise=cov_func_noise) -``` - -+++ - -## The `.conditional` distribution - -The `.conditional` has an optional flag for `pred_noise`, which defaults to `False`. When `pred_noise=False`, the `conditional` method produces the predictive distribution for the underlying function represented by the GP. When `pred_noise=True`, the `conditional` method produces the predictive distribution for the GP plus noise. Using the same `gp` object defined above, - -```python -# vector of new X points we want to predict the function at -Xnew = np.linspace(0, 2, 100)[:, None] - -with marginal_gp_model: - f_star = gp.conditional("f_star", Xnew=Xnew) - - # or to predict the GP plus noise - y_star = gp.conditional("y_star", Xnew=Xnew, pred_noise=True) -``` -If using an additive GP model, the conditional distribution for individual components can be constructed by setting the optional argument `given`. For more information on building additive GPs, see the main documentation page. For an example, see the Mauna Loa CO$_2$ notebook. - -+++ - -## Making predictions - -The `.predict` method returns the conditional mean and variance of the `gp` given a `point` as NumPy arrays. The `point` can be the result of `find_MAP` or a sample from the trace. The `.predict` method can be used outside of a `Model` block. Like `.conditional`, `.predict` accepts `given` so it can produce predictions from components of additive GPs. - -```python -# The mean and full covariance -mu, cov = gp.predict(Xnew, point=trace[-1]) - -# The mean and variance (diagonal of the covariance) -mu, var = gp.predict(Xnew, point=trace[-1], diag=True) - -# With noise included -mu, var = gp.predict(Xnew, point=trace[-1], diag=True, pred_noise=True) -``` - -+++ - -## Example: Regression with white, Gaussian noise - -```{code-cell} ipython3 ---- -jupyter: - outputs_hidden: true ---- -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc3 as pm -import scipy as sp - -%matplotlib inline -``` - -```{code-cell} ipython3 -# set the seed -np.random.seed(1) - -n = 100 # The number of data points -X = np.linspace(0, 10, n)[:, None] # The inputs to the GP, they must be arranged as a column vector - -# Define the true covariance function and its parameters -ℓ_true = 1.0 -η_true = 3.0 -cov_func = η_true**2 * pm.gp.cov.Matern52(1, ℓ_true) - -# A mean function that is zero everywhere -mean_func = pm.gp.mean.Zero() - -# The latent function values are one sample from a multivariate normal -# Note that we have to call `eval()` because PyMC3 built on top of Theano -f_true = np.random.multivariate_normal( - mean_func(X).eval(), cov_func(X).eval() + 1e-8 * np.eye(n), 1 -).flatten() - -# The observed data is the latent function plus a small amount of IID Gaussian noise -# The standard deviation of the noise is `sigma` -σ_true = 2.0 -y = f_true + σ_true * np.random.randn(n) - -## Plot the data and the unobserved latent function -fig = plt.figure(figsize=(12, 5)) -ax = fig.gca() -ax.plot(X, f_true, "dodgerblue", lw=3, label="True f") -ax.plot(X, y, "ok", ms=3, alpha=0.5, label="Data") -ax.set_xlabel("X") -ax.set_ylabel("The true f(x)") -plt.legend(); -``` - -```{code-cell} ipython3 -with pm.Model() as model: - ℓ = pm.Gamma("ℓ", alpha=2, beta=1) - η = pm.HalfCauchy("η", beta=5) - - cov = η**2 * pm.gp.cov.Matern52(1, ℓ) - gp = pm.gp.Marginal(cov_func=cov) - - σ = pm.HalfCauchy("σ", beta=5) - y_ = gp.marginal_likelihood("y", X=X, y=y, noise=σ) - - mp = pm.find_MAP() -``` - -```{code-cell} ipython3 -# collect the results into a pandas dataframe to display -# "mp" stands for marginal posterior -pd.DataFrame( - { - "Parameter": ["ℓ", "η", "σ"], - "Value at MAP": [float(mp["ℓ"]), float(mp["η"]), float(mp["σ"])], - "True value": [ℓ_true, η_true, σ_true], - } -) -``` - -The MAP values are close to their true values. - -+++ - -### Using `.conditional` - -```{code-cell} ipython3 -# new values from x=0 to x=20 -X_new = np.linspace(0, 20, 600)[:, None] - -# add the GP conditional to the model, given the new X values -with model: - f_pred = gp.conditional("f_pred", X_new) - -# To use the MAP values, you can just replace the trace with a length-1 list with `mp` -with model: - pred_samples = pm.sample_posterior_predictive([mp], vars=[f_pred], samples=2000) -``` - -```{code-cell} ipython3 -# plot the results -fig = plt.figure(figsize=(12, 5)) -ax = fig.gca() - -# plot the samples from the gp posterior with samples and shading -from pymc3.gp.util import plot_gp_dist - -plot_gp_dist(ax, pred_samples["f_pred"], X_new) - -# plot the data and the true latent function -plt.plot(X, f_true, "dodgerblue", lw=3, label="True f") -plt.plot(X, y, "ok", ms=3, alpha=0.5, label="Observed data") - -# axis labels and title -plt.xlabel("X") -plt.ylim([-13, 13]) -plt.title("Posterior distribution over $f(x)$ at the observed values") -plt.legend(); -``` - -The prediction also matches the results from `gp.Latent` very closely. What about predicting new data points? Here we only predicted $f_*$, not $f_*$ + noise, which is what we actually observe. - -The `conditional` method of `gp.Marginal` contains the flag `pred_noise` whose default value is `False`. To draw from the *posterior predictive* distribution, we simply set this flag to `True`. - -```{code-cell} ipython3 -with model: - y_pred = gp.conditional("y_pred", X_new, pred_noise=True) - y_samples = pm.sample_posterior_predictive([mp], vars=[y_pred], samples=2000) -``` - -```{code-cell} ipython3 -fig = plt.figure(figsize=(12, 5)) -ax = fig.gca() - -# posterior predictive distribution -plot_gp_dist(ax, y_samples["y_pred"], X_new, plot_samples=False, palette="bone_r") - -# overlay a scatter of one draw of random points from the -# posterior predictive distribution -plt.plot(X_new, y_samples["y_pred"][800, :].T, "co", ms=2, label="Predicted data") - -# plot original data and true function -plt.plot(X, y, "ok", ms=3, alpha=1.0, label="observed data") -plt.plot(X, f_true, "dodgerblue", lw=3, label="true f") - -plt.xlabel("x") -plt.ylim([-13, 13]) -plt.title("posterior predictive distribution, y_*") -plt.legend(); -``` - -Notice that the posterior predictive density is wider than the conditional distribution of the noiseless function, and reflects the predictive distribution of the noisy data, which is marked as black dots. The light colored dots don't follow the spread of the predictive density exactly because they are a single draw from the posterior of the GP plus noise. - -+++ - -### Using `.predict` - -We can use the `.predict` method to return the mean and variance given a particular `point`. Since we used `find_MAP` in this example, `predict` returns the same mean and covariance that the distribution of `.conditional` has. - -```{code-cell} ipython3 -# predict -mu, var = gp.predict(X_new, point=mp, diag=True) -sd = np.sqrt(var) - -# draw plot -fig = plt.figure(figsize=(12, 5)) -ax = fig.gca() - -# plot mean and 2σ intervals -plt.plot(X_new, mu, "r", lw=2, label="mean and 2σ region") -plt.plot(X_new, mu + 2 * sd, "r", lw=1) -plt.plot(X_new, mu - 2 * sd, "r", lw=1) -plt.fill_between(X_new.flatten(), mu - 2 * sd, mu + 2 * sd, color="r", alpha=0.5) - -# plot original data and true function -plt.plot(X, y, "ok", ms=3, alpha=1.0, label="observed data") -plt.plot(X, f_true, "dodgerblue", lw=3, label="true f") - -plt.xlabel("x") -plt.ylim([-13, 13]) -plt.title("predictive mean and 2σ interval") -plt.legend(); -``` - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/gaussian_processes/GP-MaunaLoa.myst.md b/myst_nbs/gaussian_processes/GP-MaunaLoa.myst.md deleted file mode 100644 index 0e8f2a827..000000000 --- a/myst_nbs/gaussian_processes/GP-MaunaLoa.myst.md +++ /dev/null @@ -1,604 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 -substitutions: - extra_dependencies: bokeh ---- - -(GP-MaunaLoa)= -# Gaussian Process for CO2 at Mauna Loa - -:::{post} April, 2022 -:tags: gaussian process, CO2 -:category: intermediate -:author: Bill Engels, Chris Fonnesbeck -::: - -+++ - -This Gaussian Process (GP) example shows how to: - -- Design combinations of covariance functions -- Use additive GPs whose individual components can be used for prediction -- Perform maximum a-posteriori (MAP) estimation - -+++ - -Since the late 1950's, the Mauna Loa observatory has been taking regular measurements of atmospheric CO$_2$. In the late 1950's Charles Keeling invented a accurate way to measure atmospheric CO$_2$ concentration. -Since then, CO$_2$ measurements have been recorded nearly continuously at the Mauna Loa observatory. Check out last hours measurement result [here](https://www.co2.earth/daily-co2). - -![](http://sites.gsu.edu/geog1112/files/2014/07/MaunaLoaObservatory_small-2g29jvt.png) - -Not much was known about how fossil fuel burning influences the climate in the late 1950s. The first couple years of data collection showed that CO$_2$ levels rose and fell following summer and winter, tracking the growth and decay of vegetation in the northern hemisphere. As multiple years passed, the steady upward trend increasingly grew into focus. With over 70 years of collected data, the Keeling curve is one of the most important climate indicators. - -The history behind these measurements and their influence on climatology today and other interesting reading: - -- http://scrippsco2.ucsd.edu/history_legacy/early_keeling_curve# -- https://scripps.ucsd.edu/programs/keelingcurve/2016/05/23/why-has-a-drop-in-global-co2-emissions-not-caused-co2-levels-in-the-atmosphere-to-stabilize/#more-1412 - -Let's load in the data, tidy it up, and have a look. The [raw data set is located here](http://scrippsco2.ucsd.edu/data/atmospheric_co2/mlo). This notebook uses the [Bokeh package](http://bokeh.pydata.org/en/latest/) for plots that benefit from interactivity. - -+++ - -## Preparing the data - -+++ - -:::{include} ../extra_installs.md -::: - -```{code-cell} ipython3 -import numpy as np -import pandas as pd -import pymc3 as pm - -from bokeh.io import output_notebook -from bokeh.models import BoxAnnotation, Label, Legend, Span -from bokeh.palettes import brewer -from bokeh.plotting import figure, show - -output_notebook() -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -``` - -```{code-cell} ipython3 -# get data -try: - data_monthly = pd.read_csv("../data/monthly_in_situ_co2_mlo.csv", header=56) -except FileNotFoundError: - data_monthly = pd.read_csv(pm.get_data("monthly_in_situ_co2_mlo.csv"), header=56) - -# replace -99.99 with NaN -data_monthly.replace(to_replace=-99.99, value=np.nan, inplace=True) - -# fix column names -cols = [ - "year", - "month", - "--", - "--", - "CO2", - "seasonaly_adjusted", - "fit", - "seasonally_adjusted_fit", - "CO2_filled", - "seasonally_adjusted_filled", -] -data_monthly.columns = cols -cols.remove("--") -cols.remove("--") -data_monthly = data_monthly[cols] - -# drop rows with nan -data_monthly.dropna(inplace=True) - -# fix time index -data_monthly["day"] = 15 -data_monthly.index = pd.to_datetime(data_monthly[["year", "month", "day"]]) -cols.remove("year") -cols.remove("month") -data_monthly = data_monthly[cols] - -data_monthly.head(5) -``` - -```{code-cell} ipython3 -# function to convert datetimes to indexed numbers that are useful for later prediction -def dates_to_idx(timelist): - reference_time = pd.to_datetime("1958-03-15") - t = (timelist - reference_time) / pd.Timedelta(365, "D") - return np.asarray(t) - - -t = dates_to_idx(data_monthly.index) - -# normalize CO2 levels -y = data_monthly["CO2"].values -first_co2 = y[0] -std_co2 = np.std(y) -y_n = (y - first_co2) / std_co2 - -data_monthly = data_monthly.assign(t=t) -data_monthly = data_monthly.assign(y_n=y_n) -``` - -This data might be familiar to you, since it was used as an example in the [Gaussian Processes for Machine Learning](http://www.gaussianprocess.org/gpml/) book by {cite:t}`rasmussen2003gaussian`. The version of the data set they use starts in the late 1950's, but stops at the end of 2003. So that our PyMC3 example is somewhat comparable to their example, we use the stretch of data from before 2004 as the "training" set. The data from 2004 to 2022 we'll use to test our predictions. - -```{code-cell} ipython3 -# split into training and test set -sep_idx = data_monthly.index.searchsorted(pd.to_datetime("2003-12-15")) -data_early = data_monthly.iloc[: sep_idx + 1, :] -data_later = data_monthly.iloc[sep_idx:, :] -``` - -```{code-cell} ipython3 -# plot training and test data -p = figure( - x_axis_type="datetime", - title="Monthly CO2 Readings from Mauna Loa", - plot_width=550, - plot_height=350, -) -p.yaxis.axis_label = "CO2 [ppm]" -p.xaxis.axis_label = "Date" -predict_region = BoxAnnotation( - left=pd.to_datetime("2003-12-15"), fill_alpha=0.1, fill_color="firebrick" -) -p.add_layout(predict_region) -ppm400 = Span(location=400, dimension="width", line_color="red", line_dash="dashed", line_width=2) -p.add_layout(ppm400) - -p.line(data_monthly.index, data_monthly["CO2"], line_width=2, line_color="black", alpha=0.5) -p.circle(data_monthly.index, data_monthly["CO2"], line_color="black", alpha=0.1, size=2) - -train_label = Label( - x=100, - y=165, - x_units="screen", - y_units="screen", - text="Training Set", - render_mode="css", - border_line_alpha=0.0, - background_fill_alpha=0.0, -) -test_label = Label( - x=585, - y=80, - x_units="screen", - y_units="screen", - text="Test Set", - render_mode="css", - border_line_alpha=0.0, - background_fill_alpha=0.0, -) - -p.add_layout(train_label) -p.add_layout(test_label) -show(p) -``` - -Bokeh plots are interactive, so panning and zooming can be done with the sidebar on the right hand side. The seasonal rise and fall is plainly apparent, as is the upward trend. Here is a link to an plots of [this curve at different time scales, and in the context of historical ice core data](https://scripps.ucsd.edu/programs/keelingcurve/). - -The 400 ppm level is highlighted with a dashed line. In addition to fitting a descriptive model, our goal will be to predict the first month the 400 ppm threshold is crossed, which was [May, 2013](https://scripps.ucsd.edu/programs/keelingcurve/2013/05/20/now-what/#more-741). In the data set above, the CO$_2$ average reading for May, 2013 was about 399.98, close enough to be our correct target date. - -+++ - -## Modeling the Keeling Curve using GPs - -As a starting point, we use the GP model described in {cite:t}`rasmussen2003gaussian`. Instead of using flat priors on covariance function hyperparameters and then maximizing the marginal likelihood like is done in the textbook, we place somewhat informative priors on the hyperparameters and use optimization to find the MAP point. We use the `gp.Marginal` since Gaussian noise is assumed. - -The R&W {cite:p}`rasmussen2003gaussian` model is a sum of three GPs for the signal, and one GP for the noise. - -1. A long term smooth rising trend represented by an exponentiated quadratic kernel. -2. A periodic term that decays away from exact periodicity. This is represented by the product of a `Periodic` and a `Matern52` covariance functions. -3. Small and medium term irregularities with a rational quadratic kernel. -4. The noise is modeled as the sum of a `Matern32` and a white noise kernel. - -The prior on CO$_2$ as a function of time is, - -$$ -f(t) \sim \mathcal{GP}_{\text{slow}}(0,\, k_1(t, t')) + - \mathcal{GP}_{\text{med}}(0,\, k_2(t, t')) + - \mathcal{GP}_{\text{per}}(0,\, k_3(t, t')) + - \mathcal{GP}_{\text{noise}}(0,\, k_4(t, t')) -$$ - -## Hyperparameter priors -We use fairly uninformative priors for the scale hyperparameters of the covariance functions, and informative Gamma parameters for lengthscales. The PDFs used for the lengthscale priors is shown below: - -```{code-cell} ipython3 -x = np.linspace(0, 150, 5000) -priors = [ - ("ℓ_pdecay", pm.Gamma.dist(alpha=10, beta=0.075)), - ("ℓ_psmooth", pm.Gamma.dist(alpha=4, beta=3)), - ("period", pm.Normal.dist(mu=1.0, sigma=0.05)), - ("ℓ_med", pm.Gamma.dist(alpha=2, beta=0.75)), - ("α", pm.Gamma.dist(alpha=5, beta=2)), - ("ℓ_trend", pm.Gamma.dist(alpha=4, beta=0.1)), - ("ℓ_noise", pm.Gamma.dist(alpha=2, beta=4)), -] - -colors = brewer["Paired"][7] - -p = figure( - title="Lengthscale and period priors", - plot_width=550, - plot_height=350, - x_range=(-1, 8), - y_range=(0, 2), -) -p.yaxis.axis_label = "Probability" -p.xaxis.axis_label = "Years" - -for i, prior in enumerate(priors): - p.line( - x, - np.exp(prior[1].logp(x).eval()), - legend_label=prior[0], - line_width=3, - line_color=colors[i], - ) -show(p) -``` - -- `ℓ_pdecay`: The periodic decay. The smaller this parameter is, the faster the periodicity goes away. I doubt that the seasonality of the CO$_2$ will be going away any time soon (hopefully), and there's no evidence for that in the data. Most of the prior mass is from 60 to >140 years. - -- `ℓ_psmooth`: The smoothness of the periodic component. It controls how "sinusoidal" the periodicity is. The plot of the data shows that seasonality is not an exact sine wave, but its not terribly different from one. We use a Gamma whose mode is at one, and doesn't have too large of a variance, with most of the prior mass from around 0.5 and 2. - -- `period`: The period. We put a very strong prior on $p$, the period that is centered at one. R&W fix $p=1$, since the period is annual. - -- `ℓ_med`: This is the lengthscale for the short to medium long variations. This prior has most of its mass below 6 years. - -- `α`: This is the shape parameter. This prior is centered at 3, since we're expecting there to be some more variation than could be explained by an exponentiated quadratic. - -- `ℓ_trend`: The lengthscale of the long term trend. It has a wide prior with mass on a decade scale. Most of the mass is between 10 to 60 years. - -- `ℓ_noise`: The lengthscale of the noise covariance. This noise should be very rapid, in the scale of several months to at most a year or two. - -+++ - -We know beforehand which GP components should have a larger magnitude, so we include this information in the scale parameters. - -```{code-cell} ipython3 -x = np.linspace(0, 4, 5000) -priors = [ - ("η_per", pm.HalfCauchy.dist(beta=2)), - ("η_med", pm.HalfCauchy.dist(beta=1.0)), - ( - "η_trend", - pm.HalfCauchy.dist(beta=3), - ), # will use beta=2, but beta=3 is visible on plot - ("σ", pm.HalfNormal.dist(sigma=0.25)), - ("η_noise", pm.HalfNormal.dist(sigma=0.5)), -] - -colors = brewer["Paired"][5] - -p = figure(title="Scale priors", plot_width=550, plot_height=350) -p.yaxis.axis_label = "Probability" -p.xaxis.axis_label = "Years" - -for i, prior in enumerate(priors): - p.line( - x, - np.exp(prior[1].logp(x).eval()), - legend_label=prior[0], - line_width=3, - line_color=colors[i], - ) -show(p) -``` - -For all of the scale priors we use distributions that shrink the scale towards zero. The seasonal component and the long term trend have the least mass near zero, since they are the largest influences in the data. - -- `η_per`: Scale of the periodic or seasonal component. -- `η_med`: Scale of the short to medium term component. -- `η_trend`: Scale of the long term trend. -- `σ`: Scale of the white noise. -- `η_noise`: Scale of correlated, short term noise. - -+++ - -## The model in PyMC3 - -Below is the actual model. Each of the three component GPs is constructed separately. Since we are doing MAP, we use `Marginal` GPs and lastly call the `.marginal_likelihood` method to specify the marginal posterior. - -```{code-cell} ipython3 -# pull out normalized data -t = data_early["t"].values[:, None] -y = data_early["y_n"].values -``` - -```{code-cell} ipython3 -with pm.Model() as model: - # yearly periodic component x long term trend - η_per = pm.HalfCauchy("η_per", beta=2, testval=1.0) - ℓ_pdecay = pm.Gamma("ℓ_pdecay", alpha=10, beta=0.075) - period = pm.Normal("period", mu=1, sigma=0.05) - ℓ_psmooth = pm.Gamma("ℓ_psmooth ", alpha=4, beta=3) - cov_seasonal = ( - η_per**2 * pm.gp.cov.Periodic(1, period, ℓ_psmooth) * pm.gp.cov.Matern52(1, ℓ_pdecay) - ) - gp_seasonal = pm.gp.Marginal(cov_func=cov_seasonal) - - # small/medium term irregularities - η_med = pm.HalfCauchy("η_med", beta=0.5, testval=0.1) - ℓ_med = pm.Gamma("ℓ_med", alpha=2, beta=0.75) - α = pm.Gamma("α", alpha=5, beta=2) - cov_medium = η_med**2 * pm.gp.cov.RatQuad(1, ℓ_med, α) - gp_medium = pm.gp.Marginal(cov_func=cov_medium) - - # long term trend - η_trend = pm.HalfCauchy("η_trend", beta=2, testval=2.0) - ℓ_trend = pm.Gamma("ℓ_trend", alpha=4, beta=0.1) - cov_trend = η_trend**2 * pm.gp.cov.ExpQuad(1, ℓ_trend) - gp_trend = pm.gp.Marginal(cov_func=cov_trend) - - # noise model - η_noise = pm.HalfNormal("η_noise", sigma=0.5, testval=0.05) - ℓ_noise = pm.Gamma("ℓ_noise", alpha=2, beta=4) - σ = pm.HalfNormal("σ", sigma=0.25, testval=0.05) - cov_noise = η_noise**2 * pm.gp.cov.Matern32(1, ℓ_noise) + pm.gp.cov.WhiteNoise(σ) - - # The Gaussian process is a sum of these three components - gp = gp_seasonal + gp_medium + gp_trend - - # Since the normal noise model and the GP are conjugates, we use `Marginal` with the `.marginal_likelihood` method - y_ = gp.marginal_likelihood("y", X=t, y=y, noise=cov_noise) - - # this line calls an optimizer to find the MAP - mp = pm.find_MAP(include_transformed=True) -``` - -```{code-cell} ipython3 -# display the results, dont show transformed parameter values -sorted([name + ":" + str(mp[name]) for name in mp.keys() if not name.endswith("_")]) -``` - -At first glance the results look reasonable. The lengthscale that determines how fast the seasonality varies is about 126 years. This means that given the data, we wouldn't expect such strong periodicity to vanish until centuries have passed. The trend lengthscale is also long, about 50 years. - -+++ - -## Examining the fit of each of the additive GP components - -The code below looks at the fit of the total GP, and each component individually. The total fit and its $2\sigma$ uncertainty are shown in red. - -```{code-cell} ipython3 -# predict at a 15 day granularity -dates = pd.date_range(start="3/15/1958", end="12/15/2003", freq="15D") -tnew = dates_to_idx(dates)[:, None] - -print("Predicting with gp ...") -mu, var = gp.predict(tnew, point=mp, diag=True) -mean_pred = mu * std_co2 + first_co2 -var_pred = var * std_co2**2 - -# make dataframe to store fit results -fit = pd.DataFrame( - {"t": tnew.flatten(), "mu_total": mean_pred, "sd_total": np.sqrt(var_pred)}, - index=dates, -) - -print("Predicting with gp_trend ...") -mu, var = gp_trend.predict( - tnew, point=mp, given={"gp": gp, "X": t, "y": y, "noise": cov_noise}, diag=True -) -fit = fit.assign(mu_trend=mu * std_co2 + first_co2, sd_trend=np.sqrt(var * std_co2**2)) - -print("Predicting with gp_medium ...") -mu, var = gp_medium.predict( - tnew, point=mp, given={"gp": gp, "X": t, "y": y, "noise": cov_noise}, diag=True -) -fit = fit.assign(mu_medium=mu * std_co2 + first_co2, sd_medium=np.sqrt(var * std_co2**2)) - -print("Predicting with gp_seasonal ...") -mu, var = gp_seasonal.predict( - tnew, point=mp, given={"gp": gp, "X": t, "y": y, "noise": cov_noise}, diag=True -) -fit = fit.assign(mu_seasonal=mu * std_co2 + first_co2, sd_seasonal=np.sqrt(var * std_co2**2)) -print("Done") -``` - -```{code-cell} ipython3 -## plot the components -p = figure( - title="Decomposition of the Mauna Loa Data", - x_axis_type="datetime", - plot_width=550, - plot_height=350, -) -p.yaxis.axis_label = "CO2 [ppm]" -p.xaxis.axis_label = "Date" - -# plot mean and 2σ region of total prediction -upper = fit.mu_total + 2 * fit.sd_total -lower = fit.mu_total - 2 * fit.sd_total -band_x = np.append(fit.index.values, fit.index.values[::-1]) -band_y = np.append(lower, upper[::-1]) - -# total fit -p.line( - fit.index, - fit.mu_total, - line_width=1, - line_color="firebrick", - legend_label="Total fit", -) -p.patch(band_x, band_y, color="firebrick", alpha=0.6, line_color="white") - -# trend -p.line( - fit.index, - fit.mu_trend, - line_width=1, - line_color="blue", - legend_label="Long term trend", -) - -# medium -p.line( - fit.index, - fit.mu_medium, - line_width=1, - line_color="green", - legend_label="Medium range variation", -) - -# seasonal -p.line( - fit.index, - fit.mu_seasonal, - line_width=1, - line_color="orange", - legend_label="Seasonal process", -) - -# true value -p.circle(data_early.index, data_early["CO2"], color="black", legend_label="Observed data") -p.legend.location = "top_left" -show(p) -``` - -The fit matches the observed data very well. The trend, seasonality, and short/medium term effects also are cleanly separated out. If you zoom so the seasonal process fills the plot window (from x equals 1955 to 2004, from y equals 310 to 320), it appears to be widening as time goes on. Lets plot the first year of each decade: - -```{code-cell} ipython3 -# plot several years - -p = figure(title="Several years of the seasonal component", plot_width=550, plot_height=350) -p.yaxis.axis_label = "Δ CO2 [ppm]" -p.xaxis.axis_label = "Month" - -colors = brewer["Paired"][5] -years = ["1960", "1970", "1980", "1990", "2000"] - -for i, year in enumerate(years): - dates = pd.date_range(start="1/1/" + year, end="12/31/" + year, freq="10D") - tnew = dates_to_idx(dates)[:, None] - - print("Predicting year", year) - mu, var = gp_seasonal.predict( - tnew, point=mp, diag=True, given={"gp": gp, "X": t, "y": y, "noise": cov_noise} - ) - mu_pred = mu * std_co2 - - # plot mean - x = np.asarray((dates - dates[0]) / pd.Timedelta(30, "D")) + 1 - p.line(x, mu_pred, line_width=1, line_color=colors[i], legend_label=year) - -p.legend.location = "bottom_left" -show(p) -``` - -This plot makes it clear that there is a broadening over time. So it would seem that as there is more CO$_2$ in the atmosphere, [the absorption/release cycle due to the growth and decay of vegetation in the northern hemisphere](https://scripps.ucsd.edu/programs/keelingcurve/2013/06/04/why-does-atmospheric-co2-peak-in-may/) becomes more slightly more pronounced. - -+++ - -## What day will the CO2 level break 400 ppm? - -How well do our forecasts look? Clearly the observed data trends up and the seasonal effect is very pronounced. Does our GP model capture this well enough to make reasonable extrapolations? Our "training" set went up until the end of 2003, so we are going to predict from January 2004 out to the end of 2022. - -Although there isn't any particular significance to this event other than it being a nice round number, our side goal was to see how well we could predict the date when the 400 ppm mark is first crossed. [This event first occurred during May, 2013](https://scripps.ucsd.edu/programs/keelingcurve/2013/05/20/now-what/#more-741) and there were a few [news articles about other significant milestones](https://www.usatoday.com/story/tech/sciencefair/2016/09/29/carbon-dioxide-levels-400-ppm-scripps-mauna-loa-global-warming/91279952/). - -```{code-cell} ipython3 -dates = pd.date_range(start="11/15/2003", end="12/15/2022", freq="10D") -tnew = dates_to_idx(dates)[:, None] - -print("Sampling gp predictions ...") -mu_pred, cov_pred = gp.predict(tnew, point=mp) - -# draw samples, and rescale -n_samples = 2000 -samples = pm.MvNormal.dist(mu=mu_pred, cov=cov_pred, shape=(n_samples, len(tnew))).random() -samples = samples * std_co2 + first_co2 -``` - -```{code-cell} ipython3 -# make plot -p = figure(x_axis_type="datetime", plot_width=700, plot_height=300) -p.yaxis.axis_label = "CO2 [ppm]" -p.xaxis.axis_label = "Date" - -# plot mean and 2σ region of total prediction -# scale mean and var -mu_pred_sc = mu_pred * std_co2 + first_co2 -sd_pred_sc = np.sqrt(np.diag(cov_pred) * std_co2**2) - -upper = mu_pred_sc + 2 * sd_pred_sc -lower = mu_pred_sc - 2 * sd_pred_sc -band_x = np.append(dates, dates[::-1]) -band_y = np.append(lower, upper[::-1]) - -p.line(dates, mu_pred_sc, line_width=2, line_color="firebrick", legend_label="Total fit") -p.patch(band_x, band_y, color="firebrick", alpha=0.6, line_color="white") - -# some predictions -idx = np.random.randint(0, samples.shape[0], 10) -p.multi_line( - [dates] * len(idx), - [samples[i, :] for i in idx], - color="firebrick", - alpha=0.5, - line_width=0.5, -) - -# true value -p.circle(data_later.index, data_later["CO2"], color="black", legend_label="Observed data") - -ppm400 = Span( - location=400, - dimension="width", - line_color="black", - line_dash="dashed", - line_width=1, -) -p.add_layout(ppm400) -p.legend.location = "bottom_right" -show(p) -``` - -The mean prediction and the $2\sigma$ uncertainty is in red. A couple samples from the marginal posterior are also shown on there. It looks like our model was a little optimistic about how much CO2 is being released. The first time the $2\sigma$ uncertainty crosses the 400 ppm threshold is in May 2015, two years late. - -One reason this is occurring is because our GP prior had zero mean. This means we encoded prior information that says that the function should go to zero as we move away from our observed data. This assumption probably isn't justified. It's also possible that the CO$_2$ trend is increasing faster than linearly -- important knowledge for accurate predictions. Another possibility is the MAP estimate. Without looking at the full posterior, the uncertainty in our estimates is underestimated. How badly is unknown. - -+++ - -Having a zero mean GP prior is causing the prediction to be pretty far off. Some possibilities for fixing this is to use a constant mean function, whose value could maybe be assigned the historical, or pre-industrial revolution, CO$_2$ average. This may not be the best indicator for future CO$_2$ levels though. - -Also, using only historical CO$_2$ data may not be the best predictor. In addition to looking at the underlying behavior of what determines CO$_2$ levels using a GP fit, we could also incorporate other information, such as the amount of CO$_2$ that is released by fossil fuel burning. - -Next, we'll see about using PyMC3's GP functionality to improve the model, look at full posteriors, and incorporate other sources of data on drivers of CO$_2$ levels. - -+++ - -## Authors -* Authored by Bill Engels in September, 2017 ([pymc#2444](https://github.com/pymc-devs/pymc/pull/2444)) -* Updated by Chris Fonnesbeck in December, 2020 -* Re-executed by Danh Phan in May, 2022 ([pymc-examples#316](https://github.com/pymc-devs/pymc-examples/pull/316)) - -+++ - -## References -:::{bibliography} -:filter: docname in docnames -::: - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p bokeh -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/gaussian_processes/GP-MaunaLoa2.myst.md b/myst_nbs/gaussian_processes/GP-MaunaLoa2.myst.md deleted file mode 100644 index 6fbd05b34..000000000 --- a/myst_nbs/gaussian_processes/GP-MaunaLoa2.myst.md +++ /dev/null @@ -1,947 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: dev-pymc - language: python - name: dev-pymc ---- - -# Example: Mauna Loa CO$_2$ continued - -This GP example shows how to - -- Fit fully Bayesian GPs with NUTS -- Model inputs whose exact locations are uncertain (uncertainty in 'x') -- Design a semiparametric Gaussian process model -- Build a changepoint covariance function / kernel -- Definine a custom mean and a custom covariance function - -![Earth Systems Research Laboratory](https://www.esrl.noaa.gov/gmd/obop/mlo/pictures/sunsetmaunaloa1.jpg) - -+++ - -# Ice Core Data - -+++ - -The first data set we'll look at is CO2 measurements from ice core data. This data goes back to the year 13 AD. The data after the year 1958 is an average of ice core measurements and more accurate data taken from Mauna Loa. **I'm very grateful to Tobias Erhardt from the University of Bërn for his generous insight on the science of how some of the processes touched on actually work.** Any mistakes are my own of course. - -This data is less accurate than the Mauna Loa atmospheric CO2 measurements. Snow that falls on Antarctica accumulates gradually and hardens into ice over time, which is referred to as *firn*. CO2 measured in the Law Dome ice cores come from air bubbles trapped in the ice. If this ice were flash frozen, the amount of CO2 contained in the air bubbles would reflect the amount of CO2 in the atmosphere at the exact date and time of the freeze. Instead, the process happens gradually, so the trapped air has time to diffuse throughout the solidifying ice. The process of the layering, freezing and solidifying of the firn happens over the scale of years. For the Law Dome data used here, the CO2 measurements listed in the data represent an average CO2 across about 2-4 years in total. - -Also, the ordering of the data points is fixed. There is no way for older ice layers to end up on top of newer ice layers. This enforces that we place a prior on the measurement locations whose order is restricted. - -The dates of the ice core measurements have some uncertainty. They may be accurate on a yearly level due to how the ice layers on it self every year, but the date isn't likely to be reliable as to the season when the measurement was taken. Also, the CO2 level observed may be some sort of average of the overall yearly level. - -As we saw in the previous example, there is a strong seasonal component in CO2 levels that won't be observable in this data set. In PyMC3, we can easily include both errors in $y$ and errors in $x$. To demonstrate this, we remove the latter part of the data (which are averaged with Mauna Loa readings) so we have only the ice core measurements. We fit the Gaussian process model using the No-U-Turn MCMC sampler. - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc3 as pm -import theano -import theano.tensor as tt -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -RANDOM_SEED = 8927 -np.random.seed(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -```{code-cell} ipython3 -ice = pd.read_csv(pm.get_data("merged_ice_core_yearly.csv"), header=26) -ice.columns = ["year", "CO2"] -ice["CO2"] = ice["CO2"].astype(np.float) - -#### DATA AFTER 1958 is an average of ice core and mauna loa data, so remove it -ice = ice[ice["year"] <= 1958] -print("Number of data points:", len(ice)) -``` - -```{code-cell} ipython3 -fig = plt.figure(figsize=(9, 4)) -ax = plt.gca() - -ax.plot(ice.year.values, ice.CO2.values, ".k") -ax.set_xlabel("Year") -ax.set_ylabel("CO2 (ppm)"); -``` - -The industrial revolution era occurred around the years 1760 to 1840. This point is clearly visible in the graph, where CO2 levels rise dramatically after being fairly stationary at around 280 ppm for over a thousand years. - -+++ - -## Uncertainty in 'x' - -To model uncertainty in $x$, or time, we place a prior distribution over each of the observation dates. So that the prior is standardized, we specifically use a PyMC3 random variable to model the difference between the date given in the data set, and it's error. We assume that these differences are normal with mean zero, and standard deviation of two years. We also enforce that the observations have a strict ordering in time using the `ordered` transform. - -For just the ice core data, the uncertainty in $x$ is not very important. In the last example, we'll see how it plays a more influential role in the model. - -```{code-cell} ipython3 -fig = plt.figure(figsize=(8, 5)) -ax = plt.gca() -ax.hist(100 * pm.Normal.dist(mu=0.0, sigma=0.02).random(size=10000), 100) -ax.set_xlabel(r"$\Delta$ time (years)") -ax.set_title("time offset prior"); -``` - -```{code-cell} ipython3 -t = ice.year.values -y = ice.CO2.values - -# normalize the CO2 readings prior to fitting the model -y_mu, y_sd = np.mean(y[0:50]), np.std(y) -y_n = (y - y_mu) / y_sd - -# scale t to have units of centuries -t_n = t / 100 -``` - -We use an informative prior on the lengthscale that places most of the mass between a few and 20 centuries. - -```{code-cell} ipython3 -fig = plt.figure(figsize=(8, 5)) -ax = plt.gca() -ax.hist(pm.Gamma.dist(alpha=2, beta=0.25).random(size=10000), 100) -ax.set_xlabel("Time (centuries)") -ax.set_title("Lengthscale prior"); -``` - -```{code-cell} ipython3 -with pm.Model() as model: - η = pm.HalfNormal("η", sigma=5) - ℓ = pm.Gamma("ℓ", alpha=4, beta=2) - α = pm.Gamma("α", alpha=3, beta=1) - cov = η**2 * pm.gp.cov.RatQuad(1, α, ℓ) - - gp = pm.gp.Marginal(cov_func=cov) - - # x location uncertainty - # - sd = 0.02 says the uncertainty on the point is about two years - t_diff = pm.Normal("t_diff", mu=0.0, sigma=0.02, shape=len(t)) - t_uncert = t_n - t_diff - - # white noise variance - σ = pm.HalfNormal("σ", sigma=5, testval=1) - y_ = gp.marginal_likelihood("y", X=t_uncert[:, None], y=y_n, noise=σ) -``` - -Next we can sample with the NUTS MCMC algorithm. We run two chains but set the number of cores to one, since the linear algebra libraries used internally by Theano are multicore. - -```{code-cell} ipython3 -with model: - tr = pm.sample(target_accept=0.95, return_inferencedata=True) -``` - -```{code-cell} ipython3 -az.plot_trace(tr, var_names=["t_diff"], compact=True); -``` - -In the traceplot for `t_diff`, we can see that the posterior peaks for the different inputs haven't moved much, but the uncertainty in location is accounted for by the sampling. - -The posterior distributions for the other unknown hyperparameters is below. - -```{code-cell} ipython3 -az.plot_trace(tr, var_names=["η", "ℓ", "α", "σ"]); -``` - -### Predictions - -```{code-cell} ipython3 -tnew = np.linspace(-100, 2150, 2000) * 0.01 -with model: - fnew = gp.conditional("fnew", Xnew=tnew[:, None]) - -with model: - ppc = pm.sample_posterior_predictive(tr, samples=100, var_names=["fnew"]) -``` - -```{code-cell} ipython3 -samples = y_sd * ppc["fnew"] + y_mu - -fig = plt.figure(figsize=(12, 5)) -ax = plt.gca() -pm.gp.util.plot_gp_dist(ax, samples, tnew * 100, plot_samples=True, palette="Blues") -ax.plot(t, y, "k.") -ax.set_xlim([-100, 2200]) -ax.set_ylabel("CO2") -ax.set_xlabel("Year"); -``` - -Two features are apparent in this plot. One is the [little ice age](https://www.nature.com/articles/ngeo2769), whose effects on CO2 occurs from around 1600 to 1800. The next is the industrial revolution, when people began releasing large amounts of CO2 into the atmosphere. - - -## Semiparametric Gaussian process - -The forecast past the latest data point in 1958 rises, then flattens, then dips back downwards. Should we trust this forecast? We know it hasn't been born out (see the previous notebook) as CO2 levels have continued to rise. - -We didn't specify a mean function in our model, so we've assumed that our GP has a mean of zero. This means that -as we forecast into the future, the function will eventually return to zero. Is this reasonable in this case? There have been no global events that suggest that atmospheric CO2 will not continue on its current trend. - -+++ - -### A linear model for changepoints - -We adopt the formulation used by [Facebook's prophet](https://peerj.com/preprints/3190.pdf) time series model. This is a linear piecewise function, where each segments endpoints are restricted to be connect to one another. Some example functions are plotted below. - -```{code-cell} ipython3 -def dm_changepoints(t, changepoints_t): - A = np.zeros((len(t), len(changepoints_t))) - for i, t_i in enumerate(changepoints_t): - A[t >= t_i, i] = 1 - return A -``` - -For later use, we reprogram this function using symbolic theano variables. The code is a bit inscrutible, but it returns the same thing as `dm_changepoitns` while avoiding the use of a loop. - -```{code-cell} ipython3 -def dm_changepoints_theano(X, changepoints_t): - return 0.5 * (1.0 + tt.sgn(tt.tile(X, (1, len(changepoints_t))) - changepoints_t)) -``` - -From looking at the graph, some possible locations for changepoints are at 1600, 1800 and maybe 1900. These bookend the little ice age, the start of the industrial revolution, and the start of more modern industrial practices. - -```{code-cell} ipython3 -changepoints_t = np.array([16, 18, 19]) - -A = dm_changepoints(t_n, changepoints_t) -``` - -There are several parameters (which we will estimate), some test values and a plot of the resulting function is shown below - -```{code-cell} ipython3 -# base growth rate, or initial slope -k = 0.0 - -# offset -m = 0.1 - -# slope parameters -delta = np.array([0.05, -0.1, 0.3]) -``` - -```{code-cell} ipython3 -x = (k + np.dot(A, delta)) * t_n + (m + np.dot(A, -changepoints_t * delta)) -plt.plot(t, x); -``` - -### A custom changepoint mean function - -We could encode this mean function directly, but if we wrap it inside of a `Mean` object, then it easier to use other Gaussian process functionality, like the `.conditional` and `.predict` methods. Look here [for more information on custom mean and covariance functions](https://docs.pymc.io/notebooks/GP-MeansAndCovs.html#Defining-a-custom-mean-function). We only need to define `__init__` and `__call__` functions. - -```{code-cell} ipython3 -class PiecewiseLinear(pm.gp.mean.Mean): - def __init__(self, changepoints, k, m, delta): - self.changepoints = changepoints - self.k = k - self.m = m - self.delta = delta - - def __call__(self, X): - # X are the x locations, or time points - A = dm_changepoints_theano(X, self.changepoints) - return (self.k + tt.dot(A, self.delta)) * X.flatten() + ( - self.m + tt.dot(A, -self.changepoints * self.delta) - ) -``` - -It is inefficient to recreate `A` every time the mean function is evaluated, but we'll need to do this when the number of inputs changes when making predictions. - -### Semiparametric changepoint model - -Next is the updated model with the changepoint mean function. - -```{code-cell} ipython3 -with pm.Model() as model: - η = pm.HalfNormal("η", sigma=2) - ℓ = pm.Gamma("ℓ", alpha=4, beta=2) - α = pm.Gamma("α", alpha=3, beta=1) - cov = η**2 * pm.gp.cov.RatQuad(1, α, ℓ) - - # piecewise linear mean function - k = pm.Normal("k", mu=0, sigma=1) - m = pm.Normal("m", mu=0, sigma=1) - delta = pm.Normal("delta", mu=0, sigma=5, shape=len(changepoints_t)) - mean = PiecewiseLinear(changepoints_t, k, m, delta) - - # include mean function in GP constructor - gp = pm.gp.Marginal(cov_func=cov, mean_func=mean) - - # x location uncertainty - # - sd = 0.02 says the uncertainty on the point is about two years - t_diff = pm.Normal("t_diff", mu=0.0, sigma=0.02, shape=len(t)) - t_uncert = t_n - t_diff - - # white noise variance - σ = pm.HalfNormal("σ", sigma=5) - y_ = gp.marginal_likelihood("y", X=t_uncert[:, None], y=y_n, noise=σ) -``` - -```{code-cell} ipython3 -with model: - tr = pm.sample(chains=2, target_accept=0.95) -``` - -```{code-cell} ipython3 -az.plot_trace(tr, var_names=["η", "ℓ", "α", "σ", "k", "m", "delta"]); -``` - -### Predictions - -```{code-cell} ipython3 -tnew = np.linspace(-100, 2200, 2000) * 0.01 - -with model: - fnew = gp.conditional("fnew", Xnew=tnew[:, None]) - -with model: - ppc = pm.sample_posterior_predictive(tr, samples=100, var_names=["fnew"]) -``` - -```{code-cell} ipython3 -samples = y_sd * ppc["fnew"] + y_mu - -fig = plt.figure(figsize=(12, 5)) -ax = plt.gca() -pm.gp.util.plot_gp_dist(ax, samples, tnew * 100, plot_samples=True, palette="Blues") -ax.plot(t, y, "k.") -ax.set_xlim([-100, 2200]) -ax.set_ylabel("CO2 (ppm)") -ax.set_xlabel("year"); -``` - -These results look better, but we had to choose exactly where the changepoints were. Instead of using a changepoint in the mean function, we could also specify this same changepoint behavior in the form of a covariance function. One benefit of the latter formulation is that the changepoint can be a more realistic smooth transition, instead of a discrete breakpoint. In the next section, we'll look at how to do this. - -+++ - -# A custom changepoint covariance function - -+++ - -More complex covariance functions can be constructed by composing base covariance -functions in several ways. For instance, two of the most commonly used operations are - -- The sum of two covariance functions is a covariance function -- The product of two covariance functions is a covariance function - -We can also construct a covariance function by scaling a base covariance function ($k_b$) by any arbitrary function, -$$ k(x, x) = s(x) k_{\mathrm{b}}(x, x') s(x') \,. $$ -The scaling function can be parameterized by known parameters, or unknown parameters can be inferred. - -### Heaviside step function - -To specifically construct a covariance function that describes a changepoint, -we could propose a scaling function $s(x)$ that specifies the region where the base covariance is active. The simplest option is the step function, - -$$ s(x;\, x_0) = -\begin{cases} - 0 & x \leq x_0 \\ - 1 & x_0 < x -\end{cases} -$$ - -which is parameterized by the changepoint $x_0$. The covariance function $s(x; x_0) k_b(x, x') s(x'; x_0)$ is only active in the region $x > x_0$. - -The PyMC3 contains the `ScaledCov` covariance function. As arguments, it takes a base -covariance, a scaling function, and the tuple of the arguments for the base covariance. To construct this in PyMC3, we first define the scaling function: - -```{code-cell} ipython3 -def step_function(x, x0, greater=True): - if greater: - # s = 1 for x > x_0 - return 0.5 * (tt.sgn(x - x0) + 1.0) - else: - return 0.5 * (tt.sgn(x0 - x) + 1.0) -``` - -```{code-cell} ipython3 -step_function(np.linspace(0, 10, 10), x0=5, greater=True).eval() -``` - -```{code-cell} ipython3 -step_function(np.linspace(0, 10, 10), x0=5, greater=False).eval() -``` - -Then we can define the the following covariance function, that we compute over $x \in (0, 100)$. The base covariance has a lengthscale of 10, and $x_0 = 40$. Since we are using a step function, it is "active" for $x \leq 40$ when `greater=False`, and for for $x > 40$ when `greater=True`. - -```{code-cell} ipython3 -cov = pm.gp.cov.ExpQuad(1, 10) -sc_cov = pm.gp.cov.ScaledCov(1, cov, step_function, (40, False)) -``` - -```{code-cell} ipython3 -x = np.linspace(0, 100, 100) -K = sc_cov(x[:, None]).eval() -m = plt.imshow(K, cmap="magma") -plt.colorbar(m); -``` - -But this isn't a changepoint covariance function yet. We can add two of these together. For $x > 40$, let's use a base covariance that is a `Matern32` with a lengthscale of 5 and an amplitude of 0.25: - -```{code-cell} ipython3 -cov1 = pm.gp.cov.ExpQuad(1, 10) -sc_cov1 = pm.gp.cov.ScaledCov(1, cov1, step_function, (40, False)) - -cov2 = 0.25 * pm.gp.cov.Matern32(1, 5) -sc_cov2 = pm.gp.cov.ScaledCov(1, cov2, step_function, (40, True)) - -sc_cov = sc_cov1 + sc_cov2 - -# plot over 0 < x < 100 -x = np.linspace(0, 100, 100) -K = sc_cov(x[:, None]).eval() -m = plt.imshow(K, cmap="magma") -plt.colorbar(m); -``` - -What do samples from the Gaussian process prior with this covariance look like? - -```{code-cell} ipython3 -prior_samples = np.random.multivariate_normal(np.zeros(100), K, 3).T -plt.plot(x, prior_samples) -plt.axvline(x=40, color="k", alpha=0.5); -``` - -Before $x = 40$, the function is smooth and slowly changing. After $x = 40$, the samples are less smooth and change quickly. - -### A gradual change with a sigmoid function - -Instead of a sharp cutoff, It is usually more realistic to have a smooth transition. For this we can use the logistic function, shown below: - -```{code-cell} ipython3 -# b is the slope, a is the location - -b = -0.2 -a = 40 -plt.plot(x, pm.math.invlogit(b * (x - a)).eval(), label="scaling left cov") - -b = 0.2 -a = 40 -plt.plot(x, pm.math.invlogit(b * (x - a)).eval(), label="scaling right cov") -plt.legend(); -``` - -```{code-cell} ipython3 -def logistic(x, b, x0): - # b is the slope, x0 is the location - return pm.math.invlogit(b * (x - x0)) -``` - -The same covariance function as before, but with a gradual changepoint is shown below: - -```{code-cell} ipython3 -cov1 = pm.gp.cov.ExpQuad(1, 10) -sc_cov1 = pm.gp.cov.ScaledCov(1, cov1, logistic, (-0.1, 40)) - -cov2 = 0.25 * pm.gp.cov.Matern32(1, 5) -sc_cov2 = pm.gp.cov.ScaledCov(1, cov2, logistic, (0.1, 40)) - -sc_cov = sc_cov1 + sc_cov2 - -# plot over 0 < x < 100 -x = np.linspace(0, 100, 100) -K = sc_cov(x[:, None]).eval() -m = plt.imshow(K, cmap="magma") -plt.colorbar(m); -``` - -Below, you can see that the transition of the prior functions from one region to the next is more gradual: - -```{code-cell} ipython3 -prior_samples = np.random.multivariate_normal(np.zeros(100), K, 3).T -plt.plot(x, prior_samples) -plt.axvline(x=40, color="k", alpha=0.5); -``` - -Lets try this model out instead of the semiparametric changepoint version. - -+++ - -### Changepoint covariance model - -The features of this model are: - -- One covariance for short term variation across all time points -- The parameter `x0` is the location of the industrial revolution. It is given a prior that has most of its support between years 1760 and 1840, centered at 1800. -- We can easily use this `x0` parameter as the `shift` parameter in the 2nd degree `Polynomial` (quadratic) covariance, and as the location of the changepoint in the changepoint covariance. -- A changepoint covariance that is `ExpQuad` prior to the industrial revolution, and `ExpQuad + Polynomial(degree=2)` afterwards. -- We use the same scaling and lengthscale parameters for each of the two base covariances in the changepoint covariance. -- Still modeling uncertainty in `x` as before. - -```{code-cell} ipython3 -with pm.Model() as model: - η = pm.HalfNormal("η", sigma=5) - ℓ = pm.Gamma("ℓ", alpha=2, beta=0.1) - - # changepoint occurs near the year 1800, sometime between 1760, 1840 - x0 = pm.Normal("x0", mu=18, sigma=0.1) - # the change happens gradually - a = pm.HalfNormal("a", sigma=2) - # a constant for the - c = pm.HalfNormal("c", sigma=3) - # quadratic polynomial scale - ηq = pm.HalfNormal("ηq", sigma=5) - - cov1 = η**2 * pm.gp.cov.ExpQuad(1, ℓ) - cov2 = η**2 * pm.gp.cov.ExpQuad(1, ℓ) + ηq**2 * pm.gp.cov.Polynomial(1, x0, 2, c) - - # construct changepoint cov - sc_cov1 = pm.gp.cov.ScaledCov(1, cov1, logistic, (-a, x0)) - sc_cov2 = pm.gp.cov.ScaledCov(1, cov2, logistic, (a, x0)) - cov_c = sc_cov1 + sc_cov2 - - # short term variation - ηs = pm.HalfNormal("ηs", sigma=5) - ℓs = pm.Gamma("ℓs", alpha=2, beta=1) - cov_s = ηs**2 * pm.gp.cov.Matern52(1, ℓs) - - gp = pm.gp.Marginal(cov_func=cov_s + cov_c) - - t_diff = pm.Normal("t_diff", mu=0.0, sigma=0.02, shape=len(t)) - t_uncert = t_n - t_diff - - # white noise variance - σ = pm.HalfNormal("σ", sigma=5, testval=1) - y_ = gp.marginal_likelihood("y", X=t_uncert[:, None], y=y_n, noise=σ) -``` - -```{code-cell} ipython3 -with model: - tr = pm.sample(500, chains=2, target_accept=0.95) -``` - -```{code-cell} ipython3 -az.plot_trace(tr, var_names=["η", "ηs", "ℓ", "ℓs", "c", "a", "x0", "σ"]); -``` - -### Predictions - -```{code-cell} ipython3 -tnew = np.linspace(-100, 2300, 2200) * 0.01 - -with model: - fnew = gp.conditional("fnew", Xnew=tnew[:, None]) - -with model: - ppc = pm.sample_posterior_predictive(tr, samples=100, var_names=["fnew"]) -``` - -```{code-cell} ipython3 -samples = y_sd * ppc["fnew"] + y_mu - -fig = plt.figure(figsize=(12, 5)) -ax = plt.gca() -pm.gp.util.plot_gp_dist(ax, samples, tnew, plot_samples=True, palette="Blues") -ax.plot(t / 100, y, "k.") -ax.set_xticks(np.arange(0, 23)) -ax.set_xlim([-1, 23]) -ax.set_ylim([250, 450]) -ax.set_xlabel("time (in centuries)") -ax.set_ylabel("CO2 (ppm)"); -``` - -The predictions for this model look much more realistic. The sum of a 2nd degree polynomial with an `ExpQuad` looks like a good model to forecast with. It allows for -the amount of CO2 to increase in a not-exactly-linear fashion. We can see from the predictions that: - -- The amount of CO2 could increase at a faster rate -- The amount of CO2 should increase more or less linearly -- It is possible for the CO2 to start to decrease - -+++ - -## Incorporating Atmospheric CO2 measurements - -Next, we incorporate the CO2 measurements from the Mauna Loa observatory. These data points were taken monthly from atmospheric levels. Unlike the ice core data, there is no uncertainty in these measurements. While modeling both of these data sets together, the value of including the uncertainty in the ice core measurement time will be more apparent. Hintcasting the Mauna Loa seasonality using ice core data doesn't make too much sense, since the seasonality pattern at the south pole is different than that in the northern hemisphere in Hawaii. We'll show it anyways though since it's possible, and may be useful in other contexts. - -First let's load in the data, and then plot it alongside the ice core data. - -```{code-cell} ipython3 -import time - -from datetime import datetime as dt - - -def toYearFraction(date): - date = pd.to_datetime(date) - - def sinceEpoch(date): # returns seconds since epoch - return time.mktime(date.timetuple()) - - s = sinceEpoch - - year = date.year - startOfThisYear = dt(year=year, month=1, day=1) - startOfNextYear = dt(year=year + 1, month=1, day=1) - - yearElapsed = s(date) - s(startOfThisYear) - yearDuration = s(startOfNextYear) - s(startOfThisYear) - fraction = yearElapsed / yearDuration - - return date.year + fraction -``` - -```{code-cell} ipython3 -airdata = pd.read_csv(pm.get_data("monthly_in_situ_co2_mlo.csv"), header=56) - -# - replace -99.99 with NaN -airdata.replace(to_replace=-99.99, value=np.nan, inplace=True) - -# fix column names -cols = [ - "year", - "month", - "--", - "--", - "CO2", - "seasonaly_adjusted", - "fit", - "seasonally_adjusted_fit", - "CO2_filled", - "seasonally_adjusted_filled", -] -airdata.columns = cols -cols.remove("--") -cols.remove("--") -airdata = airdata[cols] - -# drop rows with nan -airdata.dropna(inplace=True) - -# fix time index -airdata["day"] = 15 -airdata.index = pd.to_datetime(airdata[["year", "month", "day"]]) -airdata["year"] = [toYearFraction(date) for date in airdata.index.values] -cols.remove("month") -airdata = airdata[cols] - -air = airdata[["year", "CO2"]] -air.head(5) -``` - -Like was done in the first notebook, we reserve the data from 2004 onwards as the test set. - -```{code-cell} ipython3 -sep_idx = air.index.searchsorted(pd.to_datetime("2003-12-15")) -air_test = air.iloc[sep_idx:, :] -air = air.iloc[: sep_idx + 1, :] -``` - -```{code-cell} ipython3 -plt.plot(air.year.values, air.CO2.values, ".b", label="atmospheric CO2") -plt.plot(ice.year.values, ice.CO2.values, ".", color="c", label="ice core CO2") -plt.legend() -plt.xlabel("year") -plt.ylabel("CO2 (ppm)"); -``` - -If we zoom in on the late 1950's, we can see that the atmospheric data has a seasonal component, while the ice core data does not. - -```{code-cell} ipython3 -plt.plot(air.year.values, air.CO2.values, ".b", label="atmospheric CO2") -plt.plot(ice.year.values, ice.CO2.values, ".", color="c", label="ice core CO2") -plt.xlim([1949, 1965]) -plt.ylim([305, 325]) -plt.legend() -plt.xlabel("year") -plt.ylabel("CO2 (ppm)"); -``` - -Since the ice core data isn't measured accurately, it won't be possible to backcast the seasonal component *unless we model uncertainty in x*. - -+++ - -To model both the data together, we will combine the model we've built up using the ice core data, and combine it with elements from the previous notebook on the Mauna Loa data. From the previous notebook we will additionally include the: - -- The `Periodic`, seasonal component -- The `RatQuad` covariance for short range, annual scale variations - -Also, since we are using two different data sets, there should be two different `y`-direction uncertainties, one for the ice core data, and one for the atmospheric data. To accomplish this, we make a custom `WhiteNoise` covariance function that has two `σ` parameters. - -All custom covariance functions need to have the same three methods defined, `__init__`, `diag`, and `full`. `full` returns the full covariance, given either `X` or `X` and a different `Xs`. `diag` returns only the diagonal, and `__init__` saves the input parameters. - -```{code-cell} ipython3 -class CustomWhiteNoise(pm.gp.cov.Covariance): - """Custom White Noise covariance - - sigma1 is applied to the first n1 points in the data - - sigma2 is applied to the next n2 points in the data - - The total number of data points n = n1 + n2 - """ - - def __init__(self, sigma1, sigma2, n1, n2): - super().__init__(1, None) - self.sigma1 = sigma1 - self.sigma2 = sigma2 - self.n1 = n1 - self.n2 = n2 - - def diag(self, X): - d1 = tt.alloc(tt.square(self.sigma1), self.n1) - d2 = tt.alloc(tt.square(self.sigma2), self.n2) - return tt.concatenate((d1, d2), 0) - - def full(self, X, Xs=None): - if Xs is None: - return tt.diag(self.diag(X)) - else: - return tt.alloc(0.0, X.shape[0], Xs.shape[0]) -``` - -Next we need to organize and combine the two data sets. Remember that the unit on the x-axis is centuries, not years. - -```{code-cell} ipython3 -# form dataset, stack t and co2 measurements -t = np.concatenate((ice.year.values, air.year.values), 0) -y = np.concatenate((ice.CO2.values, air.CO2.values), 0) - -y_mu, y_sd = np.mean(ice.CO2.values[0:50]), np.std(y) -y_n = (y - y_mu) / y_sd -t_n = t * 0.01 -``` - -The specification of the model is below. The dataset is larger now, so MCMC will take much longer now. But you will see that estimating the whole posterior is clearly worth the wait! - -We also choose our priors for the hyperparameters more carefully. For the changepoint covariance, we model the post-industrial revolution data with an `ExpQuad` covariance that has the same longer lengthscale as before the industrial revolution. The idea is that whatever process was at work before, is still there after. But then we add the product of a `Polynomial(degree=2)` and a `Matern52`. We fix the lengthscale of the `Matern52` to two. Since it has only been about two centuries since the industrial revolution, we force the Polynomial component to decay at that time scale. This forces the uncertainty to rise at this time scale. - -The 2nd degree polynomial and `Matern52` product expresses our prior belief that the CO2 levels may increase semi-quadratically, or decrease semi-quadratically, since the scaling parameter for this may also end up being zero. - -```{code-cell} ipython3 -with pm.Model() as model: - ηc = pm.Gamma("ηc", alpha=3, beta=2) - ℓc = pm.Gamma("ℓc", alpha=10, beta=1) - - # changepoint occurs near the year 1800, sometime between 1760, 1840 - x0 = pm.Normal("x0", mu=18, sigma=0.1) - # the change happens gradually - a = pm.Gamma("a", alpha=3, beta=1) - # constant offset - c = pm.HalfNormal("c", sigma=2) - - # quadratic polynomial scale - ηq = pm.HalfNormal("ηq", sigma=1) - ℓq = 2.0 # 2 century impact, since we only have 2 C of post IR data - - cov1 = ηc**2 * pm.gp.cov.ExpQuad(1, ℓc) - cov2 = ηc**2 * pm.gp.cov.ExpQuad(1, ℓc) + ηq**2 * pm.gp.cov.Polynomial( - 1, x0, 2, c - ) * pm.gp.cov.Matern52( - 1, ℓq - ) # ~2 century impact - - # construct changepoint cov - sc_cov1 = pm.gp.cov.ScaledCov(1, cov1, logistic, (-a, x0)) - sc_cov2 = pm.gp.cov.ScaledCov(1, cov2, logistic, (a, x0)) - gp_c = pm.gp.Marginal(cov_func=sc_cov1 + sc_cov2) - - # short term variation - ηs = pm.HalfNormal("ηs", sigma=3) - ℓs = pm.Gamma("ℓs", alpha=5, beta=100) - α = pm.Gamma("α", alpha=4, beta=1) - cov_s = ηs**2 * pm.gp.cov.RatQuad(1, α, ℓs) - gp_s = pm.gp.Marginal(cov_func=cov_s) - - # medium term variation - ηm = pm.HalfNormal("ηm", sigma=5) - ℓm = pm.Gamma("ℓm", alpha=2, beta=3) - cov_m = ηm**2 * pm.gp.cov.ExpQuad(1, ℓm) - gp_m = pm.gp.Marginal(cov_func=cov_m) - - ## periodic - ηp = pm.HalfNormal("ηp", sigma=2) - ℓp_decay = pm.Gamma("ℓp_decay", alpha=40, beta=0.1) - ℓp_smooth = pm.Normal("ℓp_smooth ", mu=1.0, sigma=0.05) - period = 1 * 0.01 # we know the period is annual - cov_p = ηp**2 * pm.gp.cov.Periodic(1, period, ℓp_smooth) * pm.gp.cov.ExpQuad(1, ℓp_decay) - gp_p = pm.gp.Marginal(cov_func=cov_p) - - gp = gp_c + gp_m + gp_s + gp_p - - # - x location uncertainty (sd = 0.01 is a standard deviation of one year) - # - only the first 111 points are the ice core data - t_mu = t_n[:111] - t_diff = pm.Normal("t_diff", mu=0.0, sigma=0.02, shape=len(t_mu)) - t_uncert = t_mu - t_diff - t_combined = tt.concatenate((t_uncert, t_n[111:]), 0) - - # Noise covariance, using boundary avoiding priors for MAP estimation - σ1 = pm.Gamma("σ1", alpha=3, beta=50) - σ2 = pm.Gamma("σ2", alpha=3, beta=50) - η_noise = pm.HalfNormal("η_noise", sigma=1) - ℓ_noise = pm.Gamma("ℓ_noise", alpha=2, beta=200) - cov_noise = η_noise**2 * pm.gp.cov.Matern32(1, ℓ_noise) + CustomWhiteNoise(σ1, σ2, 111, 545) - - y_ = gp.marginal_likelihood("y", X=t_combined[:, None], y=y_n, noise=cov_noise) -``` - -```{code-cell} ipython3 -with model: - tr = pm.sample(500, tune=1000, chains=2, cores=16, return_inferencedata=True) -``` - -```{code-cell} ipython3 -az.plot_trace(tr, compact=True); -``` - -```{code-cell} ipython3 -tnew = np.linspace(1700, 2040, 3000) * 0.01 -with model: - fnew = gp.conditional("fnew", Xnew=tnew[:, None]) -``` - -```{code-cell} ipython3 -with model: - ppc = pm.sample_posterior_predictive(tr, samples=200, var_names=["fnew"]) -``` - -Below is a plot of the data since the 18th century (Mauna Loa and Law Dome ice core data) used to fit the model. The light blue lines are a bit hard to make out at this level of zoom, but they are samples from the posterior of the Gaussian process. They both interpolate the observed data, and represent plausible trajectories of the future forecast. These samples can alternatively be used to define credible intervals. - -```{code-cell} ipython3 -plt.figure(figsize=(12, 5)) -plt.plot(tnew * 100, y_sd * ppc["fnew"][0:200:5, :].T + y_mu, color="lightblue", alpha=0.8) -plt.plot( - [-1000, -1001], - [-1000, -1001], - color="lightblue", - alpha=0.8, - label="samples from the posterior", -) -plt.plot(t, y, "k.", label="observed data") -plt.plot( - air_test.year.values, - air_test.CO2.values, - ".", - color="orange", - label="test set data", -) -plt.axhline(y=400, color="k", alpha=0.7, linestyle=":") -plt.ylabel("CO2 [ppm]") -plt.xlabel("year") -plt.title("fit and possible forecasts") -plt.legend() -plt.xlim([1700, 2040]) -plt.ylim([260, 460]); -``` - -Let's zoom in for a closer look at the uncertainty intervals at the area around when the CO2 levels first crossed 400 ppm. We can see that the posterior samples give a range of plausible future trajectories. Note that the data plotted in orange **were not** used in fitting the model. - -```{code-cell} ipython3 -plt.figure(figsize=(12, 5)) -plt.plot(tnew * 100, y_sd * ppc["fnew"][0:200:5, :].T + y_mu, color="lightblue", alpha=0.8) -plt.plot( - [-1000, -1001], - [-1000, -1001], - color="lightblue", - alpha=0.8, - label="samples from the posterior", -) -plt.plot( - air_test.year.values, - air_test.CO2.values, - ".", - color="orange", - label="test set data", -) -plt.axhline(y=400, color="k", alpha=0.7, linestyle=":") -plt.ylabel("CO2 [ppm]") -plt.xlabel("year") -plt.title("fit and possible forecasts") -plt.legend() -plt.xlim([2004, 2040]) -plt.ylim([360, 460]); -``` - -If you compare this to the first Mauna Loa example notebook, the predictions are much better. The date when the CO2 level first hits 400 is predicted much more accurately. This improvement in the bias is due to including the `Polynomial * Matern52` term, and the changepoint model. - -We can also look at what the model says about CO2 levels back in time. Since we allowed the `x` measurements to have uncertainty, we are able to fit the seasonal component back in time. To be sure, backcasting Mauna Loa CO2 measurements using ice core data doesn't really make sense from a scientific point of view, because CO2 levels due to seasonal variation are different depending on your location on the planet. Mauna Loa will have a much more pronounced cyclical pattern because the northern hemisphere has much more vegetation. The amount of vegetation largely drives the seasonality due to the growth and die-off of plants in summers and winters. But just because it's cool, lets look at the fit of the model here anyway: - -```{code-cell} ipython3 -tnew = np.linspace(11, 32, 500) * 0.01 -with model: - fnew2 = gp.conditional("fnew2", Xnew=tnew[:, None]) -``` - -```{code-cell} ipython3 -with model: - ppc = pm.sample_posterior_predictive(tr, samples=200, var_names=["fnew2"]) -``` - -```{code-cell} ipython3 -plt.figure(figsize=(12, 5)) - -plt.plot(tnew * 100, y_sd * ppc["fnew2"][0:200:10, :].T + y_mu, color="lightblue", alpha=0.8) -plt.plot( - [-1000, -1001], - [-1000, -1001], - color="lightblue", - alpha=0.8, - label="samples from the GP posterior", -) -plt.plot(100 * (t_n[:111][:, None] - tr["t_diff"].T), y[:111], "oy", alpha=0.01) -plt.plot( - [100, 200], - [100, 200], - "oy", - alpha=0.3, - label="data location posterior samples reflecting ice core time measurement uncertainty", -) -plt.plot(t, y, "k.", label="observed data") -plt.plot(air_test.year.values, air_test.CO2.values, ".", color="orange") -plt.legend(loc="upper left") -plt.ylabel("CO2 [ppm]") -plt.xlabel("year") -plt.xlim([12, 31]) -plt.xticks(np.arange(12, 32)) -plt.ylim([272, 283]); -``` - -We can see that far back in time, we can backcast even the seasonal behavior to some degree. The ~two year of uncertainty in the `x` locations allows them to be shifted onto the nearest part of the seasonal oscillation for that year. The magnitude of the oscillation is the same as it is now in modern times. While the cycle in each of the posterior samples still has an annual period, its exact morphology is less certain since we are far in time from the dates when the Mauna Loa data was collected. - -```{code-cell} ipython3 -tnew = np.linspace(-20, 0, 300) * 0.01 -with model: - fnew3 = gp.conditional("fnew3", Xnew=tnew[:, None]) -``` - -```{code-cell} ipython3 -with model: - ppc = pm.sample_posterior_predictive(tr, samples=200, var_names=["fnew3"]) -``` - -```{code-cell} ipython3 -plt.figure(figsize=(12, 5)) - -plt.plot(tnew * 100, y_sd * ppc["fnew3"][0:200:10, :].T + y_mu, color="lightblue", alpha=0.8) -plt.plot( - [-1000, -1001], - [-1000, -1001], - color="lightblue", - alpha=0.8, - label="samples from the GP posterior", -) -plt.legend(loc="upper left") -plt.ylabel("CO2 [ppm]") -plt.xlabel("year") -plt.xlim([-20, 0]) -plt.ylim([272, 283]); -``` - -Even as we go back before the year zero BCE, the general backcasted seasonality pattern remains intact, though it does begin to vary more wildly. - -+++ - -### Conclusion - -The goal of this notebook is to help provide some ideas of ways to take advantage of the flexibility of PyMC3's GP modeling capabilities. Data rarely comes in neat, evenly sampled intervals from a single source, which is no problem for GP models in general. To enable modeling interesting behavior, it is easy to define custom covariance and mean functions. There is no need to worry about figuring out the gradients, since this is taken care of by Theano's autodiff capabilities. Being able to use the extremely high quality NUTS sampler in PyMC3 with GP models means that it's possible to use samples from the posterior distribution as possible forecasts, which take into account uncertainty in the mean and covariance function hyperparameters. - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/gaussian_processes/GP-MeansAndCovs.myst.md b/myst_nbs/gaussian_processes/GP-MeansAndCovs.myst.md deleted file mode 100644 index 9ff935c68..000000000 --- a/myst_nbs/gaussian_processes/GP-MeansAndCovs.myst.md +++ /dev/null @@ -1,1268 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(GP-MeansAndCovs)= -# Mean and Covariance Functions - -:::{post} Mar 22, 2022 -:tags: gaussian process -:category: intermediate, reference -:author: Bill Engels, Oriol Abril Pla -::: - -```{code-cell} ipython3 -%matplotlib inline -``` - -```{code-cell} ipython3 ---- -papermill: - duration: 5.306978 - end_time: '2020-12-22T18:36:31.587812' - exception: false - start_time: '2020-12-22T18:36:26.280834' - status: completed -tags: [] ---- -import aesara -import aesara.tensor as at -import arviz as az -import matplotlib.cm as cmap -import matplotlib.pyplot as plt -import numpy as np -import pymc as pm -import scipy.stats as stats -``` - -```{code-cell} ipython3 ---- -papermill: - duration: 0.047175 - end_time: '2020-12-22T18:36:31.674100' - exception: false - start_time: '2020-12-22T18:36:31.626925' - status: completed -tags: [] ---- -RANDOM_SEED = 8927 - -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -plt.rcParams["figure.figsize"] = (10, 4) -``` - -+++ {"papermill": {"duration": 0.037844, "end_time": "2020-12-22T18:36:31.751886", "exception": false, "start_time": "2020-12-22T18:36:31.714042", "status": "completed"}, "tags": []} - -A large set of mean and covariance functions are available in PyMC. It is relatively easy to define custom mean and covariance functions. Since PyMC uses Aesara, their gradients do not need to be defined by the user. - -## Mean functions - -The following mean functions are available in PyMC. - -- {class}`pymc.gp.mean.Zero` -- {class}`pymc.gp.mean.Constant` -- {class}`pymc.gp.mean.Linear` - -All follow a similar usage pattern. First, the mean function is specified. Then it can be evaluated over some inputs. The first two mean functions are very simple. Regardless of the inputs, `gp.mean.Zero` returns a vector of zeros with the same length as the number of input values. - -### Zero - -```{code-cell} ipython3 ---- -papermill: - duration: 1.075408 - end_time: '2020-12-22T18:36:32.865469' - exception: false - start_time: '2020-12-22T18:36:31.790061' - status: completed -tags: [] ---- -zero_func = pm.gp.mean.Zero() - -X = np.linspace(0, 1, 5)[:, None] -print(zero_func(X).eval()) -``` - -+++ {"papermill": {"duration": 0.040891, "end_time": "2020-12-22T18:36:32.947028", "exception": false, "start_time": "2020-12-22T18:36:32.906137", "status": "completed"}, "tags": []} - -The default mean functions for all GP implementations in PyMC is `Zero`. - -### Constant - -`gp.mean.Constant` returns a vector whose value is provided. - -```{code-cell} ipython3 ---- -papermill: - duration: 2.12553 - end_time: '2020-12-22T18:36:35.113789' - exception: false - start_time: '2020-12-22T18:36:32.988259' - status: completed -tags: [] ---- -const_func = pm.gp.mean.Constant(25.2) - -print(const_func(X).eval()) -``` - -+++ {"papermill": {"duration": 0.039627, "end_time": "2020-12-22T18:36:35.195057", "exception": false, "start_time": "2020-12-22T18:36:35.155430", "status": "completed"}, "tags": []} - -As long as the shape matches the input it will receive, `gp.mean.Constant` can also accept a Aesara tensor or vector of PyMC random variables. - -```{code-cell} ipython3 ---- -papermill: - duration: 1.408839 - end_time: '2020-12-22T18:36:36.644770' - exception: false - start_time: '2020-12-22T18:36:35.235931' - status: completed -tags: [] ---- -const_func_vec = pm.gp.mean.Constant(at.ones(5)) - -print(const_func_vec(X).eval()) -``` - -+++ {"papermill": {"duration": 0.04127, "end_time": "2020-12-22T18:36:36.726017", "exception": false, "start_time": "2020-12-22T18:36:36.684747", "status": "completed"}, "tags": []} - -### Linear - -`gp.mean.Linear` is a takes as input a matrix of coefficients and a vector of intercepts (or a slope and scalar intercept in one dimension). - -```{code-cell} ipython3 ---- -papermill: - duration: 0.073879 - end_time: '2020-12-22T18:36:36.839351' - exception: false - start_time: '2020-12-22T18:36:36.765472' - status: completed -tags: [] ---- -beta = rng.normal(size=3) -b = 0.0 - -lin_func = pm.gp.mean.Linear(coeffs=beta, intercept=b) - -X = rng.normal(size=(5, 3)) -print(lin_func(X).eval()) -``` - -+++ {"papermill": {"duration": 0.03931, "end_time": "2020-12-22T18:36:36.918672", "exception": false, "start_time": "2020-12-22T18:36:36.879362", "status": "completed"}, "tags": []} - -## Defining a custom mean function - -To define a custom mean function, subclass `gp.mean.Mean`, and provide `__call__` and `__init__` methods. For example, the code for the `Constant` mean function is - -```python -import theano.tensor as tt - -class Constant(pm.gp.mean.Mean): - - def __init__(self, c=0): - Mean.__init__(self) - self.c = c - - def __call__(self, X): - return tt.alloc(1.0, X.shape[0]) * self.c - -``` - -Remember that Aesara must be used instead of NumPy. - -+++ {"papermill": {"duration": 0.039306, "end_time": "2020-12-22T18:36:36.998649", "exception": false, "start_time": "2020-12-22T18:36:36.959343", "status": "completed"}, "tags": []} - -## Covariance functions - -PyMC contains a much larger suite of {mod}`built-in covariance functions `. The following shows functions drawn from a GP prior with a given covariance function, and demonstrates how composite covariance functions can be constructed with Python operators in a straightforward manner. Our goal was for our API to follow kernel algebra (see Ch.4 of {cite:t}`rasmussen2003gaussian`) as closely as possible. See the main documentation page for an overview on their usage in PyMC. - -+++ {"papermill": {"duration": 0.039789, "end_time": "2020-12-22T18:36:37.078199", "exception": false, "start_time": "2020-12-22T18:36:37.038410", "status": "completed"}, "tags": []} - -### Exponentiated Quadratic - -$$ -k(x, x') = \mathrm{exp}\left[ -\frac{(x - x')^2}{2 \ell^2} \right] -$$ - -```{code-cell} ipython3 ---- -papermill: - duration: 7.505078 - end_time: '2020-12-22T18:36:44.626679' - exception: false - start_time: '2020-12-22T18:36:37.121601' - status: completed -tags: [] ---- -lengthscale = 0.2 -eta = 2.0 -cov = eta**2 * pm.gp.cov.ExpQuad(1, lengthscale) -# Add white noise to stabilise -cov += pm.gp.cov.WhiteNoise(1e-6) - -X = np.linspace(0, 2, 200)[:, None] -K = cov(X).eval() - -plt.plot( - X, - pm.draw( - pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=K.shape[0]), draws=3, random_seed=rng - ).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.042546, "end_time": "2020-12-22T18:36:44.712169", "exception": false, "start_time": "2020-12-22T18:36:44.669623", "status": "completed"}, "tags": []} - -### Two (and higher) Dimensional Inputs - -#### Both dimensions active - -It is easy to define kernels with higher dimensional inputs. Notice that the ```ls``` (lengthscale) parameter is an array of length 2. Lists of PyMC random variables can be used for automatic relevance determination (ARD). - -```{code-cell} ipython3 ---- -papermill: - duration: 3.19044 - end_time: '2020-12-22T18:36:47.946218' - exception: false - start_time: '2020-12-22T18:36:44.755778' - status: completed -tags: [] ---- -x1 = np.linspace(0, 1, 10) -x2 = np.arange(1, 4) -# Cartesian product -X2 = np.dstack(np.meshgrid(x1, x2)).reshape(-1, 2) - -ls = np.array([0.2, 1.0]) -cov = pm.gp.cov.ExpQuad(input_dim=2, ls=ls) - -m = plt.imshow(cov(X2).eval(), cmap="inferno", interpolation="none") -plt.colorbar(m); -``` - -+++ {"papermill": {"duration": 0.043142, "end_time": "2020-12-22T18:36:48.032797", "exception": false, "start_time": "2020-12-22T18:36:47.989655", "status": "completed"}, "tags": []} - -#### One dimension active - -```{code-cell} ipython3 ---- -papermill: - duration: 0.673374 - end_time: '2020-12-22T18:36:48.749451' - exception: false - start_time: '2020-12-22T18:36:48.076077' - status: completed -tags: [] ---- -ls = 0.2 -cov = pm.gp.cov.ExpQuad(input_dim=2, ls=ls, active_dims=[0]) - -m = plt.imshow(cov(X2).eval(), cmap="inferno", interpolation="none") -plt.colorbar(m); -``` - -+++ {"papermill": {"duration": 0.045376, "end_time": "2020-12-22T18:36:48.840086", "exception": false, "start_time": "2020-12-22T18:36:48.794710", "status": "completed"}, "tags": []} - -#### Product of covariances over different dimensions - -Note that this is equivalent to using a two dimensional `ExpQuad` with separate lengthscale parameters for each dimension. - -```{code-cell} ipython3 ---- -papermill: - duration: 1.600894 - end_time: '2020-12-22T18:36:50.486049' - exception: false - start_time: '2020-12-22T18:36:48.885155' - status: completed -tags: [] ---- -ls1 = 0.2 -ls2 = 1.0 -cov1 = pm.gp.cov.ExpQuad(2, ls1, active_dims=[0]) -cov2 = pm.gp.cov.ExpQuad(2, ls2, active_dims=[1]) -cov = cov1 * cov2 - -m = plt.imshow(cov(X2).eval(), cmap="inferno", interpolation="none") -plt.colorbar(m); -``` - -+++ {"papermill": {"duration": 0.046821, "end_time": "2020-12-22T18:36:50.579012", "exception": false, "start_time": "2020-12-22T18:36:50.532191", "status": "completed"}, "tags": []} - -### White Noise - -$$ -k(x, x') = \sigma^2 \mathrm{I}_{xx} -$$ - -```{code-cell} ipython3 ---- -papermill: - duration: 0.99526 - end_time: '2020-12-22T18:36:51.620630' - exception: false - start_time: '2020-12-22T18:36:50.625370' - status: completed -tags: [] ---- -sigma = 2.0 -cov = pm.gp.cov.WhiteNoise(sigma) - -X = np.linspace(0, 2, 200)[:, None] -K = cov(X).eval() - -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.05125, "end_time": "2020-12-22T18:36:51.723154", "exception": false, "start_time": "2020-12-22T18:36:51.671904", "status": "completed"}, "tags": []} - -### Constant - -$$ -k(x, x') = c -$$ - -```{code-cell} ipython3 ---- -papermill: - duration: 1.931356 - end_time: '2020-12-22T18:36:53.705539' - exception: false - start_time: '2020-12-22T18:36:51.774183' - status: completed -tags: [] ---- -c = 2.0 -cov = pm.gp.cov.Constant(c) -# Add white noise to stabilise -cov += pm.gp.cov.WhiteNoise(1e-6) - -X = np.linspace(0, 2, 200)[:, None] -K = cov(X).eval() - -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.051694, "end_time": "2020-12-22T18:36:53.810105", "exception": false, "start_time": "2020-12-22T18:36:53.758411", "status": "completed"}, "tags": []} - -### Rational Quadratic - -$$ -k(x, x') = \left(1 + \frac{(x - x')^2}{2\alpha\ell^2} \right)^{-\alpha} -$$ - -```{code-cell} ipython3 ---- -papermill: - duration: 2.381363 - end_time: '2020-12-22T18:36:56.245016' - exception: false - start_time: '2020-12-22T18:36:53.863653' - status: completed -tags: [] ---- -alpha = 0.1 -ls = 0.2 -tau = 2.0 -cov = tau * pm.gp.cov.RatQuad(1, ls, alpha) - -X = np.linspace(0, 2, 200)[:, None] -K = cov(X).eval() - -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.055808, "end_time": "2020-12-22T18:36:56.357806", "exception": false, "start_time": "2020-12-22T18:36:56.301998", "status": "completed"}, "tags": []} - -### Exponential - -$$ -k(x, x') = \mathrm{exp}\left[ -\frac{||x - x'||}{2\ell^2} \right] -$$ - -```{code-cell} ipython3 ---- -papermill: - duration: 1.343198 - end_time: '2020-12-22T18:36:57.756310' - exception: false - start_time: '2020-12-22T18:36:56.413112' - status: completed -tags: [] ---- -inverse_lengthscale = 5 -cov = pm.gp.cov.Exponential(1, ls_inv=inverse_lengthscale) - -X = np.linspace(0, 2, 200)[:, None] -K = cov(X).eval() - -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.058891, "end_time": "2020-12-22T18:36:57.874371", "exception": false, "start_time": "2020-12-22T18:36:57.815480", "status": "completed"}, "tags": []} - -### Matern 5/2 - -$$ -k(x, x') = \left(1 + \frac{\sqrt{5(x - x')^2}}{\ell} + - \frac{5(x-x')^2}{3\ell^2}\right) - \mathrm{exp}\left[ - \frac{\sqrt{5(x - x')^2}}{\ell} \right] -$$ - -```{code-cell} ipython3 ---- -papermill: - duration: 2.417182 - end_time: '2020-12-22T18:37:00.350538' - exception: false - start_time: '2020-12-22T18:36:57.933356' - status: completed -tags: [] ---- -ls = 0.2 -tau = 2.0 -cov = tau * pm.gp.cov.Matern52(1, ls) - -X = np.linspace(0, 2, 200)[:, None] -K = cov(X).eval() - -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.061663, "end_time": "2020-12-22T18:37:00.473343", "exception": false, "start_time": "2020-12-22T18:37:00.411680", "status": "completed"}, "tags": []} - -### Matern 3/2 - -$$ -k(x, x') = \left(1 + \frac{\sqrt{3(x - x')^2}}{\ell}\right) - \mathrm{exp}\left[ - \frac{\sqrt{3(x - x')^2}}{\ell} \right] -$$ - -```{code-cell} ipython3 ---- -papermill: - duration: 0.494084 - end_time: '2020-12-22T18:37:01.028428' - exception: false - start_time: '2020-12-22T18:37:00.534344' - status: completed -tags: [] ---- -ls = 0.2 -tau = 2.0 -cov = tau * pm.gp.cov.Matern32(1, ls) - -X = np.linspace(0, 2, 200)[:, None] -K = cov(X).eval() - -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.064186, "end_time": "2020-12-22T18:37:01.159126", "exception": false, "start_time": "2020-12-22T18:37:01.094940", "status": "completed"}, "tags": []} - -### Matern 1/2 - -$$k(x, x') = \mathrm{exp}\left[ -\frac{(x - x')^2}{\ell} \right]$$ - -```{code-cell} ipython3 ---- -papermill: - duration: 0.477568 - end_time: '2020-12-22T18:37:01.701402' - exception: false - start_time: '2020-12-22T18:37:01.223834' - status: completed -tags: [] ---- -ls = 0.2 -tau = 2.0 -cov = tau * pm.gp.cov.Matern12(1, ls) - -X = np.linspace(0, 2, 200)[:, None] -K = cov(X).eval() - -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.068504, "end_time": "2020-12-22T18:37:01.837835", "exception": false, "start_time": "2020-12-22T18:37:01.769331", "status": "completed"}, "tags": []} - -### Cosine - -$$ -k(x, x') = \mathrm{cos}\left( 2 \pi \frac{||x - x'||}{ \ell^2} \right) -$$ - -```{code-cell} ipython3 ---- -papermill: - duration: 1.457975 - end_time: '2020-12-22T18:37:03.365039' - exception: false - start_time: '2020-12-22T18:37:01.907064' - status: completed -tags: [] ---- -period = 0.5 -cov = pm.gp.cov.Cosine(1, period) -# Add white noise to stabilise -cov += pm.gp.cov.WhiteNoise(1e-4) - -X = np.linspace(0, 2, 200)[:, None] -K = cov(X).eval() - -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.077444, "end_time": "2020-12-22T18:37:03.548722", "exception": false, "start_time": "2020-12-22T18:37:03.471278", "status": "completed"}, "tags": []} - -### Linear - -$$ -k(x, x') = (x - c)(x' - c) -$$ - -```{code-cell} ipython3 ---- -papermill: - duration: 1.524742 - end_time: '2020-12-22T18:37:05.145867' - exception: false - start_time: '2020-12-22T18:37:03.621125' - status: completed -tags: [] ---- -c = 1.0 -tau = 2.0 -cov = tau * pm.gp.cov.Linear(1, c) -# Add white noise to stabilise -cov += pm.gp.cov.WhiteNoise(1e-6) - -X = np.linspace(0, 2, 200)[:, None] -K = cov(X).eval() - -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.073236, "end_time": "2020-12-22T18:37:05.293217", "exception": false, "start_time": "2020-12-22T18:37:05.219981", "status": "completed"}, "tags": []} - -### Polynomial - -$$ -k(x, x') = [(x - c)(x' - c) + \mathrm{offset}]^{d} -$$ - -```{code-cell} ipython3 ---- -papermill: - duration: 1.371418 - end_time: '2020-12-22T18:37:06.738888' - exception: false - start_time: '2020-12-22T18:37:05.367470' - status: completed -tags: [] ---- -c = 1.0 -d = 3 -offset = 1.0 -tau = 0.1 -cov = tau * pm.gp.cov.Polynomial(1, c=c, d=d, offset=offset) -# Add white noise to stabilise -cov += pm.gp.cov.WhiteNoise(1e-6) - -X = np.linspace(0, 2, 200)[:, None] -K = cov(X).eval() - -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.07702, "end_time": "2020-12-22T18:37:06.892733", "exception": false, "start_time": "2020-12-22T18:37:06.815713", "status": "completed"}, "tags": []} - -### Multiplication with a precomputed covariance matrix - -A covariance function ```cov``` can be multiplied with numpy matrix, ```K_cos```, as long as the shapes are appropriate. - -```{code-cell} ipython3 ---- -papermill: - duration: 1.546032 - end_time: '2020-12-22T18:37:08.514887' - exception: false - start_time: '2020-12-22T18:37:06.968855' - status: completed -tags: [] ---- -# first evaluate a covariance function into a matrix -period = 0.2 -cov_cos = pm.gp.cov.Cosine(1, period) -K_cos = cov_cos(X).eval() - -# now multiply it with a covariance *function* -cov = pm.gp.cov.Matern32(1, 0.5) * K_cos - -X = np.linspace(0, 2, 200)[:, None] -K = cov(X).eval() - -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.078461, "end_time": "2020-12-22T18:37:08.672218", "exception": false, "start_time": "2020-12-22T18:37:08.593757", "status": "completed"}, "tags": []} - -### Applying an arbitrary warping function on the inputs - -If $k(x, x')$ is a valid covariance function, then so is $k(w(x), w(x'))$. - -The first argument of the warping function must be the input ```X```. The remaining arguments can be anything else, including random variables. - -```{code-cell} ipython3 ---- -papermill: - duration: 6.061177 - end_time: '2020-12-22T18:37:14.812998' - exception: false - start_time: '2020-12-22T18:37:08.751821' - status: completed -tags: [] ---- -def warp_func(x, a, b, c): - return 1.0 + x + (a * at.tanh(b * (x - c))) - - -a = 1.0 -b = 5.0 -c = 1.0 - -cov_exp = pm.gp.cov.ExpQuad(1, 0.2) -cov = pm.gp.cov.WarpedInput(1, warp_func=warp_func, args=(a, b, c), cov_func=cov_exp) -# Add white noise to stabilise -cov += pm.gp.cov.WhiteNoise(1e-6) - -X = np.linspace(0, 2, 400)[:, None] -wf = warp_func(X.flatten(), a, b, c).eval() - -plt.plot(X, wf) -plt.xlabel("X") -plt.ylabel("warp_func(X)") -plt.title("The warping function used") - -K = cov(X).eval() -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.085228, "end_time": "2020-12-22T18:37:14.983640", "exception": false, "start_time": "2020-12-22T18:37:14.898412", "status": "completed"}, "tags": []} - -### Constructing `Periodic` using `WarpedInput` - -The `WarpedInput` kernel can be used to create the `Periodic` covariance. This covariance models functions that are periodic, but are not an exact sine wave (like the `Cosine` kernel is). - -The periodic kernel is given by - -$$ -k(x, x') = \exp\left( -\frac{2 \sin^{2}(\pi |x - x'|\frac{1}{T})}{\ell^2} \right) -$$ - -Where T is the period, and $\ell$ is the lengthscale. It can be derived by warping the input of an `ExpQuad` kernel with the function $\mathbf{u}(x) = (\sin(2\pi x \frac{1}{T})\,, \cos(2 \pi x \frac{1}{T}))$. Here we use the `WarpedInput` kernel to construct it. - -The input `X`, which is defined at the top of this page, is 2 "seconds" long. We use a period of $0.5$, which means that functions -drawn from this GP prior will repeat 4 times over 2 seconds. - -```{code-cell} ipython3 ---- -papermill: - duration: 3.628528 - end_time: '2020-12-22T18:37:18.698932' - exception: false - start_time: '2020-12-22T18:37:15.070404' - status: completed -tags: [] ---- -def mapping(x, T): - c = 2.0 * np.pi * (1.0 / T) - u = at.concatenate((at.sin(c * x), at.cos(c * x)), 1) - return u - - -T = 0.6 -ls = 0.4 -# note that the input of the covariance function taking -# the inputs is 2 dimensional -cov_exp = pm.gp.cov.ExpQuad(2, ls) -cov = pm.gp.cov.WarpedInput(1, cov_func=cov_exp, warp_func=mapping, args=(T,)) -# Add white noise to stabilise -cov += pm.gp.cov.WhiteNoise(1e-6) - -K = cov(X).eval() -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.089186, "end_time": "2020-12-22T18:37:18.877629", "exception": false, "start_time": "2020-12-22T18:37:18.788443", "status": "completed"}, "tags": []} - -### Periodic - -There is no need to construct the periodic covariance this way every time. A more efficient implementation of this covariance function is built in. - -```{code-cell} ipython3 ---- -papermill: - duration: 2.454314 - end_time: '2020-12-22T18:37:21.420790' - exception: false - start_time: '2020-12-22T18:37:18.966476' - status: completed -tags: [] ---- -period = 0.6 -ls = 0.4 -cov = pm.gp.cov.Periodic(1, period=period, ls=ls) -# Add white noise to stabilise -cov += pm.gp.cov.WhiteNoise(1e-6) - -K = cov(X).eval() -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -for p in np.arange(0, 2, period): - plt.axvline(p, color="black") -plt.axhline(0, color="black") -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.090578, "end_time": "2020-12-22T18:37:21.604122", "exception": false, "start_time": "2020-12-22T18:37:21.513544", "status": "completed"}, "tags": []} - -### Circular - -Circular kernel is similar to Periodic one but has an additional nuisance parameter $\tau$ - -In {cite:t}`padonou2015polar`, the Weinland function is used to solve the problem and ensures positive definite kernel on the circular domain (and not only). - -$$ -W_c(t) = \left(1 + \tau \frac{t}{c}\right)\left(1-\frac{t}{c}\right)_+^\tau -$$ -where $c$ is maximum value for $t$ and $\tau\ge 4$ is some positive number - -The kernel itself for geodesic distance (arc length) on a circle looks like - -$$ -k_g(x, y) = W_\pi(\text{dist}_{\mathit{geo}}(x, y)) -$$ - -Briefly, you can think - -* $t$ is time, it runs from $0$ to $24$ and then goes back to $0$ -* $c$ is maximum distance between any timestamps, here it would be $12$ -* $\tau$ controls for correlation strength, larger $\tau$ leads to less smooth functions - -```{code-cell} ipython3 ---- -papermill: - duration: 4.35163 - end_time: '2020-12-22T18:37:26.047326' - exception: false - start_time: '2020-12-22T18:37:21.695696' - status: completed -tags: [] ---- -period = 0.6 -tau = 4 -cov = pm.gp.cov.Circular(1, period=period, tau=tau) - -K = cov(X).eval() -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -for p in np.arange(0, 2, period): - plt.axvline(p, color="black") -plt.axhline(0, color="black") -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.094257, "end_time": "2020-12-22T18:37:26.237410", "exception": false, "start_time": "2020-12-22T18:37:26.143153", "status": "completed"}, "tags": []} - -We can see the effect of $\tau$, it adds more non-smooth patterns - -```{code-cell} ipython3 ---- -papermill: - duration: 0.613972 - end_time: '2020-12-22T18:37:26.946669' - exception: false - start_time: '2020-12-22T18:37:26.332697' - status: completed -tags: [] ---- -period = 0.6 -tau = 40 -cov = pm.gp.cov.Circular(1, period=period, tau=tau) - -K = cov(X).eval() -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -for p in np.arange(0, 2, period): - plt.axvline(p, color="black") -plt.axhline(0, color="black") -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.099739, "end_time": "2020-12-22T18:37:27.146953", "exception": false, "start_time": "2020-12-22T18:37:27.047214", "status": "completed"}, "tags": []} - -### Gibbs - -The Gibbs covariance function applies a positive definite warping function to the lengthscale. Similarly to ```WarpedInput```, the lengthscale warping function can be specified with parameters that are either fixed or random variables. - -```{code-cell} ipython3 ---- -papermill: - duration: 4.779819 - end_time: '2020-12-22T18:37:32.026714' - exception: false - start_time: '2020-12-22T18:37:27.246895' - status: completed -tags: [] ---- -def tanh_func(x, ls1, ls2, w, x0): - """ - ls1: left saturation value - ls2: right saturation value - w: transition width - x0: transition location. - """ - return (ls1 + ls2) / 2.0 - (ls1 - ls2) / 2.0 * at.tanh((x - x0) / w) - - -ls1 = 0.05 -ls2 = 0.6 -w = 0.3 -x0 = 1.0 -cov = pm.gp.cov.Gibbs(1, tanh_func, args=(ls1, ls2, w, x0)) -# Add white noise to stabilise -cov += pm.gp.cov.WhiteNoise(1e-6) - -wf = tanh_func(X, ls1, ls2, w, x0).eval() -plt.plot(X, wf) -plt.ylabel("lengthscale") -plt.xlabel("X") -plt.title("Lengthscale as a function of X") - -K = cov(X).eval() -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.106362, "end_time": "2020-12-22T18:37:32.238582", "exception": false, "start_time": "2020-12-22T18:37:32.132220", "status": "completed"}, "tags": []} - -### Scaled Covariance - -One can construct a new kernel or covariance function by multiplying some base kernel by a nonnegative function $\phi(x)$, - -$$ -k_{\mathrm{scaled}}(x, x') = \phi(x) k_{\mathrm{base}}(x, x') \phi(x') \,. -$$ - -This is useful for specifying covariance functions whose amplitude changes across the domain. - -```{code-cell} ipython3 ---- -papermill: - duration: 6.455011 - end_time: '2020-12-22T18:37:38.798884' - exception: false - start_time: '2020-12-22T18:37:32.343873' - status: completed -tags: [] ---- -def logistic(x, a, x0, c, d): - # a is the slope, x0 is the location - return d * pm.math.invlogit(a * (x - x0)) + c - - -a = 2.0 -x0 = 5.0 -c = 0.1 -d = 2.0 - -cov_base = pm.gp.cov.ExpQuad(1, 0.2) -cov = pm.gp.cov.ScaledCov(1, scaling_func=logistic, args=(a, x0, c, d), cov_func=cov_base) -# Add white noise to stabilise -cov += pm.gp.cov.WhiteNoise(1e-5) - -X = np.linspace(0, 10, 400)[:, None] -lfunc = logistic(X.flatten(), a, b, c, d).eval() - -plt.plot(X, lfunc) -plt.xlabel("X") -plt.ylabel(r"$\phi(x)$") -plt.title("The scaling function") - -K = cov(X).eval() -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.109017, "end_time": "2020-12-22T18:37:39.017681", "exception": false, "start_time": "2020-12-22T18:37:38.908664", "status": "completed"}, "tags": []} - -### Constructing a Changepoint kernel using `ScaledCov` - -The `ScaledCov` kernel can be used to create the `Changepoint` covariance. This covariance models -a process that gradually transitions from one type of behavior to another. - -The changepoint kernel is given by - -$$ -k(x, x') = \phi(x)k_{1}(x, x')\phi(x) + (1 - \phi(x))k_{2}(x, x')(1 - \phi(x')) -$$ - -where $\phi(x)$ is the logistic function. - -```{code-cell} ipython3 ---- -papermill: - duration: 2.436655 - end_time: '2020-12-22T18:37:41.563496' - exception: false - start_time: '2020-12-22T18:37:39.126841' - status: completed -tags: [] ---- -def logistic(x, a, x0): - # a is the slope, x0 is the location - return pm.math.invlogit(a * (x - x0)) - - -a = 2.0 -x0 = 5.0 - -cov1 = pm.gp.cov.ScaledCov( - 1, scaling_func=logistic, args=(-a, x0), cov_func=pm.gp.cov.ExpQuad(1, 0.2) -) -cov2 = pm.gp.cov.ScaledCov( - 1, scaling_func=logistic, args=(a, x0), cov_func=pm.gp.cov.Cosine(1, 0.5) -) -cov = cov1 + cov2 -# Add white noise to stabilise -cov += pm.gp.cov.WhiteNoise(1e-5) - -X = np.linspace(0, 10, 400) -plt.fill_between( - X, - np.zeros(400), - logistic(X, -a, x0).eval(), - label="ExpQuad region", - color="slateblue", - alpha=0.4, -) -plt.fill_between( - X, np.zeros(400), logistic(X, a, x0).eval(), label="Cosine region", color="firebrick", alpha=0.4 -) -plt.legend() -plt.xlabel("X") -plt.ylabel(r"$\phi(x)$") -plt.title("The two scaling functions") - -K = cov(X[:, None]).eval() -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.123091, "end_time": "2020-12-22T18:37:41.801550", "exception": false, "start_time": "2020-12-22T18:37:41.678459", "status": "completed"}, "tags": []} - -### Combination of two or more Covariance functions - -You can combine different covariance functions to model complex data. - -In particular, you can perform the following operations on any covaraince functions: - -- Add other covariance function with equal or broadcastable dimensions with first covariance function -- Multiply with a scalar or a covariance function with equal or broadcastable dimensions with first covariance function -- Exponentiate with a scalar. - -+++ {"papermill": {"duration": 0.114783, "end_time": "2020-12-22T18:37:42.043753", "exception": false, "start_time": "2020-12-22T18:37:41.928970", "status": "completed"}, "tags": []} - -#### Addition - -```{code-cell} ipython3 ---- -papermill: - duration: 0.565388 - end_time: '2020-12-22T18:37:42.722540' - exception: false - start_time: '2020-12-22T18:37:42.157152' - status: completed -tags: [] ---- -ls_1 = 0.1 -tau_1 = 2.0 -ls_2 = 0.5 -tau_2 = 1.0 -cov_1 = tau_1 * pm.gp.cov.ExpQuad(1, ls=ls_1) -cov_2 = tau_2 * pm.gp.cov.ExpQuad(1, ls=ls_2) - -cov = cov_1 + cov_2 -# Add white noise to stabilise -cov += pm.gp.cov.WhiteNoise(1e-6) - -X = np.linspace(0, 2, 200)[:, None] -K = cov(X).eval() - -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.11646, "end_time": "2020-12-22T18:37:42.956319", "exception": false, "start_time": "2020-12-22T18:37:42.839859", "status": "completed"}, "tags": []} - -#### Multiplication - -```{code-cell} ipython3 ---- -papermill: - duration: 0.554047 - end_time: '2020-12-22T18:37:43.627013' - exception: false - start_time: '2020-12-22T18:37:43.072966' - status: completed -tags: [] ---- -ls_1 = 0.1 -tau_1 = 2.0 -ls_2 = 0.5 -tau_2 = 1.0 -cov_1 = tau_1 * pm.gp.cov.ExpQuad(1, ls=ls_1) -cov_2 = tau_2 * pm.gp.cov.ExpQuad(1, ls=ls_2) - -cov = cov_1 * cov_2 -# Add white noise to stabilise -cov += pm.gp.cov.WhiteNoise(1e-6) - -X = np.linspace(0, 2, 200)[:, None] -K = cov(X).eval() - -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.125568, "end_time": "2020-12-22T18:37:43.873379", "exception": false, "start_time": "2020-12-22T18:37:43.747811", "status": "completed"}, "tags": []} - -#### Exponentiation - -```{code-cell} ipython3 ---- -papermill: - duration: 0.525416 - end_time: '2020-12-22T18:37:44.521691' - exception: false - start_time: '2020-12-22T18:37:43.996275' - status: completed -tags: [] ---- -ls_1 = 0.1 -tau_1 = 2.0 -power = 2 -cov_1 = tau_1 * pm.gp.cov.ExpQuad(1, ls=ls_1) - -cov = cov_1**power -# Add white noise to stabilise -cov += pm.gp.cov.WhiteNoise(1e-6) - -X = np.linspace(0, 2, 200)[:, None] -K = cov(X).eval() - -plt.plot( - X, - pm.draw(pm.MvNormal.dist(mu=np.zeros(len(K)), cov=K, shape=len(K)), draws=3, random_seed=rng).T, -) -plt.title("Samples from the GP prior") -plt.ylabel("y") -plt.xlabel("X"); -``` - -+++ {"papermill": {"duration": 0.124028, "end_time": "2020-12-22T18:37:44.770709", "exception": false, "start_time": "2020-12-22T18:37:44.646681", "status": "completed"}, "tags": []} - -### Defining a custom covariance function - -Covariance function objects in PyMC need to implement the `__init__`, `diag`, and `full` methods, and subclass `gp.cov.Covariance`. `diag` returns only the diagonal of the covariance matrix, and `full` returns the full covariance matrix. The `full` method has two inputs `X` and `Xs`. `full(X)` returns the square covariance matrix, and `full(X, Xs)` returns the cross-covariances between the two sets of inputs. - -For example, here is the implementation of the `WhiteNoise` covariance function: - -```python -class WhiteNoise(pm.gp.cov.Covariance): - def __init__(self, sigma): - super(WhiteNoise, self).__init__(1, None) - self.sigma = sigma - - def diag(self, X): - return tt.alloc(tt.square(self.sigma), X.shape[0]) - - def full(self, X, Xs=None): - if Xs is None: - return tt.diag(self.diag(X)) - else: - return tt.alloc(0.0, X.shape[0], Xs.shape[0]) -``` - -If we have forgotten an important covariance or mean function, please feel free to submit a pull request! - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Authors -* Authored by Bill Engels -* Updated to v4 by Oriol Abril Pla in Nov 2022 ([pymc-examples#301](https://github.com/pymc-devs/pymc-examples/pull/301)) - -+++ - -## Watermark - -```{code-cell} ipython3 ---- -papermill: - duration: 0.212109 - end_time: '2020-12-22T18:37:55.023502' - exception: false - start_time: '2020-12-22T18:37:54.811393' - status: completed -tags: [] ---- -%load_ext watermark -%watermark -n -u -v -iv -w -p aeppl,xarray -``` - -:::{include} ../page_footer.md -::: - -```{code-cell} ipython3 - -``` diff --git a/myst_nbs/gaussian_processes/GP-SparseApprox.myst.md b/myst_nbs/gaussian_processes/GP-SparseApprox.myst.md deleted file mode 100644 index 7574bc5ad..000000000 --- a/myst_nbs/gaussian_processes/GP-SparseApprox.myst.md +++ /dev/null @@ -1,185 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- - -# Sparse Approximations - - -The `gp.MarginalSparse` class implements sparse, or inducing point, GP approximations. It works identically to `gp.Marginal`, except it additionally requires the locations of the inducing points (denoted `Xu`), and it accepts the argument `sigma` instead of `noise` because these sparse approximations assume white IID noise. - -Three approximations are currently implemented, FITC, DTC and VFE. For most problems, they produce fairly similar results. These GP approximations don't form the full covariance matrix over all $n$ training inputs. Instead they rely on $m < n$ *inducing points*, which are "strategically" placed throughout the domain. Both of these approximations reduce the $\mathcal{O(n^3)}$ complexity of GPs down to $\mathcal{O(nm^2)}$ --- a significant speed up. The memory requirements scale down a bit too, but not as much. They are commonly referred to as *sparse* approximations, in the sense of being data sparse. The downside of sparse approximations is that they reduce the expressiveness of the GP. Reducing the dimension of the covariance matrix effectively reduces the number of covariance matrix eigenvectors that can be used to fit the data. - -A choice that needs to be made is where to place the inducing points. One option is to use a subset of the inputs. Another possibility is to use K-means. The location of the inducing points can also be an unknown and optimized as part of the model. These sparse approximations are useful for speeding up calculations when the density of data points is high and the lengthscales is larger than the separations between inducing points. - -For more information on these approximations, see [Quinonero-Candela+Rasmussen, 2006](http://www.jmlr.org/papers/v6/quinonero-candela05a.html) and [Titsias 2009](https://pdfs.semanticscholar.org/9c13/b87b5efb4bb011acc89d90b15f637fa48593.pdf). - -+++ - -## Examples - -For the following examples, we use the same data set as was used in the `gp.Marginal` example, but with more data points. - -```{code-cell} ipython3 -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm -import theano -import theano.tensor as tt - -%matplotlib inline -``` - -```{code-cell} ipython3 -# set the seed -np.random.seed(1) - -n = 2000 # The number of data points -X = 10 * np.sort(np.random.rand(n))[:, None] - -# Define the true covariance function and its parameters -ℓ_true = 1.0 -η_true = 3.0 -cov_func = η_true**2 * pm.gp.cov.Matern52(1, ℓ_true) - -# A mean function that is zero everywhere -mean_func = pm.gp.mean.Zero() - -# The latent function values are one sample from a multivariate normal -# Note that we have to call `eval()` because PyMC3 built on top of Theano -f_true = np.random.multivariate_normal( - mean_func(X).eval(), cov_func(X).eval() + 1e-8 * np.eye(n), 1 -).flatten() - -# The observed data is the latent function plus a small amount of IID Gaussian noise -# The standard deviation of the noise is `sigma` -σ_true = 2.0 -y = f_true + σ_true * np.random.randn(n) - -## Plot the data and the unobserved latent function -fig = plt.figure(figsize=(12, 5)) -ax = fig.gca() -ax.plot(X, f_true, "dodgerblue", lw=3, label="True f") -ax.plot(X, y, "ok", ms=3, alpha=0.5, label="Data") -ax.set_xlabel("X") -ax.set_ylabel("The true f(x)") -plt.legend(); -``` - -### Initializing the inducing points with K-means - -We use the NUTS sampler and the `FITC` approximation. - -```{code-cell} ipython3 -with pm.Model() as model: - ℓ = pm.Gamma("ℓ", alpha=2, beta=1) - η = pm.HalfCauchy("η", beta=5) - - cov = η**2 * pm.gp.cov.Matern52(1, ℓ) - gp = pm.gp.MarginalSparse(cov_func=cov, approx="FITC") - - # initialize 20 inducing points with K-means - # gp.util - Xu = pm.gp.util.kmeans_inducing_points(20, X) - - σ = pm.HalfCauchy("σ", beta=5) - y_ = gp.marginal_likelihood("y", X=X, Xu=Xu, y=y, noise=σ) - - trace = pm.sample(1000) -``` - -```{code-cell} ipython3 -X_new = np.linspace(-1, 11, 200)[:, None] - -# add the GP conditional to the model, given the new X values -with model: - f_pred = gp.conditional("f_pred", X_new) - -# To use the MAP values, you can just replace the trace with a length-1 list with `mp` -with model: - pred_samples = pm.sample_posterior_predictive(trace, vars=[f_pred], samples=1000) -``` - -```{code-cell} ipython3 -# plot the results -fig = plt.figure(figsize=(12, 5)) -ax = fig.gca() - -# plot the samples from the gp posterior with samples and shading -from pymc3.gp.util import plot_gp_dist - -plot_gp_dist(ax, pred_samples["f_pred"], X_new) - -# plot the data and the true latent function -plt.plot(X, y, "ok", ms=3, alpha=0.5, label="Observed data") -plt.plot(X, f_true, "dodgerblue", lw=3, label="True f") -plt.plot(Xu, 10 * np.ones(Xu.shape[0]), "cx", ms=10, label="Inducing point locations") - -# axis labels and title -plt.xlabel("X") -plt.ylim([-13, 13]) -plt.title("Posterior distribution over $f(x)$ at the observed values") -plt.legend(); -``` - -### Optimizing inducing point locations as part of the model - -For demonstration purposes, we set `approx="VFE"`. Any inducing point initialization can be done with any approximation. - -```{code-cell} ipython3 -Xu_init = 10 * np.random.rand(20) - -with pm.Model() as model: - ℓ = pm.Gamma("ℓ", alpha=2, beta=1) - η = pm.HalfCauchy("η", beta=5) - - cov = η**2 * pm.gp.cov.Matern52(1, ℓ) - gp = pm.gp.MarginalSparse(cov_func=cov, approx="VFE") - - # set flat prior for Xu - Xu = pm.Flat("Xu", shape=20, testval=Xu_init) - - σ = pm.HalfCauchy("σ", beta=5) - y_ = gp.marginal_likelihood("y", X=X, Xu=Xu[:, None], y=y, noise=σ) - - mp = pm.find_MAP() -``` - -```{code-cell} ipython3 -mu, var = gp.predict(X_new, point=mp, diag=True) -sd = np.sqrt(var) - -# draw plot -fig = plt.figure(figsize=(12, 5)) -ax = fig.gca() - -# plot mean and 2σ intervals -plt.plot(X_new, mu, "r", lw=2, label="mean and 2σ region") -plt.plot(X_new, mu + 2 * sd, "r", lw=1) -plt.plot(X_new, mu - 2 * sd, "r", lw=1) -plt.fill_between(X_new.flatten(), mu - 2 * sd, mu + 2 * sd, color="r", alpha=0.5) - -# plot original data and true function -plt.plot(X, y, "ok", ms=3, alpha=1.0, label="observed data") -plt.plot(X, f_true, "dodgerblue", lw=3, label="true f") -Xu = mp["Xu"] -plt.plot(Xu, 10 * np.ones(Xu.shape[0]), "cx", ms=10, label="Inducing point locations") - -plt.xlabel("x") -plt.ylim([-13, 13]) -plt.title("predictive mean and 2σ interval") -plt.legend(); -``` - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/gaussian_processes/GP-TProcess.myst.md b/myst_nbs/gaussian_processes/GP-TProcess.myst.md deleted file mode 100644 index 272695728..000000000 --- a/myst_nbs/gaussian_processes/GP-TProcess.myst.md +++ /dev/null @@ -1,156 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- - -# Student-t Process - -PyMC3 also includes T-process priors. They are a generalization of a Gaussian process prior to the multivariate Student's T distribution. The usage is identical to that of `gp.Latent`, except they require a degrees of freedom parameter when they are specified in the model. For more information, see chapter 9 of [Rasmussen+Williams](http://www.gaussianprocess.org/gpml/), and [Shah et al.](https://arxiv.org/abs/1402.4306). - -Note that T processes aren't additive in the same way as GPs, so addition of `TP` objects are not supported. - -+++ - -## Samples from a TP prior - -The following code draws samples from a T process prior with 3 degrees of freedom and a Gaussian process, both with the same covariance matrix. - -```{code-cell} ipython3 -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm -import theano.tensor as tt - -%matplotlib inline -``` - -```{code-cell} ipython3 -# set the seed -np.random.seed(1) - -n = 100 # The number of data points -X = np.linspace(0, 10, n)[:, None] # The inputs to the GP, they must be arranged as a column vector - -# Define the true covariance function and its parameters -ℓ_true = 1.0 -η_true = 3.0 -cov_func = η_true**2 * pm.gp.cov.Matern52(1, ℓ_true) - -# A mean function that is zero everywhere -mean_func = pm.gp.mean.Zero() - -# The latent function values are one sample from a multivariate normal -# Note that we have to call `eval()` because PyMC3 built on top of Theano -tp_samples = pm.MvStudentT.dist(mu=mean_func(X).eval(), cov=cov_func(X).eval(), nu=3).random(size=8) - -## Plot samples from TP prior -fig = plt.figure(figsize=(12, 5)) -ax = fig.gca() -ax.plot(X.flatten(), tp_samples.T, lw=3, alpha=0.6) -ax.set_xlabel("X") -ax.set_ylabel("y") -ax.set_title("Samples from TP with DoF=3") - - -gp_samples = pm.MvNormal.dist(mu=mean_func(X).eval(), cov=cov_func(X).eval()).random(size=8) -fig = plt.figure(figsize=(12, 5)) -ax = fig.gca() -ax.plot(X.flatten(), gp_samples.T, lw=3, alpha=0.6) -ax.set_xlabel("X") -ax.set_ylabel("y") -ax.set_title("Samples from GP"); -``` - -## Poisson data generated by a T process - -For the Poisson rate, we take the square of the function represented by the T process prior. - -```{code-cell} ipython3 -np.random.seed(7) - -n = 150 # The number of data points -X = np.linspace(0, 10, n)[:, None] # The inputs to the GP, they must be arranged as a column vector - -# Define the true covariance function and its parameters -ℓ_true = 1.0 -η_true = 3.0 -cov_func = η_true**2 * pm.gp.cov.ExpQuad(1, ℓ_true) - -# A mean function that is zero everywhere -mean_func = pm.gp.mean.Zero() - -# The latent function values are one sample from a multivariate normal -# Note that we have to call `eval()` because PyMC3 built on top of Theano -f_true = pm.MvStudentT.dist(mu=mean_func(X).eval(), cov=cov_func(X).eval(), nu=3).random(size=1) -y = np.random.poisson(f_true**2) - -fig = plt.figure(figsize=(12, 5)) -ax = fig.gca() -ax.plot(X, f_true**2, "dodgerblue", lw=3, label="True f") -ax.plot(X, y, "ok", ms=3, label="Data") -ax.set_xlabel("X") -ax.set_ylabel("y") -plt.legend(); -``` - -```{code-cell} ipython3 -with pm.Model() as model: - ℓ = pm.Gamma("ℓ", alpha=2, beta=2) - η = pm.HalfCauchy("η", beta=3) - cov = η**2 * pm.gp.cov.ExpQuad(1, ℓ) - - # informative prior on degrees of freedom < 5 - ν = pm.Gamma("ν", alpha=2, beta=1) - tp = pm.gp.TP(cov_func=cov, nu=ν) - f = tp.prior("f", X=X) - - # adding a small constant seems to help with numerical stability here - y_ = pm.Poisson("y", mu=tt.square(f) + 1e-6, observed=y) - - tr = pm.sample(1000) -``` - -```{code-cell} ipython3 -pm.traceplot(tr, var_names=["ℓ", "ν", "η"]); -``` - -```{code-cell} ipython3 -n_new = 200 -X_new = np.linspace(0, 15, n_new)[:, None] - -# add the GP conditional to the model, given the new X values -with model: - f_pred = tp.conditional("f_pred", X_new) - -# Sample from the GP conditional distribution -with model: - pred_samples = pm.sample_posterior_predictive(tr, vars=[f_pred], samples=1000) -``` - -```{code-cell} ipython3 -fig = plt.figure(figsize=(12, 5)) -ax = fig.gca() -from pymc3.gp.util import plot_gp_dist - -plot_gp_dist(ax, np.square(pred_samples["f_pred"]), X_new) -plt.plot(X, np.square(f_true), "dodgerblue", lw=3, label="True f") -plt.plot(X, y, "ok", ms=3, alpha=0.5, label="Observed data") -plt.xlabel("X") -plt.ylabel("True f(x)") -plt.ylim([-2, 20]) -plt.title("Conditional distribution of f_*, given f") -plt.legend(); -``` - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/gaussian_processes/GP-smoothing.myst.md b/myst_nbs/gaussian_processes/GP-smoothing.myst.md deleted file mode 100644 index e277d43f1..000000000 --- a/myst_nbs/gaussian_processes/GP-smoothing.myst.md +++ /dev/null @@ -1,177 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- - -# Gaussian Process (GP) smoothing - -This example deals with the case when we want to **smooth** the observed data points $(x_i, y_i)$ of some 1-dimensional function $y=f(x)$, by finding the new values $(x_i, y'_i)$ such that the new data is more "smooth" (see more on the definition of smoothness through allocation of variance in the model description below) when moving along the $x$ axis. - -It is important to note that we are **not** dealing with the problem of interpolating the function $y=f(x)$ at the unknown values of $x$. Such problem would be called "regression" not "smoothing", and will be considered in other examples. - -If we assume the functional dependency between $x$ and $y$ is **linear** then, by making the independence and normality assumptions about the noise, we can infer a straight line that approximates the dependency between the variables, i.e. perform a linear regression. We can also fit more complex functional dependencies (like quadratic, cubic, etc), if we know the functional form of the dependency in advance. - -However, the **functional form** of $y=f(x)$ is **not always known in advance**, and it might be hard to choose which one to fit, given the data. For example, you wouldn't necessarily know which function to use, given the following observed data. Assume you haven't seen the formula that generated it: - -```{code-cell} ipython3 -%pylab inline -figsize(12, 6); -``` - -```{code-cell} ipython3 -import numpy as np -import scipy.stats as stats - -x = np.linspace(0, 50, 100) -y = np.exp(1.0 + np.power(x, 0.5) - np.exp(x / 15.0)) + np.random.normal(scale=1.0, size=x.shape) - -plot(x, y) -xlabel("x") -ylabel("y") -title("Observed Data"); -``` - -### Let's try a linear regression first - -As humans, we see that there is a non-linear dependency with some noise, and we would like to capture that dependency. If we perform a linear regression, we see that the "smoothed" data is less than satisfactory: - -```{code-cell} ipython3 -plot(x, y) -xlabel("x") -ylabel("y") - -lin = stats.linregress(x, y) -plot(x, lin.intercept + lin.slope * x) -title("Linear Smoothing"); -``` - -### Linear regression model recap - -The linear regression assumes there is a linear dependency between the input $x$ and output $y$, sprinkled with some noise around it so that for each observed data point we have: - -$$ y_i = a + b\, x_i + \epsilon_i $$ - -where the observation errors at each data point satisfy: - -$$ \epsilon_i \sim N(0, \sigma^2) $$ - -with the same $\sigma$, and the errors are independent: - -$$ cov(\epsilon_i, \epsilon_j) = 0 \: \text{ for } i \neq j $$ - -The parameters of this model are $a$, $b$, and $\sigma$. It turns out that, under these assumptions, the maximum likelihood estimates of $a$ and $b$ don't depend on $\sigma$. Then $\sigma$ can be estimated separately, after finding the most likely values for $a$ and $b$. - -+++ - -### Gaussian Process smoothing model - -This model allows departure from the linear dependency by assuming that the dependency between $x$ and $y$ is a Brownian motion over the domain of $x$. This doesn't go as far as assuming a particular functional dependency between the variables. Instead, by **controlling the standard deviation of the unobserved Brownian motion** we can achieve different levels of smoothness of the recovered functional dependency at the original data points. - -The particular model we are going to discuss assumes that the observed data points are **evenly spaced** across the domain of $x$, and therefore can be indexed by $i=1,\dots,N$ without the loss of generality. The model is described as follows: - -\begin{equation} -\begin{aligned} -z_i & \sim \mathcal{N}(z_{i-1} + \mu, (1 - \alpha)\cdot\sigma^2) \: \text{ for } i=2,\dots,N \\ -z_1 & \sim ImproperFlat(-\infty,\infty) \\ -y_i & \sim \mathcal{N}(z_i, \alpha\cdot\sigma^2) -\end{aligned} -\end{equation} - -where $z$ is the hidden Brownian motion, $y$ is the observed data, and the total variance $\sigma^2$ of each observation is split between the hidden Brownian motion and the noise in proportions of $1 - \alpha$ and $\alpha$ respectively, with parameter $0 < \alpha < 1$ specifying the degree of smoothing. - -When we estimate the maximum likelihood values of the hidden process $z_i$ at each of the data points, $i=1,\dots,N$, these values provide an approximation of the functional dependency $y=f(x)$ as $\mathrm{E}\,[f(x_i)] = z_i$ at the original data points $x_i$ only. Therefore, again, the method is called smoothing and not regression. - -+++ - -### Let's describe the above GP-smoothing model in PyMC3 - -```{code-cell} ipython3 -import pymc3 as pm - -from pymc3.distributions.timeseries import GaussianRandomWalk -from scipy import optimize -from theano import shared -``` - -Let's create a model with a shared parameter for specifying different levels of smoothing. We use very wide priors for the "mu" and "tau" parameters of the hidden Brownian motion, which you can adjust according to your application. - -```{code-cell} ipython3 -LARGE_NUMBER = 1e5 - -model = pm.Model() -with model: - smoothing_param = shared(0.9) - mu = pm.Normal("mu", sigma=LARGE_NUMBER) - tau = pm.Exponential("tau", 1.0 / LARGE_NUMBER) - z = GaussianRandomWalk("z", mu=mu, tau=tau / (1.0 - smoothing_param), shape=y.shape) - obs = pm.Normal("obs", mu=z, tau=tau / smoothing_param, observed=y) -``` - -Let's also make a helper function for inferring the most likely values of $z$: - -```{code-cell} ipython3 -def infer_z(smoothing): - with model: - smoothing_param.set_value(smoothing) - res = pm.find_MAP(vars=[z], method="L-BFGS-B") - return res["z"] -``` - -Please note that in this example, we are only looking at the MAP estimate of the unobserved variables. We are not really interested in inferring the posterior distributions. Instead, we have a control parameter $\alpha$ which lets us allocate the variance between the hidden Brownian motion and the noise. Other goals and/or different models may require sampling to obtain the posterior distributions, but for our goal a MAP estimate will suffice. - -### Exploring different levels of smoothing - -Let's try to allocate 50% variance to the noise, and see if the result matches our expectations. - -```{code-cell} ipython3 -smoothing = 0.5 -z_val = infer_z(smoothing) - -plot(x, y) -plot(x, z_val) -title(f"Smoothing={smoothing}"); -``` - -It appears that the variance is split evenly between the noise and the hidden process, as expected. - -Let's try gradually increasing the smoothness parameter to see if we can obtain smoother data: - -```{code-cell} ipython3 -smoothing = 0.9 -z_val = infer_z(smoothing) - -plot(x, y) -plot(x, z_val) -title(f"Smoothing={smoothing}"); -``` - -### Smoothing "to the limits" - -By increasing the smoothing parameter, we can gradually make the inferred values of the hidden Brownian motion approach the average value of the data. This is because as we increase the smoothing parameter, we allow less and less of the variance to be allocated to the Brownian motion, so eventually it approaches the process which almost doesn't change over the domain of $x$: - -```{code-cell} ipython3 -fig, axes = subplots(2, 2) - -for ax, smoothing in zip(axes.ravel(), [0.95, 0.99, 0.999, 0.9999]): - - z_val = infer_z(smoothing) - - ax.plot(x, y) - ax.plot(x, z_val) - ax.set_title(f"Smoothing={smoothing:05.4f}") -``` - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` - -This example originally contributed by: Andrey Kuzmenko, http://github.com/akuz diff --git a/myst_nbs/gaussian_processes/MOGP-Coregion-Hadamard.myst.md b/myst_nbs/gaussian_processes/MOGP-Coregion-Hadamard.myst.md deleted file mode 100644 index 7bf7fde2b..000000000 --- a/myst_nbs/gaussian_processes/MOGP-Coregion-Hadamard.myst.md +++ /dev/null @@ -1,359 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(Multi-output-GPs_Coregion)= -# Multi-output Gaussian Processes: Coregionalization models using Hamadard product - -:::{post} October, 2022 -:tags: gaussian process, multi-output -:category: intermediate -:author: Danh Phan, Bill Engels, Chris Fonnesbeck -::: - -+++ - -This notebook shows how to implement the **Intrinsic Coregionalization Model** (ICM) and the **Linear Coregionalization Model** (LCM) using a Hamadard product between the Coregion kernel and input kernels. Multi-output Gaussian Process is discussed in [this paper](https://papers.nips.cc/paper/2007/hash/66368270ffd51418ec58bd793f2d9b1b-Abstract.html) by {cite:t}`bonilla2007multioutput`. For further information about ICM and LCM, please check out the [talk](https://www.youtube.com/watch?v=ttgUJtVJthA&list=PLpTp0l_CVmgwyAthrUmmdIFiunV1VvicM) on Multi-output Gaussian Processes by Mauricio Alvarez, and [his slides](http://gpss.cc/gpss17/slides/multipleOutputGPs.pdf) with more references at the last page. - -The advantage of Multi-output Gaussian Processes is their capacity to simultaneously learn and infer many outputs which have the same source of uncertainty from inputs. In this example, we model the average spin rates of several pitchers in different games from a baseball dataset. - -```{code-cell} ipython3 -import aesara.tensor as at -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm - -from pymc.gp.util import plot_gp_dist -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -``` - -## Preparing the data -The baseball dataset contains the average spin rate of several pitchers on different game dates. - -```{code-cell} ipython3 -# get data -try: - df = pd.read_csv("../data/fastball_spin_rates.csv") -except FileNotFoundError: - df = pd.read_csv(pm.get_data("fastball_spin_rates.csv")) - -print(df.shape) -df.head() -``` - -```{code-cell} ipython3 -print( - f"There are {df['pitcher_name'].nunique()} pichers, in {df['game_date'].nunique()} game dates" -) -``` - -```{code-cell} ipython3 -# Standardise average spin rate -df["avg_spin_rate"] = (df["avg_spin_rate"] - df["avg_spin_rate"].mean()) / df["avg_spin_rate"].std() -df["avg_spin_rate"].describe() -``` - -#### Top N popular pitchers - -```{code-cell} ipython3 -# Get top N popular pitchers by who attended most games -n_outputs = 5 # Top 5 popular pitchers -top_pitchers = df.groupby("pitcher_name")["game_date"].count().nlargest(n_outputs).reset_index() -top_pitchers = top_pitchers.reset_index().rename(columns={"index": "output_idx"}) -top_pitchers -``` - -```{code-cell} ipython3 -# Filter the data with only top N pitchers -adf = df.loc[df["pitcher_name"].isin(top_pitchers["pitcher_name"])].copy() -print(adf.shape) -adf.head() -``` - -```{code-cell} ipython3 -adf["avg_spin_rate"].describe() -``` - -#### Create a game date index - -```{code-cell} ipython3 -# There are 142 game dates from 01 Apr 2021 to 03 Oct 2021. -adf.loc[:, "game_date"] = pd.to_datetime(adf.loc[:, "game_date"]) -game_dates = adf.loc[:, "game_date"] -game_dates.min(), game_dates.max(), game_dates.nunique(), (game_dates.max() - game_dates.min()) -``` - -```{code-cell} ipython3 -# Create a game date index -dates_idx = pd.DataFrame( - {"game_date": pd.date_range(game_dates.min(), game_dates.max())} -).reset_index() -dates_idx = dates_idx.rename(columns={"index": "x"}) -dates_idx.head() -``` - -#### Create training data - -```{code-cell} ipython3 -adf = adf.merge(dates_idx, how="left", on="game_date") -adf = adf.merge(top_pitchers[["pitcher_name", "output_idx"]], how="left", on="pitcher_name") -adf.head() -``` - -```{code-cell} ipython3 -adf = adf.sort_values(["output_idx", "x"]) -X = adf[ - ["x", "output_idx"] -].values # Input data includes the index of game dates, and the index of picthers -Y = adf["avg_spin_rate"].values # Output data includes the average spin rate of pitchers -X.shape, Y.shape -``` - -#### Visualise training data - -```{code-cell} ipython3 -# Plot average spin rates of top pitchers -fig, ax = plt.subplots(1, 1, figsize=(14, 6)) -legends = [] -for pitcher in top_pitchers["pitcher_name"]: - cond = adf["pitcher_name"] == pitcher - ax.plot(adf.loc[cond, "x"], adf.loc[cond, "avg_spin_rate"], "-o") - legends.append(pitcher) -plt.title("Average spin rates of top 5 popular pitchers") -plt.xlabel("The index of game dates") -plt.ylim([-1.5, 4.0]) -plt.legend(legends, loc="upper center"); -``` - -## Intrinsic Coregionalization Model (ICM) - -The Intrinsic Coregionalization Model (ICM) is a particular case of the Linear Coregionalization Model (LCM) with one input kernel, for example: - -$$ K_{ICM} = B \otimes K_{ExpQuad} $$ - -Where $B(o,o')$ is the output kernel, and $K_{ExpQuad}(x,x')$ is an input kernel. - -$$ B = WW^T + diag(kappa) $$ - -```{code-cell} ipython3 -def get_icm(input_dim, kernel, W=None, kappa=None, B=None, active_dims=None): - """ - This function generates an ICM kernel from an input kernel and a Coregion kernel. - """ - coreg = pm.gp.cov.Coregion(input_dim=input_dim, W=W, kappa=kappa, B=B, active_dims=active_dims) - icm_cov = kernel * coreg # Use Hadamard Product for separate inputs - return icm_cov -``` - -```{code-cell} ipython3 -with pm.Model() as model: - # Priors - ell = pm.Gamma("ell", alpha=2, beta=0.5) - eta = pm.Gamma("eta", alpha=3, beta=1) - kernel = eta**2 * pm.gp.cov.ExpQuad(input_dim=2, ls=ell, active_dims=[0]) - sigma = pm.HalfNormal("sigma", sigma=3) - - # Get the ICM kernel - W = pm.Normal("W", mu=0, sigma=3, shape=(n_outputs, 2), initval=np.random.randn(n_outputs, 2)) - kappa = pm.Gamma("kappa", alpha=1.5, beta=1, shape=n_outputs) - B = pm.Deterministic("B", at.dot(W, W.T) + at.diag(kappa)) - cov_icm = get_icm(input_dim=2, kernel=kernel, B=B, active_dims=[1]) - - # Define a Multi-output GP - mogp = pm.gp.Marginal(cov_func=cov_icm) - y_ = mogp.marginal_likelihood("f", X, Y, sigma=sigma) -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(model) -``` - -```{code-cell} ipython3 -%%time -with model: - gp_trace = pm.sample(2000, chains=1) -``` - -#### Prediction - -```{code-cell} ipython3 -# Prepare test data -M = 200 # number of data points -x_new = np.linspace(0, 200, M)[ - :, None -] # Select 200 days (185 previous days, and add 15 days into the future). -X_new = np.vstack([x_new for idx in range(n_outputs)]) -output_idx = np.vstack([np.repeat(idx, M)[:, None] for idx in range(n_outputs)]) -X_new = np.hstack([X_new, output_idx]) -``` - -```{code-cell} ipython3 -:tags: [] - -%%time -with model: - preds = mogp.conditional("preds", X_new) - gp_samples = pm.sample_posterior_predictive(gp_trace, var_names=["preds"], random_seed=42) -``` - -```{code-cell} ipython3 -f_pred = gp_samples.posterior_predictive["preds"].sel(chain=0) - - -def plot_predictive_posteriors(f_pred, top_pitchers, M, X_new): - fig, axes = plt.subplots(n_outputs, 1, figsize=(12, 15)) - - for idx, pitcher in enumerate(top_pitchers["pitcher_name"]): - # Prediction - plot_gp_dist( - axes[idx], - f_pred[:, M * idx : M * (idx + 1)], - X_new[M * idx : M * (idx + 1), 0], - palette="Blues", - fill_alpha=0.1, - samples_alpha=0.1, - ) - # Training data points - cond = adf["pitcher_name"] == pitcher - axes[idx].scatter(adf.loc[cond, "x"], adf.loc[cond, "avg_spin_rate"], color="r") - axes[idx].set_title(pitcher) - plt.tight_layout() - - -plot_predictive_posteriors(f_pred, top_pitchers, M, X_new) -``` - -It can be seen that the average spin rate of Rodriguez Richard decreases significantly from the 75th game dates. Besides, Kopech Michael's performance improves after a break of several weeks in the middle, while Hearn Taylor has performed better recently. - -```{code-cell} ipython3 -az.plot_trace(gp_trace) -plt.tight_layout() -``` - -## Linear Coregionalization Model (LCM) - -The LCM is a generalization of the ICM with two or more input kernels, so the LCM kernel is basically a sum of several ICM kernels. The LMC allows several independent samples from GPs with different covariances (kernels). - -In this example, in addition to an `ExpQuad` kernel, we add a `Matern32` kernel for input data. - -$$ K_{LCM} = B \otimes K_{ExpQuad} + B \otimes K_{Matern32} $$ - -```{code-cell} ipython3 -def get_lcm(input_dim, active_dims, num_outputs, kernels, W=None, B=None, name="ICM"): - """ - This function generates a LCM kernel from a list of input `kernels` and a Coregion kernel. - """ - if B is None: - kappa = pm.Gamma(f"{name}_kappa", alpha=5, beta=1, shape=num_outputs) - if W is None: - W = pm.Normal( - f"{name}_W", - mu=0, - sigma=5, - shape=(num_outputs, 1), - initval=np.random.randn(num_outputs, 1), - ) - else: - kappa = None - - cov_func = 0 - for idx, kernel in enumerate(kernels): - icm = get_icm(input_dim, kernel, W, kappa, B, active_dims) - cov_func += icm - return cov_func -``` - -```{code-cell} ipython3 -with pm.Model() as model: - # Priors - ell = pm.Gamma("ell", alpha=2, beta=0.5, shape=2) - eta = pm.Gamma("eta", alpha=3, beta=1, shape=2) - kernels = [pm.gp.cov.ExpQuad, pm.gp.cov.Matern32] - sigma = pm.HalfNormal("sigma", sigma=3) - - # Define a list of covariance functions - cov_list = [ - eta[idx] ** 2 * kernel(input_dim=2, ls=ell[idx], active_dims=[0]) - for idx, kernel in enumerate(kernels) - ] - - # Get the LCM kernel - cov_lcm = get_lcm(input_dim=2, active_dims=[1], num_outputs=n_outputs, kernels=cov_list) - - # Define a Multi-output GP - mogp = pm.gp.Marginal(cov_func=cov_lcm) - y_ = mogp.marginal_likelihood("f", X, Y, sigma=sigma) -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(model) -``` - -```{code-cell} ipython3 -%%time -with model: - gp_trace = pm.sample(2000, chains=1) -``` - -### Prediction - -```{code-cell} ipython3 -:tags: [] - -%%time -with model: - preds = mogp.conditional("preds", X_new) - gp_samples = pm.sample_posterior_predictive(gp_trace, var_names=["preds"], random_seed=42) -``` - -```{code-cell} ipython3 -plot_predictive_posteriors(f_pred, top_pitchers, M, X_new) -``` - -```{code-cell} ipython3 -az.plot_trace(gp_trace) -plt.tight_layout() -``` - -## Acknowledgement -This work is supported by 2022 [Google Summer of Codes](https://summerofcode.withgoogle.com/) and [NUMFOCUS](https://numfocus.org/). - -+++ - -## Authors -* Authored by [Danh Phan](https://github.com/danhphan), [Bill Engels](https://github.com/bwengals), [Chris Fonnesbeck](https://github.com/fonnesbeck) in November, 2022 ([pymc-examples#454](https://github.com/pymc-devs/pymc-examples/pull/454)) - -+++ - -## References -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/gaussian_processes/gaussian_process.myst.md b/myst_nbs/gaussian_processes/gaussian_process.myst.md deleted file mode 100644 index 9d1a8b69c..000000000 --- a/myst_nbs/gaussian_processes/gaussian_process.myst.md +++ /dev/null @@ -1,190 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3.10.5 ('pymc-dev') - language: python - name: python3 ---- - -(gaussian_process)= -# Gaussian Processes using numpy kernel - -:::{post} Jul 31, 2022 -:tags: gaussian process, -:category: advanced -:author: Chris Fonnesbeck, Ana Rita Santos and Sandra Meneses -::: - -+++ - -Example of simple Gaussian Process fit, adapted from Stan's [example-models repository](https://github.com/stan-dev/example-models/blob/master/misc/gaussian-process/gp-fit.stan). - -For illustrative and divulgative purposes, this example builds a Gaussian process from scratch. However, PyMC includes a {mod}`module dedicated to Gaussian Processes ` which is recommended instead of coding everything from scratch. - -```{code-cell} ipython3 -import aesara.tensor as at -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc as pm -import seaborn as sns - -from xarray_einstats.stats import multivariate_normal - -print(f"Running on PyMC v{pm.__version__}") -``` - -```{code-cell} ipython3 -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -```{code-cell} ipython3 -# fmt: off -x = np.array([-5, -4.9, -4.8, -4.7, -4.6, -4.5, -4.4, -4.3, -4.2, -4.1, -4, --3.9, -3.8, -3.7, -3.6, -3.5, -3.4, -3.3, -3.2, -3.1, -3, -2.9, --2.8, -2.7, -2.6, -2.5, -2.4, -2.3, -2.2, -2.1, -2, -1.9, -1.8, --1.7, -1.6, -1.5, -1.4, -1.3, -1.2, -1.1, -1, -0.9, -0.8, -0.7, --0.6, -0.5, -0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2, 0.3, 0.4, 0.5, -0.6, 0.7, 0.8, 0.9, 1, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, -1.9, 2, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3, 3.1, -3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4, 4.1, 4.2, 4.3, 4.4, -4.5, 4.6, 4.7, 4.8, 4.9, 5]) - -y = np.array([1.04442478194401, 0.948306088493654, 0.357037759697332, 0.492336514646604, -0.520651364364746, 0.112629866592809, 0.470995468454158, -0.168442254267804, -0.0720344402575861, -0.188108980535916, -0.0160163306512027, --0.0388792158617705, -0.0600673630622568, 0.113568725264636, -0.447160403837629, 0.664421188556779, -0.139510743820276, 0.458823971660986, -0.141214654640904, -0.286957663528091, -0.466537724021695, -0.308185884317105, --1.57664872694079, -1.44463024170082, -1.51206214603847, -1.49393593601901, --2.02292464164487, -1.57047488853653, -1.22973445533419, -1.51502367058357, --1.41493587255224, -1.10140254663611, -0.591866485375275, -1.08781838696462, --0.800375653733931, -1.00764767602679, -0.0471028950122742, -0.536820626879737, --0.151688056391446, -0.176771681318393, -0.240094952335518, -1.16827876746502, --0.493597351974992, -0.831683011472805, -0.152347043914137, 0.0190364158178343, --1.09355955218051, -0.328157917911376, -0.585575679802941, -0.472837120425201, --0.503633622750049, -0.0124446353828312, -0.465529814250314, --0.101621725887347, -0.26988462590405, 0.398726664193302, 0.113805181040188, -0.331353802465398, 0.383592361618461, 0.431647298655434, 0.580036473774238, -0.830404669466897, 1.17919105883462, 0.871037583886711, 1.12290553424174, -0.752564860804382, 0.76897960270623, 1.14738839410786, 0.773151715269892, -0.700611498974798, 0.0412951045437818, 0.303526087747629, -0.139399513324585, --0.862987735433697, -1.23399179134008, -1.58924289116396, -1.35105117911049, --0.990144529089174, -1.91175364127672, -1.31836236129543, -1.65955735224704, --1.83516148300526, -2.03817062501248, -1.66764011409214, -0.552154350554687, --0.547807883952654, -0.905389222477036, -0.737156477425302, -0.40211249920415, -0.129669958952991, 0.271142753510592, 0.176311762529962, 0.283580281859344, -0.635808289696458, 1.69976647982837, 1.10748978734239, 0.365412229181044, -0.788821368082444, 0.879731888124867, 1.02180766619069, 0.551526067300283]) -# fmt: on -N = len(y) -``` - -We will use a squared exponential covariance function, which relies on the squared distances between observed points in the data. - -```{code-cell} ipython3 -squared_distance = lambda x, y: (x[None, :] - y[:, None]) ** 2 -``` - -```{code-cell} ipython3 -with pm.Model() as gp_fit: - - mu = np.zeros(N) - - eta_sq = pm.HalfCauchy("eta_sq", 5) - rho_sq = pm.HalfCauchy("rho_sq", 5) - sigma_sq = pm.HalfCauchy("sigma_sq", 5) - - D = squared_distance(x, x) - - # Squared exponential - sigma = at.fill_diagonal(eta_sq * at.exp(-rho_sq * D), eta_sq + sigma_sq) - - obs = pm.MvNormal("obs", mu, sigma, observed=y) -``` - -This is what our initial covariance matrix looks like. Intuitively, every data point's Y-value correlates with points according to their squared distances. - -```{code-cell} ipython3 -sns.heatmap(sigma.eval(), xticklabels=False, yticklabels=False); -``` - -The following generates predictions from the Gaussian Process model in a grid of values: - -```{code-cell} ipython3 -# Prediction over grid -xgrid = np.linspace(-6, 6) -D_pred = squared_distance(xgrid, xgrid) -D_off_diag = squared_distance(x, xgrid) - -gp_fit.add_coords({"pred_id": xgrid, "pred_id2": xgrid}) - -with gp_fit as gp: - # Covariance matrices for prediction - sigma_pred = eta_sq * at.exp(-rho_sq * D_pred) - sigma_off_diag = eta_sq * at.exp(-rho_sq * D_off_diag) - - # Posterior mean - mu_post = pm.Deterministic( - "mu_post", at.dot(at.dot(sigma_off_diag, pm.math.matrix_inverse(sigma)), y), dims="pred_id" - ) - # Posterior covariance - sigma_post = pm.Deterministic( - "sigma_post", - sigma_pred - - at.dot(at.dot(sigma_off_diag, pm.math.matrix_inverse(sigma)), sigma_off_diag.T), - dims=("pred_id", "pred_id2"), - ) -``` - -```{code-cell} ipython3 -with gp_fit: - svgd_approx = pm.fit(400, method="svgd", inf_kwargs=dict(n_particles=100)) -``` - -```{code-cell} ipython3 -gp_trace = svgd_approx.sample(1000) -``` - -```{code-cell} ipython3 -az.plot_trace(gp_trace, var_names=["eta_sq", "rho_sq", "sigma_sq"]); -``` - -Sample from the posterior Gaussian Process - -```{code-cell} ipython3 -post = az.extract(gp_trace, num_samples=200) - -y_pred = multivariate_normal( - post["mu_post"], post["sigma_post"], dims=("pred_id", "pred_id2") -).rvs() -``` - -```{code-cell} ipython3 -_, ax = plt.subplots(figsize=(12, 8)) -ax.plot(xgrid, y_pred.transpose(..., "sample"), "c-", alpha=0.1) -ax.plot(x, y, "r."); -``` - -## Authors -* Adapted from Stan's [example-models repository](https://github.com/stan-dev/example-models/blob/master/misc/gaussian-process/gp-fit.stan) by Chris Fonnesbeck in 2016 -* Updated by Ana Rita Santos and Sandra Meneses in July, 2022 ([pymc#404](https://github.com/pymc-devs/pymc/pull/404)) - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray,xarray_einstats -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/gaussian_processes/log-gaussian-cox-process.myst.md b/myst_nbs/gaussian_processes/log-gaussian-cox-process.myst.md deleted file mode 100644 index f60586107..000000000 --- a/myst_nbs/gaussian_processes/log-gaussian-cox-process.myst.md +++ /dev/null @@ -1,295 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(log-gaussian-cox-process)= -# Modeling spatial point patterns with a marked log-Gaussian Cox process - -:::{post} May 31, 2022 -:tags: cox process, latent gaussian process, nonparametric, spatial, count data -:category: intermediate -:author: Chrisopher Krapu, Chris Fonnesbeck -::: - -+++ - -## Introduction - -+++ - -The log-Gaussian Cox process (LGCP) is a probabilistic model of point patterns typically observed in space or time. It has two main components. First, an underlying *intensity* field $\lambda(s)$ of positive real values is modeled over the entire domain $X$ using an exponentially-transformed Gaussian process which constrains $\lambda$ to be positive. Then, this intensity field is used to parameterize a [Poisson point process](https://en.wikipedia.org/wiki/Poisson_point_process) which represents a stochastic mechanism for placing points in space. Some phenomena amenable to this representation include the incidence of cancer cases across a county, or the spatiotemporal locations of crime events in a city. Both spatial and temporal dimensions can be handled equivalently within this framework, though this tutorial only addresses data in two spatial dimensions. - -In more formal terms, if we have a space $X$ and $A\subseteq X$, the distribution over the number of points $Y_A$ occurring within subset $A$ is given by -$$Y_A \sim Poisson\left(\int_A \lambda(s) ds\right)$$ -and the intensity field is defined as -$$\log \lambda(s) \sim GP(\mu(s), K(s,s'))$$ -where $GP(\mu(s), K(s,s'))$ denotes a Gaussian process with mean function $\mu(s)$ and covariance kernel $K(s,s')$ for a location $s \in X$. This is one of the simplest models of point patterns of $n$ events recorded as locations $s_1,...,s_n$ in an arbitrary metric space. In conjunction with a Bayesian analysis, this model can be used to answering questions of interest such as: -* Does an observed point pattern imply a statistically significant shift in spatial intensities? -* What would randomly sampled patterns with the same statistical properties look like? -* Is there a statistical correlation between the *frequency* and *magnitude* of point events? - -In this notebook, we'll use a grid-based approximation to the full LGCP with PyMC to fit a model and analyze its posterior summaries. We will also explore the usage of a marked Poisson process, an extension of this model to account for the distribution of *marks* associated with each data point. - -+++ - -## Data - -+++ - -Our observational data concerns 231 sea anemones whose sizes and locations on the French coast were recorded. This data was taken from the [`spatstat` spatial modeling package in R](https://github.com/spatstat/spatstat) which is designed to address models like the LGCP and its subsequent refinements. The original source of this data is the textbook *Spatial data analysis by example* by Upton and Fingleton (1985) and a longer description of the data can be found there. - -```{code-cell} ipython3 -import warnings - -from itertools import product - -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm - -from matplotlib import MatplotlibDeprecationWarning -from numpy.random import default_rng - -warnings.filterwarnings(action="ignore", category=MatplotlibDeprecationWarning) -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -``` - -```{code-cell} ipython3 -data = pd.read_csv(pm.get_data("anemones.csv")) -n = data.shape[0] -``` - -This dataset has coordinates and discrete mark values for each anemone. While these marks are integers, for the sake of simplicity we will model these values as continuous in a later step. - -```{code-cell} ipython3 -data.head(3) -``` - -Let's take a look at this data in 2D space: - -```{code-cell} ipython3 -plt.scatter(data["x"], data["y"], c=data["marks"]) -plt.colorbar(label="Anemone size") -plt.axis("equal"); -``` - -The 'marks' column indicates the size of each anemone. If we were to model both the marks and the spatial distribution of points, we would be modeling a *marked Poisson point process*. Extending the basic point pattern model to include this feature is the second portion of this notebook. - -+++ - -While there are multiple ways to conduct inference, perhaps the simplest way is to slice up our domain $X$ into many small pieces $A_1, A_2,...,A_M$ and fix the intensity field to be constant within each subset. Then, we will treat the number of points within each $A_j$ as a Poisson random variable such that $Y_j \sim Poisson(\lambda_j)$. and we also consider the $\log{\lambda_1}...,\log{\lambda_M}$ variables as a single draw from a Gaussian process. - -+++ - -The code below splits up the domain into grid cells, counts the number of points within each cell and also identifies its centroid. - -```{code-cell} ipython3 -xy = data[["x", "y"]].values - -# Jitter the data slightly so that none of the points fall exactly -# on cell boundaries -eps = 1e-3 -rng = default_rng() -xy = xy.astype("float") + rng.standard_normal(xy.shape) * eps - -resolution = 20 - -# Rescaling the unit of area so that our parameter estimates -# are easier to read -area_per_cell = resolution**2 / 100 - -cells_x = int(280 / resolution) -cells_y = int(180 / resolution) - -# Creating bin edges for a 2D histogram -quadrat_x = np.linspace(0, 280, cells_x + 1) -quadrat_y = np.linspace(0, 180, cells_y + 1) - -# Identifying the midpoints of each grid cell -centroids = np.asarray(list(product(quadrat_x[:-1] + 10, quadrat_y[:-1] + 10))) - -cell_counts, _, _ = np.histogram2d(xy[:, 0], xy[:, 1], [quadrat_x, quadrat_y]) -cell_counts = cell_counts.ravel().astype(int) -``` - -With the points split into different cells and the cell centroids computed, we can plot our new gridded dataset as shown below. - -```{code-cell} ipython3 -line_kwargs = {"color": "k", "linewidth": 1, "alpha": 0.5} - -plt.figure(figsize=(6, 4.5)) -[plt.axhline(y, **line_kwargs) for y in quadrat_y] -[plt.axvline(x, **line_kwargs) for x in quadrat_x] -plt.scatter(data["x"], data["y"], c=data["marks"], s=6) - -for i, row in enumerate(centroids): - shifted_row = row - 2 - plt.annotate(cell_counts[i], shifted_row, alpha=0.75) - -plt.title("Anemone counts per grid cell"), plt.colorbar(label="Anemone size"); -``` - -We can see that all of the counts are fairly low and range from zero to five. With all of our data prepared, we can go ahead and start writing out our probabilistic model in PyMC. We are going to treat each of the per-cell counts $Y_1,...Y_M$ above as a Poisson random variable. - -+++ - -# Inference - -+++ - -Our first step is to place prior distributions over the high-level parameters for the Gaussian process. This includes the length scale $\rho$ for the covariance function and a constant mean $\mu$ for the GP. - -```{code-cell} ipython3 -with pm.Model() as lgcp_model: - mu = pm.Normal("mu", sigma=3) - rho = pm.Uniform("rho", lower=25, upper=300) - variance = pm.InverseGamma("variance", alpha=1, beta=1) - cov_func = variance * pm.gp.cov.Matern52(2, ls=rho) - mean_func = pm.gp.mean.Constant(mu) -``` - -Next, we transform the Gaussian process into a positive-valued process via `pm.math.exp` and use the area per cell to transform the intensity function $\lambda(s)$ into rates $\lambda_i$ parameterizing the Poisson likelihood for the counts within cell $i$. - -```{code-cell} ipython3 -with lgcp_model: - gp = pm.gp.Latent(mean_func=mean_func, cov_func=cov_func) - - log_intensity = gp.prior("log_intensity", X=centroids) - intensity = pm.math.exp(log_intensity) - - rates = intensity * area_per_cell - counts = pm.Poisson("counts", mu=rates, observed=cell_counts) -``` - -With the model fully specified, we can start sampling from the posterior using the default NUTS sampler. I'll also tweak the target acceptance rate to reduce the number of divergences. - -```{code-cell} ipython3 -with lgcp_model: - trace = pm.sample(1000, tune=2000, target_accept=0.95) -``` - -# Interpreting the results - -+++ - -Posterior inference on the length_scale parameter is useful for understanding whether or not there are long-range correlations in the data. We can also examine the mean of the log-intensity field, but since it is on the log scale it is hard to directly interpret. - -```{code-cell} ipython3 -az.summary(trace, var_names=["mu", "rho"]) -``` - -We are also interested in looking at the value of the intensity field at a large number of new points in space. We can accommodate this within our model by including a new random variable for the latent Gaussian process evaluated at a denser set of points. Using `sample_posterior_predictive`, we generate posterior predictions on new data points contained in the variable `intensity_new`. - -```{code-cell} ipython3 -x_new = np.linspace(5, 275, 20) -y_new = np.linspace(5, 175, 20) -xs, ys = np.asarray(np.meshgrid(x_new, y_new)) -xy_new = np.asarray([xs.ravel(), ys.ravel()]).T - -with lgcp_model: - intensity_new = gp.conditional("log_intensity_new", Xnew=xy_new) - - spp_trace = pm.sample_posterior_predictive( - trace, var_names=["log_intensity_new"], keep_size=True - ) - -trace.extend(spp_trace) -intensity_samples = np.exp(trace.posterior_predictive["log_intensity_new"]) -``` - -Let's take a look at a few realizations of $\lambda(s)$. Since the samples are on the log scale, we'll need to exponentiate them to obtain the spatial intensity field of our 2D Poisson process. In the plot below, the observed point pattern is overlaid. - -```{code-cell} ipython3 -fig, axes = plt.subplots(2, 3, figsize=(8, 5), constrained_layout=True) -axes = axes.ravel() - -field_kwargs = {"marker": "o", "edgecolor": "None", "alpha": 0.5, "s": 80} - -for i in range(6): - field_handle = axes[i].scatter( - xy_new[:, 0], xy_new[:, 1], c=intensity_samples.sel(chain=0, draw=i), **field_kwargs - ) - - obs_handle = axes[i].scatter(data["x"], data["y"], s=10, color="k") - axes[i].axis("off") - axes[i].set_title(f"Sample {i}") - -plt.figlegend( - (obs_handle, field_handle), - ("Observed data", r"Posterior draws of $\lambda(s)$"), - ncol=2, - loc=(0.2, -0.01), - fontsize=14, - frameon=False, -); -``` - -While there is some heterogeneity in the patterns these surfaces show, we obtain a posterior mean surface with a very clearly defined spatial surface with higher intensity in the upper right and lower intensity in the lower left. - -```{code-cell} ipython3 -fig = plt.figure(figsize=(5, 4)) - -plt.scatter( - xy_new[:, 0], - xy_new[:, 1], - c=intensity_samples.mean(("chain", "draw")), - marker="o", - alpha=0.75, - s=100, - edgecolor=None, -) - -plt.title("$E[\\lambda(s) \\vert Y]$") -plt.colorbar(label="Posterior mean"); -``` - -The spatial variation in our estimates of the intensity field may not be very meaningful if there is a lot of uncertainty. We can make a similar plot of the posterior variance (or standard deviation) in this case: - -```{code-cell} ipython3 -fig = plt.figure(figsize=(5, 4)) - -plt.scatter( - xy_new[:, 0], - xy_new[:, 1], - c=intensity_samples.var(("chain", "draw")), - marker="o", - alpha=0.75, - s=100, - edgecolor=None, -) -plt.title("$Var[\\lambda(s) \\vert Y]$"), plt.colorbar(); -``` - -The posterior variance is lowest in the middle of the domain and largest in the corners and edges. This makes sense - in locations where there is more data, we have more accurate estimates for what the values of the intensity field may be. - -+++ - -## Authors - -- This notebook was written by [Christopher Krapu](https://github.com/ckrapu) on September 6, 2020 and updated on April 1, 2021. -- Updated by Chris Fonnesbeck on May 31, 2022 for v4 compatibility. - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/generalized_linear_models/GLM-binomial-regression.myst.md b/myst_nbs/generalized_linear_models/GLM-binomial-regression.myst.md deleted file mode 100644 index a8bdf8bd5..000000000 --- a/myst_nbs/generalized_linear_models/GLM-binomial-regression.myst.md +++ /dev/null @@ -1,260 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: pymc-dev-py39 - language: python - name: pymc-dev-py39 ---- - -(GLM-binomial-regression)= -# Binomial regression - -:::{post} February, 2022 -:tags: binomial regression, generalized linear model, -:category: beginner -:author: Benjamin T. Vincent -::: - -+++ - -This notebook covers the logic behind [Binomial regression](https://en.wikipedia.org/wiki/Binomial_regression), a specific instance of Generalized Linear Modelling. The example is kept very simple, with a single predictor variable. - -It helps to recap logistic regression to understand when binomial regression is applicable. Logistic regression is useful when your outcome variable is a set of successes or fails, that is, a series of `0`, `1` observations. An example of this kind of outcome variable is "Did you go for a run today?" Binomial regression (aka aggregated binomial regression) is useful when you have a certain number of successes out of $n$ trials. So the example would be, "How many days did you go for a run in the last 7 days?" - -The observed data are a set of _counts_ of number of successes out of $n$ total trials. Many people might be tempted to reduce this data to a proportion, but this is not necessarily a good idea. For example, proportions are not directly measured, they are often best treated as latent variables to be estimated. Also, a proportion looses information: a proportion of 0.5 could respond to 1 run out of 2 days, or to 4 runs in the last 4 weeks, or many other things, but you have lost that information by paying attention to the proportion alone. - -The appropriate likelihood for binomial regression is the Binomial distribution: - -$$ -y_i \sim \text{Binomial}(n, p_i) -$$ - -where $y_i$ is a count of the number of successes out of $n$ trials, and $p_i$ is the (latent) probability of success. What we want to achieve with Binomial regression is to use a linear model to accurately estimate $p_i$ (i.e. $p_i = \beta_0 + \beta_1 \cdot x_i$). So we could try to do this with a likelihood term like: - -$$ -y_i \sim \text{Binomial}(n, \beta_0 + \beta_1 \cdot x_i) -$$ - -If we did this, we would quickly run into problems when the linear model generates values of $p$ outside the range of $0-1$. This is where the link function comes in: - -$$ -g(p_i) = \beta_0 + \beta_1 \cdot x_i -$$ - -where $g()$ is a link function. This can be thought of as a transformation that maps proportions in the range $(0, 1)$ to the domain $(-\infty, +\infty)$. There are a number of potential functions that could be used, but a common one to use is the [Logit function](https://en.wikipedia.org/wiki/Logit). - -Although what we actually want to do is to rearrange this equation for $p_i$ so that we can enter it into the likelihood function. This results in: - -$$ -p_i= g^{-1}(\beta_0 + \beta_1 \cdot x_i) -$$ - -where $g^{-1}()$ is the inverse of the link function, in this case the inverse of the Logit function (i.e. the [logistic sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function) function, also known as the expit function). So if we enter this into our likelihood function we end up with: - -$$ -y_i \sim \text{Binomial}(n, \text{InverseLogit}(\beta_0 + \beta_1 \cdot x_i)) -$$ - -This defines our likelihood function. All you need now to get some Bayesian Binomial regression done is priors over the $\beta$ parameters. The observed data are $y_i$, $n$, and $x_i$. - -```{code-cell} ipython3 -:tags: [] - -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm - -from scipy.special import expit -``` - -```{code-cell} ipython3 -:tags: [] - -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -rng = np.random.default_rng(1234) -``` - -## Generate data - -```{code-cell} ipython3 -# true params -beta0_true = 0.7 -beta1_true = 0.4 -# number of yes/no questions -n = 20 - -sample_size = 30 -x = np.linspace(-10, 20, sample_size) -# Linear model -mu_true = beta0_true + beta1_true * x -# transformation (inverse logit function = expit) -p_true = expit(mu_true) -# Generate data -y = rng.binomial(n, p_true) -# bundle data into dataframe -data = pd.DataFrame({"x": x, "y": y}) -``` - -We can see that the underlying data $y$ is count data, out of $n$ total trials. - -```{code-cell} ipython3 -data.head() -``` - -```{code-cell} ipython3 -:tags: [hide-input] - -# Plot underlying linear model -fig, ax = plt.subplots(2, 1, figsize=(9, 6), sharex=True) -ax[0].plot(x, mu_true, label=r"$β_0 + β_1 \cdot x_i$") -ax[0].set(xlabel="$x$", ylabel=r"$β_0 + β_1 \cdot x_i$", title="Underlying linear model") -ax[0].legend() - -# Plot GLM -freq = ax[1].twinx() # instantiate a second axes that shares the same x-axis -freq.set_ylabel("number of successes") -freq.scatter(x, y, color="k") -# plot proportion related stuff on ax[1] -ax[1].plot(x, p_true, label=r"$g^{-1}(β_0 + β_1 \cdot x_i)$") -ax[1].set_ylabel("proportion successes", color="b") -ax[1].tick_params(axis="y", labelcolor="b") -ax[1].set(xlabel="$x$", title="Binomial regression") -ax[1].legend() -# get y-axes to line up -y_buffer = 1 -freq.set(ylim=[-y_buffer, n + y_buffer]) -ax[1].set(ylim=[-(y_buffer / n), 1 + (y_buffer / n)]) -freq.grid(None) -``` - -The top panel shows the (untransformed) linear model. We can see that the linear model is generating values outside the range $0-1$, making clear the need for an inverse link function, $g^{-1}()$ which converts from the domain of $(-\infty, +\infty) \rightarrow (0, 1)$. As we've seen, this is done by the inverse logistic function (aka logistic sigmoid). - -+++ - -## Binomial regression model - -Technically, we don't need to supply `coords`, but providing this (a list of observation values) helps when reshaping arrays of data later on. The information in `coords` is used by the `dims` kwarg in the model. - -```{code-cell} ipython3 -coords = {"observation": data.index.values} - -with pm.Model(coords=coords) as binomial_regression_model: - x = pm.ConstantData("x", data["x"], dims="observation") - # priors - beta0 = pm.Normal("beta0", mu=0, sigma=1) - beta1 = pm.Normal("beta1", mu=0, sigma=1) - # linear model - mu = beta0 + beta1 * x - p = pm.Deterministic("p", pm.math.invlogit(mu), dims="observation") - # likelihood - pm.Binomial("y", n=n, p=p, observed=data["y"], dims="observation") -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(binomial_regression_model) -``` - -```{code-cell} ipython3 -with binomial_regression_model: - idata = pm.sample(1000, tune=2000) -``` - -Confirm no inference issues by visual inspection of chain. We've got no warnings about divergences, $\hat{R}$, or effective sample size. Everything looks good. - -```{code-cell} ipython3 -az.plot_trace(idata, var_names=["beta0", "beta1"]); -``` - -## Examine results -The code below plots out model predictions in data space, and our posterior beliefs in parameter space. - -```{code-cell} ipython3 -:tags: [hide-input] - -fig, ax = plt.subplots(1, 2, figsize=(9, 4), gridspec_kw={"width_ratios": [2, 1]}) - -# Data space plot ======================================================== -az.plot_hdi( - data["x"], - idata.posterior.p, - hdi_prob=0.95, - fill_kwargs={"alpha": 0.25, "linewidth": 0}, - ax=ax[0], - color="C1", -) -# posterior mean -post_mean = idata.posterior.p.mean(("chain", "draw")) -ax[0].plot(data["x"], post_mean, label="posterior mean", color="C1") -# plot truth -ax[0].plot(data["x"], p_true, "--", label="true", color="C2") -# formatting -ax[0].set(xlabel="x", title="Data space") -ax[0].set_ylabel("proportion successes", color="C1") -ax[0].tick_params(axis="y", labelcolor="C1") -ax[0].legend() -# instantiate a second axes that shares the same x-axis -freq = ax[0].twinx() -freq.set_ylabel("number of successes") -freq.scatter(data["x"], data["y"], color="k", label="data") -# get y-axes to line up -y_buffer = 1 -freq.set(ylim=[-y_buffer, n + y_buffer]) -ax[0].set(ylim=[-(y_buffer / n), 1 + (y_buffer / n)]) -freq.grid(None) -# set both y-axis to have 5 ticks -ax[0].set(yticks=np.linspace(0, 20, 5) / n) -freq.set(yticks=np.linspace(0, 20, 5)) - -# Parameter space plot =================================================== -az.plot_kde( - idata.posterior.stack(sample=("chain", "draw")).beta0.values, - idata.posterior.stack(sample=("chain", "draw")).beta1.values, - contourf_kwargs={"cmap": "Blues"}, - ax=ax[1], -) -ax[1].plot(beta0_true, beta1_true, "C2o", label="true") -ax[1].set(xlabel=r"$\beta_0$", ylabel=r"$\beta_1$", title="Parameter space") -ax[1].legend(facecolor="white", frameon=True); -``` - -The left panel shows the posterior mean (solid line) and 95% credible intervals (shaded region). Because we are working with simulated data, we know what the true model is, so we can see that the posterior mean compares favourably with the true data generating model. - -This is also shown by the posterior distribution over parameter space (right panel), which does well when comparing to the true data generating parameters. - -Using binomial regression in real data analysis situations would probably involve more predictor variables, and correspondingly more model parameters, but hopefully this example has demonstrated the logic behind binomial regression. - -A good introduction to generalized linear models is provided by {cite:t}`roback2021beyond` which is available in hardcopy and [free online](https://bookdown.org/roback/bookdown-BeyondMLR/). - -+++ - -## Authors -- Authored by [Benjamin T. Vincent](https://github.com/drbenvincent) in July 2021 -- Updated by [Benjamin T. Vincent](https://github.com/drbenvincent) in February 2022 - -+++ - -## References -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Watermark - -```{code-cell} ipython3 -:tags: [] - -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl -``` - -:::{include} ../page_footer.md ::: diff --git a/myst_nbs/generalized_linear_models/GLM-hierarchical-binomial-model.myst.md b/myst_nbs/generalized_linear_models/GLM-hierarchical-binomial-model.myst.md deleted file mode 100644 index 73c9bdba0..000000000 --- a/myst_nbs/generalized_linear_models/GLM-hierarchical-binomial-model.myst.md +++ /dev/null @@ -1,283 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3.9.7 ('base') - language: python - name: python3 ---- - -# Hierarchical Binomial Model: Rat Tumor Example -:::{post} Nov 11, 2021 -:tags: generalized linear model, hierarchical model -:category: intermediate -:author: Demetri Pananos, Junpeng Lao, Raúl Maldonado, Farhan Reynaldo -::: - -```{code-cell} ipython3 -import aesara.tensor as at -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm - -from scipy.special import gammaln -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -``` - -This short tutorial demonstrates how to use PyMC to do inference for the rat tumour example found in chapter 5 of *Bayesian Data Analysis 3rd Edition* {cite:p}`gelman2013bayesian`. Readers should already be familiar with the PyMC API. - -Suppose we are interested in the probability that a lab rat develops endometrial stromal polyps. We have data from 71 previously performed trials and would like to use this data to perform inference. - -The authors of BDA3 choose to model this problem hierarchically. Let $y_i$ be the number of lab rats which develop endometrial stromal polyps out of a possible $n_i$. We model the number rodents which develop endometrial stromal polyps as binomial - -$$ y_i \sim \operatorname{Bin}(\theta_i;n_i)$$ - -allowing the probability of developing an endometrial stromal polyp (i.e. $\theta_i$) to be drawn from some population distribution. For analytical tractability, we assume that $\theta_i$ has Beta distribution - -$$ \theta_i \sim \operatorname{Beta}(\alpha, \beta)$$ - -We are free to specify a prior distribution for $\alpha, \beta$. We choose a weakly informative prior distribution to reflect our ignorance about the true values of $\alpha, \beta$. The authors of BDA3 choose the joint hyperprior for $\alpha, \beta$ to be - -$$ p(\alpha, \beta) \propto (\alpha + \beta) ^{-5/2}$$ - -For more information, please see *Bayesian Data Analysis 3rd Edition* pg. 110. - -+++ - -## A Directly Computed Solution - -Our joint posterior distribution is - -$$p(\alpha,\beta,\theta \lvert y) -\propto -p(\alpha, \beta) -p(\theta \lvert \alpha,\beta) -p(y \lvert \theta)$$ - -which can be rewritten in such a way so as to obtain the marginal posterior distribution for $\alpha$ and $\beta$, namely - -$$ p(\alpha, \beta, \lvert y) = -p(\alpha, \beta) -\prod_{i = 1}^{N} \dfrac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} -\dfrac{\Gamma(\alpha+y_i)\Gamma(\beta+n_i - y_i)}{\Gamma(\alpha+\beta+n_i)}$$ - - -See BDA3 pg. 110 for a more information on the deriving the marginal posterior distribution. With a little determination, we can plot the marginal posterior and estimate the means of $\alpha$ and $\beta$ without having to resort to MCMC. We will see, however, that this requires considerable effort. - -The authors of BDA3 choose to plot the surface under the parameterization $(\log(\alpha/\beta), \log(\alpha+\beta))$. We do so as well. Through the remainder of the example let $x = \log(\alpha/\beta)$ and $z = \log(\alpha+\beta)$. - -```{code-cell} ipython3 -# rat data (BDA3, p. 102) -# fmt: off -y = np.array([ - 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, - 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 5, 2, - 5, 3, 2, 7, 7, 3, 3, 2, 9, 10, 4, 4, 4, 4, 4, 4, 4, - 10, 4, 4, 4, 5, 11, 12, 5, 5, 6, 5, 6, 6, 6, 6, 16, 15, - 15, 9, 4 -]) -n = np.array([ - 20, 20, 20, 20, 20, 20, 20, 19, 19, 19, 19, 18, 18, 17, 20, 20, 20, - 20, 19, 19, 18, 18, 25, 24, 23, 20, 20, 20, 20, 20, 20, 10, 49, 19, - 46, 27, 17, 49, 47, 20, 20, 13, 48, 50, 20, 20, 20, 20, 20, 20, 20, - 48, 19, 19, 19, 22, 46, 49, 20, 20, 23, 19, 22, 20, 20, 20, 52, 46, - 47, 24, 14 -]) -# fmt: on - -N = len(n) -``` - -```{code-cell} ipython3 -# Compute on log scale because products turn to sums -def log_likelihood(alpha, beta, y, n): - LL = 0 - - # Summing over data - for Y, N in zip(y, n): - LL += ( - gammaln(alpha + beta) - - gammaln(alpha) - - gammaln(beta) - + gammaln(alpha + Y) - + gammaln(beta + N - Y) - - gammaln(alpha + beta + N) - ) - - return LL - - -def log_prior(A, B): - - return -5 / 2 * np.log(A + B) - - -def trans_to_beta(x, y): - - return np.exp(y) / (np.exp(x) + 1) - - -def trans_to_alpha(x, y): - - return np.exp(x) * trans_to_beta(x, y) - - -# Create space for the parameterization in which we wish to plot -X, Z = np.meshgrid(np.arange(-2.3, -1.3, 0.01), np.arange(1, 5, 0.01)) -param_space = np.c_[X.ravel(), Z.ravel()] -df = pd.DataFrame(param_space, columns=["X", "Z"]) - -# Transform the space back to alpha beta to compute the log-posterior -df["alpha"] = trans_to_alpha(df.X, df.Z) -df["beta"] = trans_to_beta(df.X, df.Z) - -df["log_posterior"] = log_prior(df.alpha, df.beta) + log_likelihood(df.alpha, df.beta, y, n) -df["log_jacobian"] = np.log(df.alpha) + np.log(df.beta) - -df["transformed"] = df.log_posterior + df.log_jacobian -df["exp_trans"] = np.exp(df.transformed - df.transformed.max()) - -# This will ensure the density is normalized -df["normed_exp_trans"] = df.exp_trans / df.exp_trans.sum() - - -surface = df.set_index(["X", "Z"]).exp_trans.unstack().values.T -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 8)) -ax.contourf(X, Z, surface) -ax.set_xlabel(r"$\log(\alpha/\beta)$", fontsize=16) -ax.set_ylabel(r"$\log(\alpha+\beta)$", fontsize=16) - -ix_z, ix_x = np.unravel_index(np.argmax(surface, axis=None), surface.shape) -ax.scatter([X[0, ix_x]], [Z[ix_z, 0]], color="red") - -text = r"$({a},{b})$".format(a=np.round(X[0, ix_x], 2), b=np.round(Z[ix_z, 0], 2)) - -ax.annotate( - text, - xy=(X[0, ix_x], Z[ix_z, 0]), - xytext=(-1.6, 3.5), - ha="center", - fontsize=16, - color="white", - arrowprops={"facecolor": "white"}, -); -``` - -The plot shows that the posterior is roughly symmetric about the mode (-1.79, 2.74). This corresponds to $\alpha = 2.21$ and $\beta = 13.27$. We can compute the marginal means as the authors of BDA3 do, using - -$$ \operatorname{E}(\alpha \lvert y) \text{ is estimated by } -\sum_{x,z} \alpha p(x,z\lvert y) $$ - -$$ \operatorname{E}(\beta \lvert y) \text{ is estimated by } -\sum_{x,z} \beta p(x,z\lvert y) $$ - -```{code-cell} ipython3 -# Estimated mean of alpha -(df.alpha * df.normed_exp_trans).sum().round(3) -``` - -```{code-cell} ipython3 -# Estimated mean of beta -(df.beta * df.normed_exp_trans).sum().round(3) -``` - -## Computing the Posterior using PyMC - -Computing the marginal posterior directly is a lot of work, and is not always possible for sufficiently complex models. - -On the other hand, creating hierarchical models in PyMC is simple. We can use the samples obtained from the posterior to estimate the means of $\alpha$ and $\beta$. - -```{code-cell} ipython3 -coords = { - "obs_id": np.arange(N), - "param": ["alpha", "beta"], -} -``` - -```{code-cell} ipython3 -def logp_ab(value): - """prior density""" - return at.log(at.pow(at.sum(value), -5 / 2)) - - -with pm.Model(coords=coords) as model: - # Uninformative prior for alpha and beta - n_val = pm.ConstantData("n_val", n) - ab = pm.HalfNormal("ab", sigma=10, dims="param") - pm.Potential("p(a, b)", logp_ab(ab)) - - X = pm.Deterministic("X", at.log(ab[0] / ab[1])) - Z = pm.Deterministic("Z", at.log(at.sum(ab))) - - theta = pm.Beta("theta", alpha=ab[0], beta=ab[1], dims="obs_id") - - p = pm.Binomial("y", p=theta, observed=y, n=n_val) - trace = pm.sample(draws=2000, tune=2000, target_accept=0.95) -``` - -```{code-cell} ipython3 -# Check the trace. Looks good! -az.plot_trace(trace, var_names=["ab", "X", "Z"], compact=False); -``` - -We can plot a kernel density estimate for $x$ and $y$. It looks rather similar to our contour plot made from the analytic marginal posterior density. That's a good sign, and required far less effort. - -```{code-cell} ipython3 -az.plot_pair(trace, var_names=["X", "Z"], kind="kde"); -``` - -From here, we could use the trace to compute the mean of the distribution. - -```{code-cell} ipython3 -az.plot_posterior(trace, var_names=["ab"]); -``` - -```{code-cell} ipython3 -# estimate the means from the samples -trace.posterior["ab"].mean(("chain", "draw")) -``` - -## Conclusion - -Analytically calculating statistics for posterior distributions is difficult if not impossible for some models. PyMC provides an easy way drawing samples from your model's posterior with only a few lines of code. Here, we used PyMC to obtain estimates of the posterior mean for the rat tumor example in chapter 5 of BDA3. The estimates obtained from PyMC are encouragingly close to the estimates obtained from the analytical posterior density. - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Authors - -* Adapted from chapter 5 of Bayesian Data Analysis 3rd Edition {cite:p}`gelman2013bayesian` by Demetri Pananos and Junpeng Lao ([pymc#3054](https://github.com/pymc-devs/pymc/pull/3054)) -* Updated and reexecuted by Raúl Maldonado ([pymc-examples#24](https://github.com/pymc-devs/pymc-examples/pull/24), [pymc-examples#45](https://github.com/pymc-devs/pymc-examples/pull/45) and [pymc-examples#147](https://github.com/pymc-devs/pymc-examples/pull/147)) -* Updated and reexecuted by Farhan Reynaldo in November 2021 ([pymc-examples#248](https://github.com/pymc-devs/pymc-examples/pull/248)) - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/generalized_linear_models/GLM-model-selection.myst.md b/myst_nbs/generalized_linear_models/GLM-model-selection.myst.md deleted file mode 100644 index 50f2da23a..000000000 --- a/myst_nbs/generalized_linear_models/GLM-model-selection.myst.md +++ /dev/null @@ -1,621 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(GLM-model-selection)= -# GLM: Model Selection - -:::{post} Jan 8, 2022 -:tags: cross validation, generalized linear model, loo, model comparison, waic -:category: intermediate -:author: Jon Sedar, Junpeng Lao, Abhipsha Das, Oriol Abril-Pla -::: - -```{code-cell} ipython3 -import arviz as az -import bambi as bmb -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc3 as pm -import seaborn as sns -import xarray as xr - -from ipywidgets import fixed, interactive - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) - -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -plt.rcParams["figure.constrained_layout.use"] = False -``` - -## Introduction -A fairly minimal reproducible example of Model Selection using WAIC, and LOO as currently implemented in PyMC3. - -This example creates two toy datasets under linear and quadratic models, and then tests the fit of a range of polynomial linear models upon those datasets by using Widely Applicable Information Criterion (WAIC), and leave-one-out (LOO) cross-validation using Pareto-smoothed importance sampling (PSIS). - -The example was inspired by Jake Vanderplas' [blogpost](https://jakevdp.github.io/blog/2015/08/07/frequentism-and-bayesianism-5-model-selection/) on model selection, although Cross-Validation and Bayes Factor comparison are not implemented. The datasets are tiny and generated within this Notebook. They contain errors in the measured value (y) only. - -+++ - -## Local Functions - -We start writing some functions to help with the rest of the notebook. Only the some functions are key to understanding the notebook, the rest are convenience functions to make plotting more concise when needed and are hidden inside a toggle-able section; it is still available but you need to click to see it. - -```{code-cell} ipython3 -def generate_data(n=20, p=0, a=1, b=1, c=0, latent_sigma_y=20, seed=5): - """ - Create a toy dataset based on a very simple model that we might - imagine is a noisy physical process: - 1. random x values within a range - 2. latent error aka inherent noise in y - 3. optionally create labelled outliers with larger noise - - Model form: y ~ a + bx + cx^2 + e - - NOTE: latent_sigma_y is used to create a normally distributed, - 'latent error' aka 'inherent noise' in the 'physical' generating - process, rather than experimental measurement error. - Please don't use the returned `latent_error` values in inferential - models, it's returned in the dataframe for interest only. - """ - rng = np.random.default_rng(seed) - df = pd.DataFrame({"x": rng.choice(np.arange(100), n, replace=False)}) - - # create linear or quadratic model - df["y"] = a + b * (df["x"]) + c * (df["x"]) ** 2 - - # create latent noise and marked outliers - df["latent_error"] = rng.normal(0, latent_sigma_y, n) - df["outlier_error"] = rng.normal(0, latent_sigma_y * 10, n) - df["outlier"] = rng.binomial(1, p, n) - - # add noise, with extreme noise for marked outliers - df["y"] += (1 - df["outlier"]) * df["latent_error"] - df["y"] += df["outlier"] * df["outlier_error"] - - # round - for col in ["y", "latent_error", "outlier_error", "x"]: - df[col] = np.round(df[col], 3) - - # add label - df["source"] = "linear" if c == 0 else "quadratic" - - # create simple linspace for plotting true model - plotx = np.linspace( - df["x"].min() - np.ptp(df["x"].values) * 0.1, - df["x"].max() + np.ptp(df["x"].values) * 0.1, - 100, - ) - - ploty = a + b * plotx + c * plotx**2 - dfp = pd.DataFrame({"x": plotx, "y": ploty}) - - return df, dfp -``` - -```{code-cell} ipython3 -:tags: [hide-cell] - -def interact_dataset(n=20, p=0, a=-30, b=5, c=0, latent_sigma_y=20): - """ - Convenience function: - Interactively generate dataset and plot - """ - - df, dfp = generate_data(n, p, a, b, c, latent_sigma_y) - - g = sns.FacetGrid( - df, - height=8, - hue="outlier", - hue_order=[True, False], - palette=sns.color_palette("bone"), - legend_out=False, - ) - - g.map( - plt.errorbar, - "x", - "y", - "latent_error", - marker="o", - ms=10, - mec="w", - mew=2, - ls="", - elinewidth=0.7, - ).add_legend() - - plt.plot(dfp["x"], dfp["y"], "--", alpha=0.8) - - plt.subplots_adjust(top=0.92) - g.fig.suptitle("Sketch of Data Generation ({})".format(df["source"][0]), fontsize=16) - - -def plot_datasets(df_lin, df_quad, dfp_lin, dfp_quad): - """ - Convenience function: - Plot the two generated datasets in facets with generative model - """ - - df = pd.concat((df_lin, df_quad), axis=0) - - g = sns.FacetGrid(col="source", hue="source", data=df, height=6, sharey=False, legend_out=False) - - g.map(plt.scatter, "x", "y", alpha=0.7, s=100, lw=2, edgecolor="w") - - g.axes[0][0].plot(dfp_lin["x"], dfp_lin["y"], "--", alpha=0.6, color="C0") - g.axes[0][1].plot(dfp_quad["x"], dfp_quad["y"], "--", alpha=0.6, color="C0") - - -def plot_annotated_trace(traces): - """ - Convenience function: - Plot traces with overlaid means and values - """ - - summary = az.summary(traces, stat_funcs={"mean": np.mean}, extend=False) - ax = az.plot_trace( - traces, - lines=tuple([(k, {}, v["mean"]) for k, v in summary.iterrows()]), - ) - - for i, mn in enumerate(summary["mean"].values): - ax[i, 0].annotate( - f"{mn:.2f}", - xy=(mn, 0), - xycoords="data", - xytext=(5, 10), - textcoords="offset points", - rotation=90, - va="bottom", - fontsize="large", - color="C0", - ) - - -def plot_posterior_cr(models, idatas, rawdata, xlims, datamodelnm="linear", modelnm="k1"): - """ - Convenience function: - Plot posterior predictions with credible regions shown as filled areas. - """ - - # Get traces and calc posterior prediction for npoints in x - npoints = 100 - mdl = models[modelnm] - trc = idatas[modelnm].posterior.copy().drop_vars("y_sigma") - da = xr.concat([var for var in trc.values()], dim="order") - - ordr = int(modelnm[-1:]) - x = xr.DataArray(np.linspace(xlims[0], xlims[1], npoints), dims=["x_plot"]) - pwrs = xr.DataArray(np.arange(ordr + 1), dims=["order"]) - X = x**pwrs - cr = xr.dot(X, da, dims="order") - - # Calculate credible regions and plot over the datapoints - qs = cr.quantile([0.025, 0.25, 0.5, 0.75, 0.975], dim=("chain", "draw")) - - f, ax1d = plt.subplots(1, 1, figsize=(7, 7)) - f.suptitle( - f"Posterior Predictive Fit -- Data: {datamodelnm} -- Model: {modelnm}", - fontsize=16, - ) - - ax1d.fill_between( - x, qs.sel(quantile=0.025), qs.sel(quantile=0.975), alpha=0.5, color="C0", label="CR 95%" - ) - ax1d.fill_between( - x, qs.sel(quantile=0.25), qs.sel(quantile=0.75), alpha=0.5, color="C3", label="CR 50%" - ) - ax1d.plot(x, qs.sel(quantile=0.5), alpha=0.6, color="C4", label="Median") - ax1d.scatter(rawdata["x"], rawdata["y"], alpha=0.7, s=100, lw=2, edgecolor="w") - - ax1d.legend() - ax1d.set_xlim(xlims) -``` - -## Generate toy datasets - -### Interactively draft data - -Throughout the rest of the Notebook, we'll use two toy datasets created by a linear and a quadratic model respectively, so that we can better evaluate the fit of the model selection. - -Right now, lets use an interactive session to play around with the data generation function in this Notebook, and get a feel for the possibilities of data we could generate. - - -$$y_{i} = a + bx_{i} + cx_{i}^{2} + \epsilon_{i}$$ - -where: -$i \in n$ datapoints - -$$\epsilon \sim \mathcal{N}(0,latent\_sigma\_y)$$ - -:::{admonition} Note on outliers -+ We can use value `p` to set the (approximate) proportion of 'outliers' under a bernoulli distribution. -+ These outliers have a 10x larger `latent_sigma_y` -+ These outliers are labelled in the returned datasets and may be useful for other modelling, see another example Notebook: {ref}`GLM-robust-with-outlier-detection` -::: - -```{code-cell} ipython3 -interactive( - interact_dataset, - n=[5, 50, 5], - p=[0, 0.5, 0.05], - a=[-50, 50], - b=[-10, 10], - c=[-3, 3], - latent_sigma_y=[0, 1000, 50], -) -``` - -**Observe:** - -+ I've shown the `latent_error` in errorbars, but this is for interest only, since this shows the _inherent noise_ in whatever 'physical process' we imagine created the data. -+ There is no _measurement error_. -+ Datapoints created as outliers are shown in **red**, again for interest only. - -+++ - -### Create datasets for modelling - -+++ - -We can use the above interactive plot to get a feel for the effect of the params. Now we'll create 2 fixed datasets to use for the remainder of the Notebook. - -1. For a start, we'll create a linear model with small noise. Keep it simple. -2. Secondly, a quadratic model with small noise - -```{code-cell} ipython3 -n = 30 -df_lin, dfp_lin = generate_data(n=n, p=0, a=-30, b=5, c=0, latent_sigma_y=40, seed=RANDOM_SEED) -df_quad, dfp_quad = generate_data(n=n, p=0, a=-200, b=2, c=3, latent_sigma_y=500, seed=RANDOM_SEED) -``` - -Scatterplot against model line - -```{code-cell} ipython3 -plot_datasets(df_lin, df_quad, dfp_lin, dfp_quad) -``` - -**Observe:** - -+ We now have two datasets `df_lin` and `df_quad` created by a linear model and quadratic model respectively. -+ You can see this raw data, the ideal model fit and the effect of the latent noise in the scatterplots above -+ In the following plots in this Notebook, the linear-generated data will be shown in Blue and the quadratic in Green. - -+++ - -### Standardize - -```{code-cell} ipython3 -dfs_lin = df_lin.copy() -dfs_lin["x"] = (df_lin["x"] - df_lin["x"].mean()) / df_lin["x"].std() - -dfs_quad = df_quad.copy() -dfs_quad["x"] = (df_quad["x"] - df_quad["x"].mean()) / df_quad["x"].std() -``` - -Create ranges for later ylim xim - -```{code-cell} ipython3 -dfs_lin_xlims = ( - dfs_lin["x"].min() - np.ptp(dfs_lin["x"].values) / 10, - dfs_lin["x"].max() + np.ptp(dfs_lin["x"].values) / 10, -) - -dfs_lin_ylims = ( - dfs_lin["y"].min() - np.ptp(dfs_lin["y"].values) / 10, - dfs_lin["y"].max() + np.ptp(dfs_lin["y"].values) / 10, -) - -dfs_quad_ylims = ( - dfs_quad["y"].min() - np.ptp(dfs_quad["y"].values) / 10, - dfs_quad["y"].max() + np.ptp(dfs_quad["y"].values) / 10, -) -``` - -## Demonstrate simple linear model - -This *linear model* is really simple and conventional, an OLS with L2 constraints (Ridge Regression): - -$$y = a + bx + \epsilon$$ - -+++ - -### Define model using explicit PyMC3 method - -```{code-cell} ipython3 -with pm.Model() as mdl_ols: - ## define Normal priors to give Ridge regression - b0 = pm.Normal("Intercept", mu=0, sigma=100) - b1 = pm.Normal("x", mu=0, sigma=100) - - ## define Linear model - yest = b0 + b1 * df_lin["x"] - - ## define Normal likelihood with HalfCauchy noise (fat tails, equiv to HalfT 1DoF) - y_sigma = pm.HalfCauchy("y_sigma", beta=10) - likelihood = pm.Normal("likelihood", mu=yest, sigma=y_sigma, observed=df_lin["y"]) - - idata_ols = pm.sample(2000, return_inferencedata=True) -``` - -```{code-cell} ipython3 -plt.rcParams["figure.constrained_layout.use"] = True -plot_annotated_trace(idata_ols) -``` - -**Observe:** - -+ This simple OLS manages to make fairly good guesses on the model parameters - the data has been generated fairly simply after all - but it does appear to have been fooled slightly by the inherent noise. - -+++ - -### Define model using Bambi - -Bambi can be used for defining models using a `formulae`-style formula syntax. This seems really useful, especially for defining simple regression models in fewer lines of code. - -Here's the same OLS model as above, defined using `bambi`. - -```{code-cell} ipython3 -# Define priors for intercept and regression coefficients. -priors = { - "Intercept": bmb.Prior("Normal", mu=0, sigma=100), - "x": bmb.Prior("Normal", mu=0, sigma=100), -} - -model = bmb.Model("y ~ 1 + x", df_lin, priors=priors, family="gaussian") - -idata_ols_glm = model.fit(draws=2000, tune=2000) -``` - -```{code-cell} ipython3 -plot_annotated_trace(idata_ols_glm) -``` - -**Observe:** - -+ This `bambi`-defined model appears to behave in a very similar way, and finds the same parameter values as the conventionally-defined model - any differences are due to the random nature of the sampling. -+ We can quite happily use the `bambi` syntax for further models below, since it allows us to create a small model factory very easily. - -+++ - -## Create higher-order linear models - -Back to the real purpose of this Notebook, to demonstrate model selection. - -First, let's create and run a set of polynomial models on each of our toy datasets. By default this is for models of order 1 to 5. - -### Create and run polynomial models - -We're creating 5 polynomial models and fitting each to the chosen dataset using the functions `create_poly_modelspec` and `run_models` below. - -```{code-cell} ipython3 -def create_poly_modelspec(k=1): - """ - Convenience function: - Create a polynomial modelspec string for bambi - """ - return ("y ~ 1 + x " + " ".join([f"+ np.power(x,{j})" for j in range(2, k + 1)])).strip() - - -def run_models(df, upper_order=5): - """ - Convenience function: - Fit a range of pymc3 models of increasing polynomial complexity. - Suggest limit to max order 5 since calculation time is exponential. - """ - - models, results = dict(), dict() - - for k in range(1, upper_order + 1): - - nm = f"k{k}" - fml = create_poly_modelspec(k) - - print(f"\nRunning: {nm}") - - models[nm] = bmb.Model( - fml, df, priors={"intercept": bmb.Prior("Normal", mu=0, sigma=100)}, family="gaussian" - ) - results[nm] = models[nm].fit(draws=2000, tune=1000, init="advi+adapt_diag") - - return models, results -``` - -```{code-cell} ipython3 -:tags: [hide-output] - -models_lin, idatas_lin = run_models(dfs_lin, 5) -``` - -```{code-cell} ipython3 -:tags: [hide-output] - -models_quad, idatas_quad = run_models(dfs_quad, 5) -``` - -## View posterior predictive fit - -Just for the linear, generated data, lets take an interactive look at the posterior predictive fit for the models k1 through k5. - -As indicated by the likelhood plots above, the higher-order polynomial models exhibit some quite wild swings in the function in order to (over)fit the data - -```{code-cell} ipython3 -interactive( - plot_posterior_cr, - models=fixed(models_lin), - idatas=fixed(idatas_lin), - rawdata=fixed(dfs_lin), - xlims=fixed(dfs_lin_xlims), - datamodelnm=fixed("linear"), - modelnm=["k1", "k2", "k3", "k4", "k5"], -) -``` - -## Compare models using WAIC - -The Widely Applicable Information Criterion (WAIC) can be used to calculate the goodness-of-fit of a model using numerical techniques. See {cite:t}`watanabe2010asymptotic` for details. - -+++ - -**Observe:** - -We get three different measurements: -- waic: widely applicable information criterion (or "Watanabe–Akaike information criterion") -- waic_se: standard error of waic -- p_waic: effective number parameters - -In this case we are interested in the WAIC score. We also plot error bars for the standard error of the estimated scores. This gives us a more accurate view of how much they might differ. - -```{code-cell} ipython3 -dfwaic_lin = az.compare(idatas_lin, ic="WAIC") -dfwaic_quad = az.compare(idatas_quad, ic="WAIC") -``` - -```{code-cell} ipython3 -dfwaic_lin -``` - -```{code-cell} ipython3 -dfwaic_quad -``` - -```{code-cell} ipython3 -_, axs = plt.subplots(1, 2) - -ax = axs[0] -az.plot_compare(dfwaic_lin, ax=ax) -ax.set_title("Linear data") - -ax = axs[1] -az.plot_compare(dfwaic_quad, ax=ax) -ax.set_title("Quadratic data"); -``` - -**Observe** - -+ We should prefer the model(s) with higher WAIC - - -+ Linear-generated data (lhs): - + The WAIC seems quite flat across models - + The WAIC seems best (highest) for simpler models. - - -+ Quadratic-generated data (rhs): - + The WAIC is also quite flat across the models - + The worst WAIC is for **k1**, it is not flexible enough to properly fit the data. - + WAIC is quite flat for the rest, but the highest is for **k2** as should be and it decreases as the order increases. The higher the order the higher the complexity of the model, but the goodness of fit is basically the same. As models with higher complexity are penalized we can see how we land at the sweet spot of choosing the simplest model that can fit the data. - -+++ - -## Compare leave-one-out Cross-Validation [LOO] - -Leave-One-Out Cross-Validation or K-fold Cross-Validation is another quite universal approach for model selection. However, to implement K-fold cross-validation we need to partition the data repeatedly and fit the model on every partition. It can be very time consumming (computation time increase roughly as a factor of K). Here we are applying the numerical approach using the posterior trace as suggested in {cite:t}`vehtari2017practical` - -```{code-cell} ipython3 -dfloo_lin = az.compare(idatas_lin, ic="LOO") -dfloo_quad = az.compare(idatas_quad, ic="LOO") -``` - -```{code-cell} ipython3 -dfloo_lin -``` - -```{code-cell} ipython3 -dfloo_quad -``` - -```{code-cell} ipython3 -_, axs = plt.subplots(1, 2) - -ax = axs[0] -az.plot_compare(dfloo_lin, ax=ax) -ax.set_title("Linear data") - -ax = axs[1] -az.plot_compare(dfloo_quad, ax=ax) -ax.set_title("Quadratic data"); -``` - -**Observe** - -+ We should prefer the model(s) with higher LOO. You can see that LOO is nearly identical with WAIC. That's because WAIC is asymptotically equal to LOO. However, PSIS-LOO is supposedly more robust than WAIC in the finite case (under weak priors or influential observation). - - -+ Linear-generated data (lhs): - + The LOO is also quite flat across models - + The LOO is also seems best (highest) for simpler models. - - -+ Quadratic-generated data (rhs): - + The same pattern as the WAIC - -+++ - -## Final remarks and tips - -It is important to keep in mind that, with more data points, the real underlying model (one that we used to generate the data) should outperform other models. - -There is some agreement that PSIS-LOO offers the best indication of a model's quality. To quote from [avehtari's comment](https://github.com/pymc-devs/pymc3/issues/938#issuecomment-313425552): "I also recommend using PSIS-LOO instead of WAIC, because it's more reliable and has better diagnostics as discussed in {cite:t}`vehtari2017practical`, but if you insist to have one information criterion then leave WAIC". - -Alternatively, Watanabe [says](http://watanabe-www.math.dis.titech.ac.jp/users/swatanab/index.html) "WAIC is a better approximator of the generalization error than the pareto smoothing importance sampling cross validation. The Pareto smoothing cross validation may be the better approximator of the cross validation than WAIC, however, it is not of the generalization error". - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames - -ando2007bayesian -spiegelhalter2002bayesian -::: - -:::{seealso} -+ Thomas Wiecki's [detailed response](https://stats.stackexchange.com/questions/161082/bayesian-model-selection-in-pymc3/166383#166383) to a question on Cross Validated -+ [Cross-validation FAQs](https://avehtari.github.io/modelselection/CV-FAQ.html) by Aki Vehtari -::: - -+++ - -## Authors -* Authored by [Jon Sedar](https://github.com/jonsedar) on January, 2016 ([pymc#930](https://github.com/pymc-devs/pymc/pull/930)) -* Updated by [Junpeng Lao](https://github.com/junpenglao) on July, 2017 ([pymc#2398](https://github.com/pymc-devs/pymc/pull/2398)) -* Re-executed by Ravin Kumar on May, 2019 ([pymc#3397](https://github.com/pymc-devs/pymc/pull/3397)) -* Re-executed by Alex Andorra and Michael Osthege on June, 2020 ([pymc#3955](https://github.com/pymc-devs/pymc/pull/3955)) -* Updated by Raul Maldonado on March, 2021 ([pymc-examples#24](https://github.com/pymc-devs/pymc-examples/pull/24)) -* Updated by Abhipsha Das and Oriol Abril on June, 2021 ([pymc-examples#173](https://github.com/pymc-devs/pymc-examples/pull/173)) - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p theano,xarray -``` - -:::{include} ../page_footer.md -::: - -```{code-cell} ipython3 - -``` diff --git a/myst_nbs/generalized_linear_models/GLM-negative-binomial-regression.myst.md b/myst_nbs/generalized_linear_models/GLM-negative-binomial-regression.myst.md deleted file mode 100644 index 9bec08432..000000000 --- a/myst_nbs/generalized_linear_models/GLM-negative-binomial-regression.myst.md +++ /dev/null @@ -1,240 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3.9.12 ('pymc-dev-py39') - language: python - name: python3 ---- - -(GLM-negative-binomial-regression)= -# GLM: Negative Binomial Regression - -:::{post} June, 2022 -:tags: negative binomial regression, generalized linear model, -:category: beginner -:author: Ian Ozsvald, Abhipsha Das, Benjamin Vincent -::: - -```{code-cell} ipython3 -import arviz as az -import numpy as np -import pandas as pd -import pymc as pm -import seaborn as sns - -from scipy import stats -``` - -```{code-cell} ipython3 -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) - -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -``` - -This notebook closely follows the GLM Poisson regression example by [Jonathan Sedar](https://github.com/jonsedar) (which is in turn inspired by [a project by Ian Osvald](http://ianozsvald.com/2016/05/07/statistically-solving-sneezes-and-sniffles-a-work-in-progress-report-at-pydatalondon-2016/)) except the data here is negative binomially distributed instead of Poisson distributed. - -Negative binomial regression is used to model count data for which the variance is higher than the mean. The [negative binomial distribution](https://en.wikipedia.org/wiki/Negative_binomial_distribution) can be thought of as a Poisson distribution whose rate parameter is gamma distributed, so that rate parameter can be adjusted to account for the increased variance. - -+++ - -### Generate Data - -As in the Poisson regression example, we assume that sneezing occurs at some baseline rate, and that consuming alcohol, not taking antihistamines, or doing both, increase its frequency. - -#### Poisson Data - -First, let's look at some Poisson distributed data from the Poisson regression example. - -```{code-cell} ipython3 -# Mean Poisson values -theta_noalcohol_meds = 1 # no alcohol, took an antihist -theta_alcohol_meds = 3 # alcohol, took an antihist -theta_noalcohol_nomeds = 6 # no alcohol, no antihist -theta_alcohol_nomeds = 36 # alcohol, no antihist - -# Create samples -q = 1000 -df_pois = pd.DataFrame( - { - "nsneeze": np.concatenate( - ( - rng.poisson(theta_noalcohol_meds, q), - rng.poisson(theta_alcohol_meds, q), - rng.poisson(theta_noalcohol_nomeds, q), - rng.poisson(theta_alcohol_nomeds, q), - ) - ), - "alcohol": np.concatenate( - ( - np.repeat(False, q), - np.repeat(True, q), - np.repeat(False, q), - np.repeat(True, q), - ) - ), - "nomeds": np.concatenate( - ( - np.repeat(False, q), - np.repeat(False, q), - np.repeat(True, q), - np.repeat(True, q), - ) - ), - } -) -``` - -```{code-cell} ipython3 -df_pois.groupby(["nomeds", "alcohol"])["nsneeze"].agg(["mean", "var"]) -``` - -Since the mean and variance of a Poisson distributed random variable are equal, the sample means and variances are very close. - -#### Negative Binomial Data - -Now, suppose every subject in the dataset had the flu, increasing the variance of their sneezing (and causing an unfortunate few to sneeze over 70 times a day). If the mean number of sneezes stays the same but variance increases, the data might follow a negative binomial distribution. - -```{code-cell} ipython3 -# Gamma shape parameter -alpha = 10 - - -def get_nb_vals(mu, alpha, size): - """Generate negative binomially distributed samples by - drawing a sample from a gamma distribution with mean `mu` and - shape parameter `alpha', then drawing from a Poisson - distribution whose rate parameter is given by the sampled - gamma variable. - - """ - - g = stats.gamma.rvs(alpha, scale=mu / alpha, size=size) - return stats.poisson.rvs(g) - - -# Create samples -n = 1000 -df = pd.DataFrame( - { - "nsneeze": np.concatenate( - ( - get_nb_vals(theta_noalcohol_meds, alpha, n), - get_nb_vals(theta_alcohol_meds, alpha, n), - get_nb_vals(theta_noalcohol_nomeds, alpha, n), - get_nb_vals(theta_alcohol_nomeds, alpha, n), - ) - ), - "alcohol": np.concatenate( - ( - np.repeat(False, n), - np.repeat(True, n), - np.repeat(False, n), - np.repeat(True, n), - ) - ), - "nomeds": np.concatenate( - ( - np.repeat(False, n), - np.repeat(False, n), - np.repeat(True, n), - np.repeat(True, n), - ) - ), - } -) -df -``` - -```{code-cell} ipython3 -df.groupby(["nomeds", "alcohol"])["nsneeze"].agg(["mean", "var"]) -``` - -As in the Poisson regression example, we see that drinking alcohol and/or not taking antihistamines increase the sneezing rate to varying degrees. Unlike in that example, for each combination of `alcohol` and `nomeds`, the variance of `nsneeze` is higher than the mean. This suggests that a Poisson distribution would be a poor fit for the data since the mean and variance of a Poisson distribution are equal. - -+++ - -### Visualize the Data - -```{code-cell} ipython3 -g = sns.catplot(x="nsneeze", row="nomeds", col="alcohol", data=df, kind="count", aspect=1.5) - -# Make x-axis ticklabels less crowded -ax = g.axes[1, 0] -labels = range(len(ax.get_xticklabels(which="both"))) -ax.set_xticks(labels[::5]) -ax.set_xticklabels(labels[::5]); -``` - -## Negative Binomial Regression - -+++ - -### Create GLM Model - -```{code-cell} ipython3 -COORDS = {"regressor": ["nomeds", "alcohol", "nomeds:alcohol"], "obs_idx": df.index} - -with pm.Model(coords=COORDS) as m_sneeze_inter: - a = pm.Normal("intercept", mu=0, sigma=5) - b = pm.Normal("slopes", mu=0, sigma=1, dims="regressor") - alpha = pm.Exponential("alpha", 0.5) - - M = pm.ConstantData("nomeds", df.nomeds.to_numpy(), dims="obs_idx") - A = pm.ConstantData("alcohol", df.alcohol.to_numpy(), dims="obs_idx") - S = pm.ConstantData("nsneeze", df.nsneeze.to_numpy(), dims="obs_idx") - - λ = pm.math.exp(a + b[0] * M + b[1] * A + b[2] * M * A) - - y = pm.NegativeBinomial("y", mu=λ, alpha=alpha, observed=S, dims="obs_idx") - - idata = pm.sample() -``` - -### View Results - -```{code-cell} ipython3 -az.plot_trace(idata, compact=False); -``` - -```{code-cell} ipython3 -# Transform coefficients to recover parameter values -az.summary(np.exp(idata.posterior), kind="stats", var_names=["intercept", "slopes"]) -``` - -```{code-cell} ipython3 -az.summary(idata.posterior, kind="stats", var_names="alpha") -``` - -The mean values are close to the values we specified when generating the data: -- The base rate is a constant 1. -- Drinking alcohol triples the base rate. -- Not taking antihistamines increases the base rate by 6 times. -- Drinking alcohol and not taking antihistamines doubles the rate that would be expected if their rates were independent. If they were independent, then doing both would increase the base rate by 3\*6=18 times, but instead the base rate is increased by 3\*6\*2=36 times. - -Finally, the mean of `nsneeze_alpha` is also quite close to its actual value of 10. - -+++ - -See also, [`bambi's` negative binomial example](https://bambinos.github.io/bambi/master/notebooks/negative_binomial.html) for further reference. - -+++ - -## Authors -- Created by [Ian Ozsvald](https://github.com/ianozsvald) -- Updated by [Abhipsha Das](https://github.com/chiral-carbon) in August 2021 -- Updated by [Benjamin Vincent](https://github.com/drbenvincent) to PyMC v4 in June 2022 - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/generalized_linear_models/GLM-out-of-sample-predictions.myst.md b/myst_nbs/generalized_linear_models/GLM-out-of-sample-predictions.myst.md deleted file mode 100644 index 8a9bcce31..000000000 --- a/myst_nbs/generalized_linear_models/GLM-out-of-sample-predictions.myst.md +++ /dev/null @@ -1,297 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: 'Python 3.7.6 64-bit (''website_projects'': conda)' - metadata: - interpreter: - hash: fbddea5140024843998ae64bf59a7579a9660d103062604797ea5984366c686c - name: python3 ---- - -# GLM in PyMC3: Out-Of-Sample Predictions - -In this notebook I explore the [glm](https://docs.pymc.io/api/glm.html) module of [PyMC3](https://docs.pymc.io/). I am particularly interested in the model definition using [patsy](https://patsy.readthedocs.io/en/latest/) formulas, as it makes the model evaluation loop faster (easier to include features and/or interactions). There are many good resources on this subject, but most of them evaluate the model in-sample. For many applications we require doing predictions on out-of-sample data. This experiment was motivated by the discussion of the thread ["Out of sample" predictions with the GLM sub-module](https://discourse.pymc.io/t/out-of-sample-predictions-with-the-glm-sub-module/773) on the (great!) forum [discourse.pymc.io/](https://discourse.pymc.io/), thank you all for your input! - -**Resources** - - -- [PyMC3 Docs: Example Notebooks](https://docs.pymc.io/nb_examples/index.html) - - - In particular check [GLM: Logistic Regression](https://docs.pymc.io/notebooks/GLM-logistic.html) - -- [Bambi](https://bambinos.github.io/bambi/), a more complete implementation of the GLM submodule which also allows for mixed-effects models. - -- [Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb) -- [Statistical Rethinking](https://xcelab.net/rm/statistical-rethinking/) - -+++ - -## Prepare Notebook - -```{code-cell} ipython3 -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import seaborn as sns - -sns.set_style(style="darkgrid", rc={"axes.facecolor": ".9", "grid.color": ".8"}) -sns.set_palette(palette="deep") -sns_c = sns.color_palette(palette="deep") - -import arviz as az -import patsy -import pymc3 as pm - -from pymc3 import glm - -plt.rcParams["figure.figsize"] = [7, 6] -plt.rcParams["figure.dpi"] = 100 -``` - -## Generate Sample Data - -We want to fit a logistic regression model where there is a multiplicative interaction between two numerical features. - -```{code-cell} ipython3 -SEED = 42 -np.random.seed(SEED) - -# Number of data points. -n = 250 -# Create features. -x1 = np.random.normal(loc=0.0, scale=2.0, size=n) -x2 = np.random.normal(loc=0.0, scale=2.0, size=n) -epsilon = np.random.normal(loc=0.0, scale=0.5, size=n) -# Define target variable. -intercept = -0.5 -beta_x1 = 1 -beta_x2 = -1 -beta_interaction = 2 -z = intercept + beta_x1 * x1 + beta_x2 * x2 + beta_interaction * x1 * x2 -p = 1 / (1 + np.exp(-z)) -y = np.random.binomial(n=1, p=p, size=n) - -df = pd.DataFrame(dict(x1=x1, x2=x2, y=y)) - -df.head() -``` - -Let us do some exploration of the data: - -```{code-cell} ipython3 -sns.pairplot( - data=df, kind="scatter", height=2, plot_kws={"color": sns_c[1]}, diag_kws={"color": sns_c[2]} -); -``` - -- $x_1$ and $x_2$ are not correlated. -- $x_1$ and $x_2$ do not seem to separate the $y$-classes independently. -- The distribution of $y$ is not highly unbalanced. - -```{code-cell} ipython3 -fig, ax = plt.subplots() -sns_c_div = sns.diverging_palette(240, 10, n=2) -sns.scatterplot(x="x1", y="x2", data=df, hue="y", palette=[sns_c_div[0], sns_c_div[-1]]) -ax.legend(title="y", loc="center left", bbox_to_anchor=(1, 0.5)) -ax.set(title="Sample Data", xlim=(-9, 9), ylim=(-9, 9)); -``` - -## Prepare Data for Modeling - -I wanted to use the *`classmethod`* `from_formula` (see [documentation](https://docs.pymc.io/api/glm.html)), but I was not able to generate out-of-sample predictions with this approach (if you find a way please let me know!). As a workaround, I created the features from a formula using [patsy](https://patsy.readthedocs.io/en/latest/) directly and then use *`class`* `pymc3.glm.linear.GLM` (this was motivated by going into the [source code](https://github.com/pymc-devs/pymc3/blob/master/pymc3/glm/linear.py)). - -```{code-cell} ipython3 -# Define model formula. -formula = "y ~ x1 * x2" -# Create features. -y, x = patsy.dmatrices(formula_like=formula, data=df) -y = np.asarray(y).flatten() -labels = x.design_info.column_names -x = np.asarray(x) -``` - -As pointed out on the [thread](https://discourse.pymc.io/t/out-of-sample-predictions-with-the-glm-sub-module/773) (thank you @Nicky!), we need to keep the labels of the features in the design matrix. - -```{code-cell} ipython3 -print(f"labels = {labels}") -``` - -Now we do a train-test split. - -```{code-cell} ipython3 -from sklearn.model_selection import train_test_split - -x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=SEED) -``` - -## Define and Fit the Model - -We now specify the model in PyMC3. - -```{code-cell} ipython3 -with pm.Model() as model: - # Set data container. - data = pm.Data("data", x_train) - # Define GLM family. - family = pm.glm.families.Binomial() - # Set priors. - priors = { - "Intercept": pm.Normal.dist(mu=0, sd=10), - "x1": pm.Normal.dist(mu=0, sd=10), - "x2": pm.Normal.dist(mu=0, sd=10), - "x1:x2": pm.Normal.dist(mu=0, sd=10), - } - # Specify model. - glm.GLM(y=y_train, x=data, family=family, intercept=False, labels=labels, priors=priors) - # Configure sampler. - trace = pm.sample(5000, chains=5, tune=1000, target_accept=0.87, random_seed=SEED) -``` - -```{code-cell} ipython3 -# Plot chains. -az.plot_trace(data=trace); -``` - -```{code-cell} ipython3 -az.summary(trace) -``` - -The chains look good. - -+++ - -## Generate Out-Of-Sample Predictions - -Now we generate predictions on the test set. - -```{code-cell} ipython3 -# Update data reference. -pm.set_data({"data": x_test}, model=model) -# Generate posterior samples. -ppc_test = pm.sample_posterior_predictive(trace, model=model, samples=1000) -``` - -```{code-cell} ipython3 -# Compute the point prediction by taking the mean -# and defining the category via a threshold. -p_test_pred = ppc_test["y"].mean(axis=0) -y_test_pred = (p_test_pred >= 0.5).astype("int") -``` - -## Evaluate Model - -First let us compute the accuracy on the test set. - -```{code-cell} ipython3 -from sklearn.metrics import accuracy_score - -print(f"accuracy = {accuracy_score(y_true=y_test, y_pred=y_test_pred): 0.3f}") -``` - -Next, we plot the [roc curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) and compute the [auc](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve). - -```{code-cell} ipython3 -from sklearn.metrics import RocCurveDisplay, auc, roc_curve - -fpr, tpr, thresholds = roc_curve( - y_true=y_test, y_score=p_test_pred, pos_label=1, drop_intermediate=False -) -roc_auc = auc(fpr, tpr) - -fig, ax = plt.subplots() -roc_display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc) -roc_display = roc_display.plot(ax=ax, marker="o", color=sns_c[4], markersize=4) -ax.set(title="ROC"); -``` - -The model is performing as expected (we of course know the data generating process, which is almost never the case in practical applications). - -+++ - -## Model Decision Boundary - -Finally we will describe and plot the model decision boundary, which is the space defined as - -$$\mathcal{B} = \{(x_1, x_2) \in \mathbb{R}^2 \: | \: p(x_1, x_2) = 0.5\}$$ - -where $p$ denotes the probability of belonging to the class $y=1$ output by the model. To make this set explicit, we simply write the condition in terms of the model parametrization: - -$$0.5 = \frac{1}{1 + \exp(-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1x_2))}$$ - -which implies - -$$0 = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1x_2$$ - -Solving for $x_2$ we get the formula - -$$x_2 = - \frac{\beta_0 + \beta_1 x_1}{\beta_2 + \beta_{12}x_1}$$ - -Observe that this curve is a hyperbola centered at the singularity point $x_1 = - \beta_2 / \beta_{12}$. - -+++ - -Let us now plot the model decision boundary using a grid: - -```{code-cell} ipython3 -# Construct grid. -x1_grid = np.linspace(start=-9, stop=9, num=300) -x2_grid = x1_grid - -x1_mesh, x2_mesh = np.meshgrid(x1_grid, x2_grid) - -x_grid = np.stack(arrays=[x1_mesh.flatten(), x2_mesh.flatten()], axis=1) - -# Create features on the grid. -x_grid_ext = patsy.dmatrix(formula_like="x1 * x2", data=dict(x1=x_grid[:, 0], x2=x_grid[:, 1])) - -x_grid_ext = np.asarray(x_grid_ext) - -# Generate model predictions on the grid. -pm.set_data({"data": x_grid_ext}, model=model) -ppc_grid = pm.sample_posterior_predictive(trace, model=model, samples=1000) -``` - -Now we compute the model decision boundary on the grid for visualization purposes. - -```{code-cell} ipython3 -numerator = -(trace["Intercept"].mean(axis=0) + trace["x1"].mean(axis=0) * x1_grid) -denominator = trace["x2"].mean(axis=0) + trace["x1:x2"].mean(axis=0) * x1_grid -bd_grid = numerator / denominator - -grid_df = pd.DataFrame(x_grid, columns=["x1", "x2"]) -grid_df["p"] = ppc_grid["y"].mean(axis=0) -grid_df.sort_values("p", inplace=True) - -p_grid = grid_df.pivot(index="x2", columns="x1", values="p").to_numpy() -``` - -We finally get the plot and the predictions on the test set: - -```{code-cell} ipython3 -fig, ax = plt.subplots() -cmap = sns.diverging_palette(240, 10, n=50, as_cmap=True) -sns.scatterplot( - x=x_test[:, 1].flatten(), - y=x_test[:, 2].flatten(), - hue=y_test, - palette=[sns_c_div[0], sns_c_div[-1]], - ax=ax, -) -sns.lineplot(x=x1_grid, y=bd_grid, color="black", ax=ax) -ax.contourf(x1_grid, x2_grid, p_grid, cmap=cmap, alpha=0.3) -ax.legend(title="y", loc="center left", bbox_to_anchor=(1, 0.5)) -ax.lines[0].set_linestyle("dotted") -ax.set(title="Model Decision Boundary", xlim=(-9, 9), ylim=(-9, 9), xlabel="x1", ylabel="x2"); -``` - -**Remark:** Note that we have computed the model decision boundary by using the mean of the posterior samples. However, we can generate a better (and more informative!) plot if we use the complete distribution (similarly for other metrics like accuracy and auc). One way of doing this is by storing and computing it inside the model definition as a `Deterministic` variable as in [Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb). - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/generalized_linear_models/GLM-poisson-regression.myst.md b/myst_nbs/generalized_linear_models/GLM-poisson-regression.myst.md deleted file mode 100644 index 864a11722..000000000 --- a/myst_nbs/generalized_linear_models/GLM-poisson-regression.myst.md +++ /dev/null @@ -1,535 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: pymc-ex - language: python - name: pymc-ex ---- - -+++ {"papermill": {"duration": 0.043172, "end_time": "2021-02-23T11:26:55.064791", "exception": false, "start_time": "2021-02-23T11:26:55.021619", "status": "completed"}, "tags": []} - -(GLM-poisson-regression)= -# GLM: Poisson Regression - -:::{post} November 30, 2022 -:tags: regression, poisson -:category: intermediate -:author: Jonathan Sedar, Benjamin Vincent -::: - -+++ {"papermill": {"duration": 0.069202, "end_time": "2021-02-23T11:27:01.489628", "exception": false, "start_time": "2021-02-23T11:27:01.420426", "status": "completed"}, "tags": []} - -This is a minimal reproducible example of Poisson regression to predict counts using dummy data. - -This Notebook is basically an excuse to demo Poisson regression using PyMC, both manually and using `bambi` to demo interactions using the `formulae` library. We will create some dummy data, Poisson distributed according to a linear model, and try to recover the coefficients of that linear model through inference. - -For more statistical detail see: - -+ Basic info on [Wikipedia](https://en.wikipedia.org/wiki/Poisson_regression) -+ GLMs: Poisson regression, exposure, and overdispersion in Chapter 6.2 of [ARM, Gelmann & Hill 2006](http://www.stat.columbia.edu/%7Egelman/arm/) -+ This worked example from ARM 6.2 by [Clay Ford](http://www.clayford.net/statistics/poisson-regression-ch-6-of-gelman-and-hill/) - -This very basic model is inspired by [a project by Ian Osvald](http://ianozsvald.com/2016/05/07/statistically-solving-sneezes-and-sniffles-a-work-in-progress-report-at-pydatalondon-2016/), which is concerned with understanding the various effects of external environmental factors upon the allergic sneezing of a test subject. - -```{code-cell} ipython3 -#!pip install seaborn -``` - -```{code-cell} ipython3 ---- -papermill: - duration: 6.051698 - end_time: '2021-02-23T11:27:01.160546' - exception: false - start_time: '2021-02-23T11:26:55.108848' - status: completed -tags: [] ---- -import arviz as az -import bambi as bmb -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm -import seaborn as sns - -from formulae import design_matrices -``` - -```{code-cell} ipython3 ---- -papermill: - duration: 0.111837 - end_time: '2021-02-23T11:27:01.349763' - exception: false - start_time: '2021-02-23T11:27:01.237926' - status: completed -tags: [] ---- -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) - -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -``` - -+++ {"papermill": {"duration": 0.06268, "end_time": "2021-02-23T11:27:01.615645", "exception": false, "start_time": "2021-02-23T11:27:01.552965", "status": "completed"}, "tags": []} - -## Local Functions - -+++ {"papermill": {"duration": 0.073451, "end_time": "2021-02-23T11:27:01.763249", "exception": false, "start_time": "2021-02-23T11:27:01.689798", "status": "completed"}, "tags": []} - -## Generate Data - -+++ {"papermill": {"duration": 0.060542, "end_time": "2021-02-23T11:27:01.884617", "exception": false, "start_time": "2021-02-23T11:27:01.824075", "status": "completed"}, "tags": []} - -This dummy dataset is created to emulate some data created as part of a study into quantified self, and the real data is more complicated than this. Ask Ian Osvald if you'd like to know more [@ianozvald](https://twitter.com/ianozsvald). - - -### Assumptions: - -+ The subject sneezes N times per day, recorded as `nsneeze (int)` -+ The subject may or may not drink alcohol during that day, recorded as `alcohol (boolean)` -+ The subject may or may not take an antihistamine medication during that day, recorded as the negative action `nomeds (boolean)` -+ We postulate (probably incorrectly) that sneezing occurs at some baseline rate, which increases if an antihistamine is not taken, and further increased after alcohol is consumed. -+ The data is aggregated per day, to yield a total count of sneezes on that day, with a boolean flag for alcohol and antihistamine usage, with the big assumption that nsneezes have a direct causal relationship. - - -Create 4000 days of data: daily counts of sneezes which are Poisson distributed w.r.t alcohol consumption and antihistamine usage - -```{code-cell} ipython3 ---- -papermill: - duration: 0.07367 - end_time: '2021-02-23T11:27:02.023323' - exception: false - start_time: '2021-02-23T11:27:01.949653' - status: completed -tags: [] ---- -# decide poisson theta values -theta_noalcohol_meds = 1 # no alcohol, took an antihist -theta_alcohol_meds = 3 # alcohol, took an antihist -theta_noalcohol_nomeds = 6 # no alcohol, no antihist -theta_alcohol_nomeds = 36 # alcohol, no antihist - -# create samples -q = 1000 -df = pd.DataFrame( - { - "nsneeze": np.concatenate( - ( - rng.poisson(theta_noalcohol_meds, q), - rng.poisson(theta_alcohol_meds, q), - rng.poisson(theta_noalcohol_nomeds, q), - rng.poisson(theta_alcohol_nomeds, q), - ) - ), - "alcohol": np.concatenate( - ( - np.repeat(False, q), - np.repeat(True, q), - np.repeat(False, q), - np.repeat(True, q), - ) - ), - "nomeds": np.concatenate( - ( - np.repeat(False, q), - np.repeat(False, q), - np.repeat(True, q), - np.repeat(True, q), - ) - ), - } -) -``` - -```{code-cell} ipython3 ---- -papermill: - duration: 0.093062 - end_time: '2021-02-23T11:27:02.176348' - exception: false - start_time: '2021-02-23T11:27:02.083286' - status: completed -tags: [] ---- -df.tail() -``` - -+++ {"papermill": {"duration": 0.071086, "end_time": "2021-02-23T11:27:02.312429", "exception": false, "start_time": "2021-02-23T11:27:02.241343", "status": "completed"}, "tags": []} - -##### View means of the various combinations (Poisson mean values) - -```{code-cell} ipython3 ---- -papermill: - duration: 0.082117 - end_time: '2021-02-23T11:27:02.449759' - exception: false - start_time: '2021-02-23T11:27:02.367642' - status: completed -tags: [] ---- -df.groupby(["alcohol", "nomeds"]).mean().unstack() -``` - -+++ {"papermill": {"duration": 0.054583, "end_time": "2021-02-23T11:27:02.561633", "exception": false, "start_time": "2021-02-23T11:27:02.507050", "status": "completed"}, "tags": []} - -### Briefly Describe Dataset - -```{code-cell} ipython3 ---- -papermill: - duration: 2.510687 - end_time: '2021-02-23T11:27:05.124151' - exception: false - start_time: '2021-02-23T11:27:02.613464' - status: completed -tags: [] ---- -g = sns.catplot( - x="nsneeze", - row="nomeds", - col="alcohol", - data=df, - kind="count", - height=4, - aspect=1.5, -) -for ax in (g.axes[1, 0], g.axes[1, 1]): - for n, label in enumerate(ax.xaxis.get_ticklabels()): - label.set_visible(n % 5 == 0) -``` - -+++ {"papermill": {"duration": 0.049808, "end_time": "2021-02-23T11:27:05.231176", "exception": false, "start_time": "2021-02-23T11:27:05.181368", "status": "completed"}, "tags": []} - -**Observe:** - -+ This looks a lot like poisson-distributed count data (because it is) -+ With `nomeds == False` and `alcohol == False` (top-left, akak antihistamines WERE used, alcohol was NOT drunk) the mean of the poisson distribution of sneeze counts is low. -+ Changing `alcohol == True` (top-right) increases the sneeze count `nsneeze` slightly -+ Changing `nomeds == True` (lower-left) increases the sneeze count `nsneeze` further -+ Changing both `alcohol == True and nomeds == True` (lower-right) increases the sneeze count `nsneeze` a lot, increasing both the mean and variance. - -+++ {"papermill": {"duration": 0.049476, "end_time": "2021-02-23T11:27:05.330914", "exception": false, "start_time": "2021-02-23T11:27:05.281438", "status": "completed"}, "tags": []} - ---- - -+++ {"papermill": {"duration": 0.054536, "end_time": "2021-02-23T11:27:05.438038", "exception": false, "start_time": "2021-02-23T11:27:05.383502", "status": "completed"}, "tags": []} - -## Poisson Regression - -+++ {"papermill": {"duration": 0.048945, "end_time": "2021-02-23T11:27:05.540630", "exception": false, "start_time": "2021-02-23T11:27:05.491685", "status": "completed"}, "tags": []} - -Our model here is a very simple Poisson regression, allowing for interaction of terms: - -$$ \theta = exp(\beta X)$$ - -$$ Y_{sneeze\_count} \sim Poisson(\theta)$$ - -+++ {"papermill": {"duration": 0.04972, "end_time": "2021-02-23T11:27:05.641588", "exception": false, "start_time": "2021-02-23T11:27:05.591868", "status": "completed"}, "tags": []} - -**Create linear model for interaction of terms** - -```{code-cell} ipython3 ---- -papermill: - duration: 0.056994 - end_time: '2021-02-23T11:27:05.748431' - exception: false - start_time: '2021-02-23T11:27:05.691437' - status: completed -tags: [] ---- -fml = "nsneeze ~ alcohol + nomeds + alcohol:nomeds" # full formulae formulation -``` - -```{code-cell} ipython3 ---- -papermill: - duration: 0.058609 - end_time: '2021-02-23T11:27:05.859414' - exception: false - start_time: '2021-02-23T11:27:05.800805' - status: completed -tags: [] ---- -fml = "nsneeze ~ alcohol * nomeds" # lazy, alternative formulae formulation -``` - -+++ {"papermill": {"duration": 0.048682, "end_time": "2021-02-23T11:27:05.958802", "exception": false, "start_time": "2021-02-23T11:27:05.910120", "status": "completed"}, "tags": []} - -### 1. Manual method, create design matrices and manually specify model - -+++ {"papermill": {"duration": 0.049076, "end_time": "2021-02-23T11:27:06.059305", "exception": false, "start_time": "2021-02-23T11:27:06.010229", "status": "completed"}, "tags": []} - -**Create Design Matrices** - -```{code-cell} ipython3 -dm = design_matrices(fml, df, na_action="error") -``` - -```{code-cell} ipython3 -mx_ex = dm.common.as_dataframe() -mx_en = dm.response.as_dataframe() -``` - -```{code-cell} ipython3 -mx_ex -``` - -+++ {"papermill": {"duration": 0.062897, "end_time": "2021-02-23T11:27:06.420853", "exception": false, "start_time": "2021-02-23T11:27:06.357956", "status": "completed"}, "tags": []} - -**Create Model** - -```{code-cell} ipython3 ---- -papermill: - duration: 29.137887 - end_time: '2021-02-23T11:27:35.621305' - exception: false - start_time: '2021-02-23T11:27:06.483418' - status: completed -tags: [] ---- -with pm.Model() as mdl_fish: - - # define priors, weakly informative Normal - b0 = pm.Normal("Intercept", mu=0, sigma=10) - b1 = pm.Normal("alcohol", mu=0, sigma=10) - b2 = pm.Normal("nomeds", mu=0, sigma=10) - b3 = pm.Normal("alcohol:nomeds", mu=0, sigma=10) - - # define linear model and exp link function - theta = ( - b0 - + b1 * mx_ex["alcohol"].values - + b2 * mx_ex["nomeds"].values - + b3 * mx_ex["alcohol:nomeds"].values - ) - - ## Define Poisson likelihood - y = pm.Poisson("y", mu=pm.math.exp(theta), observed=mx_en["nsneeze"].values) -``` - -+++ {"papermill": {"duration": 0.049445, "end_time": "2021-02-23T11:27:35.720870", "exception": false, "start_time": "2021-02-23T11:27:35.671425", "status": "completed"}, "tags": []} - -**Sample Model** - -```{code-cell} ipython3 ---- -papermill: - duration: 108.169723 - end_time: '2021-02-23T11:29:23.939578' - exception: false - start_time: '2021-02-23T11:27:35.769855' - status: completed -tags: [] ---- -with mdl_fish: - inf_fish = pm.sample() - # inf_fish.extend(pm.sample_posterior_predictive(inf_fish)) -``` - -+++ {"papermill": {"duration": 0.118023, "end_time": "2021-02-23T11:29:24.142987", "exception": false, "start_time": "2021-02-23T11:29:24.024964", "status": "completed"}, "tags": []} - -**View Diagnostics** - -```{code-cell} ipython3 ---- -papermill: - duration: 4.374731 - end_time: '2021-02-23T11:29:28.617406' - exception: false - start_time: '2021-02-23T11:29:24.242675' - status: completed -tags: [] ---- -az.plot_trace(inf_fish); -``` - -+++ {"papermill": {"duration": 0.076462, "end_time": "2021-02-23T11:29:28.790410", "exception": false, "start_time": "2021-02-23T11:29:28.713948", "status": "completed"}, "tags": []} - -**Observe:** - -+ The model converges quickly and traceplots looks pretty well mixed - -+++ {"papermill": {"duration": 0.07685, "end_time": "2021-02-23T11:29:28.943674", "exception": false, "start_time": "2021-02-23T11:29:28.866824", "status": "completed"}, "tags": []} - -### Transform coeffs and recover theta values - -```{code-cell} ipython3 -az.summary(np.exp(inf_fish.posterior), kind="stats") -``` - -+++ {"papermill": {"duration": 0.075014, "end_time": "2021-02-23T11:29:29.324266", "exception": false, "start_time": "2021-02-23T11:29:29.249252", "status": "completed"}, "tags": []} - -**Observe:** - -+ The contributions from each feature as a multiplier of the baseline sneezecount appear to be as per the data generation: - - - 1. exp(Intercept): mean=1.05 cr=[0.98, 1.10] - - Roughly linear baseline count when no alcohol and meds, as per the generated data: - - theta_noalcohol_meds = 1 (as set above) - theta_noalcohol_meds = exp(Intercept) - = 1 - - - 2. exp(alcohol): mean=2.86 cr=[2.67, 3.07] - - non-zero positive effect of adding alcohol, a ~3x multiplier of - baseline sneeze count, as per the generated data: - - theta_alcohol_meds = 3 (as set above) - theta_alcohol_meds = exp(Intercept + alcohol) - = exp(Intercept) * exp(alcohol) - = 1 * 3 = 3 - - - 3. exp(nomeds): mean=5.73 cr=[5.34, 6.08] - - larger, non-zero positive effect of adding nomeds, a ~6x multiplier of - baseline sneeze count, as per the generated data: - - theta_noalcohol_nomeds = 6 (as set above) - theta_noalcohol_nomeds = exp(Intercept + nomeds) - = exp(Intercept) * exp(nomeds) - = 1 * 6 = 6 - - - 4. exp(alcohol:nomeds): mean=2.10 cr=[1.96, 2.28] - - small, positive interaction effect of alcohol and meds, a ~2x multiplier of - baseline sneeze count, as per the generated data: - - theta_alcohol_nomeds = 36 (as set above) - theta_alcohol_nomeds = exp(Intercept + alcohol + nomeds + alcohol:nomeds) - = exp(Intercept) * exp(alcohol) * exp(nomeds * alcohol:nomeds) - = 1 * 3 * 6 * 2 = 36 - -+++ {"papermill": {"duration": 0.076829, "end_time": "2021-02-23T11:29:29.477240", "exception": false, "start_time": "2021-02-23T11:29:29.400411", "status": "completed"}, "tags": []} - -### 2. Alternative method, using `bambi` - -+++ {"papermill": {"duration": 0.074408, "end_time": "2021-02-23T11:29:29.628052", "exception": false, "start_time": "2021-02-23T11:29:29.553644", "status": "completed"}, "tags": []} - -**Create Model** - -+++ {"papermill": {"duration": 0.07467, "end_time": "2021-02-23T11:29:29.778406", "exception": false, "start_time": "2021-02-23T11:29:29.703736", "status": "completed"}, "tags": []} - -**Alternative automatic formulation using `bambi`** - -```{code-cell} ipython3 ---- -papermill: - duration: 4.699873 - end_time: '2021-02-23T11:29:34.554521' - exception: false - start_time: '2021-02-23T11:29:29.854648' - status: completed -tags: [] ---- -model = bmb.Model(fml, df, family="poisson") -``` - -+++ {"papermill": {"duration": 0.077285, "end_time": "2021-02-23T11:29:34.719403", "exception": false, "start_time": "2021-02-23T11:29:34.642118", "status": "completed"}, "tags": []} - -**Fit Model** - -```{code-cell} ipython3 ---- -papermill: - duration: 115.426671 - end_time: '2021-02-23T11:31:30.222773' - exception: false - start_time: '2021-02-23T11:29:34.796102' - status: completed -tags: [] ---- -inf_fish_alt = model.fit() -``` - -+++ {"papermill": {"duration": 0.075564, "end_time": "2021-02-23T11:31:30.375433", "exception": false, "start_time": "2021-02-23T11:31:30.299869", "status": "completed"}, "tags": []} - -**View Traces** - -```{code-cell} ipython3 ---- -papermill: - duration: 2.970961 - end_time: '2021-02-23T11:31:33.424138' - exception: false - start_time: '2021-02-23T11:31:30.453177' - status: completed -tags: [] ---- -az.plot_trace(inf_fish_alt); -``` - -+++ {"papermill": {"duration": 0.10274, "end_time": "2021-02-23T11:31:33.628707", "exception": false, "start_time": "2021-02-23T11:31:33.525967", "status": "completed"}, "tags": []} - -### Transform coeffs - -```{code-cell} ipython3 -az.summary(np.exp(inf_fish_alt.posterior), kind="stats") -``` - -+++ {"papermill": {"duration": 0.10059, "end_time": "2021-02-23T11:31:34.095731", "exception": false, "start_time": "2021-02-23T11:31:33.995141", "status": "completed"}, "tags": []} - -**Observe:** - -+ The traceplots look well mixed -+ The transformed model coeffs look moreorless the same as those generated by the manual model -+ Note that the posterior predictive samples have an extreme skew - -```{code-cell} ipython3 -:tags: [] - -posterior_predictive = model.predict(inf_fish_alt, kind="pps") -``` - -We can use `az.plot_ppc()` to check that the posterior predictive samples are similar to the observed data. - -For more information on posterior predictive checks, we can refer to {ref}`pymc:posterior_predictive`. - -```{code-cell} ipython3 -az.plot_ppc(inf_fish_alt); -``` - -+++ {"papermill": {"duration": 0.106366, "end_time": "2021-02-23T11:31:34.956844", "exception": false, "start_time": "2021-02-23T11:31:34.850478", "status": "completed"}, "tags": []} - -## Authors -- Example originally contributed by [Jonathan Sedar](https://github.com/jonsedar) 2016-05-15. -- Updated to PyMC v4 by [Benjamin Vincent](https://github.com/drbenvincent) May 2022. -- Notebook header and footer updated November 2022. - -+++ - -## Watermark - -```{code-cell} ipython3 ---- -papermill: - duration: 0.16014 - end_time: '2021-02-23T11:31:43.372227' - exception: false - start_time: '2021-02-23T11:31:43.212087' - status: completed -tags: [] ---- -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/generalized_linear_models/GLM-robust-with-outlier-detection.myst.md b/myst_nbs/generalized_linear_models/GLM-robust-with-outlier-detection.myst.md deleted file mode 100644 index 8e422c8d5..000000000 --- a/myst_nbs/generalized_linear_models/GLM-robust-with-outlier-detection.myst.md +++ /dev/null @@ -1,943 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 -substitutions: - extra_dependencies: seaborn ---- - -(GLM-robust-with-outlier-detection)= -# GLM: Robust Regression using Custom Likelihood for Outlier Classification - -:::{post} 17 Nov, 2021 -:tags: regression, robust analysis -:category: intermediate -:author: Jon Sedar, Thomas Wiecki, Raul Maldonado, Oriol Abril -::: - -Using PyMC3 for Robust Regression with Outlier Detection using the Hogg 2010 Signal vs Noise method. - -**Modelling concept:** -+ This model uses a custom likelihood function as a mixture of two likelihoods, one for the main data-generating function (a linear model that we care about), and one for outliers. -+ The model does not marginalize and thus gives us a classification of outlier-hood for each datapoint -+ The dataset is tiny and hardcoded into this Notebook. It contains errors in both the x and y, but we will deal here with only errors in y. - -**Complementary approaches:** -+ This is a complementary approach to the Student-T robust regression as illustrated in the example {doc}`generalized_linear_models/GLM-robust`, and that approach is also compared -+ See also a [gist by Dan FM](https://gist.github.com/dfm/5250dd2f17daf60cbe582ceeeb2fd12f) that he published after a quick twitter conversation - his "Hogg improved" model uses this same model structure and cleverly marginalizes over the outlier class but also observes it during sampling using a `pm.Deterministic` <- this is really nice -+ The likelihood evaluation is essentially a copy of eqn 17 in "Data analysis recipes: Fitting a model to data" - {cite:t}`hogg2010data` -+ The model is adapted specifically from Jake Vanderplas' and Brigitta Sipocz' [implementation](http://www.astroml.org/book_figures/chapter8/fig_outlier_rejection.html) in the AstroML book {cite:p}`ivezić2014astroMLtext,vanderplas2012astroML` - -+++ - -## Setup - -+++ - -### Installation Notes - -+++ - -See the original project [README](https://github.com/jonsedar/pymc3_examples/blob/master/README.md) for full details on dependencies and about the environment where the notebook was written in. A summary on the environment where this notebook was executed is available in the ["Watermark"](#watermark) section. - -:::{include} ../extra_installs.md -::: - -```{code-cell} ipython3 -%matplotlib inline -%config InlineBackend.figure_format = 'retina' -``` - -### Imports - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc3 as pm -import seaborn as sns - -from matplotlib.lines import Line2D -from scipy import stats -``` - -```{code-cell} ipython3 -az.style.use("arviz-darkgrid") -``` - -### Load Data - -+++ - -We'll use the Hogg 2010 data available at https://github.com/astroML/astroML/blob/master/astroML/datasets/hogg2010test.py - -It's a very small dataset so for convenience, it's hardcoded below - -```{code-cell} ipython3 -# cut & pasted directly from the fetch_hogg2010test() function -# identical to the original dataset as hardcoded in the Hogg 2010 paper - -dfhogg = pd.DataFrame( - np.array( - [ - [1, 201, 592, 61, 9, -0.84], - [2, 244, 401, 25, 4, 0.31], - [3, 47, 583, 38, 11, 0.64], - [4, 287, 402, 15, 7, -0.27], - [5, 203, 495, 21, 5, -0.33], - [6, 58, 173, 15, 9, 0.67], - [7, 210, 479, 27, 4, -0.02], - [8, 202, 504, 14, 4, -0.05], - [9, 198, 510, 30, 11, -0.84], - [10, 158, 416, 16, 7, -0.69], - [11, 165, 393, 14, 5, 0.30], - [12, 201, 442, 25, 5, -0.46], - [13, 157, 317, 52, 5, -0.03], - [14, 131, 311, 16, 6, 0.50], - [15, 166, 400, 34, 6, 0.73], - [16, 160, 337, 31, 5, -0.52], - [17, 186, 423, 42, 9, 0.90], - [18, 125, 334, 26, 8, 0.40], - [19, 218, 533, 16, 6, -0.78], - [20, 146, 344, 22, 5, -0.56], - ] - ), - columns=["id", "x", "y", "sigma_y", "sigma_x", "rho_xy"], -) - -dfhogg["id"] = dfhogg["id"].apply(lambda x: "p{}".format(int(x))) -dfhogg.set_index("id", inplace=True) -dfhogg.head() -``` - ---- - -+++ - -## 1. Basic EDA - -+++ - -Exploratory Data Analysis - -+++ - -Note: -+ this is very rudimentary so we can quickly get to the `pymc3` part -+ the dataset contains errors in both the x and y, but we will deal here with only errors in y. -+ see the {cite:t}`hogg2010data` for more detail - -```{code-cell} ipython3 -with plt.rc_context({"figure.constrained_layout.use": False}): - gd = sns.jointplot( - x="x", - y="y", - data=dfhogg, - kind="scatter", - height=6, - marginal_kws={"bins": 12, "kde": True, "kde_kws": {"cut": 1}}, - joint_kws={"edgecolor": "w", "linewidth": 1.2, "s": 80}, - ) - -_ = gd.ax_joint.errorbar( - "x", "y", "sigma_y", "sigma_x", fmt="none", ecolor="#4878d0", data=dfhogg, zorder=10 -) - -for idx, r in dfhogg.iterrows(): - _ = gd.ax_joint.annotate( - text=idx, - xy=(r["x"], r["y"]), - xycoords="data", - xytext=(10, 10), - textcoords="offset points", - color="#999999", - zorder=1, - ) - - -_ = gd.fig.suptitle( - ( - "Original datapoints in Hogg 2010 dataset\n" - + "showing marginal distributions and errors sigma_x, sigma_y" - ), - y=1.05, -); -``` - -**Observe**: - -+ Even judging just by eye, you can see these observations mostly fall on / around a straight line with positive gradient -+ It looks like a few of the datapoints may be outliers from such a line -+ Measurement error (independently on x and y) varies across the observations - -+++ - -## 2. Basic Feature Engineering - -+++ - -### 2.1 Transform and standardize dataset - -+++ - -It's common practice to standardize the input values to a linear model, because this leads to coefficients -sitting in the same range and being more directly comparable. e.g. this is noted in {cite:t}`gelman2008scaling` - -So, following Gelman's paper above, we'll divide by 2 s.d. here - -+ since this model is very simple, we just standardize directly, -rather than using e.g. a `scikit-learn` `FunctionTransformer` -+ ignoring `rho_xy` for now - -**Additional note** on scaling the output feature `y` and measurement error `sigma_y`: -+ This is unconventional - typically you wouldn't scale an output feature -+ However, in the Hogg model we fit a custom two-part likelihood function of Normals which encourages -a globally minimised log-loss by allowing outliers to fit to their own Normal distribution. This -outlier distribution is specified using a stdev stated as an offset `sigma_y_out` from `sigma_y` -+ This offset value has the effect of requiring `sigma_y` to be restated in the same scale as the stdev of `y` - -+++ - -Standardize (mean center and divide by 2 sd): - -```{code-cell} ipython3 -dfhoggs = (dfhogg[["x", "y"]] - dfhogg[["x", "y"]].mean(0)) / (2 * dfhogg[["x", "y"]].std(0)) -dfhoggs["sigma_x"] = dfhogg["sigma_x"] / (2 * dfhogg["x"].std()) -dfhoggs["sigma_y"] = dfhogg["sigma_y"] / (2 * dfhogg["y"].std()) -``` - -```{code-cell} ipython3 -with plt.rc_context({"figure.constrained_layout.use": False}): - gd = sns.jointplot( - x="x", - y="y", - data=dfhoggs, - kind="scatter", - height=6, - marginal_kws={"bins": 12, "kde": True, "kde_kws": {"cut": 1}}, - joint_kws={"edgecolor": "w", "linewidth": 1, "s": 80}, - ) -gd.ax_joint.errorbar("x", "y", "sigma_y", "sigma_x", fmt="none", ecolor="#4878d0", data=dfhoggs) -gd.fig.suptitle( - ( - "Quick view to confirm action of\n" - + "standardizing datapoints in Hogg 2010 dataset\n" - + "showing marginal distributions and errors sigma_x, sigma_y" - ), - y=1.08, -); -``` - ---- - -+++ - -## 3. Simple Linear Model with no Outlier Correction - -+++ - -### 3.1 Specify Model - -+++ - -Before we get more advanced, I want to demo the fit of a simple linear model with Normal likelihood function. The priors are also Normally distributed, so this behaves like an OLS with Ridge Regression (L2 norm). - -Note: the dataset also has `sigma_x` and `rho_xy` available, but for this exercise, We've chosen to only use `sigma_y` - -$$\hat{y} \sim \mathcal{N}(\beta^{T} \vec{x}_{i}, \sigma_{i})$$ - -where: - -+ $\beta$ = $\{1, \beta_{j \in X_{j}}\}$ <--- linear coefs in $X_{j}$, in this case `1 + x` -+ $\sigma$ = error term <--- in this case we set this to an _unpooled_ $\sigma_{i}$: the measured error `sigma_y` for each datapoint - -```{code-cell} ipython3 -coords = {"coefs": ["intercept", "slope"], "datapoint_id": dfhoggs.index} -with pm.Model(coords=coords) as mdl_ols: - - ## Define weakly informative Normal priors to give Ridge regression - beta = pm.Normal("beta", mu=0, sigma=10, dims="coefs") - - ## Define linear model - y_est = beta[0] + beta[1] * dfhoggs["x"] - - ## Define Normal likelihood - pm.Normal("y", mu=y_est, sigma=dfhoggs["sigma_y"], observed=dfhoggs["y"], dims="datapoint_id") - -pm.model_to_graphviz(mdl_ols) -``` - -### 3.2 Fit Model - -+++ - -Note we are purposefully missing a step here for prior predictive checks. - -+++ - -#### 3.2.1 Sample Posterior - -```{code-cell} ipython3 -with mdl_ols: - trc_ols = pm.sample( - tune=5000, - draws=500, - chains=4, - cores=4, - init="advi+adapt_diag", - n_init=50000, - progressbar=True, - return_inferencedata=True, - ) -``` - -#### 3.2.2 View Diagnostics - -+++ - -NOTE: We will illustrate this OLS fit and compare to the datapoints in the final comparison plot - -+++ - -Traceplot - -```{code-cell} ipython3 -_ = az.plot_trace(trc_ols, compact=False) -``` - -Plot posterior joint distribution (since the model has only 2 coeffs, we can easily view this as a 2D joint distribution) - -```{code-cell} ipython3 -marginal_kwargs = {"kind": "kde", "rug": True, "color": "C0"} -ax = az.plot_pair( - trc_ols, - var_names="beta", - marginals=True, - kind=["scatter", "kde"], - scatter_kwargs={"color": "C0", "alpha": 0.4}, - marginal_kwargs=marginal_kwargs, -) -fig = ax[0, 0].get_figure() -fig.suptitle("Posterior joint distribution (mdl_ols)", y=1.02); -``` - -## 4. Simple Linear Model with Robust Student-T Likelihood - -+++ - -I've added this brief section in order to directly compare the Student-T based method exampled in Thomas Wiecki's notebook in the [PyMC3 documentation](http://pymc-devs.github.io/pymc3/GLM-robust/) - -Instead of using a Normal distribution for the likelihood, we use a Student-T which has fatter tails. In theory this allows outliers to have a smaller influence in the likelihood estimation. This method does not produce inlier / outlier flags (it marginalizes over such a classification) but it's simpler and faster to run than the Signal Vs Noise model below, so a comparison seems worthwhile. - -+++ - -### 4.1 Specify Model - -+++ - -In this modification, we allow the likelihood to be more robust to outliers (have fatter tails) - -$$\hat{y} \sim \text{StudentT}(\beta^{T} \vec{x}_{i}, \sigma_{i}, \nu)$$ - -where: - -+ $\beta$ = $\{1, \beta_{j \in X_{j}}\}$ <--- linear coefs in $X_{j}$, in this case `1 + x` -+ $\sigma$ = error term <--- in this case we set this to an _unpooled_ $\sigma_{i}$: the measured error `sigma_y` for each datapoint -+ $\nu$ = degrees of freedom <--- allowing a pdf with fat tails and thus less influence from outlier datapoints - -Note: the dataset also has `sigma_x` and `rho_xy` available, but for this exercise, I've chosen to only use `sigma_y` - -```{code-cell} ipython3 -with pm.Model(coords=coords) as mdl_studentt: - - # define weakly informative Normal priors to give Ridge regression - beta = pm.Normal("beta", mu=0, sigma=10, dims="coefs") - - # define linear model - y_est = beta[0] + beta[1] * dfhoggs["x"] - - # define prior for StudentT degrees of freedom - # InverseGamma has nice properties: - # it's continuous and has support x ∈ (0, inf) - nu = pm.InverseGamma("nu", alpha=1, beta=1) - - # define Student T likelihood - pm.StudentT( - "y", mu=y_est, sigma=dfhoggs["sigma_y"], nu=nu, observed=dfhoggs["y"], dims="datapoint_id" - ) - -pm.model_to_graphviz(mdl_studentt) -``` - -### 4.2 Fit Model - -+++ - -#### 4.2.1 Sample Posterior - -```{code-cell} ipython3 -with mdl_studentt: - trc_studentt = pm.sample( - tune=5000, - draws=500, - chains=4, - cores=4, - init="advi+adapt_diag", - n_init=50000, - progressbar=True, - return_inferencedata=True, - ) -``` - -#### 4.2.2 View Diagnostics - -+++ - -NOTE: We will illustrate this StudentT fit and compare to the datapoints in the final comparison plot - -+++ - -Traceplot - -```{code-cell} ipython3 -_ = az.plot_trace(trc_studentt, compact=False); -``` - -Plot posterior joint distribution - -```{code-cell} ipython3 -marginal_kwargs["color"] = "C1" -ax = az.plot_pair( - trc_studentt, - var_names="beta", - kind=["scatter", "kde"], - divergences=True, - marginals=True, - marginal_kwargs=marginal_kwargs, - scatter_kwargs={"color": "C1", "alpha": 0.4}, -) -ax[0, 0].get_figure().suptitle("Posterior joint distribution (mdl_studentt)"); -``` - -#### 4.2.3 View the shift in posterior joint distributions from OLS to StudentT - -```{code-cell} ipython3 -marginal_kwargs["rug"] = False -marginal_kwargs["color"] = "C0" -ax = az.plot_pair( - trc_ols, - var_names="beta", - kind=["scatter", "kde"], - divergences=True, - figsize=[12, 12], - marginals=True, - marginal_kwargs=marginal_kwargs, - scatter_kwargs={"color": "C0", "alpha": 0.4}, - kde_kwargs={"contour_kwargs": {"colors": "C0"}}, -) - -marginal_kwargs["color"] = "C1" -az.plot_pair( - trc_studentt, - var_names="beta", - kind=["scatter", "kde"], - divergences=True, - marginals=True, - marginal_kwargs=marginal_kwargs, - scatter_kwargs={"color": "C1", "alpha": 0.4}, - kde_kwargs={"contour_kwargs": {"colors": "C1"}}, - ax=ax, -) - -ax[0, 0].get_figure().suptitle( - "Posterior joint distributions\n(showing general movement from OLS to StudentT)" -); -``` - -**Observe:** - -+ Both `beta` parameters appear to have greater variance than in the OLS regression -+ This is due to $\nu$ appearing to converge to a value `nu ~ 1`, indicating - that a fat-tailed likelihood has a better fit than a thin-tailed one -+ The parameter `beta[intercept]` has moved much closer to $0$, which is - interesting: if the theoretical relationship `y ~ f(x)` has no offset, - then for this mean-centered dataset, the intercept should indeed be $0$: it - might easily be getting pushed off-course by outliers in the OLS model. -+ The parameter `beta[slope]` has accordingly become greater: perhaps moving - closer to the theoretical function `f(x)` - -+++ - ---- - -+++ - -## 5. Linear Model with Custom Likelihood to Distinguish Outliers: Hogg Method - -+++ - -Please read the paper (Hogg 2010) and Jake Vanderplas' code for more complete information about the modelling technique. - -The general idea is to create a 'mixture' model whereby datapoints can be described by either: - -1. the proposed (linear) model (thus a datapoint is an inlier), or -2. a second model, which for convenience we also propose to be linear, but allow it to have a different mean and variance (thus a datapoint is an outlier) - -+++ - -### 5.1 Specify Model - -+++ - -The likelihood is evaluated over a mixture of two likelihoods, one for 'inliers', one for 'outliers'. A Bernoulli distribution is used to randomly assign datapoints in N to either the inlier or outlier groups, and we sample the model as usual to infer robust model parameters and inlier / outlier flags: - -$$ -\mathcal{logL} = \sum_{i}^{i=N} log \left[ \frac{(1 - B_{i})}{\sqrt{2 \pi \sigma_{in}^{2}}} exp \left( - \frac{(x_{i} - \mu_{in})^{2}}{2\sigma_{in}^{2}} \right) \right] + \sum_{i}^{i=N} log \left[ \frac{B_{i}}{\sqrt{2 \pi (\sigma_{in}^{2} + \sigma_{out}^{2})}} exp \left( - \frac{(x_{i}- \mu_{out})^{2}}{2(\sigma_{in}^{2} + \sigma_{out}^{2})} \right) \right] -$$ - -where: -+ $B_{i}$ is Bernoulli-distibuted $B_{i} \in \{0_{(inlier)},1_{(outlier)}\}$ -+ $\mu_{in} = \beta^{T} \vec{x}_{i}$ as before for inliers, where $\beta$ = $\{1, \beta_{j \in X_{j}}\}$ <--- linear coefs in -$X_{j}$, in this case `1 + x` -+ $\sigma_{in}$ = noise term <--- in this case we set this to an _unpooled_ $\sigma_{i}$: the measured error `sigma_y` for each datapoint -+ $\mu_{out}$ <--- is a random _pooled_ variable for outliers -+ $\sigma_{out}$ = additional noise term <--- is a random _unpooled_ variable for outliers - -+++ - -This notebook uses {func}`~pymc3.model.Potential` class combined with `logp` to create a likelihood and build this model where a feature is not observed, here the Bernoulli switching variable. - -Usage of `Potential` is not discussed. Other resources are available that are worth referring to for details -on `Potential` usage: - -+ [Junpenglao's presentation on likelihoods](https://github.com/junpenglao/All-that-likelihood-with-PyMC3) at PyData Berlin July 2018 -+ worked examples on [Discourse](https://discourse.pymc.io/t/pm-potential-much-needed-explanation-for-newbie/2341) and [Cross Validated](https://stats.stackexchange.com/a/252607/10625). -+ and the pymc3 port of CamDP's Probabilistic Programming and Bayesian Methods for Hackers, Chapter 5 Loss Functions, [Example: Optimizing for the Showcase on The Price is Right](https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter5_LossFunctions/Ch5_LossFunctions_PyMC3.ipynb) -+ Other examples using it, search for the `pymc3.Potential` tag on the left sidebar - -```{code-cell} ipython3 -with pm.Model(coords=coords) as mdl_hogg: - - # state input data as Theano shared vars - tsv_x = pm.Data("tsv_x", dfhoggs["x"], dims="datapoint_id") - tsv_y = pm.Data("tsv_y", dfhoggs["y"], dims="datapoint_id") - tsv_sigma_y = pm.Data("tsv_sigma_y", dfhoggs["sigma_y"], dims="datapoint_id") - - # weakly informative Normal priors (L2 ridge reg) for inliers - beta = pm.Normal("beta", mu=0, sigma=10, dims="coefs") - - # linear model for mean for inliers - y_est_in = beta[0] + beta[1] * tsv_x # dims="obs_id" - - # very weakly informative mean for all outliers - y_est_out = pm.Normal("y_est_out", mu=0, sigma=10, testval=pm.floatX(0.0)) - - # very weakly informative prior for additional variance for outliers - sigma_y_out = pm.HalfNormal("sigma_y_out", sigma=10, testval=pm.floatX(1.0)) - - # create in/outlier distributions to get a logp evaluated on the observed y - # this is not strictly a pymc3 likelihood, but behaves like one when we - # evaluate it within a Potential (which is minimised) - inlier_logp = pm.Normal.dist(mu=y_est_in, sigma=tsv_sigma_y).logp(tsv_y) - - outlier_logp = pm.Normal.dist(mu=y_est_out, sigma=tsv_sigma_y + sigma_y_out).logp(tsv_y) - - # frac_outliers only needs to span [0, .5] - # testval for is_outlier initialised in order to create class asymmetry - frac_outliers = pm.Uniform("frac_outliers", lower=0.0, upper=0.5) - is_outlier = pm.Bernoulli( - "is_outlier", - p=frac_outliers, - testval=(np.random.rand(tsv_x.eval().shape[0]) < 0.4) * 1, - dims="datapoint_id", - ) - - # non-sampled Potential evaluates the Normal.dist.logp's - potential = pm.Potential( - "obs", - ((1 - is_outlier) * inlier_logp).sum() + (is_outlier * outlier_logp).sum(), - ) -``` - -### 5.2 Fit Model - -+++ - -#### 5.2.1 Sample Posterior - -+++ - -Note that `pm.sample` conveniently and automatically creates the compound sampling process to: -1. sample a Bernoulli variable (the class `is_outlier`) using a discrete sampler -2. sample the continuous variables using a continuous sampler - -Further note: -+ This also means we can't initialise using ADVI, so will init using `jitter+adapt_diag` -+ In order to pass `kwargs` to a particular stepper, wrap them in a dict addressed to the lowercased [name of the stepper](https://github.com/pymc-devs/pymc3/blob/master/pymc3/sampling.py) e.g. `nuts={'target_accept': 0.85}` - -```{code-cell} ipython3 -with mdl_hogg: - trc_hogg = pm.sample( - tune=10000, - draws=500, - chains=4, - cores=4, - init="jitter+adapt_diag", - nuts={"target_accept": 0.99}, - return_inferencedata=True, - ) -``` - -#### 5.2.2 View Diagnostics - -+++ - -We will illustrate this model fit and compare to the datapoints in the final comparison plot - -```{code-cell} ipython3 -rvs = ["beta", "y_est_out", "sigma_y_out", "frac_outliers"] -_ = az.plot_trace(trc_hogg, var_names=rvs, compact=False); -``` - -**Observe:** - -+ At the default `target_accept = 0.8` there are lots of divergences, indicating this is not a particularly stable model -+ However, at `target_accept = 0.9` (and increasing `tune` from 5000 to 10000), the traces exhibit fewer divergences and appear slightly better behaved. -+ The traces for the inlier model `beta` parameters, and for outlier model parameter `y_est_out` (the mean) look reasonably converged -+ The traces for outlier model param `y_sigma_out` (the additional pooled variance) occasionally go a bit wild -+ It's interesting that `frac_outliers` is so dispersed: that's quite a flat distribution: suggests that there are a few datapoints where their inlier/outlier status is subjective -+ Indeed as Thomas noted in his v2.0 Notebook, because we're explicitly modeling the latent label (inlier/outlier) as binary choice the sampler could have a problem - rewriting this model into a marginal mixture model would be better. - -+++ - -Simple trace summary inc rhat - -```{code-cell} ipython3 -az.summary(trc_hogg, var_names=rvs) -``` - -Plot posterior joint distribution - -(This is a particularly useful diagnostic in this case where we see a lot of divergences in the traces: maybe the model specification leads to weird behaviours) - -```{code-cell} ipython3 -marginal_kwargs["color"] = "C2" -marginal_kwargs["rug"] = True -x = az.plot_pair( - data=trc_hogg, - var_names="beta", - kind=["kde", "scatter"], - divergences=True, - marginals=True, - marginal_kwargs=marginal_kwargs, - scatter_kwargs={"color": "C2"}, -) -ax[0, 0].get_figure().suptitle("Posterior joint distribution (mdl_hogg)"); -``` - -#### 5.2.3 View the shift in posterior joint distributions from OLS to StudentT to Hogg - -```{code-cell} ipython3 -kde_kwargs = {"contour_kwargs": {"colors": "C0", "zorder": 4}, "contourf_kwargs": {"alpha": 0}} -marginal_kwargs["rug"] = False -marginal_kwargs["color"] = "C0" -ax = az.plot_pair( - trc_ols, - var_names="beta", - kind="kde", - divergences=True, - marginals=True, - marginal_kwargs={"color": "C0"}, - kde_kwargs=kde_kwargs, - figsize=(8, 8), -) - -marginal_kwargs["color"] = "C1" -kde_kwargs["contour_kwargs"]["colors"] = "C1" -az.plot_pair( - trc_studentt, - var_names="beta", - kind="kde", - divergences=True, - marginals=True, - marginal_kwargs=marginal_kwargs, - kde_kwargs=kde_kwargs, - ax=ax, -) - -marginal_kwargs["color"] = "C2" -kde_kwargs["contour_kwargs"]["colors"] = "C2" -az.plot_pair( - data=trc_hogg, - var_names="beta", - kind="kde", - divergences=True, - marginals=True, - marginal_kwargs=marginal_kwargs, - kde_kwargs=kde_kwargs, - ax=ax, - show=True, -) -ax[0, 0].get_figure().suptitle( - "Posterior joint distributions" + "\nOLS, StudentT, and Hogg (inliers)" -); -``` - -**Observe:** - -+ The `hogg_inlier` and `studentt` models converge to similar ranges for -`b0_intercept` and `b1_slope`, indicating that the (unshown) `hogg_outlier` -model might perform a similar job to the fat tails of the `studentt` model: -allowing greater log probability away from the main linear distribution in the datapoints -+ As expected, (since it's a Normal) the `hogg_inlier` posterior has thinner - tails and more probability mass concentrated about the central values -+ The `hogg_inlier` model also appears to have moved farther away from both the -`ols` and `studentt` models on the `b0_intercept`, suggesting that the outliers -really distort that particular dimension - -+++ - -### 5.3 Declare Outliers - -+++ - -#### 5.3.1 View ranges for inliers / outlier predictions - -+++ - -At each step of the traces, each datapoint may be either an inlier or outlier. We hope that the datapoints spend an unequal time being one state or the other, so let's take a look at the simple count of states for each of the 20 datapoints. - -```{code-cell} ipython3 -dfm_outlier_results = trc_hogg.posterior.is_outlier.to_dataframe().reset_index() - -with plt.rc_context({"figure.constrained_layout.use": False}): - gd = sns.catplot( - y="datapoint_id", - x="is_outlier", - data=dfm_outlier_results, - kind="point", - join=False, - height=6, - aspect=2, - ) -_ = gd.fig.axes[0].set(xlim=(-0.05, 1.05), xticks=np.arange(0, 1.1, 0.1)) -_ = gd.fig.axes[0].axvline(x=0, color="b", linestyle=":") -_ = gd.fig.axes[0].axvline(x=1, color="r", linestyle=":") -_ = gd.fig.axes[0].yaxis.grid(True, linestyle="-", which="major", color="w", alpha=0.4) -_ = gd.fig.suptitle( - ("For each datapoint, distribution of outlier classification " + "over the trace"), - y=1.04, - fontsize=16, -) -``` - -**Observe**: - -+ The plot above shows the proportion of samples in the traces in which each datapoint is marked as an outlier, expressed as a percentage. -+ 3 points [p2, p3, p4] spend >=95% of their time as outliers -+ Note the mean posterior value of `frac_outliers ~ 0.27`, corresponding to approx 5 or 6 of the 20 datapoints: we might investigate datapoints `[p1, p12, p16]` to see if they lean towards being outliers - -The 95% cutoff we choose is subjective and arbitrary, but I prefer it for now, so let's declare these 3 to be outliers and see how it looks compared to Jake Vanderplas' outliers, which were declared in a slightly different way as points with means above 0.68. - -+++ - -#### 5.3.2 Declare outliers - -**Note:** -+ I will declare outliers to be datapoints that have value == 1 at the 5-percentile cutoff, i.e. in the percentiles from 5 up to 100, their values are 1. -+ Try for yourself altering cutoff to larger values, which leads to an objective ranking of outlier-hood. - -```{code-cell} ipython3 -cutoff = 0.05 -dfhoggs["classed_as_outlier"] = ( - trc_hogg.posterior["is_outlier"].quantile(cutoff, dim=("chain", "draw")) == 1 -) -dfhoggs["classed_as_outlier"].value_counts() -``` - -Also add flag for points to be investigated. Will use this to annotate final plot - -```{code-cell} ipython3 -dfhoggs["annotate_for_investigation"] = ( - trc_hogg.posterior["is_outlier"].quantile(0.75, dim=("chain", "draw")) == 1 -) -dfhoggs["annotate_for_investigation"].value_counts() -``` - -### 5.4 Posterior Prediction Plots for OLS vs StudentT vs Hogg "Signal vs Noise" - -```{code-cell} ipython3 -import xarray as xr - -x = xr.DataArray(np.linspace(-3, 3, 10), dims="plot_dim") - -# evaluate outlier posterior distribution for plotting -trc_hogg.posterior["outlier_mean"] = trc_hogg.posterior["y_est_out"].broadcast_like(x) - -# evaluate model (inlier) posterior distributions for all 3 models -lm = lambda beta, x: beta.sel(coefs="intercept") + beta.sel(coefs="slope") * x - -trc_ols.posterior["y_mean"] = lm(trc_ols.posterior["beta"], x) -trc_studentt.posterior["y_mean"] = lm(trc_studentt.posterior["beta"], x) -trc_hogg.posterior["y_mean"] = lm(trc_hogg.posterior["beta"], x) -``` - -```{code-cell} ipython3 -def subsample_helper(da, samples=100, seed=None): - da = da.stack(sample=("chain", "draw")) - rng = np.random.default_rng(seed) - n = len(da.sample) - return da.isel(sample=rng.choice(n, samples, replace=False)) -``` - -```{code-cell} ipython3 -with plt.rc_context({"figure.constrained_layout.use": False}): - gd = sns.FacetGrid( - dfhoggs, - height=7, - hue="classed_as_outlier", - hue_order=[True, False], - palette="Set1", - legend_out=False, - ) - -# plot hogg outlier posterior distribution -outlier_mean = subsample_helper(trc_hogg.posterior["outlier_mean"], 400) -gd.ax.plot(x, outlier_mean, color="C3", linewidth=0.5, alpha=0.2, zorder=1) - -# plot the 3 model (inlier) posterior distributions -y_mean = subsample_helper(trc_ols.posterior["y_mean"], 200) -gd.ax.plot(x, y_mean, color="C0", linewidth=0.5, alpha=0.2, zorder=2) - -y_mean = subsample_helper(trc_studentt.posterior["y_mean"], 200) -gd.ax.plot(x, y_mean, color="C1", linewidth=0.5, alpha=0.2, zorder=3) - -y_mean = subsample_helper(trc_hogg.posterior["y_mean"], 200) -gd.ax.plot(x, y_mean, color="C2", linewidth=0.5, alpha=0.2, zorder=4) - -# add legend for regression lines plotted above -line_legend = plt.legend( - [ - Line2D([0], [0], color="C3"), - Line2D([0], [0], color="C2"), - Line2D([0], [0], color="C1"), - Line2D([0], [0], color="C0"), - ], - ["Hogg Inlier", "Hogg Outlier", "Student-T", "OLS"], - loc="lower right", - title="Posterior Predictive", -) -gd.ax.add_artist(line_legend) - -# plot points -_ = gd.map( - plt.errorbar, - "x", - "y", - "sigma_y", - "sigma_x", - marker="o", - ls="", - markeredgecolor="w", - markersize=10, - zorder=5, -).add_legend() -gd.ax.legend(loc="upper left", title="Outlier Classification") - -# annotate the potential outliers -for idx, r in dfhoggs.loc[dfhoggs["annotate_for_investigation"]].iterrows(): - _ = gd.ax.annotate( - text=idx, - xy=(r["x"], r["y"]), - xycoords="data", - xytext=(7, 7), - textcoords="offset points", - color="k", - zorder=4, - ) - -## create xlims ylims for plotting -x_ptp = np.ptp(dfhoggs["x"].values) / 3.3 -y_ptp = np.ptp(dfhoggs["y"].values) / 3.3 -xlims = (dfhoggs["x"].min() - x_ptp, dfhoggs["x"].max() + x_ptp) -ylims = (dfhoggs["y"].min() - y_ptp, dfhoggs["y"].max() + y_ptp) -gd.ax.set(ylim=ylims, xlim=xlims) -gd.fig.suptitle( - ( - "Standardized datapoints in Hogg 2010 dataset, showing " - "posterior predictive fit for 3 models:\nOLS, StudentT and Hogg " - '"Signal vs Noise" (inlier vs outlier, custom likelihood)' - ), - y=1.04, - fontsize=14, -); -``` - -**Observe**: - -The posterior preditive fit for: -+ the **OLS model** is shown in **Green** and as expected, it doesn't appear to fit the majority of our datapoints very well, skewed by outliers -+ the **Student-T model** is shown in **Orange** and does appear to fit the 'main axis' of datapoints quite well, ignoring outliers -+ the **Hogg Signal vs Noise model** is shown in two parts: - + **Blue** for inliers fits the 'main axis' of datapoints well, ignoring outliers - + **Red** for outliers has a very large variance and has assigned 'outlier' points with more log likelihood than the Blue inlier model - - -We see that the **Hogg Signal vs Noise model** also yields specific estimates of _which_ datapoints are outliers: -+ 17 'inlier' datapoints, in **Blue** and -+ 3 'outlier' datapoints shown in **Red**. -+ From a simple visual inspection, the classification seems fair, and agrees with Jake Vanderplas' findings. -+ I've annotated these Red and the most outlying inliers to aid visual investigation - - -Overall: -+ the **Hogg Signal vs Noise model** behaves as promised, yielding a robust regression estimate and explicit labelling of inliers / outliers, but -+ the **Hogg Signal vs Noise model** is quite complex, and whilst the regression seems robust, the traceplot shoes many divergences, and the model is potentially unstable -+ if you simply want a robust regression without inlier / outlier labelling, the **Student-T model** may be a good compromise, offering a simple model, quick sampling, and a very similar estimate. - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Authors - -+++ - -* Authored and adapted for this collection by Jon Sedar ([jonsedar](https://github.com/jonsedar)) on December, 2015. It was originally posted in [jonsedar/pymc3_examples](https://github.com/jonsedar/pymc3_examples) -* Updated by Thomas Wiecki ([twiecki](https://github.com/twiecki)) on July, 2018 - * Restate outlier model using `pm.Normal.dist().logp()` and `pm.Potential()` -* Updated by Jon Sedar on November, 2019 - * Restate `nu` in StudentT model to be more efficient, drop explicit use of theano shared vars, generally improve plotting / explanations / layout -* Updated by Jon Sedar on May, 2020 - * Tidyup language, formatting, plots and warnings and rerun with pymc=3.8, arviz=0.7 -* Updated by Raul Maldonado ([CloudChaoszero](https://github.com/CloudChaoszero)) on April, 2021 - * Tidyup language, formatting, set MultiTrace objects to `arviz.InferenceData` objects, running on pymc=3.11, arviz=0.11.0 -* Updated by Raul Maldonado on May, 2021 - * Update Visualizations from Matplotlib explicit calls to Arviz visualizations. objects, running on pymc=3.11, arviz=0.11.0 -* Updated by Oriol Abril on November, 2021 - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p theano,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/generalized_linear_models/GLM-robust.myst.md b/myst_nbs/generalized_linear_models/GLM-robust.myst.md deleted file mode 100644 index 3f3949b41..000000000 --- a/myst_nbs/generalized_linear_models/GLM-robust.myst.md +++ /dev/null @@ -1,234 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(GLM-robust)= -# GLM: Robust Linear Regression - -:::{post} August, 2013 -:tags: regression, linear model, robust -:category: beginner -:author: Thomas Wiecki, Chris Fonnesbeck, Abhipsha Das, Conor Hassan, Igor Kuvychko, Reshama Shaikh, Oriol Abril Pla -::: - -+++ - -# GLM: Robust Linear Regression - -The tutorial is the second of a three-part series on Bayesian *generalized linear models (GLMs)*, that first appeared on [Thomas Wiecki's blog](https://twiecki.io/): - - 1. {ref}`Linear Regression ` - 2. {ref}`Robust Linear Regression ` - 3. {ref}`Hierarchical Linear Regression ` - -In this blog post I will write about: - - - How a few outliers can largely affect the fit of linear regression models. - - How replacing the normal likelihood with Student T distribution produces robust regression. - -In the {ref}`linear regression tutorial ` I described how minimizing the squared distance of the regression line is the same as maximizing the likelihood of a Normal distribution with the mean coming from the regression line. This latter probabilistic expression allows us to easily formulate a Bayesian linear regression model. - -This worked splendidly on simulated data. The problem with simulated data though is that it's, well, simulated. In the real world things tend to get more messy and assumptions like normality are easily violated by a few outliers. - -Lets see what happens if we add some outliers to our simulated data from the last post. - -+++ - -First, let's import our modules. - -```{code-cell} ipython3 -%matplotlib inline - -import aesara -import aesara.tensor as at -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm -import xarray as xr -``` - -```{code-cell} ipython3 -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -``` - -Create some toy data but also add some outliers. - -```{code-cell} ipython3 -size = 100 -true_intercept = 1 -true_slope = 2 - -x = np.linspace(0, 1, size) -# y = a + b*x -true_regression_line = true_intercept + true_slope * x -# add noise -y = true_regression_line + rng.normal(scale=0.5, size=size) - -# Add outliers -x_out = np.append(x, [0.1, 0.15, 0.2]) -y_out = np.append(y, [8, 6, 9]) - -data = pd.DataFrame(dict(x=x_out, y=y_out)) -``` - -Plot the data together with the true regression line (the three points in the upper left corner are the outliers we added). - -```{code-cell} ipython3 -fig = plt.figure(figsize=(7, 5)) -ax = fig.add_subplot(111, xlabel="x", ylabel="y", title="Generated data and underlying model") -ax.plot(x_out, y_out, "x", label="sampled data") -ax.plot(x, true_regression_line, label="true regression line", lw=2.0) -plt.legend(loc=0); -``` - -## Robust Regression - -### Normal Likelihood - -Lets see what happens if we estimate our Bayesian linear regression model by fitting a regression model with a normal likelihood. -Note that the bambi library provides an easy to use such that an equivalent model can be built using one line of code. -A version of this same notebook using Bambi is available at {doc}`bambi's docs ` - -```{code-cell} ipython3 -with pm.Model() as model: - xdata = pm.ConstantData("x", x_out, dims="obs_id") - - # define priors - intercept = pm.Normal("intercept", mu=0, sigma=1) - slope = pm.Normal("slope", mu=0, sigma=1) - sigma = pm.HalfCauchy("sigma", beta=10) - - mu = pm.Deterministic("mu", intercept + slope * xdata, dims="obs_id") - - # define likelihood - likelihood = pm.Normal("y", mu=mu, sigma=sigma, observed=y_out, dims="obs_id") - - # inference - trace = pm.sample(tune=2000) -``` - -To evaluate the fit, the code below calculates the posterior predictive regression lines by taking regression parameters from the posterior distribution and plots a regression line for 20 of them. - -```{code-cell} ipython3 -post = az.extract(trace, num_samples=20) -x_plot = xr.DataArray(np.linspace(x_out.min(), x_out.max(), 100), dims="plot_id") -lines = post["intercept"] + post["slope"] * x_plot - -plt.scatter(x_out, y_out, label="data") -plt.plot(x_plot, lines.transpose(), alpha=0.4, color="C1") -plt.plot(x, true_regression_line, label="True regression line", lw=3.0, c="C2") -plt.legend(loc=0) -plt.title("Posterior predictive for normal likelihood"); -``` - -As you can see, the fit is quite skewed and we have a fair amount of uncertainty in our estimate as indicated by the wide range of different posterior predictive regression lines. Why is this? The reason is that the normal distribution does not have a lot of mass in the tails and consequently, an outlier will affect the fit strongly. - -A Frequentist would estimate a [Robust Regression](http://en.wikipedia.org/wiki/Robust_regression) and use a non-quadratic distance measure to evaluate the fit. - -But what's a Bayesian to do? Since the problem is the light tails of the Normal distribution we can instead assume that our data is not normally distributed but instead distributed according to the [Student T distribution](http://en.wikipedia.org/wiki/Student%27s_t-distribution) which has heavier tails as shown next {cite:p}`gelman2013bayesian,kruschke2014doing`. - -Lets look at those two distributions to get a feel for them. - -```{code-cell} ipython3 -normal_dist = pm.Normal.dist(mu=0, sigma=1) -t_dist = pm.StudentT.dist(mu=0, lam=1, nu=1) -x_eval = np.linspace(-8, 8, 300) -plt.plot(x_eval, pm.math.exp(pm.logp(normal_dist, x_eval)).eval(), label="Normal", lw=2.0) -plt.plot(x_eval, pm.math.exp(pm.logp(t_dist, x_eval)).eval(), label="Student T", lw=2.0) -plt.xlabel("x") -plt.ylabel("Probability density") -plt.legend(); -``` - -As you can see, the probability of values far away from the mean (0 in this case) are much more likely under the `T` distribution than under the Normal distribution. - -Below is a PyMC model, with the `likelihood` term following a `StudentT` distribution with $\nu=3$ degrees of freedom, opposed to the `Normal` distribution. - -```{code-cell} ipython3 -with pm.Model() as robust_model: - xdata = pm.ConstantData("x", x_out, dims="obs_id") - - # define priors - intercept = pm.Normal("intercept", mu=0, sigma=1) - slope = pm.Normal("slope", mu=0, sigma=1) - sigma = pm.HalfCauchy("sigma", beta=10) - - mu = pm.Deterministic("mu", intercept + slope * xdata, dims="obs_id") - - # define likelihood - likelihood = pm.StudentT("y", mu=mu, sigma=sigma, nu=3, observed=y_out, dims="obs_id") - - # inference - robust_trace = pm.sample(tune=4000) -``` - -```{code-cell} ipython3 -robust_post = az.extract(robust_trace, num_samples=20) -x_plot = xr.DataArray(np.linspace(x_out.min(), x_out.max(), 100), dims="plot_id") -robust_lines = robust_post["intercept"] + robust_post["slope"] * x_plot - -plt.scatter(x_out, y_out, label="data") -plt.plot(x_plot, robust_lines.transpose(), alpha=0.4, color="C1") -plt.plot(x, true_regression_line, label="True regression line", lw=3.0, c="C2") -plt.legend(loc=0) -plt.title("Posterior predictive for Student-T likelihood") -``` - -There, much better! The outliers are barely influencing our estimation at all because our likelihood function assumes that outliers are much more probable than under the Normal distribution. - -+++ - -## Summary - - - By changing the likelihood from a Normal distribution to a Student T distribution -- which has more mass in the tails -- we can perform *Robust Regression*. - -*Extensions*: - - - The Student-T distribution has, besides the mean and variance, a third parameter called *degrees of freedom* that describes how much mass should be put into the tails. Here it is set to 1 which gives maximum mass to the tails (setting this to infinity results in a Normal distribution!). One could easily place a prior on this rather than fixing it which I leave as an exercise for the reader ;). - - T distributions can be used as priors as well. See {ref}`GLM-hierarchical` - - How do we test if our data is normal or violates that assumption in an important way? Check out this [great blog post](http://allendowney.blogspot.com/2013/08/are-my-data-normal.html) by Allen Downey. - -+++ - -## Authors - -* Adapted from [Thomas Wiecki's](https://twitter.com/twiecki) blog -* Updated by @fonnesbeck in September 2016 (pymc#1378) -* Updated by @chiral-carbon in August 2021 (pymc-examples#205) -* Updated by Conor Hassan, Igor Kuvychko, Reshama Shaikh and [Oriol Abril Pla](https://oriolabrilpla.cat/en/) in 2022 - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p xarray -``` - -:::{include} ../page_footer.md -::: - -```{code-cell} ipython3 - -``` diff --git a/myst_nbs/generalized_linear_models/GLM-rolling-regression.myst.md b/myst_nbs/generalized_linear_models/GLM-rolling-regression.myst.md deleted file mode 100644 index 83545beff..000000000 --- a/myst_nbs/generalized_linear_models/GLM-rolling-regression.myst.md +++ /dev/null @@ -1,265 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(GLM-rolling-regression)= -# Rolling Regression - -:::{post} June, 2022 -:tags: generalized linear model, regression -:category: intermediate -:author: Thomas Wiecki -::: - -+++ - -* [Pairs trading](https://en.wikipedia.org/wiki/Pairs_trade?oldformat=true) is a famous technique in algorithmic trading that plays two stocks against each other. -* For this to work, stocks must be correlated (cointegrated). -* One common example is the price of gold (GLD) and the price of gold mining operations (GFI). - -```{code-cell} ipython3 -import os -import warnings - -import arviz as az -import matplotlib.pyplot as plt -import matplotlib.ticker as mticker -import numpy as np -import pandas as pd -import pymc as pm -import xarray as xr - -from matplotlib import MatplotlibDeprecationWarning - -warnings.filterwarnings(action="ignore", category=MatplotlibDeprecationWarning) -``` - -```{code-cell} ipython3 -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -``` - -Lets load the prices of GFI and GLD. - -```{code-cell} ipython3 -# from pandas_datareader import data -# prices = data.GoogleDailyReader(symbols=['GLD', 'GFI'], end='2014-8-1').read().loc['Open', :, :] -try: - prices = pd.read_csv(os.path.join("..", "data", "stock_prices.csv")).dropna() -except FileNotFoundError: - prices = pd.read_csv(pm.get_data("stock_prices.csv")).dropna() - -prices["Date"] = pd.DatetimeIndex(prices["Date"]) -prices = prices.set_index("Date") -prices_zscored = (prices - prices.mean()) / prices.std() -prices.head() -``` - -Plotting the prices over time suggests a strong correlation. However, the correlation seems to change over time. - -```{code-cell} ipython3 -fig = plt.figure(figsize=(9, 6)) -ax = fig.add_subplot(111, xlabel=r"Price GFI in \$", ylabel=r"Price GLD in \$") -colors = np.linspace(0, 1, len(prices)) -mymap = plt.get_cmap("viridis") -sc = ax.scatter(prices.GFI, prices.GLD, c=colors, cmap=mymap, lw=0) -ticks = colors[:: len(prices) // 10] -ticklabels = [str(p.date()) for p in prices[:: len(prices) // 10].index] -cb = plt.colorbar(sc, ticks=ticks) -cb.ax.set_yticklabels(ticklabels); -``` - -A naive approach would be to estimate a linear model and ignore the time domain. - -```{code-cell} ipython3 -with pm.Model() as model: # model specifications in PyMC are wrapped in a with-statement - # Define priors - sigma = pm.HalfCauchy("sigma", beta=10) - alpha = pm.Normal("alpha", mu=0, sigma=20) - beta = pm.Normal("beta", mu=0, sigma=20) - - mu = pm.Deterministic("mu", alpha + beta * prices_zscored.GFI.to_numpy()) - - # Define likelihood - likelihood = pm.Normal("y", mu=mu, sigma=sigma, observed=prices_zscored.GLD.to_numpy()) - - # Inference - trace_reg = pm.sample(tune=2000) -``` - -The posterior predictive plot shows how bad the fit is. - -```{code-cell} ipython3 -fig = plt.figure(figsize=(9, 6)) -ax = fig.add_subplot( - 111, - xlabel=r"Price GFI in \$", - ylabel=r"Price GLD in \$", - title="Posterior predictive regression lines", -) -sc = ax.scatter(prices_zscored.GFI, prices_zscored.GLD, c=colors, cmap=mymap, lw=0) - -xi = xr.DataArray(prices_zscored.GFI.values) -az.plot_hdi( - xi, - trace_reg.posterior.mu, - color="k", - hdi_prob=0.95, - ax=ax, - fill_kwargs={"alpha": 0.25}, - smooth=False, -) -az.plot_hdi( - xi, - trace_reg.posterior.mu, - color="k", - hdi_prob=0.5, - ax=ax, - fill_kwargs={"alpha": 0.25}, - smooth=False, -) - -cb = plt.colorbar(sc, ticks=ticks) -cb.ax.set_yticklabels(ticklabels); -``` - -## Rolling regression - -Next, we will build an improved model that will allow for changes in the regression coefficients over time. Specifically, we will assume that intercept and slope follow a random-walk through time. That idea is similar to the {doc}`case_studies/stochastic_volatility`. - -$$ \alpha_t \sim \mathcal{N}(\alpha_{t-1}, \sigma_\alpha^2) $$ -$$ \beta_t \sim \mathcal{N}(\beta_{t-1}, \sigma_\beta^2) $$ - -+++ - -First, lets define the hyper-priors for $\sigma_\alpha^2$ and $\sigma_\beta^2$. This parameter can be interpreted as the volatility in the regression coefficients. - -```{code-cell} ipython3 -with pm.Model(coords={"time": prices.index.values}) as model_randomwalk: - # std of random walk - sigma_alpha = pm.Exponential("sigma_alpha", 50.0) - sigma_beta = pm.Exponential("sigma_beta", 50.0) - - alpha = pm.GaussianRandomWalk( - "alpha", sigma=sigma_alpha, init_dist=pm.Normal.dist(0, 10), dims="time" - ) - beta = pm.GaussianRandomWalk( - "beta", sigma=sigma_beta, init_dist=pm.Normal.dist(0, 10), dims="time" - ) -``` - -Perform the regression given coefficients and data and link to the data via the likelihood. - -```{code-cell} ipython3 -with model_randomwalk: - # Define regression - regression = alpha + beta * prices_zscored.GFI.values - - # Assume prices are Normally distributed, the mean comes from the regression. - sd = pm.HalfNormal("sd", sigma=0.1) - likelihood = pm.Normal("y", mu=regression, sigma=sd, observed=prices_zscored.GLD.to_numpy()) -``` - -Inference. Despite this being quite a complex model, NUTS handles it wells. - -```{code-cell} ipython3 -with model_randomwalk: - trace_rw = pm.sample(tune=2000, target_accept=0.9) -``` - -Increasing the tree-depth does indeed help but it makes sampling very slow. The results look identical with this run, however. - -+++ - -## Analysis of results - -+++ - -As can be seen below, $\alpha$, the intercept, changes over time. - -```{code-cell} ipython3 -fig = plt.figure(figsize=(8, 6), constrained_layout=False) -ax = plt.subplot(111, xlabel="time", ylabel="alpha", title="Change of alpha over time.") -ax.plot(trace_rw.posterior.stack(pooled_chain=("chain", "draw"))["alpha"], "r", alpha=0.05) - -ticks_changes = mticker.FixedLocator(ax.get_xticks().tolist()) -ticklabels_changes = [str(p.date()) for p in prices[:: len(prices) // 7].index] -ax.xaxis.set_major_locator(ticks_changes) -ax.set_xticklabels(ticklabels_changes) - -fig.autofmt_xdate() -``` - -As does the slope. - -```{code-cell} ipython3 -fig = plt.figure(figsize=(8, 6), constrained_layout=False) -ax = fig.add_subplot(111, xlabel="time", ylabel="beta", title="Change of beta over time") -ax.plot(trace_rw.posterior.stack(pooled_chain=("chain", "draw"))["beta"], "b", alpha=0.05) - -ax.xaxis.set_major_locator(ticks_changes) -ax.set_xticklabels(ticklabels_changes) - -fig.autofmt_xdate() -``` - -The posterior predictive plot shows that we capture the change in regression over time much better. Note that we should have used returns instead of prices. The model would still work the same, but the visualisations would not be quite as clear. - -```{code-cell} ipython3 -fig = plt.figure(figsize=(8, 6)) -ax = fig.add_subplot( - 111, - xlabel=r"Price GFI in \$", - ylabel=r"Price GLD in \$", - title="Posterior predictive regression lines", -) - -colors = np.linspace(0, 1, len(prices)) -colors_sc = np.linspace(0, 1, len(prices.index.values[::50])) - -xi = xr.DataArray(np.linspace(prices_zscored.GFI.min(), prices_zscored.GFI.max(), 50)) - -for i, time in enumerate(prices.index.values[::50]): - sel_trace = trace_rw.posterior.sel(time=time) - regression_line = ( - (sel_trace["alpha"] + sel_trace["beta"] * xi) - .stack(pooled_chain=("chain", "draw")) - .isel(pooled_chain=slice(None, None, 200)) - ) - ax.plot(xi, regression_line, color=mymap(colors_sc[i]), alpha=0.1, zorder=10, linewidth=3) - -sc = ax.scatter( - prices_zscored.GFI, prices_zscored.GLD, label="data", cmap=mymap, c=colors, zorder=11 -) - -cb = plt.colorbar(sc, ticks=ticks) -cb.ax.set_yticklabels(ticklabels); -``` - -## Authors - -- Created by [Thomas Wiecki](https://github.com/twiecki/) -- Updated by [Benjamin T. Vincent](https://github.com/drbenvincent) June 2022 - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/generalized_linear_models/GLM-simpsons-paradox.myst.md b/myst_nbs/generalized_linear_models/GLM-simpsons-paradox.myst.md deleted file mode 100644 index 84e14c8cf..000000000 --- a/myst_nbs/generalized_linear_models/GLM-simpsons-paradox.myst.md +++ /dev/null @@ -1,559 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: pymc-dev-py39 - language: python - name: pymc-dev-py39 ---- - -(GLM-simpsons-paradox)= -# Simpson's paradox and mixed models - -:::{post} March, 2022 -:tags: regression, hierarchical model, linear model, posterior predictive, Simpson's paradox -:category: beginner -:author: Benjamin T. Vincent -::: - -+++ - -This notebook covers: -- [Simpson's Paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) and its resolution through mixed or hierarchical models. This is a situation where there might be a negative relationship between two variables within a group, but when data from multiple groups are combined, that relationship may disappear or even reverse sign. The gif below (from the [Simpson's Paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) Wikipedia page) demonstrates this very nicely. -- How to build linear regression models, starting with linear regression, moving up to hierarchical linear regression. Simpon's paradox is a nice motivation for why we might want to do this - but of course we should aim to build models which incorporate as much as our knowledge about the structure of the data (e.g. it's nested nature) as possible. -- Use of `pm.Data` containers to facilitate posterior prediction at different $x$ values with the same model. -- Providing array dimensions (see `coords`) to models to help with shape problems. This involves the use of [xarray](http://xarray.pydata.org/) and is quite helpful in multi-level / hierarchical models. -- Differences between posteriors and posterior predictive distributions. -- How to visualise models in data space and parameter space, using a mixture of [ArviZ](https://arviz-devs.github.io/arviz/) and [matplotlib](https://matplotlib.org/). - -+++ - -![](https://upload.wikimedia.org/wikipedia/commons/f/fb/Simpsons_paradox_-_animation.gif) - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm -import xarray as xr -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -rng = np.random.default_rng(1234) -``` - -## Generate data - -This data generation was influenced by this [stackexchange](https://stats.stackexchange.com/questions/479201/understanding-simpsons-paradox-with-random-effects) question. - -```{code-cell} ipython3 -:tags: [hide-input] - -def generate(): - group_list = ["one", "two", "three", "four", "five"] - trials_per_group = 20 - group_intercepts = rng.normal(0, 1, len(group_list)) - group_slopes = np.ones(len(group_list)) * -0.5 - group_mx = group_intercepts * 2 - group = np.repeat(group_list, trials_per_group) - subject = np.concatenate( - [np.ones(trials_per_group) * i for i in np.arange(len(group_list))] - ).astype(int) - intercept = np.repeat(group_intercepts, trials_per_group) - slope = np.repeat(group_slopes, trials_per_group) - mx = np.repeat(group_mx, trials_per_group) - x = rng.normal(mx, 1) - y = rng.normal(intercept + (x - mx) * slope, 1) - data = pd.DataFrame({"group": group, "group_idx": subject, "x": x, "y": y}) - return data, group_list -``` - -```{code-cell} ipython3 -data, group_list = generate() -``` - -To follow along, it is useful to clearly understand the form of the data. This is [long form](https://en.wikipedia.org/wiki/Wide_and_narrow_data) data (also known as narrow data) in that each row represents one observation. We have a `group` column which has the group label, and an accompanying numerical `group_idx` column. This is very useful when it comes to modelling as we can use it as an index to look up group-level parameter estimates. Finally, we have our core observations of the predictor variable `x` and the outcome `y`. - -```{code-cell} ipython3 -display(data) -``` - -And we can visualise this as below. - -```{code-cell} ipython3 -:tags: [hide-input] - -for i, group in enumerate(group_list): - plt.scatter( - data.query(f"group_idx=={i}").x, - data.query(f"group_idx=={i}").y, - color=f"C{i}", - label=f"{group}", - ) -plt.legend(title="group"); -``` - -The rest of the notebook will cover different ways that we can analyse this data using linear models. - -+++ - -## Model 1: Basic linear regression - -First we examine the simplest model - plain linear regression which pools all the data and has no knowledge of the group/multi-level structure of the data. - -+++ - -### Build model - -```{code-cell} ipython3 -with pm.Model() as linear_regression: - sigma = pm.HalfCauchy("sigma", beta=2) - β0 = pm.Normal("β0", 0, sigma=5) - β1 = pm.Normal("β1", 0, sigma=5) - x = pm.MutableData("x", data.x, dims="obs_id") - μ = pm.Deterministic("μ", β0 + β1 * x, dims="obs_id") - pm.Normal("y", mu=μ, sigma=sigma, observed=data.y, dims="obs_id") -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(linear_regression) -``` - -### Do inference - -```{code-cell} ipython3 -with linear_regression: - idata = pm.sample() -``` - -```{code-cell} ipython3 -az.plot_trace(idata, filter_vars="regex", var_names=["~μ"]); -``` - -### Visualisation - -```{code-cell} ipython3 -# posterior prediction for these x values -xi = np.linspace(data.x.min(), data.x.max(), 20) - -# do posterior predictive inference -with linear_regression: - pm.set_data({"x": xi}) - idata.extend(pm.sample_posterior_predictive(idata, var_names=["y", "μ"])) -``` - -```{code-cell} ipython3 -:tags: [hide-input] - -fig, ax = plt.subplots(1, 3, figsize=(12, 4)) - -# conditional mean plot --------------------------------------------- -# data -ax[0].scatter(data.x, data.y, color="k") -# conditional mean credible intervals -post = idata.posterior.stack(sample=("chain", "draw")) -xi = xr.DataArray(np.linspace(np.min(data.x), np.max(data.x), 20), dims=["x_plot"]) -y = post.β0 + post.β1 * xi -region = y.quantile([0.025, 0.15, 0.5, 0.85, 0.975], dim="sample") -ax[0].fill_between( - xi, region.sel(quantile=0.025), region.sel(quantile=0.975), alpha=0.2, color="k", edgecolor="w" -) -ax[0].fill_between( - xi, region.sel(quantile=0.15), region.sel(quantile=0.85), alpha=0.2, color="k", edgecolor="w" -) -# conditional mean -ax[0].plot(xi, region.sel(quantile=0.5), "k", linewidth=2) -# formatting -ax[0].set(xlabel="x", ylabel="y", title="Conditional mean") - -# posterior prediction ---------------------------------------------- -# data -ax[1].scatter(data.x, data.y, color="k") -# posterior mean and HDI's - -ax[1].plot(xi, idata.posterior_predictive.y.mean(["chain", "draw"]), "k") - -az.plot_hdi( - xi, - idata.posterior_predictive.y, - hdi_prob=0.6, - color="k", - fill_kwargs={"alpha": 0.2, "linewidth": 0}, - ax=ax[1], -) -az.plot_hdi( - xi, - idata.posterior_predictive.y, - hdi_prob=0.95, - color="k", - fill_kwargs={"alpha": 0.2, "linewidth": 0}, - ax=ax[1], -) -# formatting -ax[1].set(xlabel="x", ylabel="y", title="Posterior predictive distribution") - -# parameter space --------------------------------------------------- -ax[2].scatter( - idata.posterior.β1.stack(sample=("chain", "draw")), - idata.posterior.β0.stack(sample=("chain", "draw")), - color="k", - alpha=0.01, - rasterized=True, -) - -# formatting -ax[2].set(xlabel="slope", ylabel="intercept", title="Parameter space") -ax[2].axhline(y=0, c="k") -ax[2].axvline(x=0, c="k"); -``` - -The plot on the left shows the data and the posterior of the **conditional mean**. For a given $x$, we get a posterior distribution of the model (i.e. of $\mu$). - -The plot in the middle shows the **posterior predictive distribution**, which gives a statement about the data we expect to see. Intuitively, this can be understood as not only incorporating what we know of the model (left plot) but also what we know about the distribution of error. - -The plot on the right shows out posterior beliefs in **parameter space**. - -+++ - -One of the clear things about this analysis is that we have credible evidence that $x$ and $y$ are _positively_ correlated. We can see this from the posterior over the slope (see right hand panel in the figure above). - -+++ - -## Model 2: Independent slopes and intercepts model - -We will use the same data in this analysis, but this time we will use our knowledge that data come from groups. More specifically we will essentially fit independent regressions to data within each group. - -```{code-cell} ipython3 -coords = {"group": group_list} - -with pm.Model(coords=coords) as ind_slope_intercept: - # Define priors - sigma = pm.HalfCauchy("sigma", beta=2, dims="group") - β0 = pm.Normal("β0", 0, sigma=5, dims="group") - β1 = pm.Normal("β1", 0, sigma=5, dims="group") - # Data - x = pm.MutableData("x", data.x, dims="obs_id") - g = pm.MutableData("g", data.group_idx, dims="obs_id") - # Linear model - μ = pm.Deterministic("μ", β0[g] + β1[g] * x, dims="obs_id") - # Define likelihood - pm.Normal("y", mu=μ, sigma=sigma[g], observed=data.y, dims="obs_id") -``` - -By plotting the DAG for this model it is clear to see that we now have individual intercept, slope, and variance parameters for each of the groups. - -```{code-cell} ipython3 -pm.model_to_graphviz(ind_slope_intercept) -``` - -```{code-cell} ipython3 -with ind_slope_intercept: - idata = pm.sample() - -az.plot_trace(idata, filter_vars="regex", var_names=["~μ"]); -``` - -### Visualisation - -```{code-cell} ipython3 -# Create values of x and g to use for posterior prediction -xi = [ - np.linspace(data.query(f"group_idx=={i}").x.min(), data.query(f"group_idx=={i}").x.max(), 10) - for i, _ in enumerate(group_list) -] -g = [np.ones(10) * i for i, _ in enumerate(group_list)] -xi, g = np.concatenate(xi), np.concatenate(g) - -# Do the posterior prediction -with ind_slope_intercept: - pm.set_data({"x": xi, "g": g.astype(int)}) - idata.extend(pm.sample_posterior_predictive(idata, var_names=["μ", "y"])) -``` - -```{code-cell} ipython3 -:tags: [hide-input] - -def get_ppy_for_group(group_list, group): - """Get posterior predictive outcomes for observations from a given group""" - return idata.posterior_predictive.y.data[:, :, group_list == group] - - -fig, ax = plt.subplots(1, 3, figsize=(12, 4)) - -# conditional mean plot --------------------------------------------- -for i, groupname in enumerate(group_list): - # data - ax[0].scatter(data.x[data.group_idx == i], data.y[data.group_idx == i], color=f"C{i}") - # conditional mean credible intervals - post = idata.posterior.stack(sample=("chain", "draw")) - _xi = xr.DataArray( - np.linspace(np.min(data.x[data.group_idx == i]), np.max(data.x[data.group_idx == i]), 20), - dims=["x_plot"], - ) - y = post.β0.sel(group=groupname) + post.β1.sel(group=groupname) * _xi - region = y.quantile([0.025, 0.15, 0.5, 0.85, 0.975], dim="sample") - ax[0].fill_between( - _xi, - region.sel(quantile=0.025), - region.sel(quantile=0.975), - alpha=0.2, - color=f"C{i}", - edgecolor="w", - ) - ax[0].fill_between( - _xi, - region.sel(quantile=0.15), - region.sel(quantile=0.85), - alpha=0.2, - color=f"C{i}", - edgecolor="w", - ) - # conditional mean - ax[0].plot(_xi, region.sel(quantile=0.5), color=f"C{i}", linewidth=2) - # formatting - ax[0].set(xlabel="x", ylabel="y", title="Conditional mean") - -# posterior prediction ---------------------------------------------- -for i, groupname in enumerate(group_list): - # data - ax[1].scatter(data.x[data.group_idx == i], data.y[data.group_idx == i], color=f"C{i}") - # posterior mean and HDI's - ax[1].plot(xi[g == i], np.mean(get_ppy_for_group(g, i), axis=(0, 1)), label=groupname) - az.plot_hdi( - xi[g == i], - get_ppy_for_group(g, i), # pp_y[:, :, g == i], - hdi_prob=0.6, - color=f"C{i}", - fill_kwargs={"alpha": 0.4, "linewidth": 0}, - ax=ax[1], - ) - az.plot_hdi( - xi[g == i], - get_ppy_for_group(g, i), - hdi_prob=0.95, - color=f"C{i}", - fill_kwargs={"alpha": 0.2, "linewidth": 0}, - ax=ax[1], - ) - -ax[1].set(xlabel="x", ylabel="y", title="Posterior predictive distribution") - - -# parameter space --------------------------------------------------- -for i, _ in enumerate(group_list): - ax[2].scatter( - idata.posterior.β1.stack(sample=("chain", "draw"))[i, :], - idata.posterior.β0.stack(sample=("chain", "draw"))[i, :], - color=f"C{i}", - alpha=0.01, - rasterized=True, - ) - -ax[2].set(xlabel="slope", ylabel="intercept", title="Parameter space") -ax[2].axhline(y=0, c="k") -ax[2].axvline(x=0, c="k"); -``` - -In contrast to plain regression model (Model 1), when we model on the group level we can see that now the evidence points toward _negative_ relationships between $x$ and $y$. - -+++ - -## Model 3: Hierarchical regression -We can go beyond Model 2 and incorporate even more knowledge about the structure of our data. Rather than treating each group as entirely independent, we can use our knowledge that these groups are drawn from a population-level distribution. These are sometimes called hyper-parameters. - -In one sense this move from Model 2 to Model 3 can be seen as adding parameters, and therefore increasing model complexity. However, in another sense, adding this knowledge about the nested structure of the data actually provides a constraint over parameter space. - -Note: This model was producing divergent samples, so a reparameterisation trick is used. See the blog post [Why hierarchical models are awesome, tricky, and Bayesian](https://twiecki.io/blog/2017/02/08/bayesian-hierchical-non-centered/) by Thomas Wiecki for more information on this. - -```{code-cell} ipython3 -non_centered = True - -with pm.Model(coords=coords) as hierarchical: - # Hyperpriors - intercept_mu = pm.Normal("intercept_mu", 0, sigma=1) - intercept_sigma = pm.HalfNormal("intercept_sigma", sigma=2) - slope_mu = pm.Normal("slope_mu", 0, sigma=1) - slope_sigma = pm.HalfNormal("slope_sigma", sigma=2) - sigma_hyperprior = pm.HalfNormal("sigma_hyperprior", sigma=0.5) - - # Define priors - sigma = pm.HalfNormal("sigma", sigma=sigma_hyperprior, dims="group") - - if non_centered: - β0_offset = pm.Normal("β0_offset", 0, sigma=1, dims="group") - β0 = pm.Deterministic("β0", intercept_mu + β0_offset * intercept_sigma, dims="group") - β1_offset = pm.Normal("β1_offset", 0, sigma=1, dims="group") - β1 = pm.Deterministic("β1", slope_mu + β1_offset * slope_sigma, dims="group") - else: - β0 = pm.Normal("β0", intercept_mu, sigma=intercept_sigma, dims="group") - β1 = pm.Normal("β1", slope_mu, sigma=slope_sigma, dims="group") - - # Data - x = pm.MutableData("x", data.x, dims="obs_id") - g = pm.MutableData("g", data.group_idx, dims="obs_id") - # Linear model - μ = pm.Deterministic("μ", β0[g] + β1[g] * x, dims="obs_id") - # Define likelihood - pm.Normal("y", mu=μ, sigma=sigma[g], observed=data.y, dims="obs_id") -``` - -Plotting the DAG now makes it clear that the group-level intercept and slope parameters are drawn from a population level distributions. That is, we have hyper-priors for the slopes and intercept parameters. This particular model does not have a hyper-prior for the measurement error - this is just left as one parameter per group, as in the previous model. - -```{code-cell} ipython3 -pm.model_to_graphviz(hierarchical) -``` - -```{code-cell} ipython3 -with hierarchical: - idata = pm.sample(tune=2000, target_accept=0.99) - -az.plot_trace(idata, filter_vars="regex", var_names=["~μ"]); -``` - -### Visualise - -```{code-cell} ipython3 -# Create values of x and g to use for posterior prediction -xi = [ - np.linspace(data.query(f"group_idx=={i}").x.min(), data.query(f"group_idx=={i}").x.max(), 10) - for i, _ in enumerate(group_list) -] -g = [np.ones(10) * i for i, _ in enumerate(group_list)] -xi, g = np.concatenate(xi), np.concatenate(g) - -# Do the posterior prediction -with hierarchical: - pm.set_data({"x": xi, "g": g.astype(int)}) - idata.extend(pm.sample_posterior_predictive(idata, var_names=["μ", "y"])) -``` - -```{code-cell} ipython3 -:tags: [hide-input] - -fig, ax = plt.subplots(1, 3, figsize=(12, 4)) - -# conditional mean plot --------------------------------------------- -for i, groupname in enumerate(group_list): - # data - ax[0].scatter(data.x[data.group_idx == i], data.y[data.group_idx == i], color=f"C{i}") - # conditional mean credible intervals - post = idata.posterior.stack(sample=("chain", "draw")) - _xi = xr.DataArray( - np.linspace(np.min(data.x[data.group_idx == i]), np.max(data.x[data.group_idx == i]), 20), - dims=["x_plot"], - ) - y = post.β0.sel(group=groupname) + post.β1.sel(group=groupname) * _xi - region = y.quantile([0.025, 0.15, 0.5, 0.85, 0.975], dim="sample") - ax[0].fill_between( - _xi, - region.sel(quantile=0.025), - region.sel(quantile=0.975), - alpha=0.2, - color=f"C{i}", - edgecolor="w", - ) - ax[0].fill_between( - _xi, - region.sel(quantile=0.15), - region.sel(quantile=0.85), - alpha=0.2, - color=f"C{i}", - edgecolor="w", - ) - # conditional mean - ax[0].plot(_xi, region.sel(quantile=0.5), color=f"C{i}", linewidth=2) - # formatting - ax[0].set(xlabel="x", ylabel="y", title="Conditional mean") - -# posterior prediction ---------------------------------------------- -for i, groupname in enumerate(group_list): - # data - ax[1].scatter(data.x[data.group_idx == i], data.y[data.group_idx == i], color=f"C{i}") - # posterior mean and HDI's - ax[1].plot(xi[g == i], np.mean(get_ppy_for_group(g, i), axis=(0, 1)), label=groupname) - az.plot_hdi( - xi[g == i], - get_ppy_for_group(g, i), - hdi_prob=0.6, - color=f"C{i}", - fill_kwargs={"alpha": 0.4, "linewidth": 0}, - ax=ax[1], - ) - az.plot_hdi( - xi[g == i], - get_ppy_for_group(g, i), - hdi_prob=0.95, - color=f"C{i}", - fill_kwargs={"alpha": 0.2, "linewidth": 0}, - ax=ax[1], - ) - -ax[1].set(xlabel="x", ylabel="y", title="Posterior Predictive") - -# parameter space --------------------------------------------------- -# plot posterior for population level slope and intercept -slope = rng.normal( - idata.posterior.slope_mu.stack(sample=("chain", "draw")).values, - idata.posterior.slope_sigma.stack(sample=("chain", "draw")).values, -) -intercept = rng.normal( - idata.posterior.intercept_mu.stack(sample=("chain", "draw")).values, - idata.posterior.intercept_sigma.stack(sample=("chain", "draw")).values, -) -ax[2].scatter(slope, intercept, color="k", alpha=0.05) -# plot posterior for group level slope and intercept -for i, _ in enumerate(group_list): - ax[2].scatter( - idata.posterior.β1.stack(sample=("chain", "draw"))[i, :], - idata.posterior.β0.stack(sample=("chain", "draw"))[i, :], - color=f"C{i}", - alpha=0.01, - ) - -ax[2].set(xlabel="slope", ylabel="intercept", title="Parameter space", xlim=[-2, 1], ylim=[-5, 5]) -ax[2].axhline(y=0, c="k") -ax[2].axvline(x=0, c="k"); -``` - -The panel on the right shows the posterior group level posterior of the slope and intercept parameters in black. This particular visualisation is a little unclear however, so we can just plot the marginal distribution below to see how much belief we have in the slope being less than zero. - -```{code-cell} ipython3 -:tags: [hide-input] - -az.plot_posterior(slope, ref_val=0) -plt.title("Population level slope parameter"); -``` - -## Summary -Using Simpson's paradox, we've walked through 3 different models. The first is a simple linear regression which treats all the data as coming from one group. We saw that this lead us to believe the regression slope was positive. - -While that is not necessarily wrong, it is paradoxical when we see that the regression slopes for the data _within_ a group is negative. We saw how to apply separate regressions for data in each group in the second model. - -The third and final model added a layer to the hierarchy, which captures our knowledge that each of these groups are sampled from an overall population. This added the ability to make inferences not only about the regression parameters at the group level, but also at the population level. The final plot shows our posterior over this population level slope parameter from which we believe the groups are sampled from. - -If you are interested in learning more, there are a number of other [PyMC examples](http://docs.pymc.io/nb_examples/index.html) covering hierarchical modelling and regression topics. - -+++ - -## Authors -* Authored by [Benjamin T. Vincent](https://github.com/drbenvincent) in July 2021 -* Updated by [Benjamin T. Vincent](https://github.com/drbenvincent) in April 2022 - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/generalized_linear_models/GLM-truncated-censored-regression.myst.md b/myst_nbs/generalized_linear_models/GLM-truncated-censored-regression.myst.md deleted file mode 100644 index 24c04838b..000000000 --- a/myst_nbs/generalized_linear_models/GLM-truncated-censored-regression.myst.md +++ /dev/null @@ -1,384 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3.10.6 ('pymc_env') - language: python - name: python3 ---- - -(GLM-truncated-censored-regression)= -# Bayesian regression with truncated or censored data - -:::{post} September, 2022 -:tags: censored, generalized linear model, regression, truncated -:category: beginner -:author: Benjamin T. Vincent -::: - -The notebook provides an example of how to conduct linear regression when your outcome variable is either censored or truncated. - -```{code-cell} ipython3 -:tags: [] - -from copy import copy - -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc as pm -import xarray as xr - -from numpy.random import default_rng -from scipy.stats import norm, truncnorm -``` - -```{code-cell} ipython3 -:tags: [] - -%config InlineBackend.figure_format = 'retina' -rng = default_rng(12345) -az.style.use("arviz-darkgrid") -``` - -## Truncation and censoring - -Truncation and censoring are examples of missing data problems. It can sometimes be easy to muddle up truncation and censoring, so let's look at some definitions. - -- **Truncation** is a type of missing data problem where you are simply unaware of any data where the outcome variable falls outside of a certain set of bounds. -- **Censoring** occurs when a measurement has a sensitivity with a certain set of bounds. But rather than discard data outside these bounds, you would record a measurement at the bound which it exceeded. - -Let's further explore this with some code and plots. First we will generate some true `(x, y)` scatter data, where `y` is our outcome measure and `x` is some predictor variable. - -```{code-cell} ipython3 -:tags: [] - -slope, intercept, σ, N = 1, 0, 2, 200 -x = rng.uniform(-10, 10, N) -y = rng.normal(loc=slope * x + intercept, scale=σ) -``` - -For this example of `(x, y)` scatter data, we can describe the truncation process as simply filtering out any data for which our outcome variable `y` falls outside of a set of bounds. - -```{code-cell} ipython3 -:tags: [] - -def truncate_y(x, y, bounds): - keep = (y >= bounds[0]) & (y <= bounds[1]) - return (x[keep], y[keep]) -``` - -With censoring however, we are setting the `y` value equal to the bounds that they exceed. - -```{code-cell} ipython3 -:tags: [] - -def censor_y(x, y, bounds): - cy = copy(y) - cy[y <= bounds[0]] = bounds[0] - cy[y >= bounds[1]] = bounds[1] - return (x, cy) -``` - -Based on our generated `(x, y)` data (which an experimenter would never see in real life), we can generate our actual observed datasets for truncated data `(xt, yt)` and censored data `(xc, yc)`. - -```{code-cell} ipython3 -:tags: [] - -bounds = [-5, 5] -xt, yt = truncate_y(x, y, bounds) -xc, yc = censor_y(x, y, bounds) -``` - -We can visualise this latent data (in grey) and the remaining truncated or censored data (black) as below. - -```{code-cell} ipython3 -:tags: [] - -fig, axes = plt.subplots(1, 2, figsize=(10, 5)) - -for ax in axes: - ax.plot(x, y, ".", c=[0.7, 0.7, 0.7]) - ax.axhline(bounds[0], c="r", ls="--") - ax.axhline(bounds[1], c="r", ls="--") - ax.set(xlabel="x", ylabel="y") - -axes[0].plot(xt, yt, ".", c=[0, 0, 0]) -axes[0].set(title="Truncated data") - -axes[1].plot(xc, yc, ".", c=[0, 0, 0]) -axes[1].set(title="Censored data"); -``` - -## The problem that truncated or censored regression solves -If we were to run regular linear regression on either the truncated or censored data, it should be fairly intuitive to see that we will likely underestimate the slope. Truncated regression and censored regress (aka Tobit regression) were designed to address these missing data problems and hopefully result in regression slopes which are free from the bias introduced by truncation or censoring. - -In this section we will run Bayesian linear regression on these datasets to see the extent of the problem. We start by defining a function which defines a PyMC model, conducts MCMC sampling, and returns the model and the MCMC sampling data. - -```{code-cell} ipython3 -:tags: [] - -def linear_regression(x, y): - with pm.Model() as model: - slope = pm.Normal("slope", mu=0, sigma=1) - intercept = pm.Normal("intercept", mu=0, sigma=1) - σ = pm.HalfNormal("σ", sigma=1) - pm.Normal("obs", mu=slope * x + intercept, sigma=σ, observed=y) - - return model -``` - -So we can run this on our truncated and our censored data, separately. - -```{code-cell} ipython3 -:tags: [] - -trunc_linear_model = linear_regression(xt, yt) - -with trunc_linear_model: - trunc_linear_fit = pm.sample() -``` - -```{code-cell} ipython3 -:tags: [] - -cens_linear_model = linear_regression(xc, yc) - -with cens_linear_model: - cens_linear_fit = pm.sample() -``` - -By plotting the posterior distribution over the slope parameters we can see that the estimates for the slope are pretty far off, so we are indeed underestimating the regression slope. - -```{code-cell} ipython3 -:tags: [] - -fig, ax = plt.subplots(1, 2, figsize=(10, 5), sharex=True) - -az.plot_posterior(trunc_linear_fit, var_names=["slope"], ref_val=slope, ax=ax[0]) -ax[0].set(title="Linear regression\n(truncated data)", xlabel="slope") - -az.plot_posterior(cens_linear_fit, var_names=["slope"], ref_val=slope, ax=ax[1]) -ax[1].set(title="Linear regression\n(censored data)", xlabel="slope"); -``` - -To appreciate the extent of the problem (for this dataset) we can visualise the posterior predictive fits alongside the data. - -```{code-cell} ipython3 -:tags: [] - -def pp_plot(x, y, fit, ax): - # plot data - ax.plot(x, y, "k.") - # plot posterior predicted... samples from posterior - xi = xr.DataArray(np.array([np.min(x), np.max(x)]), dims=["obs_id"]) - post = fit.posterior - y_ppc = xi * post["slope"] + post["intercept"] - ax.plot(xi, y_ppc.stack(sample=("chain", "draw")), c="steelblue", alpha=0.01, rasterized=True) - # plot true - ax.plot(xi, slope * xi + intercept, "k", lw=3, label="True") - # plot bounds - ax.axhline(bounds[0], c="r", ls="--") - ax.axhline(bounds[1], c="r", ls="--") - ax.legend() - ax.set(xlabel="x", ylabel="y") - - -fig, ax = plt.subplots(1, 2, figsize=(10, 5), sharex=True, sharey=True) - -pp_plot(xt, yt, trunc_linear_fit, ax[0]) -ax[0].set(title="Truncated data") - -pp_plot(xc, yc, cens_linear_fit, ax[1]) -ax[1].set(title="Censored data"); -``` - -By looking at these plots we can intuitively predict what factors will influence the extent of the underestimation bias. Firstly, if the truncation or censoring bounds are very broad such that they only affect a few data points, then the underestimation bias would be smaller. Secondly, if the measurement error `σ` is low, we might expect the underestimation bias to decrease. In the limit of zero measurement noise then it should be possible to fully recover the true slope for truncated data but there will always be some bias in the censored case. Regardless, it would be prudent to use truncated or censored regression models unless the measurement error is near zero, or the bounds are so broad as to be practically irrelevant. - -+++ - -## Implementing truncated and censored regression models -Now we have seen the problem of conducting regression on truncated or censored data, in terms of underestimating the regression slopes. This is what truncated or censored regression models were designed to solve. The general approach taken by both truncated and censored regression is to encode our prior knowledge of the truncation or censoring steps in the data generating process. This is done by modifying the likelihood function in various ways. - -+++ - -### Truncated regression model -Truncated regression models are quite simple to implement. The normal likelihood is centered on the regression slope as normal, but now we just specify a normal distribution which is truncated at the bounds. - -```{code-cell} ipython3 -:tags: [] - -def truncated_regression(x, y, bounds): - with pm.Model() as model: - slope = pm.Normal("slope", mu=0, sigma=1) - intercept = pm.Normal("intercept", mu=0, sigma=1) - σ = pm.HalfNormal("σ", sigma=1) - normal_dist = pm.Normal.dist(mu=slope * x + intercept, sigma=σ) - pm.Truncated("obs", normal_dist, lower=bounds[0], upper=bounds[1], observed=y) - return model -``` - -Truncated regression solves the bias problem by updating the likelihood to reflect our knowledge about the process generating the observations. Namely, we have zero chance of observing any data outside of the truncation bounds, and so the likelihood should reflect this. We can visualise this in the plot below, where compared to a normal distribution, the probability density of a truncated normal is zero outside of the truncation bounds $(y<-1)$ in this case. - -```{code-cell} ipython3 -:tags: [] - -fig, ax = plt.subplots(figsize=(10, 3)) -y = np.linspace(-4, 4, 1000) -ax.fill_between(y, norm.pdf(y, loc=0, scale=1), 0, alpha=0.2, ec="b", fc="b", label="Normal") -ax.fill_between( - y, - truncnorm.pdf(y, -1, float("inf"), loc=0, scale=1), - 0, - alpha=0.2, - ec="r", - fc="r", - label="Truncated Normal", -) -ax.set(xlabel="$y$", ylabel="probability") -ax.axvline(-1, c="k", ls="--") -ax.legend(); -``` - -### Censored regression model - -```{code-cell} ipython3 -:tags: [] - -def censored_regression(x, y, bounds): - with pm.Model() as model: - slope = pm.Normal("slope", mu=0, sigma=1) - intercept = pm.Normal("intercept", mu=0, sigma=1) - σ = pm.HalfNormal("σ", sigma=1) - y_latent = pm.Normal.dist(mu=slope * x + intercept, sigma=σ) - obs = pm.Censored("obs", y_latent, lower=bounds[0], upper=bounds[1], observed=y) - - return model -``` - -Thanks to the new {class}`pm.Censored` distribution it is really straightforward to write models with censored data. The only thing to remember is that the latent variable being censored must be called with the `.dist` method, as in `pm.Normal.dist` in the model above. - -Behind the scenes, `pm.Censored` adjusts the likelihood function to take into account that: -- the probability at the lower bound is equal to the cumulative distribution function from $-\infty$ to the lower bound, -- the probability at the upper bound is equal to the the cumulative distribution function from the upper bound to $\infty$. - -This is demonstrated visually in the plot below. Technically the _probability density_ at the bound is infinite because the bin width exactly at the bound is zero. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(10, 3)) - -with pm.Model() as m: - pm.Normal("y", 0, 2) - -with pm.Model() as m_censored: - pm.Censored("y", pm.Normal.dist(0, 2), lower=-1.0, upper=None) - -logp_fn = m.compile_logp() -logp_censored_fn = m_censored.compile_logp() - -xi = np.hstack((np.linspace(-6, -1.01), [-1.0], np.linspace(-0.99, 6))) - -ax.plot(xi, [np.exp(logp_fn({"y": x})) for x in xi], label="uncensored") -ax.plot(xi, [np.exp(logp_censored_fn({"y": x})) for x in xi], label="censored", lw=8, alpha=0.6) -ax.axvline(-1, c="k", ls="--") -ax.legend() -ax.set(xlabel="$y$", ylabel="probability density", ylim=(-0.02, 0.4)); -``` - -## Run the truncated and censored regressions -Now we can conduct our parameter estimation with the truncated regression model on the truncated data... - -```{code-cell} ipython3 -:tags: [] - -truncated_model = truncated_regression(xt, yt, bounds) - -with truncated_model: - truncated_fit = pm.sample() -``` - -and with the censored regression model on the censored data. - -```{code-cell} ipython3 -:tags: [] - -censored_model = censored_regression(xc, yc, bounds) - -with censored_model: - censored_fit = pm.sample(init="adapt_diag") -``` - -We can do the same as before and visualise our posterior estimates on the slope. - -```{code-cell} ipython3 -fig, ax = plt.subplots(1, 2, figsize=(10, 5), sharex=True) - -az.plot_posterior(truncated_fit, var_names=["slope"], ref_val=slope, ax=ax[0]) -ax[0].set(title="Truncated regression\n(truncated data)", xlabel="slope") - -az.plot_posterior(censored_fit, var_names=["slope"], ref_val=slope, ax=ax[1]) -ax[1].set(title="Censored regression\n(censored data)", xlabel="slope"); -``` - -These are _much_ better estimates. Interestingly, we can see that the estimate for censored regression is more precise than for truncated data. This will not necessarily always be the case, but the intuition here is that the `x` and `y` data is entirely discarded with truncation, but only the `y` data becomes partially unknown in censoring. - -We could speculate then, that if an experimenter had the choice of truncating or censoring data, it might be better to opt for censoring over truncation. - -Correspondingly, we can confirm the models are good through visual inspection of the posterior predictive plots. - -```{code-cell} ipython3 -:tags: [] - -fig, ax = plt.subplots(1, 2, figsize=(10, 5), sharex=True, sharey=True) - -pp_plot(xt, yt, truncated_fit, ax[0]) -ax[0].set(title="Truncated data") - -pp_plot(xc, yc, censored_fit, ax[1]) -ax[1].set(title="Censored data"); -``` - -This brings an end to our guide on truncated and censored data and truncated and censored regression models in PyMC. While the extent of the regression slope estimation bias will vary with a number of factors discussed above, hopefully these examples have convinced you of the importance of encoding your knowledge of the data generating process into regression analyses. - -+++ - -## Further topics -It is also possible to treat the bounds as unknown latent parameters. If these are not known exactly and it is possible to fomulate a prior over these bounds, then it would be possible to infer what the bounds are. This could be argued as overkill however - depending on your data analysis context it may be entirely sufficient to extract 'good enough' point estimates of the bounds in order to get reasonable regression estimates. - -The censored regression model presented above takes one particular approach, and there are others. For example, it did not attempt to infer posterior beliefs over the true latent `y` values of the censored data. It is possible to build censored regression models which do impute these censored `y` values, but we did not address that here as the topic of [imputation](https://en.wikipedia.org/wiki/Imputation_(statistics)) deserves its own focused treatment. The PyMC {ref}`censored_data` example also covers this topic, with a particular {ref}`example model to impute censored data `. - -+++ - -## Further reading -When looking into this topic, I found that most of the material out there focuses on maximum likelihood estimation approaches, with focus on mathematical derivation rather than practical implementation. One good concise mathematical 80 page booklet by {cite:t}`breen1996regression` covers truncated and censored as well as other missing data scenarios. That said, a few pages are given over to this topic in Bayesian Data Analysis by {cite:t}`gelman2013bayesian`, and {cite:t}`gelman2020regression`. - -+++ - -## Authors -* Authored by [Benjamin T. Vincent](https://github.com/drbenvincent) in May 2021 -* Updated by [Benjamin T. Vincent](https://github.com/drbenvincent) in January 2022 -* Updated by [Benjamin T. Vincent](https://github.com/drbenvincent) in September 2022 - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/howto/api_quickstart.myst.md b/myst_nbs/howto/api_quickstart.myst.md deleted file mode 100644 index fc616694e..000000000 --- a/myst_nbs/howto/api_quickstart.myst.md +++ /dev/null @@ -1,457 +0,0 @@ ---- -jupytext: - notebook_metadata_filter: substitutions - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(api_quickstart)= -# General API quickstart - -:::{post} May 31, 2022 -:tags: -:category: beginner -:author: Christian Luhmann -::: - -```{code-cell} ipython3 -import aesara.tensor as at -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc as pm -``` - -```{code-cell} ipython3 -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -## 1. Model creation - -Models in PyMC are centered around the `Model` class. It has references to all random variables (RVs) and computes the model logp and its gradients. Usually, you would instantiate it as part of a `with` context: - -```{code-cell} ipython3 -with pm.Model() as model: - # Model definition - pass -``` - -We discuss RVs further below but let's create a simple model to explore the `Model` class. - -```{code-cell} ipython3 -with pm.Model() as model: - mu = pm.Normal("mu", mu=0, sigma=1) - obs = pm.Normal("obs", mu=mu, sigma=1, observed=rng.standard_normal(100)) -``` - -```{code-cell} ipython3 -model.basic_RVs -``` - -```{code-cell} ipython3 -model.free_RVs -``` - -```{code-cell} ipython3 -model.observed_RVs -``` - -```{code-cell} ipython3 -model.compile_logp()({"mu": 0}) -``` - -It's worth highlighting the design choice we made with `logp`. As you can see above, `logp` is being called with arguments, so it's a method of the model instance. More precisely, it puts together a function based on the current state of the model -- or on the state given as argument to `logp` (see example below). - -For diverse reasons, we assume that a `Model` instance isn't static. If you need to use `logp` in an inner loop and it needs to be static, simply use something like `logp = model.logp`. Here is an example below -- note the caching effect and the speed up: - -```{code-cell} ipython3 -%timeit model.compile_logp()({"mu": 0.1}) -logp = model.compile_logp() -%timeit logp({"mu": 0.1}) -``` - -## 2. Probability Distributions - -Every probabilistic program consists of observed and unobserved Random Variables (RVs). Observed RVs are defined via likelihood distributions, while unobserved RVs are defined via prior distributions. In the PyMC module, the structure for probability distributions looks like this: - -{ref}`pymc:api_distributions` -- {ref}`pymc:api_distributions_continuous` -- {ref}`pymc:api_distributions_discrete` -- {ref}`pymc:api_distributions_multivariate` -- {ref}`pymc:api_distributions_mixture` -- {ref}`pymc:api_distributions_rimeseries` -- {ref}`pymc:api_distributions_censored` -- {ref}`pymc:api_distributions_simulator` - -+++ - -### Unobserved Random Variables - -+++ - -Every unobserved RV has the following calling signature: name (str), parameter keyword arguments. Thus, a normal prior can be defined in a model context like this: - -```{code-cell} ipython3 -with pm.Model(): - x = pm.Normal("x", mu=0, sigma=1) -``` - -As with the model, we can evaluate its logp: - -```{code-cell} ipython3 -pm.logp(x, 0).eval() -``` - -### Observed Random Variables - -+++ - -Observed RVs are defined just like unobserved RVs but require data to be passed into the `observed` keyword argument: - -```{code-cell} ipython3 -with pm.Model(): - obs = pm.Normal("x", mu=0, sigma=1, observed=rng.standard_normal(100)) -``` - -`observed` supports lists, `numpy.ndarray` and `aesara` data structures. - -+++ - -### Deterministic transforms - -+++ - -PyMC allows you to freely do algebra with RVs in all kinds of ways: - -```{code-cell} ipython3 -with pm.Model(): - x = pm.Normal("x", mu=0, sigma=1) - y = pm.Gamma("y", alpha=1, beta=1) - plus_2 = x + 2 - summed = x + y - squared = x**2 - sined = pm.math.sin(x) -``` - -Though these transformations work seamlessly, their results are not stored automatically. Thus, if you want to keep track of a transformed variable, you have to use `pm.Deterministic`: - -```{code-cell} ipython3 -with pm.Model(): - x = pm.Normal("x", mu=0, sigma=1) - plus_2 = pm.Deterministic("x plus 2", x + 2) -``` - -Note that `plus_2` can be used in the identical way to above, we only tell PyMC to keep track of this RV for us. - -+++ - -### Lists of RVs / higher-dimensional RVs - -Above we have seen how to create scalar RVs. In many models, we want multiple RVs. Users will sometimes try to create lists of RVs, like this: - -```{code-cell} ipython3 -with pm.Model(): - # bad: - x = [pm.Normal(f"x_{i}", mu=0, sigma=1) for i in range(10)] -``` - -This works, but it is slow and not recommended. Instead, we can use {ref}`coordinates `: - -```{code-cell} ipython3 -coords = {"cities": ["Santiago", "Mumbai", "Tokyo"]} -with pm.Model(coords=coords) as model: - # good: - x = pm.Normal("x", mu=0, sigma=1, dims="cities") -``` - -`x` is now a array of length 3 and each of the 3 variables within this array is associated with a label. This will make it very easy to distinguish the 3 different variables when we go to look at results. We can index into this array or do linear algebra operations on it: - -```{code-cell} ipython3 -with model: - y = x[0] * x[1] # indexing is supported - x.dot(x.T) # linear algebra is supported -``` - -### Initialize Random Variables - -Though PyMC automatically initializes models, it is sometimes helpful to define initial values for RVs. This can be done via the `initval` kwarg: - -```{code-cell} ipython3 -with pm.Model(coords={"idx": np.arange(5)}) as model: - x = pm.Normal("x", mu=0, sigma=1, dims="idx") - -model.initial_point() -``` - -```{code-cell} ipython3 -with pm.Model(coords={"idx": np.arange(5)}) as model: - x = pm.Normal("x", mu=0, sigma=1, dims="idx", initval=rng.standard_normal(5)) - -model.initial_point() -``` - -This technique is sometimes useful when trying to identify problems with model specification or initialization. - -+++ - -## 3. Inference - -Once we have defined our model, we have to perform inference to approximate the posterior distribution. PyMC supports two broad classes of inference: sampling and variational inference. - -### 3.1 Sampling - -The main entry point to MCMC sampling algorithms is via the `pm.sample()` function. By default, this function tries to auto-assign the right sampler(s). `pm.sample()` returns an `arviz.InferenceData` object. `InferenceData` objects can easily be saved/loaded from a file and can carry additional (meta)data such as date/version and posterior predictive samples. Take a look at the {ref}`ArviZ Quickstart ` to learn more. - -```{code-cell} ipython3 -with pm.Model() as model: - mu = pm.Normal("mu", mu=0, sigma=1) - obs = pm.Normal("obs", mu=mu, sigma=1, observed=rng.standard_normal(100)) - - idata = pm.sample(2000) -``` - -As you can see, with model that exclusively contains continuous variables, PyMC assigns the NUTS sampler, which is very efficient even for complex models. PyMC also runs initial tuning to find good starting parameters for the sampler. Here we draw 2000 samples from the posterior in each chain and allow the sampler to adjust its parameters in an additional 1500 iterations. - -If not set via the `chains` kwarg, the number of chains is determined from the number of available CPU cores. - -```{code-cell} ipython3 -idata.posterior.dims -``` - -The tuning samples are discarded by default. With `discard_tuned_samples=False` they can be kept and end up in a separate group within the `InferenceData` object (i.e., `idata.warmup_posterior`). - -You can control how the chains are run in parallel using the `chains` and `cores` kwargs: - -```{code-cell} ipython3 -with pm.Model() as model: - mu = pm.Normal("mu", mu=0, sigma=1) - obs = pm.Normal("obs", mu=mu, sigma=1, observed=rng.standard_normal(100)) - - idata = pm.sample(cores=4, chains=6) -``` - -```{code-cell} ipython3 -idata.posterior["mu"].shape -``` - -```{code-cell} ipython3 -# get values of a single chain -idata.posterior["mu"].sel(chain=2).shape -``` - -### 3.2 Analyze sampling results - -The most common used plot to analyze sampling results is the so-called trace-plot: - -```{code-cell} ipython3 -with pm.Model() as model: - mu = pm.Normal("mu", mu=0, sigma=1) - sd = pm.HalfNormal("sd", sigma=1) - obs = pm.Normal("obs", mu=mu, sigma=sd, observed=rng.standard_normal(100)) - - idata = pm.sample() -``` - -```{code-cell} ipython3 -az.plot_trace(idata); -``` - -Another common metric to look at is the Gelman-Rubin statistic, or R-hat: - -```{code-cell} ipython3 -az.summary(idata) -``` - -R-hat is also presented as part of the `az.plot_forest`: - -```{code-cell} ipython3 -az.plot_forest(idata, r_hat=True); -``` - -Finally, for a plot of the posterior that is inspired by {cite:p}`kruschke2014doing`, you can use the: - -```{code-cell} ipython3 -az.plot_posterior(idata); -``` - -For high-dimensional models it becomes cumbersome to look at the traces for all parameters. When using `NUTS` we can look at the energy plot to assess problems of convergence: - -```{code-cell} ipython3 -with pm.Model(coords={"idx": np.arange(100)}) as model: - x = pm.Normal("x", mu=0, sigma=1, dims="idx") - idata = pm.sample() - -az.plot_energy(idata); -``` - -For more information on sampler stats and the energy plot, see {ref}`sampler_stats`. For more information on identifying sampling problems and what to do about them, see {ref}`diagnosing_with_divergences`. - -+++ - -### 3.3 Variational inference - -PyMC supports various Variational Inference techniques. While these methods are much faster, they are often also less accurate and can lead to biased inference. The main entry point is `pymc.fit()`. - -```{code-cell} ipython3 -with pm.Model() as model: - mu = pm.Normal("mu", mu=0, sigma=1) - sd = pm.HalfNormal("sd", sigma=1) - obs = pm.Normal("obs", mu=mu, sigma=sd, observed=rng.standard_normal(100)) - - approx = pm.fit() -``` - -The returned `Approximation` object has various capabilities, like drawing samples from the approximated posterior, which we can analyse like a regular sampling run: - -```{code-cell} ipython3 -idata = approx.sample(1000) -az.summary(idata) -``` - -The `variational` submodule offers a lot of flexibility in which VI to use and follows an object oriented design. For example, full-rank ADVI estimates a full covariance matrix: - -```{code-cell} ipython3 -mu = pm.floatX([0.0, 0.0]) -cov = pm.floatX([[1, 0.5], [0.5, 1.0]]) -with pm.Model(coords={"idx": np.arange(2)}) as model: - pm.MvNormal("x", mu=mu, cov=cov, dims="idx") - approx = pm.fit(method="fullrank_advi") -``` - -An equivalent expression using the object-oriented interface is: - -```{code-cell} ipython3 -with pm.Model(coords={"idx": np.arange(2)}) as model: - pm.MvNormal("x", mu=mu, cov=cov, dims="idx") - approx = pm.FullRankADVI().fit() -``` - -```{code-cell} ipython3 -with pm.Model(coords={"idx": np.arange(2)}) as model: - pm.MvNormal("x", mu=mu, cov=cov, dims="idx") - approx = pm.FullRankADVI().fit() -``` - -```{code-cell} ipython3 -plt.figure() -idata = approx.sample(10000) -az.plot_pair(idata, var_names="x", coords={"idx": [0, 1]}); -``` - -Stein Variational Gradient Descent (SVGD) uses particles to estimate the posterior: - -```{code-cell} ipython3 -w = pm.floatX([0.2, 0.8]) -mu = pm.floatX([-0.3, 0.5]) -sd = pm.floatX([0.1, 0.1]) -with pm.Model() as model: - pm.NormalMixture("x", w=w, mu=mu, sigma=sd) - approx = pm.fit(method=pm.SVGD(n_particles=200, jitter=1.0)) -``` - -```{code-cell} ipython3 -with pm.Model() as model: - pm.NormalMixture("x", w=[0.2, 0.8], mu=[-0.3, 0.5], sigma=[0.1, 0.1]) -``` - -```{code-cell} ipython3 -plt.figure() -idata = approx.sample(10000) -az.plot_dist(idata.posterior["x"]); -``` - -For more information on variational inference, see {ref}`variational_inference`. - -+++ - -## 4. Posterior Predictive Sampling - -The `sample_posterior_predictive()` function performs prediction on hold-out data and posterior predictive checks. - -```{code-cell} ipython3 -data = rng.standard_normal(100) -with pm.Model() as model: - mu = pm.Normal("mu", mu=0, sigma=1) - sd = pm.HalfNormal("sd", sigma=1) - obs = pm.Normal("obs", mu=mu, sigma=sd, observed=data) - - idata = pm.sample() -``` - -```{code-cell} ipython3 -with model: - idata.extend(pm.sample_posterior_predictive(idata)) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots() -az.plot_ppc(idata, ax=ax) -ax.axvline(data.mean(), ls="--", color="r", label="True mean") -ax.legend(fontsize=10); -``` - -## 4.1 Predicting on hold-out data - -In many cases you want to predict on unseen / hold-out data. This is especially relevant in Probabilistic Machine Learning and Bayesian Deep Learning. PyMC includes a `pm.MutableData` container to help with such uses. It is a wrapper around a `aesara.shared` variable and allows the values of the data to be changed later. Otherwise, `pm.MutableData` objects can be used just like any other numpy array or tensor. - -This distinction is significant since internally all models in PyMC are giant symbolic expressions. When you pass raw data directly into a model, you are giving Aesara permission to treat this data as a constant and optimize it away if doing so makes sense. If you need to change this data later you may not have any way to point at it within the larger symbolic expression. Using `pm.MutableData` offers a way to point to a specific place in the symbolic expression and change what is there. - -```{code-cell} ipython3 -x = rng.standard_normal(100) -y = x > 0 - -coords = {"idx": np.arange(100)} -with pm.Model() as model: - # create shared variables that can be changed later on - x_obs = pm.MutableData("x_obs", x, dims="idx") - y_obs = pm.MutableData("y_obs", y, dims="idx") - - coeff = pm.Normal("x", mu=0, sigma=1) - logistic = pm.math.sigmoid(coeff * x_obs) - pm.Bernoulli("obs", p=logistic, observed=y_obs, dims="idx") - idata = pm.sample() -``` - -Now assume we want to predict on unseen data. For this we have to change the values of `x_obs` and `y_obs`. Theoretically we don't need to set `y_obs` as we want to predict it but it has to match the shape of `x_obs`. - -```{code-cell} ipython3 -with model: - # change the value and shape of the data - pm.set_data( - { - "x_obs": [-1, 0, 1.0], - # use dummy values with the same shape: - "y_obs": [0, 0, 0], - }, - coords={"idx": [1001, 1002, 1003]}, - ) - - idata.extend(pm.sample_posterior_predictive(idata)) -``` - -```{code-cell} ipython3 -idata.posterior_predictive["obs"].mean(dim=["draw", "chain"]) -``` - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/howto/custom_distribution.myst.md b/myst_nbs/howto/custom_distribution.myst.md deleted file mode 100644 index 9aec8e1b1..000000000 --- a/myst_nbs/howto/custom_distribution.myst.md +++ /dev/null @@ -1,335 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(custom_distribution)= -# Defining a Custom Distribution in PyMC3 - -+++ - -In this notebook, we are going to walk through how to create a custom distribution for the Generalized Poisson distribution. - -+++ - -There are 3 main steps required to define a custom distribution in PyMC3: -1. Define the log probability function -2. Define the random generator function -3. Define the class for the distribution - -+++ - -## Background on the Generalized Poisson - -The **Poisson** distribution models equidispersed count data where the mean $\mu$ is equal to the variance. -$$p(Y = y | \mu ) = \frac{e^{-\mu} \mu^y}{y!}$$ - -The **Negative Binomial** distribution allows us to model overdispersed count data. It has 2 parameters: -- Mean $\mu > 0$ -- Overdisperson parameter $\alpha > 0$ - - As $\alpha \rightarrow \infty$, the Negative Binonimal converges to the Poisson. - -The **Generalized Poisson** distribution is flexible enough to handle both overdispersion and underdispersion. It has the following PMF: - -$$p(Y = y | \theta, \lambda) = \frac{\theta (\theta + \lambda y)^{y-1} e^{-\theta - \lambda y}}{y!}, y = 0,1,2,...$$ - -where $\theta > 0$ and $\max(-1, -\frac{\theta}{4}) \leq \lambda \leq 1$ - -The mean and variance are given by -$$\mathbb{E}[Y] = \frac{\theta}{1 - \lambda}, \quad -\text{Var}[Y] = \frac{\theta}{(1 - \lambda)^3}$$ - -- When $\lambda = 0$, the Generalized Poisson reduces to the standard Poisson with $\mu = \theta$. -- When $\lambda < 0$, the model has underdispersion (mean $>$ variance). -- When $\lambda > 0$, the model has overdispersion (mean $<$ variance). - -```{code-cell} ipython3 -import os - -from datetime import date, timedelta - -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc3 as pm -import theano.tensor as tt - -from pymc3.distributions.dist_math import bound, factln, logpow -from pymc3.distributions.distribution import draw_values, generate_samples -from pymc3.theanof import intX -``` - -## 1. Log Probability Function - -The $\log$ of the PMF above is as follows: - -$$\log f(y | \theta, \lambda) = \log\theta + \log\left((\theta + \lambda y)^{y-1}\right) - (\theta + \lambda y) - \log(y!),\,y = 0,1,2,...$$ - -where $\theta > 0$ and $\max(-1, -\frac{\theta}{4}) \leq \lambda \leq 1$ - -We now define the log probability function, which is an implementation of the above formula using just Aesara operations. - -Parameters: -- `theta`: $\theta$ -- `lam`: $\lambda$ -- `value`: $y$ - -Returns: -- The log probability of the Generalized Poisson with the given parameters, evaluated at the specified value. - -```{code-cell} ipython3 -def genpoisson_logp(theta, lam, value): - theta_lam_value = theta + lam * value - log_prob = np.log(theta) + logpow(theta_lam_value, value - 1) - theta_lam_value - factln(value) - - # Probability is 0 when value > m, where m is the largest positive integer for which - # theta + m * lam > 0 (when lam < 0). - log_prob = tt.switch(theta_lam_value <= 0, -np.inf, log_prob) - - return bound(log_prob, value >= 0, theta > 0, abs(lam) <= 1, -theta / 4 <= lam) -``` - -## 2. Generator Function - -If your distribution exists in `scipy.stats` (https://docs.scipy.org/doc/scipy/reference/stats.html), then you can use the Random Variates method `scipy.stats.{dist_name}.rvs` to generate random samples. - -Since `scipy` does not include the Generalized Poisson, we will define our own generator function using the Inversion Algorithm presented in [Famoye (1997)](https://www.tandfonline.com/doi/abs/10.1080/01966324.1997.10737439?journalCode=umms20): - -Initialize $\omega \leftarrow e^{-\lambda}$ -1. $X \leftarrow 0$ -2. $S \leftarrow e^{-\theta}$ and $P \leftarrow S$ -3. Generate $U$ from uniform distribution on $(0,1)$. -4. While $U > S$, do - 1. $X \leftarrow X + 1$ - 2. $C \leftarrow \theta - \lambda + \lambda X$ - 3. $P \leftarrow \omega \cdot C (1 + \frac{\lambda}{C})^{X-1} P X^{-1}$ - 4. $S \leftarrow S + P$ -5. Deliver $X$ - -We now define a function that generates a set of random samples from the Generalized Poisson with the given parameters. It is meant to be analogous to `scipy.stats.{dist_name}.rvs`. - -Parameters: -- `theta`: An array of values for $\theta$ -- `lam`: A single value for $\lambda$ -- `size`: The number of samples to generate - -Returns: -- One random sample for the Generalized Poisson defined by each of the given $\theta$ values and the given $\lambda$ value. - -```{code-cell} ipython3 -def genpoisson_rvs(theta, lam, size=None): - if size is not None: - assert size == theta.shape - else: - size = theta.shape - lam = lam[0] - omega = np.exp(-lam) - X = np.full(size, 0) - S = np.exp(-theta) - P = np.copy(S) - for i in range(size[0]): - U = np.random.uniform() - while U > S[i]: - X[i] += 1 - C = theta[i] - lam + lam * X[i] - P[i] = omega * C * (1 + lam / C) ** (X[i] - 1) * P[i] / X[i] - S[i] += P[i] - return X -``` - -## 3. Class Definition - -Every PyMC3 distribution requires the following basic format. A few things to keep in mind: -- Your class should have the parent class `pm.Discrete` if your distribution is discrete, or `pm.Continuous` if your distriution is continuous. -- For continuous distributions you also have to define the default transform, or inherit from a more specific class like `PositiveContinuous` which specifies what the default transform should be. -- You'll need specify at least one "default value" for the distribution during `init` such as `self.mode`, `self.median`, or `self.mean` (the latter only for continuous distributions). This is used by some samplers or other compound distributions. - -```{code-cell} ipython3 -class GenPoisson(pm.Discrete): - def __init__(self, theta, lam, *args, **kwargs): - super().__init__(*args, **kwargs) - self.theta = theta - self.lam = lam - self.mode = intX(tt.floor(theta / (1 - lam))) - - def logp(self, value): - theta = self.theta - lam = self.lam - return genpoisson_logp(theta, lam, value) - - def random(self, point=None, size=None): - theta, lam = draw_values([self.theta, self.lam], point=point, size=size) - return generate_samples(genpoisson_rvs, theta=theta, lam=lam, size=size) -``` - -## Sanity Check - -Let's sample from our new distribution to make sure it's working as expected. We'll take 5000 samples each from the standard Poisson, the Generalized Poisson (GP) with $\lambda=0$, the GP with $\lambda<0$, and the GP with $\lambda>0$. You can see that the GP with $\lambda=0$ is equivalent to the standard Poisson (mean $=$ variance), while the GP with $\lambda<0$ is underdispered (mean $>$ variance), and the GP with $\lambda>0$ is overdispersed (mean $<$ variance). - -```{code-cell} ipython3 -std = pm.Poisson.dist(mu=5).random(size=5000) -equi = GenPoisson.dist(theta=np.full(5000, 5), lam=0).random() -under = GenPoisson.dist(theta=np.full(5000, 5), lam=-0.5).random() -over = GenPoisson.dist(theta=np.full(5000, 5), lam=0.3).random() -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(nrows=2, ncols=2, sharex=True, sharey=True, figsize=(10, 8)) -plt.setp(ax, xlim=(0, 20)) - -ax[0][0].hist(std, bins=np.arange(21)) -ax[0][0].set_title("Standard Poisson\n($\\mu=5$)") - -ax[0][1].hist(equi, bins=np.arange(21)) -ax[0][1].set_title("Generalized Poisson with Equidispersion\n($\\theta=5, \\lambda=0$)") - -ax[1][0].hist(under, bins=np.arange(21)) -ax[1][0].set_title("Generalized Poisson with Underdispersion\n($\\theta=5, \\lambda=-0.5$)") - -ax[1][1].hist(over, bins=np.arange(21)) -ax[1][1].set_title("Generalized Poisson with Overdispersion\n($\\theta=5, \\lambda=0.3$)"); -``` - -## Using our custom distribution in our model - -Now that we have defined our custom distribution, we can use it in our PyMC3 model as we would use any other pre-defined distribution. - -+++ - -### Model - -Our goal is to predict the next 2 weeks of COVID-19 occupancy counts at a specific hospital. We are given a series of daily counts $y_t$ indexed by day $t$ for the past $T$ days, and we would like to make forecasts for the next $F=14$ days. In other words, we are building a probabilisitic model for - -$$p( y_{(T+1):(T+F)} \mid y_{1:T} )$$ - -We suppose that $y$ is GenPoisson-distributed over the exponential of a latent time series $f$, where $f$ is an autoregressive process with 1 lag, i.e., for each day $t$, - -$$y_t \sim \text{GenPoisson}( \theta = \exp(f_t), \lambda )$$ - -$$f_t \sim N(\beta_0 + \beta_1 * f_{t-1}, \sigma^2)$$ - -### Priors -- Bias weight: $$\beta_0 \sim N(0,0.1)$$ -- Weight on most recent timestep: $$\beta_1 \sim N(1,0.1)$$ -- Standard deviation: $$\sigma \sim \text{HalfNormal}(0.1)$$ -- Dispersion parameter: $$\lambda \sim \text{TruncatedNormal}(0, 0.1, \text{lower}=-1, \text{upper}=1)$$ - -```{code-cell} ipython3 -try: - df = pd.read_csv( - os.path.join("..", "data", "tufts_medical_center_2020-04-29_to_2020-07-06.csv") - ) -except FileNotFoundError: - df = pd.read_csv(pm.get_data("tufts_medical_center_2020-04-29_to_2020-07-06.csv")) - -dates = df["date"].values -y = df["hospitalized_total_covid_patients_suspected_and_confirmed_including_icu"].astype(float) -``` - -We'll divide our dataset into training and validation sets, holding out the last $F=14$ days, and treating the remaining $T$ days as the past. - -```{code-cell} ipython3 -F = 14 -T = len(y) - F -y_tr = y[:T] -y_va = y[-F:] -``` - -```{code-cell} ipython3 -with pm.Model() as model: - bias = pm.Normal("beta[0]", mu=0, sigma=0.1) - beta_recent = pm.Normal("beta[1]", mu=1, sigma=0.1) - rho = [bias, beta_recent] - sigma = pm.HalfNormal("sigma", sigma=0.1) - f = pm.AR("f", rho, sigma=sigma, constant=True, shape=T + F) - - lam = pm.TruncatedNormal("lam", mu=0, sigma=0.1, lower=-1, upper=1) - - y_past = GenPoisson("y_past", theta=tt.exp(f[:T]), lam=lam, observed=y_tr) -``` - -```{code-cell} ipython3 -with model: - trace = pm.sample( - 5000, - tune=2000, - target_accept=0.99, - max_treedepth=15, - chains=2, - cores=1, - init="adapt_diag", - random_seed=42, - ) -``` - -```{code-cell} ipython3 -pm.traceplot(trace); -``` - -```{code-cell} ipython3 -with model: - y_future = GenPoisson("y_future", theta=tt.exp(f[-F:]), lam=lam, shape=F) - forecasts = pm.sample_posterior_predictive(trace, vars=[y_future], random_seed=42) -samples = forecasts["y_future"] -``` - -```{code-cell} ipython3 -start = date.fromisoformat(dates[-1]) - timedelta(F - 1) # start date of forecasts - -low = np.zeros(F) -high = np.zeros(F) -median = np.zeros(F) - -for i in range(F): - low[i] = np.percentile(samples[:, i], 2.5) - high[i] = np.percentile(samples[:, i], 97.5) - median[i] = np.percentile(samples[:, i], 50) - -x_future = np.arange(F) -plt.errorbar( - x_future, - median, - yerr=[median - low, high - median], - capsize=2, - fmt="x", - linewidth=1, - label="2.5, 50, 97.5 percentiles", -) -x_past = np.arange(-30, 0) - -plt.plot( - np.concatenate((x_past, x_future)), np.concatenate((y_tr[-30:], y_va)), ".", label="observed" -) - -plt.xticks([-30, 0, F - 1], [start + timedelta(-30), start, start + timedelta(F - 1)]) - -plt.legend() -plt.title("Predicted Counts of COVID-19 Patients at Tufts Medical Center") -plt.ylabel("Count") -plt.show() -``` - -## References - -Contributed by Alexandra Hope Lee. This example is adapted from the modeling work presented in this paper: - -A. H. Lee, P. Lymperopoulos, J. T. Cohen, J. B. Wong, and M. C. Hughes. Forecasting COVID-19 Counts at a Single Hospital: A Hierarchical Bayesian Approach. In ICLR 2021 Workshop on Machine Learning for Preventing and Combating Pandemics, 2021. https://arxiv.org/abs/2104.09327. - -Resources on the Generalized Poisson distribution: -- https://www.tandfonline.com/doi/pdf/10.1080/03610929208830766 -- https://journals.sagepub.com/doi/pdf/10.1177/1536867X1201200412 -- https://towardsdatascience.com/generalized-poisson-regression-for-real-world-datasets-d1ff32607d79 -- https://www.tandfonline.com/doi/abs/10.1080/01966324.1997.10737439?journalCode=umms20 - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p theano,xarray -``` diff --git a/myst_nbs/howto/data_container.myst.md b/myst_nbs/howto/data_container.myst.md deleted file mode 100644 index 321b00d92..000000000 --- a/myst_nbs/howto/data_container.myst.md +++ /dev/null @@ -1,329 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(data_container)= -# Using shared variables (`Data` container adaptation) - -:::{post} Dec 16, 2021 -:tags: posterior predictive, shared data -:category: beginner -:author: Juan Martin Loyola, Kavya Jaiswal, Oriol Abril -::: - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc3 as pm - -from numpy.random import default_rng - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 -%matplotlib inline -%config InlineBackend.figure_format = 'retina' -RANDOM_SEED = 8927 -rng = default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -## The Data class - -The {class}`pymc.Data` container class wraps the theano shared variable class and lets the model be aware of its inputs and outputs. This allows one to change the value of an observed variable to predict or refit on new data. All variables of this class must be declared inside a model context and specify a name for them. - -In the following example, this is demonstrated with fictional temperature observations. - -```{code-cell} ipython3 -df_data = pd.DataFrame(columns=["date"]).set_index("date") -dates = pd.date_range(start="2020-05-01", end="2020-05-20") - -for city, mu in {"Berlin": 15, "San Marino": 18, "Paris": 16}.items(): - df_data[city] = rng.normal(loc=mu, size=len(dates)) - -df_data.index = dates -df_data.index.name = "date" -df_data.head() -``` - -PyMC3 can also keep track of the dimensions (like dates or cities) and coordinates (such as the actual date times or city names) of multi-dimensional data. That way, when wrapping your data around the `Data` container when building your model, you can specify the dimension names and coordinates of random variables, instead of specifying the shapes of those random variables as numbers. - -More generally, there are two ways to specify new dimensions and their coordinates: -- Entering the dimensions in the `dims` kwarg of a `pm.Data` variable with a pandas Series or DataFrame. The name of the index and columns will be remembered as the dimensions, and PyMC3 will infer that the values of the given columns must be the coordinates. -- Using the new `coords` argument to {class}`pymc.Model` to set the coordinates explicitly. - -For more explanation about dimensions, coordinates and their big benefits, we encourage you to take a look at the {ref}`ArviZ documentation `. - -This is a lot of explanation -- let's see how it's done! We will use a hierarchical model: it assumes a mean temperature for the European continent and models each city relative to the continent mean: - -```{code-cell} ipython3 -# The data has two dimensions: date and city -coords = {"date": df_data.index, "city": df_data.columns} -``` - -```{code-cell} ipython3 -with pm.Model(coords=coords) as model: - europe_mean = pm.Normal("europe_mean_temp", mu=15.0, sigma=3.0) - city_offset = pm.Normal("city_offset", mu=0.0, sigma=3.0, dims="city") - city_temperature = pm.Deterministic("city_temperature", europe_mean + city_offset, dims="city") - - data = pm.Data("data", df_data, dims=("date", "city")) - pm.Normal("likelihood", mu=city_temperature, sigma=0.5, observed=data) - - idata = pm.sample( - 2000, - tune=2000, - target_accept=0.85, - return_inferencedata=True, - random_seed=RANDOM_SEED, - ) -``` - -We can plot the digraph for our model using: - -```{code-cell} ipython3 -pm.model_to_graphviz(model) -``` - -And we see that the model did remember the coords we gave it: - -```{code-cell} ipython3 -model.coords -``` - -Coordinates are automatically stored into the {class}`arviz.InferenceData` object: - -```{code-cell} ipython3 -idata.posterior.coords -``` - -```{code-cell} ipython3 -az.plot_trace(idata, var_names=["europe_mean_temp", "city_temperature"]); -``` - -We can get the data container variable from the model using: - -```{code-cell} ipython3 -model["data"].get_value() -``` - -Note that we used a theano method {meth}`theano.compile.sharedvalue.SharedVariable.get_value` of class {class}`theano.compile.sharedvalue.SharedVariable` to get the value of the variable. This is because our variable is actually a `SharedVariable`. - -```{code-cell} ipython3 -type(data) -``` - -The methods and functions related to the Data container class are: - -- `data_container.get_value` (method inherited from the theano SharedVariable): gets the value associated with the `data_container`. -- `data_container.set_value` (method inherited from the theano SharedVariable): sets the value associated with the `data_container`. -- {func}`pymc.set_data`: PyMC3 function that sets the value associated with each Data container variable indicated in the dictionary `new_data` with it corresponding new value. - -+++ - -## Using Data container variables to fit the same model to several datasets - -This and the next sections are an adaptation of the notebook ["Advanced usage of Theano in PyMC3"](../Advanced_usage_of_Theano_in_PyMC3.html#using-shared-variables) using `pm.Data`. - -We can use `Data` container variables in PyMC3 to fit the same model to several datasets without the need to recreate the model each time (which can be time consuming if the number of datasets is large): - -```{code-cell} ipython3 -:tags: [hide-output] - -# We generate 10 datasets -true_mu = [rng.random() for _ in range(10)] -observed_data = [mu + rng.random(20) for mu in true_mu] - -with pm.Model() as model: - data = pm.Data("data", observed_data[0]) - mu = pm.Normal("mu", 0, 10) - pm.Normal("y", mu=mu, sigma=1, observed=data) - -# Generate one trace for each dataset -traces = [] -for data_vals in observed_data: - with model: - # Switch out the observed dataset - pm.set_data({"data": data_vals}) - traces.append(pm.sample(return_inferencedata=True)) -``` - -## Using Data container variables to predict on new data - -We can also sometimes use `Data` container variables to work around limitations in the current PyMC3 API. A common task in machine learning is to predict values for unseen data, and one way to achieve this is to use a `Data` container variable for our observations: - -```{code-cell} ipython3 -x = rng.random(100) -y = x > 0 - -with pm.Model() as model: - x_shared = pm.Data("x_shared", x) - coeff = pm.Normal("x", mu=0, sigma=1) - - logistic = pm.math.sigmoid(coeff * x_shared) - pm.Bernoulli("obs", p=logistic, observed=y) - - # fit the model - trace = pm.sample(return_inferencedata=True, tune=2000) -``` - -```{code-cell} ipython3 -new_values = [-1, 0, 1.0] -with model: - # Switch out the observations and use `sample_posterior_predictive` to predict - pm.set_data({"x_shared": new_values}) - post_pred = pm.sample_posterior_predictive(trace) -``` - -The same concept applied to a more complex model can be seen in the notebook {ref}`bayesian_neural_network_advi`. - -+++ - -## Applied example: height of toddlers as a function of age - -+++ - -This example is taken from Osvaldo Martin's book: [Bayesian Analysis with Python: Introduction to statistical modeling and probabilistic programming using PyMC3 and ArviZ, 2nd Edition](https://www.amazon.com/Bayesian-Analysis-Python-Introduction-probabilistic-ebook/dp/B07HHBCR9G) {cite:p}`martin2018bayesian`. - -+++ - -The World Health Organization and other health institutions around the world collect data -for newborns and toddlers and design [growth charts standards](http://www.who.int/childgrowth/en/). These charts are an essential component of the paediatric toolkit and also as a measure of the general well-being of -populations in order to formulate health policies, and plan interventions and -monitor their effectiveness. - -An example of such data is the lengths (heights) of newborn / toddler girls as a function of age (in months): - -```{code-cell} ipython3 -try: - data = pd.read_csv("../data/babies.csv") -except FileNotFoundError: - data = pd.read_csv(pm.get_data("babies.csv")) -data.plot.scatter("Month", "Length", alpha=0.4); -``` - -To model this data we are going to use this model: - -```{code-cell} ipython3 -with pm.Model(coords={"time_idx": np.arange(len(data))}) as model_babies: - α = pm.Normal("α", sigma=10) - β = pm.Normal("β", sigma=10) - γ = pm.HalfNormal("γ", sigma=10) - δ = pm.HalfNormal("δ", sigma=10) - - month = pm.Data("month", data.Month.values.astype(float), dims="time_idx") - - μ = pm.Deterministic("μ", α + β * month**0.5, dims="time_idx") - ε = pm.Deterministic("ε", γ + δ * month, dims="time_idx") - - length = pm.Normal("length", mu=μ, sigma=ε, observed=data.Length, dims="time_idx") - - idata_babies = pm.sample(tune=2000, return_inferencedata=True) -``` - -The following figure shows the result of our model. The expected length, $\mu$, is represented with a blue curve, and two semi-transparent orange bands represent the 60% and 94% highest posterior density intervals of posterior predictive length measurements: - -```{code-cell} ipython3 -with model_babies: - idata_babies.extend( - az.from_pymc3(posterior_predictive=pm.sample_posterior_predictive(idata_babies)) - ) -``` - -```{code-cell} ipython3 -ax = az.plot_hdi( - data.Month, - idata_babies.posterior_predictive["length"], - hdi_prob=0.6, - fill_kwargs={"alpha": 0.8}, -) -ax.plot( - data.Month, - idata_babies.posterior["μ"].mean(("chain", "draw")), - label="Posterior predictive mean", -) -ax = az.plot_lm( - idata=idata_babies, - y="length", - x="month", - kind_pp="hdi", - y_kwargs={"color": "k", "ms": 6, "alpha": 0.15}, - y_hat_fill_kwargs=dict(fill_kwargs={"alpha": 0.4}), - axes=ax, -) -``` - -At the moment of writing Osvaldo's daughter is two weeks ($\approx 0.5$ months) old, and thus he wonders how her length compares to the growth chart we have just created. One way to answer this question is to ask the model for the distribution of the variable length for babies of 0.5 months. Using PyMC3 we can ask this questions with the function `sample_posterior_predictive` , as this will return samples of _Length_ conditioned on the obseved data and the estimated distribution of parameters, that is including uncertainties. - -The only problem is that by default this function will return predictions for _Length_ for the observed values of _Month_, and $0.5$ months (the value Osvaldo cares about) has not been observed, -- all measures are reported for integer months. The easier way to get predictions for non-observed values of _Month_ is to pass new values to the `Data` container we defined above in our model. To do that, we need to use `pm.set_data` and then we just have to sample from the posterior predictve distribution: - -```{code-cell} ipython3 -ages_to_check = [0.5, 0.75] -with model_babies: - pm.set_data({"month": ages_to_check}) - # we use two values instead of only 0.5 months to avoid triggering - # https://github.com/pymc-devs/pymc3/issues/3640 - predictions = pm.sample_posterior_predictive(idata_babies) - - # add the generation predictions also to the inferencedata object - # this is not necessary but allows for example storing data, posterior and predictions in the same file - az.from_pymc3_predictions( - predictions, - idata_orig=idata_babies, - inplace=True, - # we update the dimensions and coordinates, we no longer have use for "time_idx" - # as unique id. We'll now use the age in months as coordinate for better labeling and indexing - # We duplicate the constant_data as coords though - coords={"age (months)": ages_to_check}, - dims={"length": ["age (months)"], "month": ["age (months)"]}, - ) -``` - -Now we can plot the expected distribution of lengths for 2-week old babies and compute additional quantities -- for example the percentile of a child given her length. Here, let's imagine that the child we're interested in has a length of 51.5: - -```{code-cell} ipython3 -ref_length = 51.5 - -az.plot_posterior( - idata_babies, - group="predictions", - ref_val={"length": [{"age (months)": 0.5, "ref_val": ref_length}]}, - labeller=az.labels.DimCoordLabeller(), -); -``` - -## Authors -* Authored by [Juan Martin Loyola](https://github.com/jmloyola) in March, 2019 ([pymc#3389](https://github.com/pymc-devs/pymc/pull/3389)) -* Updated by [Kavya Jaiswal](https://github.com/KavyaJaiswal) and [Oriol Abril](https://github.com/OriolAbril) in December, 2021 ([pymc-examples#151](https://github.com/pymc-devs/pymc-examples/pull/151)) - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p theano,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/howto/howto_debugging.myst.md b/myst_nbs/howto/howto_debugging.myst.md deleted file mode 100644 index d4fb0630a..000000000 --- a/myst_nbs/howto/howto_debugging.myst.md +++ /dev/null @@ -1,198 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(howto_debugging)= -# How to debug a model - -:::{post} August 2, 2022 -:tags: debugging, Aesara -:category: beginner -:author: Thomas Wiecki, Igor Kuvychko -::: - -+++ - -## Introduction -There are various levels on which to debug a model. One of the simplest is to just print out the values that different variables are taking on. - -Because `PyMC` uses `Aesara` expressions to build the model, and not functions, there is no way to place a `print` statement into a likelihood function. Instead, you can use the `aesara.printing.Print` class to print intermediate values. - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm -``` - -```{code-cell} ipython3 -%matplotlib inline -%config InlineBackend.figure_format = "retina" - -RANDOM_SEED = 8927 -``` - -### How to print intermediate values of `Aesara` functions -Since `Aesara` functions are compiled to C, you have to use `aesara.printing.Print` class to print intermediate values (imported below as `Print`). Python `print` function will not work. Below is a simple example of using `Print`. For more information, see {ref}`Debugging Aesara `. - -```{code-cell} ipython3 -import aesara.tensor as at - -from aesara import function -from aesara.printing import Print -``` - -```{code-cell} ipython3 -x = at.dvector("x") -y = at.dvector("y") -func = function([x, y], 1 / (x - y)) -func([1, 2, 3], [1, 0, -1]) -``` - -To see what causes the `inf` value in the output, we can print intermediate values of $(x-y)$ using `Print`. `Print` class simply passes along its caller but prints out its value along a user-define message: - -```{code-cell} ipython3 -z_with_print = Print("x - y = ")(x - y) -func_with_print = function([x, y], 1 / z_with_print) -func_with_print([1, 2, 3], [1, 0, -1]) -``` - -`Print` reveals the root cause: $(x-y)$ takes a zero value when $x=1, y=1$, causing the `inf` output. - -+++ - -### How to capture `Print` output for further analysis - -When we expect many rows of output from `Print`, it can be desirable to redirect the output to a string buffer and access the values later on (thanks to **Lindley Lentati** for inspiring this example). Here is a toy example using Python `print` function: - -```{code-cell} ipython3 -import sys - -from io import StringIO - -old_stdout = sys.stdout -mystdout = sys.stdout = StringIO() - -for i in range(5): - print(f"Test values: {i}") - -output = mystdout.getvalue().split("\n") -sys.stdout = old_stdout # setting sys.stdout back -output -``` - -### Troubleshooting a toy PyMC model - -```{code-cell} ipython3 -rng = np.random.default_rng(RANDOM_SEED) -x = rng.normal(size=100) - -with pm.Model() as model: - # priors - mu = pm.Normal("mu", mu=0, sigma=1) - sd = pm.Normal("sd", mu=0, sigma=1) - - # setting out printing for mu and sd - mu_print = Print("mu")(mu) - sd_print = Print("sd")(sd) - - # likelihood - obs = pm.Normal("obs", mu=mu_print, sigma=sd_print, observed=x) -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(model) -``` - -```{code-cell} ipython3 -with model: - step = pm.Metropolis() - trace = pm.sample(5, step, tune=0, chains=1, progressbar=False, random_seed=RANDOM_SEED) -``` - -Exception handling of PyMC v4 has improved, so now SamplingError exception prints out the intermediate values of `mu` and `sd` which led to likelihood of `-inf`. However, this technique of printing intermediate values with `aeasara.printing.Print` can be valuable in more complicated cases. - -+++ - -### Bringing it all together - -```{code-cell} ipython3 -rng = np.random.default_rng(RANDOM_SEED) -y = rng.normal(loc=5, size=20) - -old_stdout = sys.stdout -mystdout = sys.stdout = StringIO() - -with pm.Model() as model: - mu = pm.Normal("mu", mu=0, sigma=10) - a = pm.Normal("a", mu=0, sigma=10, initval=0.1) - b = pm.Normal("b", mu=0, sigma=10, initval=0.1) - sd_print = Print("Delta")(a / b) - obs = pm.Normal("obs", mu=mu, sigma=sd_print, observed=y) - - # limiting number of samples and chains to simplify output - trace = pm.sample(draws=10, tune=0, chains=1, progressbar=False, random_seed=RANDOM_SEED) - -output = mystdout.getvalue() -sys.stdout = old_stdout # setting sys.stdout back -``` - -```{code-cell} ipython3 -output -``` - -Raw output is a bit messy and requires some cleanup and formatting to convert to `numpy.ndarray`. In the example below regex is used to clean up the output, and then it is evaluated with `eval` to give a list of floats. Code below also works with higher-dimensional outputs (in case you want to experiment with different models). - -```{code-cell} ipython3 -import re - -# output cleanup and conversion to numpy array -# this is code accepts more complicated inputs -pattern = re.compile("Delta __str__ = ") -output = re.sub(pattern, " ", output) -pattern = re.compile("\\s+") -output = re.sub(pattern, ",", output) -pattern = re.compile(r"\[,") -output = re.sub(pattern, "[", output) -output += "]" -output = "[" + output[1:] -output = eval(output) -output = np.array(output) -``` - -```{code-cell} ipython3 -output -``` - -Notice that we requested 5 draws, but got 34 sets of $a/b$ values. The reason is that for each iteration, all proposed values are printed (not just the accepted values). Negative values are clearly problematic. - -```{code-cell} ipython3 -output.shape -``` - -## Authors - -* Authored by Thomas Wiecki in July, 2016 -* Updated by Igor Kuvychko in August, 2022 ([pymc#406] (https://github.com/pymc-devs/pymc-examples/pull/406)) - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/howto/lasso_block_update.myst.md b/myst_nbs/howto/lasso_block_update.myst.md deleted file mode 100644 index 2bbd83671..000000000 --- a/myst_nbs/howto/lasso_block_update.myst.md +++ /dev/null @@ -1,124 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(lasso_block_update)= -# Lasso regression with block updating - -:::{post} Feb 10, 2022 -:tags: regression -:category: beginner -:author: Chris Fonnesbeck, Raul Maldonado, Michael Osthege, Thomas Wiecki, Lorenzo Toniazzi -::: - -```{code-cell} ipython3 -:tags: [] - -%matplotlib inline -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc as pm - -print(f"Running on PyMC v{pm.__version__}") -``` - -```{code-cell} ipython3 -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -Sometimes, it is very useful to update a set of parameters together. For example, variables that are highly correlated are often good to update together. In PyMC block updating is simple. This will be demonstrated using the parameter `step` of {class}`pymc.sample`. - -Here we have a [LASSO regression model](https://en.wikipedia.org/wiki/Lasso_(statistics)#Bayesian_interpretation) where the two coefficients are strongly correlated. Normally, we would define the coefficient parameters as a single random variable, but here we define them separately to show how to do block updates. - -First we generate some fake data. - -```{code-cell} ipython3 -x = rng.standard_normal(size=(3, 30)) -x1 = x[0] + 4 -x2 = x[1] + 4 -noise = x[2] -y_obs = x1 * 0.2 + x2 * 0.3 + noise -``` - -Then define the random variables. - -```{code-cell} ipython3 -:tags: [] - -lam = 3000 - -with pm.Model() as model: - sigma = pm.Exponential("sigma", 1) - tau = pm.Uniform("tau", 0, 1) - b = lam * tau - beta1 = pm.Laplace("beta1", 0, b) - beta2 = pm.Laplace("beta2", 0, b) - - mu = x1 * beta1 + x2 * beta2 - - y = pm.Normal("y", mu=mu, sigma=sigma, observed=y_obs) -``` - -For most samplers, including {class}`pymc.Metropolis` and {class}`pymc.HamiltonianMC`, simply pass a list of variables to sample as a block. This works with both scalar and array parameters. - -```{code-cell} ipython3 -with model: - step1 = pm.Metropolis([beta1, beta2]) - - step2 = pm.Slice([sigma, tau]) - - idata = pm.sample(draws=10000, step=[step1, step2]) -``` - -We conclude by plotting the sampled marginals and the joint distribution of `beta1` and `beta2`. - -```{code-cell} ipython3 -:tags: [] - -az.plot_trace(idata); -``` - -```{code-cell} ipython3 -az.plot_pair( - idata, - var_names=["beta1", "beta2"], - kind="hexbin", - marginals=True, - figsize=(10, 10), - gridsize=50, -) -``` - -## Authors - -* Authored by [Chris Fonnesbeck](https://github.com/fonnesbeck) in Dec, 2020 -* Updated by [Raul Maldonado](https://github.com/CloudChaoszero) in Jan, 2021 -* Updated by Raul Maldonado in Mar, 2021 -* Reexecuted by [Thomas Wiecki](https://github.com/twiecki) and [Michael Osthege](https://github.com/michaelosthege) with PyMC v4 in Jan, 2022 ([pymc-examples#264](https://github.com/pymc-devs/pymc-examples/pull/264)) -* Updated by [Lorenzo Toniazzi](https://github.com/ltoniazzi) in Feb, 2022 ([pymc-examples#279](https://github.com/pymc-devs/pymc-examples/pull/279)) - -+++ - -## Watermark - -```{code-cell} ipython3 -:tags: [] - -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/howto/profiling.myst.md b/myst_nbs/howto/profiling.myst.md deleted file mode 100644 index f7f12b890..000000000 --- a/myst_nbs/howto/profiling.myst.md +++ /dev/null @@ -1,62 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(profiling)= -# Profiling -Sometimes computing the likelihood is not as fast as we would like. Theano provides handy profiling tools which are wrapped in PyMC3 by `model.profile`. This function returns a `ProfileStats` object conveying information about the underlying Theano operations. Here we'll profile the likelihood and gradient for the stochastic volatility example. - -First we build the model. - -```{code-cell} ipython3 -import numpy as np -import pandas as pd -import pymc3 as pm - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 -RANDOM_SEED = 8927 -np.random.seed(RANDOM_SEED) -``` - -```{code-cell} ipython3 -# Load the data -returns = pd.read_csv(pm.get_data("SP500.csv"), index_col=0, parse_dates=True) -``` - -```{code-cell} ipython3 -# Stochastic volatility example -with pm.Model() as model: - sigma = pm.Exponential("sigma", 1.0 / 0.02, testval=0.1) - nu = pm.Exponential("nu", 1.0 / 10) - s = pm.GaussianRandomWalk("s", sigma**-2, shape=returns.shape[0]) - r = pm.StudentT("r", nu, lam=np.exp(-2 * s), observed=returns["change"]) -``` - -Then we call the `profile` function and summarize its return values. - -```{code-cell} ipython3 -# Profiling of the logp call -model.profile(model.logpt).summary() -``` - -```{code-cell} ipython3 -# Profiling of the gradient call dlogp/dx -model.profile(pm.gradient(model.logpt, model.vars)).summary() -``` - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/howto/sampling_callback.myst.md b/myst_nbs/howto/sampling_callback.myst.md deleted file mode 100644 index 239fee479..000000000 --- a/myst_nbs/howto/sampling_callback.myst.md +++ /dev/null @@ -1,100 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3.9.7 ('pymc3_stable') - language: python - name: python3 ---- - -# Sample callback - -This notebook demonstrates the usage of the callback attribute in `pm.sample`. A callback is a function which gets called for every sample from the trace of a chain. The function is called with the trace and the current draw as arguments and will contain all samples for a single trace. - -The sampling process can be interrupted by throwing a `KeyboardInterrupt` from inside the callback. - -use-cases for this callback include: - - - Stopping sampling when a number of effective samples is reached - - Stopping sampling when there are too many divergences - - Logging metrics to external tools (such as TensorBoard) - -We'll start with defining a simple model - -```{code-cell} ipython3 -import numpy as np -import pymc3 as pm - -X = np.array([1, 2, 3, 4, 5]) -y = X * 2 + np.random.randn(len(X)) -with pm.Model() as model: - - intercept = pm.Normal("intercept", 0, 10) - slope = pm.Normal("slope", 0, 10) - - mean = intercept + slope * X - error = pm.HalfCauchy("error", 1) - obs = pm.Normal("obs", mean, error, observed=y) -``` - -We can then for example add a callback that stops sampling whenever 100 samples are made, regardless of the number of draws set in the `pm.sample` - -```{code-cell} ipython3 -def my_callback(trace, draw): - if len(trace) >= 100: - raise KeyboardInterrupt() - - -with model: - trace = pm.sample(tune=0, draws=500, callback=my_callback, chains=1) - -print(len(trace)) -``` - -Something to note though, is that the trace we get passed in the callback only correspond to a single chain. That means that if we want to do calculations over multiple chains at once, we'll need a bit of machinery to make this possible. - -```{code-cell} ipython3 -def my_callback(trace, draw): - if len(trace) % 100 == 0: - print(len(trace)) - - -with model: - trace = pm.sample(tune=0, draws=500, callback=my_callback, chains=2, cores=2) -``` - -We can use the `draw.chain` attribute to figure out which chain the current draw and trace belong to. Combined with some kind of convergence statistic like r_hat we can stop when we have converged, regardless of the amount of specified draws. - -```{code-cell} ipython3 -import arviz as az - - -class MyCallback: - def __init__(self, every=1000, max_rhat=1.05): - self.every = every - self.max_rhat = max_rhat - self.traces = {} - - def __call__(self, trace, draw): - if draw.tuning: - return - - self.traces[draw.chain] = trace - if len(trace) % self.every == 0: - multitrace = pm.backends.base.MultiTrace(list(self.traces.values())) - if pm.stats.rhat(multitrace).to_array().max() < self.max_rhat: - raise KeyboardInterrupt - - -with model: - trace = pm.sample(tune=1000, draws=100000, callback=MyCallback(), chains=2, cores=2) -``` - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/howto/sampling_compound_step.myst.md b/myst_nbs/howto/sampling_compound_step.myst.md deleted file mode 100644 index f00445bbc..000000000 --- a/myst_nbs/howto/sampling_compound_step.myst.md +++ /dev/null @@ -1,197 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -# Compound Steps in Sampling -This notebook explains how the compound steps work in `pymc.sample` function when sampling multiple random variables. We are going to answer the following questions associated with compound steps: - -- How do compound steps work? -- What happens when PyMC assigns step methods by default? -- How to specify the step methods? What is the order to apply the step methods at each iteration? Is there a way to specify the order of the step methods? -- What are the issues with mixing discrete and continuous samplers, especially with HMC/NUTS? -- What happens to sample statistics that occur in multiple step methods? - -```{code-cell} ipython3 -import aesara -import arviz as az -import numpy as np -import pymc as pm -import xarray -``` - -```{code-cell} ipython3 -az.style.use("arviz-darkgrid") -``` - -## Compound steps - -+++ - -When sampling a model with multiple free random variables, compound steps are needed in the `pm.sample` function. When compound steps are involved, the function takes a list of `step` to generate a list of `methods` for different random variables. For example in the following code: -```python -with pm.Model() as m: - rv1 = ... # random variable 1 (continuous) - rv2 = ... # random variable 2 (continuous) - rv3 = ... # random variable 3 (categorical) - #... - step1 = pm.Metropolis([rv1, rv2]) - step2 = pm.CategoricalGibbsMetropolis([rv3]) - trace = pm.sample(..., step=[step1, step2]) -``` -The compound step now contains a list of `methods`. At each sampling step, it iterates over these methods, taking a `point` as input. In each step a new `point` is proposed as an output, if rejected by the Metropolis-Hastings criteria the original input `point` sticks around as the output. - -+++ - -## Compound steps by default -To conduct Markov chain Monte Carlo (MCMC) sampling to generate posterior samples in PyMC, we specify a step method object that corresponds to a particular MCMC algorithm, such as Metropolis, Slice sampling, or the No-U-Turn Sampler (NUTS). PyMC’s step_methods can be assigned manually, or assigned automatically by PyMC. Auto-assignment is based on the attributes of each variable in the model. In general: - -- Binary variables will be assigned to BinaryMetropolis -- Discrete variables will be assigned to Metropolis -- Continuous variables will be assigned to NUTS - -When we call `pm.sample(return_inferencedata=False)`, `PyMC` assigns the best step method to each of the free random variables. Take the following example - -```{code-cell} ipython3 -n_ = aesara.shared(np.asarray([10, 15])) -with pm.Model() as m: - p = pm.Beta("p", 1.0, 1.0) - ni = pm.Bernoulli("ni", 0.5) - k = pm.Binomial("k", p=p, n=n_[ni], observed=4) - trace = pm.sample(10000) -``` - -There are two free parameters in the model we would like to sample from, a continuous variable `p_logodds__` and a binary variable `ni`. - -```{code-cell} ipython3 -m.free_RVs -``` - -When we call `pm.sample(return_inferencedata=False)`, `PyMC` assigns the best step method to each of them. For example, `NUTS` was assigned to `p_logodds__` and `BinaryGibbsMetropolis` was assigned to `ni`. - -+++ - -## Specify compound steps -Auto-assignment can be overridden for any subset of variables by specifying them manually prior to sampling: - -```{code-cell} ipython3 -with m: - step1 = pm.Metropolis([p]) - step2 = pm.BinaryMetropolis([ni]) - trace = pm.sample( - 10000, - step=[step1, step2], - idata_kwargs={ - "dims": {"accept": ["step"]}, - "coords": {"step": ["Metropolis", "BinaryMetropolis"]}, - }, - ) -``` - -```{code-cell} ipython3 -point = m.test_point -point -``` - -Then pass the `point` to the first step method `pm.Metropolis` for random variable `p`. - -```{code-cell} ipython3 -point, state = step1.step(point=point) -point, state -``` - -As you can see, the value of `ni` does not change, but `p_logodds__` is updated. - -And similarly, you can pass the updated `point` to `step2` and get a sample for `ni`: - -```{code-cell} ipython3 -point = step2.step(point=point) -point -``` - -Compound step works exactly like this by iterating all the steps within the list. In effect, it is a metropolis hastings within gibbs sampling. - -Moreover, `pm.CompoundStep` is called internally by `pm.sample(return_inferencedata=False)`. We can make them explicit as below: - -```{code-cell} ipython3 -with m: - comp_step1 = pm.CompoundStep([step1, step2]) - trace1 = pm.sample(10000, comp_step1) -comp_step1.methods -``` - -```{code-cell} ipython3 -# These are the Sample Stats for Compound Step based sampling -list(trace1.sample_stats.data_vars) -``` - -Note: In compound step method, a sample stats variable maybe present in both step methods, like `accept` in every chain. - -```{code-cell} ipython3 -trace1.sample_stats["accept"].sel(chain=1).values -``` - -## Order of step methods - -+++ - -When in the default setting, the parameter update order follows the same order of the random variables, and it is assigned automatically. But if you specify the steps, you can change the order of the methods in the list: - -```{code-cell} ipython3 -with m: - comp_step2 = pm.CompoundStep([step2, step1]) - trace2 = pm.sample( - 10000, - comp_step2, - ) -comp_step2.methods -``` - -In the sampling process, it always follows the same step order in each sample in the Gibbs-like fashion. More precisely, at each update, it iterates over the list of `methods` where the accept/reject is based on comparing the acceptance rate with $p \sim \text{Uniform}(0, 1)$ (by checking whether $\log p < \log p_{\text {updated}} - \log p_{\text {current}}$). - -+++ - -Each step method gets its own `accept`, notice how the plots are reversed in when step order is reverted. - -```{code-cell} ipython3 -az.plot_density( - trace1, - group="sample_stats", - var_names="accept", - point_estimate="mean", -); -``` - -```{code-cell} ipython3 -az.plot_density( - trace2, - group="sample_stats", - var_names="accept", - point_estimate="mean", -); -``` - -## Issues with mixing discrete and continuous sampling - -+++ - -A recurrent issue/concern is the validity of mixing discrete and continuous sampling, especially mixing other samplers with NUTS. While in the book [Bayesian Data Analysis 3rd edition](http://www.stat.columbia.edu/~gelman/book/) Chapter 12.4, there is a small paragraph on "Combining Hamiltonian Monte Carlo with Gibbs sampling", which suggests that this could be a valid way to do, the Stan developers are always skeptical about how practical it is. (Here are more discussions about this issue [1](http://discourse.mc-stan.org/t/mcmc-sampling-does-not-work-when-execute/1918/47), [2](http://discourse.mc-stan.org/t/constraining-latent-factor-model-baysian-probabalisic-matrix-factorization-to-remove-multimodality/2152/21)). - -The concern with mixing discrete and continuous sampling is that the change in discrete parameters will affect the continuous distribution's geometry so that the adaptation (i.e., the tuned mass matrix and step size) may be inappropriate for the Hamiltonian Monte Carlo sampling. HMC/NUTS is hypersensitive to its tuning parameters (mass matrix and step size). Another issue is that we also don't know how many iterations we have to run to get a decent sample when the discrete parameters change. Though it hasn't been fully evaluated, it seems that if the discrete parameter is in low dimensions (e.g., 2-class mixture models, outlier detection with explicit discrete labeling), the mixing of discrete sampling with HMC/NUTS works OK. However, it is much less efficient than marginalizing out the discrete parameters. And sometimes it can be observed that the Markov chains get stuck quite often. In order to evaluate this more properly, one can use a simulation-based method to look at the posterior coverage and establish the computational correctness, as explained in [Cook, Gelman, and Rubin 2006](https://amstat.tandfonline.com/doi/abs/10.1198/106186006x136976). - -+++ - -Updated by: Meenal Jhajharia - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/howto/sampling_conjugate_step.myst.md b/myst_nbs/howto/sampling_conjugate_step.myst.md deleted file mode 100644 index e14b9b064..000000000 --- a/myst_nbs/howto/sampling_conjugate_step.myst.md +++ /dev/null @@ -1,251 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python (PyMC3 Dev) - language: python - name: pymc3-dev ---- - -# Using a custom step method for sampling from locally conjugate posterior distributions - -+++ - -## Introduction - -+++ - -Sampling methods based on Monte Carlo are extremely widely used in Bayesian inference, and PyMC3 uses a powerful version of Hamiltonian Monte Carlo (HMC) to efficiently sample from posterior distributions over many hundreds or thousands of parameters. HMC is a generic inference algorithm in the sense that you do not need to assume specific prior distributions (like an inverse-Gamma prior on the conditional variance of a regression model) or likelihood functions. In general, the product of a prior and likelihood will not easily be integrated in closed form, so we can't derive the form of the posterior with pen and paper. HMC is widely regarded as a major improvement over previous Markov chain Monte Carlo (MCMC) algorithms because it uses gradients of the model's log posterior density to make informed proposals in parameter space. - -However, these gradient computations can often be expensive for models with especially complicated functional dependencies between variables and observed data. When this is the case, we may wish to find a faster sampling scheme by making use of additional structure in some portions of the model. When a number of variables within the model are *conjugate*, the conditional posterior--that is, the posterior distribution holding all other model variables fixed--can often be sampled from very easily. This suggests using a HMC-within-Gibbs step in which we alternate between using cheap conjugate sampling for variables when possible, and using more expensive HMC for the rest. - -Generally, it is not advisable to pick *any* alternative sampling method and use it to replace HMC. This combination often yields much worse performance in terms of *effective* sampling rates, even if the individual samples are drawn much more rapidly. In this notebook, we show how to implement a conjugate sampling scheme in PyMC3 and compare it against a full-HMC (or, in this case, NUTS) approach. For this case, we find that using conjugate sampling can dramatically speed up computations for a Dirichlet-multinomial model. - -+++ - -## Probabilistic model - -+++ - -To keep this notebook simple, we'll consider a relatively simple hierarchical model defined for $N$ observations of a vector of counts across $J$ outcomes:: - -$$\tau \sim Exp(\lambda)$$ -$$\mathbf{p}_i \sim Dir(\tau )$$ -$$\mathbf{x}_i \sim Multinomial(\mathbf{p}_i)$$ - -The index $i\in\{1,...,N\}$ represents the observation while $j\in \{1...,J\}$ indexes the outcome. The variable $\tau$ is a scalar concentration while $\mathbf{p}_i$ is a $J$-vector of probabilities drawn from a Dirichlet prior with entries $(\tau, \tau, ..., \tau)$. With fixed $\tau$ and observed data $x$, we know that $\mathbf{p}$ has a [closed-form posterior distribution](https://en.wikipedia.org/wiki/Dirichlet_distribution#Conjugate_to_categorical/multinomial), meaning that we can easily sample from it. Our sampling scheme will alternate between using the No-U-Turn sampler (NUTS) on $\tau$ and drawing from this known conditional posterior distribution for $\mathbf{p}_i$. We will assume a fixed value for $\lambda$. - -+++ - -## Implementing a custom step method - -+++ - -Adding a conjugate sampler as part of our compound sampling approach is straightforward: we define a new step method that examines the current state of the Markov chain approximation and modifies it by adding samples drawn from the conjugate posterior. - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm - -from pymc3.distributions.transforms import stick_breaking -from pymc3.model import modelcontext -from pymc3.step_methods.arraystep import BlockedStep -``` - -```{code-cell} ipython3 -RANDOM_SEED = 8927 -np.random.seed(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -First, we need a method for sampling from a Dirichlet distribution. The built in `numpy.random.dirichlet` can only handle 2D input arrays, and we might like to generalize beyond this in the future. Thus, I have created a function for sampling from a Dirichlet distribution with parameter array `c` by representing it as a normalized sum of Gamma random variables. More detail about this is given [here](https://en.wikipedia.org/wiki/Dirichlet_distribution#Gamma_distribution). - -```{code-cell} ipython3 -def sample_dirichlet(c): - """ - Samples Dirichlet random variables which sum to 1 along their last axis. - """ - gamma = np.random.gamma(c) - p = gamma / gamma.sum(axis=-1, keepdims=True) - return p -``` - -Next, we define the step object used to replace NUTS for part of the computation. It must have a `step` method that receives a dict called `point` containing the current state of the Markov chain. We'll modify it in place. - -There is an extra complication here as PyMC3 does not track the state of the Dirichlet random variable in the form $\mathbf{p}=(p_1, p_2 ,..., p_J)$ with the constraint $\sum_j p_j = 1$. Rather, it uses an inverse stick breaking transformation of the variable which is easier to use with NUTS. This transformation removes the constraint that all entries must sum to 1 and are positive. - -```{code-cell} ipython3 -class ConjugateStep(BlockedStep): - def __init__(self, var, counts: np.ndarray, concentration): - self.vars = [var] - self.counts = counts - self.name = var.name - self.conc_prior = concentration - - def step(self, point: dict): - # Since our concentration parameter is going to be log-transformed - # in point, we invert that transformation so that we - # can get conc_posterior = conc_prior + counts - conc_posterior = np.exp(point[self.conc_prior.transformed.name]) + self.counts - draw = sample_dirichlet(conc_posterior) - - # Since our new_p is not in the transformed / unconstrained space, - # we apply the transformation so that our new value - # is consistent with PyMC3's internal representation of p - point[self.name] = stick_breaking.forward_val(draw) - - return point -``` - -The usage of `point` and its indexing variables can be confusing here. The expression `point[self.conc_prior.transformed.name]` in particular is quite long. This expression is necessary because when `step` is called, it is passed a dictionary `point` with string variable names as keys. - -However, the prior parameter's name won't be stored directly in the keys for `point` because PyMC3 stores a transformed variable instead. Thus, we will need to query `point` using the *transformed name* and then undo that transformation. - -To identify the correct variable to query into `point`, we need to take an argument during initialization that tells the sampling step where to find the prior parameter. Thus, we pass `var` into `ConjugateStep` so that the sampler can find the name of the transformed variable (`var.transformed.name`) later. - -+++ - -## Simulated data - -+++ - -We'll try out the sampler on some simulated data. Fixing $\tau=0.5$, we'll draw 500 observations of a 10 dimensional Dirichlet distribution. - -```{code-cell} ipython3 -J = 10 -N = 500 - -ncounts = 20 -tau_true = 0.5 -alpha = tau_true * np.ones([N, J]) -p_true = sample_dirichlet(alpha) -counts = np.zeros([N, J]) - -for i in range(N): - counts[i] = np.random.multinomial(ncounts, p_true[i]) -print(counts.shape) -``` - -## Comparing partial conjugate with full NUTS sampling - -+++ - -We don't have any closed form expression for the posterior distribution of $\tau$ so we will use NUTS on it. In the code cell below, we fit the same model using 1) conjugate sampling on the probability vectors with NUTS on $\tau$, and 2) NUTS for everything. - -```{code-cell} ipython3 -traces = [] -models = [] -names = ["Partial conjugate sampling", "Full NUTS"] - -for use_conjugate in [True, False]: - with pm.Model() as model: - tau = pm.Exponential("tau", lam=1, testval=1.0) - alpha = pm.Deterministic("alpha", tau * np.ones([N, J])) - p = pm.Dirichlet("p", a=alpha) - - if use_conjugate: - # If we use the conjugate sampling, we don't need to define the likelihood - # as it's already taken into account in our custom step method - step = [ConjugateStep(p.transformed, counts, tau)] - - else: - x = pm.Multinomial("x", n=ncounts, p=p, observed=counts) - step = [] - - trace = pm.sample(step=step, chains=2, cores=1, return_inferencedata=True) - traces.append(trace) - - assert all(az.summary(trace)["r_hat"] < 1.1) - models.append(model) -``` - -We see that the runtimes for the partially conjugate sampling are much lower, though this can be misleading if the samples have high autocorrelation or the chains are mixing very slowly. We also see that there are a few divergences in the NUTS-only trace. - -+++ - -We want to make sure that the two samplers are converging to the same estimates. The posterior histogram and trace plot below show that both essentially converge to $\tau$ within reasonable posterior uncertainty credible intervals. We can also see that the trace plots lack any obvious autocorrelation as they are mostly indistinguishable from white noise. - -```{code-cell} ipython3 -for name, trace in zip(names, traces): - ax = az.plot_trace(trace, var_names="tau") - ax[0, 0].axvline(0.5, label="True value", color="k") - ax[0, 0].legend() - plt.suptitle(name) -``` - -We want to avoid comparing sampler effectiveness in terms of raw samples per second. If a sampler works quickly per sample but generates highly correlated samples, the effective sample size (ESS) is diminished. Since our posterior analyses are critically dependent on the effective sample size, we should examine this latter quantity instead. - -This model includes $500\times 10=5000$ probability values for the 500 Dirichlet random variables. Let's calculate the effective sample size for each of these 5000 entries and generate a histogram for each sampling method: - -```{code-cell} ipython3 -summaries_p = [] -for trace, model in zip(traces, models): - with model: - summaries_p.append(az.summary(trace, var_names="p")) - -[plt.hist(s["ess_mean"], bins=50, alpha=0.4, label=names[i]) for i, s in enumerate(summaries_p)] -plt.legend(), plt.xlabel("Effective sample size"); -``` - -Interestingly, we see that while the mode of the ESS histogram is larger for the full NUTS run, the minimum ESS appears to be lower. Since our inferences are often constrained by the of the worst-performing part of the Markov chain, the minimum ESS is of interest. - -```{code-cell} ipython3 -print("Minimum effective sample sizes across all entries of p:") -print({names[i]: s["ess_mean"].min() for i, s in enumerate(summaries_p)}) -``` - -Here, we can see that the conjugate sampling scheme gets a similar number of effective samples in the worst case. However, there is an enormous disparity when we consider the effective sampling *rate*. - -```{code-cell} ipython3 -print("Minimum ESS/second across all entries of p:") -print( - { - names[i]: s["ess_mean"].min() / traces[i].posterior.sampling_time - for i, s in enumerate(summaries_p) - } -) -``` - -The partial conjugate sampling scheme is over 10X faster in terms of worst-case ESS rate! - -+++ - -As a final check, we also want to make sure that the probability estimates are the same for both samplers. In the plot below, we can see that estimates from both the partial conjugate sampling and the full NUTS sampling are very closely correlated with the true values. - -```{code-cell} ipython3 -fig, axes = plt.subplots(1, 2, figsize=(10, 5)) -axes[0].scatter( - summaries_p[0]["mean"], - p_true.ravel(), - s=2, - label="Partial conjugate sampling", - zorder=2, - alpha=0.3, - color="b", -) -axes[0].set_ylabel("Posterior estimates"), axes[0].set_xlabel("True values") - -axes[1].scatter( - summaries_p[1]["mean"], - p_true.ravel(), - s=2, - alpha=0.3, - color="orange", -) -axes[1].set_ylabel("Posterior estimates"), axes[1].set_xlabel("True values") - -[axes[i].set_title(n) for i, n in enumerate(names)]; -``` - -* This notebook was written by Christopher Krapu on November 17, 2020. - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/howto/updating_priors.myst.md b/myst_nbs/howto/updating_priors.myst.md deleted file mode 100644 index 510ed5199..000000000 --- a/myst_nbs/howto/updating_priors.myst.md +++ /dev/null @@ -1,171 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python (PyMC3 Dev) - language: python - name: pymc3-dev ---- - -# Updating priors - -+++ - -In this notebook, I will show how it is possible to update the priors as new data becomes available. The example is a slightly modified version of the linear regression in the [Getting started with PyMC3](https://github.com/pymc-devs/pymc3/blob/master/docs/source/notebooks/getting_started.ipynb) notebook. - -```{code-cell} ipython3 -%matplotlib inline -import warnings - -import arviz as az -import matplotlib as mpl -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm -import theano.tensor as tt - -from pymc3 import Model, Normal, Slice, sample -from pymc3.distributions import Interpolated -from scipy import stats -from theano import as_op - -plt.style.use("seaborn-darkgrid") -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 -warnings.filterwarnings("ignore") -``` - -## Generating data - -```{code-cell} ipython3 -# Initialize random number generator -np.random.seed(93457) - -# True parameter values -alpha_true = 5 -beta0_true = 7 -beta1_true = 13 - -# Size of dataset -size = 100 - -# Predictor variable -X1 = np.random.randn(size) -X2 = np.random.randn(size) * 0.2 - -# Simulate outcome variable -Y = alpha_true + beta0_true * X1 + beta1_true * X2 + np.random.randn(size) -``` - -## Model specification - -+++ - -Our initial beliefs about the parameters are quite informative (sigma=1) and a bit off the true values. - -```{code-cell} ipython3 -basic_model = Model() - -with basic_model: - - # Priors for unknown model parameters - alpha = Normal("alpha", mu=0, sigma=1) - beta0 = Normal("beta0", mu=12, sigma=1) - beta1 = Normal("beta1", mu=18, sigma=1) - - # Expected value of outcome - mu = alpha + beta0 * X1 + beta1 * X2 - - # Likelihood (sampling distribution) of observations - Y_obs = Normal("Y_obs", mu=mu, sigma=1, observed=Y) - - # draw 1000 posterior samples - trace = sample(1000) -``` - -```{code-cell} ipython3 -az.plot_trace(trace); -``` - -In order to update our beliefs about the parameters, we use the posterior distributions, which will be used as the prior distributions for the next inference. The data used for each inference iteration has to be independent from the previous iterations, otherwise the same (possibly wrong) belief is injected over and over in the system, amplifying the errors and misleading the inference. By ensuring the data is independent, the system should converge to the true parameter values. - -Because we draw samples from the posterior distribution (shown on the right in the figure above), we need to estimate their probability density (shown on the left in the figure above). [Kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) (KDE) is a way to achieve this, and we will use this technique here. In any case, it is an empirical distribution that cannot be expressed analytically. Fortunately PyMC3 provides a way to use custom distributions, via `Interpolated` class. - -```{code-cell} ipython3 -def from_posterior(param, samples): - smin, smax = np.min(samples), np.max(samples) - width = smax - smin - x = np.linspace(smin, smax, 100) - y = stats.gaussian_kde(samples)(x) - - # what was never sampled should have a small probability but not 0, - # so we'll extend the domain and use linear approximation of density on it - x = np.concatenate([[x[0] - 3 * width], x, [x[-1] + 3 * width]]) - y = np.concatenate([[0], y, [0]]) - return Interpolated(param, x, y) -``` - -Now we just need to generate more data and build our Bayesian model so that the prior distributions for the current iteration are the posterior distributions from the previous iteration. It is still possible to continue using NUTS sampling method because `Interpolated` class implements calculation of gradients that are necessary for Hamiltonian Monte Carlo samplers. - -```{code-cell} ipython3 -traces = [trace] -``` - -```{code-cell} ipython3 -for _ in range(10): - - # generate more data - X1 = np.random.randn(size) - X2 = np.random.randn(size) * 0.2 - Y = alpha_true + beta0_true * X1 + beta1_true * X2 + np.random.randn(size) - - model = Model() - with model: - # Priors are posteriors from previous iteration - alpha = from_posterior("alpha", trace["alpha"]) - beta0 = from_posterior("beta0", trace["beta0"]) - beta1 = from_posterior("beta1", trace["beta1"]) - - # Expected value of outcome - mu = alpha + beta0 * X1 + beta1 * X2 - - # Likelihood (sampling distribution) of observations - Y_obs = Normal("Y_obs", mu=mu, sigma=1, observed=Y) - - # draw 10000 posterior samples - trace = sample(1000) - traces.append(trace) -``` - -```{code-cell} ipython3 -print("Posterior distributions after " + str(len(traces)) + " iterations.") -cmap = mpl.cm.autumn -for param in ["alpha", "beta0", "beta1"]: - plt.figure(figsize=(8, 2)) - for update_i, trace in enumerate(traces): - samples = trace[param] - smin, smax = np.min(samples), np.max(samples) - x = np.linspace(smin, smax, 100) - y = stats.gaussian_kde(samples)(x) - plt.plot(x, y, color=cmap(1 - update_i / len(traces))) - plt.axvline({"alpha": alpha_true, "beta0": beta0_true, "beta1": beta1_true}[param], c="k") - plt.ylabel("Frequency") - plt.title(param) - -plt.tight_layout(); -``` - -You can re-execute the last two cells to generate more updates. - -What is interesting to note is that the posterior distributions for our parameters tend to get centered on their true value (vertical lines), and the distribution gets thiner and thiner. This means that we get more confident each time, and the (false) belief we had at the beginning gets flushed away by the new data we incorporate. - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/mixture_models/dependent_density_regression.myst.md b/myst_nbs/mixture_models/dependent_density_regression.myst.md deleted file mode 100644 index d387ba8dc..000000000 --- a/myst_nbs/mixture_models/dependent_density_regression.myst.md +++ /dev/null @@ -1,291 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- - -# Dependent density regression -In another [example](dp_mix.ipynb), we showed how to use Dirichlet processes to perform Bayesian nonparametric density estimation. This example expands on the previous one, illustrating dependent density regression. - -Just as Dirichlet process mixtures can be thought of as infinite mixture models that select the number of active components as part of inference, dependent density regression can be thought of as infinite [mixtures of experts](https://en.wikipedia.org/wiki/Committee_machine) that select the active experts as part of inference. Their flexibility and modularity make them powerful tools for performing nonparametric Bayesian Data analysis. - -```{code-cell} ipython3 -import arviz as az -import numpy as np -import pandas as pd -import pymc3 as pm -import seaborn as sns - -from IPython.display import HTML -from matplotlib import animation as ani -from matplotlib import pyplot as plt -from theano import tensor as tt - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -plt.rc("animation", writer="ffmpeg") -blue, *_ = sns.color_palette() -az.style.use("arviz-darkgrid") -SEED = 972915 # from random.org; for reproducibility -np.random.seed(SEED) -``` - -We will use the LIDAR data set from Larry Wasserman's excellent book, [_All of Nonparametric Statistics_](http://www.stat.cmu.edu/~larry/all-of-nonpar/). We standardize the data set to improve the rate of convergence of our samples. - -```{code-cell} ipython3 -DATA_URI = "http://www.stat.cmu.edu/~larry/all-of-nonpar/=data/lidar.dat" - - -def standardize(x): - return (x - x.mean()) / x.std() - - -df = pd.read_csv(DATA_URI, sep=r"\s{1,3}", engine="python").assign( - std_range=lambda df: standardize(df.range), std_logratio=lambda df: standardize(df.logratio) -) -``` - -```{code-cell} ipython3 -df.head() -``` - -We plot the LIDAR data below. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - -ax.scatter(df.std_range, df.std_logratio, color=blue) - -ax.set_xticklabels([]) -ax.set_xlabel("Standardized range") - -ax.set_yticklabels([]) -ax.set_ylabel("Standardized log ratio"); -``` - -This data set has a two interesting properties that make it useful for illustrating dependent density regression. - -1. The relationship between range and log ratio is nonlinear, but has locally linear components. -2. The observation noise is [heteroskedastic](https://en.wikipedia.org/wiki/Heteroscedasticity); that is, the magnitude of the variance varies with the range. - -The intuitive idea behind dependent density regression is to reduce the problem to many (related) density estimates, conditioned on fixed values of the predictors. The following animation illustrates this intuition. - -```{code-cell} ipython3 -fig, (scatter_ax, hist_ax) = plt.subplots(ncols=2, figsize=(16, 6)) - -scatter_ax.scatter(df.std_range, df.std_logratio, color=blue, zorder=2) - -scatter_ax.set_xticklabels([]) -scatter_ax.set_xlabel("Standardized range") - -scatter_ax.set_yticklabels([]) -scatter_ax.set_ylabel("Standardized log ratio") - -bins = np.linspace(df.std_range.min(), df.std_range.max(), 25) - -hist_ax.hist(df.std_logratio, bins=bins, color="k", lw=0, alpha=0.25, label="All data") - -hist_ax.set_xticklabels([]) -hist_ax.set_xlabel("Standardized log ratio") - -hist_ax.set_yticklabels([]) -hist_ax.set_ylabel("Frequency") - -hist_ax.legend(loc=2) - -endpoints = np.linspace(1.05 * df.std_range.min(), 1.05 * df.std_range.max(), 15) - -frame_artists = [] - -for low, high in zip(endpoints[:-1], endpoints[2:]): - interval = scatter_ax.axvspan(low, high, color="k", alpha=0.5, lw=0, zorder=1) - *_, bars = hist_ax.hist( - df[df.std_range.between(low, high)].std_logratio, bins=bins, color="k", lw=0, alpha=0.5 - ) - - frame_artists.append((interval,) + tuple(bars)) - -animation = ani.ArtistAnimation(fig, frame_artists, interval=500, repeat_delay=3000, blit=True) -plt.close() -# prevent the intermediate figure from showing -``` - -```{code-cell} ipython3 -HTML(animation.to_html5_video()) -``` - -As we slice the data with a window sliding along the x-axis in the left plot, the empirical distribution of the y-values of the points in the window varies in the right plot. An important aspect of this approach is that the density estimates that correspond to close values of the predictor are similar. - -In the previous example, we saw that a Dirichlet process estimates a probability density as a mixture model with infinitely many components. In the case of normal component distributions, - -$$y \sim \sum_{i = 1}^{\infty} w_i \cdot N(\mu_i, \tau_i^{-1}),$$ - -where the mixture weights, $w_1, w_2, \ldots$, are generated by a [stick-breaking process](https://en.wikipedia.org/wiki/Dirichlet_process#The_stick-breaking_process). - -Dependent density regression generalizes this representation of the Dirichlet process mixture model by allowing the mixture weights and component means to vary conditioned on the value of the predictor, $x$. That is, - -$$y\ |\ x \sim \sum_{i = 1}^{\infty} w_i\ |\ x \cdot N(\mu_i\ |\ x, \tau_i^{-1}).$$ - -In this example, we will follow Chapter 23 of [_Bayesian Data Analysis_](http://www.stat.columbia.edu/~gelman/book/) and use a probit stick-breaking process to determine the conditional mixture weights, $w_i\ |\ x$. The probit stick-breaking process starts by defining - -$$v_i\ |\ x = \Phi(\alpha_i + \beta_i x),$$ - -where $\Phi$ is the cumulative distribution function of the standard normal distribution. We then obtain $w_i\ |\ x$ by applying the stick breaking process to $v_i\ |\ x$. That is, - -$$w_i\ |\ x = v_i\ |\ x \cdot \prod_{j = 1}^{i - 1} (1 - v_j\ |\ x).$$ - -For the LIDAR data set, we use independent normal priors $\alpha_i \sim N(0, 5^2)$ and $\beta_i \sim N(0, 5^2)$. We now express this this model for the conditional mixture weights using `PyMC3`. - -```{code-cell} ipython3 -def norm_cdf(z): - return 0.5 * (1 + tt.erf(z / np.sqrt(2))) - - -def stick_breaking(v): - return v * tt.concatenate( - [tt.ones_like(v[:, :1]), tt.extra_ops.cumprod(1 - v, axis=1)[:, :-1]], axis=1 - ) -``` - -```{code-cell} ipython3 -N = len(df) -K = 20 - -std_range = df.std_range.values[:, np.newaxis] -std_logratio = df.std_logratio.values - -with pm.Model(coords={"N": np.arange(N), "K": np.arange(K) + 1, "one": [1]}) as model: - alpha = pm.Normal("alpha", 0.0, 5.0, dims="K") - beta = pm.Normal("beta", 0.0, 5.0, dims=("one", "K")) - x = pm.Data("x", std_range) - v = norm_cdf(alpha + pm.math.dot(x, beta)) - w = pm.Deterministic("w", stick_breaking(v), dims=["N", "K"]) -``` - -We have defined `x` as a `pm.Data` container in order to use `PyMC3`'s posterior prediction capabilities later. - -While the dependent density regression model theoretically has infinitely many components, we must truncate the model to finitely many components (in this case, twenty) in order to express it using `PyMC3`. After sampling from the model, we will verify that truncation did not unduly influence our results. - -Since the LIDAR data seems to have several linear components, we use the linear models - -$$ -\begin{align*} -\mu_i\ |\ x - & \sim \gamma_i + \delta_i x \\ -\gamma_i - & \sim N(0, 10^2) \\ -\delta_i - & \sim N(0, 10^2) -\end{align*} -$$ - -for the conditional component means. - -```{code-cell} ipython3 -with model: - gamma = pm.Normal("gamma", 0.0, 10.0, dims="K") - delta = pm.Normal("delta", 0.0, 10.0, dims=("one", "K")) - mu = pm.Deterministic("mu", gamma + pm.math.dot(x, delta)) -``` - -Finally, we place the prior $\tau_i \sim \textrm{Gamma}(1, 1)$ on the component precisions. - -```{code-cell} ipython3 -with model: - tau = pm.Gamma("tau", 1.0, 1.0, dims="K") - y = pm.Data("y", std_logratio) - obs = pm.NormalMixture("obs", w, mu, tau=tau, observed=y) - -pm.model_to_graphviz(model) -``` - -We now sample from the dependent density regression model. - -```{code-cell} ipython3 -SAMPLES = 20000 -BURN = 10000 - -with model: - step = pm.Metropolis() - trace = pm.sample(SAMPLES, tune=BURN, step=step, random_seed=SEED, return_inferencedata=True) -``` - -To verify that truncation did not unduly influence our results, we plot the largest posterior expected mixture weight for each component. (In this model, each point has a mixture weight for each component, so we plot the maximum mixture weight for each component across all data points in order to judge if the component exerts any influence on the posterior.) - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - -max_mixture_weights = trace.posterior["w"].mean(("chain", "draw")).max("N") -ax.bar(max_mixture_weights.coords.to_index(), max_mixture_weights) - -ax.set_xlim(1 - 0.5, K + 0.5) -ax.set_xticks(np.arange(0, K, 2) + 1) -ax.set_xlabel("Mixture component") - -ax.set_ylabel("Largest posterior expected\nmixture weight"); -``` - -Since only three mixture components have appreciable posterior expected weight for any data point, we can be fairly certain that truncation did not unduly influence our results. (If most components had appreciable posterior expected weight, truncation may have influenced the results, and we would have increased the number of components and sampled again.) - -Visually, it is reasonable that the LIDAR data has three linear components, so these posterior expected weights seem to have identified the structure of the data well. We now sample from the posterior predictive distribution to get a better understand the model's performance. - -```{code-cell} ipython3 -PP_SAMPLES = 5000 - -lidar_pp_x = np.linspace(std_range.min() - 0.05, std_range.max() + 0.05, 100) - -with model: - pm.set_data({"x": lidar_pp_x[:, np.newaxis]}) - pp_trace = pm.sample_posterior_predictive(trace, PP_SAMPLES, random_seed=SEED) -``` - -Below we plot the posterior expected value and the 95% posterior credible interval. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - -ax.scatter(df.std_range, df.std_logratio, color=blue, zorder=10, label=None) - -low, high = np.percentile(pp_trace["obs"], [2.5, 97.5], axis=0) -ax.fill_between( - lidar_pp_x, low, high, color="k", alpha=0.35, zorder=5, label="95% posterior credible interval" -) - -ax.plot(lidar_pp_x, pp_trace["obs"].mean(axis=0), c="k", zorder=6, label="Posterior expected value") - -ax.set_xticklabels([]) -ax.set_xlabel("Standardized range") - -ax.set_yticklabels([]) -ax.set_ylabel("Standardized log ratio") - -ax.legend(loc=1) -ax.set_title("LIDAR Data"); -``` - -The model has fit the linear components of the data well, and also accommodated its heteroskedasticity. This flexibility, along with the ability to modularly specify the conditional mixture weights and conditional component densities, makes dependent density regression an extremely useful nonparametric Bayesian model. - -To learn more about dependent density regression and related models, consult [_Bayesian Data Analysis_](http://www.stat.columbia.edu/~gelman/book/), [_Bayesian Nonparametric Data Analysis_](http://www.springer.com/us/book/9783319189673), or [_Bayesian Nonparametrics_](https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=bayesian+nonparametrics+book). - -This example first appeared [here](http://austinrochford.com/posts/2017-01-18-ddp-pymc3.html). - -Author: [Austin Rochford](https://github.com/AustinRochford/) - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` - -```{code-cell} ipython3 - -``` diff --git a/myst_nbs/mixture_models/dirichlet_mixture_of_multinomials.myst.md b/myst_nbs/mixture_models/dirichlet_mixture_of_multinomials.myst.md deleted file mode 100644 index 51a15e64a..000000000 --- a/myst_nbs/mixture_models/dirichlet_mixture_of_multinomials.myst.md +++ /dev/null @@ -1,568 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(dirichlet_mixture_of_multinomials)= -# Dirichlet mixtures of multinomials - -:::{post} Jan 8, 2022 -:tags: mixture model, -:category: advanced -:author: Byron J. Smith, Abhipsha Das, Oriol Abril-Pla -::: - -+++ - -This example notebook demonstrates the use of a -Dirichlet mixture of multinomials -(a.k.a [Dirichlet-multinomial](https://en.wikipedia.org/wiki/Dirichlet-multinomial_distribution) or DM) -to model categorical count data. -Models like this one are important in a variety of areas, including -natural language processing, ecology, bioinformatics, and more. - -The Dirichlet-multinomial can be understood as draws from a [Multinomial distribution](https://en.wikipedia.org/wiki/Multinomial_distribution) -where each sample has a slightly different probability vector, which is itself drawn from a common [Dirichlet distribution](https://en.wikipedia.org/wiki/Dirichlet_distribution). -This contrasts with the Multinomial distribution, which assumes that all observations arise from a single fixed probability vector. -This enables the Dirichlet-multinomial to accommodate more variable (a.k.a, over-dispersed) count data than the Multinomial. - -Other examples of over-dispersed count distributions are the -[Beta-binomial](https://en.wikipedia.org/wiki/Beta-binomial_distribution) -(which can be thought of as a special case of the DM) or the -[Negative binomial](https://en.wikipedia.org/wiki/Negative_binomial_distribution) -distributions. - -The DM is also an example of marginalizing -a mixture distribution over its latent parameters. -This notebook will demonstrate the performance benefits that come from taking that approach. - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm -import scipy as sp -import scipy.stats -import seaborn as sns - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -## Simulation data - -+++ - -Let us simulate some over-dispersed, categorical count data -for this example. - -Here we are simulating from the DM distribution itself, -so it is perhaps tautological to fit that model, -but rest assured that data like these really do appear in -the counts of different: - -1. words in text corpuses {cite:p}`madsen2005modelingdirichlet`, -2. types of RNA molecules in a cell {cite:p}`nowicka2016drimseq`, -3. items purchased by shoppers {cite:p}`goodhardt1984thedirichlet`. - -Here we will discuss a community ecology example, pretending that we have observed counts of $k=5$ different -tree species in $n=10$ different forests. - -Our simulation will produce a two-dimensional matrix of integers (counts) -where each row, (zero-)indexed by $i \in (0...n-1)$, is an observation (different forest), and each -column $j \in (0...k-1)$ is a category (tree species). -We'll parameterize this distribution with three things: -- $\mathrm{frac}$ : the expected fraction of each species, - a $k$-dimensional vector on the simplex (i.e. sums-to-one) -- $\mathrm{total\_count}$ : the total number of items tallied in each observation, -- $\mathrm{conc}$ : the concentration, controlling the overdispersion of our data, - where larger values result in our distribution more closely approximating the multinomial. - -Here, and throughout this notebook, we've used a -[convenient reparameterization](https://mc-stan.org/docs/2_26/stan-users-guide/reparameterizations.html#dirichlet-priors) -of the Dirichlet distribution -from one to two parameters, -$\alpha=\mathrm{conc} \times \mathrm{frac}$, as this -fits our desired interpretation. - -Each observation from the DM is simulated by: -1. first obtaining a value on the $k$-simplex simulated as - $p_i \sim \mathrm{Dirichlet}(\alpha=\mathrm{conc} \times \mathrm{frac})$, -2. and then simulating $\mathrm{counts}_i \sim \mathrm{Multinomial}(\mathrm{total\_count}, p_i)$. - -Notice that each observation gets its _own_ -latent parameter $p_i$, simulated independently from -a common Dirichlet distribution. - -```{code-cell} ipython3 -true_conc = 6.0 -true_frac = np.array([0.45, 0.30, 0.15, 0.09, 0.01]) -trees = ["pine", "oak", "ebony", "rosewood", "mahogany"] # Tree species observed -# fmt: off -forests = [ # Forests observed - "sunderbans", "amazon", "arashiyama", "trossachs", "valdivian", - "bosc de poblet", "font groga", "monteverde", "primorye", "daintree", -] -# fmt: on -k = len(trees) -n = len(forests) -total_count = 50 - -true_p = sp.stats.dirichlet(true_conc * true_frac).rvs(size=n) -observed_counts = np.vstack([sp.stats.multinomial(n=total_count, p=p_i).rvs() for p_i in true_p]) - -observed_counts -``` - -## Multinomial model - -+++ - -The first model that we will fit to these data is a plain -multinomial model, where the only parameter is the -expected fraction of each category, $\mathrm{frac}$, which we will give a Dirichlet prior. -While the uniform prior ($\alpha_j=1$ for each $j$) works well, if we have independent beliefs about the fraction of each tree, -we could encode this into our prior, e.g. -increasing the value of $\alpha_j$ where we expect a higher fraction of species-$j$. - -```{code-cell} ipython3 -coords = {"tree": trees, "forest": forests} -with pm.Model(coords=coords) as model_multinomial: - frac = pm.Dirichlet("frac", a=np.ones(k), dims="tree") - counts = pm.Multinomial( - "counts", n=total_count, p=frac, observed=observed_counts, dims=("forest", "tree") - ) - -pm.model_to_graphviz(model_multinomial) -``` - -Interestingly, NUTS frequently runs into numerical problems on this model, perhaps an example of the -["Folk Theorem of Statistical Computing"](https://statmodeling.stat.columbia.edu/2008/05/13/the_folk_theore/). - -Because of a couple of identities of the multinomial distribution, -we could reparameterize this model in a number of ways—we -would obtain equivalent models by exploding our $n$ observations -of $\mathrm{total\_count}$ items into $(n \times \mathrm{total\_count})$ -independent categorical trials, or collapsing them down into -one Multinomial draw with $(n \times \mathrm{total\_count})$ items. -(Importantly, this is _not_ true for the DM distribution.) - -Rather than _actually_ fixing our problem through reparameterization, -here we'll instead switch to the Metropolis step method, -which ignores some of the geometric pathologies of our naïve model. - -**Important**: switching to Metropolis does not not _fix_ our model's issues, rather it _sweeps them under the rug_. -In fact, if you try running this model with NUTS (PyMC3's default step method), it will break loudly during sampling. -When that happens, this should be a **red alert** that there is something wrong in our model. - -You'll also notice below that we have to increase considerably the number of draws we take from the posterior; -this is because Metropolis is much less efficient at -exploring the posterior than NUTS. - -```{code-cell} ipython3 -with model_multinomial: - trace_multinomial = pm.sample( - draws=5000, chains=4, step=pm.Metropolis(), return_inferencedata=True - ) -``` - -Let's ignore the warning about inefficient sampling for now. - -```{code-cell} ipython3 -az.plot_trace(data=trace_multinomial, var_names=["frac"]); -``` - -The trace plots look fairly good; -visually, each parameter appears to be moving around the posterior well, -although some sharp parts of the KDE plot suggests that -sampling sometimes gets stuck in one place for a few steps. - -```{code-cell} ipython3 -summary_multinomial = az.summary(trace_multinomial, var_names=["frac"]) - -summary_multinomial = summary_multinomial.assign( - ess_bulk_per_sec=lambda x: x.ess_bulk / trace_multinomial.posterior.sampling_time, -) - -summary_multinomial -``` - -Likewise, diagnostics in the parameter summary table all look fine. -Here I've added a column estimating the effective sample size per -second of sampling. - -Nonetheless, the fact that we were unable to use NUTS is still a red flag, and we should be -very cautious in using these results. - -```{code-cell} ipython3 -az.plot_forest(trace_multinomial, var_names=["frac"]) -for j, (y_tick, frac_j) in enumerate(zip(plt.gca().get_yticks(), reversed(true_frac))): - plt.vlines(frac_j, ymin=y_tick - 0.45, ymax=y_tick + 0.45, color="black", linestyle="--") -``` - -Here we've drawn a forest-plot, showing the mean and 94% HDIs from our posterior approximation. -Interestingly, because we know what the underlying -frequencies are for each species (dashed lines), we can comment on the accuracy -of our inferences. -And now the issues with our model become apparent; -notice that the 94% HDIs _don't include the true values_ for -tree species 0, 2, 3. -We might have seen _one_ HDI miss, but _three_??? - -...what's going on? - -Let's troubleshoot this model using a posterior-predictive check, comparing our data to simulated data conditioned on our posterior estimates. - -```{code-cell} ipython3 -with model_multinomial: - pp_samples = az.from_pymc3( - posterior_predictive=pm.fast_sample_posterior_predictive(trace=trace_multinomial) - ) - -# Concatenate with InferenceData object -trace_multinomial.extend(pp_samples) -``` - -```{code-cell} ipython3 -cmap = plt.get_cmap("tab10") - -fig, axs = plt.subplots(k, 1, sharex=True, sharey=True, figsize=(6, 8)) -for j, ax in enumerate(axs): - c = cmap(j) - ax.hist( - trace_multinomial.posterior_predictive.counts.sel(tree=trees[j]).values.flatten(), - bins=np.arange(total_count), - histtype="step", - color=c, - density=True, - label="Post.Pred.", - ) - ax.hist( - (trace_multinomial.observed_data.counts.sel(tree=trees[j]).values.flatten()), - bins=np.arange(total_count), - color=c, - density=True, - alpha=0.25, - label="Observed", - ) - ax.axvline( - true_frac[j] * total_count, - color=c, - lw=1.0, - alpha=0.45, - label="True", - ) - ax.annotate( - f"{trees[j]}", - xy=(0.96, 0.9), - xycoords="axes fraction", - ha="right", - va="top", - color=c, - ) - -axs[-1].legend(loc="upper center", fontsize=10) -axs[-1].set_xlabel("Count") -axs[-1].set_yticks([0, 0.5, 1.0]) -axs[-1].set_ylim(0, 0.6); -``` - -Here we're plotting histograms of the predicted counts -against the observed counts for each species. - -_(Notice that the y-axis isn't full height and clips the distributions for species-4 in purple.)_ - -And now we can start to see why our posterior HDI deviates from the _true_ parameters for three of five species (vertical lines). -See that for all of the species the observed counts are frequently quite far from the predictions -conditioned on the posterior distribution. -This is particularly obvious for (e.g.) species-2 where we have one observation of more than 20 -trees of this species, despite the posterior predicitive mass being concentrated far below that. - -This is overdispersion at work, and a clear sign that we need to adjust our model to accommodate it. - -Posterior predictive checks are one of the best ways to diagnose model misspecification, -and this example is no different. - -+++ - -## Dirichlet-Multinomial Model - Explicit Mixture - -+++ - -Let's go ahead and model our data using the DM distribution. - -For this model we'll keep the same prior on the expected frequencies of each -species, $\mathrm{frac}$. -We'll also add a strictly positive parameter, $\mathrm{conc}$, for the concentration. - -In this iteration of our model we'll explicitly include the latent multinomial -probability, $p_i$, modeling the $\mathrm{true\_p}_i$ from our simulations (which we would not -observe in the real world). - -```{code-cell} ipython3 -with pm.Model(coords=coords) as model_dm_explicit: - frac = pm.Dirichlet("frac", a=np.ones(k), dims="tree") - conc = pm.Lognormal("conc", mu=1, sigma=1) - p = pm.Dirichlet("p", a=frac * conc, dims=("forest", "tree")) - counts = pm.Multinomial( - "counts", n=total_count, p=p, observed=observed_counts, dims=("forest", "tree") - ) - -pm.model_to_graphviz(model_dm_explicit) -``` - -Compare this diagram to the first. -Here the latent, Dirichlet distributed $p$ separates the multinomial from the expected frequencies, $\mathrm{frac}$, -accounting for overdispersion of counts relative to the simple multinomial model. - -```{code-cell} ipython3 -with model_dm_explicit: - trace_dm_explicit = pm.sample(chains=4, return_inferencedata=True) -``` - -We got a warning, although we'll ignore it for now. -More interesting is how much longer it took to sample this model than the -first. -This may be because our model has an additional ~$(n \times k)$ parameters, -but it seems like there are other geometric challenges for NUTS as well. - -We'll see if we can fix these in the next model, but for now let's take a look at the traces. - -```{code-cell} ipython3 -az.plot_trace(data=trace_dm_explicit, var_names=["frac", "conc"]); -``` - -Obviously some sampling issues, but it's hard to see where divergences are occurring. - -```{code-cell} ipython3 -az.plot_forest(trace_dm_explicit, var_names=["frac"]) -for j, (y_tick, frac_j) in enumerate(zip(plt.gca().get_yticks(), reversed(true_frac))): - plt.vlines(frac_j, ymin=y_tick - 0.45, ymax=y_tick + 0.45, color="black", linestyle="--") -``` - -On the other hand, since we know the ground-truth for $\mathrm{frac}$, -we can congratulate ourselves that -the HDIs include the true values for all of our species! - -Modeling this mixture has made our inferences robust to the overdispersion of counts, -while the plain multinomial is very sensitive. -Notice that the HDI is much wider than before for each $\mathrm{frac}_i$. -In this case that makes the difference between correct and incorrect inferences. - -```{code-cell} ipython3 -summary_dm_explicit = az.summary(trace_dm_explicit, var_names=["frac", "conc"]) -summary_dm_explicit = summary_dm_explicit.assign( - ess_bulk_per_sec=lambda x: x.ess_bulk / trace_dm_explicit.posterior.sampling_time, -) - -summary_dm_explicit -``` - -This is great, but _we can do better_. -The larger $\hat{R}$ value for $\mathrm{frac}_4$ is mildly concerning, and it's surprising -that our $\mathrm{ESS} \; \mathrm{sec}^{-1}$ is relatively small. - -+++ - -## Dirichlet-Multinomial Model - Marginalized - -+++ - -Happily, the Dirichlet distribution is conjugate to the multinomial -and therefore there's a convenient, closed-form for the marginalized -distribution, i.e. the Dirichlet-multinomial distribution, which was added to PyMC3 in [3.11.0](https://github.com/pymc-devs/pymc3/releases/tag/v3.11.0). - -Let's take advantage of this, marginalizing out the explicit latent parameter, $p_i$, -replacing the combination of this node and the multinomial -with the DM to make an equivalent model. - -```{code-cell} ipython3 -with pm.Model(coords=coords) as model_dm_marginalized: - frac = pm.Dirichlet("frac", a=np.ones(k), dims="tree") - conc = pm.Lognormal("conc", mu=1, sigma=1) - counts = pm.DirichletMultinomial( - "counts", n=total_count, a=frac * conc, observed=observed_counts, dims=("forest", "tree") - ) - -pm.model_to_graphviz(model_dm_marginalized) -``` - -The plate diagram shows that we've collapsed what had been the latent Dirichlet and the multinomial -nodes together into a single DM node. - -```{code-cell} ipython3 -with model_dm_marginalized: - trace_dm_marginalized = pm.sample(chains=4, return_inferencedata=True) -``` - -It samples much more quickly and without any of the warnings from before! - -```{code-cell} ipython3 -az.plot_trace(data=trace_dm_marginalized, var_names=["frac", "conc"]); -``` - -Trace plots look fuzzy and KDEs are clean. - -```{code-cell} ipython3 -summary_dm_marginalized = az.summary(trace_dm_marginalized, var_names=["frac", "conc"]) -summary_dm_marginalized = summary_dm_marginalized.assign( - ess_mean_per_sec=lambda x: x.ess_bulk / trace_dm_marginalized.posterior.sampling_time, -) -assert all(summary_dm_marginalized.r_hat < 1.03) - -summary_dm_marginalized -``` - -We see that $\hat{R}$ is close to $1$ everywhere -and $\mathrm{ESS} \; \mathrm{sec}^{-1}$ is much higher. -Our reparameterization (marginalization) has greatly improved the sampling! -(And, thankfully, the HDIs look similar to the other model.) - -This all looks very good, but what if we didn't have the ground-truth? - -Posterior predictive checks to the rescue (again)! - -```{code-cell} ipython3 -with model_dm_marginalized: - pp_samples = az.from_pymc3( - posterior_predictive=pm.fast_sample_posterior_predictive(trace_dm_marginalized) - ) - -# Concatenate with InferenceData object -trace_dm_marginalized.extend(pp_samples) -``` - -```{code-cell} ipython3 -cmap = plt.get_cmap("tab10") - -fig, axs = plt.subplots(k, 2, sharex=True, sharey=True, figsize=(8, 8)) -for j, row in enumerate(axs): - c = cmap(j) - for _trace, ax in zip([trace_dm_marginalized, trace_multinomial], row): - ax.hist( - _trace.posterior_predictive.counts.sel(tree=trees[j]).values.flatten(), - bins=np.arange(total_count), - histtype="step", - color=c, - density=True, - label="Post.Pred.", - ) - ax.hist( - (_trace.observed_data.counts.sel(tree=trees[j]).values.flatten()), - bins=np.arange(total_count), - color=c, - density=True, - alpha=0.25, - label="Observed", - ) - ax.axvline( - true_frac[j] * total_count, - color=c, - lw=1.0, - alpha=0.45, - label="True", - ) - row[1].annotate( - f"{trees[j]}", - xy=(0.96, 0.9), - xycoords="axes fraction", - ha="right", - va="top", - color=c, - ) - -axs[-1, -1].legend(loc="upper center", fontsize=10) -axs[0, 1].set_title("Multinomial") -axs[0, 0].set_title("Dirichlet-multinomial") -axs[-1, 0].set_xlabel("Count") -axs[-1, 1].set_xlabel("Count") -axs[-1, 0].set_yticks([0, 0.5, 1.0]) -axs[-1, 0].set_ylim(0, 0.6) -ax.set_ylim(0, 0.6); -``` - -_(Notice, again, that the y-axis isn't full height, and clips the distributions for species-4 in purple.)_ - -Compared to the multinomial (plots on the right), PPCs for the DM (left) show that the observed data is -an entirely reasonable realization of our model. -This is great news! - -+++ - -## Model Comparison - -+++ - -Let's go a step further and try to put a number on how much better our DM model is -relative to the raw multinomial. -We'll use leave-one-out cross validation to compare the -out-of-sample predictive ability of the two. - -```{code-cell} ipython3 -az.compare( - {"multinomial": trace_multinomial, "dirichlet_multinomial": trace_dm_marginalized}, ic="loo" -) -``` - -Unsurprisingly, the DM outclasses the multinomial by a mile, assigning a weight of nearly -100% to the over-dispersed model. -We can conclude that between the two, the DM should be greatly favored for prediction, -parameter inference, etc. - -+++ - -## Conclusions - -Obviously the DM is not a perfect model in every case, but it is often a better choice than the multinomial, much more robust while taking on just one additional parameter. - -There are a number of shortcomings to the DM that we should keep in mind when selecting a model. -The biggest problem is that, while more flexible than the multinomial, the DM -still ignores the possibility of underlying correlations between categories. -If one of our tree species relies on another, for instance, the model we've used here -will not effectively account for this. -In that case, swapping the vanilla Dirichlet distribution for something fancier (e.g. the [Generalized Dirichlet](https://en.wikipedia.org/wiki/Generalized_Dirichlet_distribution) or [Logistic-Multivariate Normal](https://en.wikipedia.org/wiki/Logit-normal_distribution#Multivariate_generalization)) may be worth considering. - -+++ - -## References - - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Authors -* Authored by [Byron J. Smith](https://github.com/bsmith89) on Jan, 2021 ([pymc-examples#18](https://github.com/pymc-devs/pymc-examples/pull/18)) -* Updated by Abhipsha Das and Oriol Abril-Pla on August, 2021 ([pymc-examples#212](https://github.com/pymc-devs/pymc-examples/pull/212)) - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p theano,xarray -``` - -:::{include} page_footer.md -::: - -```{code-cell} ipython3 - -``` diff --git a/myst_nbs/mixture_models/dp_mix.myst.md b/myst_nbs/mixture_models/dp_mix.myst.md deleted file mode 100644 index e97d5af6b..000000000 --- a/myst_nbs/mixture_models/dp_mix.myst.md +++ /dev/null @@ -1,618 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(dp_mix)= -# Dirichlet process mixtures for density estimation - -:::{post} Sept 16, 2021 -:tags: mixture model, -:category: advanced -:author: Austin Rochford, Abhipsha Das -::: - -+++ - -## Dirichlet processes - -The [Dirichlet process](https://en.wikipedia.org/wiki/Dirichlet_process) is a flexible probability distribution over the space of distributions. Most generally, a probability distribution, $P$, on a set $\Omega$ is a [measure](https://en.wikipedia.org/wiki/Measure_(mathematics%29) that assigns measure one to the entire space ($P(\Omega) = 1$). A Dirichlet process $P \sim \textrm{DP}(\alpha, P_0)$ is a measure that has the property that, for every finite [disjoint](https://en.wikipedia.org/wiki/Disjoint_sets) partition $S_1, \ldots, S_n$ of $\Omega$, - -$$(P(S_1), \ldots, P(S_n)) \sim \textrm{Dir}(\alpha P_0(S_1), \ldots, \alpha P_0(S_n)).$$ - -Here $P_0$ is the base probability measure on the space $\Omega$. The precision parameter $\alpha > 0$ controls how close samples from the Dirichlet process are to the base measure, $P_0$. As $\alpha \to \infty$, samples from the Dirichlet process approach the base measure $P_0$. - -Dirichlet processes have several properties that make them quite suitable to {term}`MCMC` simulation. - -1. The posterior given [i.i.d.](https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables) observations $\omega_1, \ldots, \omega_n$ from a Dirichlet process $P \sim \textrm{DP}(\alpha, P_0)$ is also a Dirichlet process with - - $$P\ |\ \omega_1, \ldots, \omega_n \sim \textrm{DP}\left(\alpha + n, \frac{\alpha}{\alpha + n} P_0 + \frac{1}{\alpha + n} \sum_{i = 1}^n \delta_{\omega_i}\right),$$ - - where $\delta$ is the [Dirac delta measure](https://en.wikipedia.org/wiki/Dirac_delta_function) - - $$\begin{align*} - \delta_{\omega}(S) - & = \begin{cases} - 1 & \textrm{if } \omega \in S \\ - 0 & \textrm{if } \omega \not \in S - \end{cases} - \end{align*}.$$ - -2. The posterior predictive distribution of a new observation is a compromise between the base measure and the observations, - - $$\omega\ |\ \omega_1, \ldots, \omega_n \sim \frac{\alpha}{\alpha + n} P_0 + \frac{1}{\alpha + n} \sum_{i = 1}^n \delta_{\omega_i}.$$ - - We see that the prior precision $\alpha$ can naturally be interpreted as a prior sample size. The form of this posterior predictive distribution also lends itself to Gibbs sampling. - -2. Samples, $P \sim \textrm{DP}(\alpha, P_0)$, from a Dirichlet process are discrete with probability one. That is, there are elements $\omega_1, \omega_2, \ldots$ in $\Omega$ and weights $\mu_1, \mu_2, \ldots$ with $\sum_{i = 1}^{\infty} \mu_i = 1$ such that - - $$P = \sum_{i = 1}^\infty \mu_i \delta_{\omega_i}.$$ - -3. The [stick-breaking process](https://en.wikipedia.org/wiki/Dirichlet_process#The_stick-breaking_process) gives an explicit construction of the weights $\mu_i$ and samples $\omega_i$ above that is straightforward to sample from. If $\beta_1, \beta_2, \ldots \sim \textrm{Beta}(1, \alpha)$, then $\mu_i = \beta_i \prod_{j = 1}^{i - 1} (1 - \beta_j)$. The relationship between this representation and stick breaking may be illustrated as follows: - 1. Start with a stick of length one. - 2. Break the stick into two portions, the first of proportion $\mu_1 = \beta_1$ and the second of proportion $1 - \mu_1$. - 3. Further break the second portion into two portions, the first of proportion $\beta_2$ and the second of proportion $1 - \beta_2$. The length of the first portion of this stick is $\beta_2 (1 - \beta_1)$; the length of the second portion is $(1 - \beta_1) (1 - \beta_2)$. - 4. Continue breaking the second portion from the previous break in this manner forever. If $\omega_1, \omega_2, \ldots \sim P_0$, then - - $$P = \sum_{i = 1}^\infty \mu_i \delta_{\omega_i} \sim \textrm{DP}(\alpha, P_0).$$ - -[Suggested Further Reading]: (http://mlg.eng.cam.ac.uk/tutorials/07/ywt.pdf) and {cite:t}`teh2010dirichletprocess` for a brief introduction to other flavours of Dirichlet Processes, and their applications. - -We can use the stick-breaking process above to easily sample from a Dirichlet process in Python. For this example, $\alpha = 2$ and the base distribution is $N(0, 1)$. - -```{code-cell} ipython3 -import os - -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc3 as pm -import scipy as sp -import seaborn as sns -import theano.tensor as tt -import xarray as xr - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -```{code-cell} ipython3 -N = 20 -K = 30 - -alpha = 2.0 -P0 = sp.stats.norm -``` - -We draw and plot samples from the stick-breaking process. - -```{code-cell} ipython3 -beta = sp.stats.beta.rvs(1, alpha, size=(N, K)) -w = np.empty_like(beta) -w[:, 0] = beta[:, 0] -w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1) - -omega = P0.rvs(size=(N, K)) - -x_plot = xr.DataArray(np.linspace(-3, 3, 200), dims=["plot"]) - -sample_cdfs = (w[..., np.newaxis] * np.less.outer(omega, x_plot.values)).sum(axis=1) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - -ax.plot(x_plot, sample_cdfs[0], c="gray", alpha=0.75, label="DP sample CDFs") -ax.plot(x_plot, sample_cdfs[1:].T, c="gray", alpha=0.75) -ax.plot(x_plot, P0.cdf(x_plot), c="k", label="Base CDF") - -ax.set_title(rf"$\alpha = {alpha}$") -ax.legend(loc=2); -``` - -As stated above, as $\alpha \to \infty$, samples from the Dirichlet process converge to the base distribution. - -```{code-cell} ipython3 -fig, (l_ax, r_ax) = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(16, 6)) - -K = 50 -alpha = 10.0 - -beta = sp.stats.beta.rvs(1, alpha, size=(N, K)) -w = np.empty_like(beta) -w[:, 0] = beta[:, 0] -w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1) - -omega = P0.rvs(size=(N, K)) - -sample_cdfs = (w[..., np.newaxis] * np.less.outer(omega, x_plot.values)).sum(axis=1) - -l_ax.plot(x_plot, sample_cdfs[0], c="gray", alpha=0.75, label="DP sample CDFs") -l_ax.plot(x_plot, sample_cdfs[1:].T, c="gray", alpha=0.75) -l_ax.plot(x_plot, P0.cdf(x_plot), c="k", label="Base CDF") - -l_ax.set_title(rf"$\alpha = {alpha}$") -l_ax.legend(loc=2) - -K = 200 -alpha = 50.0 - -beta = sp.stats.beta.rvs(1, alpha, size=(N, K)) -w = np.empty_like(beta) -w[:, 0] = beta[:, 0] -w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1) - -omega = P0.rvs(size=(N, K)) - -sample_cdfs = (w[..., np.newaxis] * np.less.outer(omega, x_plot.values)).sum(axis=1) - -r_ax.plot(x_plot, sample_cdfs[0], c="gray", alpha=0.75, label="DP sample CDFs") -r_ax.plot(x_plot, sample_cdfs[1:].T, c="gray", alpha=0.75) -r_ax.plot(x_plot, P0.cdf(x_plot), c="k", label="Base CDF") - -r_ax.set_title(rf"$\alpha = {alpha}$") -r_ax.legend(loc=2); -``` - -## Dirichlet process mixtures - -For the task of density estimation, the (almost sure) discreteness of samples from the Dirichlet process is a significant drawback. This problem can be solved with another level of indirection by using Dirichlet process mixtures for density estimation. A Dirichlet process mixture uses component densities from a parametric family $\mathcal{F} = \{f_{\theta}\ |\ \theta \in \Theta\}$ and represents the mixture weights as a Dirichlet process. If $P_0$ is a probability measure on the parameter space $\Theta$, a Dirichlet process mixture is the hierarchical model - -$$ -\begin{align*} - x_i\ |\ \theta_i - & \sim f_{\theta_i} \\ - \theta_1, \ldots, \theta_n - & \sim P \\ - P - & \sim \textrm{DP}(\alpha, P_0). -\end{align*} -$$ - -To illustrate this model, we simulate draws from a Dirichlet process mixture with $\alpha = 2$, $\theta \sim N(0, 1)$, $x\ |\ \theta \sim N(\theta, (0.3)^2)$. - -```{code-cell} ipython3 -N = 5 -K = 30 - -alpha = 2 -P0 = sp.stats.norm -f = lambda x, theta: sp.stats.norm.pdf(x, theta, 0.3) -``` - -```{code-cell} ipython3 -beta = sp.stats.beta.rvs(1, alpha, size=(N, K)) -w = np.empty_like(beta) -w[:, 0] = beta[:, 0] -w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1) - -theta = P0.rvs(size=(N, K)) - -dpm_pdf_components = f(x_plot, theta[..., np.newaxis]) -dpm_pdfs = (w[..., np.newaxis] * dpm_pdf_components).sum(axis=1) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - -ax.plot(x_plot, dpm_pdfs.T, c="gray") - -ax.set_yticklabels([]); -``` - -We now focus on a single mixture and decompose it into its individual (weighted) mixture components. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - -ix = 1 - -ax.plot(x_plot, dpm_pdfs[ix], c="k", label="Density") -ax.plot( - x_plot, - (w[..., np.newaxis] * dpm_pdf_components)[ix, 0], - "--", - c="k", - label="Mixture components (weighted)", -) -ax.plot(x_plot, (w[..., np.newaxis] * dpm_pdf_components)[ix].T, "--", c="k") - -ax.set_yticklabels([]) -ax.legend(loc=1); -``` - -Sampling from these stochastic processes is fun, but these ideas become truly useful when we fit them to data. The discreteness of samples and the stick-breaking representation of the Dirichlet process lend themselves nicely to Markov chain Monte Carlo simulation of posterior distributions. We will perform this sampling using `PyMC3`. - -Our first example uses a Dirichlet process mixture to estimate the density of waiting times between eruptions of the [Old Faithful](https://en.wikipedia.org/wiki/Old_Faithful) geyser in [Yellowstone National Park](https://en.wikipedia.org/wiki/Yellowstone_National_Park). - -```{code-cell} ipython3 -try: - old_faithful_df = pd.read_csv(os.path.join("..", "data", "old_faithful.csv")) -except FileNotFoundError: - old_faithful_df = pd.read_csv(pm.get_data("old_faithful.csv")) -``` - -For convenience in specifying the prior, we standardize the waiting time between eruptions. - -```{code-cell} ipython3 -old_faithful_df["std_waiting"] = ( - old_faithful_df.waiting - old_faithful_df.waiting.mean() -) / old_faithful_df.waiting.std() -``` - -```{code-cell} ipython3 -old_faithful_df.head() -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - -n_bins = 20 -ax.hist(old_faithful_df.std_waiting, bins=n_bins, color="C0", lw=0, alpha=0.5) - -ax.set_xlabel("Standardized waiting time between eruptions") -ax.set_ylabel("Number of eruptions"); -``` - -Observant readers will have noted that we have not been continuing the stick-breaking process indefinitely as indicated by its definition, but rather have been truncating this process after a finite number of breaks. Obviously, when computing with Dirichlet processes, it is necessary to only store a finite number of its point masses and weights in memory. This restriction is not terribly onerous, since with a finite number of observations, it seems quite likely that the number of mixture components that contribute non-neglible mass to the mixture will grow slower than the number of samples. This intuition can be formalized to show that the (expected) number of components that contribute non-negligible mass to the mixture approaches $\alpha \log N$, where $N$ is the sample size. - -There are various clever [Gibbs sampling](https://en.wikipedia.org/wiki/Gibbs_sampling) techniques for Dirichlet processes that allow the number of components stored to grow as needed. Stochastic memoization {cite:p}`roy2008npbayes` is another powerful technique for simulating Dirichlet processes while only storing finitely many components in memory. In this introductory example, we take the much less sophistocated approach of simply truncating the Dirichlet process components that are stored after a fixed number, $K$, of components. {cite:t}`ishwaran2002approxdirichlet` and {cite:t}`ohlssen2007flexible` provide justification for truncation, showing that $K > 5 \alpha + 2$ is most likely sufficient to capture almost all of the mixture weight ($\sum_{i = 1}^{K} w_i > 0.99$). In practice, we can verify the suitability of our truncated approximation to the Dirichlet process by checking the number of components that contribute non-negligible mass to the mixture. If, in our simulations, all components contribute non-negligible mass to the mixture, we have truncated the Dirichlet process too early. - -Our (truncated) Dirichlet process mixture model for the standardized waiting times is - -$$ -\begin{align*} - \alpha - & \sim \textrm{Gamma}(1, 1) \\ - \beta_1, \ldots, \beta_K - & \sim \textrm{Beta}(1, \alpha) \\ - w_i - & = \beta_i \prod_{j = i - 1}^i (1 - \beta_j) \\ - \\ - \lambda_1, \ldots, \lambda_K - & \sim \textrm{Gamma}(10, 1) \\ - \tau_1, \ldots, \tau_K - & \sim \textrm{Gamma}(10, 1) \\ - \mu_i\ |\ \lambda_i, \tau_i - & \sim N\left(0, (\lambda_i \tau_i)^{-1}\right) \\ - \\ - x\ |\ w_i, \lambda_i, \tau_i, \mu_i - & \sim \sum_{i = 1}^K w_i\ N(\mu_i, (\lambda_i \tau_i)^{-1}) -\end{align*} -$$ - -Note that instead of fixing a value of $\alpha$, as in our previous simulations, we specify a prior on $\alpha$, so that we may learn its posterior distribution from the observations. - -We now construct this model using `pymc3`. - -```{code-cell} ipython3 -N = old_faithful_df.shape[0] - -K = 30 -``` - -```{code-cell} ipython3 -def stick_breaking(beta): - portion_remaining = tt.concatenate([[1], tt.extra_ops.cumprod(1 - beta)[:-1]]) - - return beta * portion_remaining -``` - -```{code-cell} ipython3 -with pm.Model(coords={"component": np.arange(K), "obs_id": np.arange(N)}) as model: - alpha = pm.Gamma("alpha", 1.0, 1.0) - beta = pm.Beta("beta", 1.0, alpha, dims="component") - w = pm.Deterministic("w", stick_breaking(beta), dims="component") - - tau = pm.Gamma("tau", 1.0, 1.0, dims="component") - lambda_ = pm.Gamma("lambda_", 10.0, 1.0, dims="component") - mu = pm.Normal("mu", 0, tau=lambda_ * tau, dims="component") - obs = pm.NormalMixture( - "obs", w, mu, tau=lambda_ * tau, observed=old_faithful_df.std_waiting.values, dims="obs_id" - ) -``` - -We sample from the model 1,000 times using NUTS initialized with ADVI. - -```{code-cell} ipython3 -with model: - trace = pm.sample( - 1000, - tune=2500, - chains=2, - init="advi", - target_accept=0.9, - random_seed=RANDOM_SEED, - return_inferencedata=True, - ) -``` - -The posterior distribution of $\alpha$ is highly concentrated between 0.25 and 1. - -```{code-cell} ipython3 -az.plot_trace(trace, var_names=["alpha"]); -``` - -To verify that truncation is not biasing our results, we plot the posterior expected mixture weight of each component. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - -plot_w = np.arange(K) + 1 - -ax.bar(plot_w - 0.5, trace.posterior["w"].mean(("chain", "draw")), width=1.0, lw=0) - -ax.set_xlim(0.5, K) -ax.set_xlabel("Component") - -ax.set_ylabel("Posterior expected mixture weight"); -``` - -We see that only three mixture components have appreciable posterior expected weights, so we conclude that truncating the Dirichlet process to thirty components has not appreciably affected our estimates. - -We now compute and plot our posterior density estimate. - -```{code-cell} ipython3 -post_pdf_contribs = xr.apply_ufunc( - sp.stats.norm.pdf, - x_plot, - trace.posterior["mu"], - 1.0 / np.sqrt(trace.posterior["lambda_"] * trace.posterior["tau"]), -) - -post_pdfs = (trace.posterior["w"] * post_pdf_contribs).sum(dim=("component")) - -post_pdf_quantiles = post_pdfs.quantile([0.1, 0.9], dim=("chain", "draw")) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - -n_bins = 20 -ax.hist(old_faithful_df.std_waiting.values, bins=n_bins, density=True, color="C0", lw=0, alpha=0.5) - -ax.fill_between( - x_plot, - post_pdf_quantiles.sel(quantile=0.1), - post_pdf_quantiles.sel(quantile=0.9), - color="gray", - alpha=0.45, -) -ax.plot(x_plot, post_pdfs.sel(chain=0, draw=0), c="gray", label="Posterior sample densities") -ax.plot( - x_plot, - post_pdfs.stack(pooled_chain=("chain", "draw")).sel(pooled_chain=slice(None, None, 100)), - c="gray", -) -ax.plot(x_plot, post_pdfs.mean(dim=("chain", "draw")), c="k", label="Posterior expected density") - -ax.set_xlabel("Standardized waiting time between eruptions") - -ax.set_yticklabels([]) -ax.set_ylabel("Density") - -ax.legend(loc=2); -``` - -As above, we can decompose this density estimate into its (weighted) mixture components. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - -n_bins = 20 -ax.hist(old_faithful_df.std_waiting.values, bins=n_bins, density=True, color="C0", lw=0, alpha=0.5) - -ax.plot(x_plot, post_pdfs.mean(dim=("chain", "draw")), c="k", label="Posterior expected density") -ax.plot( - x_plot, - (trace.posterior["w"] * post_pdf_contribs).mean(dim=("chain", "draw")).sel(component=0), - "--", - c="k", - label="Posterior expected mixture\ncomponents\n(weighted)", -) -ax.plot( - x_plot, - (trace.posterior["w"] * post_pdf_contribs).mean(dim=("chain", "draw")).T, - "--", - c="k", -) - -ax.set_xlabel("Standardized waiting time between eruptions") - -ax.set_yticklabels([]) -ax.set_ylabel("Density") - -ax.legend(loc=2); -``` - -The Dirichlet process mixture model is incredibly flexible in terms of the family of parametric component distributions $\{f_{\theta}\ |\ f_{\theta} \in \Theta\}$. We illustrate this flexibility below by using Poisson component distributions to estimate the density of sunspots per year. This dataset was curated by {cite:t}`sidc2021sunspot` and can be downloaded. - -```{code-cell} ipython3 -kwargs = dict(sep=";", names=["time", "sunspot.year"], usecols=[0, 1]) -try: - sunspot_df = pd.read_csv(os.path.join("..", "data", "sunspot.csv"), **kwargs) -except FileNotFoundError: - sunspot_df = pd.read_csv(pm.get_data("sunspot.csv"), **kwargs) -``` - -```{code-cell} ipython3 -sunspot_df.head() -``` - -For this example, the model is - -$$ -\begin{align*} - \alpha - & \sim \textrm{Gamma}(1, 1) \\ - \beta_1, \ldots, \beta_K - & \sim \textrm{Beta}(1, \alpha) \\ - w_i - & = \beta_i \prod_{j = i - 1}^i (1 - \beta_j) \\ - \\ - \lambda_i, \ldots, \lambda_K - & \sim \textrm{Gamma}(300, 2) - \\ - x\ |\ w_i, \lambda_i - & \sim \sum_{i = 1}^K w_i\ \textrm{Poisson}(\lambda_i). -\end{align*} -$$ - -```{code-cell} ipython3 -K = 50 -N = sunspot_df.shape[0] -``` - -```{code-cell} ipython3 -with pm.Model(coords={"component": np.arange(K), "obs_id": np.arange(N)}) as model: - alpha = pm.Gamma("alpha", 1.0, 1.0) - beta = pm.Beta("beta", 1, alpha, dims="component") - w = pm.Deterministic("w", stick_breaking(beta), dims="component") - # Gamma is conjugate prior to Poisson - lambda_ = pm.Gamma("lambda_", 300.0, 2.0, dims="component") - obs = pm.Mixture( - "obs", w, pm.Poisson.dist(lambda_), observed=sunspot_df["sunspot.year"], dims="obs_id" - ) -``` - -```{code-cell} ipython3 -with model: - trace = pm.sample( - 2000, - tune=5000, - chains=4, - init="advi", - target_accept=0.8, - random_seed=RANDOM_SEED, - return_inferencedata=True, - ) -``` - -For the sunspot model, the posterior distribution of $\alpha$ is concentrated between 0.6 and 1.2, indicating that we should expect more components to contribute non-negligible amounts to the mixture than for the Old Faithful waiting time model. - -```{code-cell} ipython3 -az.plot_trace(trace, var_names=["alpha"]); -``` - -Indeed, we see that between ten and fifteen mixture components have appreciable posterior expected weight. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - -plot_w = np.arange(K) + 1 - -ax.bar(plot_w - 0.5, trace.posterior["w"].mean(("chain", "draw")), width=1.0, lw=0) - -ax.set_xlim(0.5, K) -ax.set_xlabel("Component") - -ax.set_ylabel("Posterior expected mixture weight"); -``` - -We now calculate and plot the fitted density estimate. - -```{code-cell} ipython3 -x_plot = xr.DataArray(np.arange(250), dims=["plot"]) - -post_pmf_contribs = xr.apply_ufunc(sp.stats.poisson.pmf, x_plot, trace.posterior["lambda_"]) - -post_pmfs = (trace.posterior["w"] * post_pmf_contribs).sum(dim=("component")) - -post_pmf_quantiles = post_pmfs.quantile([0.025, 0.975], dim=("chain", "draw")) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - -ax.hist(sunspot_df["sunspot.year"].values, bins=40, density=True, lw=0, alpha=0.75) - -ax.fill_between( - x_plot, - post_pmf_quantiles.sel(quantile=0.025), - post_pmf_quantiles.sel(quantile=0.975), - color="gray", - alpha=0.45, -) -ax.plot(x_plot, post_pmfs.sel(chain=0, draw=0), c="gray", label="Posterior sample densities") -ax.plot( - x_plot, - post_pmfs.stack(pooled_chain=("chain", "draw")).sel(pooled_chain=slice(None, None, 200)), - c="gray", -) -ax.plot(x_plot, post_pmfs.mean(dim=("chain", "draw")), c="k", label="Posterior expected density") - -ax.set_xlabel("Yearly sunspot count") -ax.set_yticklabels([]) -ax.legend(loc=1); -``` - -Again, we can decompose the posterior expected density into weighted mixture densities. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - -n_bins = 40 -ax.hist(sunspot_df["sunspot.year"].values, bins=n_bins, density=True, color="C0", lw=0, alpha=0.75) - -ax.plot(x_plot, post_pmfs.mean(dim=("chain", "draw")), c="k", label="Posterior expected density") -ax.plot( - x_plot, - (trace.posterior["w"] * post_pmf_contribs).mean(dim=("chain", "draw")).sel(component=0), - "--", - c="k", - label="Posterior expected mixture\ncomponents\n(weighted)", -) -ax.plot( - x_plot, - (trace.posterior["w"] * post_pmf_contribs).mean(dim=("chain", "draw")).T, - "--", - c="k", -) - -ax.set_xlabel("Yearly sunspot count") -ax.set_yticklabels([]) -ax.legend(loc=1); -``` - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Authors -* Adapted by [Austin Rochford](https://github.com/AustinRochford/) from [his own blog post](http://austinrochford.com/posts/2016-02-25-density-estimation-dpm.html) -* Updated by Abhipsha Das on August, 2021 ([pymc-examples#212](https://github.com/pymc-devs/pymc-examples/pull/212)) - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` - -:::{include} page_footer.md -::: - -```{code-cell} ipython3 - -``` diff --git a/myst_nbs/mixture_models/gaussian_mixture_model.myst.md b/myst_nbs/mixture_models/gaussian_mixture_model.myst.md deleted file mode 100644 index 1546f2d5b..000000000 --- a/myst_nbs/mixture_models/gaussian_mixture_model.myst.md +++ /dev/null @@ -1,124 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: pymc-dev-py39 - language: python - name: pymc-dev-py39 ---- - -(gaussian_mixture_model)= -# Gaussian Mixture Model - -:::{post} April, 2022 -:tags: mixture model, classification -:category: beginner -:author: Abe Flaxman -::: - -A [mixture model](https://en.wikipedia.org/wiki/Mixture_model) allows us to make inferences about the component contributors to a distribution of data. More specifically, a Gaussian Mixture Model allows us to make inferences about the means and standard deviations of a specified number of underlying component Gaussian distributions. - -This could be useful in a number of ways. For example, we may be interested in simply describing a complex distribution parametrically (i.e. a [mixture distribution](https://en.wikipedia.org/wiki/Mixture_distribution)). Alternatively, we may be interested in [classification](https://en.wikipedia.org/wiki/Classification) where we seek to probabilistically classify which of a number of classes a particular observation is from. - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm - -from scipy.stats import norm -from xarray_einstats.stats import XrContinuousRV -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -First we generate some simulated observations. - -```{code-cell} ipython3 -:tags: [hide-input] - -k = 3 -ndata = 500 -centers = np.array([-5, 0, 5]) -sds = np.array([0.5, 2.0, 0.75]) -idx = rng.integers(0, k, ndata) -x = rng.normal(loc=centers[idx], scale=sds[idx], size=ndata) -plt.hist(x, 40); -``` - -In the PyMC model, we will estimate one $\mu$ and one $\sigma$ for each of the 3 clusters. Writing a Gaussian Mixture Model is very easy with the `pm.NormalMixture` distribution. - -```{code-cell} ipython3 -with pm.Model(coords={"cluster": range(k)}) as model: - μ = pm.Normal( - "μ", - mu=0, - sigma=5, - transform=pm.distributions.transforms.ordered, - initval=[-4, 0, 4], - dims="cluster", - ) - σ = pm.HalfNormal("σ", sigma=1, dims="cluster") - weights = pm.Dirichlet("w", np.ones(k), dims="cluster") - pm.NormalMixture("x", w=weights, mu=μ, sigma=σ, observed=x) - -pm.model_to_graphviz(model) -``` - -```{code-cell} ipython3 -with model: - idata = pm.sample() -``` - -We can also plot the trace to check the nature of the MCMC chains, and compare to the ground truth values. - -```{code-cell} ipython3 -az.plot_trace(idata, var_names=["μ", "σ"], lines=[("μ", {}, [centers]), ("σ", {}, [sds])]); -``` - -And if we wanted, we could calculate the probability density function and examine the estimated group membership probabilities, based on the posterior mean estimates. - -```{code-cell} ipython3 -xi = np.linspace(-7, 7, 500) -post = idata.posterior -pdf_components = XrContinuousRV(norm, post["μ"], post["σ"]).pdf(xi) * post["w"] -pdf = pdf_components.sum("cluster") - -fig, ax = plt.subplots(3, 1, figsize=(7, 8), sharex=True) -# empirical histogram -ax[0].hist(x, 50) -ax[0].set(title="Data", xlabel="x", ylabel="Frequency") -# pdf -pdf_components.mean(dim=["chain", "draw"]).sum("cluster").plot.line(ax=ax[1]) -ax[1].set(title="PDF", xlabel="x", ylabel="Probability\ndensity") -# plot group membership probabilities -(pdf_components / pdf).mean(dim=["chain", "draw"]).plot.line(hue="cluster", ax=ax[2]) -ax[2].set(title="Group membership", xlabel="x", ylabel="Probability"); -``` - -## Authors -- Authored by Abe Flaxman. -- Updated by Thomas Wiecki. -- Updated by [Benjamin T. Vincent](https://github.com/drbenvincent) in April 2022 ([#310](https://github.com/pymc-devs/pymc-examples/pull/310)) to use `pm.NormalMixture`. - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray,xarray_einstats -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/mixture_models/marginalized_gaussian_mixture_model.myst.md b/myst_nbs/mixture_models/marginalized_gaussian_mixture_model.myst.md deleted file mode 100644 index 0ed1504df..000000000 --- a/myst_nbs/mixture_models/marginalized_gaussian_mixture_model.myst.md +++ /dev/null @@ -1,239 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- - -+++ {"papermill": {"duration": 0.012112, "end_time": "2020-12-20T20:45:32.375345", "exception": false, "start_time": "2020-12-20T20:45:32.363233", "status": "completed"}, "tags": []} - -# Marginalized Gaussian Mixture Model - -:::{post} Sept 18, 2021 -:tags: mixture model, -:category: intermediate -::: - -```{code-cell} ipython3 ---- -papermill: - duration: 5.906876 - end_time: '2020-12-20T20:45:38.293074' - exception: false - start_time: '2020-12-20T20:45:32.386198' - status: completed -tags: [] ---- -import arviz as az -import numpy as np -import pymc3 as pm -import seaborn as sns - -from matplotlib import pyplot as plt - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 ---- -papermill: - duration: 0.034525 - end_time: '2020-12-20T20:45:38.340340' - exception: false - start_time: '2020-12-20T20:45:38.305815' - status: completed -tags: [] ---- -%config InlineBackend.figure_format = 'retina' -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -+++ {"papermill": {"duration": 0.011094, "end_time": "2020-12-20T20:45:38.362640", "exception": false, "start_time": "2020-12-20T20:45:38.351546", "status": "completed"}, "tags": []} - -Gaussian mixtures are a flexible class of models for data that exhibits subpopulation heterogeneity. A toy example of such a data set is shown below. - -```{code-cell} ipython3 ---- -papermill: - duration: 0.019101 - end_time: '2020-12-20T20:45:38.392974' - exception: false - start_time: '2020-12-20T20:45:38.373873' - status: completed -tags: [] ---- -N = 1000 - -W = np.array([0.35, 0.4, 0.25]) - -MU = np.array([0.0, 2.0, 5.0]) -SIGMA = np.array([0.5, 0.5, 1.0]) -``` - -```{code-cell} ipython3 ---- -papermill: - duration: 0.018854 - end_time: '2020-12-20T20:45:38.422840' - exception: false - start_time: '2020-12-20T20:45:38.403986' - status: completed -tags: [] ---- -component = rng.choice(MU.size, size=N, p=W) -x = rng.normal(MU[component], SIGMA[component], size=N) -``` - -```{code-cell} ipython3 ---- -papermill: - duration: 0.422847 - end_time: '2020-12-20T20:45:38.856513' - exception: false - start_time: '2020-12-20T20:45:38.433666' - status: completed -tags: [] ---- -fig, ax = plt.subplots(figsize=(8, 6)) - -ax.hist(x, bins=30, density=True, lw=0); -``` - -+++ {"papermill": {"duration": 0.012072, "end_time": "2020-12-20T20:45:38.881581", "exception": false, "start_time": "2020-12-20T20:45:38.869509", "status": "completed"}, "tags": []} - -A natural parameterization of the Gaussian mixture model is as the [latent variable model](https://en.wikipedia.org/wiki/Latent_variable_model) - -$$ -\begin{align*} -\mu_1, \ldots, \mu_K - & \sim N(0, \sigma^2) \\ -\tau_1, \ldots, \tau_K - & \sim \textrm{Gamma}(a, b) \\ -\boldsymbol{w} - & \sim \textrm{Dir}(\boldsymbol{\alpha}) \\ -z\ |\ \boldsymbol{w} - & \sim \textrm{Cat}(\boldsymbol{w}) \\ -x\ |\ z - & \sim N(\mu_z, \tau^{-1}_z). -\end{align*} -$$ - -An implementation of this parameterization in PyMC3 is available at {doc}`gaussian_mixture_model`. A drawback of this parameterization is that is posterior relies on sampling the discrete latent variable $z$. This reliance can cause slow mixing and ineffective exploration of the tails of the distribution. - -An alternative, equivalent parameterization that addresses these problems is to marginalize over $z$. The marginalized model is - -$$ -\begin{align*} -\mu_1, \ldots, \mu_K - & \sim N(0, \sigma^2) \\ -\tau_1, \ldots, \tau_K - & \sim \textrm{Gamma}(a, b) \\ -\boldsymbol{w} - & \sim \textrm{Dir}(\boldsymbol{\alpha}) \\ -f(x\ |\ \boldsymbol{w}) - & = \sum_{i = 1}^K w_i\ N(x\ |\ \mu_i, \tau^{-1}_i), -\end{align*} -$$ - -where - -$$N(x\ |\ \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi} \sigma} \exp\left(-\frac{1}{2 \sigma^2} (x - \mu)^2\right)$$ - -is the probability density function of the normal distribution. - -Marginalizing $z$ out of the model generally leads to faster mixing and better exploration of the tails of the posterior distribution. Marginalization over discrete parameters is a common trick in the [Stan](http://mc-stan.org/) community, since Stan does not support sampling from discrete distributions. For further details on marginalization and several worked examples, see the [_Stan User's Guide and Reference Manual_](http://www.uvm.edu/~bbeckage/Teaching/DataAnalysis/Manuals/stan-reference-2.8.0.pdf). - -PyMC3 supports marginalized Gaussian mixture models through its `NormalMixture` class. (It also supports marginalized general mixture models through its `Mixture` class) Below we specify and fit a marginalized Gaussian mixture model to this data in PyMC3. - -```{code-cell} ipython3 ---- -papermill: - duration: 71.268227 - end_time: '2020-12-20T20:46:50.162293' - exception: false - start_time: '2020-12-20T20:45:38.894066' - status: completed -tags: [] ---- -with pm.Model(coords={"cluster": np.arange(len(W)), "obs_id": np.arange(N)}) as model: - w = pm.Dirichlet("w", np.ones_like(W)) - - mu = pm.Normal( - "mu", - np.zeros_like(W), - 1.0, - dims="cluster", - transform=pm.transforms.ordered, - testval=[1, 2, 3], - ) - tau = pm.Gamma("tau", 1.0, 1.0, dims="cluster") - - x_obs = pm.NormalMixture("x_obs", w, mu, tau=tau, observed=x, dims="obs_id") -``` - -```{code-cell} ipython3 ---- -papermill: - duration: 587.834129 - end_time: '2020-12-20T20:56:38.008902' - exception: false - start_time: '2020-12-20T20:46:50.174773' - status: completed -tags: [] ---- -with model: - trace = pm.sample(5000, n_init=10000, tune=1000, return_inferencedata=True) - - # sample posterior predictive samples - ppc_trace = pm.sample_posterior_predictive(trace, var_names=["x_obs"], keep_size=True) - -trace.add_groups(posterior_predictive=ppc_trace) -``` - -+++ {"papermill": {"duration": 0.013524, "end_time": "2020-12-20T20:56:38.036405", "exception": false, "start_time": "2020-12-20T20:56:38.022881", "status": "completed"}, "tags": []} - -We see in the following plot that the posterior distribution on the weights and the component means has captured the true value quite well. - -```{code-cell} ipython3 -az.plot_trace(trace, var_names=["w", "mu"], compact=False); -``` - -```{code-cell} ipython3 -az.plot_posterior(trace, var_names=["w", "mu"]); -``` - -+++ {"papermill": {"duration": 0.035988, "end_time": "2020-12-20T20:56:44.871074", "exception": false, "start_time": "2020-12-20T20:56:44.835086", "status": "completed"}, "tags": []} - -We see that the posterior predictive samples have a distribution quite close to that of the observed data. - -```{code-cell} ipython3 -az.plot_ppc(trace); -``` - -Author: [Austin Rochford](http://austinrochford.com) - -+++ - -## Watermark - -```{code-cell} ipython3 ---- -papermill: - duration: 0.108022 - end_time: '2020-12-20T20:58:55.011403' - exception: false - start_time: '2020-12-20T20:58:54.903381' - status: completed -tags: [] ---- -%load_ext watermark -%watermark -n -u -v -iv -w -p theano,xarray -``` diff --git a/myst_nbs/ode_models/ODE_API_introduction.myst.md b/myst_nbs/ode_models/ODE_API_introduction.myst.md deleted file mode 100644 index 07a3cb10d..000000000 --- a/myst_nbs/ode_models/ODE_API_introduction.myst.md +++ /dev/null @@ -1,295 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: pymc-dev - language: python - name: pymc-dev ---- - -```{code-cell} ipython3 -%matplotlib inline -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm -import theano - -from pymc3.ode import DifferentialEquation -from scipy.integrate import odeint - -plt.style.use("seaborn-darkgrid") -``` - -# GSoC 2019: Introduction of pymc3.ode API -by [Demetri Pananos](https://dpananos.github.io/posts/2019/08/blog-post-21/) - -Ordinary differential equations (ODEs) are a convenient mathematical framework for modelling the temporal dynamics of a system in disciplines from engineering to ecology. Though most analyses focus on bifurcations and stability of fixed points, parameter and uncertainty estimates are more salient in systems of practical interest, such as population pharmacokinetics and pharmacodynamics. - - -Both parameter estimation and uncertainty propagation are handled elegantly by the Bayesian framework. In this notebook, I showcase how PyMC3 can be used to do inference for differential equations using the `ode` submodule. While the current implementation is quite flexible and well integrated, more complex models can easily become slow to estimate. A new package that integrates the much faster `sundials` package into PyMC3 called `sunode` can be found [here](https://github.com/aseyboldt/sunode). - - -## Catching Up On Differential Equations - -A differential equation is an equation relating an unknown function's derivative to itself. We usually write differentual equations as - -$$ \mathbf{y}' = f(\mathbf{y},t,\mathbf{p}) \quad \mathbf{y}(t_0) = \mathbf{y}_0 $$ - -Here, $\mathbf{y}$ is the unknown function, $t$ is time, and $\mathbf{p}$ is a vector of parameters. The function $f$ can be either scalar or vector valued. - -Only a small subset of differential equations have an analytical solution. For most differential equations of applied interest, numerical methods must be used to obtain approximate solutions. - - -## Doing Bayesian Inference With Differential Equations - -PyMC3 uses Hamiltonian Monte Carlo (HMC) to obtain samples from the posterior distribution. HMC requires derivatives of the ODE's solution with respect to the parameters $p$. The `ode` submodual automatically computes appropriate derivatives so you don't have to. All you have to do is - -* Write the differential equation as a python function -* Write the model in PyMC3 -* Hit the Inference Button $^{\text{TM}}$ - -Let's see how this is done in practice with a small example. - -## A Differential Equation For Freefall - -An object of mass $m$ is brought to some height and allowed to fall freely until it reaches the ground. A differential equation describing the object's speed over time is - -$$ y' = mg - \gamma y $$ - -The force the object experiences in the downwards direction is $mg$, while the force the object experiences in the opposite direction (due to air resistance) is proportional to how fast the object is presently moving. Let's assume the object starts from rest (that is, that the object's initial velocity is 0). This may or may not be the case. To showcase how to do inference on initial conditions, I will first assume the object starts from rest, and then relax that assumption later. - -Data on this object's speed as a function of time is shown below. The data may be noisy because of our measurement tools, or because the object is an irregular shape, thus leading to times during freefall when the object is more/less aerodynamic. Let's use this data to estimate the proportionality constant for air resistance. - -```{code-cell} ipython3 -# For reproducibility -np.random.seed(20394) - - -def freefall(y, t, p): - return 2.0 * p[1] - p[0] * y[0] - - -# Times for observation -times = np.arange(0, 10, 0.5) -gamma, g, y0, sigma = 0.4, 9.8, -2, 2 -y = odeint(freefall, t=times, y0=y0, args=tuple([[gamma, g]])) -yobs = np.random.normal(y, 2) - -fig, ax = plt.subplots(dpi=120) -plt.plot(times, yobs, label="observed speed", linestyle="dashed", marker="o", color="red") -plt.plot(times, y, label="True speed", color="k", alpha=0.5) -plt.legend() -plt.xlabel("Time (Seconds)") -plt.ylabel(r"$y(t)$") -plt.show() -``` - -To specify and ordinary differential equation with pyMC3, use the `DifferentialEquation` class. This class takes as arguments: - -* `func`: A function specifying the differential equation (i.e. $f(\mathbf{y},t,\mathbf{p})$). -* `times`: An array of times at which data was observed. -* `n_states`: The dimension of $f(\mathbf{y},t,\mathbf{p})$. -* `n_theta`: The dimension of $\mathbf{p}$. -* `t0`: Optional time to which the initial condition belongs. - -The argument `func` needs to be written as if `y` and `p` are vectors. So even when your model has one state and/or one parameter, you should explicitly write `y[0]` and/or `p[0]`. - -Once the model is specified, we can use it in our pyMC3 model by passing parameters and initial conditions. `DifferentialEquation` returns a flattened solution, so you will need to reshape it to the same shape as your observed data in the model. - -Shown below is a model to estimate $\gamma$ in the ODE above. - -```{code-cell} ipython3 -ode_model = DifferentialEquation(func=freefall, times=times, n_states=1, n_theta=2, t0=0) - -with pm.Model() as model: - # Specify prior distributions for some of our model parameters - sigma = pm.HalfCauchy("sigma", 1) - gamma = pm.Lognormal("gamma", 0, 1) - - # If we know one of the parameter values, we can simply pass the value. - ode_solution = ode_model(y0=[0], theta=[gamma, 9.8]) - # The ode_solution has a shape of (n_times, n_states) - - Y = pm.Normal("Y", mu=ode_solution, sigma=sigma, observed=yobs) - - prior = pm.sample_prior_predictive() - trace = pm.sample(2000, tune=1000, cores=1) - posterior_predictive = pm.sample_posterior_predictive(trace) - - data = az.from_pymc3(trace=trace, prior=prior, posterior_predictive=posterior_predictive) -``` - -```{code-cell} ipython3 -az.plot_posterior(data); -``` - -Our estimates of the proportionality constant and noise in the system are incredibly close to their actual values! - -We can even estimate the acceleration due to gravity by specifying a prior for it. - -```{code-cell} ipython3 -with pm.Model() as model2: - sigma = pm.HalfCauchy("sigma", 1) - gamma = pm.Lognormal("gamma", 0, 1) - # A prior on the acceleration due to gravity - g = pm.Lognormal("g", pm.math.log(10), 2) - - # Notice now I have passed g to the odeparams argument - ode_solution = ode_model(y0=[0], theta=[gamma, g]) - - Y = pm.Normal("Y", mu=ode_solution, sigma=sigma, observed=yobs) - - prior = pm.sample_prior_predictive() - trace = pm.sample(2000, tune=1000, target_accept=0.9, cores=1) - posterior_predictive = pm.sample_posterior_predictive(trace) - - data = az.from_pymc3(trace=trace, prior=prior, posterior_predictive=posterior_predictive) -``` - -```{code-cell} ipython3 -az.plot_posterior(data); -``` - -The uncertainty in the acceleration due to gravity has increased our uncertainty in the proportionality constant. - -Finally, we can do inference on the initial condition. If this object was brought to its initial height by an airplane, then turbulent air might have made the airplane move up or down, thereby changing the initial velocity of the object. - -Doing inference on the initial condition is as easy as specifying a prior for the initial condition, and then passing the initial condition to `ode_model`. - -```{code-cell} ipython3 -with pm.Model() as model3: - sigma = pm.HalfCauchy("sigma", 1) - gamma = pm.Lognormal("gamma", 0, 1) - g = pm.Lognormal("g", pm.math.log(10), 2) - # Initial condition prior. We think it is at rest, but will allow for perturbations in initial velocity. - y0 = pm.Normal("y0", 0, 2) - - ode_solution = ode_model(y0=[y0], theta=[gamma, g]) - - Y = pm.Normal("Y", mu=ode_solution, sigma=sigma, observed=yobs) - - prior = pm.sample_prior_predictive() - trace = pm.sample(2000, tune=1000, target_accept=0.9, cores=1) - posterior_predictive = pm.sample_posterior_predictive(trace) - - data = az.from_pymc3(trace=trace, prior=prior, posterior_predictive=posterior_predictive) -``` - -```{code-cell} ipython3 -az.plot_posterior(data, figsize=(13, 3)); -``` - -Note that by explicitly modelling the initial condition, we obtain a much better estimate of the acceleration due to gravity than if we had insisted that the object started at rest. - -+++ - -## Non-linear Differential Equations - -The example of an object in free fall might not be the most appropriate since that differential equation can be solved exactly. Thus, `DifferentialEquation` is not needed to solve that particular problem. There are, however, many examples of differential equations which cannot be solved exactly. Inference for these models is where `DifferentialEquation` truly shines. - -Consider the SIR model of infection. This model describes the temporal dynamics of a disease spreading through a homogeneously mixed closed population. Members of the population are placed into one of three cateories: Susceptible, Infective, or Recovered. The differential equations are... - - -$$ \dfrac{dS}{dt} = - \beta SI \quad S(0) = S_0 $$ -$$ \dfrac{dI}{dt} = \beta SI - \lambda I \quad I(0) = I_0 $$ -$$ \dfrac{dR}{dt} = \lambda I \quad R(0) = R_0 $$ - -With the constraint that $S(t) + I(t) + R(t) = 1 \, \forall t$. Here, $\beta$ is the rate of infection per susceptible and per infective, and $\lambda$ is the rate of recovery. - -If we knew $S(t)$ and $I(t)$, then we could determine $R(t)$, so we can peel off the differential equation for $R(t)$ and work only with the first two. - - -In the SIR model, it is straight-forward to see that $\beta, \gamma$ and $\beta/2, \gamma/2$ will produce the same qualitative dynamics but on much different time scales. To study the *quality* of the dynamics, regardless of time scale, applied mathematicians will *non-dimensionalize* differential equations. Non-dimensionalization is the process of introducing scaleless variables into the differential equation to understand the system's dynamics under families of equivalent paramterizations. - -To non-dimensionalize this system, let's scale time by $1/\lambda$ (we do this because people stay infected for an average of $1/\lambda$ units of time. It is a straight forward argument to show this. For more, see [1]). Let $t = \tau/\lambda$, where $\tau$ is a unitless variable. Then... - - -$$ \dfrac{dS}{d\tau} = \dfrac{dt}{d\tau} \dfrac{dS}{dt} = \dfrac{1}{\lambda}\dfrac{dS}{dt} = -\dfrac{\beta}{\lambda}SI$$ - -and - -$$ \dfrac{dI}{d\tau} = \dfrac{dt}{d\tau} \dfrac{dI}{dt} = \dfrac{1}{\lambda}\dfrac{dI}{dt} = \dfrac{\beta}{\lambda}SI - I$$ - -The quantity $\beta/\lambda$ has a very special name. We call it *The R-Nought* ($\mathcal{R}_0$). It's interpretation is that if we were to drop a single infected person into a population of suceptible individuals, we would expect $\mathcal{R}_0$ new infections. If $\mathcal{R}_0>1$, then an epidemic will take place. If $\mathcal{R}_0\leq1$ then there will be no epidemic (note, we can show this more rigoursly by studying eigenvalues of the system's Jacobain. For more, see [2]). - -This non-dimensionalization is important because it gives us information about the parameters. If we see an epidemic has occurred, then we know that $\mathcal{R}_0>1$ which means $\beta> \lambda$. Furthermore, it might be hard to place a prior on $\beta$ because of beta's interpretation. But since $1/\lambda$ has a simple interpretation, and since $\mathcal{R}_0>1$, we can obtain $\beta$ by computing $\mathcal{R}_0\lambda$. - -Side note: I'm going to choose a likelihood which certainly violates these constraints, just for exposition on how to use `DifferentialEquation`. In reality, a likelihood which respects these constraints should be chosen. - -```{code-cell} ipython3 -def SIR(y, t, p): - ds = -p[0] * y[0] * y[1] - di = p[0] * y[0] * y[1] - p[1] * y[1] - return [ds, di] - - -times = np.arange(0, 5, 0.25) - -beta, gamma = 4, 1.0 -# Create true curves -y = odeint(SIR, t=times, y0=[0.99, 0.01], args=((beta, gamma),), rtol=1e-8) -# Observational model. Lognormal likelihood isn't appropriate, but we'll do it anyway -yobs = np.random.lognormal(mean=np.log(y[1::]), sigma=[0.2, 0.3]) - -plt.plot(times[1::], yobs, marker="o", linestyle="none") -plt.plot(times, y[:, 0], color="C0", alpha=0.5, label=f"$S(t)$") -plt.plot(times, y[:, 1], color="C1", alpha=0.5, label=f"$I(t)$") -plt.legend() -plt.show() -``` - -```{code-cell} ipython3 -sir_model = DifferentialEquation( - func=SIR, - times=np.arange(0.25, 5, 0.25), - n_states=2, - n_theta=2, - t0=0, -) - -with pm.Model() as model4: - sigma = pm.HalfCauchy("sigma", 1, shape=2) - - # R0 is bounded below by 1 because we see an epidemic has occurred - R0 = pm.Bound(pm.Normal, lower=1)("R0", 2, 3) - lam = pm.Lognormal("lambda", pm.math.log(2), 2) - beta = pm.Deterministic("beta", lam * R0) - - sir_curves = sir_model(y0=[0.99, 0.01], theta=[beta, lam]) - - Y = pm.Lognormal("Y", mu=pm.math.log(sir_curves), sigma=sigma, observed=yobs) - - trace = pm.sample(2000, tune=1000, target_accept=0.9, cores=1) - data = az.from_pymc3(trace=trace) -``` - -```{code-cell} ipython3 -az.plot_posterior(data, round_to=2, credible_interval=0.95); -``` - -As can be seen from the posterior plots, $\beta$ is well estimated by leveraging knoweldege about the non-dimensional parameter $\mathcal{R}_0$ and $\lambda$. - -+++ - -## Conclusions & Final Thoughts - -ODEs are a really good model for continuous temporal evolution. With the addition of `DifferentialEquation` to PyMC3, we can now use bayesian methods to estimate the parameters of ODEs. - -`DifferentialEquation` is not as fast as compared to Stan's `integrate_ode_bdf`. However, the ease of use of `DifferentialEquation` will allow practitioners to get up and running much quicker with Bayesian estimation for ODEs than Stan (which has a steep learning curve). - -+++ - -## References - -1. Earn, D. J., et al. Mathematical epidemiology. Berlin: Springer, 2008. -2. Britton, Nicholas F. Essential mathematical biology. Springer Science & Business Media, 2012. - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/ode_models/ODE_API_shapes_and_benchmarking.myst.md b/myst_nbs/ode_models/ODE_API_shapes_and_benchmarking.myst.md deleted file mode 100644 index ce2fab06e..000000000 --- a/myst_nbs/ode_models/ODE_API_shapes_and_benchmarking.myst.md +++ /dev/null @@ -1,198 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python (PyMC3 Dev) - language: python - name: pymc3-dev ---- - -```{code-cell} ipython3 -%load_ext autoreload -%autoreload 2 -``` - -```{code-cell} ipython3 -import os - -os.environ["THEANO_FLAGS"] = "floatX=float64" -``` - -```{code-cell} ipython3 -import logging - -import arviz -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm -import theano -import theano.tensor as tt - -from scipy.integrate import odeint - -# this notebook show DEBUG log messages -logging.getLogger("pymc3").setLevel(logging.DEBUG) - -import IPython.display -``` - -# pymc3.ode: Shapes and benchmarking - -+++ - -### Demo Scenario: Simple enzymatic reaction -The model has two ODEs with 3 parameters in total. - -In our generated data, we'll observe `S` and `P` at different times to demonstrate how to slice in such cases. - -```{code-cell} ipython3 -# For reproducibility -np.random.seed(23489) - - -class Chem: - @staticmethod - def reaction(y, t, p): - S, P = y[0], y[1] - vmax, K_S = p[0], p[1] - dPdt = vmax * (S / K_S + S) - dSdt = -dPdt - return [ - dSdt, - dPdt, - ] - - -# Times for observation -times = np.arange(0, 10, 0.5) -red = np.arange(5, len(times)) -blue = np.arange(12) -x_obs_1 = times[red] -x_obs_2 = times[blue] - -y0_true = (10, 2) -theta_true = vmax, K_S = (0.5, 2) -sigma = 0.2 - -y_obs = odeint(Chem.reaction, t=times, y0=y0_true, args=(theta_true,)) -y_obs_1 = np.random.normal(y_obs[red, 0], sigma) -y_obs_2 = np.random.normal(y_obs[blue, 1], sigma) - -fig, ax = plt.subplots(dpi=120) -plt.plot(x_obs_1, y_obs_1, label="S", linestyle="dashed", marker="o", color="red") -plt.plot(x_obs_2, y_obs_2, label="P", linestyle="dashed", marker="o", color="blue") -plt.legend() -plt.xlabel("Time (Seconds)") -plt.ylabel(r"$y(t)$") -plt.show() -``` - -```{code-cell} ipython3 -# To demonstrate that test-value computation works, but also for debugging -theano.config.compute_test_value = "raise" -theano.config.exception_verbosity = "high" -theano.config.traceback.limit = 100 -``` - -```{code-cell} ipython3 -def get_model(): - with pm.Model() as pmodel: - sigma = pm.HalfCauchy("sigma", 1) - vmax = pm.Lognormal("vmax", 0, 1) - K_S = pm.Lognormal("K_S", 0, 1) - s0 = pm.Normal("red_0", mu=10, sigma=2) - - y_hat = pm.ode.DifferentialEquation( - func=Chem.reaction, times=times, n_states=len(y0_true), n_theta=len(theta_true) - )(y0=[s0, y0_true[1]], theta=[vmax, K_S], return_sens=False) - - red_hat = y_hat.T[0][red] - blue_hat = y_hat.T[1][blue] - - Y_red = pm.Normal("Y_red", mu=red_hat, sigma=sigma, observed=y_obs_1) - Y_blue = pm.Normal("Y_blue", mu=blue_hat, sigma=sigma, observed=y_obs_2) - - return pmodel - - -def make_benchmark(): - pmodel = get_model() - - # select input variables & test values - t_inputs = pmodel.cont_vars - # apply transformations as required - test_inputs = (np.log(0.2), np.log(0.5), np.log(1.9), 10) - - # create a test function for evaluating the logp value - print("Compiling f_logpt") - f_logpt = theano.function( - inputs=t_inputs, - outputs=[pmodel.logpt], - # with float32, allow downcast because the forward integration is always float64 - allow_input_downcast=(theano.config.floatX == "float32"), - ) - print(f"Test logpt:") - print(f_logpt(*test_inputs)) - - # and another test function for evaluating the gradient - print("Compiling f_logpt") - f_grad = theano.function( - inputs=t_inputs, - outputs=tt.grad(pmodel.logpt, t_inputs), - # with float32, allow downcast because the forward integration is always float64 - allow_input_downcast=(theano.config.floatX == "float32"), - ) - print(f"Test gradient:") - print(f_grad(*test_inputs)) - - # make a benchmarking function that uses random inputs - # - to avoid cheating by caching - # - to get a more realistic distribution over simulation times - def bm(): - f_grad( - np.log(np.random.uniform(0.1, 0.2)), - np.log(np.random.uniform(0.4, 0.6)), - np.log(np.random.uniform(1.9, 2.1)), - np.random.uniform(9, 11), - ) - - return pmodel, bm -``` - -```{code-cell} ipython3 -model, benchmark = make_benchmark() - -print("\nPerformance:") -%timeit benchmark() -``` - -### Inspecting the computation graphs -If you zoom in to the large `DifferentialEquation` ellipse in the top, you can follow the blue arrows downwards to see that the gradient is directly passed from the original DifferentialEquation Op node. - -```{code-cell} ipython3 -theano.printing.pydotprint(tt.grad(model.logpt, model.vmax), "ODE_API_shapes_and_benchmarking.png") -IPython.display.Image("ODE_API_shapes_and_benchmarking.png") -``` - -With the cell below, you can visualize the computation graph interactively. (The HTML file is saved next to this notebook.) - -If you need to install `graphviz/pydot`, you can use these commands: -``` -conda install -c conda-forge python-graphviz -pip install pydot -``` - -```{code-cell} ipython3 -from theano import d3viz - -d3viz.d3viz(model.logpt, "ODE_API_shapes_and_benchmarking.html") -``` - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/ode_models/ODE_with_manual_gradients.myst.md b/myst_nbs/ode_models/ODE_with_manual_gradients.myst.md deleted file mode 100644 index b46326df0..000000000 --- a/myst_nbs/ode_models/ODE_with_manual_gradients.myst.md +++ /dev/null @@ -1,634 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python PyMC3 (Dev) - language: python - name: pymc3-dev-py38 ---- - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm -import theano - -from scipy.integrate import odeint -from theano import * - -THEANO_FLAGS = "optimizer=fast_compile" - -print(f"PyMC3 Version: {pm.__version__}") -``` - -# Lotka-Volterra with manual gradients - -by [Sanmitra Ghosh](https://www.mrc-bsu.cam.ac.uk/people/in-alphabetical-order/a-to-g/sanmitra-ghosh/) - -+++ - -Mathematical models are used ubiquitously in a variety of science and engineering domains to model the time evolution of physical variables. These mathematical models are often described as ODEs that are characterised by model structure - the functions of the dynamical variables - and model parameters. However, for the vast majority of systems of practical interest it is necessary to infer both the model parameters and an appropriate model structure from experimental observations. This experimental data often appears to be scarce and incomplete. Furthermore, a large variety of models described as dynamical systems show traits of sloppiness (see [Gutenkunst et al., 2007](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.0030189)) and have unidentifiable parameter combinations. The task of inferring model parameters and structure from experimental data is of paramount importance to reliably analyse the behaviour of dynamical systems and draw faithful predictions in light of the difficulties posit by their complexities. Moreover, any future model prediction should encompass and propagate variability and uncertainty in model parameters and/or structure. Thus, it is also important that the inference methods are equipped to quantify and propagate the aforementioned uncertainties from the model descriptions to model predictions. As a natural choice to handle uncertainty, at least in the parameters, Bayesian inference is increasingly used to fit ODE models to experimental data ([Mark Girolami, 2008](https://www.sciencedirect.com/science/article/pii/S030439750800501X)). However, due to some of the difficulties that I pointed above, fitting an ODE model using Bayesian inference is a challenging task. In this tutorial I am going to take up that challenge and will show how PyMC3 could be potentially used for this purpose. - -I must point out that model fitting (inference of the unknown parameters) is just one of many crucial tasks that a modeller has to complete in order to gain a deeper understanding of a physical process. However, success in this task is crucial and this is where PyMC3, and probabilistic programming (ppl) in general, is extremely useful. The modeller can take full advantage of the variety of samplers and distributions provided by PyMC3 to automate inference. - -In this tutorial I will focus on the fitting exercise, that is estimating the posterior distribution of the parameters given some noisy experimental time series. - -+++ - -## Bayesian inference of the parameters of an ODE - -I begin by first introducing the Bayesian framework for inference in a coupled non-linear ODE defined as -$$ -\frac{d X(t)}{dt}=\boldsymbol{f}\big(X(t),\boldsymbol{\theta}\big), -$$ -where $X(t)\in\mathbb{R}^K$ is the solution, at each time point, of the system composed of $K$ coupled ODEs - the state vector - and $\boldsymbol{\theta}\in\mathbb{R}^D$ is the parameter vector that we wish to infer. $\boldsymbol{f}(\cdot)$ is a non-linear function that describes the governing dynamics. Also, in case of an initial value problem, let the matrix $\boldsymbol{X}(\boldsymbol{\theta}, \mathbf{x_0})$ denote the solution of the above system of equations at some specified time points for the parameters $\boldsymbol{\theta}$ and initial conditions $\mathbf{x_0}$. - -Consider a set of noisy experimental observations $\boldsymbol{Y} \in \mathbb{R}^{T\times K}$ observed at $T$ experimental time points for the $K$ states. We can obtain the likelihood $p(\boldsymbol{Y}|\boldsymbol{X})$, where I use the symbol $\boldsymbol{X}:=\boldsymbol{X}(\boldsymbol{\theta}, \mathbf{x_0})$, and combine that with a prior distribution $p(\boldsymbol{\theta})$ on the parameters, using the Bayes theorem, to obtain the posterior distribution as -$$ -p(\boldsymbol{\theta}|\boldsymbol{Y})=\frac{1}{Z}p(\boldsymbol{Y}|\boldsymbol{X})p(\boldsymbol{\theta}), -$$ -where $Z=\int p(\boldsymbol{Y}|\boldsymbol{X})p(\boldsymbol{\theta}) d\boldsymbol{\theta} $ is the intractable marginal likelihood. Due to this intractability we resort to approximate inference and apply MCMC. - -For this tutorial I have chosen two ODEs: -1. The [__Lotka-Volterra predator prey model__ ](http://www.scholarpedia.org/article/Predator-prey_model) -2. The [__Fitzhugh-Nagumo action potential model__](http://www.scholarpedia.org/article/FitzHugh-Nagumo_model) - -I will showcase two distinctive approaches (__NUTS__ and __SMC__ step methods), supported by PyMC3, for the estimation of unknown parameters in these models. - -+++ - -## Lotka-Volterra predator prey model - - The Lotka Volterra model depicts an ecological system that is used to describe the interaction between a predator and prey species. This ODE given by - $$ - \begin{aligned} - \frac{d x}{dt} &=\alpha x -\beta xy \\ - \frac{d y}{dt} &=-\gamma y + \delta xy, - \end{aligned} - $$ - shows limit cycle behaviour and has often been used for benchmarking Bayesian inference methods. $\boldsymbol{\theta}=(\alpha,\beta,\gamma,\delta, x(0),y(0))$ is the set of unknown parameters that we wish to infer from experimental observations of the state vector $X(t)=(x(t),y(t))$ comprising the concentrations of the prey and the predator species respectively. $x(0), y(0)$ are the initial values of the states needed to solve the ODE, which are also treated as unknown quantities. The predator prey model was recently used to demonstrate the applicability of the NUTS sampler, and the Stan ppl in general, for inference in ODE models. I will closely follow [this](https://mc-stan.org/users/documentation/case-studies/lotka-volterra-predator-prey.html) Stan tutorial and thus I will setup this model and associated inference problem (including the data) exactly as was done for the Stan tutorial. Let me first write down the code to solve this ODE using the SciPy's `odeint`. Note that the methods in this tutorial is not limited or tied to `odeint`. Here I have chosen `odeint` to simply stay within PyMC3's dependencies (SciPy in this case). - -```{code-cell} ipython3 -class LotkaVolterraModel: - def __init__(self, y0=None): - self._y0 = y0 - - def simulate(self, parameters, times): - alpha, beta, gamma, delta, Xt0, Yt0 = [x for x in parameters] - - def rhs(y, t, p): - X, Y = y - dX_dt = alpha * X - beta * X * Y - dY_dt = -gamma * Y + delta * X * Y - return dX_dt, dY_dt - - values = odeint(rhs, [Xt0, Yt0], times, (parameters,)) - return values - - -ode_model = LotkaVolterraModel() -``` - -## Handling ODE gradients - -NUTS requires the gradient of the log of the target density w.r.t. the unknown parameters, $\nabla_{\boldsymbol{\theta}}p(\boldsymbol{\theta}|\boldsymbol{Y})$, which can be evaluated using the chain rule of differentiation as -$$ \nabla_{\boldsymbol{\theta}}p(\boldsymbol{\theta}|\boldsymbol{Y}) = \frac{\partial p(\boldsymbol{\theta}|\boldsymbol{Y})}{\partial \boldsymbol{X}}^T \frac{\partial \boldsymbol{X}}{\partial \boldsymbol{\theta}}.$$ - -The gradient of an ODE w.r.t. its parameters, the term $\frac{\partial \boldsymbol{X}}{\partial \boldsymbol{\theta}}$, can be obtained using local sensitivity analysis, although this is not the only method to obtain gradients. However, just like solving an ODE (a non-linear one to be precise) evaluation of the gradients can only be carried out using some sort of numerical method, say for example the famous Runge-Kutta method for non-stiff ODEs. PyMC3 uses Theano as the automatic differentiation engine and thus all models are implemented by stitching together available primitive operations (Ops) supported by Theano. Even to extend PyMC3 we need to compose models that can be expressed as symbolic combinations of Theano's Ops. However, if we take a step back and think about Theano then it is apparent that neither the ODE solution nor its gradient w.r.t. to the parameters can be expressed symbolically as combinations of Theano’s primitive Ops. Hence, from Theano’s perspective an ODE (and for that matter any other form of a non-linear differential equation) is a non-differentiable black-box function. However, one might argue that if a numerical method is coded up in Theano (using say the `scan` Op), then it is possible to symbolically express the numerical method that evaluates the ODE states, and then we can easily use Theano’s automatic differentiation engine to obtain the gradients as well by differentiating through the numerical solver itself. I like to point out that the former, obtaining the solution, is indeed possible this way but the obtained gradient would be error-prone. Additionally, this entails to a complete ‘re-inventing the wheel’ as one would have to implement decades old sophisticated numerical algorithms again from scratch in Theano. - -Thus, in this tutorial I am going to present the alternative approach which consists of defining new [custom Theano Ops](http://deeplearning.net/software/theano_versions/dev/extending/extending_theano.html), extending Theano, that will wrap both the numerical solution and the vector-Matrix product, $ \frac{\partial p(\boldsymbol{\theta}|\boldsymbol{Y})}{\partial \boldsymbol{X}}^T \frac{\partial \boldsymbol{X}}{\partial \boldsymbol{\theta}}$, often known as the _**vector-Jacobian product**_ (VJP) in automatic differentiation literature. I like to point out here that in the context of non-linear ODEs the term Jacobian is used to denote gradients of the ODE dynamics $\boldsymbol{f}$ w.r.t. the ODE states $X(t)$. Thus, to avoid confusion, from now on I will use the term _**vector-sensitivity product**_ (VSP) to denote the same quantity that the term VJP denotes. - -I will start by introducing the forward sensitivity analysis. - -## ODE sensitivity analysis - -For a coupled ODE system $\frac{d X(t)}{dt} = \boldsymbol{f}(X(t),\boldsymbol{\theta})$, the local sensitivity of the solution to a parameter is defined by how much the solution would change by changes in the parameter, i.e. the sensitivity of the the $k$-th state is simply put the time evolution of its graident w.r.t. the $d$-th parameter. This quantity, denoted as $Z_{kd}(t)$, is given by -$$Z_{kd}(t)=\frac{d }{d t} \left\{\frac{\partial X_k (t)}{\partial \theta_d}\right\} = \sum_{i=1}^K \frac{\partial f_k}{\partial X_i (t)}\frac{\partial X_i (t)}{\partial \theta_d} + \frac{\partial f_k}{\partial \theta_d}.$$ - -Using forward sensitivity analysis we can obtain both the state $X(t)$ and its derivative w.r.t the parameters, at each time point, as the solution to an initial value problem by augmenting the original ODE system with the sensitivity equations $Z_{kd}$. The augmented ODE system $\big(X(t), Z(t)\big)$ can then be solved together using a chosen numerical method. The augmented ODE system needs the initial values for the sensitivity equations. All of these should be set to zero except the ones where the sensitivity of a state w.r.t. its own initial value is sought, that is $ \frac{\partial X_k(t)}{\partial X_k (0)} =1 $. Note that in order to solve this augmented system we have to embark in the tedious process of deriving $ \frac{\partial f_k}{\partial X_i (t)}$, also known as the Jacobian of an ODE, and $\frac{\partial f_k}{\partial \theta_d}$ terms. Thankfully, many ODE solvers calculate these terms and solve the augmented system when asked for by the user. An example would be the [SUNDIAL CVODES solver suite](https://computation.llnl.gov/projects/sundials/cvodes). A Python wrapper for CVODES can be found [here](https://jmodelica.org/assimulo/). - -However, for this tutorial I would go ahead and derive the terms mentioned above, manually, and solve the Lotka-Volterra ODEs alongwith the sensitivites in the following code block. The functions `jac` and `dfdp` below calculate $ \frac{\partial f_k}{\partial X_i (t)}$ and $\frac{\partial f_k}{\partial \theta_d}$ respectively for the Lotka-Volterra model. For convenience I have transformed the sensitivity equation in a matrix form. Here I extended the solver code snippet above to include sensitivities when asked for. - -```{code-cell} ipython3 -n_states = 2 -n_odeparams = 4 -n_ivs = 2 - - -class LotkaVolterraModel: - def __init__(self, n_states, n_odeparams, n_ivs, y0=None): - self._n_states = n_states - self._n_odeparams = n_odeparams - self._n_ivs = n_ivs - self._y0 = y0 - - def simulate(self, parameters, times): - return self._simulate(parameters, times, False) - - def simulate_with_sensitivities(self, parameters, times): - return self._simulate(parameters, times, True) - - def _simulate(self, parameters, times, sensitivities): - alpha, beta, gamma, delta, Xt0, Yt0 = [x for x in parameters] - - def r(y, t, p): - X, Y = y - dX_dt = alpha * X - beta * X * Y - dY_dt = -gamma * Y + delta * X * Y - return dX_dt, dY_dt - - if sensitivities: - - def jac(y): - X, Y = y - ret = np.zeros((self._n_states, self._n_states)) - ret[0, 0] = alpha - beta * Y - ret[0, 1] = -beta * X - ret[1, 0] = delta * Y - ret[1, 1] = -gamma + delta * X - return ret - - def dfdp(y): - X, Y = y - ret = np.zeros( - (self._n_states, self._n_odeparams + self._n_ivs) - ) # except the following entries - ret[ - 0, 0 - ] = X # \frac{\partial [\alpha X - \beta XY]}{\partial \alpha}, and so on... - ret[0, 1] = -X * Y - ret[1, 2] = -Y - ret[1, 3] = X * Y - - return ret - - def rhs(y_and_dydp, t, p): - y = y_and_dydp[0 : self._n_states] - dydp = y_and_dydp[self._n_states :].reshape( - (self._n_states, self._n_odeparams + self._n_ivs) - ) - dydt = r(y, t, p) - d_dydp_dt = np.matmul(jac(y), dydp) + dfdp(y) - return np.concatenate((dydt, d_dydp_dt.reshape(-1))) - - y0 = np.zeros((2 * (n_odeparams + n_ivs)) + n_states) - y0[6] = 1.0 # \frac{\partial [X]}{\partial Xt0} at t==0, and same below for Y - y0[13] = 1.0 - y0[0:n_states] = [Xt0, Yt0] - result = odeint(rhs, y0, times, (parameters,), rtol=1e-6, atol=1e-5) - values = result[:, 0 : self._n_states] - dvalues_dp = result[:, self._n_states :].reshape( - (len(times), self._n_states, self._n_odeparams + self._n_ivs) - ) - return values, dvalues_dp - else: - values = odeint(r, [Xt0, Yt0], times, (parameters,), rtol=1e-6, atol=1e-5) - return values - - -ode_model = LotkaVolterraModel(n_states, n_odeparams, n_ivs) -``` - -For this model I have set the relative and absolute tolerances to $10^{-6}$ and $10^{-5}$ respectively, as was suggested in the Stan tutorial. This will produce sufficiently accurate solutions. Further reducing the tolerances will increase accuracy but at the cost of increasing the computational time. A thorough discussion on the choice and use of a numerical method for solving the ODE is out of the scope of this tutorial. However, I must point out that the inaccuracies of the ODE solver do affect the likelihood and as a result the inference. This is more so the case for stiff systems. I would recommend interested readers to this nice blog article where this effect is discussed thoroughly for a [cardiac ODE model](https://mirams.wordpress.com/2018/10/17/ode-errors-and-optimisation/). There is also an emerging area of uncertainty quantification that attacks the problem of noise arisng from impreciseness of numerical algorithms, [probabilistic numerics](http://probabilistic-numerics.org/). This is indeed an elegant framework to carry out inference while taking into account the errors coming from the numeric ODE solvers. - -## Custom ODE Op - -In order to define the custom `Op` I have written down two `theano.Op` classes `ODEGradop`, `ODEop`. `ODEop` essentially wraps the ODE solution and will be called by PyMC3. The `ODEGradop` wraps the numerical VSP and this op is then in turn used inside the `grad` method in the `ODEop` to return the VSP. Note that we pass in two functions: `state`, `numpy_vsp` as arguments to respective Ops. I will define these functions later. These functions act as shims using which we connect the python code for numerical solution of state and VSP to Theano and thus PyMC3. - -```{code-cell} ipython3 -class ODEGradop(theano.tensor.Op): - def __init__(self, numpy_vsp): - self._numpy_vsp = numpy_vsp - - def make_node(self, x, g): - x = theano.tensor.as_tensor_variable(x) - g = theano.tensor.as_tensor_variable(g) - node = theano.Apply(self, [x, g], [g.type()]) - return node - - def perform(self, node, inputs_storage, output_storage): - x = inputs_storage[0] - - g = inputs_storage[1] - out = output_storage[0] - out[0] = self._numpy_vsp(x, g) # get the numerical VSP - - -class ODEop(theano.tensor.Op): - def __init__(self, state, numpy_vsp): - self._state = state - self._numpy_vsp = numpy_vsp - - def make_node(self, x): - x = theano.tensor.as_tensor_variable(x) - - return theano.tensor.Apply(self, [x], [x.type()]) - - def perform(self, node, inputs_storage, output_storage): - x = inputs_storage[0] - out = output_storage[0] - - out[0] = self._state(x) # get the numerical solution of ODE states - - def grad(self, inputs, output_grads): - x = inputs[0] - g = output_grads[0] - - grad_op = ODEGradop(self._numpy_vsp) # pass the VSP when asked for gradient - grad_op_apply = grad_op(x, g) - - return [grad_op_apply] -``` - -I must point out that the way I have defined the custom ODE Ops above there is the possibility that the ODE is solved twice for the same parameter values, once for the states and another time for the VSP. To avoid this behaviour I have written a helper class which stops this double evaluation. - -```{code-cell} ipython3 -class solveCached: - def __init__(self, times, n_params, n_outputs): - - self._times = times - self._n_params = n_params - self._n_outputs = n_outputs - self._cachedParam = np.zeros(n_params) - self._cachedSens = np.zeros((len(times), n_outputs, n_params)) - self._cachedState = np.zeros((len(times), n_outputs)) - - def __call__(self, x): - - if np.all(x == self._cachedParam): - state, sens = self._cachedState, self._cachedSens - - else: - state, sens = ode_model.simulate_with_sensitivities(x, times) - - return state, sens - - -times = np.arange(0, 21) # number of measurement points (see below) -cached_solver = solveCached(times, n_odeparams + n_ivs, n_states) -``` - -### The ODE state & VSP evaluation - -Most ODE systems of practical interest will have multiple states and thus the output of the solver, which I have denoted so far as $\boldsymbol{X}$, for a system with $K$ states solved on $T$ time points, would be a $T \times K$-dimensional matrix. For the Lotka-Volterra model the columns of this matrix represent the time evolution of the individual species concentrations. I flatten this matrix to a $TK$-dimensional vector $vec(\boldsymbol{X})$, and also rearrange the sensitivities accordingly to obtain the desired vector-matrix product. It is beneficial at this point to test the custom Op as described [here](http://deeplearning.net/software/theano_versions/dev/extending/extending_theano.html#how-to-test-it). - -```{code-cell} ipython3 -def state(x): - State, Sens = cached_solver(np.array(x, dtype=np.float64)) - cached_solver._cachedState, cached_solver._cachedSens, cached_solver._cachedParam = ( - State, - Sens, - x, - ) - return State.reshape((2 * len(State),)) - - -def numpy_vsp(x, g): - numpy_sens = cached_solver(np.array(x, dtype=np.float64))[1].reshape( - (n_states * len(times), len(x)) - ) - return numpy_sens.T.dot(g) -``` - -## The Hudson's Bay Company data - -The Lotka-Volterra predator prey model has been used previously to successfully explain the dynamics of natural populations of predators and prey, such as the lynx and snowshoe hare data of the Hudson's Bay Company. This is the same data (that was shared [here](https://github.com/stan-dev/example-models/tree/master/knitr/lotka-volterra)) used in the Stan example and thus I will use this data-set as the experimental observations $\boldsymbol{Y}(t)$ to infer the parameters. - -```{code-cell} ipython3 -Year = np.arange(1900, 1921, 1) -# fmt: off -Lynx = np.array([4.0, 6.1, 9.8, 35.2, 59.4, 41.7, 19.0, 13.0, 8.3, 9.1, 7.4, - 8.0, 12.3, 19.5, 45.7, 51.1, 29.7, 15.8, 9.7, 10.1, 8.6]) -Hare = np.array([30.0, 47.2, 70.2, 77.4, 36.3, 20.6, 18.1, 21.4, 22.0, 25.4, - 27.1, 40.3, 57.0, 76.6, 52.3, 19.5, 11.2, 7.6, 14.6, 16.2, 24.7]) -# fmt: on -plt.figure(figsize=(15, 7.5)) -plt.plot(Year, Lynx, color="b", lw=4, label="Lynx") -plt.plot(Year, Hare, color="g", lw=4, label="Hare") -plt.legend(fontsize=15) -plt.xlim([1900, 1920]) -plt.xlabel("Year", fontsize=15) -plt.ylabel("Concentrations", fontsize=15) -plt.xticks(Year, rotation=45) -plt.title("Lynx (predator) - Hare (prey): oscillatory dynamics", fontsize=25); -``` - -## The probabilistic model - -I have now got all the ingredients needed in order to define the probabilistic model in PyMC3. As I have mentioned previously I will set up the probabilistic model with the exact same likelihood and priors used in the Stan example. The observed data is defined as follows: - -$$\log (\boldsymbol{Y(t)}) = \log (\boldsymbol{X(t)}) + \eta(t),$$ - -where $\eta(t)$ is assumed to be zero mean i.i.d Gaussian noise with an unknown standard deviation $\sigma$, that needs to be estimated. The above multiplicative (on the natural scale) noise model encodes a lognormal distribution as the likelihood: - -$$\boldsymbol{Y(t)} \sim \mathcal{L}\mathcal{N}(\log (\boldsymbol{X(t)}), \sigma^2).$$ - -The following priors are then placed on the parameters: - -$$ -\begin{aligned} -x(0), y(0) &\sim \mathcal{L}\mathcal{N}(\log(10),1),\\ -\alpha, \gamma &\sim \mathcal{N}(1,0.5),\\ -\beta, \delta &\sim \mathcal{N}(0.05,0.05),\\ -\sigma &\sim \mathcal{L}\mathcal{N}(-1,1). -\end{aligned} -$$ - -For an intuitive explanation, which I am omitting for brevity, regarding the choice of priors as well as the likelihood model, I would recommend the Stan example mentioned above. The above probabilistic model is defined in PyMC3 below. Note that the flattened state vector is reshaped to match the data dimensionality. - -Finally, I use the `pm.sample` method to run NUTS by default and obtain $1500$ post warm-up samples from the posterior. - -```{code-cell} ipython3 -theano.config.exception_verbosity = "high" -theano.config.floatX = "float64" - - -# Define the data matrix -Y = np.vstack((Hare, Lynx)).T - -# Now instantiate the theano custom ODE op -my_ODEop = ODEop(state, numpy_vsp) - -# The probabilistic model -with pm.Model() as LV_model: - - # Priors for unknown model parameters - - alpha = pm.Normal("alpha", mu=1, sigma=0.5) - beta = pm.Normal("beta", mu=0.05, sigma=0.05) - gamma = pm.Normal("gamma", mu=1, sigma=0.5) - delta = pm.Normal("delta", mu=0.05, sigma=0.05) - - xt0 = pm.Lognormal("xto", mu=np.log(10), sigma=1) - yt0 = pm.Lognormal("yto", mu=np.log(10), sigma=1) - sigma = pm.Lognormal("sigma", mu=-1, sigma=1, shape=2) - - # Forward model - all_params = pm.math.stack([alpha, beta, gamma, delta, xt0, yt0], axis=0) - ode_sol = my_ODEop(all_params) - forward = ode_sol.reshape(Y.shape) - - # Likelihood - Y_obs = pm.Lognormal("Y_obs", mu=pm.math.log(forward), sigma=sigma, observed=Y) - - trace = pm.sample(1500, init="jitter+adapt_diag", cores=1) -trace["diverging"].sum() -``` - -```{code-cell} ipython3 -with LV_model: - az.plot_trace(trace); -``` - -```{code-cell} ipython3 -import pandas as pd - -summary = az.summary(trace) -STAN_mus = [0.549, 0.028, 0.797, 0.024, 33.960, 5.949, 0.248, 0.252] -STAN_sds = [0.065, 0.004, 0.091, 0.004, 2.909, 0.533, 0.045, 0.044] -summary["STAN_mus"] = pd.Series(np.array(STAN_mus), index=summary.index) -summary["STAN_sds"] = pd.Series(np.array(STAN_sds), index=summary.index) -summary -``` - -These estimates are almost identical to those obtained in the Stan tutorial (see the last two columns above), which is what we can expect. Posterior predictives can be drawn as below. - -```{code-cell} ipython3 -ppc_samples = pm.sample_posterior_predictive(trace, samples=1000, model=LV_model)["Y_obs"] -mean_ppc = ppc_samples.mean(axis=0) -CriL_ppc = np.percentile(ppc_samples, q=2.5, axis=0) -CriU_ppc = np.percentile(ppc_samples, q=97.5, axis=0) -``` - -```{code-cell} ipython3 -plt.figure(figsize=(15, 2 * (5))) -plt.subplot(2, 1, 1) -plt.plot(Year, Lynx, "o", color="b", lw=4, ms=10.5) -plt.plot(Year, mean_ppc[:, 1], color="b", lw=4) -plt.plot(Year, CriL_ppc[:, 1], "--", color="b", lw=2) -plt.plot(Year, CriU_ppc[:, 1], "--", color="b", lw=2) -plt.xlim([1900, 1920]) -plt.ylabel("Lynx conc", fontsize=15) -plt.xticks(Year, rotation=45) -plt.subplot(2, 1, 2) -plt.plot(Year, Hare, "o", color="g", lw=4, ms=10.5, label="Observed") -plt.plot(Year, mean_ppc[:, 0], color="g", lw=4, label="mean of ppc") -plt.plot(Year, CriL_ppc[:, 0], "--", color="g", lw=2, label="credible intervals") -plt.plot(Year, CriU_ppc[:, 0], "--", color="g", lw=2) -plt.legend(fontsize=15) -plt.xlim([1900, 1920]) -plt.xlabel("Year", fontsize=15) -plt.ylabel("Hare conc", fontsize=15) -plt.xticks(Year, rotation=45); -``` - -# Efficient exploration of the posterior landscape with SMC - -It has been pointed out in several papers that the complex non-linear dynamics of an ODE results in a posterior landscape that is extremely difficult to navigate efficiently by many MCMC samplers. Thus, recently the curvature information of the posterior surface has been used to construct powerful geometrically aware samplers ([Mark Girolami and Ben Calderhead, 2011](https://rss.onlinelibrary.wiley.com/doi/epdf/10.1111/j.1467-9868.2010.00765.x)) that perform extremely well in ODE inference problems. Another set of ideas suggest breaking down a complex inference task into a sequence of simpler tasks. In essence the idea is to use sequential-importance-sampling to sample from an artificial sequence of increasingly complex distributions where the first in the sequence is a distribution that is easy to sample from, the prior, and the last in the sequence is the actual complex target distribution. The associated importance distribution is constructed by moving the set of particles sampled at the previous step using a Markov kernel, say for example the MH kernel. - -A simple way of building the sequence of distributions is to use a temperature $\beta$, that is raised slowly from $0$ to $1$. Using this temperature variable $\beta$ we can write down the annealed intermediate distribution as - -$$p_{\beta}(\boldsymbol{\theta}|\boldsymbol{y})\propto p(\boldsymbol{y}|\boldsymbol{\theta})^{\beta} p(\boldsymbol{\theta}).$$ - -Samplers that carry out sequential-importance-sampling from these artificial sequence of distributions, to avoid the difficult task of sampling directly from $p(\boldsymbol{\theta}|\boldsymbol{y})$, are known as Sequential Monte Carlo (SMC) samplers ([P Del Moral et al., 2006](https://rss.onlinelibrary.wiley.com/doi/full/10.1111/j.1467-9868.2006.00553.x)). The performance of these samplers are sensitive to the choice of the temperature schedule, that is the set of user-defined increasing values of $\beta$ between $0$ and $1$. Fortunately, PyMC3 provides a version of the SMC sampler ([Jianye Ching and Yi-Chu Chen, 2007](https://ascelibrary.org/doi/10.1061/%28ASCE%290733-9399%282007%29133%3A7%28816%29)) that automatically figures out this temperature schedule. Moreover, the PyMC3's SMC sampler does not require the gradient of the log target density. As a result it is extremely easy to use this sampler for inference in ODE models. In the next example I will apply this SMC sampler to estimate the parameters of the Fitzhugh-Nagumo model. - -+++ - -## The Fitzhugh-Nagumo model - -The Fitzhugh-Nagumo model given by -$$ -\begin{aligned} -\frac{dV}{dt}&=(V - \frac{V^3}{3} + R)c\\ -\frac{dR}{dt}&=\frac{-(V-a+bR)}{c}, -\end{aligned} -$$ -consisting of a membrane voltage variable $V(t)$ and a recovery variable $R(t)$ is a two-dimensional simplification of the [Hodgkin-Huxley](http://www.scholarpedia.org/article/Conductance-based_models) model of spike (action potential) generation in squid giant axons and where $a$, $b$, $c$ are the model parameters. This model produces a rich dynamics and as a result a complex geometry of the posterior surface that often leads to poor performance of many MCMC samplers. As a result this model was used to test the efficacy of the discussed geometric MCMC scheme and since then has been used to benchmark other novel MCMC methods. Following [Mark Girolami and Ben Calderhead, 2011](https://rss.onlinelibrary.wiley.com/doi/epdf/10.1111/j.1467-9868.2010.00765.x) I will also use artificially generated data from this model to setup the inference task for estimating $\boldsymbol{\theta}=(a,b,c)$. - -```{code-cell} ipython3 -class FitzhughNagumoModel: - def __init__(self, times, y0=None): - self._y0 = np.array([-1, 1], dtype=np.float64) - self._times = times - - def _simulate(self, parameters, times): - a, b, c = [float(x) for x in parameters] - - def rhs(y, t, p): - V, R = y - dV_dt = (V - V**3 / 3 + R) * c - dR_dt = (V - a + b * R) / -c - return dV_dt, dR_dt - - values = odeint(rhs, self._y0, times, (parameters,), rtol=1e-6, atol=1e-6) - return values - - def simulate(self, x): - return self._simulate(x, self._times) -``` - -## Simulated Data - -For this example I am going to use simulated data that is I will generate noisy traces from the forward model defined above with parameters $\theta$ set to $(0.2,0.2,3)$ respectively and corrupted by i.i.d Gaussian noise with a standard deviation $\sigma=0.5$. The initial values are set to $V(0)=-1$ and $R(0)=1$ respectively. Again following [Mark Girolami and Ben Calderhead, 2011](https://rss.onlinelibrary.wiley.com/doi/epdf/10.1111/j.1467-9868.2010.00765.x) I will assume that the initial values are known. These parameter values pushes the model into the oscillatory regime. - -```{code-cell} ipython3 -n_states = 2 -n_times = 200 -true_params = [0.2, 0.2, 3.0] -noise_sigma = 0.5 -FN_solver_times = np.linspace(0, 20, n_times) -ode_model = FitzhughNagumoModel(FN_solver_times) -sim_data = ode_model.simulate(true_params) -np.random.seed(42) -Y_sim = sim_data + np.random.randn(n_times, n_states) * noise_sigma -plt.figure(figsize=(15, 7.5)) -plt.plot(FN_solver_times, sim_data[:, 0], color="darkblue", lw=4, label=r"$V(t)$") -plt.plot(FN_solver_times, sim_data[:, 1], color="darkgreen", lw=4, label=r"$R(t)$") -plt.plot(FN_solver_times, Y_sim[:, 0], "o", color="darkblue", ms=4.5, label="Noisy traces") -plt.plot(FN_solver_times, Y_sim[:, 1], "o", color="darkgreen", ms=4.5) -plt.legend(fontsize=15) -plt.xlabel("Time", fontsize=15) -plt.ylabel("Values", fontsize=15) -plt.title("Fitzhugh-Nagumo Action Potential Model", fontsize=25); -``` - -## Define a non-differentiable black-box op using Theano @as_op - -Remember that I told SMC sampler does not require gradients, this is by the way the case for other samplers such as the Metropolis-Hastings, Slice sampler that are also supported in PyMC3. For all these gradient-free samplers I will show a simple and quick way of wrapping the forward model i.e. the ODE solution in Theano. All we have to do is to simply to use the decorator `as_op` that converts a python function into a basic Theano Op. We also tell Theano using the `as_op` decorator that we have three parameters each being a Theano scalar. The output then is a Theano matrix whose columns are the state vectors. - -```{code-cell} ipython3 -import theano.tensor as tt - -from theano.compile.ops import as_op - - -@as_op(itypes=[tt.dscalar, tt.dscalar, tt.dscalar], otypes=[tt.dmatrix]) -def th_forward_model(param1, param2, param3): - param = [param1, param2, param3] - th_states = ode_model.simulate(param) - - return th_states -``` - -## Generative model - -Since I have corrupted the original traces with i.i.d Gaussian thus the likelihood is given by -$$\boldsymbol{Y} = \prod_{i=1}^T \mathcal{N}(\boldsymbol{X}(t_i)), \sigma^2\mathbb{I}),$$ -where $\mathbb{I}\in \mathbb{R}^{K \times K}$. We place a Gamma, Normal, Uniform prior on $(a,b,c)$ and a HalfNormal prior on $\sigma$ as follows: -$$ -\begin{aligned} - a & \sim \mathcal{Gamma}(2,1),\\ - b & \sim \mathcal{N}(0,1),\\ - c & \sim \mathcal{U}(0.1,1),\\ - \sigma & \sim \mathcal{H}(1). -\end{aligned} -$$ - -Notice how I have used the `start` argument for this example. Just like `pm.sample` `pm.sample_smc` has a number of settings, but I found the default ones good enough for simple models such as this one. - -```{code-cell} ipython3 -draws = 1000 -with pm.Model() as FN_model: - - a = pm.Gamma("a", alpha=2, beta=1) - b = pm.Normal("b", mu=0, sigma=1) - c = pm.Uniform("c", lower=0.1, upper=10) - - sigma = pm.HalfNormal("sigma", sigma=1) - - forward = th_forward_model(a, b, c) - - cov = np.eye(2) * sigma**2 - - Y_obs = pm.MvNormal("Y_obs", mu=forward, cov=cov, observed=Y_sim) - - startsmc = {v.name: np.random.uniform(1e-3, 2, size=draws) for v in FN_model.free_RVs} - - trace_FN = pm.sample_smc(draws, start=startsmc) -``` - -```{code-cell} ipython3 -az.plot_posterior(trace_FN, kind="hist", bins=30, color="seagreen"); -``` - -## Inference summary - -With `pm.SMC`, do I get similar performance to geometric MCMC samplers (see [Mark Girolami and Ben Calderhead, 2011](https://rss.onlinelibrary.wiley.com/doi/epdf/10.1111/j.1467-9868.2010.00765.x))? I think so ! - -```{code-cell} ipython3 -results = [ - az.summary(trace_FN, ["a"]), - az.summary(trace_FN, ["b"]), - az.summary(trace_FN, ["c"]), - az.summary(trace_FN, ["sigma"]), -] -results = pd.concat(results) -true_params.append(noise_sigma) -results["True values"] = pd.Series(np.array(true_params), index=results.index) -true_params.pop() -results -``` - -## Reconstruction of the phase portrait - -Its good to check that we can reconstruct the (famous) pahse portrait for this model based on the obtained samples. - -```{code-cell} ipython3 -params = np.array([trace_FN.get_values("a"), trace_FN.get_values("b"), trace_FN.get_values("c")]).T -params.shape -new_values = [] -for ind in range(len(params)): - ppc_sol = ode_model.simulate(params[ind]) - new_values.append(ppc_sol) -new_values = np.array(new_values) -mean_values = np.mean(new_values, axis=0) -plt.figure(figsize=(15, 7.5)) - -plt.plot( - mean_values[:, 0], - mean_values[:, 1], - color="black", - lw=4, - label="Inferred (mean of sampled) phase portrait", -) -plt.plot( - sim_data[:, 0], sim_data[:, 1], "--", color="#ff7f0e", lw=4, ms=6, label="True phase portrait" -) -plt.legend(fontsize=15) -plt.xlabel(r"$V(t)$", fontsize=15) -plt.ylabel(r"$R(t)$", fontsize=15); -``` - -# Perspectives - -### Using some other ODE models - -I have tried to keep everything as general as possible. So, my custom ODE Op, the state and VSP evaluator as well as the cached solver are not tied to a specific ODE model. Thus, to use any other ODE model one only needs to implement a `simulate_with_sensitivities` method according to their own specific ODE model. - -### Other forms of differential equation (DDE, DAE, PDE) - -I hope the two examples have elucidated the applicability of PyMC3 in regards to fitting ODE models. Although ODEs are the most fundamental constituent of a mathematical model, there are indeed other forms of dynamical systems such as a delay differential equation (DDE), a differential algebraic equation (DAE) and the partial differential equation (PDE) whose parameter estimation is equally important. The SMC and for that matter any other non-gradient sampler supported by PyMC3 can be used to fit all these forms of differential equation, of course using the `as_op`. However, just like an ODE we can solve augmented systems of DDE/DAE along with their sensitivity equations. The sensitivity equations for a DDE and a DAE can be found in this recent paper, [C Rackauckas et al., 2018](https://arxiv.org/abs/1812.01892) (Equation 9 and 10). Thus we can easily apply NUTS sampler to these models. - -### Stan already supports ODEs - -Well there are many problems where I believe SMC sampler would be more suitable than NUTS and thus its good to have that option. - -### Model selection - -Most ODE inference literature since [Vladislav Vyshemirsky and Mark Girolami, 2008](https://academic.oup.com/bioinformatics/article/24/6/833/192524) recommend the usage of Bayes factor for the purpose of model selection/comparison. This involves the calculation of the marginal likelihood which is a much more nuanced topic and I would refrain from any discussion about that. Fortunately, the SMC sampler calculates the marginal likelihood as a by product so this can be used for obtaining Bayes factors. Follow PyMC3's other tutorials for further information regarding how to obtain the marginal likelihood after running the SMC sampler. - -Since we generally frame the ODE inference as a regression problem (along with the i.i.d measurement noise assumption in most cases) we can straight away use any of the supported information criterion, such as the widely available information criterion (WAIC), irrespective of what sampler is used for inference. See the PyMC3's API for further information regarding WAIC. - -### Other AD packages - -Although this is a slight digression nonetheless I would still like to point out my observations on this issue. The approach that I have presented here for embedding an ODE (also extends to DDE/DAE) as a custom Op can be trivially carried forward to other AD packages such as TensorFlow and PyTorch. I had been able to use TensorFlow's [py_func](https://www.tensorflow.org/api_docs/python/tf/py_func) to build a custom TensorFlow ODE Op and then use that in the [Edward](http://edwardlib.org/) ppl. I would recommend [this](https://pytorch.org/tutorials/advanced/numpy_extensions_tutorial.html) tutorial, for writing PyTorch extensions, to those who are interested in using the [Pyro](http://pyro.ai/) ppl. - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/samplers/DEMetropolisZ_EfficiencyComparison.myst.md b/myst_nbs/samplers/DEMetropolisZ_EfficiencyComparison.myst.md deleted file mode 100644 index 7ff40d5e0..000000000 --- a/myst_nbs/samplers/DEMetropolisZ_EfficiencyComparison.myst.md +++ /dev/null @@ -1,250 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- - -# DEMetropolis(Z): Population vs. History efficiency comparison -The idea behind `DEMetropolis` is quite simple: Over time, a population of MCMC chains converges to the posterior, therefore the population can be used to inform joint proposals. -But just like the most recent positions of an entire population converges, so does the history of each individual chain. - -In [ter Braak & Vrugt, 2008](https://doi.org/10.1007/s11222-008-9104-9) this history of posterior samples is used in the "DE-MCMC-Z" variant to make proposals. - -The implementation in PyMC3 is based on `DE-MCMC-Z`, but a few details are different. Namely, each `DEMetropolisZ` chain only looks into its own history. Also we use a different tuning scheme. - -In this notebook, a D-dimenstional multivariate normal target densities are sampled with `DEMetropolis` and `DEMetropolisZ` at different $N_{chains}$ settings. - -```{code-cell} ipython3 -import pathlib -import time - -import arviz as az -import fastprogress -import ipywidgets -import numpy as np -import pandas as pd -import pymc3 as pm - -from matplotlib import cm -from matplotlib import pyplot as plt - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -## Benchmarking with a D-dimensional MVNormal model -The function below constructs a fresh model for a given dimensionality and runs either `DEMetropolis` or `DEMetropolisZ` with the given settings. The resulting trace is saved with ArviZ. - -If the saved trace is already found, it is loaded from disk. - -Note that all traces are sampled with `cores=1`. This is because parallelization of `DEMetropolis` chains is slow at $O(N_{chains})$ and the comparison would be different depending on the number of available CPUs. - -```{code-cell} ipython3 -def get_mvnormal_model(D: int) -> pm.Model: - true_mu = np.zeros(D) - true_cov = np.eye(D) - true_cov[:5, :5] = np.array( - [ - [1, 0.5, 0, 0, 0], - [0.5, 2, 2, 0, 0], - [0, 2, 3, 0, 0], - [0, 0, 0, 4, 4], - [0, 0, 0, 4, 5], - ] - ) - - with pm.Model() as pmodel: - x = pm.MvNormal("x", mu=true_mu, cov=true_cov, shape=(D,)) - - true_samples = x.random(size=1000) - truth_id = az.data.convert_to_inference_data(true_samples[np.newaxis, :], group="random") - return pmodel, truth_id - - -def run_setting(D, N_tune, N_draws, N_chains, algorithm): - savename = f"{algorithm}_{D}_{N_tune}_{N_draws}_{N_chains}.nc" - print(f"Scenario filename: {savename}") - if not pathlib.Path(savename).exists(): - pmodel, truth_id = get_mvnormal_model(D) - with pmodel: - if algorithm == "DE-MCMC": - step = pm.DEMetropolis() - elif algorithm == "DE-MCMC-Z": - step = pm.DEMetropolisZ() - idata = pm.sample( - cores=1, - tune=N_tune, - draws=N_draws, - chains=N_chains, - step=step, - start={"x": [0] * D}, - discard_tuned_samples=False, - return_inferencedata=True, - ) - idata.to_netcdf(savename) - else: - idata = az.from_netcdf(savename) - return idata -``` - -## Running the Benchmark Scenarios -Here a variety of different scenarios is computed and the results are aggregated in a multi-indexed DataFrame. - -```{code-cell} ipython3 -df_results = pd.DataFrame(columns="algorithm,D,N_tune,N_draws,N_chains,t,idata".split(",")) -df_results = df_results.set_index("algorithm,D,N_tune,N_draws,N_chains".split(",")) - -for algorithm in {"DE-MCMC", "DE-MCMC-Z"}: - for D in (10, 20, 40): - N_tune = 10000 - N_draws = 10000 - for N_chains in (5, 10, 20, 30, 40, 80): - idata = run_setting(D, N_tune, N_draws, N_chains, algorithm) - t = idata.posterior.sampling_time - df_results.loc[(algorithm, D, N_tune, N_draws, N_chains)] = (t, idata) -``` - -```{code-cell} ipython3 -df_results[["t"]] -``` - -## Analyzing the traces -From the traces, we need to compute the absolute and relative $N_{eff}$ and the $\hat{R}$ to see if we can trust the posteriors. - -```{code-cell} ipython3 -df_temp = df_results.reset_index(["N_tune", "N_draws"]) -df_temp["N_samples"] = [row.N_draws * row.Index[2] for row in df_temp.itertuples()] -df_temp["ess"] = [ - float(az.ess(idata.posterior).x.mean()) for idata in fastprogress.progress_bar(df_temp.idata) -] -df_temp["rel_ess"] = [row.ess / (row.N_samples) for row in df_temp.itertuples()] -df_temp["r_hat"] = [ - float(az.rhat(idata.posterior).x.mean()) for idata in fastprogress.progress_bar(df_temp.idata) -] -df_temp = df_temp.sort_index(level=["algorithm", "D", "N_chains"]) -``` - -```{code-cell} ipython3 -df_temp -``` - -## Visualizing Effective Sample Size -In this diagram, we'll plot the relative effective sample size against the number of chains. - -Because our computation above ran everything with $N_{cores}=1$, we can't make a realistic comparison of effective sampling rates. - -```{code-cell} ipython3 -fig, right = plt.subplots(dpi=140, ncols=1, sharey="row", figsize=(12, 6)) - -for algorithm, linestyle in zip(["DE-MCMC", "DE-MCMC-Z"], ["-", "--"]): - dimensionalities = list(sorted(set(df_temp.reset_index().D)))[::-1] - N_dimensionalities = len(dimensionalities) - for d, dim in enumerate(dimensionalities): - color = cm.autumn(d / N_dimensionalities) - df = df_temp.loc[(algorithm, dim)].reset_index() - right.plot( - df.N_chains, - df.rel_ess * 100, - linestyle=linestyle, - color=color, - label=f"{algorithm}, {dim} dimensions", - ) - -right.legend() -right.set_ylabel("$S_{eff}$ [%]") -right.set_xlabel("$N_{chains}$ [-]") -right.set_ylim(0) -right.set_xlim(0) -plt.show() -``` - -## Visualizing Computation Time -As all traces were sampled with `cores=1`, we expect the computation time to grow linearly with the number of samples. - -```{code-cell} ipython3 -fig, ax = plt.subplots(dpi=140) - -for alg in ["DE-MCMC", "DE-MCMC-Z"]: - df = df_temp.sort_values("N_samples").loc[alg] - ax.scatter(df.N_samples / 1000, df.t, label=alg) -ax.legend() -ax.set_xlabel("$N_{samples} / 1000$ [-]") -ax.set_ylabel("$t_{sampling}$ [s]") -fig.tight_layout() -plt.show() -``` - -## Visualizing the Traces -By comparing DE-MCMC and DE-MCMC-Z for a setting such as D=10, $N_{chains}$=5, you can see how DE-MCMC-Z has a clear advantage over a DE-MCMC that is run with too few chains. - -```{code-cell} ipython3 -def plot_trace(algorithm, D, N_chains): - n_plot = min(10, N_chains) - fig, axs = plt.subplots(nrows=n_plot, figsize=(12, 2 * n_plot)) - idata = df_results.loc[(algorithm, D, 10000, 10000, N_chains), "idata"] - for c in range(n_plot): - samples = idata.posterior.x[c, :, 0] - axs[c].plot(samples, linewidth=0.5) - plt.show() - return - - -ipywidgets.interact_manual( - plot_trace, - algorithm=["DE-MCMC", "DE-MCMC-Z"], - D=sorted(set(df_results.reset_index().D)), - N_chains=sorted(set(df_results.reset_index().N_chains)), -); -``` - -## Inspecting the Sampler Stats -With the following widget, you can explore the sampler stats to better understand the tuning phase. - -The `tune=None` default setting of `DEMetropolisZ` is the most robust tuning strategy. However, setting `tune='lambda'` can improves the initial convergence by doing a swing-in that makes it diverge much faster than it would with a constant `lambda`. The downside of tuning `lambda` is that if the tuning is stopped too early, it can get stuck with a very inefficient `lambda`. - -Therefore, you should always inspect the `lambda` and rolling mean of `accepted` sampler stats when picking $N_{tune}$. - -```{code-cell} ipython3 -def plot_stat(*, sname: str = "accepted", rolling=True, algorithm, D, N_chains): - fig, ax = plt.subplots(ncols=1, figsize=(12, 7), sharey="row") - row = df_results.loc[(algorithm, D, 10000, 10000, N_chains)] - for c in df_results.idata[0].posterior.chain: - S = np.hstack( - [ - # idata.warmup_sample_stats[sname].sel(chain=c), - idata.sample_stats[sname].sel(chain=c) - ] - ) - y = pd.Series(S).rolling(window=500).mean().iloc[500 - 1 :].values if rolling else S - ax.plot(y, linewidth=0.5) - ax.set_xlabel("iteration") - ax.set_ylabel(sname) - plt.show() - return - - -ipywidgets.interact_manual( - plot_stat, - sname=set(df_results.idata[0].sample_stats.keys()), - rolling=True, - algorithm=["DE-MCMC-Z", "DE-MCMC"], - D=sorted(set(df_results.reset_index().D)), - N_chains=sorted(set(df_results.reset_index().N_chains)), -); -``` - -## Conclusion -When used with the recommended settings, `DEMetropolis` is on par with `DEMetropolisZ`. On high-dimensional problems however, `DEMetropolisZ` can achieve the same effective sample sizes with less chains. - -On problems where not enough CPUs are available to run $N_{chains}=2\cdot D$ `DEMetropolis` chains, the `DEMetropolisZ` should have much better scaling. - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/samplers/DEMetropolisZ_tune_drop_fraction.myst.md b/myst_nbs/samplers/DEMetropolisZ_tune_drop_fraction.myst.md deleted file mode 100644 index fc4d227c5..000000000 --- a/myst_nbs/samplers/DEMetropolisZ_tune_drop_fraction.myst.md +++ /dev/null @@ -1,235 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- - -# DEMetropolis(Z): tune_drop_fraction -The implementation of `DEMetropolisZ` in PyMC3 uses a different tuning scheme than described by [ter Braak & Vrugt, 2008](https://doi.org/10.1007/s11222-008-9104-9). -In our tuning scheme, the first `tune_drop_fraction * 100` % of the history from the tuning phase is dropped when the tune iterations end and sampling begins. - -In this notebook, a D-dimenstional multivariate normal target densities is sampled with `DEMetropolisZ` at different `tune_drop_fraction` settings to show why the setting was introduced. - -```{code-cell} ipython3 -import time - -import arviz as az -import ipywidgets -import numpy as np -import pandas as pd -import pymc3 as pm - -from matplotlib import cm, gridspec -from matplotlib import pyplot as plt - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -``` - -## Setting up the Benchmark -We use a multivariate normal target density with some correlation in the first few dimensions. - -```{code-cell} ipython3 -def get_mvnormal_model(D: int) -> pm.Model: - true_mu = np.zeros(D) - true_cov = np.eye(D) - true_cov[:5, :5] = np.array( - [ - [1, 0.5, 0, 0, 0], - [0.5, 2, 2, 0, 0], - [0, 2, 3, 0, 0], - [0, 0, 0, 4, 4], - [0, 0, 0, 4, 5], - ] - ) - - with pm.Model() as pmodel: - x = pm.MvNormal("x", mu=true_mu, cov=true_cov, shape=(D,)) - - true_samples = x.random(size=1000) - truth_id = az.data.convert_to_inference_data(true_samples[np.newaxis, :], group="random") - return pmodel, truth_id -``` - -The problem will be 10-dimensional and we run 5 independent repetitions. - -```{code-cell} ipython3 -D = 10 -N_tune = 10000 -N_draws = 10000 -N_runs = 5 -pmodel, truth_id = get_mvnormal_model(D) -pmodel.logp(pmodel.test_point) -``` - -```{code-cell} ipython3 -df_results = pd.DataFrame(columns="drop_fraction,r,ess,t,idata".split(",")).set_index( - "drop_fraction,r".split(",") -) - -for drop_fraction in (0, 0.5, 0.9, 1): - for r in range(N_runs): - with pmodel: - t_start = time.time() - step = pm.DEMetropolisZ(tune="lambda", tune_drop_fraction=drop_fraction) - idata = pm.sample( - cores=6, - tune=N_tune, - draws=N_draws, - chains=1, - step=step, - start={"x": [7.0] * D}, - discard_tuned_samples=False, - return_inferencedata=True, - # the replicates (r) have different seeds, but they are comparable across - # the drop_fractions. The tuning will be identical, they'll divergen in sampling. - random_seed=2020 + r, - ) - t = time.time() - t_start - df_results.loc[(drop_fraction, r), "ess"] = float(az.ess(idata).x.mean()) - df_results.loc[(drop_fraction, r), "t"] = t - df_results.loc[(drop_fraction, r), "idata"] = idata -``` - -```{code-cell} ipython3 -df_results[["ess", "t"]] -``` - -## Visualizing the Effective Sample Sizes -Here, the mean effective sample size is plotted with standard errors. Next to it, the traces of all chains in one dimension are shown to better understand why the effective sample sizes are so different. - -```{code-cell} ipython3 -df_temp = df_results.ess.unstack("r").T - -fig = plt.figure(dpi=100, figsize=(12, 8)) -gs = gridspec.GridSpec(4, 2, width_ratios=[1, 2]) -ax_left = plt.subplot(gs[:, 0]) -ax_right_bottom = plt.subplot(gs[3, 1]) -axs_right = [ - plt.subplot(gs[0, 1], sharex=ax_right_bottom), - plt.subplot(gs[1, 1], sharex=ax_right_bottom), - plt.subplot(gs[2, 1], sharex=ax_right_bottom), - ax_right_bottom, -] -for ax in axs_right[:-1]: - plt.setp(ax.get_xticklabels(), visible=False) - -ax_left.bar( - x=df_temp.columns, - height=df_temp.mean() / N_draws * 100, - width=0.05, - yerr=df_temp.sem() / N_draws * 100, -) -ax_left.set_xlabel("tune_drop_fraction") -ax_left.set_ylabel("$S_{eff}$ [%]") - -# traceplots -for ax, drop_fraction in zip(axs_right, df_temp.columns): - ax.set_ylabel("$f_{drop}$=" + f"{drop_fraction}") - for r, idata in enumerate(df_results.loc[(drop_fraction)].idata): - # combine warmup and draw iterations into one array: - samples = np.vstack( - [idata.warmup_posterior.x.sel(chain=0).values, idata.posterior.x.sel(chain=0).values] - ) - ax.plot(samples, linewidth=0.25) - ax.axvline(N_tune, linestyle="--", linewidth=0.5, label="end of tuning") -axs_right[0].legend() - -axs_right[0].set_title(f"1-dim traces of {N_runs} independent runs") -ax_left.set_title("mean $S_{eff}$ on " + f"{D}-dimensional correlated MVNormal") -ax_right_bottom.set_xlabel("iteration") -plt.show() -``` - -## Autocorrelation -A diagnostic measure for the effect we can see above is the autocorrelation in the sampling phase. - -When the entire tuning history is dropped, the chain has to diverge from its current position back into the typical set, but without the lambda-swing-in trick, it takes much longer. - -```{code-cell} ipython3 -fig, axs = plt.subplots(ncols=4, figsize=(12, 3), sharey="row") -for ax, drop_fraction in zip(axs, (0, 0.5, 0.9, 1)): - az.plot_autocorr(df_results.loc[(drop_fraction, 0), "idata"].posterior.x.T, ax=ax) - ax.set_title("$f_{drop}=$" + f"{drop_fraction}") -ax.set_ylim(-0.1, 1) -ax.set_ylim() -plt.show() -``` - -## Acceptance Rate -The rolling mean over the `'accepted'` sampler stat shows that by dropping the tuning history, the acceptance rate shoots up to almost 100 %. High acceptance rates happen when the proposals are too narrow, as we can see up in the traceplot. - -```{code-cell} ipython3 -fig, ax = plt.subplots(ncols=1, figsize=(12, 7), sharey="row") - -for drop_fraction in df_temp.columns: - # combine warmup and draw iterations into one array: - idata = df_results.loc[(drop_fraction, 0), "idata"] - S = np.hstack( - [ - idata.warmup_sample_stats["accepted"].sel(chain=0), - idata.sample_stats["accepted"].sel(chain=0), - ] - ) - for c in range(idata.posterior.dims["chain"]): - ax.plot( - pd.Series(S).rolling(window=500).mean().iloc[500 - 1 :].values, - label="$f_{drop}$=" + f"{drop_fraction}", - ) -ax.set_xlabel("iteration") -ax.legend() -ax.set_ylabel("rolling mean acceptance rate (w=500)") -plt.ylim(0, 1) -plt.show() -``` - -## Inspecting the Sampler Stats -With the following widget, you can explore the sampler stats to better understand the tuning phase. - -Check out the `lambda` and rolling mean of `accepted` sampler stats to see how their interaction improves initial convergece. - -```{code-cell} ipython3 -def plot_stat(*, sname: str = "accepted", rolling=True): - fig, ax = plt.subplots(ncols=1, figsize=(12, 7), sharey="row") - f_drop_to_color = { - 1: "blue", - 0.9: "green", - 0.5: "orange", - 0: "red", - } - for row in df_results.reset_index().itertuples(): - idata = row.idata - S = np.hstack( - [idata.warmup_sample_stats[sname].sel(chain=0), idata.sample_stats[sname].sel(chain=0)] - ) - for c in range(row.idata.posterior.dims["chain"]): - y = pd.Series(S).rolling(window=500).mean().iloc[500 - 1 :].values if rolling else S - ax.plot(y, color=f_drop_to_color[row.drop_fraction], linewidth=0.5) - for f_drop, color in f_drop_to_color.items(): - ax.plot([], [], label="$f_{drop}=$" + f"{f_drop}", color=color) - ax.set_xlabel("iteration") - ax.legend() - ax.set_ylabel(sname) - return - - -ipywidgets.interact_manual( - plot_stat, sname=df_results.idata[0, 0].sample_stats.keys(), rolling=True -); -``` - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/samplers/GLM-hierarchical-jax.myst.md b/myst_nbs/samplers/GLM-hierarchical-jax.myst.md deleted file mode 100644 index edcea30cd..000000000 --- a/myst_nbs/samplers/GLM-hierarchical-jax.myst.md +++ /dev/null @@ -1,130 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python (PyMC3 Dev) - language: python - name: pymc3-dev ---- - -# Using JAX for faster sampling - -(c) Thomas Wiecki, 2020 - -*Note: These samplers are still experimental.* - -Using the new Theano JAX linker that Brandon Willard has developed, we can compile PyMC3 models to JAX without any change to the PyMC3 code base or any user-level code changes. The way this works is that we take our Theano graph built by PyMC3 and then translate it to JAX primitives. - -Using our Python samplers, this is still a bit slower than the C-code generated by default Theano. - -However, things get really interesting when we also express our samplers in JAX. Here we have used the JAX samplers by NumPyro or TFP. This combining of the samplers was done by [Junpeng Lao](https://twitter.com/junpenglao). - -The reason this is so much faster is that while before in PyMC3, only the logp evaluation was compiled while the samplers where still coded in Python, so for every loop we went back from C to Python. With this approach, the model *and* the sampler are JIT-compiled by JAX and there is no more Python overhead during the whole sampling run. This way we also get sampling on GPUs or TPUs for free. - -This NB requires the master of [Theano-PyMC](https://github.com/pymc-devs/Theano-PyMC), the [pymc3jax branch of PyMC3](https://github.com/pymc-devs/pymc3/tree/pymc3jax), as well as JAX, TFP-nightly and numpyro. - -This is all still highly experimental but extremely promising and just plain amazing. - -As an example we'll use the classic Radon hierarchical model. Note that this model is still very small, I would expect much more massive speed-ups with larger models. - -```{code-cell} ipython3 -import warnings - -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc3 as pm -import pymc3.sampling_jax -import theano - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 -warnings.filterwarnings("ignore") -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -``` - -```{code-cell} ipython3 -data = pd.read_csv(pm.get_data("radon.csv")) -data["log_radon"] = data["log_radon"].astype(theano.config.floatX) -county_names = data.county.unique() -county_idx = data.county_code.values.astype("int32") - -n_counties = len(data.county.unique()) -``` - -Unchanged PyMC3 model specification: - -```{code-cell} ipython3 -with pm.Model() as hierarchical_model: - # Hyperpriors for group nodes - mu_a = pm.Normal("mu_a", mu=0.0, sigma=100.0) - sigma_a = pm.HalfNormal("sigma_a", 5.0) - mu_b = pm.Normal("mu_b", mu=0.0, sigma=100.0) - sigma_b = pm.HalfNormal("sigma_b", 5.0) - - # Intercept for each county, distributed around group mean mu_a - # Above we just set mu and sd to a fixed value while here we - # plug in a common group distribution for all a and b (which are - # vectors of length n_counties). - a = pm.Normal("a", mu=mu_a, sigma=sigma_a, shape=n_counties) - # Intercept for each county, distributed around group mean mu_a - b = pm.Normal("b", mu=mu_b, sigma=sigma_b, shape=n_counties) - - # Model error - eps = pm.HalfCauchy("eps", 5.0) - - radon_est = a[county_idx] + b[county_idx] * data.floor.values - - # Data likelihood - radon_like = pm.Normal("radon_like", mu=radon_est, sigma=eps, observed=data.log_radon) -``` - -## Sampling using our old Python NUTS sampler - -```{code-cell} ipython3 -%%time -with hierarchical_model: - hierarchical_trace = pm.sample( - 2000, tune=2000, target_accept=0.9, compute_convergence_checks=False - ) -``` - -## Sampling using JAX TFP NUTS sampler - -```{code-cell} ipython3 -%%time -# Inference button (TM)! -with hierarchical_model: - hierarchical_trace_jax = pm.sampling_jax.sample_numpyro_nuts(2000, tune=2000, target_accept=0.9) -``` - -```{code-cell} ipython3 -print(f"Speed-up = {180 / 24}x") -``` - -```{code-cell} ipython3 -az.plot_trace( - hierarchical_trace_jax, - var_names=["mu_a", "mu_b", "sigma_a_log__", "sigma_b_log__", "eps_log__"], -); -``` - -```{code-cell} ipython3 -az.plot_trace(hierarchical_trace_jax, var_names=["a"], coords={"a_dim_0": range(5)}); -``` - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/samplers/MLDA_gravity_surveying.myst.md b/myst_nbs/samplers/MLDA_gravity_surveying.myst.md deleted file mode 100644 index 2d8f6e265..000000000 --- a/myst_nbs/samplers/MLDA_gravity_surveying.myst.md +++ /dev/null @@ -1,863 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python PyMC3 (Dev) - language: python - name: pymc3-dev-py38 ---- - -# Multilevel Gravity Survey with MLDA - -+++ - -### The MLDA sampler -This notebook is designed to demonstrate the Multi-Level Delayed Acceptance MCMC algorithm (MLDA) proposed in Dodwell (2019), as implemented within PyMC3. If you are using MLDA for the first time, we recommend first running the `MLDA_simple_linear_regression.ipynb` notebook in the same folder. - -The MLDA sampler can be more efficient than other MCMC samplers when dealing with computationally intensive problems where we have access not only to the desired (fine) posterior distribution but also to a set of approximate (coarse) posteriors of decreasing accuracy and decreasing computational cost. In simple terms, we can use multiple chains on different coarseness levels and coarser chains' samples are used as proposals for the finer chains. This has been shown to improve the effective sample size of the finest chain and this allows us to reduce the number of expensive fine-chain likelihood evaluations. - -The notebook initially defines the necessary classes that describe the model. These classes use scipy to do the numerical solve in the forward model. It then instantiates models in two levels (with different granularities) and generates data for inference. Finally, the model classes are passed to two pymc3 models using Theano Ops and inference is done using three different MCMC methods (including MLDA). Some summary results and comparison plots are shown at the end to demonstrate the results. The use of Theano Ops is common when users want to use external code to calculate their likelihood (e.g. some fast PDE solver) and this example is designed to serve as a starting point for users to employ MLDA in their own problems. - -Please note that the MLDA sampler is new in PyMC3. The user should be extra critical about the results and report any problems as issues in the pymc3's github repository. - -The notebook results shown below were generated on a MacBook Pro with a 2.6 GHz 6-core Intel Core i7, 32 GB DDR4 and macOS 10.15.4. - -### Gravity Surveying -In this notebook, we solve a 2-dimensional gravity surveying problem, adapted from the 1D problem presented in Hansen (2010). - -Our aim is to recover a two-dimensional mass distribution $f(\vec{t})$ at a known depth $d$ below the surface from measurements $g(\vec{s})$ of the vertical component of the gravitational field at the surface. The contribution to $g(\vec{s})$ from infinitesimally small areas of the subsurface mass distribution are given by: - -\begin{equation} - dg = \frac{\sin \theta}{r^2} f(\vec{t}) \: d\vec{t} -\end{equation} -where $\theta$ is the angle between the vertical plane and a straight line between two points $f(\vec{t})$ and $g(\vec{s})$, and $r = | \vec{s} - \vec{t} |$ is the Eucledian distance between the points. We exploit that $\sin \theta = \frac{d}{r}$, so that - -\begin{equation} - \frac{\sin \theta}{r^2} f(\vec{t}) \: d\vec{t} = \frac{d}{r^3} f(\vec{t}) \: d\vec{t} = \frac{d}{ | \vec{s} - \vec{t} |^3} f(\vec{t}) \: d\vec{t} -\end{equation} - -This yields the integral equation, - -\begin{equation} - g(\vec{s}) = \iint_T \frac{d}{ | \vec{s} - \vec{t} |^3} f(\vec{t}) \: d\vec{t} -\end{equation} - -where $T = [0,1]^2$ is the domain of the function $f(\vec{t})$. This constitutes our forward model. - -We solve this integral numerically using midpoint quadrature. For simplicity, we use the same number of quadrature points along each axis, so that in discrete form our forward model becomes - -\begin{equation} - g(\vec{s}_i) = \sum_{j=1}^{m} \omega_j \frac{d}{ | \vec{s}_i - \vec{t}_j |^3} \hat{f}(\vec{t}_j), \quad i = 1, \dots, n, \quad j = 1, \dots, m -\end{equation} - -where $\omega_j = \frac{1}{m}$ are quadrature weights, $\hat{f}(\vec{t}_j)$ is the approximate subsurface mass at quadrature points $j = 1, \dots, m$, and $g(\vec{s}_i)$ is surface measurements at collocation points $i = 1, \dots, n$. Hence when $n > m$, we are dealing with an overdetermined problem and vice versa. - -This results in a linear system $\mathbf{Ax = b}$, where -\begin{equation} - a_{ij} = \omega_j \frac{d}{ | \vec{s}_i - \vec{t}_j |^3}, \quad x_j = \hat{f}(\vec{t}_j), \quad b_i = g(\vec{s}_i). -\end{equation} -In this particular problem, the matrix $\mathbf{A}$ has a very high condition number, leading to an ill-posed inverse problem, which entails numerical instability and spurious, often oscillatory, solutions for noisy right hand sides. These types of problems are traditionally solved by way of some manner of *regularisation*, but they can be handled in a natural and elegant fashion in the context of a Bayesian inverse problem. - -### Mass Distribution as a Gaussian Random Process -We model the unknown mass distribution as a Gaussian Random Process with a Matern 5/2 covariance kernel (Rasmussen and Williams, 2006): -\begin{equation} - C_{5/2}(\vec{t}, \vec{t}') = \sigma^2 \left( 1 + \frac{\sqrt{5} | \vec{t}-\vec{t}' | }{l} + \frac{5 | \vec{t}-\vec{t}' |^2}{3l^2} \right) \exp \left( - \frac{\sqrt{5} | \vec{t}-\vec{t}' | }{l} \right) -\end{equation} -where $l$ is the covariance length scale and $\sigma^2$ is the variance. - -### Comparison -Within this notebook, a simple MLDA sampler is compared to a Metropolis and a DEMetropolisZ sampler. The example demonstrates that MLDA is more efficient than the other samplers when measured by the Effective Samples per Second they can generate from the posterior. - -### References - -Dodwell, Tim & Ketelsen, Chris & Scheichl, Robert & Teckentrup, Aretha. (2019). Multilevel Markov Chain Monte Carlo. SIAM Review. 61. 509-545. https://doi.org/10.1137/19M126966X - -Per Christian Hansen. *Discrete Inverse Problems: Insight and Algorithms*. Society for Industrial and Applied Mathematics, January 2010. - -Carl Edward Rasmussen and Christopher K. I. Williams. *Gaussian processes for machine learning*. Adaptive computation and machine learning. 2006. - -+++ - -## Import modules - -```{code-cell} ipython3 -import os as os -import warnings - -os.environ["OPENBLAS_NUM_THREADS"] = "1" # Set environment variable - -import sys as sys -import time as time - -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc3 as pm -import theano -import theano.tensor as tt - -from numpy.linalg import inv -from scipy.interpolate import RectBivariateSpline -from scipy.linalg import eigh -from scipy.spatial import distance_matrix -``` - -```{code-cell} ipython3 -warnings.simplefilter(action="ignore", category=FutureWarning) -``` - -```{code-cell} ipython3 -RANDOM_SEED = 123446 -np.random.seed(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -```{code-cell} ipython3 -# Checking versions -print(f"Theano version: {theano.__version__}") -print(f"PyMC3 version: {pm.__version__}") -``` - -## Define Matern52 kernel for modelling Gaussian Random Field -This is utility code which is necessary for defining the model later - you are free to ignore it or place it in an external file - -```{code-cell} ipython3 -class SquaredExponential: - def __init__(self, coords, mkl, lamb): - """ - This class sets up a random process - on a grid and generates - a realisation of the process, given - parameters or a random vector. - """ - - # Internalise the grid and set number of vertices. - self.coords = coords - self.n_points = self.coords.shape[0] - self.eigenvalues = None - self.eigenvectors = None - self.parameters = None - self.random_field = None - - # Set some random field parameters. - self.mkl = mkl - self.lamb = lamb - - self.assemble_covariance_matrix() - - def assemble_covariance_matrix(self): - """ - Create a snazzy distance-matrix for rapid - computation of the covariance matrix. - """ - dist = distance_matrix(self.coords, self.coords) - - # Compute the covariance between all - # points in the space. - self.cov = np.exp(-0.5 * dist**2 / self.lamb**2) - - def plot_covariance_matrix(self): - """ - Plot the covariance matrix. - """ - plt.figure(figsize=(10, 8)) - plt.imshow(self.cov, cmap="binary") - plt.colorbar() - plt.show() - - def compute_eigenpairs(self): - """ - Find eigenvalues and eigenvectors using Arnoldi iteration. - """ - eigvals, eigvecs = eigh(self.cov, eigvals=(self.n_points - self.mkl, self.n_points - 1)) - - order = np.flip(np.argsort(eigvals)) - self.eigenvalues = eigvals[order] - self.eigenvectors = eigvecs[:, order] - - def generate(self, parameters=None): - """ - Generate a random field, see - Scarth, C., Adhikari, S., Cabral, P. H., - Silva, G. H. C., & Prado, A. P. do. (2019). - Random field simulation over curved surfaces: - Applications to computational structural mechanics. - Computer Methods in Applied Mechanics and Engineering, - 345, 283–301. https://doi.org/10.1016/j.cma.2018.10.026 - """ - - if parameters is None: - self.parameters = np.random.normal(size=self.mkl) - else: - self.parameters = np.array(parameters).flatten() - - self.random_field = np.linalg.multi_dot( - (self.eigenvectors, np.sqrt(np.diag(self.eigenvalues)), self.parameters) - ) - - def plot(self, lognormal=True): - """ - Plot the random field. - """ - - if lognormal: - random_field = self.random_field - contour_levels = np.linspace(min(random_field), max(random_field), 20) - else: - random_field = np.exp(self.random_field) - contour_levels = np.linspace(min(random_field), max(random_field), 20) - - plt.figure(figsize=(12, 10)) - plt.tricontourf( - self.coords[:, 0], - self.coords[:, 1], - random_field, - levels=contour_levels, - cmap="plasma", - ) - plt.colorbar() - plt.show() - - -class Matern52(SquaredExponential): - def assemble_covariance_matrix(self): - """ - This class inherits from RandomProcess and creates a Matern 5/2 covariance matrix. - """ - - # Compute scaled distances. - dist = np.sqrt(5) * distance_matrix(self.coords, self.coords) / self.lamb - - # Set up Matern 5/2 covariance matrix. - self.cov = (1 + dist + dist**2 / 3) * np.exp(-dist) -``` - -## Define the Gravity model and generate data -This is a bit lengthy due to the model used in this case, it contains class definitions and also instantiation of class objects and data generation. - -```{code-cell} ipython3 -# Set the model parameters. -depth = 0.1 -n_quad = 64 -n_data = 64 - -# noise level -noise_level = 0.02 - -# Set random process parameters. -lamb = 0.1 -mkl = 14 - -# Set the quadrature degree for each model level (coarsest first) -n_quadrature = [16, 64] -``` - -```{code-cell} ipython3 -class Gravity: - """ - Gravity is a class that implements a simple gravity surveying problem, - as described in Hansen, P. C. (2010). Discrete Inverse Problems: Insight and Algorithms. - Society for Industrial and Applied Mathematics. - It uses midpoint quadrature to evaluate a Fredholm integral of the first kind. - """ - - def __init__(self, f_function, depth, n_quad, n_data): - - # Set the function describing the distribution of subsurface density. - self.f_function = f_function - - # Set the depth of the density (distance to the surface measurements). - self.depth = depth - - # Set the quadrature degree along one dimension. - self.n_quad = n_quad - - # Set the number of data points along one dimension - self.n_data = n_data - - # Set the quadrature points. - x = np.linspace(0, 1, self.n_quad + 1) - self.tx = (x[1:] + x[:-1]) / 2 - y = np.linspace(0, 1, self.n_quad + 1) - self.ty = (y[1:] + y[:-1]) / 2 - TX, TY = np.meshgrid(self.tx, self.ty) - - # Set the measurement points. - self.sx = np.linspace(0, 1, self.n_data) - self.sy = np.linspace(0, 1, self.n_data) - SX, SY = np.meshgrid(self.sx, self.sy) - - # Create coordinate vectors. - self.T_coords = np.c_[TX.ravel(), TY.ravel(), np.zeros(self.n_quad**2)] - self.S_coords = np.c_[SX.ravel(), SY.ravel(), self.depth * np.ones(self.n_data**2)] - - # Set the quadrature weights. - self.w = 1 / self.n_quad**2 - - # Compute a distance matrix - dist = distance_matrix(self.S_coords, self.T_coords) - - # Create the Fremholm kernel. - self.K = self.w * self.depth / dist**3 - - # Evaluate the density function on the quadrature points. - self.f = self.f_function(TX, TY).flatten() - - # Compute the surface density (noiseless measurements) - self.g = np.dot(self.K, self.f) - - def plot_model(self): - - # Plot the density and the signal. - fig, axes = plt.subplots(1, 2, figsize=(16, 6)) - axes[0].set_title("Density") - f = axes[0].imshow( - self.f.reshape(self.n_quad, self.n_quad), - extent=(0, 1, 0, 1), - origin="lower", - cmap="plasma", - ) - fig.colorbar(f, ax=axes[0]) - axes[1].set_title("Signal") - g = axes[1].imshow( - self.g.reshape(self.n_data, self.n_data), - extent=(0, 1, 0, 1), - origin="lower", - cmap="plasma", - ) - fig.colorbar(g, ax=axes[1]) - plt.show() - - def plot_kernel(self): - - # Plot the kernel. - plt.figure(figsize=(8, 6)) - plt.imshow(self.K, cmap="plasma") - plt.colorbar() - plt.show() -``` - -```{code-cell} ipython3 -# This is a function describing the subsurface density. -def f(TX, TY): - f = np.sin(np.pi * TX) + np.sin(3 * np.pi * TY) + TY + 1 - f = f / f.max() - return f -``` - -```{code-cell} ipython3 -# Initialise a model -model_true = Gravity(f, depth, n_quad, n_data) -``` - -```{code-cell} ipython3 -model_true.plot_model() -``` - -```{code-cell} ipython3 -# Add noise to the data. -np.random.seed(123) -noise = np.random.normal(0, noise_level, n_data**2) -data = model_true.g + noise - -# Plot the density and the signal. -fig, axes = plt.subplots(1, 2, figsize=(16, 6)) -axes[0].set_title("Noiseless Signal") -g = axes[0].imshow( - model_true.g.reshape(n_data, n_data), - extent=(0, 1, 0, 1), - origin="lower", - cmap="plasma", -) -fig.colorbar(g, ax=axes[0]) -axes[1].set_title("Noisy Signal") -d = axes[1].imshow(data.reshape(n_data, n_data), extent=(0, 1, 0, 1), origin="lower", cmap="plasma") -fig.colorbar(d, ax=axes[1]) -plt.show() -``` - -```{code-cell} ipython3 -class Gravity_Forward(Gravity): - """ - Gravity forward is a class that implements the gravity problem, - but computation of signal and density is delayed to the "solve" - method, since it relied on a Gaussian Random Field to model - the (unknown) density. - """ - - def __init__(self, depth, n_quad, n_data): - - # Set the depth of the density (distance to the surface measurements). - self.depth = depth - - # Set the quadrature degree along one axis. - self.n_quad = n_quad - - # Set the number of data points along one axis. - self.n_data = n_data - - # Set the quadrature points. - x = np.linspace(0, 1, self.n_quad + 1) - self.tx = (x[1:] + x[:-1]) / 2 - y = np.linspace(0, 1, self.n_quad + 1) - self.ty = (y[1:] + y[:-1]) / 2 - TX, TY = np.meshgrid(self.tx, self.ty) - - # Set the measurement points. - self.sx = np.linspace(0, 1, self.n_data) - self.sy = np.linspace(0, 1, self.n_data) - SX, SY = np.meshgrid(self.sx, self.sy) - - # Create coordinate vectors. - self.T_coords = np.c_[TX.ravel(), TY.ravel(), np.zeros(self.n_quad**2)] - self.S_coords = np.c_[SX.ravel(), SY.ravel(), self.depth * np.ones(self.n_data**2)] - - # Set the quadrature weights. - self.w = 1 / self.n_quad**2 - - # Compute a distance matrix - dist = distance_matrix(self.S_coords, self.T_coords) - - # Create the Fremholm kernel. - self.K = self.w * self.depth / dist**3 - - def set_random_process(self, random_process, lamb, mkl): - - # Set the number of KL modes. - self.mkl = mkl - - # Initialise a random process on the quadrature points. - # and compute the eigenpairs of the covariance matrix, - self.random_process = random_process(self.T_coords, self.mkl, lamb) - self.random_process.compute_eigenpairs() - - def solve(self, parameters): - - # Internalise the Random Field parameters - self.parameters = parameters - - # Create a realisation of the random process, given the parameters. - self.random_process.generate(self.parameters) - mean = 0.0 - stdev = 1.0 - - # Set the density. - self.f = mean + stdev * self.random_process.random_field - - # Compute the signal. - self.g = np.dot(self.K, self.f) - - def get_data(self): - - # Get the data vector. - return self.g -``` - -```{code-cell} ipython3 -# We project the eigenmodes of the fine model to the quadrature points -# of the coarse model using linear interpolation. -def project_eigenmodes(model_coarse, model_fine): - model_coarse.random_process.eigenvalues = model_fine.random_process.eigenvalues - for i in range(model_coarse.mkl): - interpolator = RectBivariateSpline( - model_fine.tx, - model_fine.ty, - model_fine.random_process.eigenvectors[:, i].reshape( - model_fine.n_quad, model_fine.n_quad - ), - ) - model_coarse.random_process.eigenvectors[:, i] = interpolator( - model_coarse.tx, model_coarse.ty - ).ravel() -``` - -```{code-cell} ipython3 -# Initialise the models, according the quadrature degree. -my_models = [] -for i, n_quad in enumerate(n_quadrature): - my_models.append(Gravity_Forward(depth, n_quad, n_data)) - my_models[i].set_random_process(Matern52, lamb, mkl) - -# Project the eigenmodes of the fine model to the coarse model. -for m in my_models[:-1]: - project_eigenmodes(m, my_models[-1]) -``` - -## Solve and plot models to demonstrate coarse/fine difference - -```{code-cell} ipython3 -# Plot the same random realisation for each level, and the corresponding signal, -# to validate that the levels are equivalents. -for i, m in enumerate(my_models): - print(f"Level {i}:") - np.random.seed(2) - m.solve(np.random.normal(size=mkl)) - m.plot_model() -``` - -```{code-cell} ipython3 -plt.title(f"Largest {mkl} KL eigenvalues of GP prior") -plt.plot(my_models[-1].random_process.eigenvalues) -plt.show() -``` - -## Compare computation cost of coarse and fine model solve -The bigger the difference in time, the more MLDA has potential to increase efficiency - -```{code-cell} ipython3 -%%timeit -my_models[0].solve(np.random.normal(size=mkl)) -``` - -```{code-cell} ipython3 -%%timeit -my_models[-1].solve(np.random.normal(size=mkl)) -``` - -## Set MCMC parameters for inference - -```{code-cell} ipython3 -# Number of draws from the distribution -ndraws = 15000 - -# Number of burn-in samples -nburn = 10000 - -# MLDA and Metropolis tuning parameters -tune = True -tune_interval = 100 # Set high to prevent tuning. -discard_tuning = True - -# Number of independent chains. -nchains = 3 - -# Subsampling rate for MLDA -nsub = 5 - -# Set prior parameters for multivariate Gaussian prior distribution. -mu_prior = np.zeros(mkl) -cov_prior = np.eye(mkl) - -# Set the sigma for inference. -sigma = 1.0 - -# Sampling seed -sampling_seed = RANDOM_SEED -``` - -## Define a Theano Op for the likelihood -This creates the theano op needed to pass the above model to the PyMC3 sampler - -```{code-cell} ipython3 -def my_loglik(my_model, theta, data, sigma): - """ - This returns the log-likelihood of my_model given theta, - datapoints, the observed data and sigma. It uses the - model_wrapper function to do a model solve. - """ - my_model.solve(theta) - output = my_model.get_data() - return -(0.5 / sigma**2) * np.sum((output - data) ** 2) - - -class LogLike(tt.Op): - """ - Theano Op that wraps the log-likelihood computation, necessary to - pass "black-box" code into pymc3. - Based on the work in: - https://docs.pymc.io/notebooks/blackbox_external_likelihood.html - https://docs.pymc.io/Advanced_usage_of_Theano_in_PyMC3.html - """ - - # Specify what type of object will be passed and returned to the Op when it is - # called. In our case we will be passing it a vector of values (the parameters - # that define our model and a model object) and returning a single "scalar" - # value (the log-likelihood) - itypes = [tt.dvector] # expects a vector of parameter values when called - otypes = [tt.dscalar] # outputs a single scalar value (the log likelihood) - - def __init__(self, my_model, loglike, data, sigma): - """ - Initialise the Op with various things that our log-likelihood function - requires. Below are the things that are needed in this particular - example. - - Parameters - ---------- - my_model: - A Model object (defined in model.py) that contains the parameters - and functions of out model. - loglike: - The log-likelihood function we've defined, in this example it is - my_loglik. - data: - The "observed" data that our log-likelihood function takes in. These - are the true data generated by the finest model in this example. - x: - The dependent variable (aka 'x') that our model requires. This is - the datapoints in this example. - sigma: - The noise standard deviation that our function requires. - """ - # add inputs as class attributes - self.my_model = my_model - self.likelihood = loglike - self.data = data - self.sigma = sigma - - def perform(self, node, inputs, outputs): - # the method that is used when calling the Op - theta = inputs # this will contain my variables - - # call the log-likelihood function - logl = self.likelihood(self.my_model, theta, self.data, self.sigma) - - outputs[0][0] = np.array(logl) # output the log-likelihood -``` - -```{code-cell} ipython3 -# create Theano Ops to wrap likelihoods of all model levels and store them in list -logl = [] -for i, m_i in enumerate(my_models): - logl.append(LogLike(m_i, my_loglik, data, sigma)) -``` - -## Create coarse model in PyMC3 - -```{code-cell} ipython3 -# Set up models in pymc3 for each level - excluding finest model level -coarse_models = [] -for j in range(len(my_models) - 1): - with pm.Model() as model: - - # Multivariate normal prior. - theta = pm.MvNormal("theta", mu=mu_prior, cov=cov_prior, shape=mkl) - - # Use the Potential class to evaluate likelihood - pm.Potential("likelihood", logl[j](theta)) - - coarse_models.append(model) -``` - -## Create fine model and perform inference -Note that we sample using all three methods and that we use the MAP as the starting point for sampling - -```{code-cell} ipython3 -# Set up finest model and perform inference with PyMC3, using the MLDA algorithm -# and passing the coarse_models list created above. -method_names = [] -traces = [] -runtimes = [] - -with pm.Model() as model: - - # Multivariate normal prior. - theta = pm.MvNormal("theta", mu=mu_prior, cov=cov_prior, shape=mkl) - - # Use the Potential class to evaluate likelihood - pm.Potential("likelihood", logl[-1](theta)) - - # Find the MAP estimate which is used as the starting point for sampling - MAP = pm.find_MAP() - - # Initialise a Metropolis, DEMetropolisZ and MLDA step method objects (passing the subsampling rate and - # coarse models list for the latter) - step_metropolis = pm.Metropolis(tune=tune, tune_interval=tune_interval) - step_demetropolisz = pm.DEMetropolisZ(tune_interval=tune_interval) - step_mlda = pm.MLDA( - coarse_models=coarse_models, subsampling_rates=nsub, base_tune_interval=tune_interval - ) - - # Inference! - # Metropolis - t_start = time.time() - method_names.append("Metropolis") - traces.append( - pm.sample( - draws=ndraws, - step=step_metropolis, - chains=nchains, - tune=nburn, - discard_tuned_samples=discard_tuning, - random_seed=sampling_seed, - start=MAP, - cores=1, - mp_ctx="forkserver", - ) - ) - runtimes.append(time.time() - t_start) - - # DEMetropolisZ - t_start = time.time() - method_names.append("DEMetropolisZ") - traces.append( - pm.sample( - draws=ndraws, - step=step_demetropolisz, - chains=nchains, - tune=nburn, - discard_tuned_samples=discard_tuning, - random_seed=sampling_seed, - start=MAP, - cores=1, - mp_ctx="forkserver", - ) - ) - runtimes.append(time.time() - t_start) - - # MLDA - t_start = time.time() - method_names.append("MLDA") - traces.append( - pm.sample( - draws=ndraws, - step=step_mlda, - chains=nchains, - tune=nburn, - discard_tuned_samples=discard_tuning, - random_seed=sampling_seed, - start=MAP, - cores=1, - mp_ctx="forkserver", - ) - ) - runtimes.append(time.time() - t_start) -``` - -## Get post-sampling stats and diagnostics - -+++ - -#### Print MAP estimate and pymc3 sampling summary - -```{code-cell} ipython3 -with model: - print( - f"\nDetailed summaries and plots:\nMAP estimate: {MAP['theta']}. Not used as starting point." - ) - for i, trace in enumerate(traces): - print(f"\n{method_names[i]} Sampler:\n") - display(az.summary(trace)) -``` - -#### Show ESS and ESS/sec for all samplers - -```{code-cell} ipython3 -acc = [] -ess = [] -ess_n = [] -performances = [] - -# Get some more statistics. -with model: - for i, trace in enumerate(traces): - acc.append(trace.get_sampler_stats("accepted").mean()) - ess.append(np.array(az.ess(trace).to_array())) - ess_n.append(ess[i] / len(trace) / trace.nchains) - performances.append(ess[i] / runtimes[i]) - print( - f"\n{method_names[i]} Sampler: {len(trace)} drawn samples in each of " - f"{trace.nchains} chains." - f"\nRuntime: {runtimes[i]} seconds" - f"\nAcceptance rate: {acc[i]}" - f"\nESS list: {np.round(ess[i][0], 3)}" - f"\nNormalised ESS list: {np.round(ess_n[i][0], 3)}" - f"\nESS/sec: {np.round(performances[i][0], 3)}" - ) - - # Plot the effective sample size (ESS) and relative ESS (ES/sec) of each of the sampling strategies. - colors = ["firebrick", "darkgoldenrod", "darkcyan", "olivedrab"] - - fig, axes = plt.subplots(1, 2, figsize=(16, 5)) - - axes[0].set_title("ESS") - for i, e in enumerate(ess): - axes[0].bar( - [j + i * 0.2 for j in range(mkl)], - e.ravel(), - width=0.2, - color=colors[i], - label=method_names[i], - ) - axes[0].set_xticks([i + 0.3 for i in range(mkl)]) - axes[0].set_xticklabels([f"theta_{i}" for i in range(mkl)]) - axes[0].legend() - - axes[1].set_title("ES/sec") - for i, p in enumerate(performances): - axes[1].bar( - [j + i * 0.2 for j in range(mkl)], - p.ravel(), - width=0.2, - color=colors[i], - label=method_names[i], - ) - axes[1].set_xticks([i + 0.3 for i in range(mkl)]) - axes[1].set_xticklabels([f"theta_{i}" for i in range(mkl)]) - axes[1].legend() - plt.show() -``` - -#### Plot distributions and trace. -Vertical grey lines represent the MAP estimate of each parameter. - -```{code-cell} ipython3 -with model: - lines = (("theta", {}, MAP["theta"].tolist()),) - for i, trace in enumerate(traces): - az.plot_trace(trace, lines=lines) - - # Ugly hack to get some titles in. - x_offset = -0.1 * ndraws - y_offset = trace.get_values("theta").max() + 0.25 * ( - trace.get_values("theta").max() - trace.get_values("theta").min() - ) - plt.text(x_offset, y_offset, "{} Sampler".format(method_names[i])) -``` - -#### Plot true and recovered densities -This is useful for verification, i.e. to compare the true model density and signal to the estimated ones from the samplers. - -```{code-cell} ipython3 -print("True Model") -model_true.plot_model() -with model: - print("MAP estimate:") - my_models[-1].solve(MAP["theta"]) - my_models[-1].plot_model() - for i, t in enumerate(traces): - print(f"Recovered by: {method_names[i]}") - my_models[-1].solve(az.summary(t)["mean"].values) - my_models[-1].plot_model() -``` - -```{code-cell} ipython3 -# Show trace of lowest energy mode for Metropolis sampler -plt.figure(figsize=(8, 3)) -plt.plot(traces[0]["theta"][:5000, -1]) -plt.show() -``` - -```{code-cell} ipython3 -# Show trace of lowest energy mode for MLDA sampler -plt.figure(figsize=(8, 3)) -plt.plot(traces[2]["theta"][:5000:, -1]) -plt.show() -``` - -```{code-cell} ipython3 -# Make sure samplers have converged -assert all(az.rhat(traces[0]) < 1.03) -assert all(az.rhat(traces[1]) < 1.03) -assert all(az.rhat(traces[2]) < 1.03) -``` - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/samplers/MLDA_introduction.myst.md b/myst_nbs/samplers/MLDA_introduction.myst.md deleted file mode 100644 index df197e3e9..000000000 --- a/myst_nbs/samplers/MLDA_introduction.myst.md +++ /dev/null @@ -1,70 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- - -# MLDA sampler: Introduction and resources - -This notebook contains an introduction to the Multi-Level Delayed Acceptance MCMC algorithm (MLDA) proposed in [1]. It explains the main idea behind the method, gives an overview of the problems it is good for and points to specific notebooks with examples of how to use it within PyMC. - -[1] Dodwell, Tim & Ketelsen, Chris & Scheichl, Robert & Teckentrup, Aretha. (2019). Multilevel Markov Chain Monte Carlo. SIAM Review. 61. 509-545. https://doi.org/10.1137/19M126966X - -+++ - -### The MLDA sampler - -The MLDA sampler is designed to deal with computationally intensive problems where we have access not only to the desired (fine) posterior distribution but also to a set of approximate (coarse) posteriors of decreasing accuracy and decreasing computational cost (we need at least one of those). - -Its main idea is to run multiple chains, where each chain samples from a different version of the posterior (the lower the level, the coarser the posterior). Each chain generates a number of samples and the last of them is passed as a proposal to the chain one level up. The latter accepts or rejects the sample and then gives back control to the first chain. This strategy is applied recursively so that each chain uses the chain below as a source of proposed samples. - -MLDA improves the effective sample size of the finest chain compared to standard samplers (e.g. Metropolis) and this allows us to reduce the number of expensive fine-chain likelihood evaluations while still exploring the posterior adequately. Note that the bottom level sampler is a standard MCMC sampler like Metropolis or DEMetropolisZ. - -+++ - -### Problems it is good for - -In many real-world problems, we use models to represent spatially or temporally varying quantities. We often have the ability to modify the spatial and/or temporal granularity of the models. For example, when representing a high-resolution image, we can use a coarse 64x64 grid but we can also use a much finer 512x512 grid, which is more accurate and more computationally demanding when working with it. In those cases it is often possible to apply multilevel modeling to infer unknown quantities in the model. In multilevel modeling, a hierarchy of models of increasing accuracy/resolution and increasing computational cost are used together to perform inference more efficiently than doing inference using the finest model only. - -Example applications include inverse problems for physical, natural or other systems, e.g. subsurface fluid transportation, predator-prey models in ecology, impedance imaging, ultrasound imaging, emission tomography, flow field of ocean circulation. - -In many of those inverse problems, evaluating the Bayesian likelihood requires solving a partial differential equation (PDE) numerically. Doing this on a fine-resolution model can be orders of magnitude slower than doing it on a coarse-resolution model. This is the ideal scenario for multilevel modeling and MLDA as MLDA allows us to get away with only calculating a fraction of the expensive fine-resolution likelihoods in exchange for many cheap coarse-resolution likelihoods. - -+++ - -### PyMC implementation - -MLDA is one of the MCMC inference methods available in PyMC. You can instantiate an MLDA sampler using the `pm.MLDA(coarse_models=...)`, where you need to pass at least one coarse model within a list. - -The PyMC implementation of MLDA supports any number of levels, tuning parameterization for the bottom-level sampler, separate subsampling rates for each level, choice between blocked and compound sampling for the bottom-level sampler, two types of bottom-level samplers (Metropolis, DEMetropolisZ), adaptive error correction and variance reduction. - -For more details about the MLDA sampler and the way it should be used and parameterised, the user can refer to the docstrings in the code and to the other example notebooks (links below) which deal with more complex problem settings and more advanced MLDA features. - -Please note that the MLDA sampler is new in PyMC. The user should be extra critical about the results and report any problems as issues in the PyMC's github repository. - -+++ - -### Notebooks with example code - - -[Simple linear regression](./MLDA_simple_linear_regression.ipynb): This notebook demonstrates the workflow for using MLDA within PyMC. It employes a very simple toy model. - -[Gravity surveying](./MLDA_gravity_surveying.ipynb): In this notebook, we use MLDA to solve a 2-dimensional gravity surveying inverse problem. Evaluating the likelihood requires solving a PDE, which we do using [scipy](https://www.scipy.org/). We also compare the performance of MLDA with other PyMC samplers (Metropolis, DEMetropolisZ). - -[Variance reduction 1](./MLDA_variance_reduction_linear_regression.ipynb) and [Variance reduction 2](https://github.com/alan-turing-institute/pymc/blob/mlda_all_notebooks/docs/source/notebooks/MLDA_variance_reduction_groundwater.ipynb) (external link): Those two notebooks demonstrate the variance reduction feature in a linear regression model and a groundwater flow model. This feature allows the user to define a quantity of interest that they need to estimate using the MCMC samples. It then collects those quantities of interest, as well as differences of these quantities between levels, during MLDA sampling. The collected quentities can then be used to produce an estimate which has lower variance than a standard estimate that uses samples from the fine chain only. The first notebook does not have external dependencies, while the second one requires FEniCS. Note that the second notebook is outside the core PyMC repository because FEniCS is not a PyMC dependency. - -[Adaptive error model](https://github.com/alan-turing-institute/pymc/blob/mlda_all_notebooks/docs/source/notebooks/MLDA_adaptive_error_model.ipynb) (external link): In this notebook we use MLDA to tackle another inverse problem; groundwarer flow modeling. The aim is to infer the posterior distribution of model parameters (hydraulic conductivity) given data (measurements of hydraulic head). In this example we make use of Aesara Ops in order to define a "black box" likelihood, i.e. a likelihood that uses external code. Specifically, our likelihood uses the [FEniCS](https://fenicsproject.org/) library to solve a PDE. This is a common scenario, as PDEs of this type are slow to solve with scipy or other standard libraries. Note that this notebook is outside the core PyMC repository because FEniCS is not a PyMC dependency. We employ the adaptive error model (AEM) feature and compare the performance of basic MLDA with AEM-enhanced MLDA. The idea of Adaptive Error Model (AEM) is to estimate the mean and variance of the forward-model error between adjacent levels, i.e. estimate the bias of the coarse forward model compared to the fine forward model, and use those estimates to correct the coarse model. Using the technique should improve ESS/sec on the fine level. - -[Benchmarks and tuning](https://github.com/alan-turing-institute/pymc/blob/mlda_all_notebooks/docs/source/notebooks/MLDA_benchmarks_tuning.ipynb) (external link): In this notebook we benchmark MLDA against other samplers using different parameterizations of the groundwater flow model. We also give some advice on tuning MLDA. Note that this notebook is outside the core PyMC repository because FEniCS is not a PyMC dependency. - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/samplers/MLDA_simple_linear_regression.myst.md b/myst_nbs/samplers/MLDA_simple_linear_regression.myst.md deleted file mode 100644 index 0e59afa29..000000000 --- a/myst_nbs/samplers/MLDA_simple_linear_regression.myst.md +++ /dev/null @@ -1,211 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python (PyMC Dev) - language: python - name: pymc-dev ---- - -### The MLDA sampler -This notebook is a good starting point to understand the basic usage of the Multi-Level Delayed Acceptance MCMC algorithm (MLDA) proposed in [1], as implemented within PyMC. - -It uses a simple linear regression model (and a toy coarse model counterpart) to show the basic workflow when using MLDA. The model is similar to the one used in https://docs.pymc.io/notebooks/GLM-linear.html. - -The MLDA sampler is designed to deal with computationally intensive problems where we have access not only to the desired (fine) posterior distribution but also to a set of approximate (coarse) posteriors of decreasing accuracy and decreasing computational cost (we need at least one of those). Its main idea is that coarser chains' samples are used as proposals for the finer chains. A coarse chain runs for a fixed number of iterations and the last sample is used as a proposal for the finer chain. This has been shown to improve the effective sample size of the finest chain and this allows us to reduce the number of expensive fine-chain likelihood evaluations. - -The PyMC implementation supports: -- Any number of levels -- Two types of bottom-level samplers (Metropolis, DEMetropolisZ) -- Various tuning parameters for the bottom-level samplers -- Separate subsampling rates for each level -- A choice between blocked and compound sampling for bottom-level Metropolis. -- An adaptive error model to correct bias between coarse and fine models -- A variance reduction technique that utilizes samples from all chains to reduce the variance of an estimated quantity of interest. - -For more details about the MLDA sampler and the way it should be used and parameterised, the user can refer to the docstrings in the code and to the other example notebooks which deal with more complex problem settings and more advanced MLDA features. - -Please note that the MLDA sampler is new in PyMC. The user should be extra critical about the results and report any problems as issues in the PyMC's github repository. - -[1] Dodwell, Tim & Ketelsen, Chris & Scheichl, Robert & Teckentrup, Aretha. (2019). Multilevel Markov Chain Monte Carlo. SIAM Review. 61. 509-545. https://doi.org/10.1137/19M126966X - -+++ - -### Work flow - -MLDA is used in a similar way as most step method in PyMC. It has the special requirement that the user need to provide at least one coarse model to allow it to work. - -The basic flow to use MLDA consists of four steps, which we demonstrate here using a simple linear regression model with a toy coarse model counterpart. - -+++ - -##### Step 1: Generate some data - -Here, we generate a vector `x` of 200 points equally spaced between 0.0 and 1.0. Then we project those onto a straight line with intercept 1.0 and slope 2.0, adding some random noise, resulting in a vector `y`. The goal is to infer the intercept and slope from `x` and `y`, i.e. a very simple linear regression problem. - -```{code-cell} ipython3 -# Import libraries -import time as time -import warnings - -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc as pm -``` - -```{code-cell} ipython3 -az.style.use("arviz-darkgrid") -``` - -```{code-cell} ipython3 -warnings.filterwarnings("ignore") -``` - -```{code-cell} ipython3 -# Generate data -RANDOM_SEED = 915623497 -np.random.seed(RANDOM_SEED) - -true_intercept = 1 -true_slope = 2 -sigma = 1 - -size = 200 -x = np.linspace(0, 1, size) -y = true_intercept + true_slope * x + np.random.normal(0, sigma**2, size) -``` - -```{code-cell} ipython3 -# Plot the data -fig = plt.figure(figsize=(7, 7)) -ax = fig.add_subplot(111, xlabel="x", ylabel="y", title="Generated data and underlying model") -ax.plot(x, y, "x", label="sampled data") -ax.plot(x, true_intercept + true_slope * x, label="true regression line", lw=2.0) -plt.legend(loc=0); -``` - -##### Step 2: Define the fine model - -In this step we use the PyMC model definition language to define the priors and the likelihood. We choose non-informative Normal priors for both intercept and slope and a Normal likelihood, where we feed in `x` and `y`. - -```{code-cell} ipython3 -# Constructing the fine model -with pm.Model() as fine_model: - # Define priors - intercept = pm.Normal("intercept", 0, sigma=20) - slope = pm.Normal("slope", 0, sigma=20) - - # Define likelihood - likelihood = pm.Normal("y", mu=intercept + slope * x, sigma=sigma, observed=y) -``` - -##### Step 3: Define a coarse model - -Here, we define a toy coarse model where coarseness is introduced by using fewer data in the likelihood compared to the fine model, i.e. we only use every 2nd data point from the original data set. - -```{code-cell} ipython3 -# Thinning the data set -x_coarse = x[::2] -y_coarse = y[::2] -``` - -```{code-cell} ipython3 -# Constructing the coarse model -with pm.Model() as coarse_model: - # Define priors - intercept = pm.Normal("intercept", 0, sigma=20) - slope = pm.Normal("slope", 0, sigma=20) - - # Define likelihood - likelihood = pm.Normal("y", mu=intercept + slope * x_coarse, sigma=sigma, observed=y_coarse) -``` - -##### Step 4: Draw MCMC samples from the posterior using MLDA - -We feed `coarse_model` to the MLDA instance and we also set `subsampling_rate` to 10. The subsampling rate is the number of samples drawn in the coarse chain to construct a proposal for the fine chain. In this case, MLDA draws 10 samples in the coarse chain and uses the last one as a proposal for the fine chain. This is accepted or rejected by the fine chain and then control goes back to the coarse chain which generates another 10 samples, etc. Note that `pm.MLDA` has many other tuning arguments which can be found in the documentation. - -Next, we use the universal `pm.sample` method, passing the MLDA instance to it. This runs MLDA and returns a `trace`, containing all MCMC samples and various by-products. Here, we also run standard Metropolis and DEMetropolisZ samplers for comparison, which return separate traces. We time the runs to compare later. - -Finally, PyMC provides various functions to visualise the trace and print summary statistics (two of them are shown below). - -```{code-cell} ipython3 -with fine_model: - # Initialise step methods - step = pm.MLDA(coarse_models=[coarse_model], subsampling_rates=[10]) - step_2 = pm.Metropolis() - step_3 = pm.DEMetropolisZ() - - # Sample using MLDA - t_start = time.time() - trace = pm.sample(draws=6000, chains=4, tune=2000, step=step, random_seed=RANDOM_SEED) - runtime = time.time() - t_start - - # Sample using Metropolis - t_start = time.time() - trace_2 = pm.sample(draws=6000, chains=4, tune=2000, step=step_2, random_seed=RANDOM_SEED) - runtime_2 = time.time() - t_start - - # Sample using DEMetropolisZ - t_start = time.time() - trace_3 = pm.sample(draws=6000, chains=4, tune=2000, step=step_3, random_seed=RANDOM_SEED) - runtime_3 = time.time() - t_start -``` - -```{code-cell} ipython3 -# Trace plots -az.plot_trace(trace) -az.plot_trace(trace_2) -az.plot_trace(trace_3) -``` - -```{code-cell} ipython3 -# Summary statistics for MLDA -az.summary(trace) -``` - -```{code-cell} ipython3 -# Summary statistics for Metropolis -az.summary(trace_2) -``` - -```{code-cell} ipython3 -# Summary statistics for DEMetropolisZ -az.summary(trace_3) -``` - -```{code-cell} ipython3 -# Make sure samplers have converged -assert all(az.rhat(trace) < 1.03) -assert all(az.rhat(trace_2) < 1.03) -assert all(az.rhat(trace_3) < 1.03) -``` - -```{code-cell} ipython3 -# Display runtimes -print(f"Runtimes: MLDA: {runtime}, Metropolis: {runtime_2}, DEMetropolisZ: {runtime_3}") -``` - -##### Comments - -**Performance:** - -You can see from the summary statistics above that MLDA's ESS is ~13x higher than Metropolis and ~2.5x higher than DEMetropolisZ. The runtime of MLDA is ~3.5x larger than either Metropolis or DEMetropolisZ. Therefore in this toy example MLDA is almost an overkill (especially compared to DEMetropolisZ). For more complex problems, where the difference in computational cost between the coarse and fine models/likelihoods is orders of magnitude, MLDA is expected to outperform the other two samplers, as long as the coarse model is reasonably close to the fine one. This case is often encountered in inverse problems in engineering, ecology, imaging, etc where a forward model can be defined with varying coarseness in space and/or time (e.g. subsurface water flow, predator prey models, etc). For an example of this, please see the `MLDA_gravity_surveying.ipynb notebook` in the same folder. - -**Subsampling rate:** - -The MLDA sampler is based on the assumption that the coarse proposal samples (i.e. the samples proposed from the coarse chain to the fine one) are independent (or almost independent) from each other. In order to generate independent samples, it is necessary to run the coarse chain for an adequate number of iterations to get rid of autocorrelation. Therefore, the higher the autocorrelation in the coarse chain, the more iterations are needed and the larger the subsampling rate should be. - -Values larger than the minimum for beating autocorreletion can further improve the proposal (as the distribution is explored better and the proposal are imptoved), and thus ESS. But at the same time more steps cost more computationally. Users are encouraged to do test runs with different subsampling rates to understand which gives the best ESS/sec. - -Note that in cases where you have more than one coarse model/level, MLDA allows you to choose a different subsampling rate for each coarse level (as a list of integers when you instantiate the stepper). - -```{code-cell} ipython3 -# Show packages' and Python's versions -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/samplers/MLDA_variance_reduction_linear_regression.myst.md b/myst_nbs/samplers/MLDA_variance_reduction_linear_regression.myst.md deleted file mode 100644 index 99e5cf7f2..000000000 --- a/myst_nbs/samplers/MLDA_variance_reduction_linear_regression.myst.md +++ /dev/null @@ -1,443 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python PyMC3 (Dev) - language: python - name: pymc3-dev-py38 ---- - -# Variance reduction in MLDA - Linear regression - -+++ - -MLDA is based on the idea of running multiple chains which sample from approximations of the true posterior (where the approximation normally becomes coarser when going from the top level to the bottom level). Due to this characteristic, MLDA draws MCMC samples from all those levels. These samples, apart from improving the mixing of the top-level chain can serve another purpose; we can use them to apply a variance reduction technique when estimating a quantity of interest from the drawn samples. - -In this example, we demonstrate this technique using a linear model example similar to the `MLDA_simple_linear_regression.ipynb` notebook in the same folder. - -#### Typical quantity of interest estimation in MCMC -Specifically, here we are interested in cases where we have a forward model $F$ which is a function of an unknown vector of random variables $\theta$, i.e. $F = F(\theta)$. $F$ is a model of some physical process or phenomenon and $\theta$ is usually a set of unknown parameters in the model. We want to estimate a quantity of interest $Q$ which depends on the forward model $F$, i.e. $Q = Q(F(\theta))$. In order to do that, we draw samples from the posterior of $P(\theta | D)$, where $D$ are our data, and we use the samples to construct an estimator $E_P[Q] = {1\over N} \Sigma_{n}Q(F(\theta_n))$ where $\theta_n$ is the $n-th$ sample drawn from the posterior $P$ using MCMC. - -In this notebook, where we work with a linear regression model, we can use simply one of the values in the theta vector or the mean of all y outputs of the model. - -#### Quantity of interest estimation using variance reduction in MLDA -In a usual MCMC algorithm we would sample from the posterior and use the samples to get the estimate above. In MLDA, we have the extra advantage that we do not only draw samples from the correct/fine posterior $P$; we also draw samples from approximations of it. We can use those samples to reduce the variance of the estimator of $Q$ (and thus require fewer samples to achieve the same variance). - -The technique we use is similar to the idea of a telescopic sum. Instead of estimating $Q$ directly, we estimate differences of $Q$-estimates between levels and add those differences (i.e. we estimate the correction with respect to the next lower level). - -Specifically, we have a set of approximate forward models $F_l$ and posteriors $P_l, l \in \{0,1,...,L-1\}$, where $L$ is the number of levels in MLDA, $F_{L-1} = F$ and $P_{L-1} = P$. MLDA in level $l$ produces the samples $\theta_{1:N_l}^l$ from posterior $P_l$, where $N_l$ is the number of samples at that level (each level generates a different number of samples, with $N_l$ decreasing with $l$). This also results in the quantity of interest functions $Q_l = Q(F_l(\theta))$ for each level $l$ (where $\theta$ indexes are omitted. We use the following equation to estimate the quanity of interest (by combining the above functions): -$E_{VR}[Q] = E_{P_0}[Q_0] + \Sigma_{l=1}^{L-1} (E_{P_l}[Q_l] - E_{P_{l-1}}[Q_{l-1}])$. - -The first term in the right hand side can be estimated using the samples from level 0. For the second term in the right hand side which contains all the differences, we estimate using the following process: In level $l$, and for each sample $\theta_n^l$ in that level where $n \in {1,...,N_l}$, we use the sample $\theta_{s+R}^{l-1}$ from level $l-1$, which is a random sample in the block of $K$ samples generated in level $l-1$ to propose a sample for level $l$, where $s$ is the starting sample of the block. In other words $K$ is the subsampling rate at level $l$ and R is the index of the randomly selected sample ($R$ can range from 1 to $K$). Having this sample, we calculate the following quantity: $Y_n^l = Q_l(F_l(\theta_n^l)) - Q_{l-1}(F_{l-1}(\theta_(s+R)^{l-1}))$. We do the same thing for all $N_l$ samples in level $l$ and finally use them to calculate $E_{P_l}[Q_l] - E_{P_{l-1}}[Q_{l-1}] = {1 \over N_l} \Sigma Y_n^l$. We do the same to estimate the remaining differences and add them all together to get $E_{VR}[Q]$. - -#### Note on asymptotic variance results -$E_{VR}[Q]$ is shown to have asymptotically lower variance than $E_P[Q]$ in [1], as long as the subsampling rate $K$ in level $l$ is larger than the MCMC autocorrelation length in level $l-1$ (and if this is true for all levels). When this condition does not hold, we still see reasonably good variance reduction in experiments, although there is no theoretical guarantee of asymptotically lower variance. Users are advices to do pre-runs to detect the autocorrelation length of all chains in MLDA and then set the subsampling rates accordingly. - -#### Using variance reductioon in PyMC3 -The code in this notebook demonstrates how the user can employ the variance reduction technique within the PyMC3 implementation of MLDA. We run two samplers, one with VR and one without and calculate the resulting variances in the estimates. - -In order to use variance reduction, the user needs to pass the argument `variance_reduction=True` when instantiating the MLDA stepper. Also, they need to do two things when defining the PyMC3 model: -- Include a `pm.Data()` variable with the name `Q` in the model description of all levels, as shown in the code. -- Use a Theano Op to calculate the forward model (or the combination of a forward model and a likelihood). This Op should have a `perform()` method which (in addition to all the other calculations), calculates the quantity of interest and stores it to the variable `Q` of the PyMC3 model, using the `set_value()` function. An example is shown in the code. - -By doing the above, the user provides MLDA with the quantity of interest in each MCMC step. MLDA then internally stores and manages the values and returns all the terms necessary to calculate $E_{VR}[Q]$ (i.e. all $Q_0$ values and all $Y_n^l$ differences/corrections) within the stats of the generated trace. The user can extract them using the `get_sampler_stats()` function of the trace object, as shown at the end of the notebook. - - -### Dependencies -The code has been developed and tested with Python 3.6. You will need to have pymc3 installed and also [FEniCS](https://fenicsproject.org/) for your system. FEniCS is a popular, open-source, [well documented](https://fenicsproject.org/documentation/), high-performance computing framework for solving Partial Differential Equations. FEniCS can be [installed](https://fenicsproject.org/download/) either through their prebuilt Docker images, from their Ubuntu PPA, or from Anaconda. - - -### References -[1] Dodwell, Tim & Ketelsen, Chris & Scheichl, Robert & Teckentrup, Aretha. (2019). Multilevel Markov Chain Monte Carlo. SIAM Review. 61. 509-545. https://doi.org/10.1137/19M126966X - -+++ - -### Import modules - -```{code-cell} ipython3 -import os as os -import sys as sys -import time as time - -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm -import theano.tensor as tt - -from matplotlib.ticker import ScalarFormatter -``` - -```{code-cell} ipython3 -RANDOM_SEED = 4555 -np.random.seed(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -### Set parameters and generate data using a linear model - -```{code-cell} ipython3 -# set up the model and data -np.random.seed(RANDOM_SEED) -size = 100 -true_intercept = 1 -true_slope = 2 -sigma = 0.2 -x = np.linspace(0, 1, size) - -# y = a + b*x -true_regression_line = true_intercept + true_slope * x - -# add noise -y = true_regression_line + np.random.normal(0, sigma**2, size) -s = sigma - -# reduced datasets -# We use fewer data in the coarse models compared to the fine model in order to make them less accurate -x_coarse_0 = x[::3] -y_coarse_0 = y[::3] -x_coarse_1 = x[::2] -y_coarse_1 = y[::2] - -# MCMC parameters -ndraws = 3000 -ntune = 1000 -nsub = 5 -nchains = 2 -``` - -```{code-cell} ipython3 -# Plot the data -fig = plt.figure(figsize=(7, 7)) -ax = fig.add_subplot(111, xlabel="x", ylabel="y", title="Generated data and underlying model") -ax.plot(x, y, "x", label="sampled data") -ax.plot(x, true_regression_line, label="true regression line", lw=2.0) -plt.legend(loc=0); -``` - -### Create a theano op that implements the likelihood -In order to use variance reduction, the user needs to define a Theano Op that calculates the forward model -(or both the forward model and the likelihood). -Also, this Op needs to save the quantity of interest to a model variable with the name `Q`. -Here we use a Theano Op that contains both the forward model (i.e. the linear equation in this case) and the likelihood calculation. The quantity of interest is calculated with the perform() function and it is the mean of linear predictions given theta from all data points. - -```{code-cell} ipython3 -class Likelihood(tt.Op): - # Specify what type of object will be passed and returned to the Op when it is - # called. In our case we will be passing it a vector of values (the parameters - # that define our model) and returning a scalar (likelihood) - itypes = [tt.dvector] - otypes = [tt.dscalar] - - def __init__(self, x, y, pymc3_model): - """ - Initialise the Op with various things that our likelihood requires. - Parameters - ---------- - x: - The x points. - y: - The y points. - pymc3_model: - The pymc3 model. - """ - self.x = x - self.y = y - self.pymc3_model = pymc3_model - - def perform(self, node, inputs, outputs): - intercept = inputs[0][0] - x_coeff = inputs[0][1] - - # this uses the linear model to calculate outputs - temp = intercept + x_coeff * self.x - # this saves the quantity of interest to the pymc3 model variable Q - self.pymc3_model.Q.set_value(temp.mean()) - # this calculates the likelihood value - outputs[0][0] = np.array(-(0.5 / s**2) * np.sum((temp - self.y) ** 2)) -``` - -### Define the coarse models -Here we create the coarse models for MLDA. -We need to include a `pm.Data()` variable `Q` in each one of those models, instantiated at `0.0`. These variables are set during sampling when the Op code under `perform()` runs. - -```{code-cell} ipython3 -mout = [] -coarse_models = [] - -# Set up models in pymc3 for each level - excluding finest model level -# Level 0 (coarsest) -with pm.Model() as coarse_model_0: - # A variable Q has to be defined if you want to use the variance reduction feature - # Q can be of any dimension - here it a scalar - Q = pm.Data("Q", np.float(0.0)) - - # Define priors - intercept = pm.Normal("Intercept", 0, sigma=20) - x_coeff = pm.Normal("x", 0, sigma=20) - - # convert thetas to a tensor vector - theta = tt.as_tensor_variable([intercept, x_coeff]) - - # Here we instantiate a Likelihood object using the class defined above - # and we add to the mout list. We pass the coarse data x_coarse_0 and y_coarse_0 - # and the coarse pymc3 model coarse_model_0. This creates a coarse likelihood. - mout.append(Likelihood(x_coarse_0, y_coarse_0, coarse_model_0)) - - # This uses the likelihood object to define the likelihood of the model, given theta - pm.Potential("likelihood", mout[0](theta)) - - coarse_models.append(coarse_model_0) - -# Level 1 -with pm.Model() as coarse_model_1: - # A variable Q has to be defined if you want to use the variance reduction feature - # Q can be of any dimension - here it a scalar - Q = pm.Data("Q", np.float64(0.0)) - - # Define priors - intercept = pm.Normal("Intercept", 0, sigma=20) - x_coeff = pm.Normal("x", 0, sigma=20) - - # convert thetas to a tensor vector - theta = tt.as_tensor_variable([intercept, x_coeff]) - - # Here we instantiate a Likelihood object using the class defined above - # and we add to the mout list. We pass the coarse data x_coarse_1 and y_coarse_1 - # and the coarse pymc3 model coarse_model_1. This creates a coarse likelihood. - mout.append(Likelihood(x_coarse_1, y_coarse_1, coarse_model_1)) - - # This uses the likelihood object to define the likelihood of the model, given theta - pm.Potential("likelihood", mout[1](theta)) - - coarse_models.append(coarse_model_1) -``` - -### Define the fine model and sample -Here we define the fine (i.e. correct) model and sample from it using MLDA (with and without variance reduction). -Note that `Q` is used here too. - -We create two MLDA samplers, one with VR activated and one without. - -```{code-cell} ipython3 -with pm.Model() as model: - # A variable Q has to be defined if you want to use the variance reduction feature - # Q can be of any dimension - here it a scalar - Q = pm.Data("Q", np.float64(0.0)) - - # Define priors - intercept = pm.Normal("Intercept", 0, sigma=20) - x_coeff = pm.Normal("x", 0, sigma=20) - - # convert thetas to a tensor vector - theta = tt.as_tensor_variable([intercept, x_coeff]) - - # Here we instantiate a Likelihood object using the class defined above - # and we add to the mout list. We pass the fine data x and y - # and the fine pymc3 model model. This creates a fine likelihood. - mout.append(Likelihood(x, y, model)) - - # This uses the likelihood object to define the likelihood of the model, given theta - pm.Potential("likelihood", mout[-1](theta)) - - # MLDA with variance reduction - step_with = pm.MLDA( - coarse_models=coarse_models, subsampling_rates=nsub, variance_reduction=True - ) - - # MLDA without variance reduction - step_without = pm.MLDA( - coarse_models=coarse_models, - subsampling_rates=nsub, - variance_reduction=False, - store_Q_fine=True, - ) - - # sample - trace1 = pm.sample( - draws=ndraws, - step=step_with, - chains=nchains, - tune=ntune, - discard_tuned_samples=True, - random_seed=RANDOM_SEED, - cores=1, - ) - - trace2 = pm.sample( - draws=ndraws, - step=step_without, - chains=nchains, - tune=ntune, - discard_tuned_samples=True, - random_seed=RANDOM_SEED, - cores=1, - ) -``` - -### Show stats summary - -```{code-cell} ipython3 -with model: - trace1_az = az.from_pymc3(trace1) -az.summary(trace1_az) -``` - -```{code-cell} ipython3 -with model: - trace2_az = az.from_pymc3(trace2) -az.summary(trace2_az) -``` - -### Show traceplots - -```{code-cell} ipython3 -az.plot_trace(trace1_az) -``` - -```{code-cell} ipython3 -az.plot_trace(trace1_az) -``` - -### Estimate standard error of two methods -Compare standard error of Q estimation between: -- Standard approach: Using only Q values from the fine chain (Q_2) - samples from MLDA without VR -- Collapsing sum (VR) approach: Using Q values from the coarsest chain (Q_0), plus all estimates of differences between levels (in this case Q_1_0 and Q_2_1) - samples from MLDA with VR - -+++ - -#### 0) Convenience function for quantity of interest estimate -The easiest way to extract the quantity of interest expectation and standard error for the collapsing sum (VR) approach directly from the trace is to use the `extract_Q_estimate(...)` function as shown here. - -In the remaining part of the notebook we demonstrate the extraction in detail without using this convenience function. - -```{code-cell} ipython3 -Q_expectation, Q_SE = pm.step_methods.mlda.extract_Q_estimate(trace=trace1, levels=3) -print(Q_expectation, Q_SE) -``` - -#### 1) Extract quantities of interest from the traces -This requires some reshaping with numpy. Note that we append the samples from all chains into one long array. - -```{code-cell} ipython3 -# MLDA without VR -Q_2 = trace2.get_sampler_stats("Q_2").reshape((1, nchains * ndraws)) - -# MLDA with VR -Q_0 = np.concatenate(trace1.get_sampler_stats("Q_0")).reshape((1, nchains * ndraws * nsub * nsub)) -Q_1_0 = np.concatenate(trace1.get_sampler_stats("Q_1_0")).reshape((1, nchains * ndraws * nsub)) -Q_2_1 = np.concatenate(trace1.get_sampler_stats("Q_2_1")).reshape((1, nchains * ndraws)) - -# Estimates -Q_mean_standard = Q_2.mean() -Q_mean_vr = Q_0.mean() + Q_1_0.mean() + Q_2_1.mean() - -print(f"Q_0 mean = {Q_0.mean()}") -print(f"Q_1_0 mean = {Q_1_0.mean()}") -print(f"Q_2_1 mean = {Q_2_1.mean()}") -print(f"Q_2 mean = {Q_2.mean()}") -print(f"Standard method: Mean: {Q_mean_standard}") -print(f"VR method: Mean: {Q_mean_vr}") -``` - -#### Calculate variances of Q quantity samples -This shows that the variances of the differences is orders of magnitude smaller than the variance of any of the chains - -```{code-cell} ipython3 -Q_2.var() -``` - -```{code-cell} ipython3 -Q_0.var() -``` - -```{code-cell} ipython3 -Q_1_0.var() -``` - -```{code-cell} ipython3 -Q_2_1.var() -``` - -#### Calculate standard error of each term using ESS - -```{code-cell} ipython3 -ess_Q0 = az.ess(np.array(Q_0, np.float64)) -ess_Q_1_0 = az.ess(np.array(Q_1_0, np.float64)) -ess_Q_2_1 = az.ess(np.array(Q_2_1, np.float64)) -ess_Q2 = az.ess(np.array(Q_2, np.float64)) -``` - -```{code-cell} ipython3 -# note that the chain in level 2 has much fewer samples than the chain in level 0 (because of the subsampling rates) -print(ess_Q2, ess_Q0, ess_Q_1_0, ess_Q_2_1) -``` - -Standard errors are estimated by $Var(Q) \over ESS(Q)$. -It is clear that the differences have standard errors much lower than levels 0 and 2 - -```{code-cell} ipython3 -Q_2.var() / ess_Q2 -``` - -```{code-cell} ipython3 -Q_0.var() / ess_Q0 -``` - -```{code-cell} ipython3 -Q_1_0.var() / ess_Q_1_0 -``` - -```{code-cell} ipython3 -Q_2_1.var() / ess_Q_2_1 -``` - -#### Calculate total standard errors of the two competing estimates with different chunks of the sample -The graph shows how the errors decay when we collect more samples, demonstrating the gains of the VR technique in terms of standard error reduction. - -```{code-cell} ipython3 -step = 100 - -Q2_SE = np.zeros(int(nchains * ndraws / step)) -Q0_SE = np.zeros(int(nchains * ndraws / step)) -Q_1_0_SE = np.zeros(int(nchains * ndraws / step)) -Q_2_1_SE = np.zeros(int(nchains * ndraws / step)) -E_standard_SE = np.zeros(int(nchains * ndraws / step)) -E_VR_SE = np.zeros(int(nchains * ndraws / step)) -k = 0 - -for i in np.arange(step, nchains * ndraws + 1, step): - Q2_SE[k] = Q_2[0, 0:i].var() / az.ess(np.array(Q_2[0, 0:i], np.float64)) - Q0_SE[k] = Q_0[0, 0 : i * (nsub**2)].var() / az.ess( - np.array(Q_0[0, 0 : i * (nsub**2)], np.float64) - ) - Q_1_0_SE[k] = Q_1_0[0, 0 : i * nsub].var() / az.ess( - np.array(Q_1_0[0, 0 : i * nsub], np.float64) - ) - Q_2_1_SE[k] = Q_2_1[0, 0:i].var() / az.ess(np.array(Q_2_1[0, 0:i], np.float64)) - E_standard_SE[k] = np.sqrt(Q2_SE[k]) - E_VR_SE[k] = np.sqrt(Q0_SE[k] + Q_1_0_SE[k] + Q_2_1_SE[k]) - k += 1 - -fig = plt.figure() -ax = fig.gca() - -for axis in [ax.yaxis]: - axis.set_major_formatter(ScalarFormatter()) - -ax.plot(np.arange(step, nchains * ndraws + 1, step), E_standard_SE) -ax.plot(np.arange(step, nchains * ndraws + 1, step), E_VR_SE) -plt.xlabel("Samples drawn", fontsize=18) -plt.ylabel("Standard error", fontsize=18) -ax.legend(["Standard estimator", "Variance reduction estimator"]) -plt.show() -``` - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/samplers/SMC-ABC_Lotka-Volterra_example.myst.md b/myst_nbs/samplers/SMC-ABC_Lotka-Volterra_example.myst.md deleted file mode 100644 index 236ce5e3b..000000000 --- a/myst_nbs/samplers/SMC-ABC_Lotka-Volterra_example.myst.md +++ /dev/null @@ -1,226 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3.9.7 ('base') - language: python - name: python3 ---- - -(ABC_introduction)= -# Approximate Bayesian Computation -:::{post} May 31, 2022 -:tags: SMC, ABC -:category: beginner, explanation -::: - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc as pm - -print(f"Running on PyMC v{pm.__version__}") -``` - -```{code-cell} ipython3 -%load_ext watermark -az.style.use("arviz-darkgrid") -``` - -# Sequential Monte Carlo - Approximate Bayesian Computation - -+++ - -Approximate Bayesian Computation methods (also called likelihood free inference methods), are a group of techniques developed for inferring posterior distributions in cases where the likelihood function is intractable or costly to evaluate. This does not mean that the likelihood function is not part of the analysis, it just the we are approximating the likelihood, and hence the name of the ABC methods. - -ABC comes useful when modeling complex phenomena in certain fields of study, like systems biology. Such models often contain unobservable random quantities, which make the likelihood function hard to specify, but data can be simulated from the model. - -These methods follow a general form: - -1- Sample a parameter $\theta^*$ from a prior/proposal distribution $\pi(\theta)$. - -2- Simulate a data set $y^*$ using a function that takes $\theta$ and returns a data set of the same dimensions as the observed data set $y_0$ (simulator). - -3- Compare the simulated dataset $y^*$ with the experimental data set $y_0$ using a distance function $d$ and a tolerance threshold $\epsilon$. - -In some cases a distance function is computed between two summary statistics $d(S(y_0), S(y^*))$, avoiding the issue of computing distances for entire datasets. - -As a result we obtain a sample of parameters from a distribution $\pi(\theta | d(y_0, y^*)) \leqslant \epsilon$. - -If $\epsilon$ is sufficiently small this distribution will be a good approximation of the posterior distribution $\pi(\theta | y_0)$. - -+++ - -[Sequential monte carlo](https://docs.pymc.io/notebooks/SMC2_gaussians.html?highlight=smc) ABC is a method that iteratively morphs the prior into a posterior by propagating the sampled parameters through a series of proposal distributions $\phi(\theta^{(i)})$, weighting the accepted parameters $\theta^{(i)}$ like: - -$$ w^{(i)} \propto \frac{\pi(\theta^{(i)})}{\phi(\theta^{(i)})} $$ - -It combines the advantages of traditional SMC, i.e. ability to sample from distributions with multiple peaks, but without the need for evaluating the likelihood function. - - -_(Lintusaari, 2016), (Toni, T., 2008), (Nuñez, Prangle, 2015)_ - -+++ - -# Old good Gaussian fit - -To illustrate how to use ABC within PyMC3 we are going to start with a very simple example estimating the mean and standard deviation of Gaussian data. - -```{code-cell} ipython3 -data = np.random.normal(loc=0, scale=1, size=1000) -``` - -Clearly under normal circumstances using a Gaussian likelihood will do the job very well. But that would defeat the purpose of this example, the notebook would end here and everything would be very boring. So, instead of that we are going to define a simulator. A very straightforward simulator for normal data is a pseudo random number generator, in real life our simulator will be most likely something fancier. - -```{code-cell} ipython3 -def normal_sim(rng, a, b, size=1000): - return rng.normal(a, b, size=size) -``` - -Defining an ABC model in PyMC3 is in general, very similar to defining other PyMC3 models. The two important differences are: we need to define a `Simulator` _distribution_ and we need to use `sample_smc` with `kernel="ABC"`. The `Simulator` works as a generic interface to pass the synthetic data generating function (_normal_sim_ in this example), its parameters, the observed data and optionally a distance function and a summary statistics. In the following code we are using the default distance, `gaussian_kernel`, and the `sort` summary_statistic. As the name suggests `sort` sorts the data before computing the distance. - -Finally, SMC-ABC offers the option to store the simulated data. This can he handy as simulators can be expensive to evaluate and we may want to use the simulated data for example for posterior predictive checks. - -```{code-cell} ipython3 -with pm.Model() as example: - a = pm.Normal("a", mu=0, sigma=5) - b = pm.HalfNormal("b", sigma=1) - s = pm.Simulator("s", normal_sim, params=(a, b), sum_stat="sort", epsilon=1, observed=data) - - idata = pm.sample_smc() - idata.extend(pm.sample_posterior_predictive(idata)) -``` - -Judging by `plot_trace` the sampler did its job very well, which is not surprising given this is a very simple model. Anyway, it is always reassuring to look at a flat rank plot :-) - -```{code-cell} ipython3 -az.plot_trace(idata, kind="rank_vlines"); -``` - -```{code-cell} ipython3 -az.summary(idata, kind="stats") -``` - -The posterior predictive check shows that we have an overall good fit, but the synthetic data has heavier tails than the observed one. You may want to decrease the value of epsilon, and see if you can get a tighter fit. - -```{code-cell} ipython3 -az.plot_ppc(idata, num_pp_samples=500); -``` - -## Lotka–Volterra - -The Lotka-Volterra is well-know biological model describing how the number of individuals of two species change when there is a predator/prey interaction (A Biologist’s Guide to Mathematical Modeling in Ecology and Evolution,Otto and Day, 2007). For example, rabbits and foxes. Given an initial population number for each species, the integration of this ordinary differential equations (ODE) describes curves for the progression of both populations. This ODE's takes four parameters: - -* a is the natural growing rate of rabbits, when there's no fox. -* b is the natural dying rate of rabbits, due to predation. -* c is the natural dying rate of fox, when there is no rabbit. -* d is the factor describing how many caught rabbits let create a new fox. - -Notice that there is nothing intrinsically especial about SMC-ABC and ODEs. In principle a simulator can be any piece of code able to generate fake data given a set of parameters. - -```{code-cell} ipython3 -from scipy.integrate import odeint - -# Definition of parameters -a = 1.0 -b = 0.1 -c = 1.5 -d = 0.75 - -# initial population of rabbits and foxes -X0 = [10.0, 5.0] -# size of data -size = 100 -# time lapse -time = 15 -t = np.linspace(0, time, size) - -# Lotka - Volterra equation -def dX_dt(X, t, a, b, c, d): - """Return the growth rate of fox and rabbit populations.""" - - return np.array([a * X[0] - b * X[0] * X[1], -c * X[1] + d * b * X[0] * X[1]]) - - -# simulator function -def competition_model(rng, a, b, size=None): - return odeint(dX_dt, y0=X0, t=t, rtol=0.01, args=(a, b, c, d)) -``` - -Using the simulator function we will obtain a dataset with some noise added, for using it as observed data. - -```{code-cell} ipython3 -# function for generating noisy data to be used as observed data. -def add_noise(a, b): - noise = np.random.normal(size=(size, 2)) - simulated = competition_model(None, a, b) + noise - return simulated -``` - -```{code-cell} ipython3 -# plotting observed data. -observed = add_noise(a, b) -_, ax = plt.subplots(figsize=(12, 4)) -ax.plot(observed[:, 0], "x", label="prey") -ax.plot(observed[:, 1], "x", label="predator") -ax.set_xlabel("time") -ax.set_ylabel("population") -ax.set_title("Observed data") -ax.legend(); -``` - -As with the first example, instead of specifying a likelihood function, we use `pm.Simulator()`. - -```{code-cell} ipython3 -with pm.Model() as model_lv: - a = pm.HalfNormal("a", 1.0) - b = pm.HalfNormal("b", 1.0) - - sim = pm.Simulator("sim", competition_model, params=(a, b), epsilon=10, observed=observed) - - idata_lv = pm.sample_smc() -``` - -```{code-cell} ipython3 -az.plot_trace(idata_lv, kind="rank_vlines"); -``` - -```{code-cell} ipython3 -az.plot_posterior(idata_lv); -``` - -```{code-cell} ipython3 -# plot results -_, ax = plt.subplots(figsize=(14, 6)) -posterior = idata_lv.posterior.stack(samples=("draw", "chain")) -ax.plot(observed[:, 0], "o", label="prey", c="C0", mec="k") -ax.plot(observed[:, 1], "o", label="predator", c="C1", mec="k") -ax.plot(competition_model(None, posterior["a"].mean(), posterior["b"].mean()), linewidth=3) -for i in np.random.randint(0, size, 75): - sim = competition_model(None, posterior["a"][i], posterior["b"][i]) - ax.plot(sim[:, 0], alpha=0.1, c="C0") - ax.plot(sim[:, 1], alpha=0.1, c="C1") -ax.set_xlabel("time") -ax.set_ylabel("population") -ax.legend(); -``` - -## References - -:::{bibliography} -:filter: docname in docnames - -martin2021bayesian -::: - -```{code-cell} ipython3 -%watermark -n -u -v -iv -w -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/samplers/SMC2_gaussians.myst.md b/myst_nbs/samplers/SMC2_gaussians.myst.md deleted file mode 100644 index 8d9b02910..000000000 --- a/myst_nbs/samplers/SMC2_gaussians.myst.md +++ /dev/null @@ -1,213 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3.9.7 ('base') - language: python - name: python3 ---- - -# Sequential Monte Carlo - -:::{post} Oct 19, 2021 -:tags: SMC -:category: beginner -::: - -```{code-cell} ipython3 -import aesara.tensor as at -import arviz as az -import numpy as np -import pymc as pm - -print(f"Running on PyMC v{pm.__version__}") -``` - -```{code-cell} ipython3 -az.style.use("arviz-darkgrid") -``` - -Sampling from distributions with multiple peaks with standard MCMC methods can be difficult, if not impossible, as the Markov chain often gets stuck in either of the minima. A Sequential Monte Carlo sampler (SMC) is a way to ameliorate this problem. - -As there are many SMC flavors, in this notebook we will focus on the version implemented in PyMC. - -SMC combines several statistical ideas, including [importance sampling](https://en.wikipedia.org/wiki/Importance_sampling), tempering and MCMC. By tempering we mean the use of an auxiliary _temperature_ parameter to control the sampling process. To see how tempering can help let's write the posterior as: - -$$p(\theta \mid y)_{\beta} \propto p(y \mid \theta)^{\beta} \; p(\theta)$$ - -When $\beta=0$ we have that $p(\theta \mid y)_{\beta=0}$ is the prior distribution and when $\beta=1$ we recover the _true_ posterior. We can think of $\beta$ as a knob we can use to gradually _fade up_ the likelihood. This can be useful as in general sampling from the prior is easier than sampling from the posterior distribution. Thus we can use $\beta$ to control the transition from an easy to sample distribution to a harder one. - -A summary of the algorithm is: - -1. Initialize $\beta$ at zero and stage at zero. -2. Generate N samples $S_{\beta}$ from the prior (because when $\beta = 0$ the tempered posterior is the prior). -3. Increase $\beta$ in order to make the effective sample size equals some predefined value (we use $Nt$, where $t$ is 0.5 by default). -4. Compute a set of N importance weights $W$. The weights are computed as the ratio of the likelihoods of a sample at stage $i+1$ and stage $i$. -5. Obtain $S_{w}$ by re-sampling according to $W$. -6. Use $W$ to compute the mean and covariance for the proposal distribution, a MVNormal. -7. For stages other than 0 use the acceptance rate from the previous stage to estimate `n_steps`. -8. Run N independent Metropolis-Hastings (IMH) chains (each one of length `n_steps`), starting each one from a different sample in $S_{w}$. Samples are IMH as the proposal mean is the of the previous posterior stage and not the current point in parameter space. -9. Repeat from step 3 until $\beta \ge 1$. -10. The final result is a collection of $N$ samples from the posterior - -The algorithm is summarized in the next figure, the first subplot shows 5 samples (orange dots) at some particular stage. The second subplot shows how these samples are reweighted according to their posterior density (blue Gaussian curve). The third subplot shows the result of running a certain number of IMH steps, starting from the reweighted samples $S_{w}$ in the second subplot, notice how the two samples with the lower posterior density (smaller circles) are discarded and not used to seed new Markov chains. - -![SMC stages](smc.png) - - -SMC samplers can also be interpreted in the light of genetic algorithms, which are biologically-inspired algorithms that can be summarized as follows: - -1. Initialization: set a population of individuals -2. Mutation: individuals are somehow modified or perturbed -3. Selection: individuals with high _fitness_ have higher chance to generate _offspring_. -4. Iterate by using individuals from 3 to set the population in 1. - -If each _individual_ is a particular solution to a problem, then a genetic algorithm will eventually produce good solutions to that problem. One key aspect is to generate enough diversity (mutation step) in order to explore the solution space and hence avoid getting trap in local minima. Then we perform a _selection_ step to _probabilistically_ keep reasonable solutions while also keeping some diversity. Being too greedy and short-sighted could be problematic, _bad_ solutions in a given moment could lead to _good_ solutions in the future. - -For the SMC version implemented in PyMC we set the number of parallel Markov chains $N$ with the `draws` argument. At each stage SMC will use independent Markov chains to explore the _tempered posterior_ (the black arrow in the figure). The final samples, _i.e_ those stored in the `trace`, will be taken exclusively from the final stage ($\beta = 1$), i.e. the _true_ posterior ("true" in the mathematical sense). - -The successive values of $\beta$ are determined automatically (step 3). The harder the distribution is to sample the closer two successive values of $\beta$ will be. And the larger the number of stages SMC will take. SMC computes the next $\beta$ value by keeping the effective sample size (ESS) between two stages at a constant predefined value of half the number of draws. This can be adjusted if necessary by the `threshold` parameter (in the interval [0, 1])-- the current default of 0.5 is generally considered as a good default. The larger this value, the higher the target ESS and the closer two successive values of $\beta$ will be. This ESS values are computed from the importance weights (step 4) and not from the autocorrelation like those from ArviZ (for example using `az.ess` or `az.summary`). - -Two more parameters that are automatically determined are: - -* The number of steps each Markov chain takes to explore the _tempered posterior_ `n_steps`. This is determined from the acceptance rate from the previous stage. -* The covariance of the MVNormal proposal distribution is also adjusted adaptively based on the acceptance rate at each stage. - -As with other sampling methods, running a sampler more than one time is useful to compute diagnostics, SMC is no exception. PyMC will try to run at least two **SMC _chains_** (do not confuse with the $N$ Markov chains inside each SMC chain). - -Even when SMC uses the Metropolis-Hasting algorithm under the hood, it has several advantages over it: - -* It can sample from distributions with multiple peaks. -* It does not have a burn-in period, it starts by sampling directly from the prior and then at each stage the starting points are already _approximately_ distributed according to the tempered posterior (due to the re-weighting step). -* It is inherently parallel. - -+++ - -## Solving a PyMC model with SMC - -To see an example of how to use SMC inside PyMC let's define a multivariate Gaussian of dimension $n$ with two modes, the weights of each mode and the covariance matrix. - -```{code-cell} ipython3 -n = 4 - -mu1 = np.ones(n) * (1.0 / 2) -mu2 = -mu1 - -stdev = 0.1 -sigma = np.power(stdev, 2) * np.eye(n) -isigma = np.linalg.inv(sigma) -dsigma = np.linalg.det(sigma) - -w1 = 0.1 # one mode with 0.1 of the mass -w2 = 1 - w1 # the other mode with 0.9 of the mass - - -def two_gaussians(x): - log_like1 = ( - -0.5 * n * at.log(2 * np.pi) - - 0.5 * at.log(dsigma) - - 0.5 * (x - mu1).T.dot(isigma).dot(x - mu1) - ) - log_like2 = ( - -0.5 * n * at.log(2 * np.pi) - - 0.5 * at.log(dsigma) - - 0.5 * (x - mu2).T.dot(isigma).dot(x - mu2) - ) - return pm.math.logsumexp([at.log(w1) + log_like1, at.log(w2) + log_like2]) -``` - -```{code-cell} ipython3 -with pm.Model() as model: - X = pm.Uniform( - "X", - shape=n, - lower=-2.0 * np.ones_like(mu1), - upper=2.0 * np.ones_like(mu1), - initval=-1.0 * np.ones_like(mu1), - ) - llk = pm.Potential("llk", two_gaussians(X)) - idata_04 = pm.sample_smc(2000) -``` - -We can see from the message that PyMC is running four **SMC chains** in parallel. As explained before this is useful for diagnostics. As with other samplers one useful diagnostics is the `plot_trace`, here we use `kind="rank_vlines"` as rank plots as generally more useful than the classical "trace" - -```{code-cell} ipython3 -ax = az.plot_trace(idata_04, compact=True, kind="rank_vlines") -ax[0, 0].axvline(-0.5, 0, 0.9, color="k") -ax[0, 0].axvline(0.5, 0, 0.1, color="k") -f'Estimated w1 = {np.mean(idata_04.posterior["X"] < 0).item():.3f}' -``` - -From the KDE we can see that we recover the modes and even the relative weights seems pretty good. The rank plot on the right looks good too. One SMC chain is represented in blue and the other in orange. The vertical lines indicate deviation from the ideal expected value, which is represented with a black dashed line. If a vertical line is above the reference black dashed line we have more samples than expected, if the vertical line is below the sampler is getting less samples than expected. Deviations like the ones in the figure above are fine and not a reason for concern. - -As previously said SMC internally computes an estimation of the ESS (from importance weights). Those ESS values are not useful for diagnostics as they are a fixed target value. We can compute the ESS values from the trace returned by `sample_smc`, but this is also not a very useful diagnostics, as the computation of this ESS value takes autocorrelation into account and each SMC run/chain has low autocorrelation by construction, for most problems the values of ESS will be either very close to the number of total samples (i.e. draws x chains). In general it will only be a low number if each SMC chain explores a different mode, in that case the value of ESS will be close to the number of modes. - -+++ - -## Kill your darlings - -SMC is not free of problems, sampling can deteriorate as the dimensionality of the problem increases, in particular for multimodal posterior or _weird_ geometries as in hierarchical models. To some extent increasing the number of draws could help. Increasing the value of the argument `p_acc_rate` is also a good idea. This parameter controls how the number of steps is computed at each stage. To access the number of steps per stage you can check `trace.report.nsteps`. Ideally SMC will take a number of steps lower than `n_steps`. But if the actual number of steps per stage is `n_steps`, for a few stages, this may be signaling that we should also increase `n_steps`. - -Let's see the performance of SMC when we run the same model as before, but increasing the dimensionality from 4 to 80. - -```{code-cell} ipython3 -n = 80 - -mu1 = np.ones(n) * (1.0 / 2) -mu2 = -mu1 - -stdev = 0.1 -sigma = np.power(stdev, 2) * np.eye(n) -isigma = np.linalg.inv(sigma) -dsigma = np.linalg.det(sigma) - -w1 = 0.1 # one mode with 0.1 of the mass -w2 = 1 - w1 # the other mode with 0.9 of the mass - - -def two_gaussians(x): - log_like1 = ( - -0.5 * n * at.log(2 * np.pi) - - 0.5 * at.log(dsigma) - - 0.5 * (x - mu1).T.dot(isigma).dot(x - mu1) - ) - log_like2 = ( - -0.5 * n * at.log(2 * np.pi) - - 0.5 * at.log(dsigma) - - 0.5 * (x - mu2).T.dot(isigma).dot(x - mu2) - ) - return pm.math.logsumexp([at.log(w1) + log_like1, at.log(w2) + log_like2]) -``` - -```{code-cell} ipython3 -with pm.Model() as model: - X = pm.Uniform( - "X", - shape=n, - lower=-2.0 * np.ones_like(mu1), - upper=2.0 * np.ones_like(mu1), - initval=-1.0 * np.ones_like(mu1), - ) - llk = pm.Potential("llk", two_gaussians(X)) - idata_80 = pm.sample_smc(2000) -``` - -We see that SMC recognizes this is a harder problem and increases the number of stages. We can see that SMC still sample from both modes but now the model with higher weight is being oversampled (we get a relative weight of 0.99 instead of 0.9). Notice how the rank plot looks worse than when n=4. - -```{code-cell} ipython3 -ax = az.plot_trace(idata_80, compact=True, kind="rank_vlines") -ax[0, 0].axvline(-0.5, 0, 0.9, color="k") -ax[0, 0].axvline(0.5, 0, 0.1, color="k") -f'Estimated w1 = {np.mean(idata_80.posterior["X"] < 0).item():.3f}' -``` - -You may want to repeat the SMC sampling for n=80, and change one or more of the default parameters too see if you can improve the sampling and how much time the sampler takes to compute the posterior. - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p xarray -``` diff --git a/myst_nbs/survival_analysis/bayes_param_survival_pymc3.myst.md b/myst_nbs/survival_analysis/bayes_param_survival_pymc3.myst.md deleted file mode 100644 index ac26ca444..000000000 --- a/myst_nbs/survival_analysis/bayes_param_survival_pymc3.myst.md +++ /dev/null @@ -1,420 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python (PyMC3 Dev) - language: python - name: pymc3-dev ---- - -# Bayesian Parametric Survival Analysis with PyMC3 - -```{code-cell} ipython3 -import warnings - -import arviz as az -import numpy as np -import pymc3 as pm -import scipy as sp -import seaborn as sns - -from matplotlib import pyplot as plt -from matplotlib.ticker import StrMethodFormatter -from statsmodels import datasets -from theano import shared -from theano import tensor as tt - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 -warnings.filterwarnings("ignore") -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -``` - -[Survival analysis](https://en.wikipedia.org/wiki/Survival_analysis) studies the distribution of the time between when a subject comes under observation and when that subject experiences an event of interest. One of the fundamental challenges of survival analysis (which also makes is mathematically interesting) is that, in general, not every subject will experience the event of interest before we conduct our analysis. In more concrete terms, if we are studying the time between cancer treatment and death (as we will in this post), we will often want to analyze our data before every subject has died. This phenomenon is called censoring and is fundamental to survival analysis. - -I have previously [written](http://austinrochford.com/posts/2015-10-05-bayes-survival.html) about Bayesian survival analysis using the [semiparametric](https://en.wikipedia.org/wiki/Semiparametric_model) [Cox proportional hazards model](https://en.wikipedia.org/wiki/Proportional_hazards_model#The_Cox_model). Implementing that semiparametric model in PyMC3 involved some fairly complex `numpy` code and nonobvious probability theory equivalences. This post illustrates a parametric approach to Bayesian survival analysis in PyMC3. Parametric models of survival are simpler to both implement and understand than semiparametric models; statistically, they are also more [powerful](https://en.wikipedia.org/wiki/Statistical_power) than non- or semiparametric methods _when they are correctly specified_. This post will not further cover the differences between parametric and nonparametric models or the various methods for choosing between them. - -As in the previous post, we will analyze [mastectomy data](https://vincentarelbundock.github.io/Rdatasets/doc/HSAUR/mastectomy.html) from `R`'s [`HSAUR`](https://cran.r-project.org/web/packages/HSAUR/index.html) package. First, we load the data. - -```{code-cell} ipython3 -sns.set() -blue, green, red, purple, gold, teal = sns.color_palette(n_colors=6) - -pct_formatter = StrMethodFormatter("{x:.1%}") -``` - -```{code-cell} ipython3 -df = datasets.get_rdataset("mastectomy", "HSAUR", cache=True).data.assign( - metastized=lambda df: 1.0 * (df.metastized == "yes"), event=lambda df: 1.0 * df.event -) -``` - -```{code-cell} ipython3 -df.head() -``` - -The column `time` represents the survival time for a breast cancer patient after a mastectomy, measured in months. The column `event` indicates whether or not the observation is censored. If `event` is one, the patient's death was observed during the study; if `event` is zero, the patient lived past the end of the study and their survival time is censored. The column `metastized` indicates whether the cancer had [metastized](https://en.wikipedia.org/wiki/Metastasis) prior to the mastectomy. In this post, we will use Bayesian parametric survival regression to quantify the difference in survival times for patients whose cancer had and had not metastized. - -+++ - -## Accelerated failure time models - -[Accelerated failure time models](https://en.wikipedia.org/wiki/Accelerated_failure_time_model) are the most common type of parametric survival regression models. The fundamental quantity of survival analysis is the [survival function](https://en.wikipedia.org/wiki/Survival_function); if $T$ is the random variable representing the time to the event in question, the survival function is $S(t) = P(T > t)$. Accelerated failure time models incorporate covariates $\mathbf{x}$ into the survival function as - -$$S(t\ |\ \beta, \mathbf{x}) = S_0\left(\exp\left(\beta^{\top} \mathbf{x}\right) \cdot t\right),$$ - -where $S_0(t)$ is a fixed baseline survival function. These models are called "accelerated failure time" because, when $\beta^{\top} \mathbf{x} > 0$, $\exp\left(\beta^{\top} \mathbf{x}\right) \cdot t > t$, so the effect of the covariates is to accelerate the _effective_ passage of time for the individual in question. The following plot illustrates this phenomenon using an exponential survival function. - -```{code-cell} ipython3 -S0 = sp.stats.expon.sf -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - -t = np.linspace(0, 10, 100) - -ax.plot(t, S0(5 * t), label=r"$\beta^{\top} \mathbf{x} = \log\ 5$") -ax.plot(t, S0(2 * t), label=r"$\beta^{\top} \mathbf{x} = \log\ 2$") -ax.plot(t, S0(t), label=r"$\beta^{\top} \mathbf{x} = 0$ ($S_0$)") -ax.plot(t, S0(0.5 * t), label=r"$\beta^{\top} \mathbf{x} = -\log\ 2$") -ax.plot(t, S0(0.2 * t), label=r"$\beta^{\top} \mathbf{x} = -\log\ 5$") - -ax.set_xlim(0, 10) -ax.set_xlabel(r"$t$") - -ax.yaxis.set_major_formatter(pct_formatter) -ax.set_ylim(-0.025, 1) -ax.set_ylabel(r"Survival probability, $S(t\ |\ \beta, \mathbf{x})$") - -ax.legend(loc=1) -ax.set_title("Accelerated failure times"); -``` - -Accelerated failure time models are equivalent to log-linear models for $T$, - -$$Y = \log T = \beta^{\top} \mathbf{x} + \varepsilon.$$ - -A choice of distribution for the error term $\varepsilon$ determines baseline survival function, $S_0$, of the accelerated failure time model. The following table shows the correspondence between the distribution of $\varepsilon$ and $S_0$ for several common accelerated failure time models. - -
- - - - - - - - - - - - - - - - - -
Log-linear error distribution ($\varepsilon$)Baseline survival function ($S_0$)
[Normal](https://en.wikipedia.org/wiki/Normal_distribution)[Log-normal](https://en.wikipedia.org/wiki/Log-normal_distribution)
Extreme value ([Gumbel](https://en.wikipedia.org/wiki/Gumbel_distribution))[Weibull](https://en.wikipedia.org/wiki/Weibull_distribution)
[Logistic](https://en.wikipedia.org/wiki/Logistic_distribution)[Log-logistic](https://en.wikipedia.org/wiki/Log-logistic_distribution)
-
- -Accelerated failure time models are conventionally named after their baseline survival function, $S_0$. The rest of this post will show how to implement Weibull and log-logistic survival regression models in PyMC3 using the mastectomy data. - -+++ - -### Weibull survival regression - -In this example, the covariates are $\mathbf{x}_i = \left(1\ x^{\textrm{met}}_i\right)^{\top}$, where - -$$ -\begin{align*} -x^{\textrm{met}}_i - & = \begin{cases} - 0 & \textrm{if the } i\textrm{-th patient's cancer had not metastized} \\ - 1 & \textrm{if the } i\textrm{-th patient's cancer had metastized} - \end{cases}. -\end{align*} -$$ - -We construct the matrix of covariates $\mathbf{X}$. - -```{code-cell} ipython3 -n_patient, _ = df.shape - -X = np.empty((n_patient, 2)) -X[:, 0] = 1.0 -X[:, 1] = df.metastized -``` - -We place independent, vague normal prior distributions on the regression coefficients, - -$$\beta \sim N(0, 5^2 I_2).$$ - -```{code-cell} ipython3 -VAGUE_PRIOR_SD = 5.0 -``` - -```{code-cell} ipython3 -with pm.Model() as weibull_model: - β = pm.Normal("β", 0.0, VAGUE_PRIOR_SD, shape=2) -``` - -The covariates, $\mathbf{x}$, affect value of $Y = \log T$ through $\eta = \beta^{\top} \mathbf{x}$. - -```{code-cell} ipython3 -X_ = shared(X) - -with weibull_model: - η = β.dot(X_.T) -``` - -For Weibull regression, we use - -$$ -\begin{align*} - \varepsilon - & \sim \textrm{Gumbel}(0, s) \\ - s - & \sim \textrm{HalfNormal(5)}. -\end{align*} -$$ - -```{code-cell} ipython3 -with weibull_model: - s = pm.HalfNormal("s", 5.0) -``` - -We are nearly ready to specify the likelihood of the observations given these priors. Before doing so, we transform the observed times to the log scale and standardize them. - -```{code-cell} ipython3 -y = np.log(df.time.values) -y_std = (y - y.mean()) / y.std() -``` - -The likelihood of the data is specified in two parts, one for uncensored samples, and one for censored samples. Since $Y = \eta + \varepsilon$, and $\varepsilon \sim \textrm{Gumbel}(0, s)$, $Y \sim \textrm{Gumbel}(\eta, s)$. For the uncensored survival times, the likelihood is implemented as - -```{code-cell} ipython3 -cens = df.event.values == 0.0 -``` - -```{code-cell} ipython3 -cens_ = shared(cens) - -with weibull_model: - y_obs = pm.Gumbel("y_obs", η[~cens_], s, observed=y_std[~cens]) -``` - -For censored observations, we only know that their true survival time exceeded the total time that they were under observation. This probability is given by the survival function of the Gumbel distribution, - -$$P(Y \geq y) = 1 - \exp\left(-\exp\left(-\frac{y - \mu}{s}\right)\right).$$ - -This survival function is implemented below. - -```{code-cell} ipython3 -def gumbel_sf(y, μ, σ): - return 1.0 - tt.exp(-tt.exp(-(y - μ) / σ)) -``` - -We now specify the likelihood for the censored observations. - -```{code-cell} ipython3 -with weibull_model: - y_cens = pm.Potential("y_cens", gumbel_sf(y_std[cens], η[cens_], s)) -``` - -We now sample from the model. - -```{code-cell} ipython3 -SEED = 845199 # from random.org, for reproducibility - -SAMPLE_KWARGS = {"chains": 3, "tune": 1000, "random_seed": [SEED, SEED + 1, SEED + 2]} -``` - -```{code-cell} ipython3 -with weibull_model: - weibull_trace = pm.sample(**SAMPLE_KWARGS) -``` - -The energy plot and Bayesian fraction of missing information give no cause for concern about poor mixing in NUTS. - -```{code-cell} ipython3 -az.plot_energy(weibull_trace); -``` - -```{code-cell} ipython3 -az.bfmi(weibull_trace) -``` - -The Gelman-Rubin statistics also indicate convergence. - -```{code-cell} ipython3 -max(np.max(gr_stats) for gr_stats in az.rhat(weibull_trace).values()) -``` - -Below we plot posterior distributions of the parameters. - -```{code-cell} ipython3 -az.plot_posterior(weibull_trace, lw=0, alpha=0.5); -``` - -These are somewhat interesting (espescially the fact that the posterior of $\beta_1$ is fairly well-separated from zero), but the posterior predictive survival curves will be much more interpretable. - -The advantage of using [`theano.shared`](http://deeplearning.net/software/theano_versions/dev/library/compile/shared.html) variables is that we can now change their values to perform posterior predictive sampling. For posterior prediction, we set $X$ to have two rows, one for a subject whose cancer had not metastized and one for a subject whose cancer had metastized. Since we want to predict actual survival times, none of the posterior predictive rows are censored. - -```{code-cell} ipython3 -X_pp = np.empty((2, 2)) -X_pp[:, 0] = 1.0 -X_pp[:, 1] = [0, 1] -X_.set_value(X_pp) - -cens_pp = np.repeat(False, 2) -cens_.set_value(cens_pp) -``` - -```{code-cell} ipython3 -with weibull_model: - pp_weibull_trace = pm.sample_posterior_predictive(weibull_trace, samples=1500) -``` - -The posterior predictive survival times show that, on average, patients whose cancer had not metastized survived longer than those whose cancer had metastized. - -```{code-cell} ipython3 -t_plot = np.linspace(0, 230, 100) - -weibull_pp_surv = np.greater_equal.outer( - np.exp(y.mean() + y.std() * pp_weibull_trace["y_obs"]), t_plot -) -weibull_pp_surv_mean = weibull_pp_surv.mean(axis=0) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - - -ax.plot(t_plot, weibull_pp_surv_mean[0], c=blue, label="Not metastized") -ax.plot(t_plot, weibull_pp_surv_mean[1], c=red, label="Metastized") - -ax.set_xlim(0, 230) -ax.set_xlabel("Weeks since mastectomy") - -ax.set_ylim(top=1) -ax.yaxis.set_major_formatter(pct_formatter) -ax.set_ylabel("Survival probability") - -ax.legend(loc=1) -ax.set_title("Weibull survival regression model"); -``` - -### Log-logistic survival regression - -Other accelerated failure time models can be specified in a modular way by changing the prior distribution on $\varepsilon$. A log-logistic model corresponds to a [logistic](https://en.wikipedia.org/wiki/Logistic_distribution) prior on $\varepsilon$. Most of the model specification is the same as for the Weibull model above. - -```{code-cell} ipython3 -X_.set_value(X) -cens_.set_value(cens) - -with pm.Model() as log_logistic_model: - β = pm.Normal("β", 0.0, VAGUE_PRIOR_SD, shape=2) - η = β.dot(X_.T) - - s = pm.HalfNormal("s", 5.0) -``` - -We use the prior $\varepsilon \sim \textrm{Logistic}(0, s)$. The survival function of the logistic distribution is - -$$P(Y \geq y) = 1 - \frac{1}{1 + \exp\left(-\left(\frac{y - \mu}{s}\right)\right)},$$ - -so we get the likelihood - -```{code-cell} ipython3 -def logistic_sf(y, μ, s): - return 1.0 - pm.math.sigmoid((y - μ) / s) -``` - -```{code-cell} ipython3 -with log_logistic_model: - y_obs = pm.Logistic("y_obs", η[~cens_], s, observed=y_std[~cens]) - y_cens = pm.Potential("y_cens", logistic_sf(y_std[cens], η[cens_], s)) -``` - -We now sample from the log-logistic model. - -```{code-cell} ipython3 -with log_logistic_model: - log_logistic_trace = pm.sample(**SAMPLE_KWARGS) -``` - -All of the sampling diagnostics look good for this model. - -```{code-cell} ipython3 -az.plot_energy(log_logistic_trace); -``` - -```{code-cell} ipython3 -az.bfmi(log_logistic_trace) -``` - -```{code-cell} ipython3 -max(np.max(gr_stats) for gr_stats in az.rhat(log_logistic_trace).values()) -``` - -Again, we calculate the posterior expected survival functions for this model. - -```{code-cell} ipython3 -X_.set_value(X_pp) -cens_.set_value(cens_pp) - -with log_logistic_model: - pp_log_logistic_trace = pm.sample_posterior_predictive(log_logistic_trace, samples=1500) -``` - -```{code-cell} ipython3 -log_logistic_pp_surv = np.greater_equal.outer( - np.exp(y.mean() + y.std() * pp_log_logistic_trace["y_obs"]), t_plot -) -log_logistic_pp_surv_mean = log_logistic_pp_surv.mean(axis=0) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - -ax.plot(t_plot, weibull_pp_surv_mean[0], c=blue, label="Weibull, not metastized") -ax.plot(t_plot, weibull_pp_surv_mean[1], c=red, label="Weibull, metastized") - -ax.plot(t_plot, log_logistic_pp_surv_mean[0], "--", c=blue, label="Log-logistic, not metastized") -ax.plot(t_plot, log_logistic_pp_surv_mean[1], "--", c=red, label="Log-logistic, metastized") - -ax.set_xlim(0, 230) -ax.set_xlabel("Weeks since mastectomy") - -ax.set_ylim(top=1) -ax.yaxis.set_major_formatter(pct_formatter) -ax.set_ylabel("Survival probability") - -ax.legend(loc=1) -ax.set_title("Weibull and log-logistic\nsurvival regression models"); -``` - -This post has been a short introduction to implementing parametric survival regression models in PyMC3 with a fairly simple data set. The modular nature of probabilistic programming with PyMC3 should make it straightforward to generalize these techniques to more complex and interesting data set. - -+++ - -## Authors - -- Originally authored as a blog post by [Austin Rochford](https://austinrochford.com/posts/2017-10-02-bayes-param-survival.html) on October 2, 2017. -- Updated by [George Ho](https://eigenfoo.xyz/) on July 18, 2018. - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/survival_analysis/censored_data.myst.md b/myst_nbs/survival_analysis/censored_data.myst.md deleted file mode 100644 index 677b112dd..000000000 --- a/myst_nbs/survival_analysis/censored_data.myst.md +++ /dev/null @@ -1,244 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: pymc_env - language: python - name: pymc_env ---- - -(censored_data)= -# Censored Data Models - -:::{post} May, 2022 -:tags: censored, survival analysis -:category: intermediate, how-to -:author: Luis Mario Domenzain -::: - -```{code-cell} ipython3 -from copy import copy - -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc as pm -import seaborn as sns - -from numpy.random import default_rng -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -rng = default_rng(1234) -az.style.use("arviz-darkgrid") -``` - -[This example notebook on Bayesian survival -analysis](http://docs.pymc.io/notebooks/survival_analysis.html) touches on the -point of censored data. _Censoring_ is a form of missing-data problem, in which -observations greater than a certain threshold are clipped down to that -threshold, or observations less than a certain threshold are clipped up to that -threshold, or both. These are called right, left and interval censoring, -respectively. In this example notebook we consider interval censoring. - -Censored data arises in many modelling problems. Two common examples are: - -1. _Survival analysis:_ when studying the effect of a certain medical treatment - on survival times, it is impossible to prolong the study until all subjects - have died. At the end of the study, the only data collected for many patients - is that they were still alive for a time period $T$ after the treatment was - administered: in reality, their true survival times are greater than $T$. - -2. _Sensor saturation:_ a sensor might have a limited range and the upper and - lower limits would simply be the highest and lowest values a sensor can - report. For instance, many mercury thermometers only report a very narrow - range of temperatures. - -This example notebook presents two different ways of dealing with censored data -in PyMC3: - -1. An imputed censored model, which represents censored data as parameters and - makes up plausible values for all censored values. As a result of this - imputation, this model is capable of generating plausible sets of made-up - values that would have been censored. Each censored element introduces a - random variable. - -2. An unimputed censored model, where the censored data are integrated out and - accounted for only through the log-likelihood. This method deals more - adequately with large amounts of censored data and converges more quickly. - -To establish a baseline we compare to an uncensored model of the uncensored -data. - -```{code-cell} ipython3 -# Produce normally distributed samples -size = 500 -true_mu = 13.0 -true_sigma = 5.0 -samples = rng.normal(true_mu, true_sigma, size) - -# Set censoring limits -low = 3.0 -high = 16.0 - - -def censor(x, low, high): - x = copy(x) - x[x <= low] = low - x[x >= high] = high - return x - - -# Censor samples -censored = censor(samples, low, high) -``` - -```{code-cell} ipython3 -# Visualize uncensored and censored data -_, ax = plt.subplots(figsize=(10, 3)) -edges = np.linspace(-5, 35, 30) -ax.hist(samples, bins=edges, density=True, histtype="stepfilled", alpha=0.2, label="Uncensored") -ax.hist(censored, bins=edges, density=True, histtype="stepfilled", alpha=0.2, label="Censored") -[ax.axvline(x=x, c="k", ls="--") for x in [low, high]] -ax.legend(); -``` - -## Uncensored Model - -```{code-cell} ipython3 -def uncensored_model(data): - with pm.Model() as model: - mu = pm.Normal("mu", mu=((high - low) / 2) + low, sigma=(high - low)) - sigma = pm.HalfNormal("sigma", sigma=(high - low) / 2.0) - observed = pm.Normal("observed", mu=mu, sigma=sigma, observed=data) - return model -``` - -We should predict that running the uncensored model on uncensored data, we will get reasonable estimates of the mean and variance. - -```{code-cell} ipython3 -uncensored_model_1 = uncensored_model(samples) -with uncensored_model_1: - idata = pm.sample() - -az.plot_posterior(idata, ref_val=[true_mu, true_sigma], round_to=3); -``` - -And that is exactly what we find. - -The problem however, is that in censored data contexts, we do not have access to the true values. If we were to use the same uncensored model on the censored data, we would anticipate that our parameter estimates will be biased. If we calculate point estimates for the mean and std, then we can see that we are likely to underestimate the mean and std for this particular dataset and censor bounds. - -```{code-cell} ipython3 -print(f"mean={np.mean(censored):.2f}; std={np.std(censored):.2f}") -``` - -```{code-cell} ipython3 -uncensored_model_2 = uncensored_model(censored) -with uncensored_model_2: - idata = pm.sample() - -az.plot_posterior(idata, ref_val=[true_mu, true_sigma], round_to=3); -``` - -The figure above confirms this. - -## Censored data models - -The models below show 2 approaches to dealing with censored data. First, we need to do a bit of data pre-processing to count the number of observations that are left or right censored. We also also need to extract just the non-censored data that we observe. - -+++ - -(censored_data/model1)= -### Model 1 - Imputed Censored Model of Censored Data - -In this model, we impute the censored values from the same distribution as the uncensored data. Sampling from the posterior generates possible uncensored data sets. - -```{code-cell} ipython3 -n_right_censored = sum(censored >= high) -n_left_censored = sum(censored <= low) -n_observed = len(censored) - n_right_censored - n_left_censored -uncensored = censored[(censored > low) & (censored < high)] -assert len(uncensored) == n_observed -``` - -```{code-cell} ipython3 -with pm.Model() as imputed_censored_model: - mu = pm.Normal("mu", mu=((high - low) / 2) + low, sigma=(high - low)) - sigma = pm.HalfNormal("sigma", sigma=(high - low) / 2.0) - right_censored = pm.Normal( - "right_censored", - mu, - sigma, - transform=pm.distributions.transforms.Interval(high, None), - shape=int(n_right_censored), - initval=np.full(n_right_censored, high + 1), - ) - left_censored = pm.Normal( - "left_censored", - mu, - sigma, - transform=pm.distributions.transforms.Interval(None, low), - shape=int(n_left_censored), - initval=np.full(n_left_censored, low - 1), - ) - observed = pm.Normal("observed", mu=mu, sigma=sigma, observed=uncensored, shape=int(n_observed)) - idata = pm.sample() - -az.plot_posterior(idata, var_names=["mu", "sigma"], ref_val=[true_mu, true_sigma], round_to=3); -``` - -We can see that the bias in the estimates of the mean and variance (present in the uncensored model) have been largely removed. - -+++ - -### Model 2 - Unimputed Censored Model of Censored Data - -Here we can make use of `pm.Censored`. - -```{code-cell} ipython3 -with pm.Model() as unimputed_censored_model: - mu = pm.Normal("mu", mu=0.0, sigma=(high - low) / 2.0) - sigma = pm.HalfNormal("sigma", sigma=(high - low) / 2.0) - y_latent = pm.Normal.dist(mu=mu, sigma=sigma) - obs = pm.Censored("obs", y_latent, lower=low, upper=high, observed=censored) -``` - -Sampling - -```{code-cell} ipython3 -with unimputed_censored_model: - idata = pm.sample() - -az.plot_posterior(idata, var_names=["mu", "sigma"], ref_val=[true_mu, true_sigma], round_to=3); -``` - -Again, the bias in the estimates of the mean and variance (present in the uncensored model) have been largely removed. - -+++ - -## Discussion - -As we can see, both censored models appear to capture the mean and variance of the underlying distribution as well as the uncensored model! In addition, the imputed censored model is capable of generating data sets of censored values (sample from the posteriors of `left_censored` and `right_censored` to generate them), while the unimputed censored model scales much better with more censored data, and converges faster. - -+++ - -## Authors - -- Originally authored by [Luis Mario Domenzain](https://github.com/domenzain) on Mar 7, 2017. -- Updated by [George Ho](https://github.com/eigenfoo) on Jul 14, 2018. -- Updated by [Benjamin Vincent](https://github.com/drbenvincent) in May 2021. -- Updated by [Benjamin Vincent](https://github.com/drbenvincent) in May 2022 to PyMC v4. - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl -``` diff --git a/myst_nbs/survival_analysis/survival_analysis.myst.md b/myst_nbs/survival_analysis/survival_analysis.myst.md deleted file mode 100644 index 90fbf09a9..000000000 --- a/myst_nbs/survival_analysis/survival_analysis.myst.md +++ /dev/null @@ -1,512 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- - -# Bayesian Survival Analysis - -Author: Austin Rochford - -[Survival analysis](https://en.wikipedia.org/wiki/Survival_analysis) studies the distribution of the time to an event. Its applications span many fields across medicine, biology, engineering, and social science. This tutorial shows how to fit and analyze a Bayesian survival model in Python using PyMC3. - -We illustrate these concepts by analyzing a [mastectomy data set](https://vincentarelbundock.github.io/Rdatasets/doc/HSAUR/mastectomy.html) from `R`'s [HSAUR](https://cran.r-project.org/web/packages/HSAUR/index.html) package. - -```{code-cell} ipython3 -import arviz as az -import numpy as np -import pandas as pd -import pymc3 as pm -import theano - -%matplotlib inline -from matplotlib import pyplot as plt -from pymc3.distributions.timeseries import GaussianRandomWalk -from theano import tensor as T -``` - -```{code-cell} ipython3 -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -```{code-cell} ipython3 -try: - df = pd.read_csv("../data/mastectomy.csv") -except FileNotFoundError: - df = pd.read_csv(pm.get_data("mastectomy.csv")) - -df.event = df.event.astype(np.int64) -df.metastasized = (df.metastasized == "yes").astype(np.int64) -n_patients = df.shape[0] -patients = np.arange(n_patients) -``` - -```{code-cell} ipython3 -df.head() -``` - -```{code-cell} ipython3 -n_patients -``` - -Each row represents observations from a woman diagnosed with breast cancer that underwent a mastectomy. The column `time` represents the time (in months) post-surgery that the woman was observed. The column `event` indicates whether or not the woman died during the observation period. The column `metastasized` represents whether the cancer had [metastasized](https://en.wikipedia.org/wiki/Metastatic_breast_cancer) prior to surgery. - -This tutorial analyzes the relationship between survival time post-mastectomy and whether or not the cancer had metastasized. - -+++ - -#### A crash course in survival analysis - -First we introduce a (very little) bit of theory. If the random variable $T$ is the time to the event we are studying, survival analysis is primarily concerned with the survival function - -$$S(t) = P(T > t) = 1 - F(t),$$ - -where $F$ is the [CDF](https://en.wikipedia.org/wiki/Cumulative_distribution_function) of $T$. It is mathematically convenient to express the survival function in terms of the [hazard rate](https://en.wikipedia.org/wiki/Survival_analysis#Hazard_function_and_cumulative_hazard_function), $\lambda(t)$. The hazard rate is the instantaneous probability that the event occurs at time $t$ given that it has not yet occurred. That is, - -$$\begin{align*} -\lambda(t) - & = \lim_{\Delta t \to 0} \frac{P(t < T < t + \Delta t\ |\ T > t)}{\Delta t} \\ - & = \lim_{\Delta t \to 0} \frac{P(t < T < t + \Delta t)}{\Delta t \cdot P(T > t)} \\ - & = \frac{1}{S(t)} \cdot \lim_{\Delta t \to 0} \frac{S(t) - S(t + \Delta t)}{\Delta t} - = -\frac{S'(t)}{S(t)}. -\end{align*}$$ - -Solving this differential equation for the survival function shows that - -$$S(t) = \exp\left(-\int_0^s \lambda(s)\ ds\right).$$ - -This representation of the survival function shows that the cumulative hazard function - -$$\Lambda(t) = \int_0^t \lambda(s)\ ds$$ - -is an important quantity in survival analysis, since we may concisely write $S(t) = \exp(-\Lambda(t)).$ - -An important, but subtle, point in survival analysis is [censoring](https://en.wikipedia.org/wiki/Survival_analysis#Censoring). Even though the quantity we are interested in estimating is the time between surgery and death, we do not observe the death of every subject. At the point in time that we perform our analysis, some of our subjects will thankfully still be alive. In the case of our mastectomy study, `df.event` is one if the subject's death was observed (the observation is not censored) and is zero if the death was not observed (the observation is censored). - -```{code-cell} ipython3 -df.event.mean() -``` - -Just over 40% of our observations are censored. We visualize the observed durations and indicate which observations are censored below. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - -ax.hlines( - patients[df.event.values == 0], 0, df[df.event.values == 0].time, color="C3", label="Censored" -) - -ax.hlines( - patients[df.event.values == 1], 0, df[df.event.values == 1].time, color="C7", label="Uncensored" -) - -ax.scatter( - df[df.metastasized.values == 1].time, - patients[df.metastasized.values == 1], - color="k", - zorder=10, - label="Metastasized", -) - -ax.set_xlim(left=0) -ax.set_xlabel("Months since mastectomy") -ax.set_yticks([]) -ax.set_ylabel("Subject") - -ax.set_ylim(-0.25, n_patients + 0.25) - -ax.legend(loc="center right"); -``` - -When an observation is censored (`df.event` is zero), `df.time` is not the subject's survival time. All we can conclude from such a censored observation is that the subject's true survival time exceeds `df.time`. - -This is enough basic survival analysis theory for the purposes of this tutorial; for a more extensive introduction, consult Aalen et al.^[Aalen, Odd, Ornulf Borgan, and Hakon Gjessing. Survival and event history analysis: a process point of view. Springer Science & Business Media, 2008.] - -+++ - -#### Bayesian proportional hazards model - -The two most basic estimators in survival analysis are the [Kaplan-Meier estimator](https://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator) of the survival function and the [Nelson-Aalen estimator](https://en.wikipedia.org/wiki/Nelson%E2%80%93Aalen_estimator) of the cumulative hazard function. However, since we want to understand the impact of metastization on survival time, a risk regression model is more appropriate. Perhaps the most commonly used risk regression model is [Cox's proportional hazards model](https://en.wikipedia.org/wiki/Proportional_hazards_model). In this model, if we have covariates $\mathbf{x}$ and regression coefficients $\beta$, the hazard rate is modeled as - -$$\lambda(t) = \lambda_0(t) \exp(\mathbf{x} \beta).$$ - -Here $\lambda_0(t)$ is the baseline hazard, which is independent of the covariates $\mathbf{x}$. In this example, the covariates are the one-dimensional vector `df.metastasized`. - -Unlike in many regression situations, $\mathbf{x}$ should not include a constant term corresponding to an intercept. If $\mathbf{x}$ includes a constant term corresponding to an intercept, the model becomes [unidentifiable](https://en.wikipedia.org/wiki/Identifiability). To illustrate this unidentifiability, suppose that - -$$\lambda(t) = \lambda_0(t) \exp(\beta_0 + \mathbf{x} \beta) = \lambda_0(t) \exp(\beta_0) \exp(\mathbf{x} \beta).$$ - -If $\tilde{\beta}_0 = \beta_0 + \delta$ and $\tilde{\lambda}_0(t) = \lambda_0(t) \exp(-\delta)$, then $\lambda(t) = \tilde{\lambda}_0(t) \exp(\tilde{\beta}_0 + \mathbf{x} \beta)$ as well, making the model with $\beta_0$ unidentifiable. - -In order to perform Bayesian inference with the Cox model, we must specify priors on $\beta$ and $\lambda_0(t)$. We place a normal prior on $\beta$, $\beta \sim N(\mu_{\beta}, \sigma_{\beta}^2),$ where $\mu_{\beta} \sim N(0, 10^2)$ and $\sigma_{\beta} \sim U(0, 10)$. - -A suitable prior on $\lambda_0(t)$ is less obvious. We choose a semiparametric prior, where $\lambda_0(t)$ is a piecewise constant function. This prior requires us to partition the time range in question into intervals with endpoints $0 \leq s_1 < s_2 < \cdots < s_N$. With this partition, $\lambda_0 (t) = \lambda_j$ if $s_j \leq t < s_{j + 1}$. With $\lambda_0(t)$ constrained to have this form, all we need to do is choose priors for the $N - 1$ values $\lambda_j$. We use independent vague priors $\lambda_j \sim \operatorname{Gamma}(10^{-2}, 10^{-2}).$ For our mastectomy example, we make each interval three months long. - -```{code-cell} ipython3 -interval_length = 3 -interval_bounds = np.arange(0, df.time.max() + interval_length + 1, interval_length) -n_intervals = interval_bounds.size - 1 -intervals = np.arange(n_intervals) -``` - -We see how deaths and censored observations are distributed in these intervals. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - -ax.hist( - df[df.event == 0].time.values, - bins=interval_bounds, - lw=0, - color="C3", - alpha=0.5, - label="Censored", -) - -ax.hist( - df[df.event == 1].time.values, - bins=interval_bounds, - lw=0, - color="C7", - alpha=0.5, - label="Uncensored", -) - -ax.set_xlim(0, interval_bounds[-1]) -ax.set_xlabel("Months since mastectomy") - -ax.set_yticks([0, 1, 2, 3]) -ax.set_ylabel("Number of observations") - -ax.legend(); -``` - -With the prior distributions on $\beta$ and $\lambda_0(t)$ chosen, we now show how the model may be fit using MCMC simulation with `pymc3`. The key observation is that the piecewise-constant proportional hazard model is [closely related](http://data.princeton.edu/wws509/notes/c7s4.html) to a Poisson regression model. (The models are not identical, but their likelihoods differ by a factor that depends only on the observed data and not the parameters $\beta$ and $\lambda_j$. For details, see Germán Rodríguez's WWS 509 [course notes](http://data.princeton.edu/wws509/notes/c7s4.html).) - -We define indicator variables based on whether the $i$-th subject died in the $j$-th interval, - -$$d_{i, j} = \begin{cases} - 1 & \textrm{if subject } i \textrm{ died in interval } j \\ - 0 & \textrm{otherwise} -\end{cases}.$$ - -```{code-cell} ipython3 -last_period = np.floor((df.time - 0.01) / interval_length).astype(int) - -death = np.zeros((n_patients, n_intervals)) -death[patients, last_period] = df.event -``` - -We also define $t_{i, j}$ to be the amount of time the $i$-th subject was at risk in the $j$-th interval. - -```{code-cell} ipython3 -exposure = np.greater_equal.outer(df.time.to_numpy(), interval_bounds[:-1]) * interval_length -exposure[patients, last_period] = df.time - interval_bounds[last_period] -``` - -Finally, denote the risk incurred by the $i$-th subject in the $j$-th interval as $\lambda_{i, j} = \lambda_j \exp(\mathbf{x}_i \beta)$. - -We may approximate $d_{i, j}$ with a Poisson random variable with mean $t_{i, j}\ \lambda_{i, j}$. This approximation leads to the following `pymc3` model. - -```{code-cell} ipython3 -coords = {"intervals": intervals} - -with pm.Model(coords=coords) as model: - - lambda0 = pm.Gamma("lambda0", 0.01, 0.01, dims="intervals") - - beta = pm.Normal("beta", 0, sigma=1000) - - lambda_ = pm.Deterministic("lambda_", T.outer(T.exp(beta * df.metastasized), lambda0)) - mu = pm.Deterministic("mu", exposure * lambda_) - - obs = pm.Poisson("obs", mu, observed=death) -``` - -We now sample from the model. - -```{code-cell} ipython3 -n_samples = 1000 -n_tune = 1000 -``` - -```{code-cell} ipython3 -with model: - idata = pm.sample( - n_samples, - tune=n_tune, - target_accept=0.99, - return_inferencedata=True, - random_seed=RANDOM_SEED, - ) -``` - -We see that the hazard rate for subjects whose cancer has metastasized is about one and a half times the rate of those whose cancer has not metastasized. - -```{code-cell} ipython3 -np.exp(idata.posterior["beta"]).mean() -``` - -```{code-cell} ipython3 -az.plot_posterior(idata, var_names=["beta"]); -``` - -```{code-cell} ipython3 -az.plot_autocorr(idata, var_names=["beta"]); -``` - -We now examine the effect of metastization on both the cumulative hazard and on the survival function. - -```{code-cell} ipython3 -base_hazard = idata.posterior["lambda0"] -met_hazard = idata.posterior["lambda0"] * np.exp(idata.posterior["beta"]) -``` - -```{code-cell} ipython3 -def cum_hazard(hazard): - return (interval_length * hazard).cumsum(axis=-1) - - -def survival(hazard): - return np.exp(-cum_hazard(hazard)) - - -def get_mean(trace): - return trace.mean(("chain", "draw")) -``` - -```{code-cell} ipython3 -fig, (hazard_ax, surv_ax) = plt.subplots(ncols=2, sharex=True, sharey=False, figsize=(16, 6)) - -az.plot_hdi( - interval_bounds[:-1], - cum_hazard(base_hazard), - ax=hazard_ax, - smooth=False, - color="C0", - fill_kwargs={"label": "Had not metastasized"}, -) -az.plot_hdi( - interval_bounds[:-1], - cum_hazard(met_hazard), - ax=hazard_ax, - smooth=False, - color="C1", - fill_kwargs={"label": "Metastasized"}, -) - -hazard_ax.plot(interval_bounds[:-1], get_mean(cum_hazard(base_hazard)), color="darkblue") -hazard_ax.plot(interval_bounds[:-1], get_mean(cum_hazard(met_hazard)), color="maroon") - -hazard_ax.set_xlim(0, df.time.max()) -hazard_ax.set_xlabel("Months since mastectomy") -hazard_ax.set_ylabel(r"Cumulative hazard $\Lambda(t)$") -hazard_ax.legend(loc=2) - -az.plot_hdi(interval_bounds[:-1], survival(base_hazard), ax=surv_ax, smooth=False, color="C0") -az.plot_hdi(interval_bounds[:-1], survival(met_hazard), ax=surv_ax, smooth=False, color="C1") - -surv_ax.plot(interval_bounds[:-1], get_mean(survival(base_hazard)), color="darkblue") -surv_ax.plot(interval_bounds[:-1], get_mean(survival(met_hazard)), color="maroon") - -surv_ax.set_xlim(0, df.time.max()) -surv_ax.set_xlabel("Months since mastectomy") -surv_ax.set_ylabel("Survival function $S(t)$") - -fig.suptitle("Bayesian survival model"); -``` - -We see that the cumulative hazard for metastasized subjects increases more rapidly initially (through about seventy months), after which it increases roughly in parallel with the baseline cumulative hazard. - -These plots also show the pointwise 95% high posterior density interval for each function. One of the distinct advantages of the Bayesian model fit with `pymc3` is the inherent quantification of uncertainty in our estimates. - -+++ - -##### Time varying effects - -Another of the advantages of the model we have built is its flexibility. From the plots above, we may reasonable believe that the additional hazard due to metastization varies over time; it seems plausible that cancer that has metastasized increases the hazard rate immediately after the mastectomy, but that the risk due to metastization decreases over time. We can accommodate this mechanism in our model by allowing the regression coefficients to vary over time. In the time-varying coefficient model, if $s_j \leq t < s_{j + 1}$, we let $\lambda(t) = \lambda_j \exp(\mathbf{x} \beta_j).$ The sequence of regression coefficients $\beta_1, \beta_2, \ldots, \beta_{N - 1}$ form a normal random walk with $\beta_1 \sim N(0, 1)$, $\beta_j\ |\ \beta_{j - 1} \sim N(\beta_{j - 1}, 1)$. - -We implement this model in `pymc3` as follows. - -```{code-cell} ipython3 -coords = {"intervals": intervals} - -with pm.Model(coords=coords) as time_varying_model: - - lambda0 = pm.Gamma("lambda0", 0.01, 0.01, dims="intervals") - beta = GaussianRandomWalk("beta", tau=1.0, dims="intervals") - - lambda_ = pm.Deterministic("h", lambda0 * T.exp(T.outer(T.constant(df.metastasized), beta))) - mu = pm.Deterministic("mu", exposure * lambda_) - - obs = pm.Poisson("obs", mu, observed=death) -``` - -We proceed to sample from this model. - -```{code-cell} ipython3 -with time_varying_model: - time_varying_idata = pm.sample( - n_samples, - tune=n_tune, - return_inferencedata=True, - target_accept=0.99, - random_seed=RANDOM_SEED, - ) -``` - -```{code-cell} ipython3 -az.plot_forest(time_varying_idata, var_names=["beta"]); -``` - -We see from the plot of $\beta_j$ over time below that initially $\beta_j > 0$, indicating an elevated hazard rate due to metastization, but that this risk declines as $\beta_j < 0$ eventually. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - -beta_eti = time_varying_idata.posterior["beta"].quantile((0.025, 0.975), dim=("chain", "draw")) -beta_eti_low = beta_eti.sel(quantile=0.025) -beta_eti_high = beta_eti.sel(quantile=0.975) - -ax.fill_between(interval_bounds[:-1], beta_eti_low, beta_eti_high, color="C0", alpha=0.25) - -beta_hat = time_varying_idata.posterior["beta"].mean(("chain", "draw")) - -ax.step(interval_bounds[:-1], beta_hat, color="C0") - -ax.scatter( - interval_bounds[last_period[(df.event.values == 1) & (df.metastasized == 1)]], - beta_hat.isel(intervals=last_period[(df.event.values == 1) & (df.metastasized == 1)]), - color="C1", - zorder=10, - label="Died, cancer metastasized", -) - -ax.scatter( - interval_bounds[last_period[(df.event.values == 0) & (df.metastasized == 1)]], - beta_hat.isel(intervals=last_period[(df.event.values == 0) & (df.metastasized == 1)]), - color="C0", - zorder=10, - label="Censored, cancer metastasized", -) - -ax.set_xlim(0, df.time.max()) -ax.set_xlabel("Months since mastectomy") -ax.set_ylabel(r"$\beta_j$") -ax.legend(); -``` - -The coefficients $\beta_j$ begin declining rapidly around one hundred months post-mastectomy, which seems reasonable, given that only three of twelve subjects whose cancer had metastasized lived past this point died during the study. - -The change in our estimate of the cumulative hazard and survival functions due to time-varying effects is also quite apparent in the following plots. - -```{code-cell} ipython3 -tv_base_hazard = time_varying_idata.posterior["lambda0"] -tv_met_hazard = time_varying_idata.posterior["lambda0"] * np.exp( - time_varying_idata.posterior["beta"] -) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(8, 6)) - -ax.step( - interval_bounds[:-1], - cum_hazard(base_hazard.mean(("chain", "draw"))), - color="C0", - label="Had not metastasized", -) - -ax.step( - interval_bounds[:-1], - cum_hazard(met_hazard.mean(("chain", "draw"))), - color="C1", - label="Metastasized", -) - -ax.step( - interval_bounds[:-1], - cum_hazard(tv_base_hazard.mean(("chain", "draw"))), - color="C0", - linestyle="--", - label="Had not metastasized (time varying effect)", -) - -ax.step( - interval_bounds[:-1], - cum_hazard(tv_met_hazard.mean(dim=("chain", "draw"))), - color="C1", - linestyle="--", - label="Metastasized (time varying effect)", -) - -ax.set_xlim(0, df.time.max() - 4) -ax.set_xlabel("Months since mastectomy") -ax.set_ylim(0, 2) -ax.set_ylabel(r"Cumulative hazard $\Lambda(t)$") -ax.legend(loc=2); -``` - -```{code-cell} ipython3 -fig, (hazard_ax, surv_ax) = plt.subplots(ncols=2, sharex=True, sharey=False, figsize=(16, 6)) - -az.plot_hdi( - interval_bounds[:-1], - cum_hazard(tv_base_hazard), - ax=hazard_ax, - color="C0", - smooth=False, - fill_kwargs={"label": "Had not metastasized"}, -) - -az.plot_hdi( - interval_bounds[:-1], - cum_hazard(tv_met_hazard), - ax=hazard_ax, - smooth=False, - color="C1", - fill_kwargs={"label": "Metastasized"}, -) - -hazard_ax.plot(interval_bounds[:-1], get_mean(cum_hazard(tv_base_hazard)), color="darkblue") -hazard_ax.plot(interval_bounds[:-1], get_mean(cum_hazard(tv_met_hazard)), color="maroon") - -hazard_ax.set_xlim(0, df.time.max()) -hazard_ax.set_xlabel("Months since mastectomy") -hazard_ax.set_ylim(0, 2) -hazard_ax.set_ylabel(r"Cumulative hazard $\Lambda(t)$") -hazard_ax.legend(loc=2) - -az.plot_hdi(interval_bounds[:-1], survival(tv_base_hazard), ax=surv_ax, smooth=False, color="C0") -az.plot_hdi(interval_bounds[:-1], survival(tv_met_hazard), ax=surv_ax, smooth=False, color="C1") - -surv_ax.plot(interval_bounds[:-1], get_mean(survival(tv_base_hazard)), color="darkblue") -surv_ax.plot(interval_bounds[:-1], get_mean(survival(tv_met_hazard)), color="maroon") - -surv_ax.set_xlim(0, df.time.max()) -surv_ax.set_xlabel("Months since mastectomy") -surv_ax.set_ylabel("Survival function $S(t)$") -fig.suptitle("Bayesian survival model with time varying effects"); -``` - -We have really only scratched the surface of both survival analysis and the Bayesian approach to survival analysis. More information on Bayesian survival analysis is available in Ibrahim et al. (2005). (For example, we may want to account for individual frailty in either or original or time-varying models.) - -This tutorial is available as an [IPython](http://ipython.org/) notebook [here](https://gist.github.com/AustinRochford/4c6b07e51a2247d678d6). It is adapted from a blog post that first appeared [here](http://austinrochford.com/posts/2015-10-05-bayes-survival.html). - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p xarray -``` - -```{code-cell} ipython3 - -``` diff --git a/myst_nbs/survival_analysis/weibull_aft.myst.md b/myst_nbs/survival_analysis/weibull_aft.myst.md deleted file mode 100644 index 18e082acd..000000000 --- a/myst_nbs/survival_analysis/weibull_aft.myst.md +++ /dev/null @@ -1,182 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- - -# Reparameterizing the Weibull Accelerated Failure Time Model - -```{code-cell} ipython3 -import arviz as az -import numpy as np -import pymc3 as pm -import statsmodels.api as sm -import theano.tensor as tt - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -RANDOM_SEED = 8927 -np.random.seed(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -## Dataset - -The [previous example notebook on Bayesian parametric survival analysis](https://docs.pymc.io/notebooks/bayes_param_survival.html) introduced two different accelerated failure time (AFT) models: Weibull and log-linear. In this notebook, we present three different parameterizations of the Weibull AFT model. - -The data set we'll use is the `flchain` R data set, which comes from a medical study investigating the effect of serum free light chain (FLC) on lifespan. Read the full documentation of the data by running: - -`print(sm.datasets.get_rdataset(package='survival', dataname='flchain').__doc__)`. - -```{code-cell} ipython3 -# Fetch and clean data -data = ( - sm.datasets.get_rdataset(package="survival", dataname="flchain") - .data.sample(500) # Limit ourselves to 500 observations - .reset_index(drop=True) -) -``` - -```{code-cell} ipython3 -y = data.futime.values -censored = ~data["death"].values.astype(bool) -``` - -```{code-cell} ipython3 -y[:5] -``` - -```{code-cell} ipython3 -censored[:5] -``` - -## Using `pm.Potential` - -We have an unique problem when modelling censored data. Strictly speaking, we don't have any _data_ for censored values: we only know the _number_ of values that were censored. How can we include this information in our model? - -One way do this is by making use of `pm.Potential`. The [PyMC2 docs](https://pymc-devs.github.io/pymc/modelbuilding.html#the-potential-class) explain its usage very well. Essentially, declaring `pm.Potential('x', logp)` will add `logp` to the log-likelihood of the model. - -+++ - -## Parameterization 1 - -This parameterization is an intuitive, straightforward parameterization of the Weibull survival function. This is probably the first parameterization to come to one's mind. - -```{code-cell} ipython3 -def weibull_lccdf(x, alpha, beta): - """Log complementary cdf of Weibull distribution.""" - return -((x / beta) ** alpha) -``` - -```{code-cell} ipython3 -with pm.Model() as model_1: - alpha_sd = 10.0 - - mu = pm.Normal("mu", mu=0, sigma=100) - alpha_raw = pm.Normal("a0", mu=0, sigma=0.1) - alpha = pm.Deterministic("alpha", tt.exp(alpha_sd * alpha_raw)) - beta = pm.Deterministic("beta", tt.exp(mu / alpha)) - - y_obs = pm.Weibull("y_obs", alpha=alpha, beta=beta, observed=y[~censored]) - y_cens = pm.Potential("y_cens", weibull_lccdf(y[censored], alpha, beta)) -``` - -```{code-cell} ipython3 -with model_1: - # Change init to avoid divergences - data_1 = pm.sample(target_accept=0.9, init="adapt_diag", return_inferencedata=True) -``` - -```{code-cell} ipython3 -az.plot_trace(data_1, var_names=["alpha", "beta"]) -``` - -```{code-cell} ipython3 -az.summary(data_1, var_names=["alpha", "beta"], round_to=2) -``` - -## Parameterization 2 - -Note that, confusingly, `alpha` is now called `r`, and `alpha` denotes a prior; we maintain this notation to stay faithful to the original implementation in Stan. In this parameterization, we still model the same parameters `alpha` (now `r`) and `beta`. - -For more information, see [this Stan example model](https://github.com/stan-dev/example-models/blob/5e9c5055dcea78ad756a6fb9b3ff9a77a0a4c22b/bugs_examples/vol1/kidney/kidney.stan) and [the corresponding documentation](https://www.mrc-bsu.cam.ac.uk/wp-content/uploads/WinBUGS_Vol1.pdf). - -```{code-cell} ipython3 -with pm.Model() as model_2: - alpha = pm.Normal("alpha", mu=0, sigma=10) - r = pm.Gamma("r", alpha=1, beta=0.001, testval=0.25) - beta = pm.Deterministic("beta", tt.exp(-alpha / r)) - - y_obs = pm.Weibull("y_obs", alpha=r, beta=beta, observed=y[~censored]) - y_cens = pm.Potential("y_cens", weibull_lccdf(y[censored], r, beta)) -``` - -```{code-cell} ipython3 -with model_2: - # Increase target_accept to avoid divergences - data_2 = pm.sample(target_accept=0.9, return_inferencedata=True) -``` - -```{code-cell} ipython3 -az.plot_trace(data_2, var_names=["r", "beta"]) -``` - -```{code-cell} ipython3 -az.summary(data_2, var_names=["r", "beta"], round_to=2) -``` - -## Parameterization 3 - -In this parameterization, we model the log-linear error distribution with a Gumbel distribution instead of modelling the survival function directly. For more information, see [this blog post](http://austinrochford.com/posts/2017-10-02-bayes-param-survival.html). - -```{code-cell} ipython3 -logtime = np.log(y) - - -def gumbel_sf(y, mu, sigma): - """Gumbel survival function.""" - return 1.0 - tt.exp(-tt.exp(-(y - mu) / sigma)) -``` - -```{code-cell} ipython3 -with pm.Model() as model_3: - s = pm.HalfNormal("s", tau=5.0) - gamma = pm.Normal("gamma", mu=0, sigma=5) - - y_obs = pm.Gumbel("y_obs", mu=gamma, beta=s, observed=logtime[~censored]) - y_cens = pm.Potential("y_cens", gumbel_sf(y=logtime[censored], mu=gamma, sigma=s)) -``` - -```{code-cell} ipython3 -with model_3: - # Change init to avoid divergences - data_3 = pm.sample(init="adapt_diag", return_inferencedata=True) -``` - -```{code-cell} ipython3 -az.plot_trace(data_3) -``` - -```{code-cell} ipython3 -az.summary(data_3, round_to=2) -``` - -## Authors - -- Originally collated by [Junpeng Lao](https://junpenglao.xyz/) on Apr 21, 2018. See original code [here](https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/65447fdb431c78b15fbeaef51b8c059f46c9e8d6/PyMC3QnA/discourse_1107.ipynb). -- Authored and ported to Jupyter notebook by [George Ho](https://eigenfoo.xyz/) on Jul 15, 2018. - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/time_series/AR.myst.md b/myst_nbs/time_series/AR.myst.md deleted file mode 100644 index 5b061850c..000000000 --- a/myst_nbs/time_series/AR.myst.md +++ /dev/null @@ -1,187 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: 'Python 3.8.2 64-bit (''pymc3'': conda)' - language: python - name: python38264bitpymc3conda8b8223a2f9874eff9bd3e12d36ed2ca2 ---- - -+++ {"ein.tags": ["worksheet-0"], "slideshow": {"slide_type": "-"}} - -# Analysis of An $AR(1)$ Model in PyMC3 - -```{code-cell} ipython3 ---- -ein.tags: [worksheet-0] -slideshow: - slide_type: '-' ---- -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -RANDOM_SEED = 8927 -np.random.seed(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -+++ {"ein.tags": ["worksheet-0"], "slideshow": {"slide_type": "-"}} - -Consider the following AR(1) process, initialized in the -infinite past: -$$ - y_t = \theta y_{t-1} + \epsilon_t, -$$ -where $\epsilon_t \overset{iid}{\sim} {\cal N}(0,1)$. Suppose you'd like to learn about $\theta$ from a a sample of observations $Y^T = \{ y_0, y_1,\ldots, y_T \}$. - -First, let's generate some synthetic sample data. We simulate the 'infinite past' by generating 10,000 samples from an AR(1) process and then discarding the first 5,000: - -```{code-cell} ipython3 ---- -ein.tags: [worksheet-0] -slideshow: - slide_type: '-' ---- -T = 10000 -y = np.zeros((T,)) - -# true stationarity: -true_theta = 0.95 -# true standard deviation of the innovation: -true_sigma = 2.0 -# true process mean: -true_center = 0.0 - -for t in range(1, T): - y[t] = true_theta * y[t - 1] + np.random.normal(loc=true_center, scale=true_sigma) - -y = y[-5000:] -plt.plot(y, alpha=0.8) -plt.xlabel("Timestep") -plt.ylabel("$y$"); -``` - -+++ {"ein.tags": ["worksheet-0"], "slideshow": {"slide_type": "-"}} - -This generative process is quite straight forward to implement in PyMC3: - -```{code-cell} ipython3 ---- -ein.tags: [worksheet-0] -slideshow: - slide_type: '-' ---- -with pm.Model() as ar1: - # assumes 95% of prob mass is between -2 and 2 - theta = pm.Normal("theta", 0.0, 1.0) - # precision of the innovation term - tau = pm.Exponential("tau", 0.5) - # process mean - center = pm.Normal("center", mu=0.0, sigma=1.0) - - likelihood = pm.AR1("y", k=theta, tau_e=tau, observed=y - center) - - trace = pm.sample(1000, tune=2000, init="advi+adapt_diag", random_seed=RANDOM_SEED) - idata = az.from_pymc3(trace) -``` - -We can see that even though the sample data did not start at zero, the true center of zero is captured rightly inferred by the model, as you can see in the trace plot below. Likewise, the model captured the true values of the autocorrelation parameter 𝜃 and the innovation term $\epsilon_t$ (`tau` in the model) -- 0.95 and 1 respectively). - -```{code-cell} ipython3 -az.plot_trace( - idata, - lines=[ - ("theta", {}, true_theta), - ("tau", {}, true_sigma**-2), - ("center", {}, true_center), - ], -); -``` - -+++ {"ein.tags": ["worksheet-0"], "slideshow": {"slide_type": "-"}} - -## Extension to AR(p) -We can instead estimate an AR(2) model using PyMC3. - -$$ - y_t = \theta_1 y_{t-1} + \theta_2 y_{t-2} + \epsilon_t. -$$ - -The `AR` distribution infers the order of the process thanks to the size the of `rho` argmument passed to `AR`. - -We will also use the standard deviation of the innovations (rather than the precision) to parameterize the distribution. - -```{code-cell} ipython3 ---- -ein.tags: [worksheet-0] -slideshow: - slide_type: '-' ---- -with pm.Model() as ar2: - theta = pm.Normal("theta", 0.0, 1.0, shape=2) - sigma = pm.HalfNormal("sigma", 3) - likelihood = pm.AR("y", theta, sigma=sigma, observed=y) - - trace = pm.sample( - 1000, - tune=2000, - random_seed=RANDOM_SEED, - ) - idata = az.from_pymc3(trace) -``` - -```{code-cell} ipython3 -az.plot_trace( - idata, - lines=[ - ("theta", {"theta_dim_0": 0}, true_theta), - ("theta", {"theta_dim_0": 1}, 0.0), - ("sigma", {}, true_sigma), - ], -); -``` - -+++ {"ein.tags": ["worksheet-0"], "slideshow": {"slide_type": "-"}} - -You can also pass the set of AR parameters as a list. - -```{code-cell} ipython3 ---- -ein.tags: [worksheet-0] -slideshow: - slide_type: '-' ---- -with pm.Model() as ar2_bis: - beta0 = pm.Normal("theta0", mu=0.0, sigma=1.0) - beta1 = pm.Uniform("theta1", -1, 1) - sigma = pm.HalfNormal("sigma", 3) - likelhood = pm.AR("y", [beta0, beta1], sigma=sigma, observed=y) - - trace = pm.sample( - 1000, - tune=2000, - random_seed=RANDOM_SEED, - ) - idata = az.from_pymc3(trace) -``` - -```{code-cell} ipython3 -az.plot_trace( - idata, - lines=[("theta0", {}, true_theta), ("theta1", {}, 0.0), ("sigma", {}, true_sigma)], -); -``` - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/time_series/Air_passengers-Prophet_with_Bayesian_workflow.myst.md b/myst_nbs/time_series/Air_passengers-Prophet_with_Bayesian_workflow.myst.md deleted file mode 100644 index 1a7ec7fc4..000000000 --- a/myst_nbs/time_series/Air_passengers-Prophet_with_Bayesian_workflow.myst.md +++ /dev/null @@ -1,376 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(Air_passengers-Prophet_with_Bayesian_workflow)= -# Air passengers - Prophet-like model - -:::{post} April, 2022 -:tags: time series, prophet -:category: intermediate -:author: Marco Gorelli, Danh Phan -::: - -+++ - -We're going to look at the "air passengers" dataset, which tracks the monthly totals of a US airline passengers from 1949 to 1960. We could fit this using the [Prophet](https://facebook.github.io/prophet/) model {cite:p}`taylor2018forecasting` (indeed, this dataset is one of the examples they provide in their documentation), but instead we'll make our own Prophet-like model in PyMC3. This will make it a lot easier to inspect the model's components and to do prior predictive checks (an integral component of the [Bayesian workflow](https://arxiv.org/abs/2011.01808) {cite:p}`gelman2020bayesian`). - -```{code-cell} ipython3 -import arviz as az -import matplotlib.dates as mdates -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm -``` - -```{code-cell} ipython3 -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -%config InlineBackend.figure_format = 'retina' -``` - -```{code-cell} ipython3 -# get data -try: - df = pd.read_csv("../data/AirPassengers.csv", parse_dates=["Month"]) -except FileNotFoundError: - df = pd.read_csv(pm.get_data("AirPassengers.csv"), parse_dates=["Month"]) -``` - -## Before we begin: visualise the data - -```{code-cell} ipython3 -df.plot.scatter(x="Month", y="#Passengers", color="k"); -``` - -There's an increasing trend, with multiplicative seasonality. We'll fit a linear trend, and "borrow" the multiplicative seasonality part of it from Prophet. - -+++ - -## Part 0: scale the data - -+++ - -First, we'll scale time to be between 0 and 1: - -```{code-cell} ipython3 -t = (df["Month"] - pd.Timestamp("1900-01-01")).dt.days.to_numpy() -t_min = np.min(t) -t_max = np.max(t) -t = (t - t_min) / (t_max - t_min) -``` - -Next, for the target variable, we divide by the maximum. We do this, rather than standardising, so that the sign of the observations in unchanged - this will be necessary for the seasonality component to work properly later on. - -```{code-cell} ipython3 -y = df["#Passengers"].to_numpy() -y_max = np.max(y) -y = y / y_max -``` - -## Part 1: linear trend - -The model we'll fit, for now, will just be - -$$\text{Passengers} \sim \alpha + \beta\ \text{time}$$ - -First, let's try using the default priors set by prophet, and we'll do a prior predictive check: - -```{code-cell} ipython3 -with pm.Model(check_bounds=False) as linear: - α = pm.Normal("α", mu=0, sigma=5) - β = pm.Normal("β", mu=0, sigma=5) - σ = pm.HalfNormal("σ", sigma=5) - trend = pm.Deterministic("trend", α + β * t) - pm.Normal("likelihood", mu=trend, sigma=σ, observed=y) - - linear_prior = pm.sample_prior_predictive() - -fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True) -ax[0].plot( - df["Month"], - az.extract_dataset(linear_prior, group="prior_predictive", num_samples=100)["likelihood"] - * y_max, - color="blue", - alpha=0.05, -) -df.plot.scatter(x="Month", y="#Passengers", color="k", ax=ax[0]) -ax[0].set_title("Prior predictive") -ax[1].plot( - df["Month"], - az.extract_dataset(linear_prior, group="prior", num_samples=100)["trend"] * y_max, - color="blue", - alpha=0.05, -) -df.plot.scatter(x="Month", y="#Passengers", color="k", ax=ax[1]) -ax[1].set_title("Prior trend lines"); -``` - -We can do better than this. These priors are evidently too wide, as we end up with implausibly many passengers. Let's try setting tighter priors. - -```{code-cell} ipython3 -with pm.Model(check_bounds=False) as linear: - α = pm.Normal("α", mu=0, sigma=0.5) - β = pm.Normal("β", mu=0, sigma=0.5) - σ = pm.HalfNormal("σ", sigma=0.1) - trend = pm.Deterministic("trend", α + β * t) - pm.Normal("likelihood", mu=trend, sigma=σ, observed=y) - - linear_prior = pm.sample_prior_predictive(samples=100) - -fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True) -ax[0].plot( - df["Month"], - az.extract_dataset(linear_prior, group="prior_predictive", num_samples=100)["likelihood"] - * y_max, - color="blue", - alpha=0.05, -) -df.plot.scatter(x="Month", y="#Passengers", color="k", ax=ax[0]) -ax[0].set_title("Prior predictive") -ax[1].plot( - df["Month"], - az.extract_dataset(linear_prior, group="prior", num_samples=100)["trend"] * y_max, - color="blue", - alpha=0.05, -) -df.plot.scatter(x="Month", y="#Passengers", color="k", ax=ax[1]) -ax[1].set_title("Prior trend lines"); -``` - -Cool. Before going on to anything more complicated, let's try conditioning on the data and doing a posterior predictive check: - -```{code-cell} ipython3 -with linear: - linear_trace = pm.sample(return_inferencedata=True) - linear_prior = pm.sample_posterior_predictive(trace=linear_trace) - -fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True) -ax[0].plot( - df["Month"], - az.extract_dataset(linear_prior, group="posterior_predictive", num_samples=100)["likelihood"] - * y_max, - color="blue", - alpha=0.01, -) -df.plot.scatter(x="Month", y="#Passengers", color="k", ax=ax[0]) -ax[0].set_title("Posterior predictive") -ax[1].plot( - df["Month"], - az.extract_dataset(linear_trace, group="posterior", num_samples=100)["trend"] * y_max, - color="blue", - alpha=0.01, -) -df.plot.scatter(x="Month", y="#Passengers", color="k", ax=ax[1]) -ax[1].set_title("Posterior trend lines"); -``` - -## Part 2: enter seasonality - -To model seasonality, we'll "borrow" the approach taken by Prophet - see [the Prophet paper](https://peerj.com/preprints/3190/) {cite:p}`taylor2018forecasting` for details, but the idea is to make a matrix of Fourier features which get multiplied by a vector of coefficients. As we'll be using multiplicative seasonality, the final model will be - -$$\text{Passengers} \sim (\alpha + \beta\ \text{time}) (1 + \text{seasonality})$$ - -```{code-cell} ipython3 -n_order = 10 -periods = (df["Month"] - pd.Timestamp("1900-01-01")).dt.days / 365.25 - -fourier_features = pd.DataFrame( - { - f"{func}_order_{order}": getattr(np, func)(2 * np.pi * periods * order) - for order in range(1, n_order + 1) - for func in ("sin", "cos") - } -) -fourier_features -``` - -Again, let's use the default Prophet priors, just to see what happens. - -```{code-cell} ipython3 -coords = {"fourier_features": np.arange(2 * n_order)} -with pm.Model(check_bounds=False, coords=coords) as linear_with_seasonality: - α = pm.Normal("α", mu=0, sigma=0.5) - β = pm.Normal("β", mu=0, sigma=0.5) - σ = pm.HalfNormal("σ", sigma=0.1) - β_fourier = pm.Normal("β_fourier", mu=0, sigma=10, dims="fourier_features") - seasonality = pm.Deterministic( - "seasonality", pm.math.dot(β_fourier, fourier_features.to_numpy().T) - ) - trend = pm.Deterministic("trend", α + β * t) - μ = trend * (1 + seasonality) - pm.Normal("likelihood", mu=μ, sigma=σ, observed=y) - - linear_seasonality_prior = pm.sample_prior_predictive() - -fig, ax = plt.subplots(nrows=3, ncols=1, sharex=False, figsize=(8, 6)) -ax[0].plot( - df["Month"], - az.extract_dataset(linear_seasonality_prior, group="prior_predictive", num_samples=100)[ - "likelihood" - ] - * y_max, - color="blue", - alpha=0.05, -) -df.plot.scatter(x="Month", y="#Passengers", color="k", ax=ax[0]) -ax[0].set_title("Prior predictive") -ax[1].plot( - df["Month"], - az.extract_dataset(linear_seasonality_prior, group="prior", num_samples=100)["trend"] * y_max, - color="blue", - alpha=0.05, -) -df.plot.scatter(x="Month", y="#Passengers", color="k", ax=ax[1]) -ax[1].set_title("Prior trend lines") -ax[2].plot( - df["Month"].iloc[:12], - az.extract_dataset(linear_seasonality_prior, group="prior", num_samples=100)["seasonality"][:12] - * 100, - color="blue", - alpha=0.05, -) -ax[2].set_title("Prior seasonality") -ax[2].set_ylabel("Percent change") -formatter = mdates.DateFormatter("%b") -ax[2].xaxis.set_major_formatter(formatter); -``` - -Again, this seems implausible. The default priors are too wide for our use-case, and it doesn't make sense to use them when we can do prior predictive checks to set more sensible ones. Let's try with some narrower ones: - -```{code-cell} ipython3 -coords = {"fourier_features": np.arange(2 * n_order)} -with pm.Model(check_bounds=False, coords=coords) as linear_with_seasonality: - α = pm.Normal("α", mu=0, sigma=0.5) - β = pm.Normal("β", mu=0, sigma=0.5) - trend = pm.Deterministic("trend", α + β * t) - - β_fourier = pm.Normal("β_fourier", mu=0, sigma=0.1, dims="fourier_features") - seasonality = pm.Deterministic( - "seasonality", pm.math.dot(β_fourier, fourier_features.to_numpy().T) - ) - - μ = trend * (1 + seasonality) - σ = pm.HalfNormal("σ", sigma=0.1) - pm.Normal("likelihood", mu=μ, sigma=σ, observed=y) - - linear_seasonality_prior = pm.sample_prior_predictive() - -fig, ax = plt.subplots(nrows=3, ncols=1, sharex=False, figsize=(8, 6)) -ax[0].plot( - df["Month"], - az.extract_dataset(linear_seasonality_prior, group="prior_predictive", num_samples=100)[ - "likelihood" - ] - * y_max, - color="blue", - alpha=0.05, -) -df.plot.scatter(x="Month", y="#Passengers", color="k", ax=ax[0]) -ax[0].set_title("Prior predictive") -ax[1].plot( - df["Month"], - az.extract_dataset(linear_seasonality_prior, group="prior", num_samples=100)["trend"] * y_max, - color="blue", - alpha=0.05, -) -df.plot.scatter(x="Month", y="#Passengers", color="k", ax=ax[1]) -ax[1].set_title("Prior trend lines") -ax[2].plot( - df["Month"].iloc[:12], - az.extract_dataset(linear_seasonality_prior, group="prior", num_samples=100)["seasonality"][:12] - * 100, - color="blue", - alpha=0.05, -) -ax[2].set_title("Prior seasonality") -ax[2].set_ylabel("Percent change") -formatter = mdates.DateFormatter("%b") -ax[2].xaxis.set_major_formatter(formatter); -``` - -Seems a lot more believable. Time for a posterior predictive check: - -```{code-cell} ipython3 -with linear_with_seasonality: - linear_seasonality_trace = pm.sample(return_inferencedata=True) - linear_seasonality_posterior = pm.sample_posterior_predictive(trace=linear_seasonality_trace) - -fig, ax = plt.subplots(nrows=3, ncols=1, sharex=False, figsize=(8, 6)) -ax[0].plot( - df["Month"], - az.extract_dataset(linear_seasonality_posterior, group="posterior_predictive", num_samples=100)[ - "likelihood" - ] - * y_max, - color="blue", - alpha=0.05, -) -df.plot.scatter(x="Month", y="#Passengers", color="k", ax=ax[0]) -ax[0].set_title("Posterior predictive") -ax[1].plot( - df["Month"], - az.extract_dataset(linear_trace, group="posterior", num_samples=100)["trend"] * y_max, - color="blue", - alpha=0.05, -) -df.plot.scatter(x="Month", y="#Passengers", color="k", ax=ax[1]) -ax[1].set_title("Posterior trend lines") -ax[2].plot( - df["Month"].iloc[:12], - az.extract_dataset(linear_seasonality_trace, group="posterior", num_samples=100)["seasonality"][ - :12 - ] - * 100, - color="blue", - alpha=0.05, -) -ax[2].set_title("Posterior seasonality") -ax[2].set_ylabel("Percent change") -formatter = mdates.DateFormatter("%b") -ax[2].xaxis.set_major_formatter(formatter); -``` - -Neat! - -+++ - -## Conclusion - -We saw how we could implement a Prophet-like model ourselves and fit it to the air passengers dataset. Prophet is an awesome library and a net-positive to the community, but by implementing it ourselves, however, we can take whichever components of it we think are relevant to our problem, customise them, and carry out the Bayesian workflow {cite:p}`gelman2020bayesian`). Next time you have a time series problem, I hope you will try implementing your own probabilistic model rather than using Prophet as a "black-box" whose arguments are tuneable hyperparameters. - -For reference, you might also want to check out: -- [TimeSeeers](https://github.com/MBrouns/timeseers), a hierarchical Bayesian Time Series model based on Facebooks Prophet, written in PyMC3 -- [PM-Prophet](https://github.com/luke14free/pm-prophet), a Pymc3-based universal time series prediction and decomposition library inspired by Facebook Prophet - -+++ - -## Authors -* Authored by [Marco Gorelli](https://github.com/MarcoGorelli) in June, 2021 ([pymc-examples#183](https://github.com/pymc-devs/pymc-examples/pull/183)) -* Updated by Danh Phan in May, 2022 ([pymc-examples#320](https://github.com/pymc-devs/pymc-examples/pull/320)) - -+++ - -## References -:::{bibliography} -:filter: docname in docnames -::: - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/time_series/Euler-Maruyama_and_SDEs.myst.md b/myst_nbs/time_series/Euler-Maruyama_and_SDEs.myst.md deleted file mode 100644 index 677fbc539..000000000 --- a/myst_nbs/time_series/Euler-Maruyama_and_SDEs.myst.md +++ /dev/null @@ -1,363 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- - -+++ {"button": false, "new_sheet": false, "run_control": {"read_only": false}, "slideshow": {"slide_type": "slide"}} - -# Inferring parameters of SDEs using a Euler-Maruyama scheme - -_This notebook is derived from a presentation prepared for the Theoretical Neuroscience Group, Institute of Systems Neuroscience at Aix-Marseile University._ - -```{code-cell} ipython3 ---- -button: false -new_sheet: false -run_control: - read_only: false -slideshow: - slide_type: '-' ---- -%pylab inline -import arviz as az -import pymc3 as pm -import scipy -import theano.tensor as tt - -from pymc3.distributions.timeseries import EulerMaruyama -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -``` - -+++ {"button": false, "nbpresent": {"id": "2325c7f9-37bd-4a65-aade-86bee1bff5e3"}, "new_sheet": false, "run_control": {"read_only": false}, "slideshow": {"slide_type": "slide"}} - -## Toy model 1 - -Here's a scalar linear SDE in symbolic form - -$ dX_t = \lambda X_t + \sigma^2 dW_t $ - -discretized with the Euler-Maruyama scheme - -```{code-cell} ipython3 ---- -button: false -new_sheet: false -run_control: - read_only: false ---- -# parameters -λ = -0.78 -σ2 = 5e-3 -N = 200 -dt = 1e-1 - -# time series -x = 0.1 -x_t = [] - -# simulate -for i in range(N): - x += dt * λ * x + sqrt(dt) * σ2 * randn() - x_t.append(x) - -x_t = array(x_t) - -# z_t noisy observation -z_t = x_t + randn(x_t.size) * 5e-3 -``` - -```{code-cell} ipython3 ---- -button: false -nbpresent: - id: 0994bfef-45dc-48da-b6bf-c7b38d62bf11 -new_sheet: false -run_control: - read_only: false -slideshow: - slide_type: subslide ---- -figure(figsize=(10, 3)) -subplot(121) -plot(x_t[:30], "k", label="$x(t)$", alpha=0.5), plot(z_t[:30], "r", label="$z(t)$", alpha=0.5) -title("Transient"), legend() -subplot(122) -plot(x_t[30:], "k", label="$x(t)$", alpha=0.5), plot(z_t[30:], "r", label="$z(t)$", alpha=0.5) -title("All time") -tight_layout() -``` - -+++ {"button": false, "new_sheet": false, "run_control": {"read_only": false}} - -What is the inference we want to make? Since we've made a noisy observation of the generated time series, we need to estimate both $x(t)$ and $\lambda$. - -+++ {"button": false, "new_sheet": false, "run_control": {"read_only": false}, "slideshow": {"slide_type": "subslide"}} - -First, we rewrite our SDE as a function returning a tuple of the drift and diffusion coefficients - -```{code-cell} ipython3 ---- -button: false -new_sheet: false -run_control: - read_only: false ---- -def lin_sde(x, lam): - return lam * x, σ2 -``` - -+++ {"button": false, "new_sheet": false, "run_control": {"read_only": false}, "slideshow": {"slide_type": "subslide"}} - -Next, we describe the probability model as a set of three stochastic variables, `lam`, `xh`, and `zh`: - -```{code-cell} ipython3 ---- -button: false -nbpresent: - id: 4f90230d-f303-4b3b-a69e-304a632c6407 -new_sheet: false -run_control: - read_only: false -slideshow: - slide_type: '-' ---- -with pm.Model() as model: - - # uniform prior, but we know it must be negative - lam = pm.Flat("lam") - - # "hidden states" following a linear SDE distribution - # parametrized by time step (det. variable) and lam (random variable) - xh = EulerMaruyama("xh", dt, lin_sde, (lam,), shape=N, testval=x_t) - - # predicted observation - zh = pm.Normal("zh", mu=xh, sigma=5e-3, observed=z_t) -``` - -+++ {"button": false, "nbpresent": {"id": "287d10b5-0193-4ffe-92a7-362993c4b72e"}, "new_sheet": false, "run_control": {"read_only": false}, "slideshow": {"slide_type": "subslide"}} - -Once the model is constructed, we perform inference, i.e. sample from the posterior distribution, in the following steps: - -```{code-cell} ipython3 ---- -button: false -new_sheet: false -run_control: - read_only: false ---- -with model: - trace = pm.sample(2000, tune=1000) -``` - -+++ {"button": false, "new_sheet": false, "run_control": {"read_only": false}, "slideshow": {"slide_type": "subslide"}} - -Next, we plot some basic statistics on the samples from the posterior, - -```{code-cell} ipython3 ---- -button: false -nbpresent: - id: 925f1829-24cb-4c28-9b6b-7e9c9e86f2fd -new_sheet: false -run_control: - read_only: false ---- -figure(figsize=(10, 3)) -subplot(121) -plot(percentile(trace[xh], [2.5, 97.5], axis=0).T, "k", label=r"$\hat{x}_{95\%}(t)$") -plot(x_t, "r", label="$x(t)$") -legend() - -subplot(122) -hist(trace[lam], 30, label=r"$\hat{\lambda}$", alpha=0.5) -axvline(λ, color="r", label=r"$\lambda$", alpha=0.5) -legend(); -``` - -+++ {"button": false, "new_sheet": false, "run_control": {"read_only": false}, "slideshow": {"slide_type": "subslide"}} - -A model can fit the data precisely and still be wrong; we need to use _posterior predictive checks_ to assess if, under our fit model, the data our likely. - -In other words, we -- assume the model is correct -- simulate new observations -- check that the new observations fit with the original data - -```{code-cell} ipython3 ---- -button: false -new_sheet: false -run_control: - read_only: false ---- -# generate trace from posterior -ppc_trace = pm.sample_posterior_predictive(trace, model=model) - -# plot with data -figure(figsize=(10, 3)) -plot(percentile(ppc_trace["zh"], [2.5, 97.5], axis=0).T, "k", label=r"$z_{95\% PP}(t)$") -plot(z_t, "r", label="$z(t)$") -legend() -``` - -+++ {"button": false, "new_sheet": false, "run_control": {"read_only": false}} - -Note that - -- inference also estimates the initial conditions -- the observed data $z(t)$ lies fully within the 95% interval of the PPC. -- there are many other ways of evaluating fit - -+++ {"button": false, "new_sheet": false, "run_control": {"read_only": false}, "slideshow": {"slide_type": "slide"}} - -### Toy model 2 - -As the next model, let's use a 2D deterministic oscillator, -\begin{align} -\dot{x} &= \tau (x - x^3/3 + y) \\ -\dot{y} &= \frac{1}{\tau} (a - x) -\end{align} - -with noisy observation $z(t) = m x + (1 - m) y + N(0, 0.05)$. - -```{code-cell} ipython3 ---- -button: false -new_sheet: false -run_control: - read_only: false ---- -N, τ, a, m, σ2 = 200, 3.0, 1.05, 0.2, 1e-1 -xs, ys = [0.0], [1.0] -for i in range(N): - x, y = xs[-1], ys[-1] - dx = τ * (x - x**3.0 / 3.0 + y) - dy = (1.0 / τ) * (a - x) - xs.append(x + dt * dx + sqrt(dt) * σ2 * randn()) - ys.append(y + dt * dy + sqrt(dt) * σ2 * randn()) -xs, ys = array(xs), array(ys) -zs = m * xs + (1 - m) * ys + randn(xs.size) * 0.1 - -figure(figsize=(10, 2)) -plot(xs, label="$x(t)$") -plot(ys, label="$y(t)$") -plot(zs, label="$z(t)$") -legend() -``` - -+++ {"button": false, "new_sheet": false, "run_control": {"read_only": false}, "slideshow": {"slide_type": "subslide"}} - -Now, estimate the hidden states $x(t)$ and $y(t)$, as well as parameters $\tau$, $a$ and $m$. - -As before, we rewrite our SDE as a function returned drift & diffusion coefficients: - -```{code-cell} ipython3 ---- -button: false -new_sheet: false -run_control: - read_only: false ---- -def osc_sde(xy, τ, a): - x, y = xy[:, 0], xy[:, 1] - dx = τ * (x - x**3.0 / 3.0 + y) - dy = (1.0 / τ) * (a - x) - dxy = tt.stack([dx, dy], axis=0).T - return dxy, σ2 -``` - -+++ {"button": false, "new_sheet": false, "run_control": {"read_only": false}} - -As before, the Euler-Maruyama discretization of the SDE is written as a prediction of the state at step $i+1$ based on the state at step $i$. - -+++ {"button": false, "new_sheet": false, "run_control": {"read_only": false}, "slideshow": {"slide_type": "subslide"}} - -We can now write our statistical model as before, with uninformative priors on $\tau$, $a$ and $m$: - -```{code-cell} ipython3 ---- -button: false -new_sheet: false -run_control: - read_only: false ---- -xys = c_[xs, ys] - -with pm.Model() as model: - τh = pm.Uniform("τh", lower=0.1, upper=5.0) - ah = pm.Uniform("ah", lower=0.5, upper=1.5) - mh = pm.Uniform("mh", lower=0.0, upper=1.0) - xyh = EulerMaruyama("xyh", dt, osc_sde, (τh, ah), shape=xys.shape, testval=xys) - zh = pm.Normal("zh", mu=mh * xyh[:, 0] + (1 - mh) * xyh[:, 1], sigma=0.1, observed=zs) -``` - -```{code-cell} ipython3 ---- -button: false -new_sheet: false -run_control: - read_only: false ---- -with model: - trace = pm.sample(2000, tune=1000) -``` - -+++ {"button": false, "new_sheet": false, "run_control": {"read_only": false}, "slideshow": {"slide_type": "subslide"}} - -Again, the result is a set of samples from the posterior, including our parameters of interest but also the hidden states - -```{code-cell} ipython3 ---- -button: false -new_sheet: false -run_control: - read_only: false ---- -figure(figsize=(10, 6)) -subplot(211) -plot(percentile(trace[xyh][..., 0], [2.5, 97.5], axis=0).T, "k", label=r"$\hat{x}_{95\%}(t)$") -plot(xs, "r", label="$x(t)$") -legend(loc=0) -subplot(234), hist(trace["τh"]), axvline(τ), xlim([1.0, 4.0]), title("τ") -subplot(235), hist(trace["ah"]), axvline(a), xlim([0, 2.0]), title("a") -subplot(236), hist(trace["mh"]), axvline(m), xlim([0, 1]), title("m") -tight_layout() -``` - -+++ {"button": false, "new_sheet": false, "run_control": {"read_only": false}, "slideshow": {"slide_type": "subslide"}} - -Again, we can perform a posterior predictive check, that our data are likely given the fit model - -```{code-cell} ipython3 ---- -button: false -new_sheet: false -run_control: - read_only: false ---- -# generate trace from posterior -ppc_trace = pm.sample_posterior_predictive(trace, model=model) - -# plot with data -figure(figsize=(10, 3)) -plot(percentile(ppc_trace["zh"], [2.5, 97.5], axis=0).T, "k", label=r"$z_{95\% PP}(t)$") -plot(zs, "r", label="$z(t)$") -legend() -``` - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/time_series/Forecasting_with_structural_timeseries.myst.md b/myst_nbs/time_series/Forecasting_with_structural_timeseries.myst.md deleted file mode 100644 index 0984b40e6..000000000 --- a/myst_nbs/time_series/Forecasting_with_structural_timeseries.myst.md +++ /dev/null @@ -1,763 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3.9.6 ('prod_streamlit') - language: python - name: python3 ---- - -(forecasting_with_ar)= -# Forecasting with Structural AR Timeseries - -:::{post} Oct 20, 2022 -:tags: forecasting, autoregressive, bayesian structural timeseries -:category: intermediate -:author: Nathaniel Forde -::: - -+++ - -Bayesian structural timeseries models are an interesting way to learn about the structure inherent in any observed timeseries data. It also gives us the ability to project forward the implied predictive distribution granting us another view on forecasting problems. We can treat the learned characteristics of the timeseries data observed to-date as informative about the structure of the unrealised future state of the same measure. - -In this notebook we'll see how to fit and predict a range of auto-regressive structural timeseries models and, importantly, how to predict future observations of the models. - -```{code-cell} ipython3 -import arviz as az -import numpy as np -import pandas as pd -import pymc as pm - -from matplotlib import pyplot as plt -``` - -```{code-cell} ipython3 -RANDOM_SEED = 8929 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -## Generate Fake Autoregressive Data - -First we will generate a simple autoregressive timeseries. We will show how to specify a model to fit this data and then add a number of complexities to the data and show how they too can be captured with an autoregressive model and used to predict the shape of the future. - -```{code-cell} ipython3 -def simulate_ar(intercept, coef1, coef2, noise=0.3, *, warmup=10, steps=200): - # We sample some extra warmup steps, to let the AR process stabilize - draws = np.zeros(warmup + steps) - # Initialize first draws at intercept - draws[:2] = intercept - for step in range(2, warmup + steps): - draws[step] = ( - intercept - + coef1 * draws[step - 1] - + coef2 * draws[step - 2] - + np.random.normal(0, noise) - ) - # Discard the warmup draws - return draws[warmup:] - - -# True parameters of the AR process -ar1_data = simulate_ar(10, -0.9, 0) - -fig, ax = plt.subplots(figsize=(10, 3)) -ax.set_title("Generated Autoregressive Timeseries", fontsize=15) -ax.plot(ar1_data); -``` - -## Specifying the Model - -We'll walk through the model step by step and then generalise the pattern into a function that can be used to take increasingly complex structural combinations of components. - -```{code-cell} ipython3 -## Set up a dictionary for the specification of our priors -## We set up the dictionary to specify size of the AR coefficients in -## case we want to vary the AR lags. -priors = { - "coefs": {"mu": [10, 0.2], "sigma": [0.1, 0.1], "size": 2}, - "sigma": 8, - "init": {"mu": 9, "sigma": 0.1, "size": 1}, -} - -## Initialise the model -with pm.Model() as AR: - pass - -## Define the time interval for fitting the data -t_data = list(range(len(ar1_data))) -## Add the time interval as a mutable coordinate to the model to allow for future predictions -AR.add_coord("obs_id", t_data, mutable=True) - -with AR: - ## Data containers to enable prediction - t = pm.MutableData("t", t_data, dims="obs_id") - y = pm.MutableData("y", ar1_data, dims="obs_id") - - # The first coefficient will be the constant term but we need to set priors for each coefficient in the AR process - coefs = pm.Normal("coefs", priors["coefs"]["mu"], priors["coefs"]["sigma"]) - sigma = pm.HalfNormal("sigma", priors["sigma"]) - # We need one init variable for each lag, hence size is variable too - init = pm.Normal.dist( - priors["init"]["mu"], priors["init"]["sigma"], size=priors["init"]["size"] - ) - # Steps of the AR model minus the lags required - ar1 = pm.AR( - "ar", - coefs, - sigma=sigma, - init_dist=init, - constant=True, - steps=t.shape[0] - (priors["coefs"]["size"] - 1), - dims="obs_id", - ) - - # The Likelihood - outcome = pm.Normal("likelihood", mu=ar1, sigma=sigma, observed=y, dims="obs_id") - ## Sampling - idata_ar = pm.sample_prior_predictive() - idata_ar.extend(pm.sample(2000, random_seed=100, target_accept=0.95)) - idata_ar.extend(pm.sample_posterior_predictive(idata_ar)) -``` - -```{code-cell} ipython3 -idata_ar -``` - -Lets check the model structure with plate notation and then examine the convergence diagnostics. - -```{code-cell} ipython3 -az.plot_trace(idata_ar, figsize=(10, 6), kind="rank_vlines"); -``` - -Next we'll check the summary estimates for the to AR coefficients and the sigma term. - -```{code-cell} ipython3 -az.summary(idata_ar, var_names=["~ar"]) -``` - -We can see here that the model fit has fairly correctly estimated the true parameters of the data generating process. We can also see this if we plot the posterior ar distribution against our observed data. - -```{code-cell} ipython3 -fig, ax = plt.subplots(figsize=(10, 4)) -idata_ar.posterior.ar.mean(["chain", "draw"]).plot(ax=ax, label="Posterior Mean AR level") -ax.plot(ar1_data, "o", color="black", markersize=2, label="Observed Data") -ax.legend() -ax.set_title("Fitted AR process\nand observed data"); -``` - -## Prediction Step - -The next step works somewhat differently from generating posterior predictive observations for new data in a GLM model. Since we are forecasting from a learned posterior distribution of structural parameters we must condition on the learned parameters. Or put another way, we have to tell the model how many prediction steps we want to impute with the model we have just fit and from what basis to impute those values. - -So for the shape handling purposes we have to feed our model new data for prediction and specify how to incorporate the learned parameters of the AR process. To do so, we initialise a new AR process for the future and feed it a set of initialising values we have learned when fitting our model to data. To make this as precise as can be use use the Dirac distribution to constrain the initial AR values very tightly around the learned posterior parameters. - -```{code-cell} ipython3 -prediction_length = 250 -n = prediction_length - ar1_data.shape[0] -obs = list(range(prediction_length)) -with AR: - ## We need to have coords for the observations minus the lagged term to correctly centre the prediction step - AR.add_coords({"obs_id_fut_1": range(ar1_data.shape[0] - 1, 250, 1)}) - AR.add_coords({"obs_id_fut": range(ar1_data.shape[0], 250, 1)}) - # condition on the learned values of the AR process - # initialise the future AR process precisely at the last observed value in the AR process - # using the special feature of the dirac delta distribution to be 0 everywhere else. - ar1_fut = pm.AR( - "ar1_fut", - init_dist=pm.DiracDelta.dist(ar1[..., -1]), - rho=coefs, - sigma=sigma, - constant=True, - dims="obs_id_fut_1", - ) - yhat_fut = pm.Normal("yhat_fut", mu=ar1_fut[1:], sigma=sigma, dims="obs_id_fut") - # use the updated values and predict outcomes and probabilities: - idata_preds = pm.sample_posterior_predictive( - idata_ar, var_names=["likelihood", "yhat_fut"], predictions=True, random_seed=100 - ) -``` - -It's important to understand the conditional nature of the autoregressive forecast and the manner in which it depends on the observed data. -In our two-step model fit and predict process we have learned the posterior distribution for the parameters of an AR process, and then used those parameters to centre our forecasts. - -```{code-cell} ipython3 -pm.model_to_graphviz(AR) -``` - -```{code-cell} ipython3 -idata_preds -``` - -## Inspecting model fit and forecast - -We can look at the standard posterior predictive fits but since our data is timeseries data we have to also look how draws from the posterior predictive distribution vary over time. - -```{code-cell} ipython3 -def plot_fits(idata_ar, idata_preds): - palette = "plasma" - cmap = plt.get_cmap(palette) - percs = np.linspace(51, 99, 100) - colors = (percs - np.min(percs)) / (np.max(percs) - np.min(percs)) - mosaic = """AABB - CCCC""" - fig, axs = plt.subplot_mosaic(mosaic, sharex=False, figsize=(20, 10)) - axs = [axs[k] for k in axs.keys()] - for i, p in enumerate(percs[::-1]): - upper = np.percentile( - az.extract_dataset(idata_ar, group="prior_predictive", num_samples=1000)["likelihood"], - p, - axis=1, - ) - lower = np.percentile( - az.extract_dataset(idata_ar, group="prior_predictive", num_samples=1000)["likelihood"], - 100 - p, - axis=1, - ) - color_val = colors[i] - axs[0].fill_between( - x=idata_ar["constant_data"]["t"], - y1=upper.flatten(), - y2=lower.flatten(), - color=cmap(color_val), - alpha=0.1, - ) - - axs[0].plot( - az.extract_dataset(idata_ar, group="prior_predictive", num_samples=1000)["likelihood"].mean( - axis=1 - ), - color="cyan", - label="Prior Predicted Mean Realisation", - ) - - axs[0].scatter( - x=idata_ar["constant_data"]["t"], - y=idata_ar["constant_data"]["y"], - color="k", - label="Observed Data points", - ) - axs[0].set_title("Prior Predictive Fit", fontsize=20) - axs[0].legend() - - for i, p in enumerate(percs[::-1]): - upper = np.percentile( - az.extract_dataset(idata_preds, group="predictions", num_samples=1000)["likelihood"], - p, - axis=1, - ) - lower = np.percentile( - az.extract_dataset(idata_preds, group="predictions", num_samples=1000)["likelihood"], - 100 - p, - axis=1, - ) - color_val = colors[i] - axs[2].fill_between( - x=idata_preds["predictions_constant_data"]["t"], - y1=upper.flatten(), - y2=lower.flatten(), - color=cmap(color_val), - alpha=0.1, - ) - - upper = np.percentile( - az.extract_dataset(idata_preds, group="predictions", num_samples=1000)["yhat_fut"], - p, - axis=1, - ) - lower = np.percentile( - az.extract_dataset(idata_preds, group="predictions", num_samples=1000)["yhat_fut"], - 100 - p, - axis=1, - ) - color_val = colors[i] - axs[2].fill_between( - x=idata_preds["predictions"].coords["obs_id_fut"].data, - y1=upper.flatten(), - y2=lower.flatten(), - color=cmap(color_val), - alpha=0.1, - ) - - axs[2].plot( - az.extract_dataset(idata_preds, group="predictions", num_samples=1000)["likelihood"].mean( - axis=1 - ), - color="cyan", - ) - idata_preds.predictions.yhat_fut.mean(["chain", "draw"]).plot( - ax=axs[2], color="cyan", label="Predicted Mean Realisation" - ) - axs[2].scatter( - x=idata_ar["constant_data"]["t"], - y=idata_ar["constant_data"]["y"], - color="k", - label="Observed Data", - ) - axs[2].set_title("Posterior Predictions Plotted", fontsize=20) - axs[2].axvline(np.max(idata_ar["constant_data"]["t"]), color="black") - axs[2].legend() - axs[2].set_xlabel("Time in Days") - axs[0].set_xlabel("Time in Days") - az.plot_ppc(idata_ar, ax=axs[1]) - - -plot_fits(idata_ar, idata_preds) -``` - -Here we can see that although the model converged and ends up with a reasonable fit to the existing data, and a **plausible projection** for future values. However, we have set the prior specification very poorly in allowing an absurdly broad range of values due to the kind of compounding logic of the auto-regressive function. For this reason it's very important to be able to inspect and tailor your model with prior predictive checks. - -Secondly, the mean forecast fails to capture any long lasting structure, quickly dying down to a stable baseline. To account for these kind of short-lived forecasts, we can add more structure to our model, but first, let's complicate the picture. - -## Complicating the Picture - -Often our data will involve more than one latent process, and might have more complex factors which drive the outcomes. To see one such complication let's add a trend to our data. By adding more structure to our forecast we are telling our model that we expect certain patterns or trends to remain in the data out into the future. The choice of which structures to add are at the discretion of the creative modeller - here we'll demonstrate some simple examples. - -```{code-cell} ipython3 -y_t = -0.3 + np.arange(200) * -0.2 + np.random.normal(0, 10, 200) -y_t = y_t + ar1_data - -fig, ax = plt.subplots(figsize=(10, 4)) -ax.plot(y_t) -ax.set_title("AR Process + Trend data"); -``` - -### Wrapping our model into a function - -```{code-cell} ipython3 -def make_latent_AR_model(ar_data, priors, prediction_steps=250, full_sample=True, samples=2000): - with pm.Model() as AR: - pass - - t_data = list(range(len(ar_data))) - AR.add_coord("obs_id", t_data, mutable=True) - - with AR: - ## Data containers to enable prediction - t = pm.MutableData("t", t_data, dims="obs_id") - y = pm.MutableData("y", ar_data, dims="obs_id") - # The first coefficient will be the intercept term - coefs = pm.Normal("coefs", priors["coefs"]["mu"], priors["coefs"]["sigma"]) - sigma = pm.HalfNormal("sigma", priors["sigma"]) - # We need one init variable for each lag, hence size is variable too - init = pm.Normal.dist( - priors["init"]["mu"], priors["init"]["sigma"], size=priors["init"]["size"] - ) - # Steps of the AR model minus the lags required given specification - ar1 = pm.AR( - "ar", - coefs, - sigma=sigma, - init_dist=init, - constant=True, - steps=t.shape[0] - (priors["coefs"]["size"] - 1), - dims="obs_id", - ) - - # The Likelihood - outcome = pm.Normal("likelihood", mu=ar1, sigma=sigma, observed=y, dims="obs_id") - ## Sampling - idata_ar = pm.sample_prior_predictive() - if full_sample: - idata_ar.extend(pm.sample(samples, random_seed=100, target_accept=0.95)) - idata_ar.extend(pm.sample_posterior_predictive(idata_ar)) - else: - return idata_ar - - n = prediction_steps - ar_data.shape[0] - - with AR: - AR.add_coords({"obs_id_fut_1": range(ar1_data.shape[0] - 1, 250, 1)}) - AR.add_coords({"obs_id_fut": range(ar1_data.shape[0], 250, 1)}) - # condition on the learned values of the AR process - # initialise the future AR process precisely at the last observed value in the AR process - # using the special feature of the dirac delta distribution to be 0 probability everywhere else. - ar1_fut = pm.AR( - "ar1_fut", - init_dist=pm.DiracDelta.dist(ar1[..., -1]), - rho=coefs, - sigma=sigma, - constant=True, - dims="obs_id_fut_1", - ) - yhat_fut = pm.Normal("yhat_fut", mu=ar1_fut[1:], sigma=sigma, dims="obs_id_fut") - # use the updated values and predict outcomes and probabilities: - idata_preds = pm.sample_posterior_predictive( - idata_ar, var_names=["likelihood", "yhat_fut"], predictions=True, random_seed=100 - ) - - return idata_ar, idata_preds, AR -``` - -Next we'll cycle through a number of prior specifications to show how that impacts the prior predictive distribution i.e. the implied distribution of our outcome if we were to forward sample from the model specified by our priors. - -```{code-cell} ipython3 -priors_0 = { - "coefs": {"mu": [-4, 0.2], "sigma": 0.1, "size": 2}, - "sigma": 8, - "init": {"mu": 9, "sigma": 0.1, "size": 1}, -} - -priors_1 = { - "coefs": {"mu": [-2, 0.2], "sigma": 0.1, "size": 2}, - "sigma": 12, - "init": {"mu": 8, "sigma": 0.1, "size": 1}, -} - -priors_2 = { - "coefs": {"mu": [0, 0.2], "sigma": 0.1, "size": 2}, - "sigma": 15, - "init": {"mu": 8, "sigma": 0.1, "size": 1}, -} - -models = {} -for i, p in enumerate([priors_0, priors_1, priors_2]): - models[i] = {} - idata = make_latent_AR_model(y_t, p, full_sample=False) - models[i]["idata"] = idata -``` - -```{code-cell} ipython3 -fig, axs = plt.subplots(1, 3, figsize=(10, 4), sharey=True) -axs = axs.flatten() -for i, p in zip(range(3), [priors_0, priors_1, priors_2]): - axs[i].plot( - az.extract_dataset(models[i]["idata"], group="prior_predictive", num_samples=100)[ - "likelihood" - ], - color="blue", - alpha=0.1, - ) - axs[i].plot(y_t, "o", color="black", markersize=2) - axs[i].set_title( - "$y_{t+1}$" + f'= N({p["coefs"]["mu"][0]} + {p["coefs"]["mu"][1]}y$_t$, {p["sigma"]})' - ) -plt.suptitle("Prior Predictive Specifications", fontsize=20); -``` - -We can see the manner in which the model struggles to capture the trend line. Increasing the variability of the model will never capture the directional pattern we know to be in the data. - -```{code-cell} ipython3 -priors_0 = { - "coefs": {"mu": [-4, 0.2], "sigma": [0.5, 0.03], "size": 2}, - "sigma": 8, - "init": {"mu": -4, "sigma": 0.1, "size": 1}, -} - -idata_no_trend, preds_no_trend, model = make_latent_AR_model(y_t, priors_0) -``` - -```{code-cell} ipython3 -plot_fits(idata_no_trend, preds_no_trend) -``` - -Forecasting with this model is somewhat hopeless because, while the model fit adjusts well with observed data, but it completely fails to capture the structural trend in the data. So without some structural constraint when we seek to make predictions with this simple AR model, it reverts to the mean level forecast very quickly. - -+++ - -### Specifying a Trend Model - -We will define a model to account for the trend in our data and combine this trend in an additive model with the autoregressive components. Again the model is much as before, but now we add additional latent features. These are to be combined in a simple additive combination but we can be more creative here if it would suit our model. - -```{code-cell} ipython3 -def make_latent_AR_trend_model( - ar_data, priors, prediction_steps=250, full_sample=True, samples=2000 -): - with pm.Model() as AR: - pass - - t_data = list(range(len(ar_data))) - AR.add_coord("obs_id", t_data, mutable=True) - - with AR: - ## Data containers to enable prediction - t = pm.MutableData("t", t_data, dims="obs_id") - y = pm.MutableData("y", ar_data, dims="obs_id") - # The first coefficient will be the intercept term - coefs = pm.Normal("coefs", priors["coefs"]["mu"], priors["coefs"]["sigma"]) - sigma = pm.HalfNormal("sigma", priors["sigma"]) - # We need one init variable for each lag, hence size is variable too - init = pm.Normal.dist( - priors["init"]["mu"], priors["init"]["sigma"], size=priors["init"]["size"] - ) - # Steps of the AR model minus the lags required given specification - ar1 = pm.AR( - "ar", - coefs, - sigma=sigma, - init_dist=init, - constant=True, - steps=t.shape[0] - (priors["coefs"]["size"] - 1), - dims="obs_id", - ) - - ## Priors for the linear trend component - alpha = pm.Normal("alpha", priors["alpha"]["mu"], priors["alpha"]["sigma"]) - beta = pm.Normal("beta", priors["beta"]["mu"], priors["beta"]["sigma"]) - trend = pm.Deterministic("trend", alpha + beta * t, dims="obs_id") - - mu = ar1 + trend - - # The Likelihood - outcome = pm.Normal("likelihood", mu=mu, sigma=sigma, observed=y, dims="obs_id") - ## Sampling - idata_ar = pm.sample_prior_predictive() - if full_sample: - idata_ar.extend(pm.sample(samples, random_seed=100, target_accept=0.95)) - idata_ar.extend(pm.sample_posterior_predictive(idata_ar)) - else: - return idata_ar - - n = prediction_steps - ar_data.shape[0] - - with AR: - AR.add_coords({"obs_id_fut_1": range(ar1_data.shape[0] - 1, prediction_steps, 1)}) - AR.add_coords({"obs_id_fut": range(ar1_data.shape[0], prediction_steps, 1)}) - t_fut = pm.MutableData("t_fut", list(range(ar1_data.shape[0], prediction_steps, 1))) - # condition on the learned values of the AR process - # initialise the future AR process precisely at the last observed value in the AR process - # using the special feature of the dirac delta distribution to be 0 probability everywhere else. - ar1_fut = pm.AR( - "ar1_fut", - init_dist=pm.DiracDelta.dist(ar1[..., -1]), - rho=coefs, - sigma=sigma, - constant=True, - dims="obs_id_fut_1", - ) - trend = pm.Deterministic("trend_fut", alpha + beta * t_fut, dims="obs_id_fut") - mu = ar1_fut[1:] + trend - - yhat_fut = pm.Normal("yhat_fut", mu=mu, sigma=sigma, dims="obs_id_fut") - # use the updated values and predict outcomes and probabilities: - idata_preds = pm.sample_posterior_predictive( - idata_ar, var_names=["likelihood", "yhat_fut"], predictions=True, random_seed=100 - ) - - return idata_ar, idata_preds, AR -``` - -We will fit this model by specifying priors on the negative trend and the range of the standard deviation to respect the direction of the data drift. - -```{code-cell} ipython3 -priors_0 = { - "coefs": {"mu": [0.2, 0.2], "sigma": [0.5, 0.03], "size": 2}, - "alpha": {"mu": -4, "sigma": 0.1}, - "beta": {"mu": -0.1, "sigma": 0.2}, - "sigma": 8, - "init": {"mu": -4, "sigma": 0.1, "size": 1}, -} - - -idata_trend, preds_trend, model = make_latent_AR_trend_model(y_t, priors_0, full_sample=True) -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(model) -``` - -We can see the structure more clearly with the plate notation, and this additional structure has helped to appropriately fit the directional trend of the timeseries data. - -```{code-cell} ipython3 -plot_fits(idata_trend, preds_trend); -``` - -```{code-cell} ipython3 -az.summary(idata_trend, var_names=["coefs", "sigma", "alpha", "beta"]) -``` - -## Complicating the picture further - -Next we'll add a seasonal component to our data and see how we can recover this aspect of the data with a bayesian structural timeseries model. Again, this is is because in reality our data is often the result of multiple converging influences. These influences can be capture in an additive bayesian structural model where our inferential model ensures that we allocate appropriate weight to each of the components. - -```{code-cell} ipython3 -t_data = list(range(200)) -n_order = 10 -periods = np.array(t_data) / 7 - -fourier_features = pd.DataFrame( - { - f"{func}_order_{order}": getattr(np, func)(2 * np.pi * periods * order) - for order in range(1, n_order + 1) - for func in ("sin", "cos") - } -) - -y_t_s = y_t + 20 * fourier_features["sin_order_1"] - -fig, ax = plt.subplots(figsize=(10, 4)) -ax.plot(y_t_s) -ax.set_title("AR + Trend + Seasonality"); -``` - -The key to fitting this model is to understand that we're now passing in synthetic fourier features to help account for seasonality effects. This works because (roughly speaking) we're trying to fit a complex oscillating phenomena using a weighted combination of sine and cosine waves. So we add these sine waves and consine waves like we would add any other feature variables in a regression model. - -However, since we're using this weighted sum to fit the observed data, the model now expects a linear combination of those synthetic features **also** in the prediction step. As such we need to be able to supply those features even out into the future. This fact remains key for any other type of predictive feature we might want to add e.g. day of the week, holiday dummy variable or any other. If a feature is required to fit the observed data the feature must be available in the prediction step too. - -### Specifying the Trend + Seasonal Model - -```{code-cell} ipython3 -def make_latent_AR_trend_seasonal_model( - ar_data, ff, priors, prediction_steps=250, full_sample=True, samples=2000 -): - with pm.Model() as AR: - pass - - ff = ff.to_numpy().T - t_data = list(range(len(ar_data))) - AR.add_coord("obs_id", t_data, mutable=True) - ## The fourier features must be mutable to allow for addition fourier features to be - ## passed in the prediction step. - AR.add_coord("fourier_features", np.arange(len(ff)), mutable=True) - - with AR: - ## Data containers to enable prediction - t = pm.MutableData("t", t_data, dims="obs_id") - y = pm.MutableData("y", ar_data, dims="obs_id") - # The first coefficient will be the intercept term - coefs = pm.Normal("coefs", priors["coefs"]["mu"], priors["coefs"]["sigma"]) - sigma = pm.HalfNormal("sigma", priors["sigma"]) - # We need one init variable for each lag, hence size is variable too - init = pm.Normal.dist( - priors["init"]["mu"], priors["init"]["sigma"], size=priors["init"]["size"] - ) - # Steps of the AR model minus the lags required given specification - ar1 = pm.AR( - "ar", - coefs, - sigma=sigma, - init_dist=init, - constant=True, - steps=t.shape[0] - (priors["coefs"]["size"] - 1), - dims="obs_id", - ) - - ## Priors for the linear trend component - alpha = pm.Normal("alpha", priors["alpha"]["mu"], priors["alpha"]["sigma"]) - beta = pm.Normal("beta", priors["beta"]["mu"], priors["beta"]["sigma"]) - trend = pm.Deterministic("trend", alpha + beta * t, dims="obs_id") - - ## Priors for seasonality - beta_fourier = pm.Normal( - "beta_fourier", - mu=priors["beta_fourier"]["mu"], - sigma=priors["beta_fourier"]["sigma"], - dims="fourier_features", - ) - fourier_terms = pm.MutableData("fourier_terms", ff) - seasonality = pm.Deterministic( - "seasonality", pm.math.dot(beta_fourier, fourier_terms), dims="obs_id" - ) - - mu = ar1 + trend + seasonality - - # The Likelihood - outcome = pm.Normal("likelihood", mu=mu, sigma=sigma, observed=y, dims="obs_id") - ## Sampling - idata_ar = pm.sample_prior_predictive() - if full_sample: - idata_ar.extend(pm.sample(samples, random_seed=100, target_accept=0.95)) - idata_ar.extend(pm.sample_posterior_predictive(idata_ar)) - else: - return idata_ar - - n = prediction_steps - ar_data.shape[0] - n_order = 10 - periods = (ar_data.shape[0] + np.arange(n)) / 7 - - fourier_features_new = pd.DataFrame( - { - f"{func}_order_{order}": getattr(np, func)(2 * np.pi * periods * order) - for order in range(1, n_order + 1) - for func in ("sin", "cos") - } - ) - - with AR: - AR.add_coords({"obs_id_fut_1": range(ar1_data.shape[0] - 1, prediction_steps, 1)}) - AR.add_coords({"obs_id_fut": range(ar1_data.shape[0], prediction_steps, 1)}) - t_fut = pm.MutableData( - "t_fut", list(range(ar1_data.shape[0], prediction_steps, 1)), dims="obs_id_fut" - ) - ff_fut = pm.MutableData("ff_fut", fourier_features_new.to_numpy().T) - # condition on the learned values of the AR process - # initialise the future AR process precisely at the last observed value in the AR process - # using the special feature of the dirac delta distribution to be 0 probability everywhere else. - ar1_fut = pm.AR( - "ar1_fut", - init_dist=pm.DiracDelta.dist(ar1[..., -1]), - rho=coefs, - sigma=sigma, - constant=True, - dims="obs_id_fut_1", - ) - trend = pm.Deterministic("trend_fut", alpha + beta * t_fut, dims="obs_id_fut") - seasonality = pm.Deterministic( - "seasonality_fut", pm.math.dot(beta_fourier, ff_fut), dims="obs_id_fut" - ) - mu = ar1_fut[1:] + trend + seasonality - - yhat_fut = pm.Normal("yhat_fut", mu=mu, sigma=sigma, dims="obs_id_fut") - # use the updated values and predict outcomes and probabilities: - idata_preds = pm.sample_posterior_predictive( - idata_ar, var_names=["likelihood", "yhat_fut"], predictions=True, random_seed=743 - ) - - return idata_ar, idata_preds, AR -``` - -```{code-cell} ipython3 -priors_0 = { - "coefs": {"mu": [0.2, 0.2], "sigma": [0.5, 0.03], "size": 2}, - "alpha": {"mu": -4, "sigma": 0.1}, - "beta": {"mu": -0.1, "sigma": 0.2}, - "beta_fourier": {"mu": 0, "sigma": 2}, - "sigma": 8, - "init": {"mu": -4, "sigma": 0.1, "size": 1}, -} - - -idata_t_s, preds_t_s, model = make_latent_AR_trend_seasonal_model(y_t_s, fourier_features, priors_0) -``` - -```{code-cell} ipython3 -pm.model_to_graphviz(model) -``` - -```{code-cell} ipython3 -az.summary(idata_t_s, var_names=["alpha", "beta", "coefs", "beta_fourier"]) -``` - -```{code-cell} ipython3 -plot_fits(idata_t_s, preds_t_s) -``` - -We can see here how the model fit again recovers the broad structure and trend of the data, but in addition we have captured the oscillation of the seasonal effect and projected that into the future. - -# Closing Remarks - -The strength of a Bayesian model is largely the flexibility it offers for each modelling task. Hopefully this notebook gives a flavour of the variety of combinations worth considering when building a model to suit your use-case. We've seen how the Bayesian structural timeseries approach to forecasting can reveal the structure underlying our data, and be used to project that structure forward in time. We've seen how to encode different assumptions in the data generating model and calibrate our models against the observed data with posterior predictive checks. - -Notably in the case of Auto-regressive modelling we've explicitly relied on the learned posterior distribution of the structural components. In this aspect we think the above is a kind of pure (neatly contained) example of Bayesian learning. - -+++ - -## Authors - -Adapted from Nathaniel Forde's [Examined Algorithms Blog](https://nathanielf.github.io/post/bayesian_structural_timeseries/) by Nathaniel Forde in Oct 2022. - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/time_series/MvGaussianRandomWalk_demo.myst.md b/myst_nbs/time_series/MvGaussianRandomWalk_demo.myst.md deleted file mode 100644 index 77b375a46..000000000 --- a/myst_nbs/time_series/MvGaussianRandomWalk_demo.myst.md +++ /dev/null @@ -1,241 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -# Multivariate Gaussian Random Walk -:::{post} Sep 25, 2021 -:tags: linear model, regression, time series -:category: beginner -::: - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm -import theano - -from scipy.linalg import cholesky - -%matplotlib inline -``` - -```{code-cell} ipython3 -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -This notebook shows how to [fit a correlated time series](https://en.wikipedia.org/wiki/Curve_fitting) using multivariate [Gaussian random walks](https://en.wikipedia.org/wiki/Random_walk#Gaussian_random_walk) (GRWs). In particular, we perform a Bayesian [regression](https://en.wikipedia.org/wiki/Regression_analysis) of the time series data against a model dependent on GRWs. - -We generate data as the 3-dimensional time series - -$$ -\mathbf y = \alpha_{i[\mathbf t]} +\beta_{i[\mathbf t]} *\frac{\mathbf t}{300} +\xi_{\mathbf t},\quad \mathbf t = [0,1,...,299], -$$ (eqn:model) - -where -- $i\mapsto\alpha_{i}$ and $i\mapsto\beta_{i}$, $i\in\{0,1,2,3,4\}$, are two 3-dimensional Gaussian random walks for two correlation matrices $\Sigma_\alpha$ and $\Sigma_\beta$, -- we define the index -$$ -i[t]= j\quad\text{for}\quad t = 60j,60j+1,...,60j+59, \quad\text{and}\quad j = 0,1,2,3,4, -$$ -- $*$ means that we multiply the $j$-th column of the $3\times300$ matrix with the $j$-th entry of the vector for each $j=0,1,...,299$, and -- $\xi_{\mathbf t}$ is a $3\times300$ matrix with iid normal entries $N(0,\sigma^2)$. - - -So the series $\mathbf y$ changes due to the GRW $\alpha$ in five occasions, namely steps $0,60,120,180,240$. Meanwhile $\mathbf y$ changes at steps $1,60,120,180,240$ due to the increments of the GRW $\beta$ and at every step due to the weighting of $\beta$ with $\mathbf t/300$. Intuitively, we have a noisy ($\xi$) system that is shocked five times over a period of 300 steps, but the impact of the $\beta$ shocks gradually becomes more significant at every step. - -## Data generation - -Let's generate and plot the data. - -```{code-cell} ipython3 -D = 3 # Dimension of random walks -N = 300 # Number of steps -sections = 5 # Number of sections -period = N / sections # Number steps in each section - -Sigma_alpha = rng.standard_normal((D, D)) -Sigma_alpha = Sigma_alpha.T.dot(Sigma_alpha) # Construct covariance matrix for alpha -L_alpha = cholesky(Sigma_alpha, lower=True) # Obtain its Cholesky decomposition - -Sigma_beta = rng.standard_normal((D, D)) -Sigma_beta = Sigma_beta.T.dot(Sigma_beta) # Construct covariance matrix for beta -L_beta = cholesky(Sigma_beta, lower=True) # Obtain its Cholesky decomposition - -# Gaussian random walks: -alpha = np.cumsum(L_alpha.dot(rng.standard_normal((D, sections))), axis=1).T -beta = np.cumsum(L_beta.dot(rng.standard_normal((D, sections))), axis=1).T -t = np.arange(N)[:, None] / N -alpha = np.repeat(alpha, period, axis=0) -beta = np.repeat(beta, period, axis=0) -# Correlated series -sigma = 0.1 -y = alpha + beta * t + sigma * rng.standard_normal((N, 1)) - -# Plot the correlated series -plt.figure(figsize=(12, 5)) -plt.plot(t, y, ".", markersize=2, label=("y_0 data", "y_1 data", "y_2 data")) -plt.title("Three Correlated Series") -plt.xlabel("Time") -plt.legend() -plt.show(); -``` - -## Model -First we introduce a scaling class to rescale our data and the time parameter before the sampling and then rescale the predictions to match the unscaled data. - -```{code-cell} ipython3 -class Scaler: - def __init__(self): - mean_ = None - std_ = None - - def transform(self, x): - return (x - self.mean_) / self.std_ - - def fit_transform(self, x): - self.mean_ = x.mean(axis=0) - self.std_ = x.std(axis=0) - return self.transform(x) - - def inverse_transform(self, x): - return x * self.std_ + self.mean_ -``` - -We now construct the regression model in {eq}`eqn:model` imposing priors on the GRWs $\alpha$ and $\beta$, on the standard deviation $\sigma$ and hyperpriors on the Cholesky matrices. We use the LKJ prior {cite:p}`lewandowski2009generating` for the Cholesky matrices (see this {func}`link for the documentation ` and also the PyMC notebook {doc}`/case_studies/LKJ` for some usage examples.) - -```{code-cell} ipython3 -def inference(t, y, sections, n_samples=100): - N, D = y.shape - - # Standardies y and t - y_scaler = Scaler() - t_scaler = Scaler() - y = y_scaler.fit_transform(y) - t = t_scaler.fit_transform(t) - # Create a section index - t_section = np.repeat(np.arange(sections), N / sections) - - # Create theano equivalent - t_t = theano.shared(np.repeat(t, D, axis=1)) - y_t = theano.shared(y) - t_section_t = theano.shared(t_section) - - coords = {"y_": ["y_0", "y_1", "y_2"], "steps": np.arange(N)} - with pm.Model(coords=coords) as model: - # Hyperpriors on Cholesky matrices - packed_L_alpha = pm.LKJCholeskyCov( - "packed_L_alpha", n=D, eta=2.0, sd_dist=pm.HalfCauchy.dist(2.5) - ) - L_alpha = pm.expand_packed_triangular(D, packed_L_alpha) - packed_L_beta = pm.LKJCholeskyCov( - "packed_L_beta", n=D, eta=2.0, sd_dist=pm.HalfCauchy.dist(2.5) - ) - L_beta = pm.expand_packed_triangular(D, packed_L_beta) - - # Priors on Gaussian random walks - alpha = pm.MvGaussianRandomWalk("alpha", shape=(sections, D), chol=L_alpha) - beta = pm.MvGaussianRandomWalk("beta", shape=(sections, D), chol=L_beta) - - # Deterministic construction of the correlated random walk - alpha_r = alpha[t_section_t] - beta_r = beta[t_section_t] - regression = alpha_r + beta_r * t_t - - # Prior on noise ξ - sigma = pm.HalfNormal("sigma", 1.0) - - # Likelihood - likelihood = pm.Normal("y", mu=regression, sigma=sigma, observed=y_t, dims=("steps", "y_")) - - # MCMC sampling - trace = pm.sample(n_samples, cores=4, return_inferencedata=True) - - # Posterior predictive sampling - trace.extend(az.from_pymc3(posterior_predictive=pm.sample_posterior_predictive(trace))) - - return trace, y_scaler, t_scaler, t_section -``` - -## Inference -We now sample from our model and we return the trace, the scaling functions for space and time and the scaled time index. - -```{code-cell} ipython3 -trace, y_scaler, t_scaler, t_section = inference(t, y, sections) -``` - -We now display the energy plot using {func}`arviz.plot_energy` for a visual check for the model's convergence. Then, using {func}`arviz.plot_ppc`, we plot the distribution of the {doc}`posterior predictive samples ` against the observed data $\mathbf y$. This plot provides a general idea of the accuracy of the model (note that the values of $\mathbf y$ actually correspond to the scaled version of $\mathbf y$). - -```{code-cell} ipython3 -az.plot_energy(trace) -az.plot_ppc(trace); -``` - -+++ {"jupyter": {"outputs_hidden": true}, "tags": []} - -## Posterior visualisation -The graphs above look good. Now we plot the observed 3-dimensional series against the average predicted 3-dimensional series, or in other words, we plot the data against the estimated regression curve from the model {eq}`eqn:model`. - -```{code-cell} ipython3 -# Compute the predicted mean of the multivariate GRWs -alpha_mean = trace.posterior["alpha"].mean(dim=("chain", "draw")) -beta_mean = trace.posterior["beta"].mean(dim=("chain", "draw")) - -# Compute the predicted mean of the correlated series -y_pred = y_scaler.inverse_transform( - alpha_mean[t_section].values + beta_mean[t_section].values * t_scaler.transform(t) -) - -# Plot the predicted mean -fig, ax = plt.subplots(1, 1, figsize=(12, 6)) -ax.plot(t, y, ".", markersize=2, label=("y_0 data", "y_1 data", "y_2 data")) -plt.gca().set_prop_cycle(None) -ax.plot(t, y_pred, label=("y_0 pred", "y_1 pred", "y_2 pred")) -ax.set_xlabel("Time") -ax.legend() -ax.set_title("Predicted Mean of Three Correlated Series"); -``` - -Finally, we plot the data against the posterior predictive samples. - -```{code-cell} ipython3 -:tags: [] - -# Rescale the posterior predictive samples -ppc_y = y_scaler.inverse_transform(trace.posterior_predictive["y"].mean("chain")) - -fig, ax = plt.subplots(1, 1, figsize=(12, 6)) -# Plot the data -ax.plot(t, y, ".", markersize=3, label=("y_0 data", "y_1 data", "y_2 data")) -# Plot the posterior predictive samples -ax.plot(t, ppc_y.sel(y_="y_0").T, color="C0", alpha=0.003) -ax.plot(t, ppc_y.sel(y_="y_1").T, color="C1", alpha=0.003) -ax.plot(t, ppc_y.sel(y_="y_2").T, color="C2", alpha=0.003) -ax.set_xlabel("Time") -ax.legend() -ax.set_title("Posterior Predictive Samples and the Three Correlated Series"); -``` - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p theano,xarray -``` diff --git a/myst_nbs/time_series/bayesian_var_model.myst.md b/myst_nbs/time_series/bayesian_var_model.myst.md deleted file mode 100644 index 732953cf3..000000000 --- a/myst_nbs/time_series/bayesian_var_model.myst.md +++ /dev/null @@ -1,769 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3.9.0 ('pymc_ar_ex') - language: python - name: python3 ---- - -(Bayesian Vector Autoregressive Models)= -# Bayesian Vector Autoregressive Models - -:::{post} November, 2022 -:tags: time series, vector autoregressive model, hierarchical model -:category: intermediate -:author: Nathaniel Forde -::: - -```{code-cell} ipython3 -import os - -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc as pm -import statsmodels.api as sm - -from pymc.sampling_jax import sample_blackjax_nuts -``` - -```{code-cell} ipython3 -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -%config InlineBackend.figure_format = 'retina' -``` - -## V(ector)A(uto)R(egression) Models - -In this notebook we will outline an application of the Bayesian Vector Autoregressive Modelling. We will draw on the work in the PYMC Labs [blogpost](https://www.pymc-labs.io/blog-posts/bayesian-vector-autoregression/) (see {cite:t}`vieira2022BVAR`). This will be a three part series. In the first we want to show how to fit Bayesian VAR models in PYMC. In the second we will show how to extract extra insight from the fitted model with Impulse Response analysis and make forecasts from the fitted VAR model. In the third and final post we will show in some more detail the benefits of using hierarchical priors with Bayesian VAR models. Specifically, we'll outline how and why there are actually a range of carefully formulated industry standard priors which work with Bayesian VAR modelling. - -In this post we will (i) demonstrate the basic pattern on a simple VAR model on fake data and show how the model recovers the true data generating parameters and (ii) we will show an example applied to macro-economic data and compare the results to those achieved on the same data with statsmodels MLE fits and (iii) show an example of estimating a hierarchical bayesian VAR model over a number of countries. - -## Autoregressive Models in General - -The idea of a simple autoregressive model is to capture the manner in which past observations of the timeseries are predictive of the current observation. So in traditional fashion, if we model this as a linear phenomena we get simple autoregressive models where the current value is predicted by a weighted linear combination of the past values and an error term. - -$$ y_t = \alpha + \beta_{y0} \cdot y_{t-1} + \beta_{y1} \cdot y_{t-2} ... + \epsilon $$ - -for however many lags are deemed appropriate to the predict the current observation. - -A VAR model is kind of generalisation of this framework in that it retains the linear combination approach but allows us to model multiple timeseries at once. So concretely this mean that $\mathbf{y}_{t}$ as a vector where: - -$$ \mathbf{y}_{T} = \nu + A_{1}\mathbf{y}_{T-1} + A_{2}\mathbf{y}_{T-2} ... A_{p}\mathbf{y}_{T-p} + \mathbf{e}_{t} $$ - -where the As are coefficient matrices to be combined with the past values of each individual timeseries. For example consider an economic example where we aim to model the relationship and mutual influence of each variable on themselves and one another. - -$$ \begin{bmatrix} gdp \\ inv \\ con \end{bmatrix}_{T} = \nu + A_{1}\begin{bmatrix} gdp \\ inv \\ con \end{bmatrix}_{T-1} + - A_{2}\begin{bmatrix} gdp \\ inv \\ con \end{bmatrix}_{T-2} ... A_{p}\begin{bmatrix} gdp \\ inv \\ con \end{bmatrix}_{T-p} + \mathbf{e}_{t} $$ - -This structure is compact representation using matrix notation. The thing we are trying to estimate when we fit a VAR model is the A matrices that determine the nature of the linear combination that best fits our timeseries data. Such timeseries models can have an auto-regressive or a moving average representation, and the details matter for some of the implication of a VAR model fit. - -We'll see in the next notebook of the series how the moving-average representation of a VAR lends itself to the interpretation of the covariance structure in our model as representing a kind of impulse-response relationship between the component timeseries. - -### A Concrete Specification with Two lagged Terms - -The matrix notation is convenient to suggest the broad patterns of the model, but it is useful to see the algebra is a simple case. Consider the case of Ireland's GDP and consumption described as: - -$$ gdp_{t} = \beta_{gdp1} \cdot gdp_{t-1} + \beta_{gdp2} \cdot gdp_{t-2} + \beta_{cons1} \cdot cons_{t-1} + \beta_{cons2} \cdot cons_{t-2} + \epsilon_{gdp}$$ -$$ cons_{t} = \beta_{cons1} \cdot cons_{t-1} + \beta_{cons2} \cdot cons_{t-2} + \beta_{gdp1} \cdot gdp_{t-1} + \beta_{gdp2} \cdot gdp_{t-2} + \epsilon_{cons}$$ - -In this way we can see that if we can estimate the $\beta$ terms we have an estimate for the bi-directional effects of each variable on the other. This is a useful feature of the modelling. In what follows i should stress that i'm not an economist and I'm aiming to show only the functionality of these models not give you a decisive opinion about the economic relationships determining Irish GDP figures. - -### Creating some Fake Data - -```{code-cell} ipython3 -def simulate_var( - intercepts, coefs_yy, coefs_xy, coefs_xx, coefs_yx, noises=(1, 1), *, warmup=100, steps=200 -): - draws_y = np.zeros(warmup + steps) - draws_x = np.zeros(warmup + steps) - draws_y[:2] = intercepts[0] - draws_x[:2] = intercepts[1] - for step in range(2, warmup + steps): - draws_y[step] = ( - intercepts[0] - + coefs_yy[0] * draws_y[step - 1] - + coefs_yy[1] * draws_y[step - 2] - + coefs_xy[0] * draws_x[step - 1] - + coefs_xy[1] * draws_x[step - 2] - + rng.normal(0, noises[0]) - ) - draws_x[step] = ( - intercepts[1] - + coefs_xx[0] * draws_x[step - 1] - + coefs_xx[1] * draws_x[step - 2] - + coefs_yx[0] * draws_y[step - 1] - + coefs_yx[1] * draws_y[step - 2] - + rng.normal(0, noises[1]) - ) - return draws_y[warmup:], draws_x[warmup:] -``` - -First we generate some fake data with known parameters. - -```{code-cell} ipython3 -var_y, var_x = simulate_var( - intercepts=(18, 8), - coefs_yy=(-0.8, 0), - coefs_xy=(0.9, 0), - coefs_xx=(1.3, -0.7), - coefs_yx=(-0.1, 0.3), -) - -df = pd.DataFrame({"x": var_x, "y": var_y}) -df.head() -``` - -```{code-cell} ipython3 -fig, axs = plt.subplots(2, 1, figsize=(10, 3)) -axs[0].plot(df["x"], label="x") -axs[0].set_title("Series X") -axs[1].plot(df["y"], label="y") -axs[1].set_title("Series Y"); -``` - -## Handling Multiple Lags and Different Dimensions - -When Modelling multiple timeseries and accounting for potentially any number lags to incorporate in our model we need to abstract some of the model definition to helper functions. An example will make this a bit clearer. - -```{code-cell} ipython3 -### Define a helper function that will construct our autoregressive step for the marginal contribution of each lagged -### term in each of the respective time series equations -def calc_ar_step(lag_coefs, n_eqs, n_lags, df): - ars = [] - for j in range(n_eqs): - ar = pm.math.sum( - [ - pm.math.sum(lag_coefs[j, i] * df.values[n_lags - (i + 1) : -(i + 1)], axis=-1) - for i in range(n_lags) - ], - axis=0, - ) - ars.append(ar) - beta = pm.math.stack(ars, axis=-1) - - return beta - - -### Make the model in such a way that it can handle different specifications of the likelihood term -### and can be run for simple prior predictive checks. This latter functionality is important for debugging of -### shape handling issues. Building a VAR model involves quite a few moving parts and it is handy to -### inspect the shape implied in the prior predictive checks. -def make_model(n_lags, n_eqs, df, priors, mv_norm=True, prior_checks=True): - coords = { - "lags": np.arange(n_lags) + 1, - "equations": df.columns.tolist(), - "cross_vars": df.columns.tolist(), - "time": [x for x in df.index[n_lags:]], - } - - with pm.Model(coords=coords) as model: - lag_coefs = pm.Normal( - "lag_coefs", - mu=priors["lag_coefs"]["mu"], - sigma=priors["lag_coefs"]["sigma"], - dims=["equations", "lags", "cross_vars"], - ) - alpha = pm.Normal( - "alpha", mu=priors["alpha"]["mu"], sigma=priors["alpha"]["sigma"], dims=("equations",) - ) - data_obs = pm.Data("data_obs", df.values[n_lags:], dims=["time", "equations"], mutable=True) - - betaX = calc_ar_step(lag_coefs, n_eqs, n_lags, df) - betaX = pm.Deterministic( - "betaX", - betaX, - dims=[ - "time", - ], - ) - mean = alpha + betaX - - if mv_norm: - n = df.shape[1] - ## Under the hood the LKJ prior will retain the correlation matrix too. - noise_chol, _, _ = pm.LKJCholeskyCov( - "noise_chol", - eta=priors["noise_chol"]["eta"], - n=n, - sd_dist=pm.HalfNormal.dist(sigma=priors["noise_chol"]["sigma"]), - ) - obs = pm.MvNormal( - "obs", mu=mean, chol=noise_chol, observed=data_obs, dims=["time", "equations"] - ) - else: - ## This is an alternative likelihood that can recover sensible estimates of the coefficients - ## But lacks the multivariate correlation between the timeseries. - sigma = pm.HalfNormal("noise", sigma=priors["noise"]["sigma"], dims=["equations"]) - obs = pm.Normal( - "obs", mu=mean, sigma=sigma, observed=data_obs, dims=["time", "equations"] - ) - - if prior_checks: - idata = pm.sample_prior_predictive() - return model, idata - else: - idata = pm.sample_prior_predictive() - idata.extend(pm.sample(draws=2000, random_seed=130)) - pm.sample_posterior_predictive(idata, extend_inferencedata=True, random_seed=rng) - return model, idata -``` - -The model has a deterministic component in the auto-regressive calculation which is required at each timestep, but the key point here is that we model the likelihood of the VAR as a multivariate normal distribution with a particular covariance relationship. The estimation of these covariance relationship gives the main insight in the manner in which our component timeseries relate to one another. - -We will inspect the structure of a VAR with 2 lags and 2 equations - -```{code-cell} ipython3 -n_lags = 2 -n_eqs = 2 -priors = { - "lag_coefs": {"mu": 0.3, "sigma": 1}, - "alpha": {"mu": 15, "sigma": 5}, - "noise_chol": {"eta": 1, "sigma": 1}, - "noise": {"sigma": 1}, -} - -model, idata = make_model(n_lags, n_eqs, df, priors) -pm.model_to_graphviz(model) -``` - -Another VAR with 3 lags and 2 equations. - -```{code-cell} ipython3 -n_lags = 3 -n_eqs = 2 -model, idata = make_model(n_lags, n_eqs, df, priors) -for rv, shape in model.eval_rv_shapes().items(): - print(f"{rv:>11}: shape={shape}") -pm.model_to_graphviz(model) -``` - -We can inspect the correlation matrix between our timeseries which is implied by the prior specification, to see that we have allowed a flat uniform prior over their correlation. - -```{code-cell} ipython3 -ax = az.plot_posterior( - idata, - var_names="noise_chol_corr", - hdi_prob="hide", - group="prior", - point_estimate="mean", - grid=(2, 2), - kind="hist", - ec="black", - figsize=(10, 4), -) -``` - -Now we will fit the VAR with 2 lags and 2 equations - -```{code-cell} ipython3 -n_lags = 2 -n_eqs = 2 -model, idata_fake_data = make_model(n_lags, n_eqs, df, priors, prior_checks=False) -``` - -We'll now plot some of the results to see that the parameters are being broadly recovered. The alpha parameters match well, but the individual lag coefficients show differences. - -```{code-cell} ipython3 -az.summary(idata_fake_data, var_names=["alpha", "lag_coefs", "noise_chol_corr"]) -``` - -```{code-cell} ipython3 -az.plot_posterior(idata_fake_data, var_names=["alpha"], ref_val=[18, 8]); -``` - -Next we'll plot the posterior predictive distribution to check that the fitted model can capture the patterns in the observed data. This is the primary test of goodness of fit. - -```{code-cell} ipython3 -def shade_background(ppc, ax, idx, palette="cividis"): - palette = palette - cmap = plt.get_cmap(palette) - percs = np.linspace(51, 99, 100) - colors = (percs - np.min(percs)) / (np.max(percs) - np.min(percs)) - for i, p in enumerate(percs[::-1]): - upper = np.percentile( - ppc[:, idx, :], - p, - axis=1, - ) - lower = np.percentile( - ppc[:, idx, :], - 100 - p, - axis=1, - ) - color_val = colors[i] - ax[idx].fill_between( - x=np.arange(ppc.shape[0]), - y1=upper.flatten(), - y2=lower.flatten(), - color=cmap(color_val), - alpha=0.1, - ) - - -def plot_ppc(idata, df, group="posterior_predictive"): - fig, axs = plt.subplots(2, 1, figsize=(25, 15)) - df = pd.DataFrame(idata_fake_data["observed_data"]["obs"].data, columns=["x", "y"]) - axs = axs.flatten() - ppc = az.extract_dataset(idata, group=group, num_samples=100)["obs"] - # Minus the lagged terms and the constant - shade_background(ppc, axs, 0, "inferno") - axs[0].plot(np.arange(ppc.shape[0]), ppc[:, 0, :].mean(axis=1), color="cyan", label="Mean") - axs[0].plot(df["x"], "o", mfc="black", mec="white", mew=1, markersize=7, label="Observed") - axs[0].set_title("VAR Series 1") - axs[0].legend() - shade_background(ppc, axs, 1, "inferno") - axs[1].plot(df["y"], "o", mfc="black", mec="white", mew=1, markersize=7, label="Observed") - axs[1].plot(np.arange(ppc.shape[0]), ppc[:, 1, :].mean(axis=1), color="cyan", label="Mean") - axs[1].set_title("VAR Series 2") - axs[1].legend() - - -plot_ppc(idata_fake_data, df) -``` - -Again we can check the learned posterior distribution for the correlation parameter. - -```{code-cell} ipython3 -ax = az.plot_posterior( - idata_fake_data, - var_names="noise_chol_corr", - hdi_prob="hide", - point_estimate="mean", - grid=(2, 2), - kind="hist", - ec="black", - figsize=(10, 6), -) -``` - -## Applying the Theory: Macro Economic Timeseries - -The data is from the World Bank’s World Development Indicators. In particular, we're pulling annual values of GDP, consumption, and gross fixed capital formation (investment) for all countries from 1970. Timeseries models in general work best when we have a stable mean throughout the series, so for the estimation procedure we have taken the first difference and the natural log of each of these series. - -```{code-cell} ipython3 -try: - gdp_hierarchical = pd.read_csv( - os.path.join("..", "data", "gdp_data_hierarchical_clean.csv"), index_col=0 - ) -except FileNotFoundError: - gdp_hierarchical = pd.read_csv(pm.get_data("gdp_data_hierarchical_clean.csv"), ...) - -gdp_hierarchical -``` - -```{code-cell} ipython3 -fig, axs = plt.subplots(3, 1, figsize=(20, 10)) -for country in gdp_hierarchical["country"].unique(): - temp = gdp_hierarchical[gdp_hierarchical["country"] == country].reset_index() - axs[0].plot(temp["dl_gdp"], label=f"{country}") - axs[1].plot(temp["dl_cons"], label=f"{country}") - axs[2].plot(temp["dl_gfcf"], label=f"{country}") -axs[0].set_title("Differenced and Logged GDP") -axs[1].set_title("Differenced and Logged Consumption") -axs[2].set_title("Differenced and Logged Investment") -axs[0].legend() -axs[1].legend() -axs[2].legend() -plt.suptitle("Macroeconomic Timeseries"); -``` - -## Ireland's Economic Situation - -Ireland is somewhat infamous for its GDP numbers that are largely the product of foreign direct investment and inflated beyond expectation in recent years by the investment and taxation deals offered to large multi-nationals. We'll look here at just the relationship between GDP and consumption. We just want to show the mechanics of the VAR estimation, you shouldn't read too much into the subsequent analysis. - -```{code-cell} ipython3 -ireland_df = gdp_hierarchical[gdp_hierarchical["country"] == "Ireland"] -ireland_df.reset_index(inplace=True, drop=True) -ireland_df.head() -``` - -```{code-cell} ipython3 -n_lags = 2 -n_eqs = 2 -priors = { - ## Set prior for expected positive relationship between the variables. - "lag_coefs": {"mu": 0.3, "sigma": 1}, - "alpha": {"mu": 0, "sigma": 0.1}, - "noise_chol": {"eta": 1, "sigma": 1}, - "noise": {"sigma": 1}, -} -model, idata_ireland = make_model( - n_lags, n_eqs, ireland_df[["dl_gdp", "dl_cons"]], priors, prior_checks=False -) -idata_ireland -``` - -```{code-cell} ipython3 -az.plot_trace(idata_ireland, var_names=["lag_coefs", "alpha", "betaX"], kind="rank_vlines"); -``` - -```{code-cell} ipython3 -def plot_ppc_macro(idata, df, group="posterior_predictive"): - df = pd.DataFrame(idata["observed_data"]["obs"].data, columns=["dl_gdp", "dl_cons"]) - fig, axs = plt.subplots(2, 1, figsize=(20, 10)) - axs = axs.flatten() - ppc = az.extract_dataset(idata, group=group, num_samples=100)["obs"] - - shade_background(ppc, axs, 0, "inferno") - axs[0].plot(np.arange(ppc.shape[0]), ppc[:, 0, :].mean(axis=1), color="cyan", label="Mean") - axs[0].plot(df["dl_gdp"], "o", mfc="black", mec="white", mew=1, markersize=7, label="Observed") - axs[0].set_title("Differenced and Logged GDP") - axs[0].legend() - shade_background(ppc, axs, 1, "inferno") - axs[1].plot(df["dl_cons"], "o", mfc="black", mec="white", mew=1, markersize=7, label="Observed") - axs[1].plot(np.arange(ppc.shape[0]), ppc[:, 1, :].mean(axis=1), color="cyan", label="Mean") - axs[1].set_title("Differenced and Logged Consumption") - axs[1].legend() - - -plot_ppc_macro(idata_ireland, ireland_df) -``` - -```{code-cell} ipython3 -ax = az.plot_posterior( - idata_ireland, - var_names="noise_chol_corr", - hdi_prob="hide", - point_estimate="mean", - grid=(2, 2), - kind="hist", - ec="black", - figsize=(10, 6), -) -``` - -### Comparison with Statsmodels - -It's worthwhile comparing these model fits to the one achieved by Statsmodels just to see if we can recover a similar story. - -```{code-cell} ipython3 -VAR_model = sm.tsa.VAR(ireland_df[["dl_gdp", "dl_cons"]]) -results = VAR_model.fit(2, trend="c") -``` - -```{code-cell} ipython3 -results.params -``` - -The intercept parameters broadly agree with our Bayesian model with some differences in the implied relationships defined by the estimates for the lagged terms. - -```{code-cell} ipython3 -corr = pd.DataFrame(results.resid_corr, columns=["dl_gdp", "dl_cons"]) -corr.index = ["dl_gdp", "dl_cons"] -corr -``` - -The residual correlation estimates reported by statsmodels agree quite closely with the multivariate gaussian correlation between the variables in our Bayesian model. - -```{code-cell} ipython3 -az.summary(idata_ireland, var_names=["alpha", "lag_coefs", "noise_chol_corr"]) -``` - -We plot the alpha parameter estimates against the Statsmodels estimates - -```{code-cell} ipython3 -az.plot_posterior(idata_ireland, var_names=["alpha"], ref_val=[0.034145, 0.006996]); -``` - -```{code-cell} ipython3 -az.plot_posterior( - idata_ireland, - var_names=["lag_coefs"], - ref_val=[0.330003, -0.053677], - coords={"equations": "dl_cons", "lags": [1, 2], "cross_vars": "dl_gdp"}, -); -``` - -We can see here again how the Bayesian VAR model recovers much of the same story. Similar magnitudes in the estimates for the alpha terms for both equations and a clear relationship between the first lagged GDP numbers and consumption along with a very similar covariance structure. - -+++ - -## Adding a Bayesian Twist: Hierarchical VARs - -In addition we can add some hierarchical parameters if we want to model multiple countries and the relationship between these economic metrics at the national level. This is a useful technique in the cases where we have reasonably short timeseries data because it allows us to "borrow" information across the countries to inform the estimates of the key parameters. - -```{code-cell} ipython3 -def make_hierarchical_model(n_lags, n_eqs, df, group_field, prior_checks=True): - cols = [col for col in df.columns if col != group_field] - coords = {"lags": np.arange(n_lags) + 1, "equations": cols, "cross_vars": cols} - - groups = df[group_field].unique() - - with pm.Model(coords=coords) as model: - ## Hierarchical Priors - rho = pm.Beta("rho", alpha=2, beta=2) - alpha_hat_location = pm.Normal("alpha_hat_location", 0, 0.1) - alpha_hat_scale = pm.InverseGamma("alpha_hat_scale", 3, 0.5) - beta_hat_location = pm.Normal("beta_hat_location", 0, 0.1) - beta_hat_scale = pm.InverseGamma("beta_hat_scale", 3, 0.5) - omega_global, _, _ = pm.LKJCholeskyCov( - "omega_global", n=n_eqs, eta=1.0, sd_dist=pm.Exponential.dist(1) - ) - - for grp in groups: - df_grp = df[df[group_field] == grp][cols] - z_scale_beta = pm.InverseGamma(f"z_scale_beta_{grp}", 3, 0.5) - z_scale_alpha = pm.InverseGamma(f"z_scale_alpha_{grp}", 3, 0.5) - lag_coefs = pm.Normal( - f"lag_coefs_{grp}", - mu=beta_hat_location, - sigma=beta_hat_scale * z_scale_beta, - dims=["equations", "lags", "cross_vars"], - ) - alpha = pm.Normal( - f"alpha_{grp}", - mu=alpha_hat_location, - sigma=alpha_hat_scale * z_scale_alpha, - dims=("equations",), - ) - - betaX = calc_ar_step(lag_coefs, n_eqs, n_lags, df_grp) - betaX = pm.Deterministic(f"betaX_{grp}", betaX) - mean = alpha + betaX - - n = df_grp.shape[1] - noise_chol, _, _ = pm.LKJCholeskyCov( - f"noise_chol_{grp}", eta=10, n=n, sd_dist=pm.Exponential.dist(1) - ) - omega = pm.Deterministic(f"omega_{grp}", rho * omega_global + (1 - rho) * noise_chol) - obs = pm.MvNormal(f"obs_{grp}", mu=mean, chol=omega, observed=df_grp.values[n_lags:]) - - if prior_checks: - idata = pm.sample_prior_predictive() - return model, idata - else: - idata = pm.sample_prior_predictive() - idata.extend(sample_blackjax_nuts(2000, random_seed=120)) - pm.sample_posterior_predictive(idata, extend_inferencedata=True) - return model, idata -``` - -The model design allows for a non-centred parameterisation of the key likeihood for each of the individual country components by allowing the us to shift the country specific estimates away from the hierarchical mean. This is done by `rho * omega_global + (1 - rho) * noise_chol` line. The parameter `rho` determines the share of impact each country's data contributes to the estimation of the covariance relationship among the economic variables. Similar country specific adjustments are made with the `z_alpha_scale` and `z_beta_scale` parameters. - -```{code-cell} ipython3 -df_final = gdp_hierarchical[["country", "dl_gdp", "dl_cons", "dl_gfcf"]] -model_full_test, idata_full_test = make_hierarchical_model( - 2, - 3, - df_final, - "country", - prior_checks=False, -) -``` - -```{code-cell} ipython3 -idata_full_test -``` - -```{code-cell} ipython3 -az.plot_trace( - idata_full_test, - var_names=["rho", "alpha_hat_location", "beta_hat_location", "omega_global"], - kind="rank_vlines", -); -``` - -Next we'll look at some of the summary statistics and how they vary across the countries. - -```{code-cell} ipython3 - -``` - -```{code-cell} ipython3 -az.summary( - idata_full_test, - var_names=[ - "rho", - "alpha_hat_location", - "alpha_hat_scale", - "beta_hat_location", - "beta_hat_scale", - "z_scale_alpha_Ireland", - "z_scale_alpha_United States", - "z_scale_beta_Ireland", - "z_scale_beta_United States", - "alpha_Ireland", - "alpha_United States", - "omega_global_corr", - "lag_coefs_Ireland", - "lag_coefs_United States", - ], -) -``` - -```{code-cell} ipython3 -ax = az.plot_forest( - idata_full_test, - var_names=[ - "alpha_Ireland", - "alpha_United States", - "alpha_Australia", - "alpha_Chile", - "alpha_New Zealand", - "alpha_South Africa", - "alpha_Canada", - "alpha_United Kingdom", - ], - kind="ridgeplot", - combined=True, - ridgeplot_truncate=False, - ridgeplot_quantiles=[0.25, 0.5, 0.75], - ridgeplot_overlap=0.7, - figsize=(10, 10), -) - -ax[0].axvline(0, color="red") -ax[0].set_title("Intercept Parameters for each country \n and Economic Measure"); -``` - -```{code-cell} ipython3 -ax = az.plot_forest( - idata_full_test, - var_names=[ - "lag_coefs_Ireland", - "lag_coefs_United States", - "lag_coefs_Australia", - "lag_coefs_Chile", - "lag_coefs_New Zealand", - "lag_coefs_South Africa", - "lag_coefs_Canada", - "lag_coefs_United Kingdom", - ], - kind="ridgeplot", - ridgeplot_truncate=False, - figsize=(10, 10), - coords={"equations": "dl_cons", "lags": 1, "cross_vars": "dl_gdp"}, -) -ax[0].axvline(0, color="red") -ax[0].set_title("Lag Coefficient for the first lag of GDP on Consumption \n by Country"); -``` - -Next we'll examine the correlation between the three variables and see what we've learned by including the hierarchical structure. - -```{code-cell} ipython3 -corr = pd.DataFrame( - az.summary(idata_full_test, var_names=["omega_global_corr"])["mean"].values.reshape(3, 3), - columns=["GDP", "CONS", "GFCF"], -) -corr.index = ["GDP", "CONS", "GFCF"] -corr -``` - -```{code-cell} ipython3 -ax = az.plot_posterior( - idata_full_test, - var_names="omega_global_corr", - hdi_prob="hide", - point_estimate="mean", - grid=(3, 3), - kind="hist", - ec="black", - figsize=(10, 7), -) -titles = [ - "GDP/GDP", - "GDP/CONS", - "GDP/GFCF", - "CONS/GDP", - "CONS/CONS", - "CONS/GFCF", - "GFCF/GDP", - "GFCF/CONS", - "GFCF/GFCF", -] -for ax, t in zip(ax.ravel(), titles): - ax.set_xlim(0.6, 1) - ax.set_title(t, fontsize=10) -plt.suptitle("The Posterior Correlation Estimates", fontsize=20); -``` - -We can see these estimates of the correlations between the 3 economic variables differ markedly from the simple case where we examined Ireland alone. In particular we can see that the correlation between GDF and CONS is now much higher. Which suggests that we have learned something about the relationship between these variables which would not be clear examining the Irish case alone. - -Next we'll plot the model fits for each country to ensure that the predictive distribution can recover the observed data. It is important for the question of model adequacy that we can recover both the outlier case of Ireland and the more regular countries such as Australia and United States. - -```{code-cell} ipython3 -az.plot_ppc(idata_full_test); -``` - -And to see the development of these model fits over time: - -```{code-cell} ipython3 -countries = gdp_hierarchical["country"].unique() - - -fig, axs = plt.subplots(8, 3, figsize=(20, 40)) -for ax, country in zip(axs, countries): - temp = pd.DataFrame( - idata_full_test["observed_data"][f"obs_{country}"].data, - columns=["dl_gdp", "dl_cons", "dl_gfcf"], - ) - ppc = az.extract_dataset(idata_full_test, group="posterior_predictive", num_samples=100)[ - f"obs_{country}" - ] - if country == "Ireland": - color = "viridis" - else: - color = "inferno" - for i in range(3): - shade_background(ppc, ax, i, color) - ax[0].plot(np.arange(ppc.shape[0]), ppc[:, 0, :].mean(axis=1), color="cyan", label="Mean") - ax[0].plot(temp["dl_gdp"], "o", mfc="black", mec="white", mew=1, markersize=7, label="Observed") - ax[0].set_title(f"Posterior Predictive GDP: {country}") - ax[0].legend(loc="lower left") - ax[1].plot( - temp["dl_cons"], "o", mfc="black", mec="white", mew=1, markersize=7, label="Observed" - ) - ax[1].plot(np.arange(ppc.shape[0]), ppc[:, 1, :].mean(axis=1), color="cyan", label="Mean") - ax[1].set_title(f"Posterior Predictive Consumption: {country}") - ax[1].legend(loc="lower left") - ax[2].plot( - temp["dl_gfcf"], "o", mfc="black", mec="white", mew=1, markersize=7, label="Observed" - ) - ax[2].plot(np.arange(ppc.shape[0]), ppc[:, 2, :].mean(axis=1), color="cyan", label="Mean") - ax[2].set_title(f"Posterior Predictive Investment: {country}") - ax[2].legend(loc="lower left") -plt.suptitle("Posterior Predictive Checks on Hierarchical VAR", fontsize=20); -``` - -Here we can see that the model appears to have recovered reasonable posterior predictions for the observed data and the volatility of the Irish GDP figures is clear next to the other countries. Whether this is a cautionary tale about data quality or the corruption of metrics we leave to the economists to figure out. - -+++ - -## Conclusion - -VAR modelling is a rich an interesting area of research within economics and there are a range of challenges and pitfalls which come with the interpretation and understanding of these models. We hope this example encourages you to continue exploring the potential of this kind of VAR modelling in the Bayesian framework. Whether you're interested in the relationship between grand economic theory or simpler questions about the impact of poor app performance on customer feedback, VAR models give you a powerful tool for interrogating these relationships over time. As we've seen Hierarchical VARs further enables the precise quantification of outliers within a cohort and does not throw away the information because of odd accounting practices engendered by international capitalism. - -In the next post in this series we will spend some time digging into the implied relationships between the timeseries which result from fitting our VAR models. - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Authors -* Adapted from the PYMC labs [Blog post](https://www.pymc-labs.io/blog-posts/bayesian-vector-autoregression/) and Jim Savage's discussion [here](https://rpubs.com/jimsavage/hierarchical_var) by [Nathaniel Forde](https://nathanielf.github.io/) in November 2022 ([pymc-examples#456](https://github.com/pymc-devs/pymc-examples/pull/456)) - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,aeppl,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/variational_inference/GLM-hierarchical-advi-minibatch.myst.md b/myst_nbs/variational_inference/GLM-hierarchical-advi-minibatch.myst.md deleted file mode 100644 index 387e7d506..000000000 --- a/myst_nbs/variational_inference/GLM-hierarchical-advi-minibatch.myst.md +++ /dev/null @@ -1,176 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- - -# GLM: Mini-batch ADVI on hierarchical regression model - -:::{post} Sept 23, 2021 -:tags: generalized linear model, hierarchical model, variational inference -:category: intermediate -::: - -+++ - -Unlike Gaussian mixture models, (hierarchical) regression models have independent variables. These variables affect the likelihood function, but are not random variables. When using mini-batch, we should take care of that. - -```{code-cell} ipython3 -%env THEANO_FLAGS=device=cpu, floatX=float32, warn_float64=ignore - -import os - -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pandas as pd -import pymc3 as pm -import seaborn as sns -import theano -import theano.tensor as tt - -from scipy import stats - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -RANDOM_SEED = 8927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -```{code-cell} ipython3 -try: - data = pd.read_csv(os.path.join("..", "data", "radon.csv")) -except FileNotFoundError: - data = pd.read_csv(pm.get_data("radon.csv")) - -data -``` - -```{code-cell} ipython3 -county_idx = data["county_code"].values -floor_idx = data["floor"].values -log_radon_idx = data["log_radon"].values - -coords = {"counties": data.county.unique()} -``` - -Here, `log_radon_idx_t` is a dependent variable, while `floor_idx_t` and `county_idx_t` determine the independent variable. - -```{code-cell} ipython3 -log_radon_idx_t = pm.Minibatch(log_radon_idx, 100) -floor_idx_t = pm.Minibatch(floor_idx, 100) -county_idx_t = pm.Minibatch(county_idx, 100) -``` - -```{code-cell} ipython3 -with pm.Model(coords=coords) as hierarchical_model: - # Hyperpriors for group nodes - mu_a = pm.Normal("mu_alpha", mu=0.0, sigma=100**2) - sigma_a = pm.Uniform("sigma_alpha", lower=0, upper=100) - mu_b = pm.Normal("mu_beta", mu=0.0, sigma=100**2) - sigma_b = pm.Uniform("sigma_beta", lower=0, upper=100) -``` - -Intercept for each county, distributed around group mean `mu_a`. Above we just set `mu` and `sd` to a fixed value while here we plug in a common group distribution for all `a` and `b` (which are vectors with the same length as the number of unique counties in our example). - -```{code-cell} ipython3 -with hierarchical_model: - - a = pm.Normal("alpha", mu=mu_a, sigma=sigma_a, dims="counties") - # Intercept for each county, distributed around group mean mu_a - b = pm.Normal("beta", mu=mu_b, sigma=sigma_b, dims="counties") -``` - -Model prediction of radon level `a[county_idx]` translates to `a[0, 0, 0, 1, 1, ...]`, we thus link multiple household measures of a county to its coefficients. - -```{code-cell} ipython3 -with hierarchical_model: - - radon_est = a[county_idx_t] + b[county_idx_t] * floor_idx_t -``` - -Finally, we specify the likelihood: - -```{code-cell} ipython3 -with hierarchical_model: - - # Model error - eps = pm.Uniform("eps", lower=0, upper=100) - - # Data likelihood - radon_like = pm.Normal( - "radon_like", mu=radon_est, sigma=eps, observed=log_radon_idx_t, total_size=len(data) - ) -``` - -Random variables `radon_like`, associated with `log_radon_idx_t`, should be given to the function for ADVI to denote that as observations in the likelihood term. - -+++ - -On the other hand, `minibatches` should include the three variables above. - -+++ - -Then, run ADVI with mini-batch. - -```{code-cell} ipython3 -with hierarchical_model: - approx = pm.fit(100000, callbacks=[pm.callbacks.CheckParametersConvergence(tolerance=1e-4)]) - idata_advi = az.from_pymc3(approx.sample(500)) -``` - -Check the trace of ELBO and compare the result with MCMC. - -```{code-cell} ipython3 -plt.plot(approx.hist); -``` - -```{code-cell} ipython3 -# Inference button (TM)! -with pm.Model(coords=coords): - - mu_a = pm.Normal("mu_alpha", mu=0.0, sigma=100**2) - sigma_a = pm.Uniform("sigma_alpha", lower=0, upper=100) - mu_b = pm.Normal("mu_beta", mu=0.0, sigma=100**2) - sigma_b = pm.Uniform("sigma_beta", lower=0, upper=100) - - a = pm.Normal("alpha", mu=mu_a, sigma=sigma_a, dims="counties") - b = pm.Normal("beta", mu=mu_b, sigma=sigma_b, dims="counties") - - # Model error - eps = pm.Uniform("eps", lower=0, upper=100) - - radon_est = a[county_idx] + b[county_idx] * floor_idx - - radon_like = pm.Normal("radon_like", mu=radon_est, sigma=eps, observed=log_radon_idx) - - # essentially, this is what init='advi' does - step = pm.NUTS(scaling=approx.cov.eval(), is_cov=True) - hierarchical_trace = pm.sample( - 2000, step, start=approx.sample()[0], progressbar=True, return_inferencedata=True - ) -``` - -```{code-cell} ipython3 -az.plot_density( - [idata_advi, hierarchical_trace], var_names=["~alpha", "~beta"], data_labels=["ADVI", "NUTS"] -); -``` - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p xarray -``` diff --git a/myst_nbs/variational_inference/bayesian_neural_network_advi.myst.md b/myst_nbs/variational_inference/bayesian_neural_network_advi.myst.md deleted file mode 100644 index 17cdb6387..000000000 --- a/myst_nbs/variational_inference/bayesian_neural_network_advi.myst.md +++ /dev/null @@ -1,353 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 (ipykernel) - language: python - name: python3 ---- - -(bayesian_neural_network_advi)= -# Variational Inference: Bayesian Neural Networks - -+++ - -:::{post} May 30, 2022 -:tags: neural networks, perceptron, variational inference, minibatch -:category: intermediate -:author: Thomas Wiecki, updated by Chris Fonnesbeck -::: - -+++ - -## Current trends in Machine Learning - -**Probabilistic Programming**, **Deep Learning** and "**Big Data**" are among the biggest topics in machine learning. Inside of PP, a lot of innovation is focused on making things scale using **Variational Inference**. In this example, I will show how to use **Variational Inference** in PyMC to fit a simple Bayesian Neural Network. I will also discuss how bridging Probabilistic Programming and Deep Learning can open up very interesting avenues to explore in future research. - -### Probabilistic Programming at scale -**Probabilistic Programming** allows very flexible creation of custom probabilistic models and is mainly concerned with **inference** and learning from your data. The approach is inherently **Bayesian** so we can specify **priors** to inform and constrain our models and get uncertainty estimation in form of a **posterior** distribution. Using {ref}`MCMC sampling algorithms ` we can draw samples from this posterior to very flexibly estimate these models. PyMC, [NumPyro](https://github.com/pyro-ppl/numpyro), and [Stan](http://mc-stan.org/) are the current state-of-the-art tools for consructing and estimating these models. One major drawback of sampling, however, is that it's often slow, especially for high-dimensional models and large datasets. That's why more recently, **variational inference** algorithms have been developed that are almost as flexible as MCMC but much faster. Instead of drawing samples from the posterior, these algorithms instead fit a distribution (*e.g.* normal) to the posterior turning a sampling problem into and optimization problem. Automatic Differentation Variational Inference {cite:p}`kucukelbir2015automatic` is implemented in several probabilistic programming packages including PyMC, NumPyro and Stan. - -Unfortunately, when it comes to traditional ML problems like classification or (non-linear) regression, Probabilistic Programming often plays second fiddle (in terms of accuracy and scalability) to more algorithmic approaches like [ensemble learning](https://en.wikipedia.org/wiki/Ensemble_learning) (e.g. [random forests](https://en.wikipedia.org/wiki/Random_forest) or [gradient boosted regression trees](https://en.wikipedia.org/wiki/Boosting_(machine_learning)). - -### Deep Learning - -Now in its third renaissance, neural networks have been making headlines repeatadly by dominating almost any object recognition benchmark, kicking ass at Atari games {cite:p}`mnih2013playing`, and beating the world-champion Lee Sedol at Go {cite:p}`silver2016masteringgo`. From a statistical point, Neural Networks are extremely good non-linear function approximators and representation learners. While mostly known for classification, they have been extended to unsupervised learning with AutoEncoders {cite:p}`kingma2014autoencoding` and in all sorts of other interesting ways (e.g. [Recurrent Networks](https://en.wikipedia.org/wiki/Recurrent_neural_network), or [MDNs](http://cbonnett.github.io/MDN_EDWARD_KERAS_TF.html) to estimate multimodal distributions). Why do they work so well? No one really knows as the statistical properties are still not fully understood. - -A large part of the innoviation in deep learning is the ability to train these extremely complex models. This rests on several pillars: -* Speed: facilitating the GPU allowed for much faster processing. -* Software: frameworks like [PyTorch](https://pytorch.org/) and [TensorFlow](https://www.tensorflow.org/) allow flexible creation of abstract models that can then be optimized and compiled to CPU or GPU. -* Learning algorithms: training on sub-sets of the data -- stochastic gradient descent -- allows us to train these models on massive amounts of data. Techniques like drop-out avoid overfitting. -* Architectural: A lot of innovation comes from changing the input layers, like for convolutional neural nets, or the output layers, like for [MDNs](http://cbonnett.github.io/MDN_EDWARD_KERAS_TF.html). - -### Bridging Deep Learning and Probabilistic Programming -On one hand we have Probabilistic Programming which allows us to build rather small and focused models in a very principled and well-understood way to gain insight into our data; on the other hand we have deep learning which uses many heuristics to train huge and highly complex models that are amazing at prediction. Recent innovations in variational inference allow probabilistic programming to scale model complexity as well as data size. We are thus at the cusp of being able to combine these two approaches to hopefully unlock new innovations in Machine Learning. For more motivation, see also [Dustin Tran's](https://twitter.com/dustinvtran) [blog post](http://dustintran.com/blog/a-quick-update-edward-and-some-motivations/). - -While this would allow Probabilistic Programming to be applied to a much wider set of interesting problems, I believe this bridging also holds great promise for innovations in Deep Learning. Some ideas are: -* **Uncertainty in predictions**: As we will see below, the Bayesian Neural Network informs us about the uncertainty in its predictions. I think uncertainty is an underappreciated concept in Machine Learning as it's clearly important for real-world applications. But it could also be useful in training. For example, we could train the model specifically on samples it is most uncertain about. -* **Uncertainty in representations**: We also get uncertainty estimates of our weights which could inform us about the stability of the learned representations of the network. -* **Regularization with priors**: Weights are often L2-regularized to avoid overfitting, this very naturally becomes a Gaussian prior for the weight coefficients. We could, however, imagine all kinds of other priors, like spike-and-slab to enforce sparsity (this would be more like using the L1-norm). -* **Transfer learning with informed priors**: If we wanted to train a network on a new object recognition data set, we could bootstrap the learning by placing informed priors centered around weights retrieved from other pre-trained networks, like GoogLeNet {cite:p}`szegedy2014going`. -* **Hierarchical Neural Networks**: A very powerful approach in Probabilistic Programming is hierarchical modeling that allows pooling of things that were learned on sub-groups to the overall population (see [Hierarchical Linear Regression in PyMC3](https://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/)). Applied to Neural Networks, in hierarchical data sets, we could train individual neural nets to specialize on sub-groups while still being informed about representations of the overall population. For example, imagine a network trained to classify car models from pictures of cars. We could train a hierarchical neural network where a sub-neural network is trained to tell apart models from only a single manufacturer. The intuition being that all cars from a certain manufactures share certain similarities so it would make sense to train individual networks that specialize on brands. However, due to the individual networks being connected at a higher layer, they would still share information with the other specialized sub-networks about features that are useful to all brands. Interestingly, different layers of the network could be informed by various levels of the hierarchy -- *e.g.* early layers that extract visual lines could be identical in all sub-networks while the higher-order representations would be different. The hierarchical model would learn all that from the data. -* **Other hybrid architectures**: We can more freely build all kinds of neural networks. For example, Bayesian non-parametrics could be used to flexibly adjust the size and shape of the hidden layers to optimally scale the network architecture to the problem at hand during training. Currently, this requires costly hyper-parameter optimization and a lot of tribal knowledge. - -+++ - -## Bayesian Neural Networks in PyMC - -+++ - -### Generating data - -First, lets generate some toy data -- a simple binary classification problem that's not linearly separable. - -```{code-cell} ipython3 -import aesara -import aesara.tensor as at -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc as pm -import seaborn as sns - -from sklearn.datasets import make_moons -from sklearn.model_selection import train_test_split -from sklearn.preprocessing import scale -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -floatX = aesara.config.floatX -RANDOM_SEED = 9927 -rng = np.random.default_rng(RANDOM_SEED) -az.style.use("arviz-darkgrid") -``` - -```{code-cell} ipython3 ---- -jupyter: - outputs_hidden: true ---- -X, Y = make_moons(noise=0.2, random_state=0, n_samples=1000) -X = scale(X) -X = X.astype(floatX) -Y = Y.astype(floatX) -X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots() -ax.scatter(X[Y == 0, 0], X[Y == 0, 1], color="C0", label="Class 0") -ax.scatter(X[Y == 1, 0], X[Y == 1, 1], color="C1", label="Class 1") -sns.despine() -ax.legend() -ax.set(xlabel="X", ylabel="Y", title="Toy binary classification data set"); -``` - -### Model specification - -A neural network is quite simple. The basic unit is a [perceptron](https://en.wikipedia.org/wiki/Perceptron) which is nothing more than [logistic regression](http://pymc-devs.github.io/pymc3/notebooks/posterior_predictive.html#Prediction). We use many of these in parallel and then stack them up to get hidden layers. Here we will use 2 hidden layers with 5 neurons each which is sufficient for such a simple problem. - -```{code-cell} ipython3 ---- -jupyter: - outputs_hidden: true ---- -def construct_nn(ann_input, ann_output): - n_hidden = 5 - - # Initialize random weights between each layer - init_1 = rng.standard_normal(size=(X_train.shape[1], n_hidden)).astype(floatX) - init_2 = rng.standard_normal(size=(n_hidden, n_hidden)).astype(floatX) - init_out = rng.standard_normal(size=n_hidden).astype(floatX) - - coords = { - "hidden_layer_1": np.arange(n_hidden), - "hidden_layer_2": np.arange(n_hidden), - "train_cols": np.arange(X_train.shape[1]), - # "obs_id": np.arange(X_train.shape[0]), - } - with pm.Model(coords=coords) as neural_network: - ann_input = pm.Data("ann_input", X_train, mutable=True) - ann_output = pm.Data("ann_output", Y_train, mutable=True) - - # Weights from input to hidden layer - weights_in_1 = pm.Normal( - "w_in_1", 0, sigma=1, initval=init_1, dims=("train_cols", "hidden_layer_1") - ) - - # Weights from 1st to 2nd layer - weights_1_2 = pm.Normal( - "w_1_2", 0, sigma=1, initval=init_2, dims=("hidden_layer_1", "hidden_layer_2") - ) - - # Weights from hidden layer to output - weights_2_out = pm.Normal("w_2_out", 0, sigma=1, initval=init_out, dims="hidden_layer_2") - - # Build neural-network using tanh activation function - act_1 = pm.math.tanh(pm.math.dot(ann_input, weights_in_1)) - act_2 = pm.math.tanh(pm.math.dot(act_1, weights_1_2)) - act_out = pm.math.sigmoid(pm.math.dot(act_2, weights_2_out)) - - # Binary classification -> Bernoulli likelihood - out = pm.Bernoulli( - "out", - act_out, - observed=ann_output, - total_size=Y_train.shape[0], # IMPORTANT for minibatches - ) - return neural_network - - -neural_network = construct_nn(X_train, Y_train) -``` - -That's not so bad. The `Normal` priors help regularize the weights. Usually we would add a constant `b` to the inputs but I omitted it here to keep the code cleaner. - -+++ - -### Variational Inference: Scaling model complexity - -We could now just run a MCMC sampler like {class}`pymc.NUTS` which works pretty well in this case, but was already mentioned, this will become very slow as we scale our model up to deeper architectures with more layers. - -Instead, we will use the {class}`pymc.ADVI` variational inference algorithm. This is much faster and will scale better. Note, that this is a mean-field approximation so we ignore correlations in the posterior. - -```{code-cell} ipython3 -%%time - -with neural_network: - approx = pm.fit(n=30_000) -``` - -Plotting the objective function (ELBO) we can see that the optimization iteratively improves the fit. - -```{code-cell} ipython3 -plt.plot(approx.hist, alpha=0.3) -plt.ylabel("ELBO") -plt.xlabel("iteration"); -``` - -```{code-cell} ipython3 -trace = approx.sample(draws=5000) -``` - -Now that we trained our model, lets predict on the hold-out set using a posterior predictive check (PPC). We can use {func}`~pymc.sample_posterior_predictive` to generate new data (in this case class predictions) from the posterior (sampled from the variational estimation). - -```{code-cell} ipython3 ---- -jupyter: - outputs_hidden: true ---- -with neural_network: - pm.set_data(new_data={"ann_input": X_test}) - ppc = pm.sample_posterior_predictive(trace) - trace.extend(ppc) -``` - -We can average the predictions for each observation to estimate the underlying probability of class 1. - -```{code-cell} ipython3 -pred = ppc.posterior_predictive["out"].mean(("chain", "draw")) > 0.5 -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots() -ax.scatter(X_test[pred == 0, 0], X_test[pred == 0, 1], color="C0") -ax.scatter(X_test[pred == 1, 0], X_test[pred == 1, 1], color="C1") -sns.despine() -ax.set(title="Predicted labels in testing set", xlabel="X", ylabel="Y"); -``` - -```{code-cell} ipython3 -print(f"Accuracy = {(Y_test == pred.values).mean() * 100}%") -``` - -Hey, our neural network did all right! - -+++ - -## Lets look at what the classifier has learned - -For this, we evaluate the class probability predictions on a grid over the whole input space. - -```{code-cell} ipython3 ---- -jupyter: - outputs_hidden: true ---- -grid = pm.floatX(np.mgrid[-3:3:100j, -3:3:100j]) -grid_2d = grid.reshape(2, -1).T -dummy_out = np.ones(grid.shape[1], dtype=np.int8) -``` - -```{code-cell} ipython3 ---- -jupyter: - outputs_hidden: true ---- -with neural_network: - pm.set_data(new_data={"ann_input": grid_2d, "ann_output": dummy_out}) - ppc = pm.sample_posterior_predictive(trace) -``` - -```{code-cell} ipython3 -y_pred = ppc.posterior_predictive["out"] -``` - -### Probability surface - -```{code-cell} ipython3 -cmap = sns.diverging_palette(250, 12, s=85, l=25, as_cmap=True) -fig, ax = plt.subplots(figsize=(16, 9)) -contour = ax.contourf( - grid[0], grid[1], y_pred.mean(("chain", "draw")).values.reshape(100, 100), cmap=cmap -) -ax.scatter(X_test[pred == 0, 0], X_test[pred == 0, 1], color="C0") -ax.scatter(X_test[pred == 1, 0], X_test[pred == 1, 1], color="C1") -cbar = plt.colorbar(contour, ax=ax) -_ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel="X", ylabel="Y") -cbar.ax.set_ylabel("Posterior predictive mean probability of class label = 0"); -``` - -### Uncertainty in predicted value - -Note that we could have done everything above with a non-Bayesian Neural Network. The mean of the posterior predictive for each class-label should be identical to maximum likelihood predicted values. However, we can also look at the standard deviation of the posterior predictive to get a sense for the uncertainty in our predictions. Here is what that looks like: - -```{code-cell} ipython3 -cmap = sns.cubehelix_palette(light=1, as_cmap=True) -fig, ax = plt.subplots(figsize=(16, 9)) -contour = ax.contourf( - grid[0], grid[1], y_pred.squeeze().values.std(axis=0).reshape(100, 100), cmap=cmap -) -ax.scatter(X_test[pred == 0, 0], X_test[pred == 0, 1], color="C0") -ax.scatter(X_test[pred == 1, 0], X_test[pred == 1, 1], color="C1") -cbar = plt.colorbar(contour, ax=ax) -_ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel="X", ylabel="Y") -cbar.ax.set_ylabel("Uncertainty (posterior predictive standard deviation)"); -``` - -We can see that very close to the decision boundary, our uncertainty as to which label to predict is highest. You can imagine that associating predictions with uncertainty is a critical property for many applications like health care. To further maximize accuracy, we might want to train the model primarily on samples from that high-uncertainty region. - -+++ - -## Mini-batch ADVI - -So far, we have trained our model on all data at once. Obviously this won't scale to something like ImageNet. Moreover, training on mini-batches of data (stochastic gradient descent) avoids local minima and can lead to faster convergence. - -Fortunately, ADVI can be run on mini-batches as well. It just requires some setting up: - -```{code-cell} ipython3 -minibatch_x = pm.Minibatch(X_train, batch_size=50) -minibatch_y = pm.Minibatch(Y_train, batch_size=50) -neural_network_minibatch = construct_nn(minibatch_x, minibatch_y) -with neural_network_minibatch: - approx = pm.fit(40000, method=pm.ADVI()) -``` - -```{code-cell} ipython3 -plt.plot(approx.hist) -plt.ylabel("ELBO") -plt.xlabel("iteration"); -``` - -As you can see, mini-batch ADVI's running time is much lower. It also seems to converge faster. - -For fun, we can also look at the trace. The point is that we also get uncertainty of our Neural Network weights. - -```{code-cell} ipython3 -az.plot_trace(trace); -``` - -You might argue that the above network isn't really deep, but note that we could easily extend it to have more layers, including convolutional ones to train on more challenging data sets. - -## Acknowledgements - -[Taku Yoshioka](https://github.com/taku-y) did a lot of work on ADVI in PyMC3, including the mini-batch implementation as well as the sampling from the variational posterior. I'd also like to the thank the Stan guys (specifically Alp Kucukelbir and Daniel Lee) for deriving ADVI and teaching us about it. Thanks also to Chris Fonnesbeck, Andrew Campbell, Taku Yoshioka, and Peadar Coyle for useful comments on an earlier draft. - -+++ - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Authors - -- This notebook was originally authored as a [blog post](https://twiecki.github.io/blog/2016/06/01/bayesian-deep-learning/) by Thomas Wiecki in 2016 -- Updated by Chris Fonnesbeck for PyMC v4 in 2022 - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/variational_inference/convolutional_vae_keras_advi.myst.md b/myst_nbs/variational_inference/convolutional_vae_keras_advi.myst.md deleted file mode 100644 index 54d7a4b2e..000000000 --- a/myst_nbs/variational_inference/convolutional_vae_keras_advi.myst.md +++ /dev/null @@ -1,409 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- - -# Convolutional variational autoencoder with PyMC3 and Keras - -+++ - -In this document, I will show how autoencoding variational Bayes (AEVB) works in PyMC3's automatic differentiation variational inference (ADVI). The example here is borrowed from [Keras example](https://github.com/fchollet/keras/blob/master/examples/variational_autoencoder_deconv.py), where convolutional variational autoencoder is applied to the MNIST dataset. The network architecture of the encoder and decoder are the same. However, PyMC3 allows us to define a probabilistic model, which combines the encoder and decoder, in the same way as other probabilistic models (e.g., generalized linear models), rather than directly implementing of Monte Carlo sampling and the loss function, as is done in the Keras example. Thus the framework of AEVB in PyMC3 can be extended to more complex models such as [latent dirichlet allocation](https://taku-y.github.io/notebook/20160928/lda-advi-ae.html). - -+++ - -- Notebook Written by Taku Yoshioka (c) 2016 - -+++ - -To use Keras with PyMC3, we need to choose [Theano](http://deeplearning.net/software/theano/) as the backend for Keras. - -```{code-cell} ipython3 -%autosave 0 -%env KERAS_BACKEND=theano -%env THEANO_FLAGS=device=cuda3,floatX=float32,optimizer=fast_run - -import os -import sys - -from collections import OrderedDict - -import arviz as az -import keras -import matplotlib -import matplotlib.gridspec as gridspec -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm -import theano.tensor as tt - -from keras import backend as K -from keras.layers import ( - Activation, - BatchNormalization, - Conv2D, - Deconv2D, - Dense, - Flatten, - InputLayer, - Reshape, -) -from theano import clone, config, function, pp, shared - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -K.set_image_dim_ordering("th") -``` - -## Load images -MNIST dataset can be obtained by [scikit-learn API](http://scikit-learn.org/stable/datasets/) or from [Keras datasets](https://keras.io/datasets/). The dataset contains images of digits. - -```{code-cell} ipython3 -from keras.datasets import mnist - -(x_train, y_train), (x_test, y_test) = mnist.load_data() -data = pm.floatX(x_train.reshape(-1, 1, 28, 28)) -data /= np.max(data) -``` - -## Use Keras -We define a utility function to get parameters from Keras models. Since we have set the backend to Theano, parameter objects are obtained as shared variables of Theano. - -In the code, 'updates' are expected to include update objects (dictionary of pairs of shared variables and update equation) of scaling parameters of batch normalization. While not using batch normalization in this example, if we want to use it, we need to pass these update objects as an argument of `theano.function()` inside the PyMC3 ADVI function. The current version of PyMC3 does not support it, it is easy to modify (I want to send PR in future). - -The learning phase below is used for Keras to known the learning phase, training or test. This information is important also for batch normalization. - -```{code-cell} ipython3 -from keras.layers import BatchNormalization, Dense -from keras.models import Sequential - - -def get_params(model): - """Get parameters and updates from Keras model""" - shared_in_updates = list() - params = list() - updates = dict() - - for l in model.layers: - attrs = dir(l) - # Updates - if "updates" in attrs: - updates.update(l.updates) - shared_in_updates += [e[0] for e in l.updates] - - # Shared variables - for attr_str in attrs: - attr = getattr(l, attr_str) - if isinstance(attr, tt.compile.SharedVariable): - if attr is not model.get_input_at(0): - params.append(attr) - - return list(set(params) - set(shared_in_updates)), updates - - -# This code is required when using BatchNormalization layer -keras.backend.theano_backend._LEARNING_PHASE = shared(np.uint8(1), name="keras_learning_phase") -``` - -## Encoder and decoder - -+++ - -First, we define the convolutional neural network for encoder using the Keras API. This function returns a CNN model given the shared variable representing observations (images of digits), the dimension of latent space, and the parameters of the model architecture. - -```{code-cell} ipython3 -def cnn_enc(xs, latent_dim, nb_filters=64, nb_conv=3, intermediate_dim=128): - """Returns a CNN model of Keras. - - Parameters - ---------- - xs : theano.TensorVariable - Input tensor. - latent_dim : int - Dimension of latent vector. - """ - input_layer = InputLayer(input_tensor=xs, batch_input_shape=xs.tag.test_value.shape) - model = Sequential() - model.add(input_layer) - - cp1 = {"padding": "same", "activation": "relu"} - cp2 = {"padding": "same", "activation": "relu", "strides": (2, 2)} - cp3 = {"padding": "same", "activation": "relu", "strides": (1, 1)} - cp4 = cp3 - - model.add(Conv2D(1, (2, 2), **cp1)) - model.add(Conv2D(nb_filters, (2, 2), **cp2)) - model.add(Conv2D(nb_filters, (nb_conv, nb_conv), **cp3)) - model.add(Conv2D(nb_filters, (nb_conv, nb_conv), **cp4)) - model.add(Flatten()) - model.add(Dense(intermediate_dim, activation="relu")) - model.add(Dense(2 * latent_dim)) - - return model -``` - -Then we define a utility class for encoders. This class does not depend on the architecture of the encoder except for input shape (`tensor4` for images), so we can use this class for various encoding networks. - -```{code-cell} ipython3 -class Encoder: - """Encode observed images to variational parameters (mean/std of Gaussian). - - Parameters - ---------- - xs : theano.tensor.sharedvar.TensorSharedVariable - Placeholder of input images. - dim_hidden : int - The number of hidden variables. - net : Function - Returns - """ - - def __init__(self, xs, dim_hidden, net): - model = net(xs, dim_hidden) - - self.model = model - self.xs = xs - self.out = model.get_output_at(-1) - self.means = self.out[:, :dim_hidden] - self.rhos = self.out[:, dim_hidden:] - self.params, self.updates = get_params(model) - self.enc_func = None - self.dim_hidden = dim_hidden - - def _get_enc_func(self): - if self.enc_func is None: - xs = tt.tensor4() - means = clone(self.means, {self.xs: xs}) - rhos = clone(self.rhos, {self.xs: xs}) - self.enc_func = function([xs], [means, rhos]) - - return self.enc_func - - def encode(self, xs): - # Used in test phase - keras.backend.theano_backend._LEARNING_PHASE.set_value(np.uint8(0)) - - enc_func = self._get_enc_func() - means, _ = enc_func(xs) - - return means - - def draw_samples(self, xs, n_samples=1): - """Draw samples of hidden variables based on variational parameters encoded. - - Parameters - ---------- - xs : numpy.ndarray, shape=(n_images, 1, height, width) - Images. - """ - # Used in test phase - keras.backend.theano_backend._LEARNING_PHASE.set_value(np.uint8(0)) - - enc_func = self._get_enc_func() - means, rhos = enc_func(xs) - means = np.repeat(means, n_samples, axis=0) - rhos = np.repeat(rhos, n_samples, axis=0) - ns = np.random.randn(len(xs) * n_samples, self.dim_hidden) - zs = means + pm.distributions.dist_math.rho2sd(rhos) * ns - - return zs -``` - -In a similar way, we define the decoding network and a utility class for decoders. - -```{code-cell} ipython3 -def cnn_dec(zs, nb_filters=64, nb_conv=3, output_shape=(1, 28, 28)): - """Returns a CNN model of Keras. - - Parameters - ---------- - zs : theano.tensor.var.TensorVariable - Input tensor. - """ - minibatch_size, dim_hidden = zs.tag.test_value.shape - input_layer = InputLayer(input_tensor=zs, batch_input_shape=zs.tag.test_value.shape) - model = Sequential() - model.add(input_layer) - - model.add(Dense(dim_hidden, activation="relu")) - model.add(Dense(nb_filters * 14 * 14, activation="relu")) - - cp1 = {"padding": "same", "activation": "relu", "strides": (1, 1)} - cp2 = cp1 - cp3 = {"padding": "valid", "activation": "relu", "strides": (2, 2)} - cp4 = {"padding": "same", "activation": "sigmoid"} - - output_shape_ = (minibatch_size, nb_filters, 14, 14) - model.add(Reshape(output_shape_[1:])) - model.add(Deconv2D(nb_filters, (nb_conv, nb_conv), data_format="channels_first", **cp1)) - model.add(Deconv2D(nb_filters, (nb_conv, nb_conv), data_format="channels_first", **cp2)) - output_shape_ = (minibatch_size, nb_filters, 29, 29) - model.add(Deconv2D(nb_filters, (2, 2), data_format="channels_first", **cp3)) - model.add(Conv2D(1, (2, 2), **cp4)) - - return model -``` - -```{code-cell} ipython3 -class Decoder: - """Decode hidden variables to images. - - Parameters - ---------- - zs : Theano tensor - Hidden variables. - """ - - def __init__(self, zs, net): - model = net(zs) - self.model = model - self.zs = zs - self.out = model.get_output_at(-1) - self.params, self.updates = get_params(model) - self.dec_func = None - - def _get_dec_func(self): - if self.dec_func is None: - zs = tt.matrix() - xs = clone(self.out, {self.zs: zs}) - self.dec_func = function([zs], xs) - - return self.dec_func - - def decode(self, zs): - """Decode hidden variables to images. - - An image consists of the mean parameters of the observation noise. - - Parameters - ---------- - zs : numpy.ndarray, shape=(n_samples, dim_hidden) - Hidden variables. - """ - # Used in test phase - keras.backend.theano_backend._LEARNING_PHASE.set_value(np.uint8(0)) - - return self._get_dec_func()(zs) -``` - -## Generative model -We can construct the generative model with the PyMC3 API and the functions and classes defined above. We set the size of mini-batches to 100 and the dimension of the latent space to 2 for visualization. - -```{code-cell} ipython3 -# Constants -minibatch_size = 100 -dim_hidden = 2 -``` - -We require a placeholder for images, into which mini-batches of images will be placed during ADVI inference. It is also the input for the encoder. Below, `enc.model` is a Keras model of the encoder network and we can check the model architecture using the method `summary()`. - -```{code-cell} ipython3 -# Placeholder of images -xs_t = tt.tensor4(name="xs_t") -xs_t.tag.test_value = np.zeros((minibatch_size, 1, 28, 28)).astype("float32") -# Encoder -enc = Encoder(xs_t, dim_hidden, net=cnn_enc) -enc.model.summary() -``` - -The probabilistic model involves only two random variables; latent variable $\mathbf{z}$ and observation $\mathbf{x}$. We put a Normal prior on $\mathbf{z}$, decode the variational parameters of $q(\mathbf{z}|\mathbf{x})$ and define the likelihood of the observation $\mathbf{x}$. - -```{code-cell} ipython3 -with pm.Model() as model: - # Hidden variables - zs = pm.Normal( - "zs", - mu=0, - sigma=1, - shape=(minibatch_size, dim_hidden), - dtype="float32", - total_size=len(data), - ) - - # Decoder and its parameters - dec = Decoder(zs, net=cnn_dec) - - # Observation model - xs_ = pm.Normal( - "xs_", mu=dec.out, sigma=0.1, observed=xs_t, dtype="float32", total_size=len(data) - ) -``` - -In the generative model above, we do not know how the decoded variational parameters are passed to $q(\mathbf{z}|\mathbf{x})$. To do this, we will set the argument `local_RVs` in the ADVI function of PyMC3. - -```{code-cell} ipython3 -local_RVs = OrderedDict({zs: dict(mu=enc.means, rho=enc.rhos)}) -``` - -This argument is an `OrderedDict` whose keys are random variables to which the decoded variational parameters are set (`zs` in this model). Each value of the dictionary contains two Theano expressions representing variational mean (`enc.means`) and rhos (`enc.rhos`). A scaling constant (`len(data) / float(minibatch_size)`) is set automatically (as we specified it in the model saying what's the `total_size`) to compensate for the size of mini-batches of the corresponding log probability terms in the evidence lower bound (ELBO), the objective of the variational inference. - -The scaling constant for the observed random variables is set in the same way. - -+++ - -We can also check the architecture of the decoding network, as we did for the encoding network. - -```{code-cell} ipython3 -dec.model.summary() -``` - -## Inference - -+++ - -Let's use ADVI to fit the model. - -```{code-cell} ipython3 -# In memory Minibatches for better speed -xs_t_minibatch = pm.Minibatch(data, minibatch_size) - -with model: - approx = pm.fit( - 15000, - local_rv=local_RVs, - more_obj_params=enc.params + dec.params, - obj_optimizer=pm.rmsprop(learning_rate=0.001), - more_replacements={xs_t: xs_t_minibatch}, - ) -``` - -## Results - -+++ - -We can plot the trace of the negative ELBO obtained during optimization, to verify convergence. - -```{code-cell} ipython3 -plt.plot(approx.hist); -``` - -Finally, we can plot the distribution of the images in the latent space. To do this, we make a 2-dimensional grid of points and feed them into the decoding network. The mean of $p(\mathbf{x}|\mathbf{z})$ is the image corresponding to the samples on the grid. - -```{code-cell} ipython3 -nn = 10 -zs = np.array([(z1, z2) for z1 in np.linspace(-2, 2, nn) for z2 in np.linspace(-2, 2, nn)]).astype( - "float32" -) -xs = dec.decode(zs)[:, 0, :, :] -xs = np.bmat([[xs[i + j * nn] for i in range(nn)] for j in range(nn)]) -matplotlib.rc("axes", **{"grid": False}) -plt.figure(figsize=(10, 10)) -plt.imshow(xs, interpolation="none", cmap="gray") -plt.show() -``` - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/variational_inference/empirical-approx-overview.myst.md b/myst_nbs/variational_inference/empirical-approx-overview.myst.md deleted file mode 100644 index 999b5495c..000000000 --- a/myst_nbs/variational_inference/empirical-approx-overview.myst.md +++ /dev/null @@ -1,171 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python PyMC3 (Dev) - language: python - name: pymc3-dev-py38 ---- - -# Empirical Approximation overview - -For most models we use sampling MCMC algorithms like Metropolis or NUTS. In PyMC3 we got used to store traces of MCMC samples and then do analysis using them. There is a similar concept for the variational inference submodule in PyMC3: *Empirical*. This type of approximation stores particles for the SVGD sampler. There is no difference between independent SVGD particles and MCMC samples. *Empirical* acts as a bridge between MCMC sampling output and full-fledged VI utils like `apply_replacements` or `sample_node`. For the interface description, see [variational_api_quickstart](variational_api_quickstart.ipynb). Here we will just focus on `Emprical` and give an overview of specific things for the *Empirical* approximation - -```{code-cell} ipython3 -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm -import theano - -from pandas import DataFrame - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -np.random.seed(42) -pm.set_tt_rng(42) -``` - -## Multimodal density -Let's recall the problem from [variational_api_quickstart](variational_api_quickstart.ipynb) where we first got a NUTS trace - -```{code-cell} ipython3 -w = pm.floatX([0.2, 0.8]) -mu = pm.floatX([-0.3, 0.5]) -sd = pm.floatX([0.1, 0.1]) - -with pm.Model() as model: - x = pm.NormalMixture("x", w=w, mu=mu, sigma=sd, dtype=theano.config.floatX) - trace = pm.sample(50000) -``` - -```{code-cell} ipython3 -az.plot_trace(trace); -``` - -Great. First having a trace we can create `Empirical` approx - -```{code-cell} ipython3 -print(pm.Empirical.__doc__) -``` - -```{code-cell} ipython3 -with model: - approx = pm.Empirical(trace) -``` - -```{code-cell} ipython3 -approx -``` - -This type of approximation has it's own underlying storage for samples that is `theano.shared` itself - -```{code-cell} ipython3 -approx.histogram -``` - -```{code-cell} ipython3 -approx.histogram.get_value()[:10] -``` - -```{code-cell} ipython3 -approx.histogram.get_value().shape -``` - -It has exactly the same number of samples that you had in trace before. In our particular case it is 50k. Another thing to notice is that if you have multitrace with **more than one chain** you'll get much **more samples** stored at once. We flatten all the trace for creating `Empirical`. - -This *histogram* is about *how* we store samples. The structure is pretty simple: `(n_samples, n_dim)` The order of these variables is stored internally in the class and in most cases will not be needed for end user - -```{code-cell} ipython3 -approx.ordering -``` - -Sampling from posterior is done uniformly with replacements. Call `approx.sample(1000)` and you'll get again the trace but the order is not determined. There is no way now to reconstruct the underlying trace again with `approx.sample`. - -```{code-cell} ipython3 -new_trace = approx.sample(50000) -``` - -```{code-cell} ipython3 -%timeit new_trace = approx.sample(50000) -``` - -After sampling function is compiled sampling bacomes really fast - -```{code-cell} ipython3 -az.plot_trace(new_trace); -``` - -You see there is no order any more but reconstructed density is the same. - -## 2d density - -```{code-cell} ipython3 -mu = pm.floatX([0.0, 0.0]) -cov = pm.floatX([[1, 0.5], [0.5, 1.0]]) -with pm.Model() as model: - pm.MvNormal("x", mu=mu, cov=cov, shape=2) - trace = pm.sample(1000) -``` - -```{code-cell} ipython3 -with model: - approx = pm.Empirical(trace) -``` - -```{code-cell} ipython3 -az.plot_trace(approx.sample(10000)); -``` - -```{code-cell} ipython3 -import seaborn as sns -``` - -```{code-cell} ipython3 -kdeViz_df = DataFrame( - data=approx.sample(1000)["x"], columns=["First Dimension", "Second Dimension"] -) -``` - -```{code-cell} ipython3 -sns.kdeplot(data=kdeViz_df, x="First Dimension", y="Second Dimension") -plt.show() -``` - -Previously we had a `trace_cov` function - -```{code-cell} ipython3 -with model: - print(pm.trace_cov(trace)) -``` - -Now we can estimate the same covariance using `Empirical` - -```{code-cell} ipython3 -print(approx.cov) -``` - -That's a tensor itself - -```{code-cell} ipython3 -print(approx.cov.eval()) -``` - -Estimations are very close and differ due to precision error. We can get the mean in the same way - -```{code-cell} ipython3 -print(approx.mean.eval()) -``` - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/variational_inference/gaussian-mixture-model-advi.myst.md b/myst_nbs/variational_inference/gaussian-mixture-model-advi.myst.md deleted file mode 100644 index 5945001bf..000000000 --- a/myst_nbs/variational_inference/gaussian-mixture-model-advi.myst.md +++ /dev/null @@ -1,312 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- - -# Gaussian Mixture Model with ADVI - -+++ - -Here, we describe how to use ADVI for inference of Gaussian mixture model. First, we will show that inference with ADVI does not need to modify the stochastic model, just call a function. Then, we will show how to use mini-batch, which is useful for large dataset. In this case, where the model should be slightly changed. - -First, create artificial data from a mixuture of two Gaussian components. - -```{code-cell} ipython3 -%env THEANO_FLAGS=device=cpu,floatX=float32 - -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm -import seaborn as sns -import theano.tensor as tt - -from pymc3 import ( - NUTS, - DensityDist, - Dirichlet, - Metropolis, - MvNormal, - Normal, - Slice, - find_MAP, - sample, -) -from pymc3.math import logsumexp -from theano.tensor.nlinalg import det - -print(f"Running on PyMC3 v{pm.__version__}") -``` - -```{code-cell} ipython3 -%config InlineBackend.figure_format = 'retina' -az.style.use("arviz-darkgrid") -``` - -```{code-cell} ipython3 -n_samples = 100 -rng = np.random.RandomState(123) -ms = np.array([[-1, -1.5], [1, 1]]) -ps = np.array([0.2, 0.8]) - -zs = np.array([rng.multinomial(1, ps) for _ in range(n_samples)]).T -xs = [ - z[:, np.newaxis] * rng.multivariate_normal(m, np.eye(2), size=n_samples) for z, m in zip(zs, ms) -] -data = np.sum(np.dstack(xs), axis=2) - -plt.figure(figsize=(5, 5)) -plt.scatter(data[:, 0], data[:, 1], c="g", alpha=0.5) -plt.scatter(ms[0, 0], ms[0, 1], c="r", s=100) -plt.scatter(ms[1, 0], ms[1, 1], c="b", s=100); -``` - -Gaussian mixture models are usually constructed with categorical random variables. However, any discrete rvs does not fit ADVI. Here, class assignment variables are marginalized out, giving weighted sum of the probability for the gaussian components. The log likelihood of the total probability is calculated using logsumexp, which is a standard technique for making this kind of calculation stable. - -In the below code, DensityDist class is used as the likelihood term. The second argument, logp_gmix(mus, pi, np.eye(2)), is a python function which receives observations (denoted by 'value') and returns the tensor representation of the log-likelihood. - -```{code-cell} ipython3 -from pymc3.math import logsumexp - - -# Log likelihood of normal distribution -def logp_normal(mu, tau, value): - # log probability of individual samples - k = tau.shape[0] - delta = lambda mu: value - mu - return (-1 / 2.0) * ( - k * tt.log(2 * np.pi) - + tt.log(1.0 / det(tau)) - + (delta(mu).dot(tau) * delta(mu)).sum(axis=1) - ) - - -# Log likelihood of Gaussian mixture distribution -def logp_gmix(mus, pi, tau): - def logp_(value): - logps = [tt.log(pi[i]) + logp_normal(mu, tau, value) for i, mu in enumerate(mus)] - - return tt.sum(logsumexp(tt.stacklists(logps)[:, :n_samples], axis=0)) - - return logp_ - - -with pm.Model() as model: - mus = [ - MvNormal("mu_%d" % i, mu=pm.floatX(np.zeros(2)), tau=pm.floatX(0.1 * np.eye(2)), shape=(2,)) - for i in range(2) - ] - pi = Dirichlet("pi", a=pm.floatX(0.1 * np.ones(2)), shape=(2,)) - xs = DensityDist("x", logp_gmix(mus, pi, np.eye(2)), observed=data) -``` - -For comparison with ADVI, run MCMC. - -```{code-cell} ipython3 -with model: - start = find_MAP() - step = Metropolis() - trace = sample(1000, step, start=start) -``` - -Check posterior of component means and weights. We can see that the MCMC samples of the component mean for the lower-left component varied more than the upper-right due to the difference of the sample size of these clusters. - -```{code-cell} ipython3 -plt.figure(figsize=(5, 5)) -plt.scatter(data[:, 0], data[:, 1], alpha=0.5, c="g") -mu_0, mu_1 = trace["mu_0"], trace["mu_1"] -plt.scatter(mu_0[:, 0], mu_0[:, 1], c="r", s=10) -plt.scatter(mu_1[:, 0], mu_1[:, 1], c="b", s=10) -plt.xlim(-6, 6) -plt.ylim(-6, 6) -``` - -```{code-cell} ipython3 -sns.barplot([1, 2], np.mean(trace["pi"][:], axis=0), palette=["red", "blue"]) -``` - -We can use the same model with ADVI as follows. - -```{code-cell} ipython3 -with pm.Model() as model: - mus = [ - MvNormal("mu_%d" % i, mu=pm.floatX(np.zeros(2)), tau=pm.floatX(0.1 * np.eye(2)), shape=(2,)) - for i in range(2) - ] - pi = Dirichlet("pi", a=pm.floatX(0.1 * np.ones(2)), shape=(2,)) - xs = DensityDist("x", logp_gmix(mus, pi, np.eye(2)), observed=data) - -with model: - %time approx = pm.fit(n=4500, obj_optimizer=pm.adagrad(learning_rate=1e-1)) - -means = approx.bij.rmap(approx.mean.eval()) -cov = approx.cov.eval() -sds = approx.bij.rmap(np.diag(cov) ** 0.5) -``` - -The function returns three variables. 'means' and 'sds' are the mean and standard deviations of the variational posterior. Note that these values are in the transformed space, not in the original space. For random variables in the real line, e.g., means of the Gaussian components, no transformation is applied. Then we can see the variational posterior in the original space. - -```{code-cell} ipython3 -from copy import deepcopy - -mu_0, sd_0 = means["mu_0"], sds["mu_0"] -mu_1, sd_1 = means["mu_1"], sds["mu_1"] - - -def logp_normal_np(mu, tau, value): - # log probability of individual samples - k = tau.shape[0] - delta = lambda mu: value - mu - return (-1 / 2.0) * ( - k * np.log(2 * np.pi) - + np.log(1.0 / np.linalg.det(tau)) - + (delta(mu).dot(tau) * delta(mu)).sum(axis=1) - ) - - -def threshold(zz): - zz_ = deepcopy(zz) - zz_[zz < np.max(zz) * 1e-2] = None - return zz_ - - -def plot_logp_normal(ax, mu, sd, cmap): - f = lambda value: np.exp(logp_normal_np(mu, np.diag(1 / sd**2), value)) - g = lambda mu, sd: np.arange(mu - 3, mu + 3, 0.1) - xx, yy = np.meshgrid(g(mu[0], sd[0]), g(mu[1], sd[1])) - zz = f(np.vstack((xx.reshape(-1), yy.reshape(-1))).T).reshape(xx.shape) - ax.contourf(xx, yy, threshold(zz), cmap=cmap, alpha=0.9) - - -fig, ax = plt.subplots(figsize=(5, 5)) -plt.scatter(data[:, 0], data[:, 1], alpha=0.5, c="g") -plot_logp_normal(ax, mu_0, sd_0, cmap="Reds") -plot_logp_normal(ax, mu_1, sd_1, cmap="Blues") -plt.xlim(-6, 6) -plt.ylim(-6, 6) -``` - -TODO: We need to backward-transform 'pi', which is transformed by 'stick_breaking'. - -+++ - -'elbos' contains the trace of ELBO, showing stochastic convergence of the algorithm. - -```{code-cell} ipython3 -plt.plot(approx.hist) -``` - -To demonstrate that ADVI works for large dataset with mini-batch, let's create 100,000 samples from the same mixture distribution. - -```{code-cell} ipython3 -n_samples = 100000 - -zs = np.array([rng.multinomial(1, ps) for _ in range(n_samples)]).T -xs = [ - z[:, np.newaxis] * rng.multivariate_normal(m, np.eye(2), size=n_samples) for z, m in zip(zs, ms) -] -data = np.sum(np.dstack(xs), axis=2) - -plt.figure(figsize=(5, 5)) -plt.scatter(data[:, 0], data[:, 1], c="g", alpha=0.5) -plt.scatter(ms[0, 0], ms[0, 1], c="r", s=100) -plt.scatter(ms[1, 0], ms[1, 1], c="b", s=100) -plt.xlim(-6, 6) -plt.ylim(-6, 6) -``` - -MCMC took 55 seconds, 20 times longer than the small dataset. - -```{code-cell} ipython3 -with pm.Model() as model: - mus = [ - MvNormal("mu_%d" % i, mu=pm.floatX(np.zeros(2)), tau=pm.floatX(0.1 * np.eye(2)), shape=(2,)) - for i in range(2) - ] - pi = Dirichlet("pi", a=pm.floatX(0.1 * np.ones(2)), shape=(2,)) - xs = DensityDist("x", logp_gmix(mus, pi, np.eye(2)), observed=data) - - start = find_MAP() - step = Metropolis() - trace = sample(1000, step, start=start) -``` - -Posterior samples are concentrated on the true means, so looks like single point for each component. - -```{code-cell} ipython3 -plt.figure(figsize=(5, 5)) -plt.scatter(data[:, 0], data[:, 1], alpha=0.5, c="g") -mu_0, mu_1 = trace["mu_0"], trace["mu_1"] -plt.scatter(mu_0[-500:, 0], mu_0[-500:, 1], c="r", s=50) -plt.scatter(mu_1[-500:, 0], mu_1[-500:, 1], c="b", s=50) -plt.xlim(-6, 6) -plt.ylim(-6, 6) -``` - -For ADVI with mini-batch, put theano tensor on the observed variable of the ObservedRV. The tensor will be replaced with mini-batches. Because of the difference of the size of mini-batch and whole samples, the log-likelihood term should be appropriately scaled. To tell the log-likelihood term, we need to give ObservedRV objects ('minibatch_RVs' below) where mini-batch is put. Also we should keep the tensor ('minibatch_tensors'). - -```{code-cell} ipython3 -minibatch_size = 200 -# In memory Minibatches for better speed -data_t = pm.Minibatch(data, minibatch_size) - -with pm.Model() as model: - mus = [ - MvNormal("mu_%d" % i, mu=pm.floatX(np.zeros(2)), tau=pm.floatX(0.1 * np.eye(2)), shape=(2,)) - for i in range(2) - ] - pi = Dirichlet("pi", a=pm.floatX(0.1 * np.ones(2)), shape=(2,)) - xs = DensityDist("x", logp_gmix(mus, pi, np.eye(2)), observed=data_t, total_size=len(data)) -``` - -Run ADVI. It's much faster than MCMC, though the problem here is simple and it's not a fair comparison. - -```{code-cell} ipython3 -# Used only to write the function call in single line for using %time -# is there more smart way? - - -def f(): - approx = pm.fit(n=1500, obj_optimizer=pm.adagrad(learning_rate=1e-1), model=model) - means = approx.bij.rmap(approx.mean.eval()) - sds = approx.bij.rmap(approx.std.eval()) - return means, sds, approx.hist - - -%time means, sds, elbos = f() -``` - -The result is almost the same. - -```{code-cell} ipython3 -from copy import deepcopy - -mu_0, sd_0 = means["mu_0"], sds["mu_0"] -mu_1, sd_1 = means["mu_1"], sds["mu_1"] - -fig, ax = plt.subplots(figsize=(5, 5)) -plt.scatter(data[:, 0], data[:, 1], alpha=0.5, c="g") -plt.scatter(mu_0[0], mu_0[1], c="r", s=50) -plt.scatter(mu_1[0], mu_1[1], c="b", s=50) -plt.xlim(-6, 6) -plt.ylim(-6, 6) -``` - -The variance of the trace of ELBO is larger than without mini-batch because of the subsampling from the whole samples. - -```{code-cell} ipython3 -plt.plot(elbos); -``` - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/variational_inference/lda-advi-aevb.myst.md b/myst_nbs/variational_inference/lda-advi-aevb.myst.md deleted file mode 100644 index 1a1a40bc4..000000000 --- a/myst_nbs/variational_inference/lda-advi-aevb.myst.md +++ /dev/null @@ -1,423 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- - -# Automatic autoencoding variational Bayes for latent dirichlet allocation with PyMC3 - -For probabilistic models with latent variables, autoencoding variational Bayes (AEVB; Kingma and Welling, 2014) is an algorithm which allows us to perform inference efficiently for large datasets with an encoder. In AEVB, the encoder is used to infer variational parameters of approximate posterior on latent variables from given samples. By using tunable and flexible encoders such as multilayer perceptrons (MLPs), AEVB approximates complex variational posterior based on mean-field approximation, which does not utilize analytic representations of the true posterior. Combining AEVB with ADVI (Kucukelbir et al., 2015), we can perform posterior inference on almost arbitrary probabilistic models involving continuous latent variables. - -I have implemented AEVB for ADVI with mini-batch on PyMC3. To demonstrate flexibility of this approach, we will apply this to latent dirichlet allocation (LDA; Blei et al., 2003) for modeling documents. In the LDA model, each document is assumed to be generated from a multinomial distribution, whose parameters are treated as latent variables. By using AEVB with an MLP as an encoder, we will fit the LDA model to the 20-newsgroups dataset. - -In this example, extracted topics by AEVB seem to be qualitatively comparable to those with a standard LDA implementation, i.e., online VB implemented on scikit-learn. Unfortunately, the predictive accuracy of unseen words is less than the standard implementation of LDA, it might be due to the mean-field approximation. However, the combination of AEVB and ADVI allows us to quickly apply more complex probabilistic models than LDA to big data with the help of mini-batches. I hope this notebook will attract readers, especially practitioners working on a variety of machine learning tasks, to probabilistic programming and PyMC3. - -```{code-cell} ipython3 -%matplotlib inline -import os -import sys - -from collections import OrderedDict -from copy import deepcopy -from time import time - -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm -import seaborn as sns -import theano -import theano.tensor as tt - -from pymc3 import Dirichlet -from pymc3 import math as pmmath -from sklearn.datasets import fetch_20newsgroups -from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer -from theano import shared -from theano.sandbox.rng_mrg import MRG_RandomStreams - -# unfortunately I was not able to run it on GPU due to overflow problems -%env THEANO_FLAGS=device=cpu,floatX=float64 - -plt.style.use("seaborn-darkgrid") -``` - -## Dataset -Here, we will use the 20-newsgroups dataset. This dataset can be obtained by using functions of scikit-learn. The below code is partially adopted from an example of scikit-learn (http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html). We set the number of words in the vocabulary to 1000. - -```{code-cell} ipython3 -# The number of words in the vocabulary -n_words = 1000 - -print("Loading dataset...") -t0 = time() -dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=("headers", "footers", "quotes")) -data_samples = dataset.data -print("done in %0.3fs." % (time() - t0)) - -# Use tf (raw term count) features for LDA. -print("Extracting tf features for LDA...") -tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_words, stop_words="english") - -t0 = time() -tf = tf_vectorizer.fit_transform(data_samples) -feature_names = tf_vectorizer.get_feature_names() -print("done in %0.3fs." % (time() - t0)) -``` - -Each document is represented by 1000-dimensional term-frequency vector. Let's check the data. - -```{code-cell} ipython3 -plt.plot(tf[:10, :].toarray().T); -``` - -We split the whole documents into training and test sets. The number of tokens in the training set is 480K. Sparsity of the term-frequency document matrix is 0.025%, which implies almost all components in the term-frequency matrix is zero. - -```{code-cell} ipython3 -n_samples_tr = 10000 -n_samples_te = tf.shape[0] - n_samples_tr -docs_tr = tf[:n_samples_tr, :] -docs_te = tf[n_samples_tr:, :] -print("Number of docs for training = {}".format(docs_tr.shape[0])) -print("Number of docs for test = {}".format(docs_te.shape[0])) - -n_tokens = np.sum(docs_tr[docs_tr.nonzero()]) -print(f"Number of tokens in training set = {n_tokens}") -print( - "Sparsity = {}".format(len(docs_tr.nonzero()[0]) / float(docs_tr.shape[0] * docs_tr.shape[1])) -) -``` - -## Log-likelihood of documents for LDA -For a document $d$ consisting of tokens $w$, the log-likelihood of the LDA model with $K$ topics is given as -\begin{eqnarray} - \log p\left(d|\theta_{d},\beta\right) & = & \sum_{w\in d}\log\left[\sum_{k=1}^{K}\exp\left(\log\theta_{d,k} + \log \beta_{k,w}\right)\right]+const, -\end{eqnarray} -where $\theta_{d}$ is the topic distribution for document $d$ and $\beta$ is the word distribution for the $K$ topics. We define a function that returns a tensor of the log-likelihood of documents given $\theta_{d}$ and $\beta$. - -```{code-cell} ipython3 -def logp_lda_doc(beta, theta): - """Returns the log-likelihood function for given documents. - - K : number of topics in the model - V : number of words (size of vocabulary) - D : number of documents (in a mini-batch) - - Parameters - ---------- - beta : tensor (K x V) - Word distributions. - theta : tensor (D x K) - Topic distributions for documents. - """ - - def ll_docs_f(docs): - dixs, vixs = docs.nonzero() - vfreqs = docs[dixs, vixs] - ll_docs = ( - vfreqs * pmmath.logsumexp(tt.log(theta[dixs]) + tt.log(beta.T[vixs]), axis=1).ravel() - ) - - # Per-word log-likelihood times num of tokens in the whole dataset - return tt.sum(ll_docs) / (tt.sum(vfreqs) + 1e-9) * n_tokens - - return ll_docs_f -``` - -In the inner function, the log-likelihood is scaled for mini-batches by the number of tokens in the dataset. - -+++ - -## LDA model -With the log-likelihood function, we can construct the probabilistic model for LDA. `doc_t` works as a placeholder to which documents in a mini-batch are set. - -For ADVI, each of random variables $\theta$ and $\beta$, drawn from Dirichlet distributions, is transformed into unconstrained real coordinate space. To do this, by default, PyMC3 uses an isometric logratio transformation. Since these random variables are on a simplex, the dimension of the unconstrained coordinate space is the original dimension minus 1. For example, the dimension of $\theta_{d}$ is the number of topics (`n_topics`) in the LDA model, thus the transformed space has dimension `(n_topics - 1)`. - -The variational posterior on these transformed parameters is represented by a spherical Gaussian distributions (meanfield approximation). Thus, the number of variational parameters of $\theta_{d}$, the latent variable for each document, is `2 * (n_topics - 1)` for means and standard deviations. - -In the last line of the below cell, `DensityDist` class is used to define the log-likelihood function of the model. The second argument is a Python function which takes observations (a document matrix in this example) and returns the log-likelihood value. This function is given as a return value of `logp_lda_doc(beta, theta)`, which has been defined above. - -```{code-cell} ipython3 -n_topics = 10 -# we have sparse dataset. It's better to have dense batch so that all words accrue there -minibatch_size = 128 - -# defining minibatch -doc_t_minibatch = pm.Minibatch(docs_tr.toarray(), minibatch_size) -doc_t = shared(docs_tr.toarray()[:minibatch_size]) -with pm.Model() as model: - theta = Dirichlet( - "theta", - a=pm.floatX((1.0 / n_topics) * np.ones((minibatch_size, n_topics))), - shape=(minibatch_size, n_topics), - # do not forget scaling - total_size=n_samples_tr, - ) - beta = Dirichlet( - "beta", - a=pm.floatX((1.0 / n_topics) * np.ones((n_topics, n_words))), - shape=(n_topics, n_words), - ) - # Note, that we defined likelihood with scaling, so here we need no additional `total_size` kwarg - doc = pm.DensityDist("doc", logp_lda_doc(beta, theta), observed=doc_t) -``` - -## Encoder -Given a document, the encoder calculates variational parameters of the (transformed) latent variables, more specifically, parameters of Gaussian distributions in the unconstrained real coordinate space. The `encode()` method is required to output variational means and stds as a tuple, as shown in the following code. As explained above, the number of variational parameters is `2 * (n_topics) - 1`. Specifically, the shape of `zs_mean` (or `zs_std`) in the method is `(minibatch_size, n_topics - 1)`. It should be noted that `zs_std` is defined as $\rho = log(exp(std) - 1)$ in `ADVI` and bounded to be positive. The inverse parametrization is $std = log(1+exp(\rho))$ and considered to be numericaly stable. - -To enhance generalization ability to unseen words, a bernoulli corruption process is applied to the inputted documents. Unfortunately, I have never see any significant improvement with this. - -```{code-cell} ipython3 -class LDAEncoder: - """Encode (term-frequency) document vectors to variational means and (log-transformed) stds.""" - - def __init__(self, n_words, n_hidden, n_topics, p_corruption=0, random_seed=1): - rng = np.random.RandomState(random_seed) - self.n_words = n_words - self.n_hidden = n_hidden - self.n_topics = n_topics - self.w0 = shared(0.01 * rng.randn(n_words, n_hidden).ravel(), name="w0") - self.b0 = shared(0.01 * rng.randn(n_hidden), name="b0") - self.w1 = shared(0.01 * rng.randn(n_hidden, 2 * (n_topics - 1)).ravel(), name="w1") - self.b1 = shared(0.01 * rng.randn(2 * (n_topics - 1)), name="b1") - self.rng = MRG_RandomStreams(seed=random_seed) - self.p_corruption = p_corruption - - def encode(self, xs): - if 0 < self.p_corruption: - dixs, vixs = xs.nonzero() - mask = tt.set_subtensor( - tt.zeros_like(xs)[dixs, vixs], - self.rng.binomial(size=dixs.shape, n=1, p=1 - self.p_corruption), - ) - xs_ = xs * mask - else: - xs_ = xs - - w0 = self.w0.reshape((self.n_words, self.n_hidden)) - w1 = self.w1.reshape((self.n_hidden, 2 * (self.n_topics - 1))) - hs = tt.tanh(xs_.dot(w0) + self.b0) - zs = hs.dot(w1) + self.b1 - zs_mean = zs[:, : (self.n_topics - 1)] - zs_rho = zs[:, (self.n_topics - 1) :] - return {"mu": zs_mean, "rho": zs_rho} - - def get_params(self): - return [self.w0, self.b0, self.w1, self.b1] -``` - -To feed the output of the encoder to the variational parameters of $\theta$, we set an OrderedDict of tuples as below. - -```{code-cell} ipython3 -encoder = LDAEncoder(n_words=n_words, n_hidden=100, n_topics=n_topics, p_corruption=0.0) -local_RVs = OrderedDict([(theta, encoder.encode(doc_t))]) -local_RVs -``` - -`theta` is the random variable defined in the model creation and is a key of an entry of the `OrderedDict`. The value `(encoder.encode(doc_t), n_samples_tr / minibatch_size)` is a tuple of a theano expression and a scalar. The theano expression `encoder.encode(doc_t)` is the output of the encoder given inputs (documents). The scalar `n_samples_tr / minibatch_size` specifies the scaling factor for mini-batches. - -ADVI optimizes the parameters of the encoder. They are passed to the function for ADVI. - -```{code-cell} ipython3 -encoder_params = encoder.get_params() -encoder_params -``` - -## AEVB with ADVI - -```{code-cell} ipython3 -η = 0.1 -s = shared(η) - - -def reduce_rate(a, h, i): - s.set_value(η / ((i / minibatch_size) + 1) ** 0.7) - - -with model: - approx = pm.MeanField(local_rv=local_RVs) - approx.scale_cost_to_minibatch = False - inference = pm.KLqp(approx) -inference.fit( - 10000, - callbacks=[reduce_rate], - obj_optimizer=pm.sgd(learning_rate=s), - more_obj_params=encoder_params, - total_grad_norm_constraint=200, - more_replacements={doc_t: doc_t_minibatch}, -) -``` - -```{code-cell} ipython3 -print(approx) -``` - -```{code-cell} ipython3 -plt.plot(approx.hist[10:]); -``` - -## Extraction of characteristic words of topics based on posterior samples -By using estimated variational parameters, we can draw samples from the variational posterior. To do this, we use function `sample_vp()`. Here we use this function to obtain posterior mean of the word-topic distribution $\beta$ and show top-10 words frequently appeared in the 10 topics. - -+++ - -To apply the above function for the LDA model, we redefine the probabilistic model because the number of documents to be tested changes. Since variational parameters have already been obtained, we can reuse them for sampling from the approximate posterior distribution. - -```{code-cell} ipython3 -def print_top_words(beta, feature_names, n_top_words=10): - for i in range(len(beta)): - print( - ("Topic #%d: " % i) - + " ".join([feature_names[j] for j in beta[i].argsort()[: -n_top_words - 1 : -1]]) - ) - - -doc_t.set_value(docs_tr.toarray()) -samples = pm.sample_approx(approx, draws=100) -beta_pymc3 = samples["beta"].mean(axis=0) - -print_top_words(beta_pymc3, feature_names) -``` - -We compare these topics to those obtained by a standard LDA implementation on scikit-learn, which is based on an online stochastic variational inference (Hoffman et al., 2013). We can see that estimated words in the topics are qualitatively similar. - -```{code-cell} ipython3 -from sklearn.decomposition import LatentDirichletAllocation - -lda = LatentDirichletAllocation( - n_components=n_topics, - max_iter=5, - learning_method="online", - learning_offset=50.0, - random_state=0, -) -%time lda.fit(docs_tr) -beta_sklearn = lda.components_ / lda.components_.sum(axis=1)[:, np.newaxis] - -print_top_words(beta_sklearn, feature_names) -``` - -## Predictive distribution -In some papers (e.g., Hoffman et al. 2013), the predictive distribution of held-out words was proposed as a quantitative measure for goodness of the model fitness. The log-likelihood function for tokens of the held-out word can be calculated with posterior means of $\theta$ and $\beta$. The validity of this is explained in (Hoffman et al. 2013). - -```{code-cell} ipython3 -def calc_pp(ws, thetas, beta, wix): - """ - Parameters - ---------- - ws: ndarray (N,) - Number of times the held-out word appeared in N documents. - thetas: ndarray, shape=(N, K) - Topic distributions for N documents. - beta: ndarray, shape=(K, V) - Word distributions for K topics. - wix: int - Index of the held-out word - - Return - ------ - Log probability of held-out words. - """ - return ws * np.log(thetas.dot(beta[:, wix])) - - -def eval_lda(transform, beta, docs_te, wixs): - """Evaluate LDA model by log predictive probability. - - Parameters - ---------- - transform: Python function - Transform document vectors to posterior mean of topic proportions. - wixs: iterable of int - Word indices to be held-out. - """ - lpss = [] - docs_ = deepcopy(docs_te) - thetass = [] - wss = [] - total_words = 0 - for wix in wixs: - ws = docs_te[:, wix].ravel() - if 0 < ws.sum(): - # Hold-out - docs_[:, wix] = 0 - - # Topic distributions - thetas = transform(docs_) - - # Predictive log probability - lpss.append(calc_pp(ws, thetas, beta, wix)) - - docs_[:, wix] = ws - thetass.append(thetas) - wss.append(ws) - total_words += ws.sum() - else: - thetass.append(None) - wss.append(None) - - # Log-probability - lp = np.sum(np.hstack(lpss)) / total_words - - return {"lp": lp, "thetass": thetass, "beta": beta, "wss": wss} -``` - -`transform()` function is defined with `sample_vp()` function. This function is an argument to the function for calculating log predictive probabilities. - -```{code-cell} ipython3 -inp = tt.matrix(dtype="int64") -sample_vi_theta = theano.function( - [inp], approx.sample_node(approx.model.theta, 100, more_replacements={doc_t: inp}).mean(0) -) - - -def transform_pymc3(docs): - return sample_vi_theta(docs) -``` - -```{code-cell} ipython3 -%time result_pymc3 = eval_lda(\ - transform_pymc3, beta_pymc3, docs_te.toarray(), np.arange(100)\ - ) -print("Predictive log prob (pm3) = {}".format(result_pymc3["lp"])) -``` - -We compare the result with the scikit-learn LDA implemented. The log predictive probability is comparable with AEVB-ADVI, and it shows good set of words in the estimated topics. - -```{code-cell} ipython3 -def transform_sklearn(docs): - thetas = lda.transform(docs) - return thetas / thetas.sum(axis=1)[:, np.newaxis] - - -%time result_sklearn = eval_lda(\ - transform_sklearn, beta_sklearn, docs_te.toarray(), np.arange(100)\ - ) -print("Predictive log prob (sklearn) = {}".format(result_sklearn["lp"])) -``` - -## Summary -We have seen that PyMC3 allows us to estimate random variables of LDA, a probabilistic model with latent variables, based on automatic variational inference. Variational parameters of the local latent variables in the probabilistic model are encoded from observations. The parameters of the encoding model, MLP in this example, are optimized with variational parameters of the global latent variables. Once the probabilistic and the encoding models are defined, parameter optimization is done just by invoking an inference (`ADVI()`) without need to derive complex update equations. - -This notebook shows that even mean field approximation can perform as well as sklearn implementation, which is based on the conjugate priors and thus not relying on the mean field approximation. - -+++ - -## References -* Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. stat, 1050, 1. -* Kucukelbir, A., Ranganath, R., Gelman, A., & Blei, D. (2015). Automatic variational inference in Stan. In Advances in neural information processing systems (pp. 568-576). -* Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022. -* Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. W. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14(1), 1303-1347. -* Rezende, D. J., & Mohamed, S. (2015). Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770. -* Salimans, T., Kingma, D. P., & Welling, M. (2015). Markov chain Monte Carlo and variational inference: Bridging the gap. In International Conference on Machine Learning (pp. 1218-1226). - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/variational_inference/normalizing_flows_overview.myst.md b/myst_nbs/variational_inference/normalizing_flows_overview.myst.md deleted file mode 100644 index a8de69e9b..000000000 --- a/myst_nbs/variational_inference/normalizing_flows_overview.myst.md +++ /dev/null @@ -1,468 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python 3 - language: python - name: python3 ---- - -# Normalizing Flows Overview - -+++ - -Normalizing Flows is a rich family of distributions. They were described by [Rezende and Mohamed](https://arxiv.org/abs/1505.05770), and their experiments proved the importance of studying them further. Some extensions like that of [Tomczak and Welling](https://arxiv.org/abs/1611.09630) made partially/full rank Gaussian approximations for high dimensional spaces computationally tractable. - -This notebook reveals some tips and tricks for using normalizing flows effectively in PyMC3. - -```{code-cell} ipython3 -%matplotlib inline -from collections import Counter - -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm -import seaborn as sns -import theano -import theano.tensor as tt - -pm.set_tt_rng(42) -np.random.seed(42) -``` - -## Theory - -Normalizing flows is a series of invertible transformations on an initial distribution. - -$$z_K = f_K \circ \dots \circ f_2 \circ f_1(z_0) $$ - -In this case, we can compute a tractable density for the flow. - -$$\ln q_K(z_K) = \ln q_0(z_0) - \sum_{k=1}^{K}\ln \left|\frac{\partial f_k}{\partial z_{k-1}}\right|$$ - -Here, every $f_k$ is a parametric function with a well-defined determinant. The transformation used is up to the user; for example, the simplest flow is an affine transform: - -$$z = loc(scale(z_0)) = \mu + \sigma * z_0 $$ - -In this case, we get a mean field approximation if $z_0 \sim \mathcal{N}(0, 1)$ - -## Flow Formulas - -In PyMC3 there are flexible ways to define flows with formulas. There are currently 5 types defined: - -* Loc (`loc`): $z' = z + \mu$ -* Scale (`scale`): $z' = \sigma * z$ -* Planar (`planar`): $z' = z + u * \tanh(w^T z + b)$ -* Radial (`radial`): $z' = z + \beta (\alpha + ||z-z_r||)^{-1}(z-z_r)$ -* Householder (`hh`): $z' = H z$ - -+++ - -Formulae can be composed as a string, e.g. `'scale-loc'`, `'scale-hh*4-loc'`, `'planar*10'`. Each step is separated with `'-'`, and repeated flows are defined with `'*'` in the form of `'*<#repeats>'`. - -Flow-based approximations in PyMC3 are based on the `NormalizingFlow` class, with corresponding inference classes named using the `NF` abbreviation (analogous to how `ADVI` and `SVGD` are treated in PyMC3). - -Concretely, an approximation is represented by: - -```{code-cell} ipython3 -pm.NormalizingFlow -``` - -While an inference class is: - -```{code-cell} ipython3 -pm.NFVI -``` - -## Flow patterns - -Composing flows requires some understanding of the target output. Flows that are too complex might not converge, whereas if they are too simple, they may not accurately estimate the posterior. - -Let's start simply: - -```{code-cell} ipython3 -with pm.Model() as dummy: - - N = pm.Normal("N", shape=(100,)) -``` - -### Mean Field connectivity - -Let's apply the transformation corresponding to the mean-field family to begin with: - -```{code-cell} ipython3 -pm.NormalizingFlow("scale-loc", model=dummy) -``` - -### Full Rank Normal connectivity - -We can get a full rank model with dense covariance matrix using **householder flows** (hh). One `hh` flow adds exactly one rank to the covariance matrix, so for a full rank matrix we need `K=ndim` householder flows. hh flows are volume-preserving, so we need to change the scaling if we want our posterior to have unit variance for the latent variables. - -After we specify the covariance with a combination of `'scale-hh*K'`, we then add location shift with the `loc` flow. We now have a full-rank analog: - -```{code-cell} ipython3 -pm.NormalizingFlow("scale-hh*100-loc", model=dummy) -``` - -A more interesting case is when we do not expect a lot of interactions within the posterior. In this case, where our covariance is expected to be sparse, we can constrain it by defining a *low rank* approximation family. - -This has the additional benefit of reducing the computational cost of approximating the model. - -```{code-cell} ipython3 -pm.NormalizingFlow("scale-hh*10-loc", model=dummy) -``` - -Parameters can be initialized randomly, using the `jitter` argument to specify the scale of the randomness. - -```{code-cell} ipython3 -pm.NormalizingFlow("scale-hh*10-loc", model=dummy, jitter=0.001) # LowRank -``` - -### Planar and Radial Flows - -* Planar (`planar`): $z' = z + u * \tanh(w^T z + b)$ -* Radial (`radial`): $z' = z + \beta (\alpha + ||z-z_r||)^{-1}(z-z_r)$ - -Planar flows are useful for splitting the incoming distribution into two parts, which allows multimodal distributions to be modeled. - -Similarly, a radial flow changes density around a specific reference point. - -+++ - -## Simulated data example - -+++ - -There were 4 potential functions illustrated in the [original paper](https://arxiv.org/abs/1505.05770), which we can replicate here. Inference can be unstable in multimodal cases, but there are strategies for dealing with them. - -First, let's specify the potential functions: - -```{code-cell} ipython3 -def w1(z): - return tt.sin(2.0 * np.pi * z[0] / 4.0) - - -def w2(z): - return 3.0 * tt.exp(-0.5 * ((z[0] - 1.0) / 0.6) ** 2) - - -def w3(z): - return 3.0 * (1 + tt.exp(-(z[0] - 1.0) / 0.3)) ** -1 - - -def pot1(z): - z = z.T - return 0.5 * ((z.norm(2, axis=0) - 2.0) / 0.4) ** 2 - tt.log( - tt.exp(-0.5 * ((z[0] - 2.0) / 0.6) ** 2) + tt.exp(-0.5 * ((z[0] + 2.0) / 0.6) ** 2) - ) - - -def pot2(z): - z = z.T - return 0.5 * ((z[1] - w1(z)) / 0.4) ** 2 + 0.1 * tt.abs_(z[0]) - - -def pot3(z): - z = z.T - return -tt.log( - tt.exp(-0.5 * ((z[1] - w1(z)) / 0.35) ** 2) - + tt.exp(-0.5 * ((z[1] - w1(z) + w2(z)) / 0.35) ** 2) - ) + 0.1 * tt.abs_(z[0]) - - -def pot4(z): - z = z.T - return -tt.log( - tt.exp(-0.5 * ((z[1] - w1(z)) / 0.4) ** 2) - + tt.exp(-0.5 * ((z[1] - w1(z) + w3(z)) / 0.35) ** 2) - ) + 0.1 * tt.abs_(z[0]) - - -z = tt.matrix("z") -z.tag.test_value = pm.floatX([[0.0, 0.0]]) -pot1f = theano.function([z], pot1(z)) -pot2f = theano.function([z], pot2(z)) -pot3f = theano.function([z], pot3(z)) -pot4f = theano.function([z], pot4(z)) -``` - -```{code-cell} ipython3 -def contour_pot(potf, ax=None, title=None, xlim=5, ylim=5): - grid = pm.floatX(np.mgrid[-xlim:xlim:100j, -ylim:ylim:100j]) - grid_2d = grid.reshape(2, -1).T - cmap = plt.get_cmap("inferno") - if ax is None: - _, ax = plt.subplots(figsize=(12, 9)) - pdf1e = np.exp(-potf(grid_2d)) - contour = ax.contourf(grid[0], grid[1], pdf1e.reshape(100, 100), cmap=cmap) - if title is not None: - ax.set_title(title, fontsize=16) - return ax -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(2, 2, figsize=(12, 12)) -ax = ax.flatten() -contour_pot( - pot1f, - ax[0], - "pot1", -) -contour_pot(pot2f, ax[1], "pot2") -contour_pot(pot3f, ax[2], "pot3") -contour_pot(pot4f, ax[3], "pot4") -fig.tight_layout() -``` - -## Reproducing first potential function - -```{code-cell} ipython3 -from pymc3.distributions.dist_math import bound - - -def cust_logp(z): - # return bound(-pot1(z), z>-5, z<5) - return -pot1(z) - - -with pm.Model() as pot1m: - pm.DensityDist("pot1", logp=cust_logp, shape=(2,)) -``` - -### NUTS -Let's use NUTS first. Just to have a look how good is it's approximation. - -> Note you may need to rerun the model a couple of times, as the sampler/estimator might not fully explore function due to multimodality. - -```{code-cell} ipython3 -pm.set_tt_rng(42) -np.random.seed(42) -with pot1m: - trace = pm.sample( - 1000, - init="auto", - cores=2, - start=[dict(pot1=np.array([-2, 0])), dict(pot1=np.array([2, 0]))], - ) -``` - -```{code-cell} ipython3 -dftrace = pm.trace_to_dataframe(trace) -sns.jointplot(dftrace.iloc[:, 0], dftrace.iloc[:, 1], kind="kde") -``` - -### Normalizing flows - -As a first (naive) try with flows, we will keep things simple: Let's use just 2 planar flows and see what we get: - -```{code-cell} ipython3 -with pot1m: - inference = pm.NFVI("planar*2", jitter=1) - -## Plotting starting distribution -dftrace = pm.trace_to_dataframe(inference.approx.sample(1000)) -sns.jointplot(dftrace.iloc[:, 0], dftrace.iloc[:, 1], kind="kde"); -``` - -#### Tracking gradients - -It is illustrative to track gradients as well as parameters. In this setup, different sampling points can give different gradients because a single sampled point tends to collapse to a mode. - -Here are the parameters of the model: - -```{code-cell} ipython3 -inference.approx.params -``` - -We also require an objective: - -```{code-cell} ipython3 -inference.objective(nmc=None) -``` - -Theano can be used to calculate the gradient of the objective with respect to the parameters: - -```{code-cell} ipython3 -with theano.configparser.change_flags(compute_test_value="off"): - grads = tt.grad(inference.objective(None), inference.approx.params) -grads -``` - -If we want to keep track of the gradient changes during the inference, we warp them in a pymc3 callback: - -```{code-cell} ipython3 -from collections import OrderedDict, defaultdict -from itertools import count - - -@theano.configparser.change_flags(compute_test_value="off") -def get_tracker(inference): - numbers = defaultdict(count) - params = inference.approx.params - grads = tt.grad(inference.objective(None), params) - names = ["%s_%d" % (v.name, next(numbers[v.name])) for v in inference.approx.params] - return pm.callbacks.Tracker( - **OrderedDict( - [(name, v.eval) for name, v in zip(names, params)] - + [("grad_" + name, v.eval) for name, v in zip(names, grads)] - ) - ) - - -tracker = get_tracker(inference) -``` - -```{code-cell} ipython3 -tracker.whatchdict -``` - -```{code-cell} ipython3 -inference.fit(30000, obj_optimizer=pm.adagrad_window(learning_rate=0.01), callbacks=[tracker]) -``` - -```{code-cell} ipython3 -dftrace = pm.trace_to_dataframe(inference.approx.sample(1000)) -sns.jointplot(dftrace.iloc[:, 0], dftrace.iloc[:, 1], kind="kde") -``` - -```{code-cell} ipython3 -plt.plot(inference.hist); -``` - -As you can see, the objective history is not very informative here. This is where the gradient tracker can be more informative. - -```{code-cell} ipython3 -# fmt: off -trackername = ['u_0', 'w_0', 'b_0', 'u_1', 'w_1', 'b_1', - 'grad_u_0', 'grad_w_0', 'grad_b_0', 'grad_u_1', 'grad_w_1', 'grad_b_1'] -# fmt: on - - -def plot_tracker_results(tracker): - fig, ax = plt.subplots(len(tracker.hist) // 2, 2, figsize=(16, len(tracker.hist) // 2 * 2.3)) - ax = ax.flatten() - # names = list(tracker.hist.keys()) - names = trackername - gnames = names[len(names) // 2 :] - names = names[: len(names) // 2] - pairnames = zip(names, gnames) - - def plot_params_and_grads(name, gname): - i = names.index(name) - left = ax[i * 2] - right = ax[i * 2 + 1] - grads = np.asarray(tracker[gname]) - if grads.ndim == 1: - grads = grads[:, None] - grads = grads.T - params = np.asarray(tracker[name]) - if params.ndim == 1: - params = params[:, None] - params = params.T - right.set_title("Gradient of %s" % name) - left.set_title("Param trace of %s" % name) - s = params.shape[0] - for j, (v, g) in enumerate(zip(params, grads)): - left.plot(v, "-") - right.plot(g, "o", alpha=1 / s / 10) - left.legend([name + "_%d" % j for j in range(len(names))]) - right.legend([gname + "_%d" % j for j in range(len(names))]) - - for vn, gn in pairnames: - plot_params_and_grads(vn, gn) - fig.tight_layout() -``` - -```{code-cell} ipython3 -plot_tracker_results(tracker); -``` - -Inference **is often unstable**, some parameters are not well fitted as they poorly influence the resulting posterior. - -In a multimodal setting, the dominant mode might well change from run to run. - -+++ - -### Going deeper - -We can try to improve our approximation by adding flows; in the original paper they used both 8 and 32. Let's try using 8 here. - -```{code-cell} ipython3 -with pot1m: - inference = pm.NFVI("planar*8", jitter=1.0) - -dftrace = pm.trace_to_dataframe(inference.approx.sample(1000)) -sns.jointplot(dftrace.iloc[:, 0], dftrace.iloc[:, 1], kind="kde"); -``` - -We can try for a more robust fit by allocating more samples to `obj_n_mc` in `fit`, which controls the number of Monte Carlo samples used to approximate the gradient. - -```{code-cell} ipython3 -inference.fit( - 25000, - obj_optimizer=pm.adam(learning_rate=0.01), - obj_n_mc=100, - callbacks=[pm.callbacks.CheckParametersConvergence()], -) -``` - -```{code-cell} ipython3 -dftrace = pm.trace_to_dataframe(inference.approx.sample(1000)) -sns.jointplot(dftrace.iloc[:, 0], dftrace.iloc[:, 1], kind="kde") -``` - -This is a noticeable improvement. Here, we see that flows are able to characterize the multimodality of a given posterior, but as we have seen, they are hard to fit. The initial point of the optimization matters in general for the multimodal case. - -+++ - -### MCMC vs NFVI - -Let's use another potential function, and compare the sampling using NUTS to what we get with NF: - -```{code-cell} ipython3 -def cust_logp(z): - return -pot4(z) - - -with pm.Model() as pot_m: - pm.DensityDist("pot_func", logp=cust_logp, shape=(2,)) -``` - -```{code-cell} ipython3 -with pot_m: - traceNUTS = pm.sample(3000, tune=1000, target_accept=0.9, cores=2) -``` - -```{code-cell} ipython3 -formula = "planar*10" -with pot_m: - inference = pm.NFVI(formula, jitter=0.1) - -inference.fit(25000, obj_optimizer=pm.adam(learning_rate=0.01), obj_n_mc=10) - -traceNF = inference.approx.sample(5000) -``` - -```{code-cell} ipython3 -fig, ax = plt.subplots(1, 3, figsize=(18, 6)) -contour_pot(pot4f, ax[0], "Target Potential Function") - -ax[1].scatter(traceNUTS["pot_func"][:, 0], traceNUTS["pot_func"][:, 1], c="r", alpha=0.05) -ax[1].set_xlim(-5, 5) -ax[1].set_ylim(-5, 5) -ax[1].set_title("NUTS") - -ax[2].scatter(traceNF["pot_func"][:, 0], traceNF["pot_func"][:, 1], c="b", alpha=0.05) -ax[2].set_xlim(-5, 5) -ax[2].set_ylim(-5, 5) -ax[2].set_title("NF with " + formula); -``` - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -``` diff --git a/myst_nbs/variational_inference/pathfinder.myst.md b/myst_nbs/variational_inference/pathfinder.myst.md deleted file mode 100644 index 743a79026..000000000 --- a/myst_nbs/variational_inference/pathfinder.myst.md +++ /dev/null @@ -1,97 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: pymc4 - language: python - name: pymc4 ---- - -(pathfinder)= - -# Pathfinder Variational Inference - -:::{post} Sept 30, 2022 -:tags: variational inference, jax -:category: advanced, how-to -:author: Thomas Wiecki -::: - -+++ - -Pathfinder {cite:p}`zhang2021pathfinder` is a variational inference algorithm that produces samples from the posterior of a Bayesian model. It compares favorably to the widely used ADVI algorithm. On large problems, it should scale better than most MCMC algorithms, including dynamic HMC (i.e. NUTS), at the cost of a more biased estimate of the posterior. For details on the algorithm, see the [arxiv preprint](https://arxiv.org/abs/2108.03782). - -This algorithm is [implemented](https://github.com/blackjax-devs/blackjax/pull/194) in [BlackJAX](https://github.com/blackjax-devs/blackjax), a library of inference algorithms for [JAX](https://github.com/google/jax). Through PyMC's JAX-backend (through [aesara](https://github.com/aesara-devs/aesara)) we can run BlackJAX's pathfinder on any PyMC model with some simple wrapper code. - -This wrapper code is implemented in [pymcx](https://github.com/pymc-devs/pymcx/). This tutorial shows how to run Pathfinder on your PyMC model. - -You first need to install `pymcx`: - -`pip install git+https://github.com/pymc-devs/pymcx` - -```{code-cell} ipython3 -import arviz as az -import numpy as np -import pymc as pm -import pymcx as pmx - -print(f"Running on PyMC v{pm.__version__}") -``` - -First, define your PyMC model. Here, we use the 8-schools model. - -```{code-cell} ipython3 -# Data of the Eight Schools Model -J = 8 -y = np.array([28.0, 8.0, -3.0, 7.0, -1.0, 1.0, 18.0, 12.0]) -sigma = np.array([15.0, 10.0, 16.0, 11.0, 9.0, 11.0, 10.0, 18.0]) - -with pm.Model() as model: - mu = pm.Normal("mu", mu=0.0, sigma=10.0) - tau = pm.HalfCauchy("tau", 5.0) - - theta = pm.Normal("theta", mu=0, sigma=1, shape=J) - theta_1 = mu + tau * theta - obs = pm.Normal("obs", mu=theta, sigma=sigma, shape=J, observed=y) -``` - -Next, we call `pmx.fit()` and pass in the algorithm we want it to use. - -```{code-cell} ipython3 -with model: - idata = pmx.fit(method="pathfinder") -``` - -Just like `pymc.sample()`, this returns an idata with samples from the posterior. Note that because these samples do not come from an MCMC chain, convergence can not be assessed in the regular way. - -```{code-cell} ipython3 -az.plot_trace(idata); -``` - -## References - -:::{bibliography} -:filter: docname in docnames -::: - -+++ - -## Authors - -* Authored by Thomas Wiecki on Oct 11 2022 ([pymc-examples#429](https://github.com/pymc-devs/pymc-examples/pull/429)) - -+++ - -## Watermark - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -p aesara,xarray -``` - -:::{include} ../page_footer.md -::: diff --git a/myst_nbs/variational_inference/variational_api_quickstart.myst.md b/myst_nbs/variational_inference/variational_api_quickstart.myst.md deleted file mode 100644 index 0cc900a00..000000000 --- a/myst_nbs/variational_inference/variational_api_quickstart.myst.md +++ /dev/null @@ -1,539 +0,0 @@ ---- -jupytext: - text_representation: - extension: .md - format_name: myst - format_version: 0.13 - jupytext_version: 1.13.7 -kernelspec: - display_name: Python (PyMC3 Dev) - language: python - name: pymc3-dev ---- - -# Variational API quickstart - -The variational inference (VI) API is focused on approximating posterior distributions for Bayesian models. Common use cases to which this module can be applied include: - -* Sampling from model posterior and computing arbitrary expressions -* Conduct Monte Carlo approximation of expectation, variance, and other statistics -* Remove symbolic dependence on PyMC3 random nodes and evaluate expressions (using `eval`) -* Provide a bridge to arbitrary Theano code - -Sounds good, doesn't it? - -The module provides an interface to a variety of inference methods, so you are free to choose what is most appropriate for the problem. - -```{code-cell} ipython3 -%matplotlib inline -import arviz as az -import matplotlib.pyplot as plt -import numpy as np -import pymc3 as pm -import theano - -np.random.seed(42) -pm.set_tt_rng(42) -``` - -## Basic setup - -We do not need complex models to play with the VI API; let's begin with a simple mixture model: - -```{code-cell} ipython3 -w = pm.floatX([0.2, 0.8]) -mu = pm.floatX([-0.3, 0.5]) -sd = pm.floatX([0.1, 0.1]) - -with pm.Model() as model: - x = pm.NormalMixture("x", w=w, mu=mu, sigma=sd, dtype=theano.config.floatX) - x2 = x**2 - sin_x = pm.math.sin(x) -``` - -We can't compute analytical expectations for this model. However, we can obtain an approximation using Markov chain Monte Carlo methods; let's use NUTS first. - -To allow samples of the expressions to be saved, we need to wrap them in `Deterministic` objects: - -```{code-cell} ipython3 -with model: - pm.Deterministic("x2", x2) - pm.Deterministic("sin_x", sin_x) -``` - -```{code-cell} ipython3 -with model: - trace = pm.sample(50000) -``` - -```{code-cell} ipython3 -az.plot_trace(trace); -``` - -Above are traces for $x^2$ and $sin(x)$. We can see there is clear multi-modality in this model. One drawback, is that you need to know in advance what exactly you want to see in trace and wrap it with `Deterministic`. - -The VI API takes an alternate approach: You obtain inference from model, then calculate expressions based on this model afterwards. - -Let's use the same model: - -```{code-cell} ipython3 -with pm.Model() as model: - - x = pm.NormalMixture("x", w=w, mu=mu, sigma=sd, dtype=theano.config.floatX) - x2 = x**2 - sin_x = pm.math.sin(x) -``` - -Here we will use automatic differentiation variational inference (ADVI). - -```{code-cell} ipython3 -with model: - mean_field = pm.fit(method="advi") -``` - -```{code-cell} ipython3 -az.plot_posterior(mean_field.sample(1000), color="LightSeaGreen"); -``` - -Notice that ADVI has failed to approximate the multimodal distribution, since it uses a Gaussian distribution that has a single mode. - -## Checking convergence - -```{code-cell} ipython3 -help(pm.callbacks.CheckParametersConvergence) -``` - -Let's use the default arguments for `CheckParametersConvergence` as they seem to be reasonable. - -```{code-cell} ipython3 -from pymc3.variational.callbacks import CheckParametersConvergence - -with model: - mean_field = pm.fit(method="advi", callbacks=[CheckParametersConvergence()]) -``` - -We can access inference history via `.hist` attribute. - -```{code-cell} ipython3 -plt.plot(mean_field.hist); -``` - -This is not a good convergence plot, despite the fact that we ran many iterations. The reason is that the mean of the ADVI approximation is close to zero, and therefore taking the relative difference (the default method) is unstable for checking convergence. - -```{code-cell} ipython3 -with model: - mean_field = pm.fit( - method="advi", callbacks=[pm.callbacks.CheckParametersConvergence(diff="absolute")] - ) -``` - -```{code-cell} ipython3 -plt.plot(mean_field.hist); -``` - -That's much better! We've reached convergence after less than 5000 iterations. - -+++ - -## Tracking parameters - -+++ - -Another useful callback allows users to track parameters. It allows for the tracking of arbitrary statistics during inference, though it can be memory-hungry. Using the `fit` function, we do not have direct access to the approximation before inference. However, tracking parameters requires access to the approximation. We can get around this constraint by using the object-oriented (OO) API for inference. - -```{code-cell} ipython3 -with model: - advi = pm.ADVI() -``` - -```{code-cell} ipython3 -advi.approx -``` - -Different approximations have different hyperparameters. In mean-field ADVI, we have $\rho$ and $\mu$ (inspired by [Bayes by BackProp](https://arxiv.org/abs/1505.05424)). - -```{code-cell} ipython3 -advi.approx.shared_params -``` - -There are convenient shortcuts to relevant statistics associated with the approximation. This can be useful, for example, when specifying a mass matrix for NUTS sampling: - -```{code-cell} ipython3 -advi.approx.mean.eval(), advi.approx.std.eval() -``` - -We can roll these statistics into the `Tracker` callback. - -```{code-cell} ipython3 -tracker = pm.callbacks.Tracker( - mean=advi.approx.mean.eval, # callable that returns mean - std=advi.approx.std.eval, # callable that returns std -) -``` - -Now, calling `advi.fit` will record the mean and standard deviation of the approximation as it runs. - -```{code-cell} ipython3 -approx = advi.fit(20000, callbacks=[tracker]) -``` - -We can now plot both the evidence lower bound and parameter traces: - -```{code-cell} ipython3 -fig = plt.figure(figsize=(16, 9)) -mu_ax = fig.add_subplot(221) -std_ax = fig.add_subplot(222) -hist_ax = fig.add_subplot(212) -mu_ax.plot(tracker["mean"]) -mu_ax.set_title("Mean track") -std_ax.plot(tracker["std"]) -std_ax.set_title("Std track") -hist_ax.plot(advi.hist) -hist_ax.set_title("Negative ELBO track"); -``` - -Notice that there are convergence issues with the mean, and that lack of convergence does not seem to change the ELBO trajectory significantly. As we are using the OO API, we can run the approximation longer until convergence is achieved. - -```{code-cell} ipython3 -advi.refine(100000) -``` - -Let's take a look: - -```{code-cell} ipython3 -fig = plt.figure(figsize=(16, 9)) -mu_ax = fig.add_subplot(221) -std_ax = fig.add_subplot(222) -hist_ax = fig.add_subplot(212) -mu_ax.plot(tracker["mean"]) -mu_ax.set_title("Mean track") -std_ax.plot(tracker["std"]) -std_ax.set_title("Std track") -hist_ax.plot(advi.hist) -hist_ax.set_title("Negative ELBO track"); -``` - -We still see evidence for lack of convergence, as the mean has devolved into a random walk. This could be the result of choosing a poor algorithm for inference. At any rate, it is unstable and can produce very different results even using different random seeds. - -Let's compare results with the NUTS output: - -```{code-cell} ipython3 -import seaborn as sns - -ax = sns.kdeplot(trace["x"], label="NUTS") -sns.kdeplot(approx.sample(10000)["x"], label="ADVI"); -``` - -Again, we see that ADVI is not able to cope with multimodality; we can instead use SVGD, which generates an approximation based on a large number of particles. - -```{code-cell} ipython3 -with model: - svgd_approx = pm.fit( - 300, - method="svgd", - inf_kwargs=dict(n_particles=1000), - obj_optimizer=pm.sgd(learning_rate=0.01), - ) -``` - -```{code-cell} ipython3 -ax = sns.kdeplot(trace["x"], label="NUTS") -sns.kdeplot(approx.sample(10000)["x"], label="ADVI") -sns.kdeplot(svgd_approx.sample(2000)["x"], label="SVGD"); -``` - -That did the trick, as we now have a multimodal approximation using SVGD. - -With this, it is possible to calculate arbitrary functions of the parameters with this variational approximation. For example we can calculate $x^2$ and $sin(x)$, as with the NUTS model. - -```{code-cell} ipython3 -# recall x ~ NormalMixture -a = x**2 -b = pm.math.sin(x) -``` - -To evaluate these expressions with the approximation, we need `approx.sample_node`. - -```{code-cell} ipython3 -help(svgd_approx.sample_node) -``` - -```{code-cell} ipython3 -a_sample = svgd_approx.sample_node(a) -a_sample.eval() -``` - -```{code-cell} ipython3 -a_sample.eval() -``` - -```{code-cell} ipython3 -a_sample.eval() -``` - -Every call yields a different value from the same theano node. This is because it is **stochastic**. - -By applying replacements, we are now free of the dependence on the PyMC3 model; instead, we now depend on the approximation. Changing it will change the distribution for stochastic nodes: - -```{code-cell} ipython3 -sns.kdeplot(np.array([a_sample.eval() for _ in range(2000)])) -plt.title("$x^2$ distribution"); -``` - -There is a more convenient way to get lots of samples at once: `sample_node` - -```{code-cell} ipython3 -a_samples = svgd_approx.sample_node(a, size=1000) -``` - -```{code-cell} ipython3 -sns.kdeplot(a_samples.eval()) -plt.title("$x^2$ distribution"); -``` - -The `sample_node` function includes an additional dimension, so taking expectations or calculating variance is specified by `axis=0`. - -```{code-cell} ipython3 -a_samples.var(0).eval() # variance -``` - -```{code-cell} ipython3 -a_samples.mean(0).eval() # mean -``` - -A symbolic sample size can also be specified: - -```{code-cell} ipython3 -i = theano.tensor.iscalar("i") -i.tag.test_value = 1 -a_samples_i = svgd_approx.sample_node(a, size=i) -``` - -```{code-cell} ipython3 -a_samples_i.eval({i: 100}).shape -``` - -```{code-cell} ipython3 -a_samples_i.eval({i: 10000}).shape -``` - -Unfortunately the size must be a scalar value. - -+++ - -### Converting a Trace to an Approximation - -We can convert a MCMC trace into an Approximation. It will have the same API as approximations above with same `sample_node` methods: - -```{code-cell} ipython3 -trace_approx = pm.Empirical(trace, model=model) -trace_approx -``` - -We can then draw samples from the `Emipirical` object: - -```{code-cell} ipython3 -az.plot_posterior(trace_approx.sample(10000)); -``` - -## Multilabel logistic regression - -Let's illustrate the use of `Tracker` with the famous Iris dataset. We'll attempy multi-label classification and compute the expected accuracy score as a diagnostic. - -```{code-cell} ipython3 -import pandas as pd -import theano.tensor as tt - -from sklearn.datasets import load_iris -from sklearn.model_selection import train_test_split - -X, y = load_iris(True) -X_train, X_test, y_train, y_test = train_test_split(X, y) -``` - -![](http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2015/04/iris_petal_sepal.png) - -+++ - -A relatively simple model will be sufficient here because the classes are roughly linearly separable; we are going to fit multinomial logistic regression. - -```{code-cell} ipython3 -Xt = theano.shared(X_train) -yt = theano.shared(y_train) - -with pm.Model() as iris_model: - - # Coefficients for features - β = pm.Normal("β", 0, sigma=1e2, shape=(4, 3)) - # Transoform to unit interval - a = pm.Flat("a", shape=(3,)) - p = tt.nnet.softmax(Xt.dot(β) + a) - - observed = pm.Categorical("obs", p=p, observed=yt) -``` - -### Applying replacements in practice -PyMC3 models have symbolic inputs for latent variables. To evaluate an espression that requires knowledge of latent variables, one needs to provide fixed values. We can use values approximated by VI for this purpose. The function `sample_node` removes the symbolic dependenices. - -`sample_node` will use the whole distribution at each step, so we will use it here. We can apply more replacements in single function call using the `more_replacements` keyword argument in both replacement functions. - -> **HINT:** You can use `more_replacements` argument when calling `fit` too: -> * `pm.fit(more_replacements={full_data: minibatch_data})` -> * `inference.fit(more_replacements={full_data: minibatch_data})` - -```{code-cell} ipython3 -with iris_model: - - # We'll use SVGD - inference = pm.SVGD(n_particles=500, jitter=1) - - # Local reference to approximation - approx = inference.approx - - # Here we need `more_replacements` to change train_set to test_set - test_probs = approx.sample_node(p, more_replacements={Xt: X_test}, size=100) - - # For train set no more replacements needed - train_probs = approx.sample_node(p) -``` - -By applying the code above, we now have 100 sampled probabilities (default number for `sample_node` is `None`) for each observation. - -+++ - -Next we create symbolic expressions for sampled accuracy scores: - -```{code-cell} ipython3 -test_ok = tt.eq(test_probs.argmax(-1), y_test) -train_ok = tt.eq(train_probs.argmax(-1), y_train) -test_accuracy = test_ok.mean(-1) -train_accuracy = train_ok.mean(-1) -``` - -Tracker expects callables so we can pass `.eval` method of theano node that is function itself. - -Calls to this function are cached so they can be reused. - -```{code-cell} ipython3 -eval_tracker = pm.callbacks.Tracker( - test_accuracy=test_accuracy.eval, train_accuracy=train_accuracy.eval -) -``` - -```{code-cell} ipython3 -inference.fit(100, callbacks=[eval_tracker]); -``` - -```{code-cell} ipython3 -_, ax = plt.subplots(1, 1) -df = pd.DataFrame(eval_tracker["test_accuracy"]).T.melt() -sns.lineplot(x="variable", y="value", data=df, color="red", ax=ax) -ax.plot(eval_tracker["train_accuracy"], color="blue") -ax.set_xlabel("epoch") -plt.legend(["test_accuracy", "train_accuracy"]) -plt.title("Training Progress"); -``` - -Training does not seem to be working here. Let's use a different optimizer and boost the learning rate. - -```{code-cell} ipython3 -inference.fit(400, obj_optimizer=pm.adamax(learning_rate=0.1), callbacks=[eval_tracker]); -``` - -```{code-cell} ipython3 -_, ax = plt.subplots(1, 1) -df = pd.DataFrame(np.asarray(eval_tracker["test_accuracy"])).T.melt() -sns.lineplot(x="variable", y="value", data=df, color="red", ax=ax) -ax.plot(eval_tracker["train_accuracy"], color="blue") -ax.set_xlabel("epoch") -plt.legend(["test_accuracy", "train_accuracy"]) -plt.title("Training Progress"); -``` - -This is much better! - -So, `Tracker` allows us to monitor our approximation and choose good training schedule. - -+++ - -## Minibatches -When dealing with large datasets, using minibatch training can drastically speed up and improve approximation performance. Large datasets impose a hefty cost on the computation of gradients. - -There is a nice API in pymc3 to handle these cases, which is available through the `pm.Minibatch` class. The minibatch is just a highly specialized Theano tensor: - -```{code-cell} ipython3 -issubclass(pm.Minibatch, theano.tensor.TensorVariable) -``` - -To demonstrate, let's simulate a large quantity of data: - -```{code-cell} ipython3 -# Raw values -data = np.random.rand(40000, 100) -# Scaled values -data *= np.random.randint(1, 10, size=(100,)) -# Shifted values -data += np.random.rand(100) * 10 -``` - -For comparison, let's fit a model without minibatch processing: - -```{code-cell} ipython3 -with pm.Model() as model: - mu = pm.Flat("mu", shape=(100,)) - sd = pm.HalfNormal("sd", shape=(100,)) - lik = pm.Normal("lik", mu, sd, observed=data) -``` - -Just for fun, let's create a custom special purpose callback to halt slow optimization. Here we define a callback that causes a hard stop when approximation runs too slowly: - -```{code-cell} ipython3 -def stop_after_10(approx, loss_history, i): - if (i > 0) and (i % 10) == 0: - raise StopIteration("I was slow, sorry") -``` - -```{code-cell} ipython3 -with model: - advifit = pm.fit(callbacks=[stop_after_10]) -``` - -Inference is too slow, taking several seconds per iteration; fitting the approximation would have taken hours! - -Now let's use minibatches. At every iteration, we will draw 500 random values: - -> Remember to set `total_size` in observed - -**total_size** is an important parameter that allows pymc3 to infer the right way of rescaling densities. If it is not set, you are likely to get completely wrong results. For more information please refer to the comprehensive documentation of `pm.Minibatch`. - -```{code-cell} ipython3 -X = pm.Minibatch(data, batch_size=500) - -with pm.Model() as model: - - mu = pm.Flat("mu", shape=(100,)) - sd = pm.HalfNormal("sd", shape=(100,)) - likelihood = pm.Normal("likelihood", mu, sd, observed=X, total_size=data.shape) -``` - -```{code-cell} ipython3 -with model: - advifit = pm.fit() -``` - -```{code-cell} ipython3 -plt.plot(advifit.hist); -``` - -Minibatch inference is dramatically faster. Multidimensional minibatches may be needed for some corner cases where you do matrix factorization or model is very wide. - -Here is the docstring for `Minibatch` to illustrate how it can be customized. - -```{code-cell} ipython3 -print(pm.Minibatch.__doc__) -``` - -```{code-cell} ipython3 -%load_ext watermark -%watermark -n -u -v -iv -w -```