Skip to content

Add sample plot to Interpolated Distribution docs #3859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
canyon289 opened this issue Mar 29, 2020 · 32 comments
Closed

Add sample plot to Interpolated Distribution docs #3859

canyon289 opened this issue Mar 29, 2020 · 32 comments

Comments

@canyon289
Copy link
Member

Description of your problem

Interpolated Docs are missing sample plot. One should be added
https://docs.pymc.io/api/distributions/continuous.html#pymc3.distributions.continuous.Interpolated

image

Please provide any additional information below.
See example from Normal plot for how how this can be done
https://github.com/pymc-devs/pymc3/blob/master/pymc3/distributions/continuous.py#L433

@Dpananos
Copy link
Contributor

Dpananos commented Apr 1, 2020

This looks fun, I'll take it on. Looks like this is meant to interpolate a set of points on a grid and constrict a distribution out of those points. The other plots show distributions under various parameterizations, which I don't think this is amenable to.

Would it be sufficient to construct a lattice, sample from it, and then plot a histogram and lattice points? Here is an example of what I'm thinking

image

@canyon289
Copy link
Member Author

Sounds good to me :)

@Rishabh261998
Copy link

Hi, can I take this issue? Thanks

@mjhajharia
Copy link
Member

so, im a little confused about something, the input into the interpolated function - lattice points and their pdf, are those supposed to be sampled from a pymc3 distribution(intentionally picking them from uneven intervals), or they can be any set of monotonically increasing random points. also i'm still trying to make sense of what would pdf at a point mean, as in wouldnt probability distribution at that point be a better measure. i might be absolutely wrong, but i guess probability density function is sort of measured around a point right, so maybe distribution would make more sense, its just a lil ambiguous in my head. also is it any different from the scipy interpolation thing, or is it the same thing.

@mjhajharia
Copy link
Member

so, im a little confused about something, the input into the interpolated function - lattice points and their pdf, are those supposed to be sampled from a pymc3 distribution(intentionally picking them from uneven intervals), or they can be any set of monotonically increasing random points. also i'm still trying to make sense of what would pdf at a point mean, as in wouldnt probability distribution at that point be a better measure. i might be absolutely wrong, but i guess probability density function is sort of measured around a point right, so maybe distribution would make more sense, its just a lil ambiguous in my head. also is it any different from the scipy interpolation thing, or is it the same thing.

okay got the density thing, that was my bad. still confused about some of it

@canyon289
Copy link
Member Author

@Rishabh261998 Since you were first you're welcome to work on this, although it looks like @almostmeenal is interested as well.

If you're still interested could you comment in the next 3 days indicating your interest

@mjhajharia
Copy link
Member

I couldn't figure out a way to plot points which aren't a part of a pymc3 distribution in arviz, so i just did matplotlib to get an idea, on a small sample(100) , it was terrible lol, so I'm still wondering if it makes sense to sample a pymc3 distribution randomly as input for the interpolation or not, because it needs a static input.

so, im a little confused about something, the input into the interpolated function - lattice points and their pdf, are those supposed to be sampled from a pymc3 distribution(intentionally picking them from uneven intervals), or they can be any set of monotonically increasing random points. also i'm still trying to make sense of what would pdf at a point mean, as in wouldnt probability distribution at that point be a better measure. i might be absolutely wrong, but i guess probability density function is sort of measured around a point right, so maybe distribution would make more sense, its just a lil ambiguous in my head. also is it any different from the scipy interpolation thing, or is it the same thing.

okay got the density thing, that was my bad. still confused about some of it

@OriolAbril
Copy link
Member

OriolAbril commented Mar 17, 2021

I think this example plot needs to be significantly more complicated than the plots for other distributions.

The plots for all the other distributions show the pdf (probability density function) for several values of the input parameters. The plot for the normal distribution for example shows the pdf of 4 different normal distributions, each with different mean and standard deviation.

image

In the case of the Interpolated distribution, we don't have the pdf, so there is no way to reproduce these kind of plots, we have to generate a completely different plot customized to the Interpolated case. The main use of the interpolated distribution is to be able to use distributions whose pdf is not known in PyMC3 models. Instead of knowing the pdf like we do for the other cases, here we only have available some evaluations of the pdf on a grid (not necessarily uniform). With this info we can create an Interpolated distribution that will behave like this unknown distribution of which we only know the pdf at some given points.

In order to showcase this conceptual difference, I think we can follow these steps:

  1. Define a mixture pdf from two normals, which will act as our unknown pdf
  2. Evaluate this pdf on a uneven grid, this is the info that we will use to try and reproduce the original pdf with Interpolated
  3. Create the Interpolated distribution
  4. Plot something

Note that the Interpolated distribution defines the logp and the random method that should both approximate the original and unknown pdf, so I think it is a good idea to have the following elements in the plot:

  1. Original (unkown to Interpolate pdf) plotted on a fine grid (uniform grid too to make things easy) -> lineplot
  2. Known (to Interpolate) points of the pdf evaluated on the nonuniform grid (rough grid) -> scatterplot
  3. Interpolated distribution plotted on the fine grid -> lineplot
  4. Histogram of ~100/500 samples from the interpolated distribution -> histogram

The example image provided above by Dpananos has elements 2 and 4 only. We could also use ArviZ for 4 and plot a KDE instead of a histogram, but I think it is an overkill and its better to keep the example with minimal requirements.

Implementation proposals

# preprocessing: definitions taken from https://github.com/pymc-devs/pymc3/blob/master/pymc3/distributions/continuous.py#L445
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as st
plt.style.use('seaborn-darkgrid')
x = np.linspace(-5, 5, 1000)

# step 1: define mixture
# use scipy.stats (like in other pdf plots) to define this unkown pdf
# I propose this two normals of different loc and scale
original_pdf = lambda x: st.norm.pdf(x, -2, 1) / 2 + st.norm.pdf(x, 3, .6) / 2

# step 2: evaluation of rough grid
# as the normals have different scale, we "need" more points on >0 because changes are steeper
# so we use a non uniform grid
xgrid = np.concatenate((np.linspace(-5 ,0, 10, endpoint=False), np.linspace(0, 5, 20)))
# evaluate pdf on xgrid

# step 3: create Interpolated
# note this is outside a model, so you'll need to use pm.Interpolated.dist(xgrid, pdf_points), not pm.Interpolated(...)
# evaluate the interp_dist at x array, you can use `.logp(x).eval()` to get the log, then exponentiate before plotting
# generate random samples from interp_dist

# step 4: plot
# add the 4 plots, with a legend, x/ylabel...

@mjhajharia
Copy link
Member

In the case of the Interpolated distribution, we don't have the pdf, so there is no way to reproduce these kind of plots, we have to generate a completely different plot customized to the Interpolated case. The main use of the interpolated distribution is to be able to use distributions whose pdf is not known in PyMC3 models. Instead of knowing the pdf like we do for the other cases, here we only have available some evaluations of the pdf on a grid (not necessarily uniform). With this info we can create an Interpolated distribution that will behave like this unknown distribution of which we only know the pdf at some given points.

okay yeah that makes sense, so essentially its line fitting, except the linear limitation is removed, I'll see if I can get it done today, also I guess it's cool if I play around with the input as long as the final figure makes sense? I mean I was trying random stuff and often the interpolation wasn't that good for some randomized points, so yeah.

@OriolAbril
Copy link
Member

I guess it's cool if I play around with the input as long as the final figure makes sense?

Play with it as much as you want, I added this toy example because I figured it would help in making the comment and idea on how to move forward clear, not as the final example to be used in the docs

@mjhajharia
Copy link
Member

mjhajharia commented Mar 21, 2021

hi, does this work? I'm not sure if i did it right, also should I adhere to the color theme in previous plots?
image

here's the code:

x = np.linspace(gamma.ppf(0.01, a),gamma.ppf(0.99, a), 10000)
rv = gamma(a)
dictionary = dict(zip(x, rv.pdf(x)))
sampled_dict = {k: v for k, v in random.sample(dictionary.items(),50)}
input_dict = OrderedDict(sorted(sampled_dict.items()))
points = np.array(list(input_dict.keys()))
pdf = np.array(list(input_dict.values()))
interpolated = pm.Interpolated.dist(points, pdf)

plt.style.use('seaborn-darkgrid')
fig, ax = plt.subplots(1, 1)
ax.plot(x, rv.pdf(x), color='deepskyblue', linestyle = '--', lw=4, label='Original Gamma pdf',alpha=0.8)
ax.plot(points, pdf, color='black', marker='o', label='Lattice Points',alpha=0.5,linestyle='')
ax.plot(interpolated.x_points, interpolated.pdf_points,color='forestgreen',lw=4,label='Interpolated pdf',alpha=0.8,)
r = interpolated.random(size=1000)
ax.hist(r, density=True, histtype='stepfilled', alpha=0.4,bins = 25,align ='right',color='grey')
ax.legend(loc='best', frameon=False)
plt.show()

@mjhajharia
Copy link
Member

like ofc the plot isnt very nice and I think playing with other distributions would be better, I'll fine tune the plot later, but as a rough idea of what we're trying to do does this make sense

@OriolAbril
Copy link
Member

This looks good, I will add some comments here, but I'd encourage you to already submit a PR with these changes as it will be better to review and comment on specific parts of the code. You can add [WIP] or use github to mark it as draft and work at your own pace until it's finished and can be merged.

x = np.linspace(gamma.ppf(0.01, a),gamma.ppf(0.99, a), 10000)
rv = gamma(a)

This is great, using the ppf is clever to get the right limits for the linspace. I only have two very minor comments. First, using 10000 points is probably a bit too much, all the other plots use 1000 and do already look perfectly smooth. Second, I'd define rv before x and use rv.ppf instead of generating a new gamma with parameter a every time.

dictionary = dict(zip(x, rv.pdf(x)))
sampled_dict = {k: v for k, v in random.sample(dictionary.items(),50)}
input_dict = OrderedDict(sorted(sampled_dict.items()))
points = np.array(list(input_dict.keys()))
pdf = np.array(list(input_dict.values()))

I strongly discourage using a dictionary for this. All the code above can be reduced to two steps: defining the rough grid points and evaluating the pdf on these points to create pdf array, like so:

points = np.random.choice(x, size=50, replace=False)
pdf = rv.pdf(points)

Moreover, I'd personally not use a random grid, here is why. It is true that Interpolated accepts nonuniform grids, but it is also doing a linear interpolation between the points of the grid, with random points, these can be extremely close (as it happens a few times by seeing that there are black dots even with the alpha set) and therefore, interpolating between them doesn't really make sense. If this were a real case and we had no other option than to use a random grid, then I'd say go for it, but as this is the official docummentation and there is only one example, I think that the example should illustrate best practices. Otherwise, I fear users will use random grids as defaults because it's what they have seen in the docs, even when a uniform grid could be used and do a much better job.

ax.plot(interpolated.x_points, interpolated.pdf_points,color='forestgreen',lw=4,label='Interpolated pdf',alpha=0.8,)

The values used for plotting should not be x_points and pdf_points, we have already plotted them in the previous line. here we should be evaluating the pdf of the interpolated distribution on all the points in x. As I commented in the sample code above, you can do this with logpdf(x).eval():

ax.plot(x, np.exp(interpolated.logpdf(x).eval()), ...)

As for general aesthetics of the plot, I think it would be great to try to keep the style similar to the other plots of the section. The black dots and grey histogram are already great as is, for the other two lines, I'd use C0 and C1 as colors, and I would not set the lw manually so it falls back to the style default. The dashed linestyle, alpha and so on are already great, no need to change them

@mjhajharia
Copy link
Member

so i made these changes as you suggested:
1000 to 1000 change
replaced gamma with rv
changed pdf_points to logp.eval(x)
these changes worked fine thankyou for suggesting.
I didn't change the dictionary because when i use your lines of code, I get an error because the values of x are randomized and not monotonically increasing, Interpolated needs them monotonically increasing, I used a dict to ensure that the x, pdf pairs were monotinically increasing along the dict keys (x).
I changed the colors to c1 and c2, but I feel changing the line width affects the clarity of the plot somewhat
image

@mjhajharia
Copy link
Member

it doesn't make much difference but, personally I find this is way more visually interpretable. Also, I don't know if its my place to say this, but I'd strongly suggest using accessible colors for the entire Library, for color blind people. This is something that has gained a lot of traction recently in the entire data visualization community and I think it's a step forward towards inclusivity, I'd be more than happy to do it myself for all the plots, it would take time but its worth it in my honest opinion. Thanks for the suggestions again ;)
image

@OriolAbril
Copy link
Member

OriolAbril commented Mar 22, 2021

I didn't change the dictionary because when i use your lines of code, I get an error because the values of x are randomized and not monotonically increasing, Interpolated needs them monotonically increasing, I used a dict to ensure that the x, pdf pairs were monotinically increasing along the dict keys (x).

Try using points = np.random.choice(x, size=50, replace=False).sort() then. But again, personally I would not use a random grid. In the 2nd image you shared, already with some of the changes, there are many points >6 whereas around 1 there is quite a big gap that ends up with a significant portion of the pdf being linear and not doing a very good job. I understand wanting to show that non-uniform grids work fine with Interpolated, but I'd do so with a non-uniform grid adapted to the pdf at hand. In this particular case for example, we could use a grid that is denser between 0-3 and has more space between points after 3.

I am also now seeing that the histogram changes significantly between runs, this could be due to having too many bins, how many bins do you get with bins="auto"?

Also, I don't know if its my place to say this, but I'd strongly suggest using accessible colors for the entire Library, for color blind people.

It is definitely your place, and I think it's a great idea. I though that seaborn-darkgrhttps://github.com/arviz-devs/arviz/blob/main/arviz/plots/styles/arviz-darkgrid.mplstyleid used a template accessible to colorblind people. If that is not the case, we should switch to arviz-darkgrid, when you submit your PR it would be great if you could switch the styles of the other plots too. We should keep using C0, C1 to keep all the plots having the same style, and by updating the style we change the color in all of them at once

@mjhajharia
Copy link
Member

Okayy, I'm totally sorry I forgot to update the grid, your point is fair, documentation should have the ideal scenario, because random grids might not (always) work. And as for color, I believe seaborn has a style - seaborn-colorblind I don't know much about arviz styles but I'll have a look, and bins default works better. also maybe its a silly question but to make a uniform grid should i pick points from x at uniform intervals or actually ensure the numerical values of x are uniformly increasing. and numpy sort won't work because then the pairwise x and pdf relation would be distorted we need to preserve that i guess, then every point in x would be mapped to a different pdf

@OriolAbril
Copy link
Member

also maybe its a silly question but to make a uniform grid should i pick points from x at uniform intervals or actually ensure the numerical values of x are uniformly increasing

Any of points = x[::step] or points = np.linspace(x[0], x[-1], n_points) will work, and points will already be ordered

and numpy sort won't work because then the pairwise x and pdf relation would be distorted we need to preserve that i guess, then every point in x would be mapped to a different pdf

When doing:

points = np.random.choice(x, size=50, replace=False).sort()
pdf = rv.pdf(points)

the x<->pdf relation is defined after sorting, each position of the pdf array will correspond to the pdf evaluated at the same position always and independently of how points is generated. The only way to break that would be modifying points once pdf has already been created.

@mjhajharia
Copy link
Member

also maybe its a silly question but to make a uniform grid should i pick points from x at uniform intervals or actually ensure the numerical values of x are uniformly increasing

Any of points = x[::step] or points = np.linspace(x[0], x[-1], n_points) will work, and points will already be ordered

and numpy sort won't work because then the pairwise x and pdf relation would be distorted we need to preserve that i guess, then every point in x would be mapped to a different pdf

When doing:

points = np.random.choice(x, size=50, replace=False).sort()
pdf = rv.pdf(points)

the x<->pdf relation is defined after sorting, each position of the pdf array will correspond to the pdf evaluated at the same position always and independently of how points is generated. The only way to break that would be modifying points once pdf has already been created.

okayy got it my bad, I'm sorry for it, I assumed we were doing that after mapping, so I'll do the uniform grid and add a PR, for now I'm using the seaborn-darkgrid, later we can change it to seaborn-colorblind maybe

@OriolAbril
Copy link
Member

And as for color, I believe seaborn has a style - seaborn-colorblind I don't know much about arviz styles but I'll have a look, and bins default works better.

arviz-darkgrid should be very similar to seaborn-darkgrid but uses a colorblind friendly pallette, I did not know about seaborn-colorblind, I guess both should work fine and will be a great improvement to the documentation. Out of curiosity, are there any sample plots to see how seaborn-colorblind looks like? Some examples of arviz-darkgrid are available here: https://arviz-devs.github.io/arviz/examples/styles.html

@mjhajharia
Copy link
Member

I tried seaborn-colorblind on this plot itself (C0 and C1) and I think it looks alright?
image

@OriolAbril
Copy link
Member

Looks great yeah, thanks

@mjhajharia
Copy link
Member

mjhajharia commented Mar 22, 2021

okay so i tried seaborn and matplotlib palettes and themes, I think using sns theme seaborn-darkgrid from matplotlib doesn't leave us an option to use darkgrid with colorblind colors
if we replace:
plt.style.use('seaborn-darkgrid')

with
sns.set_style("darkgrid")
sns.set_palette("colorblind")
we'd get desired results

@mjhajharia
Copy link
Member

but i understand that would create an additional dependency of installing seaborn, whereas now we can leverage sns functions through matplotlib itself

okay so i tried seaborn and matplotlib palettes and themes, I think using sns theme seaborn-darkgrid from matplotlib doesn't leave us an option to use darkgrid with colorblind colors
if we replace:
plt.style.use('seaborn-darkgrid')

with
sns.set_style("darkgrid")
sns.set_palette("colorblind")
we'd get desired results

@mjhajharia
Copy link
Member

I tried seaborn-colorblind on this plot itself (C0 and C1) and I think it looks alright?
image

i think the darkgrid came here from the previous compilation or something, i restarted my kernel and tried only 'seaborn-colorblind', which gave a whitegrid.

@OriolAbril
Copy link
Member

We can use

plt.style.use("seaborn-darkgrid")
plt.style.use("seaborn-colorblind")

or

import arviz as az
plt.style.use("arviz-darkgrid")

seaborn is a dependency on some example notebooks which is fine, but we should not make it a dependency to generate the api docs, I think the first alternative will achieve the same result, after all, seaborn is defining custom matplotlib styles that became popular and were then included to the main matplotlib library with the seaborn prefix, but it's not adding much functionality.

ArviZ is already a dependency on pymc3 so using it does not alter the doc generating dependencies. The example notebooks use arviz-darkgrid if I remember correctly

@mjhajharia
Copy link
Member

mjhajharia commented Mar 22, 2021

yeah example docs use arviz-darkgrid and seaborn darkgrid both, it definitely makes sense to not use sns. If you just check it once, the palette used in plt.style.use("seaborn-colorblind") is different from sns.color_palette("colorblind") , probably because sns changed their palette and matplotlib did not make the updates. I found a few options to go about:

  1. I found this colormap in matplotlib, which a number of color-blind people actually found helpful https://gist.github.com/thriveth/8560036, if we use this we don't need sns as a dependency

  2. This is a color-blind friendly matplotlib style
    plt.style.use('seaborn-darkgrid')
    plt.style.use('tableau-colorblind10')

image

  1. This is a diverging colormap, as per matplotlib doc - "diverging blue-gray-red, meant to avoid issues with 3D shading, color blindness, and ordering of colors"
    plt.set_cmap('coolwarm')

image

I'm sorry for making so many comments, I learned a bit about color palettes too today, finally I'm going for this figure, lemme know if I should use a different palette, I'll make the PR tonight, so before that it'd be great if you'd tell me which one to go for.

image

Also, when i went through plots in continuous.py , none of them imported pymc3 and instead used scipy or something, so should I also use scipy spline instead of the actual interpolated function for the plot?

@canyon289
Copy link
Member Author

canyon289 commented Mar 22, 2021

@almostmeenal Dont worry too much about the comments, my recommendation at this point is to pick what you think is best and open a PR! The reason I say this is with a PR its much easier for us to see what the specific changes and effects in a precise way, and then we can also clone the code locally and test it out ourselves. Issues comments are great for general discussion (and youve been using them fantastically) but they do leave some ambiguity.

It is also not a requirement to have a PR "perfect" before opening one, Quite often we open PRs that that are partially done to get feedback and advice from others.

So I should ask, any hesitation in opening a PR and moving the discussion there? If so just let us know and we can keep conversing here as well. Whatever works for you :)

@OriolAbril
Copy link
Member

I'm sorry for making so many comments

don't worry about that at all

I learned a bit about color palettes too today

Glad it was useful, plot styling can easily get you down an unending path of color palettes and colormaps.

This is a color-blind friendly matplotlib style
plt.style.use('seaborn-darkgrid')
plt.style.use('tableau-colorblind10')

Would there be any drawbacks to using `az.style.use("arviz-darkgrid"). After your comment on seaborn colorblind not being updated, I have seen that matplotlib styles have not been updated in ~5 years, so using ArviZ would give us better control and the ability to easily update the style if needed.

This is a diverging colormap, as per matplotlib doc - "diverging blue-gray-red, meant to avoid issues with 3D shading, color blindness, and ordering of colors"
plt.set_cmap('coolwarm')

this is not affecting the plot unluckily, the plot is using the default matplotlib palette which i think is not colorblind friendly. set_cmap is modifying the default colormap, not the default palette (or color cycle as it is generally called in matplotlib). The palette/color cycle is used for multiple lines/patches in the same axis, the colormap is used to encode a 3d dimension in a 2d plot. If you take a look at: https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html you'll see the default colormap is shown in the 2nd column, whereas all the others generate objects following the palette defined by the color cycle.

I also agree with Ravin on encouraging you to open a PR, we'll review it and we can comment directly on the code, it will probably keep the conversation more structured.

@mjhajharia
Copy link
Member

It is also not a requirement to have a PR "perfect" before opening one, Quite often we open PRs that that are partially done to get feedback and advice from others.

thanks! I was skeptical of a PR because it looked far from done to me and so I was unsure.

@mjhajharia
Copy link
Member

I'm sorry for making so many comments

don't worry about that at all

I learned a bit about color palettes too today

Glad it was useful, plot styling can easily get you down an unending path of color palettes and colormaps.

This is a color-blind friendly matplotlib style
plt.style.use('seaborn-darkgrid')
plt.style.use('tableau-colorblind10')

Would there be any drawbacks to using `az.style.use("arviz-darkgrid"). After your comment on seaborn colorblind not being updated, I have seen that matplotlib styles have not been updated in ~5 years, so using ArviZ would give us better control and the ability to easily update the style if needed.

This is a diverging colormap, as per matplotlib doc - "diverging blue-gray-red, meant to avoid issues with 3D shading, color blindness, and ordering of colors"
plt.set_cmap('coolwarm')

this is not affecting the plot unluckily, the plot is using the default matplotlib palette which i think is not colorblind friendly. set_cmap is modifying the default colormap, not the default palette (or color cycle as it is generally called in matplotlib). The palette/color cycle is used for multiple lines/patches in the same axis, the colormap is used to encode a 3d dimension in a 2d plot. If you take a look at: https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html you'll see the default colormap is shown in the 2nd column, whereas all the others generate objects following the palette defined by the color cycle.

I also agree with Ravin on encouraging you to open a PR, we'll review it and we can comment directly on the code, it will probably keep the conversation more structured.

okayy, yeah the whole color cycle and colormap thing was confusing, i was trying color cycles initially but that needed another import (from cycle import cycle) so i dropped that. yes, using arviz makes sense, that would mean changing inference data to arviz format right? also I've made a preliminary draft PR

@canyon289
Copy link
Member Author

Fixed via #4557

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants