Skip to content

Violin plots #2116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Nov 1, 2017
Merged

Violin plots #2116

merged 26 commits into from
Nov 1, 2017

Conversation

etpinard
Copy link
Contributor

@etpinard etpinard commented Oct 24, 2017

Violin plots are coming to plotly.js 🎉 🎻 🎉

Python users can already create violins using @cldougl's create_violin figure factory (example); this PR will bring violin plots to all plotly.js consumer with an API very similar to the box trace type.


IMPORTANT: After the first push of 2017/10/24, this PR is very much a WIP. Several API decisions remain to be made. See the first few comments on commit 3438eae for more info.

@etpinard etpinard added this to the v1.32.0 milestone Oct 24, 2017
description: [
'Sets the bandwidth used to compute the kernel density estimate.',
'By default, the bandwidth is determined by Silverman\'s rule of thumb.'
].join(' ')
Copy link
Contributor Author

@etpinard etpinard Oct 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No hard default for bandwidth. This rule-of-thumb that depends on the sample length and standard deviation seems pretty popular in other libraries.

description: [
'Sets the span in data space for which the density function will be computed.',
'By default, the span goes from the minimum value to maximum value in the sample.'
].join(' ')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address, @alexcjohnson 's concern (from a private convo):

mmm, looking through example images I don’t actually see the behavior I was imagining… might be something to play with at some point but not for now, unless some standard existing package does it.

One thing I see a lot in examples that seems to me a bad idea is truncating the violin at value where it has a finite width - particularly if that value is actually a data value. Unfortunately one of the examples I see that does exactly that is on our own site… https://plot.ly/python/basic-statistics/ out[5]

unless there’s some physical limit to the variable… that seems like it would justify doing this, but aren’t you still throwing away area (and therefore visual weight) from the points at the end?

at any rate, seems like if one is going to truncate like that it should be done explicitly, and should not be the default.

Ideally I’d like you to only be able to truncate by value (per the argument about physical limits) rather than “truncate at the ends of the data range” but I imagine people will complain if they can’t do both…

Copy link
Contributor Author

@etpinard etpinard Oct 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To note, span was inspired seaborn's cut argument and ggplot trim,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love that this is two numbers (or date strings, I guess), one for each end. - that's way more flexible than seaborn's one number or ggplot2's boolean. I also love that it's data values, rather than a delta from the ends of the distribution. Between the two of those you could do what seems to be really the right thing with seaborn's cut example and lop off unphysical values < 0 but not restrict the upper end.

Ideally I would prefer if the default were no clipping at all (which, as discussed, probably means data bounds extended by 2 or 3 times the bandwidth), rather than clipped at the data bounds. The challenge then is to make it easy for people who do want to clip tightly to the data bounds - particularly if we think about the distribution changing over time, having to manually update the bounds for either of these cases seems awkward.

I suppose we could define two special values of span[i] - one for "clip tightly to this end of the data" and another for "do not clip this end"? 'tight' and 'loose' perhaps? Alternatively there could be another attribute for this, which would be nice as we wouldn't need to mix numbers and strings (or whatever the special value is) but it seems tricky to cover all cases this way, like if you want the low end tight and the upper end loose.

Copy link
Contributor Author

@etpinard etpinard Oct 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could define two special values of span[i]

That would be dangerous, as we want the span items to be set in d coordinates. So a violin trace with 'tight' and 'loose' categories would totally break this.

@alexcjohnson 's 'tight' and 'loose' suggestions made me think of #1876 where the specs axis.bounds and axis.boundsmode are written down.

For consistency, I'm thinking about adding a spanmode attribute alongside span with possible values 'hard', 'soft', 'manual'

{
   spanmode: 'soft',
   span: [null, 10]
   // where the null means pick the 'soft' default value
   // i.e. data min minus 2 bandwidths
}

{
   spanmode: 'hard',
   span: [0, null]
   // where the null means pick the 'hard' default value
   // i.e. the data max.
}

We could also add 'tight-soft' and 'soft-tight' spanmode values to allow users to pick hard and soft defaults for each ends without having to set span at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in ad51966

'By default, the bandwidth is determined by Silverman\'s rule of thumb.'
].join(' ')
},
scaleby: {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspired by seaborn's scale attribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced by scalegroup (taken from pie traces) and scalemode in ad51966

'By default, the span goes from the minimum value to maximum value in the sample.'
].join(' ')
},
side: {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remake, seaborn's

image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! also 'top' and 'bottom' for horizontal violins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of ad51966, side is now an enumerated with values 'both', 'negative' and 'positive' so that the same set of values work with horizontal and vertical violins.

@@ -161,7 +161,8 @@ function plotPoints(sel, plotinfo, trace, t) {
var bdPos = t.bdPos;
var bPos = t.bPos;

var mode = trace.boxpoints;
// TODO ... unfortunately
var mode = trace.boxpoints || trace.points;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to rename boxpoints -> points as well as boxmean (maybe to showmean + showsd ?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about combining display attributes for box, mean line and standard dev line inside the violins in one flaglist attributes e.g. stats: 'box+mean+sd' or innermode: /* */, but when considering #1774 it might be best to have separate booleans e.g. showbox, showmean, showsd to keep things consisting with features like axis line and grids

So, unless someone opposes:

showmeanline: true,
meanlinecolor: 'blue',
meanlinewidth: 1,
meanlinedash: 'dot'

showbox: true,
boxlinecolor: 'black'
boxlinewidth: 2,
boxfillcolor: 'red'

Copy link
Contributor Author

@etpinard etpinard Oct 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6ffc379 with the help of Lib.findPointOnPath (for meanline) added in 4a40fc7

basePlotModule: require('../../plots/cartesian'),
// TODO
// - should maybe rename 'box' category to something more general
categories: ['cartesian', 'symbols', 'oriented', 'box', 'showLegend'],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this box category is currently used:

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... which for the most part are common places for violin traces too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced by box-violin in ea43b25

module.exports = {
violinmode: boxLayoutAttrs.boxmode,
violingap: boxLayoutAttrs.boxgap,
violingroupgap: boxLayoutAttrs.boxgroupgap
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: Should box and violin the same gap, groupgap and even mode attributes?

In other words, should violin be thought as a different mode for box?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see users wanting to use the gap attributes on violins the same way they would with boxes. imo it seems pretty intuitive for these charts to behave the same way and have the same grouping attributes available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Violin-specific violinmode, violingroup and violingroupgap are implemented in 7769f20


var kernels = {
gaussian: function(v) {
return (1 / Math.sqrt(2 * Math.PI)) * Math.exp(-0.5 * v * v);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thinking to make this extensible down the line. There would be something nice about using one of the polynomial kernels that goes smoothly to zero at a finite position but to start just gaussian seems fine. That's the ggplot2 default anyway, and I can't find anything to say which one seaborn uses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the epanechnikov kernel in ad51966

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and of course, I could add a few more if desired. Note that other kernels make the violins look a little less smooth than the gaussian. Perhaps this is why other libraries (e.g. seaborn and ggplot) only use the gaussian kernel for violin plots 🤔

traceLayerClasses: [
'imagelayer',
'maplayer',
'barlayer',
'carpetlayer',
'violinlayer',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making sure violins are under boxes always.

setPositions: require('../box/set_positions'),
plot: require('./plot'),
style: require('./style'),
hoverPoints: require('../box/hover'),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should violin show hover labels about the kernel density curve?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tricky to figure out what someone would want to see labeled on the curve. The only thing I can think of that might be cool is a label that moves continuously (along the distribution axis) with the mouse, so you can look at a peak or a valley or something and see exactly what data value it's at... would help you read quantitative differences off several violins. Would a label like that get any value reported for the density? It wouldn't mean much on its own so could be omitted, though it would have meaning relative to other such values on the same or different violins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I came up with:

peek 2017-10-27 14-36

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my excitement at how beautiful this effect is, I missed an important piece: you should see both the kde and the y value in that label, so you can use it to read out the exact peak/valley locations for example. Should show just enough digits that each pixel is a different y value.

Copy link
Contributor Author

@etpinard etpinard Oct 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in bc6bc02 with the help of Lib.findPointOnPath added in 4a40fc7

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... now looks like:

peek 2017-10-30 23-38

module.exports = {
supplyDefaults: supplyDefaults,
handleSampleDefaults: handleSampleDefaults,
handlePointsDefaults: handlePointsDefaults
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a separate reorganisation commit, to show a new way to factor out common trace module blocks. I think this method is a little more consistent with ES6 modules. For example here, supplyDefaults would be the default export.

description: [
'Determines which side of the position line the density function making up',
'one half of a is plotting.',
'Useful when comparing two violin traces under *overlay* mode, where one trace.'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where one trace what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved description in ad51966

editType: 'style',
description: 'Sets the width (in px) of line bounding the violin(s).'
},
smoothing: scatterAttrs.line.smoothing,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is smoothing needed? Seems weird, especially since we're just kind of making up points for the path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line.smoothing was 🔪 in ad51966

x: d.pos + bPos - (density[i].v / scale),
y: density[i].t
});
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ends are cut off (and even if they aren't, if using the gaussian kernel that doesn't really go to zero) we should make two separate smoothed lines and connect them with straight segments, rather than one long curved line. Should be able to do this just with Drawing.smoothopen, tweaking the first and/or last characters, and concatenating them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better drawing algo in ad51966

- to make them reusable in violin/plot.js
- add support for asymmetric bdPos to support one-sided violins
- totally indenpendent from their box* counterpart
- add 'kernel' enumerated
- implement 'scalegroup & 'scalemode' (instead of 'scaleby')
- implement 'spanmode' & 'span'
- make 'side' an enumerated w/ vals 'both', 'negative' and 'positive'
  to be general enough for horizontal and vertical violin
- 🔪 'line.smoothing'
- to be used to find pt on violin bezier curves
- i.e. showinnerbox & showmeanline and friends.
- in preparation for violin 'kde' hover handling
- with three flags: 'violins', 'points' (similar to box traces)
  and 'kde' which show the point on the kde line along
  with the line to crosses the hovered-on violin
@etpinard
Copy link
Contributor Author

etpinard commented Oct 31, 2017

@alexcjohnson tagging this thing as reviewable. 🏁

I still need to fill in a few descriptions ✏️ , add a few tests (especially for one-sided violins) 🔒 and I think something is off in the way I'm computing scalegroup fields 💻 , but oh well, this PR should be in a good enough state to start reviewing 🔎

- as non-gaussian kernels may require us to make soft spanmode bounds
  kernel-dependent, which will require some trial-and-error
- as most other libraries don't support non-guassian kernel, let's
  defer this.
- use nested attribute style for box and meanline settings
- update test and mocks
- that 🔒 custom bandwidth and some box style settings.
@alexcjohnson
Copy link
Collaborator

Beautiful! 🎵 Now we really have some music for our 💃 !

@etpinard etpinard merged commit 81ddaca into master Nov 1, 2017
@etpinard etpinard deleted the violins-dev branch November 1, 2017 17:39
@etpinard etpinard mentioned this pull request Jan 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature something new
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants