On new or unified plot/trace types relating to `parcoords` and `sankey` #2229

monfera · 2018-01-02T18:02:52Z

On multidimensional explorers: made a separate issue as this topic is deep and the topic of "New charts" will likely be quite broad.

We've been planning multidimensional extensions eg. SPLOM. The above directions also make a lot of sense, and as usual there's always the tradeoff among implementation time (ultimately $), payload size and functionality. A nice option would be a kind of unification of plots, eg. parcoords and sankey, both of which are relatively compact as they do one thing without much configuration. Another option is a new plot type, or a new trace type eg. on the substrate of parcoords.

Here's a quick note on the expected challenges of integrating the new thing with either:

parcoords had been written for the express purpose of performance, so it uses WebGL for rendering. It uses GL.LINES as that's the most compact geometry ie. fastest to do the interactions with (there were response time criteria). To make lines thicker, we'd need to convert to GL.TRIANGLES or similar, ie. 2-3x work in the vertex shader (which also does GPU crossfiltering). WebGL does have a lineWidth method but the standard permits that implementations cap line width to 1px, which browsers increasingly opted for over time... If Sankey-like splines are also needed, that's another layer of performance hit and implementational complexity. Yet it'd be possible to add an SVG or polygonal WebGL trace type for those cases where data points are not in the multiple 10k range and more diverse geometry eg. thicker lines are needed. parcoords is already multilayer, the axes are in SVG and there could be more SVG layers.
sankey is an alternative target. After all, it internally works as a layered graph, ie. the internal representation is already close to the axis cadence of parallel coordinates and their ilk. But our implementation uses the heuristic as it is in d3-sankey (we only added support for multiedges via a PR), and it's free to arrange the edges, resulting in the observed line discontinuity that's definitely not in the style of parallel coordinates. This freedom yields a more optimal arrangement in the case of general Sankey work, ie. minimizing the line crossings, not a concern for parcoords-like work. We'd need to somehow add configuration for bypassing the heuristic or the entire d3-sankey in favor of a parcoords-like continuous line layout.

There's the option that these two, and the new functionality, be unified, which is doable but sounds like even more work. Also, the chart design space is enormously large and the way plotly.js is set up, for better or worse, it's geared more for specific, configurable chart types rather than a Grammar of Graphics like, fluid or low level way of building up toward a desired chart type. The reason it comes up is that there are a lot of possibly useful additions and improvements that can be made to the charts we speak of. The implementationally easiest thing on the other hand is a completely new chart type but that probably adds the largest JS payload (not sure if it's a concern, @alexcjohnson or @etpinard could tell). Btw. there's a parallel sets implementation that interacts not unlike our parcoords and sankey with drag&drop: https://www.jasondavies.com/parallel-sets/ which shows line continuity but not sure if it supports multiedges.

The text was updated successfully, but these errors were encountered:

monfera · 2018-01-02T18:05:42Z

@alexcjohnson re your question on categorical variables (all lines focusing in one point), in theory the Y screenspace value could be enhanced with random jitter on parcoords to reduce the overlap while keeping the category assignment clear, provided there are not too many categories.

jonmmease · 2018-01-02T19:32:04Z

Responding @alexcjohnson #2221 (comment) in new thread.

@jmmease does this seem like it would handle your use case?

The idea of mixing categorical support into parcoords via density regions is pretty interesting. I've pondered some similar ideas in the past, but I've gotten stuck on two points.

If the sample size is not large enough then there may be significant visual gaps between lines in a single category, which would not look as clean and pleasing as smooth sankey-style paths.
I worry a bit about the mixed vertical encoding between continuous and categorical variables. For continuous variables, the vertical dimension would encode a position while for the categorical variables it would encode a density. My concern is that it might be tempting to slip into interpreting the lines through the continuous variables as a density as well.

For improving the representation of categorical variables in the parcoords trace, I think I would prefer the random jitter approach @monfera mentioned. This would keep the interpretation of density consistent across continuous and categorical dimensions and there is visualization precedent for it in things like seaborn's stripplot

For purely categorical data, I would like something cleaner that doesn't rely on jitter.

Regarding Parallel Sets. This diagram is very close to what I'm after as it is a representation of multi-dimensional categorical data with path continuity between categories. The only issue is that the coloring of the paths between categories is linked to the first dimension (the top in the example linked above). This restriction means that you don't need to draw all of the multi-paths separately for the early dimensions, one wide path can be used instead. But it also means that you can't color paths based on external criteria as is needed to support brushing/cross-filtering.

Another way to think about what I'm proposing/developing as parcats is that it's simply a Parallel Sets diagram (flipped horizontally to look more like a traditional parcoords diagram) with more flexible path coloring. If paths are colored based on the first category, then it is a Parallel Sets diagram. But paths can also be colored based on an arbitrary color array.

Any thoughts on this @alexcjohnson @monfera?

alexcjohnson · 2018-01-02T21:09:36Z

Thanks for the detailed writeup @monfera

You're right that there are a lot of possible extensions and convergence opportunities, but before we resort to those heavier (in terms of development) solutions, lets see what we can get out of extending the existing plot types.

To me, @jmmease's problem is fairly simple, and extremely common: parcoords is basically the right data model (and sankey is not), but it doesn't show the weight of entries passing through (and connecting) categorical variables.

What I had in mind for the parcoords extension is a decidedly non-random offset for category dimensions - so that each line through the same category gets offset by a small amount from the next. My gut reaction is that we should (at least by default) keep the line width at 1px but allow the per-point offsets to scale as needed (to larger than 1px so there are gaps between the lines, OR smaller than 1px so the lines merge into a band but its width is still proportional to its weight) to fill up a pleasant fraction of the available space. Then we also would want to do some sorting to try and minimize crossings in category-category connections. I'm thinking this could be as simple as taking the middle category, sort first by that, next by its neighbors, and so on until we get to the end of category dimensions. The result would be something like this (in the >1px offset case - pardon the sketchiness, it could certainly get niceties like horizontal segments within the category bars but hopefully this gets the idea across):

Then as a second extension, we could implement fat-line rendering, possibly also with per-line thicknesses, which would end up looking exactly like parallel categories/sets. @monfera this would only be a relevant option when the total row count is low, so performance should not be an issue. It would also reduce precision when you're trying to interpret a continuous dimension, so I'd keep it optional anyway. Or perhaps fat lines for categorical dimensions, dropping to 1px for continuous? That might actually be a pleasing effect...

@jmmease perhaps what I've drawn is similar to what you had in mind with your two sticking points? (1) I think is a matter of taste, that we can deal with in various ways as I already mentioned. (2) is a good point, though with sufficient visual cues (ie boxes instead of an axis line) it feels to me like it's easy to intuitively distinguish, and the flip side is that when you're exploring via selection the category dimensions can help bring out density information pertaining to the continuous variables. My worry with using random jitter is that it still wouldn't necessarily indicate density, particularly as the sample size gets larger.

monfera · 2018-01-02T21:44:20Z

Fwiw I really like the aesthetic of the fixed cadence offsetting, and once we have this kind of deliberate use of vertical space, it opens up both options and maybe more. With too many lines our current WebGL might give Moiré-like patterns but we can do a couple of things to reduce the effect, or switching to a solid shape would be useful.

jonmmease · 2018-01-03T01:38:09Z

Thanks for posting the mockup @alexcjohnson , that's really helpful and I think it alleviates my reservations.

If the category-to-category lines can optionally be made thicker or covered by a patch in the sparse case then, as you said, this could reproduce the appearance of Parallel Categories. In addition to sorting the lines by category, I think they should also be sorted by color so that lines of the same color that have the same values of the categorical dimensions will cluster together.
Seeing the categorical dimensions with the boxes and without the axis lines does help make them look fundamentally different than the continuous dimensions.

Here's a list of some of the other features that an enhanced parcoords would need in order to replace the use-cases I'm building parcats for.

Drag dimension title horizontally to reorder categorical dimensions just like continuous dimensions
Drag category boxes vertically to reorder categories within a dimension
Hover over category boxes to display a tooltip that may contain the count and relative frequency of the samples in that category (similar to Sankey node tooltip)
Hover over bundles of lines to display a tooltip that may contain the count and relative frequency of the samples in the bundle (similar to Sankey link tooltip). By a 'bundle' I mean a collection of lines that have identical values in all dimensions.
Constraint support that works similarly to (and alongside) the constraintrange property for continuous dimensions. Maybe a constraintcats property on categorical dimensions that accepts a list of categories to constrain on. When set, lines through all other categories are grayed out. When constraints are active, the tooltips also provide the constrained count and constrained relative frequency.
plotly_resytle events would need to be emitted for interactive dimension reorder, category reorder, and constraint changes. And the events would need to contain the property information necessary to synchronize to other figures.

And a few stylistic nice-to-haves that might not be feasible with WebGL in the mix.

Curved Sankey-style lines/paths.
Animated transitions from final drag position (of dimensions or categories) back to the neutral position.

I do think this combination would make for a really powerful and flexible multi-dimensional visualization/analysis tool. Is this something the core team would be interested in working on soon? @alexcjohnson @monfera @etpinard

From my side, I plan to finish up my parcats trace for our current internal needs and then I'll work on getting approval to open source it (this is probably a month or two out). I think it's pretty useful given the current set of Plotly.js traces, but its future utility will depend on how far you all are interested/able to push parcoords in the categorical direction. Thanks for taking the time to discuss this!

jonmmease · 2018-01-03T14:13:03Z

Oh, and one other consideration comes to mind. With the parallel categories approach, it's possible to specify a count for each point in the dataset. This is just like the freq argument to the R alluvial diagram here.

This allows the grouping logic that identifies the counts for the unique paths to happen outside of the browser. One or our use-cases is to generate these diagrams in Python from larger than memory datasets. To do this we can first perform a groupby-count using dask or spark and then feed the unique paths and their counts into the parallel categories diagram.

Do you have any thoughts on how a sample count could be supported in an enhanced parcoords trace? @alexcjohnson

alexcjohnson · 2018-01-03T14:35:36Z

Do you have any thoughts on how a sample count could be supported in an enhanced parcoords trace?

Yes, that's a great use case, and we have done similar things in for example histogram traces (histfunc: 'sum'). I suspect we would only want to do this in conjunction with fat-line rendering, so we don't need to restrict the data to small-ish integers, the weights could be any (positive) numbers.

VelizarVESSELINOV · 2019-09-20T01:24:09Z

It will be great to mix categorial and numeric channels.
The challenge with Sankey is the missing interactive selection that is available in parallel coordinates.
The challenge with parallel coordinates is that categorial channels are overlapping at the value point and losing the 3rd dimension of how many lines are entering/exiting a given category also the selection is very limited because of you can't select between two vertical lines.

gvwilson · 2024-06-10T21:05:19Z

Hi - this issue has been sitting for a while, so as part of our effort to tidy up our public repositories I'm going to close it. If it's still a concern, we'd be grateful if you could open a new issue (with a short reproducible example if appropriate) so that we can add it to our stack. Cheers - @gvwilson

monfera mentioned this issue Jan 2, 2018

New charts wishlist #2221

Closed

38 tasks

etpinard added feature something new type: new trace type labels Jul 27, 2018

jonmmease mentioned this issue Jul 27, 2018

Parallel Categories trace type for multi dimensional categorical data jonmmease/plotly.js#1

Closed

4 tasks

jonmmease mentioned this issue Sep 1, 2018

Parallel Categories (parcats) trace type for multi dimensional categorical data #2963

Merged

4 tasks

gvwilson closed this as completed Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On new or unified plot/trace types relating to `parcoords` and `sankey` #2229

On new or unified plot/trace types relating to `parcoords` and `sankey` #2229

monfera commented Jan 2, 2018

monfera commented Jan 2, 2018

jonmmease commented Jan 2, 2018

alexcjohnson commented Jan 2, 2018

monfera commented Jan 2, 2018

jonmmease commented Jan 3, 2018

jonmmease commented Jan 3, 2018

alexcjohnson commented Jan 3, 2018

VelizarVESSELINOV commented Sep 20, 2019

gvwilson commented Jun 10, 2024

On new or unified plot/trace types relating to parcoords and sankey #2229

On new or unified plot/trace types relating to parcoords and sankey #2229

Comments

monfera commented Jan 2, 2018

monfera commented Jan 2, 2018

jonmmease commented Jan 2, 2018

alexcjohnson commented Jan 2, 2018

monfera commented Jan 2, 2018

jonmmease commented Jan 3, 2018

jonmmease commented Jan 3, 2018

alexcjohnson commented Jan 3, 2018

VelizarVESSELINOV commented Sep 20, 2019

gvwilson commented Jun 10, 2024

On new or unified plot/trace types relating to `parcoords` and `sankey` #2229

On new or unified plot/trace types relating to `parcoords` and `sankey` #2229