Skip to content

Transform input data: groupby, filter #917

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks
monfera opened this issue Sep 8, 2016 · 10 comments
Closed
2 tasks

Transform input data: groupby, filter #917

monfera opened this issue Sep 8, 2016 · 10 comments
Assignees
Labels
feature something new
Milestone

Comments

@monfera
Copy link
Contributor

monfera commented Sep 8, 2016

Previously discussed (some lists are from @chriddyp) :

A groupbytransform should split apart traces as per unique values or bins of the groupby dimension. Example:

groupby: ['a', 'b', 'a', 'b']
x: [1, 2, 1, 2]
y: [10, 20, 30, 40]

should generate two traces:

trace 1:
x: [1, 2]
y: [10, 20]

trace 2:
x: [1, 2]
y: [30, 40]

Static groupby as a means of splitting spatially and/or aesthetically

  • distinct categorical values: numbers, strings or datetime strings
  • evenly spaced bins based on numerical data or time (datetime strings) in the groupby attribute, reusing logic of the preexisting plotly algorithm for histograms

image

Functional aspects:

  1. groupby needs to work across numbers, dates, and categories (@chriddyp in the JS context, meaning strings, correct?)

  2. groupby needs to split across all of the arrays or array-like specifications in a trace, not just x and y. For example, marker.color or marker.line.color. Not all array-like specifications in a trace are actual arrays (consider colorscale)

  3. There must be a way of specifying distinct styles for the split apart traces so that they're discernible - example:

    transform:
        groupby: ['a', 'b', 'a', 'b']
        marker:
            color:
                a: 'blue'
                b: 'red'
    
  4. @etpinard found some issues with legend items as he wrote an initial version of transforms: Introducing transform plugins #499 (comment). We'll probably need to modify some of the transforms and API. That's OK - transforms was made for groupby

  5. All relevant denotations for groupby, and the related animation split use (see below) need to be in the JSON format for serializability, fitting in the current declarative structure

  6. The transforms such as groupby must work in the restyle and relayout steps, not just the initial plot step

  7. gd.data is expected to preserve the single trace and the groupby spec as the user supplied, and _fullData on the other hand has the individual (spllt) traces and no longer has the groupby attribute

  8. We must ID traces in _fullData back to groups or styles in data. Styling controls will be populated with the defaults from _fullData (e.g. _fullData[4].marker.color) but they’ll need to update the attributes in the data object (e.g. data[0].transform.marker.color.d). That’s because we serialize and save data, not _fullData.

Preliminary work

Related PR, containing the initial, analogous filter work by @timelyportfolio : #859
groupby: https://github.com/plotly/plotly.js/blob/master/test/jasmine/assets/transforms/groupby.js

Planned groupby coverage of the initial sprint

  1. It would cover a positive list of attributes for groupby such as x and y but not all at once - HOWEVER the preferred solution aims for generality because other transforms will need to use a similar approach e.g. filter, and future arraylike attributes should be covered without code coupling to transformations (consequence: we'll have to check if there's enough attribute metadata that allows us to tell if it's arraylike, or we need further metadata; also, whether there's a programmatic way of separating arraylike data e.g. colorscale that's not represented as an array at input, otherwise we need to handle them attribute by attribute (we'll have to come back to this topic after a first round of work).
    Initial attributes at least: x, y, marker.color, marker.size (scatter, bar, histogram, box)
    Then lat, lon (maps), a, b, c (ternary), ‘z’ (scatter3d), error_y.array
  2. It would cover a set of (initially, non-WebGL) traces
  3. First goalpost is separation by category (JS number or string)

It is expected that the trace separation (and transformations in general) is being performed in the supply defaults step.

Subsequent goal: splitting data for animations

Instead of generating n different paths as described above, plotly would arrive at a temporal sequence of n frames

Possible future items:

  1. Incremental recalculation (e.g. of bins, upon newly arriving data points)
  2. Combine this with a subplots transform for rendering the traces into separate subplots (as small multiples plots)
@monfera monfera changed the title Transform input data: groupBy, filter Transform input data: groupBy, filter Sep 8, 2016
@etpinard etpinard added the feature something new label Sep 8, 2016
@monfera monfera changed the title Transform input data: groupBy, filter Transform input data: groupby, filter Sep 8, 2016
@etpinard etpinard added this to the v1.18.0 milestone Sep 8, 2016
@monfera
Copy link
Contributor Author

monfera commented Sep 15, 2016

A quick update on progress:

As styling can be hierarchical, such as `{marker: {line: {color: "cyan"}}} and users already make a big investment learning about them, and in addition, we seek to avoid property-by-property handling (attribute metadata extension or manual additions) of styles, we agreed that the styling defs for groups would look as normal. Here's an example:

    var mockData02 = [{
        mode: 'markers',
        x: [1, -1, -2, 0, 1, 2, 3],
        y: [0, 1, 2, 3, 4, 5, 6],
        transforms: [{
            type: 'groupby',
            groups: ['a', 'a', 'b', 'a', 'b', 'b', 'a'],
            styles: {
                a: {
                    marker: {
                        color: "orange",
                        size: 20,
                        line: {
                            color: "red",
                            width: 1
                        }
                    }
                },
                b: {
                   // heterogeonos attributes are OK: 
                   // group "a" needn't define e.g .`mode` if defaults are alright
                    mode: "markers+lines", 
                    marker: {
                        color: "cyan",
                        size: 15,
                        line: {
                            color: "purple",
                            width: 4
                        },
                        opacity: 0.5,
                        symbol: "triangle-up"
                    },
                    line: {
                        width: 1,
                        color: "purple"
                    }
                }
            }
        }]
    }];

This is how the result looks like, OK it's decidedly outré but serves the point:
image

The benefit of the solution is that

  • it's very compact to implement (basically one line change in current groupby)
  • rather powerful - basically anything goes that could go with manual separation
  • robust - nothing is expected to break
  • conceptually simple to users:
    • attributes are what users already know, use and have documented anyway
    • definition is natural
  • doesn't tamper with the existing implementation structures

Its drawback stems from the same properties:

  • it can be a bit verbose in JS
  • it can't pry apart array-like palettes e.g. "Greens"
    • however this might be in a subsequent iteration

@rreusser
Copy link
Contributor

rreusser commented Sep 15, 2016

As in the related PR, one additional note that, in general, scatter traces can now have ids in addition to x and y data arrays, which can be very useful for these sorts of operations.

@monfera
Copy link
Contributor Author

monfera commented Sep 15, 2016

@etpinard @rreusser Here's another example, for these things:

  1. Define styling at a super-group level - it can work per group, or the group can override it
  2. Arrays at the super-group level are interpreted per group element
    var mockData03 = [{
        mode: 'markers',
        x: [1, -1, -2, 0, 1, 2, 3],
        y: [0, 1, 2, 3, 5, 4, 6],
        marker: {
            color: "darkred", // general "default" color
            line: {
                width: 8,
                // a general, not overridden array will be interpreted per group
                color: ["orange", "red", "green", "cyan"]
            }
        },
        transforms: [{
            type: 'groupby',
            groups: ['a', 'a', 'b', 'a', 'b', 'b', 'a'],
            styles: {
                a: {marker: {size: 30}, mode: "markers+lines"},
                b: {marker: {size: 15, color: "lightblue"}, mode: "markers+lines"} // override general color
            }
        }]
    }];

Result:
image

@rreusser
Copy link
Contributor

I like it. Transforms in general are kinda free-form and extremely flexible, which means it's probably good to develop a set of conventions (like styles the way you've defined it) so that it's clear how to write a new transform that conforms to the conventions used in the rest of the transforms.

@etpinard
Copy link
Contributor

@monfera your API looks great.

I'd vote for transforms[i].style instead of transforms[i].styles as we like to keep plurals for Array containers.

One thing that we should attempt to handle better is the findArrayAttributes step. What we need to do is something similar to what Plotly.PlotSchema.get() does here where it looks for data_array and arrayOk attributes (which e.g. correctly skips over colorscale and domain) by looking into the fullData[i]._module.attributes

The more I think about it the more I think finding the list of all data_array + arrayOk attributes in a given trace will be very common to almost all transforms (including possible transforms written by community users). So I suggest we should find that list somewhere in plots.js and pass it to as an argument to the transform methods here.

@monfera
Copy link
Contributor Author

monfera commented Sep 16, 2016

@etpinard @rreusser Do I understand that anything that's data_array and arrayOk must split by group just like x and y now? I.e. is it the only condition? I'd have thought there are array attributes that represent some value extent [from, to] or whatever in an array such that they must not be split by groupby trace.

Assuming the answer is yes: probably I can make (or plug into) code that crawls the entire set of attributes and distinguish between splittable arrays and non-splittable arrays. But there's the issue that the attribute tree can differ by plot type, and according to other values. I'm concerned that some attribute locations in a mother of all JSON attribute dictionary will be group-splitting arrays under some circumstances and non-splitting arrays under others.

@etpinard
Copy link
Contributor

Do I understand that anything that's data_array and arrayOk must split by group just like x and y now?

Yes. When an arrayOk attribute is set to an array, it should be interpreted as per-datum specifications (e.g. just like ids[i] that @rreusser mentioned earlier).

But there's the issue that the attribute tree can differ by plot type, and according to other values

That's correct. The list of data_array + arrayOk attribute should be given per plot-type.

@monfera
Copy link
Contributor Author

monfera commented Sep 16, 2016

@etpinard Awesome, thanks! With this answer, @rreusser's answers and your examples I feel there's enough nooks and crannies to continue rock climbing :-)

@rreusser
Copy link
Contributor

Climb on!
hand

@etpinard
Copy link
Contributor

closed in #936 and #978

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature something new
Projects
None yet
Development

No branches or pull requests

3 participants