Skip to content

Filter and groupby transforms in main bundle #978

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Oct 6, 2016
Merged

Conversation

etpinard
Copy link
Contributor

@etpinard etpinard added this to the v1.18.0 milestone Sep 26, 2016
- instead of calling it for every transform trace
- call findArrayAttributes in transform function

- make findArrayAttributes 'aware' of array attribute in trace
  so that it lists only the array attributes present in given trace
- to allow calcTransform methods where fullTrace / or calcdata
  items can be modified during the calc step.
- this allows writing transforms that are applied after much
  of the 'computed' fields are filled e.g. axis (auto) 'type'
- calcTransforms can only manipulate trace object 1-to-1 e.g.
  grouping operation should be done at the defaults step as previously.
- rewrite transform as a calcTransform
- add 'active' (like in groupby)
- add 'strictinterval' (boolean) to determine e.g. `>` vs `>=`
- makes filter works with numerical, categorical and dates!!!
- extend original traces with extendDeepNoArrays
  (no need to copy arrays - they are filled during the grouping operation)
- swap .forEach for good old for loop
- move some more general cases our of filter_test -> multi_test
- add some filter default cases (around 'active' and 'strictinterval')
- add a bunch of calcTransform test (numeric, categorical and dates
  are now covered!)
@etpinard
Copy link
Contributor Author

@bpostlethwaite @alexcjohnson

This PR adds another layer of functionality to transform plugins ➡️ transforms can now be applied during the calc step. In details, the plotting code will apply transforms:

  • at the defaults step if the registered transform module has transform method, and
  • at the calc step if the registered transform module has a calcTransform method.

Applying transforms at the calc has a few important advantages:

  • transforms have access to several computed fields such as axis the type (that the calc step comes after the auto-type routine) and the very useful data-to-coordinate methods (d2c)
  • the calc step is not invoked as often as the defaults step on redraws. Loosely speaking, the calc is (or should be) invoked only when the data arrays change. For example, zooming on a cartesian plot triggers a supply-defaults call but no calc step.

Important calcTransform are expected to mutate a lone trace object. That is, they are only meant for 1-trace-to-1-trace transforms. For example, groupby transforms should be applied at the defaults as is the case currently.

For implementation details, see commit 1238236

In turn the filter transform was mostly rewritten to handle numerical, categorical and dates. See 0e1e9a7

* @return {array} arrayAttributes
* list of array attributes for the given trace
*/
exports.findArrayAttributes = function(trace) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@monfera FYI I decided to make findArrayAttributes a lib function instead meant to be called on-demand by the transform modules.

That way, this same function can now find the array attributes per trace module and filter them to match the array attribute in the user trace.

exports.attributes = {
active: {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an active attribute similar to the active in the groupby transform.

BUT, if no one objects, I'll change active -> visible for consistency with our other components. Moreover, active is already used in layout.updatemenus where it an 'integer' attribute, having both a boolean and an integer active attribute in our schema would be pretty confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds fine to me!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit bothered by visible=true exactly causing things to become _in_visible.
How about enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

active -> enabled in f4b9dcc

'Has only an effect for `operation` *>*, *<*, *within* and *notwithin*',
'When `operation` is set to *within* and *notwithin*,',
'`strictinterval` is expected to be a 2-item array where the first (second)',
'item determines strictness for the lower (second) bound.'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super happy about this attribute name.

Any suggestion would be appreciated.

Copy link
Contributor

@rreusser rreusser Sep 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked to wikipedia for guidance, but it suggested 'strict inequality', so that that's perhaps where you already looked. Otherwise I usually mentally call it an 'open' or 'closed' interval. Wouldn't terminology from mathematics be the most universally understood and transparent nomenclature? 😬

Srsly though, what about grouping this into the operator? Would <= be more straightforward than setting < and applying strictinterval: false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it suggested 'strict inequality', so that that's perhaps where you already looked.

Nop, I just remembered that term from Math classes. 😄

Would <= be more straightforward than setting < and applying strictinterval: false

I thought about that. But > + >= doesn't extend to within and notwithin.

To make it really mathy, maybe we could change operation: 'within' for operation: '[]' - where [] is a closed interval () would an open interval and similarly for [) and (] ?

@alexcjohnson @cldougl want to chip in on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and maybe operation: 'in' should be operation: '{}' as in a set.

Copy link
Contributor

@rreusser rreusser Sep 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally like fewer attributes (see orientation + openreverse vs. just direction discussion), but I have no objection to the status quo, even if it's not the most beautiful, satisfying, elegant API ever. As long as it's straightforward and usable and correct, that's probably sufficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just muddying it further but for lots of charting, e.g. binning, time series etc. a frequent one is left closed, right open...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for collapsing these two into one attribute. Having > sometimes mean >= would be strange, just define them both to mean what they always mean.

I actually kinda like [], (), [), and (]! Would it be going too far to turn notwithin into ][, ](, )[, and )(?

But if people think that would be too confusing, I'd also be happy to simplify it and say within is always inclusive, and notwithin is always exclusive (so it's the precise complement of within) and if a user wants something else done with the boundaries they can manually tweak them, similar to how in auto mode we try to adjust histogram bins so there actually aren't any samples on the boundaries and this distinction is moot.

That said, I think in and notin should stay as they are.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah my main concern was if we use [], (), [), and (] instead of within what would happen to the notwithin option?
I was thinking a similar thing to what @alexcjohnson suggested - if we go with [], (), [), and (] for within I think there should be a parallel syntax for notwithin. I don't think ][, ](, )[, and )( is too confusing (as long as we write a thorough description).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 104123e

would love to get a review.

// as another round of supplyDefaults is done on the transformed traces
newTrace = Lib.extendDeep(newTrace, style[groupName] || {});
newTrace = Lib.extendDeepNoArrays(newTrace, style[groupName] || {});
Copy link
Contributor Author

@etpinard etpinard Sep 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@monfera no need to extend arrays here, right?

This could be a solid performance boost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it should be fine, haven't tested it myself.

xaxis: { type: 'category' }
});

_assert(out, [-2, -3], [3, 1], [0.3, 0.4]);
Copy link
Contributor Author

@etpinard etpinard Sep 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bpostlethwaite this is a very important test case.

Here, a user inputs numerical data on a category axis - meaning that the data items are cast to strings and drawn in the ordered that they are input. That is,

// x
[1, 2, 3, 0, -1, -2, -3] 

// is plotted with x coordinates
[0, 1, 2, 3, 4, 5, 6] 

The calcTransform filter using fullLayout._xaxis.d2c performs admirably. ⛵

},
filtersrc: {
valType: 'enumerated',
values: ['x', 'y', 'z', 'ids'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we keeping this explicit list of attributes, or is it obsoleted by crawling the attribute tree for filterable arrays?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we keeping this explicit list of attributes, or is it obsoleted by crawling the attribute tree for filterable arrays?

👍 would be awesome to filter on marker.size, or text, anything that has an array...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on request and code sample from @etpinard, the groupby transform already uses the attribute crawler which had been extracted out of plot_schema.js and now lives in Lib/coerce.js. I was asked during the groupby PR review to elevate the crawler call to transform so I think there's plan to eventually use it in filter, maybe as a subsequent PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in ad77818

var cv = d2c(v);
return cv <= coercedValue[0] || cv > coercedValue[1];
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice run of all combinations, and the notation looks good and math

Plotly.register([
require('@lib/pointcloud')
]);

// Test utilities
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

case '][':
return function(v) {
var cv = d2c(v);
return cv < coercedValue[0] || cv > coercedValue[1];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't what I thought ][ would mean... shouldn't it be cv <= coercedValue[0] || cv >= coercedValue[1]? ie ] always means <= whether it's on the left side or the right side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I guess that's easier to remember than ][ meaning not in the closed interval.

Copy link
Contributor

@monfera monfera Sep 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of context, the ][ notation would be ambiguous but one of its meanings is the complement interval, and current code is consistent with that meaning (since there's also the () symbols, it's actually disambiguated)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 7cfa4ff

Copy link
Contributor

@monfera monfera Sep 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw ][ means 'not in the open interval' and code is correctly implemented that way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Some authors use ]a, b[ to denote the complement of the interval (a, b); namely, the set of all real numbers that are either less than or equal to a, or greater than or equal to b." - wiki

I think it makes intuitive sense also, in that the retained part, i.e. the part outside the interval, has closed bounds

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@etpinard has already changed it per my suggestion (thanks!) so not to beat a dead horse but I find it much more persuasive to think of these as a combination of two independent operators (] always means <=) than to try to relate them to the corresponding "within" forms. Likewise, think about histograms: if you have a bin with an interval like [), then to meet it exactly the next bin must start with [, hence [ is the complement of )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexcjohnson I'm not requesting anything one way or another, it just looked like @etpinard's original code was in line with the standard math notation which is why I mentioned this on my initial review. I may be mistaken or we may not want to follow it, I have less experience than you figuring out what's expected by users.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@monfera no problem - also somehow I didn't see your previous comment including "]a, b[" before writing mine - which I think ends up being exactly what we ended up with ⭐

Copy link
Contributor

@monfera monfera Sep 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexcjohnson just to clarify myself, when I wrote 'complement' it's meant in the set operations sense, for example, the complement of [8, 12] is the union of (-∞, 8) and (12, ∞) which, at least according to my comment you quote, from the wiki, can be denoted as )8, 12( though again, whichever notation you guys settle will work out just fine.

'*)[* filters items outside `value[0]` to value[1]` and not equal to `value[1]`',

'*{}* filters items present in a set of values',
'*}{* filters items not present in a set of values'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the verb "filters" is a bit ambiguous - could be taken to mean "filters out" which is the opposite of what you mean. How about "keeps" or "shows"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point - done in 7cfa4ff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I also like "keeps" or "retains"

- add a missing test cases for semi-open intervals
- replace verb 'filter' by 'keep' in operation description
@alexcjohnson
Copy link
Collaborator

alexcjohnson commented Sep 30, 2016

From an off-line chat with @etpinard - calcTransform transforms always happen after transform (supplyDefaults) transforms, even if you put the calc ones first in the transform list. Within our current structure there isn't really any way around this. The situation will get even more complicated if a transform has both methods. This doesn't mean we should change anything about what we're building here, but we need to make it crystal clear to transform users what this means. Also note that it's irrelevant for transforms that commute (which groupBy and filter do)

To be concrete: lets imagine a transform truncate(N) that just passes on the first N points. This could be implemented either way (and both would have limitations) but let's imagine we make it with calcTransform. Conceptually, [truncate(100), groupBy] should result in grouping just the first 100 points of the trace into separate traces (each with <100 points) but it will actually result in the first 100 points of each group, which is what we expect from [groupBy, truncate(100)], and there would be no way to get the former behavior without making a new implementation of truncate using transform only. However, if you combine truncate and filter, the result will respect the order you give them.

  • Make this limitation crystal clear to users (through appropriate docs / comments)

assertStyle([1], ['rgb(255, 0, 0)'], [1]);

done();
it('+ *within*', function() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

names could be updated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 0aa4958

@monfera
Copy link
Contributor

monfera commented Sep 30, 2016

Similar to @alexcjohnson's above comment (excerpt from discussion on truncate vs filter with @etpinard) I had concerns about the meaning and interaction between groupby and filter, e.g. would a [groupby, filter] sequence be analogous to the SQL having, and does it make sense to have e.g. [groupby, filter, groupby, filter, groupby, filter, groupby, filter] or just thinking of it, just consecutive groupby steps.

A possibility for now is to limit what the API caller can do to well-understood, test covered sequences e.g. [filter], [groupby] or [filter, groupby].

- transform module are now required to have a transfrom OR
  calcTransform method
- so that filters can be applied to ALL possible data array
  e.g `filtersrc: 'marker.color'`
- add tests for geo 'lon' / 'lat' and 'marker.color'
- using axisIds.getFromTrace !
@@ -120,6 +120,23 @@ describe('filter transforms calc:', function() {
expect(out[0].y).toEqual(base.y);
});

it('filters should handle 3D *z* data', function() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi @bpostlethwaite this works 🎉 !!!

@etpinard
Copy link
Contributor Author

etpinard commented Oct 3, 2016

@alexcjohnson

Make this limitation crystal clear to users (through appropriate docs / comments)

With commit 2a7738e registering a transform module with both a transform and calcTransform method will log info about execution order. Were you looking for something even clearer?

Maybe we could add a src/transforms/_example.js file or a src/transforms/README.md with more info about transforms?

@alexcjohnson
Copy link
Collaborator

With commit 2a7738e registering a transform module with both a transform and calcTransform method will log info about execution order. Were you looking for something even clearer?

Maybe we could add a src/transforms/_example.js file or a src/transforms/README.md with more info about transforms?

If someone goes to the trouble of making a single transform that has both components, they probably already know what this will do. The issue I wanted to highlight comes when people (who probably weren't the transform author) use multiple transforms on the same plot that are split between the two modalities, so that, it seems to me, should be mentioned in two places: 1) wherever users will first learn about transforms, and 2) a log message would be great, but the best place for it would be when transforms are found that may not do what you expect. Specifically, if the first transform that has a calcTransform is earlier in the list than the last one that has a transform.

@rreusser
Copy link
Contributor

rreusser commented Oct 3, 2016

Make this limitation crystal clear to users (through appropriate docs / comments)

Regarding making things clear through documentation, just a side note on docs philosophy. It was expressed to me that the goal of the plotly docs is to communicate through examples rather than through explanation. I think this can make it difficult to just explain some things in a straightforward manner though, so I added a markdown_content field that allows you to preface examples with a brief markdown-formatted explanation. I'm absolutely open to different approaches and thoughts here; I just found it difficult to communicate subtle/important details through code comments alone.

See the animation docs for an example.

@etpinard
Copy link
Contributor Author

etpinard commented Oct 5, 2016

@alexcjohnson

  1. wherever users will first learn about transforms

We don't have a place for that yet unfortunately. I don't think we advertise about writing custom transforms to users just yet - until transform stabilized a little more.

  1. a log message would be great, but the best place for it would be when transforms are found that may not do what you expect. Specifically, if the first transform that has a calcTransform is earlier in the list than the last one that has a transform

That sounds like a lot of logic just to log a message. Ok to differ until transforms stabilize?

Ok to merge?

@alexcjohnson
Copy link
Collaborator

@etpinard I see. Right, people don't have a lot to go on yet... but that may not stop people from trying! How about just a warning here https://github.com/plotly/plotly.js/pull/978/files#diff-6d186b954a58d5bb740f73d84fe39073R34 - where anyone who writes a transform will have to list it - that this feature is still alpha.

@alexcjohnson
Copy link
Collaborator

Thx 💃

@etpinard etpinard merged commit 987f1aa into master Oct 6, 2016
@etpinard etpinard deleted the transforms-primetime branch October 6, 2016 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature something new
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants