-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
ENH: plotting methods can unpack labeled data [MOVED TO #4829] #4787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
After discussions with Brian Granger, Fernando Perez Peter Wang, Matthew Rocklin and Jake VanderPlas this is a proposal for how to deal with labeled data in matplotlib. The approach taken is that if the optional kwarg 'data' is passed in any other string args/kwargs to the function are replaced be the result `data[k]` if the key exists in `data`, else leave the value as-is.
Nice! |
+1 on this from my perspective as well. IIRC this is the protocol we discussed to have pandas internally dispatch to matplotlib as well. |
This sounds like getting closer to R's plotting functions. With the function call mechanism in R, you can do
and the plot function sees the expressions passed in as arguments and controls how they get evaluated. People will naturally ask for expression support next, and if we use strings as placeholders for values, we'll need to implement a parser for some expression language. Alternative design: let users pass in sympy symbols or expressions, as in
These won't clash with string values of keyword arguments, and the generalization to expressions is simple. To avoid a dependency on sympy, we can deliver a simple version of symbol objects ourselves. |
Will this work with line attributes? I would love a simple, consistent way to specify per-point colors, marker styles and marker sizes, and being able to specify that data in a record array and associate each keyword sounds like it would do the job very neatly. — Russell On Jul 25, 2015, at 1:07 AM, Thomas A Caswell [email protected] wrote:
|
@jkseppan You hit the inspiration on the head 😉 The reason that it checks if the place holder is a string instead of just trying I am not super excited about adding that sort of computation into core of mpl. I think it is important to (at the low level) keep computation/analysis logic separated from the plotting logic. @r-owen There are some limitations to that due to details of the underlying artists work (all markers from |
Will/could this label axes, and do other smart things with the labels? That is (IMO) a major motivation for this style of invocation. |
Thanks for this @tacaswell! One other feature that would be nice, though it would complicate the logic a bit. I'd love to have the created plot elements be automatically labeled. So then these two things would be equivalent: plt.plot(data['t'], data['x'], label='x')
plt.plot(data['t'], data['y'], label='y')
plt.legend() and plt.plot('t', 'x', data=data)
plt.plot('t', 'y', data=data)
plt.legend() it would involve inferring which is the y value in any appropriate function, and automatically setting the |
if rcParams['unpack_labeled']: | ||
args = tuple(_replacer(data, a) for a in args) | ||
kwargs = dict((k, _replacer(data, v)) | ||
for k, v in six.iteritems(kwargs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it make sense to implement a positive/negative list of at least kwargs which should be replaced and some which should not replaced?
Like
@unpack_labeled_data([1], ["labels", "colors"]) # the second arg and two kwargs should be replaced
def pie(self, x, explode=None, labels=None, colors=None,
autopct=None, pctdistance=0.6, shadow=False, labeldistance=1.1,
startangle=None, radius=None, counterclock=True,
[...]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakevdp Asked the same question. I didn't go down that route because it is a bit more sutble as you would have to do
@unpack_labeled_data([1, 3, 4], ['x', 'labels', 'colors'])
which now that I type out what it would look like isn't so bad.
Probably will have to special case plot
and maybe a few others with overly permissive APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
erk, didn't know that :-(
Not sure if that helps, but:
def func(x, y=1):
print("x: %s, y: %s" % (x,y))
inspect.getargspec(func)
ArgSpec(args=['x', 'y'], varargs=None, keywords=None, defaults=(1,))
- only pass in the names of the args in the decorator as
replace_names
- in the decorator: cache the list of arg names via inspect:
cached_names
- in the wrapper:
- for each arg, use the
pos
to get the name (cached_names[pos]
) and use that name in the replacement if it is inreplace_names
- process all kwargs like before if they are in
replace_names
- for each arg, use the
Unfortunately this won't work for plot_func(*arg, **kwarg)
:-(
import cycler as cy
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
def simple_plot(ax, x, y, **kwargs):
return ax.plot(x, y, **kwargs)
th = np.linspace(0, 2*np.pi, 128)
df = pd.DataFrame({'sin': np.sin(th), 'cos': np.cos(th),
'sin2': .5 * np.sin(2 * th), 'cos2': .5 * np.cos(2 * th)})
def easy_facet(df):
cycleX = (cy.cycler('x', df.keys()) + cy.cycler('linestyle', ['-', '--', ':', '']))
cycleY = (cy.cycler('y', df.keys()) + cy.cycler('marker', 'xos*'))
kw_cycle = cycleX * cycleY
fig, axes = plt.subplots(len(df.keys()), len(df.keys()), sharex=True, sharey=True,
figsize=(10, 10))
lines = []
for ax, kwargs in zip(axes.ravel(), kw_cycle):
ln, = simple_plot(ax, markevery=5, data=df, **kwargs)
ax.set_title('{x} vs {y}'.format(**kwargs))
lines.append(ln)
easy_facet(df) |
This is really a fantastic addition. This simple lookup based approach will couple equally well with other labeled data libraries, e.g., xray. @mwaskom I don't think there's a clean way to automatically handle axis labeling. The way to get that info from a pandas DataFrame is very pandas specific. |
I don't think I understand why it would specific to the input type. I would think it is just logic that needs to be associated with the particular matplotlib function. In other words, ax.scatter("foo", "bar", data=df) should be the same as ax.scatter(df.foo, df.bar)
ax.set(xlabel="foo", ylabel="bar") To be clear I'm not saying matplotlib needs to extract a |
Associating the column names with the axis labels only makes sense for the simplest of the use cases. For example ax.scatter('x', 'foo', data=df)
ax.scatter('x', 'bar', data=df)
ax.plot('x', 'baz', data=df) would end up with what ever the last call was setting the axis labels which may not be right. I think it is better to err on the side of making the users be more explicit rather than giving the users something that is wrong. |
@shoyer That was definitely part of the discussion. I have also tested that it works with h5py files/groups and dicts of things that quack like arrays. |
Pandas solves this by labeling the x axis (which is almost always the same for multiple lines) and making a legend based on the y labels. That's why I suggested above automatically labeling the objects, so that a simple |
+1 from me as well! |
Big +1 here, this looks great. The one thing I see, from a pandas perspective, is that we typically plot a column against an index. e.g. In [10]: df = pd.DataFrame({'A': range(10), 'B': np.arange(10)**2})
In [11]: df.A.plot() Will plot the >>> ax.plot(x='index', y='A', data=df.reset_index()) It'd be nice to avoid that |
ax.scatter('x', 'foo', data=df)
ax.scatter('x', 'bar', data=df)
ax.plot('x', 'baz', data=df) Pandas does just overwrite the axis label here. It's not ideal, but there is a precedent here. I guess this counts as one of those foot-cannons you mentioned @tacaswell. |
I think this discussion is why long-form data is better than wide-form data. But I could see the argument that matplotlib should remain agnostic about data format. That said, long-form datasets are probably > 5% of what's out there, so I'm not sure this is accurate:
|
kwargs = dict((k, _replacer(data, v)) | ||
for k, v in six.iteritems(kwargs)) | ||
else: | ||
raise ValueError("Trying to unpack labeled data, but " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the strategy here. What's the point of the rcParam? You don't seem to be letting it turn off the unpacking behavior. You are always popping the data kwarg; and if it is there, the only effect of the rcParam seems to be to generate an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was just too be able to turn this feature off in a guaranteed way. It is better to catch it here and raise rather than letting it fall through to set_data. The other thing i thought about was making this rcparam a define time check and if it is false just return func
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't see the use case for this. Under what circumstances would a user want to set the rcParam to False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit nervous about adding this so close to release, if you are not worried I will get rid of this bit of complexity.
Try to grab `y.index` before returning `np.arange(len(y))`
If white list is provided, only try to replace those.
Not tested or used.
Where I have landed on all of these issues is:
The calculus on the last two points changes greatly if anyone else steps up to work on this. My goal here is to get a MVP of a labeled data aware API out the door with 1.5, I think dropping some of the safety (the input white listing) and convenience (artist label lookup) is worth getting a version out that we are clear on what the limitations are (if you have a column named 'g' bad things might happen) so we can get it used and see where the limitations/pain points are. |
And to address @mwaskom comment about long vs wide data, at this level of the API, I think we have to take wide data. There needs to be a layer built on top of this that will be take the long data, do the selection/filtering/aggregation and call out to this layer with wide data. There is an interesting discussion that needs to happen about what that higher level API should look like. |
except AttributeError: | ||
y = np.atleast_1d(y) | ||
return np.arange(y.shape[0], dtype=float), y | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how this will handle a dataframe with a MultiIndex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the time the code path gets here it should be no bigger than a Series
, can you have multi-index on a series?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most definitely. Not sure how to check the type without importing pandas. I guess you could import pandas inside the try block, but that's probably not desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling series.index.values
on a Series with a MultiIndex will return a 1d array of tuples.
👍 so excited about this! |
Responding to @phobson s inline comment: No |
kwargs['label'] is None)): | ||
if len(args) > label_arg: | ||
try: | ||
kwargs['label'] = args[label_arg].name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use @mwaskom's suggestion of using the text label instead of the .name
attribute? That seems safer:
To be clear I'm not saying matplotlib needs to extract a .name attribute from a vector, just that if semantic names are used to draw the plot, they should end up as labels too.
.name
will also work with xray, but the smaller we can make the labeled data spec, the better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should probably be changed to use either. I was trying to make this work for cases where the user is currently doing plt.plot(df['foo'])
which while cutting against my long message is what I was thinking when I wrote this.
If it wasn't clear above, if anyone want to take this an run with it (or start from scratch) go for it, I have no personal attachment to this code. If this PR mostly serves to annoy someone enough to do it right I will be happy 😄 . |
So, RDF schemas and representation formats (e.g. JSON-LD, CSVW) define metadata fields like 'rdfs:label' (@en) and 'schema:name'. pandas-dev/pandas#3402 "ENH: Linked Datasets (RDF)" For linked data, I don't see why there would be a need to create a different format for expressing this metadata. CSV -> arrays <- metadata (RDF, JSON-LD) [ [edit]
|
... vega visualization grammar also solves for axes labels: https://github.com/vega/vega/wiki |
About the kwarg whitelisting / automatic labeling. @tacaswell I certainly understand the careful approach of not doing too much in a first iteration. But simply using the provided string key as the This would help a lot in the following case. Suppose this example:
If there is no automatic labeling, you have to provide a |
@jorisvandenbossche Good catch re |
pass | ||
elif label_kwarg in kwargs: | ||
try: | ||
kwargs['label'] = args[label_kwarg].name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be s/args/kwargs/
:
kwargs['label'] = kwargs[label_kwarg].name
Here is an alternative decorator, which uses inspect to get the names of arguments instead of needing to specify both position and names. [updated] If The main benefit is that you only need to specify three arguments: the "replace_names" list of a args which should be replaced and the "label_namer" and, only if varargs are used: the full list of arguments. I find that this is easier to maintain than using both position and name in all cases. This version also uses both the label_namer value (aka import functools
import six
import inspect
def _replacer(data, key):
# if key isn't a string don't bother
if not isinstance(key, six.string_types):
return key
# try to use __getitem__
try:
return data[key]
# key does not exist, silently fall back to key
except KeyError:
return key
def unpack_labeled_data_names(replace_names=None, label_namer="x", full_argument_names=None):
"""
A decorator to add a 'data' kwarg to any a function. The signature
of the input function must be ::
def foo(ax, *args, **kwargs)
so this is suitable for use with Axes methods.
"""
if replace_names is not None:
replace_names = set(replace_names)
def param(func):
# remove the first "ax" arg
arg_spec = inspect.getargspec(func)
if ((arg_spec.keywords is None) and (arg_spec.varargs is None)):
arg_names = arg_spec.args[1:]
else:
# in this case we need a supplied list of arguments
if full_argument_names is None:
raise Exception("Wrapped function uses *args or **kwargs, need full_argument_names!")
arg_names = full_argument_names[1:]
if label_namer:
if not label_namer in arg_names:
raise Exception("label namer: no arg with name %s | %s" % (label_namer, arg_names))
label_namer_pos = arg_names.index(label_namer)
else:
label_namer_pos = 9999 # bigger than all "possible" arg lists
@functools.wraps(func)
def inner(ax, *args, **kwargs):
data = kwargs.pop('data', None)
xlabel = None
if data is not None:
# save the current label_namer value so that it can be used as a label
if label_namer_pos < len(args):
xlabel = args[label_namer_pos]
else:
xlabel = kwargs.get(label_namer, None)
if not isinstance(xlabel, six.string_types):
xlabel = None
# A arg is replaced if the arg_name of that position is in replace_names
try:
args = tuple(_replacer(data, a) if arg_names[j] in replace_names else a
for j, a in enumerate(args))
except IndexError:
raise Exception("Got more args than function expects")
kwargs = dict((k, _replacer(data, v) if k in replace_names else v)
for k, v in six.iteritems(kwargs))
# replace the label if this func has a label arg and the user didn't set one
if (("label" in arg_names) and (
(arg_names.index("label") < len(args)) or # not in args
('label' not in kwargs or kwargs['label'] is None)) # not in kwargs
):
if label_namer_pos < len(args):
try:
kwargs['label'] = args[label_namer_pos].name
except AttributeError:
kwargs['label'] = xlabel
elif label_namer in kwargs:
try:
kwargs['label'] = kwargs[label_namer].name
except AttributeError:
kwargs['label'] = xlabel
return func(ax, *args, **kwargs)
return inner
return param
@unpack_labeled_data_names(replace_names=["x","y"])
def plot_func(ax, x, y, ls="x", label=None, w="xyz"):
return "x: %s, y: %s, ls: %s, w: %s, label: %s" % (list(x),list(y),ls, w, label)
## or
@unpack_labeled_data_names(replace_names=["x","y"], full_argument_names=["ax", "x", "y", "ls", "label", "w"])
def plot_func(ax, *args, **kwargs):
all_args = [None, None, "x", None, "xyz"]
for i, v in enumerate(args):
all_args[i] = v
for i, k in enumerate(["x", "y", "ls", "label", "w"]):
if k in kwargs:
all_args[i] = kwargs[k]
x, y, ls, label, w = all_args
return "x: %s, y: %s, ls: %s, w: %s, label: %s" % (list(x),list(y),ls, w, label)
# Tests (work for both plot_func versions):
assert plot_func(None, "x","y") == "x: ['x'], y: ['y'], ls: x, w: xyz, label: None"
assert plot_func(None, x="x",y="y") == "x: ['x'], y: ['y'], ls: x, w: xyz, label: None"
assert plot_func(None, "x","y", label="") == "x: ['x'], y: ['y'], ls: x, w: xyz, label: "
assert plot_func(None, "x","y", label="text") == "x: ['x'], y: ['y'], ls: x, w: xyz, label: text"
assert plot_func(None, x="x",y="y", label="") == "x: ['x'], y: ['y'], ls: x, w: xyz, label: "
assert plot_func(None, x="x",y="y", label="text") == "x: ['x'], y: ['y'], ls: x, w: xyz, label: text"
data = {"a":[1,2],"b":[8,9]}
assert plot_func(None, "a","b", data=data) == "x: [1, 2], y: [8, 9], ls: x, w: xyz, label: a"
assert plot_func(None, x="a",y="b", data=data) == "x: [1, 2], y: [8, 9], ls: x, w: xyz, label: a"
assert plot_func(None, "a","b", label="", data=data) == "x: [1, 2], y: [8, 9], ls: x, w: xyz, label: "
assert plot_func(None, "a","b", label="text", data=data) == "x: [1, 2], y: [8, 9], ls: x, w: xyz, label: text"
assert plot_func(None, x="a",y="b", label="", data=data) == "x: [1, 2], y: [8, 9], ls: x, w: xyz, label: "
assert plot_func(None, x="a",y="b", label="text", data=data) == "x: [1, 2], y: [8, 9], ls: x, w: xyz, label: text"
import pandas as pd
data = pd.DataFrame({"a":[1,2],"b":[8,9]})
assert plot_func(None, "a","b", data=data) == "x: [1, 2], y: [8, 9], ls: x, w: xyz, label: a"
assert plot_func(None, x="a",y="b", data=data) == "x: [1, 2], y: [8, 9], ls: x, w: xyz, label: a"
assert plot_func(None, "a","b", label="", data=data) == "x: [1, 2], y: [8, 9], ls: x, w: xyz, label: "
assert plot_func(None, "a","b", label="text", data=data) == "x: [1, 2], y: [8, 9], ls: x, w: xyz, label: text"
assert plot_func(None, x="a",y="b", label="", data=data) == "x: [1, 2], y: [8, 9], ls: x, w: xyz, label: "
assert plot_func(None, x="a",y="b", label="text", data=data) == "x: [1, 2], y: [8, 9], ls: x, w: xyz, label: text" |
I wrote an expanded version of my earlier comment on the mailing list: http://article.gmane.org/gmane.comp.python.matplotlib.devel/13643 The code I'm referring to is on a branch based on this one: https://github.com/jkseppan/matplotlib/commits/label-with-nonstrings |
@jkseppan That is very nice, but is out of scope for right now. The goal is to get a MVP out the door for 1.5 in a way that does not paint us into a corner. Having very clear edges of what we will be providing (limited to single-index string labeled tables) is a feature. This is an interesting enough idea that I think we should not try to rush it in and should probably work closely with the pandas/xray folks to define how that is going to work. |
@JanSchulz That looks good. label_namer should probably default to |
@tacaswell The problem I'm trying to get at is that strings are overloaded. In the matplotlib API they can mean at least colors (with multiple different syntaxes), line styles, marker styles, and text. I would argue that using strings for yet another purpose is a way of painting ourselves into a corner. While my branch has a longish demo in the test case, it's just an example of what the user could do with the API. The one change I'd like to make to this PR is b4709b3, just the part that inverts the string check to check that the keys aren't numbers or anything unhashable. Or, if allowing any object is too much, we could provide an abstract base class whose descendants we allow in addition to strings:
|
It would be good to have a test of the new functionality. There's a beginning of a test in d186a80. |
@jkseppan – your approach is interesting, but I think it would be much better suited for an extension library than the core of matplotlib itself. I agree with @tacaswell on his initial approach, though it probably needs some whitelisting mechanism as well. |
In ggplot, |
I've put up a PR with my version of the decorator: #4829 |
Closing in favor of #4829 The above discussion has convinced me that whitelisting is essential and |
After discussions with Brian Granger, Fernando Perez Peter Wang, Matthew
Rocklin and Jake VanderPlas this is a proposal for how to deal with
labeled data in matplotlib.
The approach taken is that if the optional kwarg 'data' is passed in any
other string args/kwargs to the function are replaced be the result
data[k]
if the key exists indata
, else leave the value as-is.Fernando made a compelling case that this needs to go in ASAP.
This still needs docs + tests + a bit more thought on how to deal with functions where we do some internal broadcasting (mostly
plot
). Maybe pass in names as a coma separated list? I would prefer to, long term, simplify the low-level plot and have either the users do the looping or provide higher-level plotting functions which do the looping.There is the possibility that some of the string args/kwargs we already take may conflict with names in the labeled data (ex
ha='center'
would not work with a data structure where'center' in data
).@pzwang expressed concern that we may be painting ourselves into a corner with this API as it is mostly just the difference between
vs
The unpacking attempts can be disabled via a rcparam. That could also be implemented as an import time rcparam which disables the decorator all together.
This should work with any
data
object that supports getitem and returns something thatnp.asarray
works on.attn @matplotlib/developers @jakevdp @fperez @mrocklin @ellisonbg @pzwang @mwaskom @jreback @andrewcollette