-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: engine kw to .plot to enable selectable backends #14130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@jorisvandenbossche |
The way that So, AFAIK, to have something like this work, we would either have to implement in pandas also the other engines (which means: having more plotting related code, not something we want?), or either expect from each engine to implement some kind of |
With mpl we have been working to better support pandas input natively to all of our plotting routines (the It is not too hard now to write dataframe aware functions that do mostly sensible things (ex) with matplotlib. I have a suspicion that if you started from scratch and mpl 1.5+ the mpl version of the pandas plotting code would be much shorter and clearer. My suggestion would be to pull the current pandas plotting code out into it's own project and refactor it into functions that look like def some_chart_type(df, optional=backend, independent=input, *, backend=dependent, keyword=args): and use that as a reference implementation of the plotting API that backends need to expose to pandas for use in the plot accessor. |
This may also be of interest to @santosjorge |
some quick thoughts follow. Curious to hear other's. Pandas PlottingGoal: define a system for multiple backends (matplotlib, Bokeh, Plotly, Altair Note libraries can already achive this end, to an extent, with Overview of the implementation
(scatter and hexbin are DataFrame-only; the rest are also defined on User APIA user-configurable
Would be the main point for users. Users would set this globally
Or use a context manager
Backend APINow for the tough part. Changes to PandasWe'll refactor the current
So class FramePlotMethods:
def line(self, x=None, y=None, **kwds):
backend = self.get_backend()
# _data is the DataFrame calling .plot.line
backend.line(self._data, x=x, y=y, **kwds) At that point, things are entirely up to the backend. The various backends would ChallengesAPI consistency How much should pandas care that backends accept similar keywords, behavior Global State Matplotlib has the notion of a "currently active figure", and some plotting with pd.options_context('plotting.backend', 'bokeh'):
df.plot() with pd.options_context('plotting.backend', 'matplotlib'):
df.plot()
# Any difference here?
with pd.options_context('plotting.backend', 'bokeh'):
df.plot() I don't think so (aside from the extra matplotlib plot; the bokeh plots would be Fortunately for us, pandas messed this up terribly at some point, so that registration I've been trying to improve pandas import time recently. Part of that involved
Pandas doesn't want to try / except each of the backends known to have an |
Agreed with @tacaswell here that the current implementation should be moved to the plugin system I outlined above. That would be a good test case for what other backends would need. |
Personally I don't think it really makes sense to consider seaborn a "backend" for pandas plotting. Seaborn seems higher in the stack than pandas, relative to the other backends. Are there particular plotting functions you had in mind for delegating to? |
Agreed for the most part. We could implement That brings up another point, we would want to allow backends to implement additional methods, e.g. |
@TomAugspurger great summary! I agree with pretty much everything you write.
There are basically three options:
1 is off the table for the reason you mention, and 2 is not attractive for the same reason (matplotlib doesn't want to import pandas, either, and needing to explicitly write My suggesting is that we do some variant of option 3. Some backends, e.g., matplotlib, might remain bundled in pandas for now, but in general it would be nice for backends to de-coupled. So let's define a protocol of some sort based on the value of For example, we could try importing the module giving by the string value of the backend, and then call @mwaskom Agreed, I don't see Seaborn as a "backend" (and I don't think Tom does either, based on his post). |
Honestly, I would probably prefer to have pandas plotting retired, unless
there are particular plots that other libraries (Altair, seaborn, bokeh,
Matplotlib). If there are still some special things that these other
libraries can't do, then it would probably be easier to just implement
those things in those other libraries. But it totally depends on your
philosophy about breaking APIs. I tend to lean towards breaking things to
innovate, but I understand that not all libraries can do that...
…On Thu, Sep 28, 2017 at 9:19 AM, Stephan Hoyer ***@***.***> wrote:
@TomAugspurger <https://github.com/tomaugspurger> great summary! I agree
with pretty much everything you write.
Pandas doesn't want to try / except each of the backends known to have an
implementation. Do we require users to import bokeh.pandas, which calls a
register_backend? That seems not great from the user's standpoint, but
maybe
necessary?
There are basically three options:
1. pandas tries importing other packages
2. other packages import pandas, and register a plotting method
3. pandas is aware of other packages, so it can define a lazy
importing stub. The actual implementation can be somewhere else.
1 is off the table for the reason you mention, and 2 is not attractive for
the same reason (matplotlib doesn't want to import pandas, either, and
needing to explicitly write import matplotlib.pandas is annoying).
My suggesting is that we do some variant of option 3. Some backends, e.g.,
matplotlib, might remain bundled in pandas for now, but in general it would
be nice for backends to de-coupled. So let's define protocol of some sort
based on the value of pandas.options.plotting.backend.
For example, we could try importing the module giving by the string value
of the backend, and then call backend._pandas_plot_(pandas_obj) as the
equivalent to pandas_obj.plot. If the backend doesn't want a hard
dependency on pandas, they can put their PandasPlotMethods subclass in a
separate module that is imported inside their _pandas_plot_ function.
@mwaskom <https://github.com/mwaskom> Agreed, I don't see Seaborn as a
"backend" (and I don't Tom does either, based on his post).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14130 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABr0BpNjovscYpvKPlIu8cGuFsBSkPzks5sm8cMgaJpZM4Jxxtq>
.
--
Brian E. Granger
Associate Professor of Physics and Data Science
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
[email protected] and [email protected]
|
The two biggest attractions to dataframe plot accessors is a) discoverability b) easy swapping of backends (if you really want them to be interchangeable, you need someone (pandas) to own the API). |
FWIW I have been working on adding a few more "basic" plots to seaborn (mwaskom/seaborn#1285), which would help fill to a "higher-level, matplotlib-based" hole that would otherwise open up if pandas dropped plotting altogether. |
nice!
…On Fri, Sep 29, 2017 at 11:40 AM, Michael Waskom ***@***.***> wrote:
If there are still some special things that these other libraries can't
do, then it would probably be easier to just implement those things in
those other libraries.
FWIW I have been working on adding a few more "basic" plots to seaborn (
mwaskom/seaborn#1285 <mwaskom/seaborn#1285>),
which would help fill to a "higher-level, matplotlib-based" hole that would
otherwise open up if pandas dropped plotting altogether.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14130 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABr0OTayoQZuDCsXKr2rlBvzhZej3koks5snTmMgaJpZM4Jxxtq>
.
--
Brian E. Granger
Associate Professor of Physics and Data Science
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
[email protected] and [email protected]
|
Re: retiring pandas plotting, I definitely disagree. I am personally excited and engaged by innovation, but some conversations last week at Strata reminded me that one tool, or style, will never suit all users and use cases. Some people prioritize absolute immediacy, simple expectations, and lack of friction. They just want to do Regarding the decoupling: I think pandas should own the API, and if people want to do something beyond that, they should look to using the native plotting APIs. That comports with my observation that people who most want
Pandas would own all of that. It requires a commitment from known backends to maintain the "real" registration function (that lives in the respective projects) in a stable place so that |
In general, it is brittle and painful to standardize architecture and extensibility around Python APIs. We have seen this many, many times in building different parts of Jupyter. The right way to do this is to build a declarative formal JSON schema that serves as the common exchange format and contract between pandas and different libraries which render visualizations. I would advocate for using Vega-Lite as that JSON schema, but that point is much less important than the bigger idea of using a JSON schema for this. Some of the benefits of this approach:
ping @rgbkrk who is an advocate of "JSON schema all the things" |
As we have found out and finally rectified after a long time with Bokeh, "JSON for everything" is inordinately slow for many use cases. I'm definitely not personally interested in expending very-limited bandwidth on a JSON-only solution. WRT to difficulties around standardizing APIs, I am not certain the specific issues with Jupyter history generalize everywhere. |
@ellisonbg you bring up an interesting point, that if the Pandas devs want to "own plotting", then outputting a JSON-based visualization spec would be the most flexible and accurate approach to doing that. However, the roundtrip through JSON-land is nontrivial - not merely from a logical mapping perspective, but also from the perspective of performance. In the most common case, directly calling matplotlib on a large dataframe is extremely fast. Similarly, there's no reason why Datashader or Bokeh server can't also be similarly fast on large dataframes. However, round-tripping those datasets through an encode/decode process to JSON would be quite painful. (And that's not even considering the use cases of e.g. GeoPandas, with tons of shape geometry data.) My understanding is that the Pandas devs already have a plotting API on the |
I spoke with @bryevdv about this at some length during the Strata conference. There would be a great benefit to standardizing on a flexible binary zero-copy protocol for moving data (and column types) from pandas to JS libraries. Apache Arrow is the obvious candidate for this task, as we can already emit Arrow binary streams from Python and receive them in JavaScript (though what's been implemented on the JS side as far as glue with other frameworks is very limited at the moment). We have some other invested parties who may be able to assist with some of the development work to make this easy for us to do (@trxcllnt, @lmeyerov, and others) The Arrow metadata is designed to accommodate user-defined types, so we could conceivably (with a bit of elbow grease) embed the geo data in an Arrow table column and send that as a first-class citizen. I am not sure what all would be required from here to make this work seamlessly, but to have a list of requirements and next steps would be useful and give the community a chance to get to work. |
Sorry I wasn't clear - I would keep the data separate and only specify the
visual encodings, marks, etc in the JSON. The actual data transfer could be
done with either arrow or full pandas data frames. The rendering libraries
could deal with the combination of JSON viz spec + DataFrame
…On Tue, Oct 3, 2017 at 1:50 PM, Peter Wang ***@***.***> wrote:
@ellisonbg <https://github.com/ellisonbg> you bring up an interesting
point, that if the Pandas devs want to "own plotting", then outputting a
JSON-based visualization spec would be the most flexible and accurate
approach to doing that. However, the roundtrip through JSON-land is
nontrivial - not merely from a logical mapping perspective, but also from
the perspective of performance. In the most common case, directly calling
matplotlib on a large dataframe is extremely fast. Similarly, there's no
reason why Datashader or Bokeh server can't also be similarly fast on large
dataframes. However, round-tripping those datasets through an encode/decode
process to JSON would be quite painful. (And that's not even considering
the use cases of e.g. GeoPandas, with tons of shape geometry data.)
My understanding is that the Pandas devs already have a plotting API on
the plot object, namely, the ['area', 'bar', 'barh', 'box', 'density',
'hexbin', 'hist', 'line', 'pie', 'scatter'] methods, which defines their
expectations of the API that the plotting backends must adhere to. At that
point, it's on the viz library developers to properly implement those
functions.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14130 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABr0FhD7EDH8cy1FmxfII3mpXZ_zcwnks5sop37gaJpZM4Jxxtq>
.
--
Brian E. Granger
Associate Professor of Physics and Data Science
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
[email protected] and [email protected]
|
To both Brians -- I may surprise you. I'd generally say that I'm a fan of format specifications. Versioned, specified in a manner that someone could write an implementation of the spec and convert it to other formats. At least in Jupyter, we need to be able to work solely with JSON because of the notebook format and messaging spec being a JSON based protocol (*). My primary care for specifications is to be able to support more than one language, which means being able to plot in non-python environments. I'd be more than happy if there were agreed upon binary format (even with arrow we can work with it on a web based frontend or serverside with node). (*) Caveat: kernels can send arbitrary binary blobs with messages, they're not well specced for use on the protocol though (they do get used by ipywidgets, since they intercept messages and send their own). Pandas luckily can return a standardized table schema thanks to @TomAugspurger and others, so I'm pretty happy in this regard for having something interoperable that isn't tied to a particular visualization. It ticks some basic boxes for the small cases that JSON formats are totally fine for. |
@rgbkrk heads up, our Arrow lib is now part of the official Apache/Arrow project. The package name on npm will stay the same, intending to release 0.1.3 in the next few days 🎉 |
niiiiice |
I've started a (super hacky) version of this over at master...TomAugspurger:plotting-plugin For engine authors, there's a base class where you can override the various |
@TomAugspurger:
Challenges:
|
@rs2 yeah that sounds about right (I wouldn't call them callbacks though, and it won't be using Any library wishing to take over |
Could pandas also provide some helper functions for down-selecting the data frames to just the columns of interest / doing aggregations? I think it would also make sense for the API to provide a semantic set of inputs (ex Can this be py3 only so we can use required keyword arguments? @ellisonbg I don't see a big difference an python api with fixed kwargs and a json schema which embeds the function name as one of the keys. If you need it is json format, it should be up to the plotting library to do that translation and export as json if required. |
@tacaswell - great question. A few things we have observed in building things like this:
However, if there isn't support for a JSON schema based approach, I would love to see this be python3 only so at least the python api can be strongly typed and required kw args. |
def do_boxplot(data, **kwargs):
json = build_my_json_of_boxplot_and_validate(kwargs)
return data, json For those of us with native python plotting libraries (well, fine me ;) ), this seems natural to restrict the json related things to the json libraries. It also lets you deal with any schema differences between different JSON based plotting libraries in python and give libraries a chance to do any data-preprocessing before exporting.
Fundamentally I think the two approaches are functionally equivalent (I'm less worried about static typing because I render in the same process in python so I get nice error messages rather than it rendering who knows where is a browser that happily eats all exceptions 😈 ). Expressing the API with a schema is reasonable (and auto-generating the pandas side of the API?), but I am not convinced that the value add is worth the effort of just writing the API to begin with. If this goes the JSON route mpl will just write functions that look like def do_boxplot(data, json_kwargs):
json_kwargs = json.loads(json_kwargs) # ok, I may be being pedantic here
return realy_do_boxplot(data, **json_kwargs) but it seems odd to me to run an API for python libraries to talk to each other through JSON. I am 100% on board with this being python3 only 👍 I should also be clear, I very much like JSON / JSON schema in general, I am just not convinced that it is the right thing to do in this case. @TomAugspurger Have you considered using a |
@jreback @datapythonista I can't get it to work: pd.set_option('plotting.backend', 'plotly.plotly')
df.plot(x='created_at', y='updated_at', kind='scatter') gives me
and pd.set_option('plotting.backend', 'plotly')
df.plot(x='created_at', y='updated_at', kind='scatter') gives me
There is no doc on how it works, so I'm stuck here. Tested with pandas 0.25.0 against plotly 3.10.0 and plotly 4.0.0 |
Thanks @flavianh for reporting. Unfortunately it's not feasible to plot with arbitrary libraries, we can just plot with libraries that implement our interface. Plotly has plans to work on it, but I think the development hasn't started yet, I guess it will take few months. There is work being done in hvplot and altair to make these libraries compatible with the new API, but that is not available now. Afaik the only library you can use at the moment is the latest version of: https://github.com/PatrikHlobil/Pandas-Bokeh I don't think is as mature as the matplotlib plotting we provide, and I wouldn't use it in production code, but I think should be helpful for interactive plots in a notebook. What would be very useful is if you can open a pull request to clarify all this in the documentation you were following. So, we don't mislead other users as we did with you. Thank you in advance for it! |
I think I read the changelog which mentions this very issue. I may have fast-read through the issue and thought that I could try plotly. I looked around in the documentation and it seems quite clear, especially here: "plotting.backend : str The plotting backend to use. The default value is “matplotlib”, the backend provided with pandas. Other backends can be specified by prodiving the name of the module that implements the backend. [default: matplotlib] [currently: matplotlib]". |
We have had some conversations in the past regarding an
.plot(....., engine=)
kw. in #8018This would allow pandas to redirect the plotting to a user selectable back-end, keeping
matplotlib
as the default.see chartpy here, for a way to selectively enable
matplotlib
,bokeh
andplotly
.and generically via altair.
Missing from here is when to re-direct to
seaborn
.So this issue is for discussion:
pandas-plot
)?and this actually should be the default (rather than
matplotlib
), which is of course the dependency. This might allow simply removing the vast majority of the custom plotting code.The text was updated successfully, but these errors were encountered: