-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Speed up DatetimeConverter for plotting #6636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think they are the same effect, so this should be good. @TomAugspurger @agijsberts pls do a pull-request and can get this in. I think may need to manually validate that the graphs are correct as we don't do comparison graphs per se (more of a validation that they plot and and the returned objects are 'ok'). pls add a vbench for this as well. good catch |
https://github.com/pydata/pandas/wiki section on how-to do the PR |
Applying your changed didn't seem to break any tests so this should be good. @jreback know if there will be any problems on 32-bit systems? @agijsberts a pull request would be great for this. Let me know if you have any trouble. |
@agijsberts yep let's give a try on this |
Sorry, things are moving a bit slow since it's my first time preparing a PR (setting up git, virtualenv etc.). I'm manually checking the correctness of the plots at the moment (at least w.r.t. the current implementation). Expect a PR later today. Note by the way that pandas' plotting functions (e.g., DataFrame.plot) do not benefit from this patch, as they do other trickery with time axes. The patch is therefore unfortunately only helpful in use-cases where matplotlib's functions are used directly with the time index. |
@agijsberts can you elaborate on this point, e.g about DataFrame.plot not using this new speedup? |
@jreback The function As far as I can tell, My 2 cents: the optimal and unifying solution would be for MPL to support numpy's datetime64, which would ideally also allow plotting on the scale of nanoseconds. There doesn't seem to be any ongoing work in that direction though (matplotlib/matplotlib#1097). |
ok since looks like MPL is somewhat behind the curve here a datetime to period index conversion is fast so is their a problem speed or otherwise? aside from this inelegance (which prob has a reason behind it - haven't delved into the plotting code) - can u see why this is done? and if it warrants change (eg unify to one or the other) |
My guess it that both converters came from different backgrounds: the PeriodConverter and plotting code seems to originate from scikits.timeseries, while DatetimeConverter instead from MPL's plot_date function and dates.DateFormatter. Conversion with PeriodConverter is actually quite fast, but DataFrame.plot does quite a bit more than just converting and plotting. For instance, conversion is done dynamically when zooming/panning. Though a nice feature, it does seem to make zooming and panning somewhat laggy for large data (say >1M points). I do not think it really warrants change, as both approaches (DataFrame.plot vs pyplot.plot) seem to have clear benefits. The former is easier and more feature rich, while the latter is simply faster (and obviously gives a more low-level control to the user). In this context, the DatetimeConverter is actually just a small convenience for those users that prefer to use matplotlib directly. |
ok that sounds good would u put a small PR together to put something like that into the plotting docs (not sure exactly where) so a user would know when it 'pays' to go to a low-level approach |
Just prepared PR #6660 . Is this more or less what you had in mind? |
I've recently started using pandas (impressed so far!) and found that plotting large data (from around 100k) samples is quite slow. I traced the bottleneck to the _dt_to_float_ordinal helper function called by DatetimeConverter.(https://github.com/pydata/pandas/blob/master/pandas/tseries/converter.py#L144).
More specifically, this function uses matplotlib's date2num, which converts arrays and iterables using a slow list comprehension. Since pandas seem to natively store datetimes as epoch+nanoseconds in an int64 array, it would be much faster to use matplotlib's vectorized epoch2num instead. In a testcase with 1 million points, using epoch2num is about 100 times faster than date2num:
Unfortunately, I am not familiar enough with pandas to know whether date2num is used intentionally or to implement a proper patch myself that works in all cases.
The text was updated successfully, but these errors were encountered: