Skip to content

Performance issue with timeseries plotting on py3? #11831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Dec 12, 2015 · 14 comments
Closed

Performance issue with timeseries plotting on py3? #11831

jorisvandenbossche opened this issue Dec 12, 2015 · 14 comments
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Visualization plotting
Milestone

Comments

@jorisvandenbossche
Copy link
Member

I noticed a performance issue with plotting timeseries. After some trying with different environments (different pandas, matplotlib and python versions), it seems there is a problem on python 3 -> up to 10 x slowdown compared to python 2.7:

Python 2 and pandas 0.16.2 and 0.17.1:

In [2]: sys.version
Out[2]: '2.7.10 |Anaconda 1.7.0 (64-bit)| (default, Oct 21 2015, 19:35:23) [MSC
v.1500 64 bit (AMD64)]'

In [3]: pd.__version__
Out[3]: '0.16.2'

In [4]: matplotlib.__version__
Out[4]: '1.4.3'

In [6]: N = 2000

In [7]: df = pd.DataFrame(np.random.randn(N, 5), index=pd.date_range('1/1/1975',  periods=N))

In [8]: %timeit df.plot()
1 loops, best of 3: 228 ms per loop

In [10]: df = pd.DataFrame(np.random.randn(N, 5))

In [11]: %timeit df.plot()
10 loops, best of 3: 110 ms per loop
In [1]: import sys

In [2]: sys.version
Out[2]: '2.7.11 |Continuum Analytics, Inc.| (default, Dec  7 2015, 14:10:42) [MS
C v.1500 64 bit (AMD64)]'

In [3]: pd.__version__
Out[3]: u'0.17.1'

In [4]: matplotlib.__version__
Out[4]: '1.5.0'

In [5]: N = 2000

In [6]: df = pd.DataFrame(np.random.randn(N, 5), index=pd.date_range('1/1/1975',  periods=N))

In [7]: %timeit df.plot()
1 loops, best of 3: 269 ms per loop

In [8]: df = pd.DataFrame(np.random.randn(N, 5))

In [9]: %timeit df.plot()
10 loops, best of 3: 139 ms per loop

With python 3, pandas 0.16.2 and 0.17.1:

In [2]: sys.version
Out[2]: '3.5.0 |Continuum Analytics, Inc.| (default, Dec  1 2015, 11:46:22) [MSC
 v.1900 64 bit (AMD64)]'

In [3]: pd.__version__
Out[3]: '0.16.2'

In [4]: matplotlib.__version__
Out[4]: '1.5.0'

In [5]: N = 2000

In [6]: df = pd.DataFrame(np.random.randn(N, 5), index=pd.date_range('1/1/1975',  periods=N))

In [7]: %timeit df.plot()
1 loops, best of 3: 1.02 s per loop

In [9]: df = pd.DataFrame(np.random.randn(N, 5))

In [10]: %timeit df.plot()
10 loops, best of 3: 143 ms per loop
In [1]: import sys

In [2]: sys.version
Out[2]: '3.5.0 |Anaconda 2.4.0 (64-bit)| (default, Nov  7 2015, 13:15:24) [MSC v
.1900 64 bit (AMD64)]'

In [3]: pd.__version__
Out[3]: '0.17.1'

In [4]: matplotlib.__version__
Out[4]: '1.5.0'

In [5]: N = 2000

In [6]: df = pd.DataFrame(np.random.randn(N, 5), index=pd.date_range('1/1/1975', periods=N))

In [7]: %timeit df.plot()
1 loops, best of 3: 2.37 s per loop    <------------ !!!! 10x slower than on py2.7

In [8]: df = pd.DataFrame(np.random.randn(N, 5))

In [9]: %timeit df.plot()
10 loops, best of 3: 132 ms per loop
@jreback
Copy link
Contributor

jreback commented Dec 12, 2015

this is the period index conversion, which IIRC you fixed?

@jorisvandenbossche
Copy link
Member Author

You are referring to #11194 I think. This fixed a perf regression before 0.17.0 was released (introduced between 0.16.2 and 0.17.0). Do you see a possible reason that this fix would not work on py3?

@RolandRitt
Copy link

Has this issue been fixed or are there workarounds?
I do have them same Problem (python3 is about 16 times slower).

Are things moving on this issue?

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Visualization plotting labels Jan 7, 2016
@jorisvandenbossche
Copy link
Member Author

@rittmeister123 Not that I know (I haven't had time myself to dive into it).

If you want to try to profile, to see where the slowdown is coming from, that would be very welcome!

@jorisvandenbossche jorisvandenbossche added this to the 0.18.0 milestone Jan 7, 2016
@RolandRitt
Copy link

I'm new here (my first entry :) ), so please excuse possible format-issues or something else.

@jorisvandenbossche I did some profiling with the example from above:

My Setup for Python 3:

In [1]: sys.version
Out[1]: '3.5.1 |Anaconda 2.4.1 (64-bit)| (default, Dec  7 2015, 15:00:12)
[MSC v.1900 64 bit (AMD64)]

In[2]: pd.__version__
Out[2]: '0.17.1'

In[3]: matplotlib.__version__
Out[3]: '1.5.0'

In[4]: np.__version__
Out[4]: '1.10.1'

My Setup for Python 2:

In [1]: sys.version
Out[1]: '2.7.11 |Continuum Analytics, Inc.| (default, Dec  7 2015, 14:10:42)
[MSC v.1500 64 bit (AMD64)]'

In[2]: pd.__version__
Out[2]: u'0.17.1'

In[3]: matplotlib.__version__
Out[3]: '1.5.0'

In[4]: np.__version__
Out[4]: '1.10.1'

The dataframe is generated with:

In [1]: N = 2000
        df = pd.DataFrame(np.random.randn(N, 5), index=pd.date_range('1/1/1975', periods=N))

Timeit on python2:

In [1]: %timeit df.plot()
Out[1]: 1 loops, best of 3: 111 ms per loop

Timeit on python3:

In [1]: %timeit df.plot()
Out[1]: 1 loops, best of 3: 1.01 s per loop

Attached you can find the Profiling files generated with

In[1]: %prun -D python2_df_plot.prof df.plot()

and

In[1]: %prun -D python3_df_plot.prof df.plot()

profiling_pandas_plot.zip

Please have a look at it, since i have no knowledge on profiling

@jreback
Copy link
Contributor

jreback commented Jan 30, 2016

@rittmeister123 any luck with this?

@RolandRitt
Copy link

Unfortunately not!
I'm quiet new in python and do not had time to take a close look whats going on and read the profiling..

@jreback
Copy link
Contributor

jreback commented Feb 10, 2016

@jorisvandenbossche any thoughts on this?

@RolandRitt
Copy link

FYI:
A few days ago I tried to plot a dataframe with the use_index Parameter set to false...and recognized a speed up...but hadn't had time to exactly verify this and proof it...

@jorisvandenbossche
Copy link
Member Author

Did some quick profiling, and one of the elements is in any case the difference in performance of PeriodIndex._mpl_repr(), which is just a call to PeriodIndex._get_object_array:

In [8]: sys.version
Out[8]: '3.5.0 |Anaconda 2.4.0 (64-bit)| (default, Nov  7 2015, 13:15:24) [MSC v.1900 64 bit (AMD64)]'

In [9]: pd.__version__
Out[9]: '0.17.1'

In [10]: pidx = pd.period_range('1975-01-01', periods=2000)

In [11]: %timeit pidx._mpl_repr()
1 loops, best of 3: 461 ms per loop

vs

In [6]: sys.version
Out[6]: '2.7.11 |Anaconda 1.7.0 (64-bit)| (default, Jan 19 2016, 12:08:31) [MSCv.1500 64 bit (AMD64)]'

In [7]: pd.__version__
Out[7]: '0.16.2'

In [8]: pidx = pd.period_range('1975-01-01', periods=2000)

In [9]: %timeit pidx._mpl_repr()
100 loops, best of 3: 5.25 ms per loop

In turn, this boils down to calls to Period.from_ordinal():

In [12]: sys.version
Out[12]: '3.5.0 |Anaconda 2.4.0 (64-bit)| (default, Nov  7 2015, 13:15:24) [MSCv.1900 64 bit (AMD64)]'

In [13]: pd.__version__
Out[13]: '0.17.1'

In [14]: %timeit pd.Period._from_ordinal(ordinal=1, freq='D')
1000 loops, best of 3: 476 µs per loop

vs

In [6]: sys.version
Out[6]: '2.7.11 |Anaconda 1.7.0 (64-bit)| (default, Jan 29 2016, 14:26:21) [MSCv.1500 64 bit (AMD64)]'

In [7]: pd.__version__
Out[7]: '0.17.1+315.g62363d2'

In [8]: %timeit pd.Period._from_ordinal(ordinal=1, freq='D')
10000 loops, best of 3: 42.1 µs per loop

Now, what would cause the dramatic difference in performance in the from_ordinal method between python 2 and 3, is still a mystery to me (and I also don't have time to look into further).
@jreback any idea?

@blbradley you did some work on the Period cython code. Do you have by any chance an idea where this peformance difference could be coming from?

@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

I think https://github.com/pydata/pandas/blob/master/pandas/src/period.pyx#L658

needs this instead of doing the string interpretation each time

if isinstance(freq, offsets.DateOffset):
    return freq

also the import can be replace by offsets.to_offset (below)

as when _mpl_repr is called an already constructed freq object is passed (and not the string as in the case above).

There maybe something else going on in the actual to_offset to explain the py2/3 diff though (as string conversion should be similar perf)

@jorisvandenbossche
Copy link
Member Author

I actually did a wrong timeit, as the freq is already an offset in the case of plotting:

# python 3
In [24]: %timeit pd.Period._from_ordinal(ordinal=1, freq=pidx.freq)
1000 loops, best of 3: 233 µs per loop

In [25]: type(pidx.freq)
Out[25]: pandas.tseries.offsets.Day

vs

In [11]: %timeit pd.Period._from_ordinal(ordinal=1, freq=pidx.freq)
The slowest run took 6.19 times longer than the fastest. This could mean that an
 intermediate result is being cached
100000 loops, best of 3: 5.5 µs per loop

which makes the difference even larger

@jorisvandenbossche
Copy link
Member Author

Strange, I don't see a difference in performance of to_offset, while there is in _maybe_convert_freq:

python 3:

In [43]: %timeit pd.Period._maybe_convert_freq(pidx.freq)
1000 loops, best of 3: 200 µs per loop

In [44]: %timeit to_offset(pidx.freq)
The slowest run took 10.28 times longer than the fastest. This could mean that a
n intermediate result is being cached
1000000 loops, best of 3: 479 ns per loop

vs python 2:

In [19]:  %timeit pd.Period._maybe_convert_freq(pidx.freq)
The slowest run took 6.87 times longer than the fastest. This could mean that an
 intermediate result is being cached
100000 loops, best of 3: 4.42 µs per loop

In [20]: %timeit to_offset(pidx.freq)
The slowest run took 13.08 times longer than the fastest. This could mean that a
n intermediate result is being cached
1000000 loops, best of 3: 502 ns per loop

@jreback
Copy link
Contributor

jreback commented Feb 11, 2016

try removing the import in _maybe_convert_freq (use offsets.to_offset instead)
would still add the check for DateOffset as its doing extra work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Visualization plotting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants