Skip to content

PERF: compare freq strings (timeseries plotting perf) #11194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 1, 2015

Conversation

jorisvandenbossche
Copy link
Member

See overview in #11084

@jorisvandenbossche jorisvandenbossche added Datetime Datetime data dtype Visualization plotting Performance Memory or execution speed performance labels Sep 26, 2015
@jorisvandenbossche jorisvandenbossche added this to the 0.17.0 milestone Sep 26, 2015
@jorisvandenbossche
Copy link
Member Author

@sinhrks You did think this is OK?

@sinhrks
Copy link
Member

sinhrks commented Sep 27, 2015

Thanks for this. I think we should call freq = Period._maybe_convert_freq(freq).freqstr at the top of extract_ordinals for standardize the freq string. Because followings are valid constructions.

import pandas as pd

idx = pd.period_range('2011-01', periods=3, freq='A')
idx.asobject.values
# array([Period('2011', 'A-DEC'), Period('2012', 'A-DEC'), Period('2013', 'A-DEC')], dtype=object)

idx.freqstr
# 'A-DEC'

# freq = 'A'
pd.PeriodIndex(idx.asobject.values, freq='A')

# freq = <YearEnd: month=12>
pd.PeriodIndex(idx.asobject.values, freq=idx.freq)

@jreback
Copy link
Contributor

jreback commented Sep 27, 2015

@jorisvandenbossche if you can update

freq attribute changed to return Offset object instead of string,
making this comparison much slower
@jorisvandenbossche
Copy link
Member Author

@sinhrks Thanks for the catch. It won't be needed for the plotting case I suppose, but indeed in general the example you give are valid. I should probably add some tests for these, as they were not caught by the test suite)

@jreback
Copy link
Contributor

jreback commented Sep 30, 2015

@jorisvandenbossche which benchmark does this close?

@jorisvandenbossche
Copy link
Member Author

At least the plotting.plot_timeseries_period.time_plot_timeseries_period one (didn't test if it would improve for some of the others as well)

@jreback
Copy link
Contributor

jreback commented Sep 30, 2015

this doesn't seem to fix that one?

and its using date_range not period_range (but same either way)

@jorisvandenbossche
Copy link
Member Author

Did you test with this branch?

import numpy as np
import pandas as pd

N = 2000
M = 5
df = pd.DataFrame(np.random.randn(N, M), index=pd.date_range('1/1/1975', periods=N))

With pandas 0.16.2:

In [18]: %timeit df.plot()
1 loops, best of 3: 196 ms per loop

With 0.17.0rc:

In [18]: %timeit df.plot()
1 loops, best of 3: 778 ms per loop

With this branch:

In [18]: %timeit df.plot()
1 loops, best of 3: 235 ms per loop

So it did not fully bring it back to previous performance, but at least not the big slow down.

It indeed uses date_range and not period_range, but the time series plotting machinery converts the datetimes internally to periods for plotting.

@jreback
Copy link
Contributor

jreback commented Oct 1, 2015

In [1]: N = 2000

In [2]: M = 5

In [3]: df = pd.DataFrame(np.random.randn(N, M), index=pd.date_range('1/1/1975', periods=N))

In [4]: %timeit df.plot()
1 loops, best of 3: 451 ms per loop

In [5]: pd.__version__
Out[5]: '0.17.0rc1+119.g9707214'
In [1]: pd.__version__
Out[1]: u'0.17.0rc1'

In [2]: N = 2000

In [3]: M = 5

In [4]: df = pd.DataFrame(np.random.randn(N, M), index=pd.date_range('1/1/1975', periods=N))

In [5]: %timeit df.plot()
1 loops, best of 3: 438 ms per loop
In [7]: import matplotlib

In [8]: matplotlib.__version__
Out[8]: '1.4.3'

@jreback
Copy link
Contributor

jreback commented Oct 1, 2015

In [6]: import matplotlib

In [7]: matplotlib.__version__
Out[7]: '1.5.0rc1'
In [1]: M = 5

In [2]: N = 2000

In [3]: df = pd.DataFrame(np.random.randn(N, M), index=pd.date_range('1/1/1975', periods=N))

In [4]: %timeit df.plot()
1 loops, best of 3: 499 ms per loop

In [5]: 2015-10-01 06:50:53.161 python[25071:8630508] setCanCycle: is deprecated.  Please use setCollectionBehavior instead
2015-10-01 06:50:53.163 python[25071:8630508] setCanCycle: is deprecated.  Please use setCollectionBehavior instead
2015-10-01 06:50:53.164 python[25071:8630508] setCanCycle: is deprecated.  Please use setCollectionBehavior instead
2015-10-01 06:50:53.166 python[25071:8630508] setCanCycle: is deprecated.  Please use setCollectionBehavior instead

In [5]: pd.__version__
Out[5]: '0.17.0rc1+119.g9707214'

@TomAugspurger can we get rid of this deprecated function?

@jorisvandenbossche
Copy link
Member Author

@jreback strange .. I just rebuild both versions (rc1 and this branch) and tested again:

In [1]: N = 2000

In [2]: M = 5

In [3]: df = pd.DataFrame(np.random.randn(N, M), index=pd.date_range('1/1/1975',
 periods=N))

In [4]: %timeit df.plot()
1 loops, best of 3: 734 ms per loop

In [5]: pd.__version__
Out[6]: '0.17.0rc1'
In [1]: N = 2000

In [2]: M = 5

In [3]: df = pd.DataFrame(np.random.randn(N, M), index=pd.date_range('1/1/1975',
 periods=N))

In [4]: %timeit df.plot()
1 loops, best of 3: 217 ms per loop

In [5]: pd.__version__
Out[5]: '0.17.0rc1+111.ga6379b6'

Did you test it with this branch or with current master? As I notice you have another commit in the version string

@jreback
Copy link
Contributor

jreback commented Oct 1, 2015

ok cardinal sin of not rebuilding the extensions....i must have had an older version...

thanks!

In [1]: M = 5

In [2]: N = 2000

In [3]: df = pd.DataFrame(np.random.randn(N, M), index=pd.date_range('1/1/1975', periods=N))

In [4]: pd.__version__
Out[4]: '0.17.0rc1+118.g8ecec2d'

In [5]: %timeit df.plot()
1 loops, best of 3: 167 ms per loop

jreback added a commit that referenced this pull request Oct 1, 2015
PERF: compare freq strings (timeseries plotting perf)
@jreback jreback merged commit 6ab626f into pandas-dev:master Oct 1, 2015
@jreback jreback mentioned this pull request Oct 1, 2015
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance Visualization plotting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants