Skip to content

Performance regression in gil.ParallelDatetimeFields.time_period_to_datetime #33919

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue May 1, 2020 · 5 comments · Fixed by #35171
Closed

Performance regression in gil.ParallelDatetimeFields.time_period_to_datetime #33919

TomAugspurger opened this issue May 1, 2020 · 5 comments · Fixed by #35171
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@TomAugspurger
Copy link
Contributor

import pandas as pd
import numpy as np
from pandas._testing import test_parallel

N = 10 ** 6
dti = pd.date_range("1900-01-01", periods=N, freq="T")
period = dti.to_period("D")



@test_parallel(num_threads=2)
def run(period):
    period.to_timestamp()

%timeit run(period)
# 1.0.2
96.8 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# master
129 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

https://pandas.pydata.org/speed/pandas/index.html#gil.ParallelDatetimeFields.time_period_to_datetime?commits=d106b81ce532bc71ec6cced944ddb751a4b0e5a3-577de1c6b5cda7f5ae0e4832c2bc3f97ca186e9b points to d106b81...577de1c, perhaps #33491 or #33047 (cc @jbrockmendel)

@TomAugspurger TomAugspurger added Datetime Datetime data dtype Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels May 1, 2020
@TomAugspurger TomAugspurger added this to the 1.1 milestone May 1, 2020
@jbrockmendel
Copy link
Member

using %prun -s cumtime for n in range(100): period.to_timestamp() the only thing that stands out is periodarr_to_dt64. It's plausible that #33491 is responsible.

@TomAugspurger
Copy link
Contributor Author

One note: my code above used test_parallel, but the single-threaded version is also slower:

# 1.0.4
%timeit period.to_timestamp()
53 ms ± 740 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# master
In [10]: %timeit period.to_timestamp()
71.1 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So the current implementation is likely just slower.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jul 7, 2020

It is periodarr_to_dt64 that's slower

In [1]: import numpy as np
   ...: from pandas._libs.tslibs import period as libperiod
   ...:
   ...: i8vals = np.arange(0, 100_000)
   ...: %timeit libperiod.periodarr_to_dt64arr(i8vals, 6000)
   ...:
   ...:
# 1.0.4
3.74 ms ± 73 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# master
5.35 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Any thoughts on how to progress @jbrockmendel? I'm not familiar with how to profile C/Cython code.

@jbrockmendel
Copy link
Member

I've got a couple of ideas that might optimize this a bit, will give it a go. If those don't work, then I think we're back to reverting #33491

@jbrockmendel
Copy link
Member

yep, 2.44ms on the example above, will make a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants