PERF: Implement date_range for business day in terms of daily #16463

TomAugspurger · 2017-05-23T20:08:25Z

Code Sample, a copy-pastable example if possible

@dsm054 noticed that business day performance is slow relative to daily

These two are equivalent

In [23]: %timeit pd.date_range("1956-01-31", "2017-05-16", freq="B")
278 ms ± 4.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [24]: %timeit i2 = pd.date_range("1956-01-31", "2017-05-16", freq="D"); i2=i2[i2.dayofweek < 5]
1.44 ms ± 39.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

If we do these, we'll have to special case the handling when periods is passed to expand it by enough weekends. But that's a pretty nice performance boost for a pretty common operation.

The text was updated successfully, but these errors were encountered:

dsm054 · 2017-05-23T20:50:11Z

pd.date_range is just a convenience wrapper around DatetimeIndex, and so the same problem affects resamples to B as well because of the call to DatetimeIndex it makes along the way when building a binner. There will be lots of secondary benefits as well.

chris-b1 · 2017-05-23T20:54:24Z

xref: #11214 - slightly more generic issue. In general any kind of bin generation larger than D is not performant right now.

dsm054 · 2017-05-23T21:23:34Z

@chris-b1: wow, when you're right, you're right.

This is all kinds of crazysauce.

In [58]: %timeit z = pd.DatetimeIndex(start='1956-01-31', end='2017-05-16', freq='3min')
10 loops, best of 3: 40.4 ms per loop

In [59]: %timeit z = pd.DatetimeIndex(start='1956-01-31', end='2017-05-16', freq='M')
10 loops, best of 3: 34.9 ms per loop

In [60]: len(pd.DatetimeIndex(start='1956-01-31', end='2017-05-16', freq='3min'))
Out[60]: 10745281

In [61]: len(pd.DatetimeIndex(start='1956-01-31', end='2017-05-16', freq='M'))
Out[61]: 736

That's a wild ratio.

I think it should be possible to provide a vectorized (not even cythonized) version for each of the some-filter-on-daily offsets, both anchored and not, without too much trouble. But even if not, we should at least fix the big ones. I grepped through just one of my codebases and found hundreds of D->B resamples on either Series or few-column frames where this is apparently the bottleneck. :-/

rohanp · 2017-05-23T23:19:40Z

I can try giving this a go!

jreback · 2017-05-23T23:29:39Z

@chris-b1 this is actually not that surprising, all this is doing is this an np.arange under the hood.

rohanp · 2017-05-24T00:01:06Z

From what I can tell, cythonizing generate_range from offsets.py would solve this problem?

dsm054 · 2017-05-24T00:23:12Z

I think that's the base location which is slow, but I don't know how much of the slowness is due to the Python-level looping in generate_range or the Python-level operations in the offset's .apply that get called. I suspect we're only getting a few milliseconds of overhead from the Python generator itself, in which case we'd need to cythonize a lot of other code too to see benefits.

Whether that's a better direction than just taking advantage of preexisting fast code, I don't know.

chris-b1 · 2017-05-24T00:33:04Z

Yeah @dsm054 is right, there is a whole chain of pure-python stuff behind the offsets, so not that much benefit to cythonizing the existing code without a major rewrite.

Much easier would be to define some fastpaths for "easy" offsets, using existing vectorized datetime methods like suggested at the top.

i = pd.date_range('2010', '2015', freq='D')

bday = i[i.dayofweek < 5]
month_end = i[i.is_month_end]
month_begin = i[i.is_month_start] #etc

Hardest part is probably getting the number of periods/start/end logic correct.

dsm054 · 2017-05-25T14:34:10Z

@rohanp: are you still interested in looking at this even though it may not involve cython? ;-) If not, I have a prototype which looks promising although my testing of it has been pretty shallow so far (and as usual, it'll probably take longer to test than to implement..)

rohanp · 2017-05-25T18:25:18Z

I am still interested, but feel free to share your prototype.

…

On Thu, May 25, 2017 at 7:34 AM DSM ***@***.***> wrote: @rohanp <https://github.com/rohanp>: are you still interested in looking at this even though it may not involve cython? ;-) If not, I have a prototype which looks promising although my testing of it has been pretty shallow so far (and as usual, it'll probably take longer to test than to implement..) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16463 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGxpcLJn7liouFzY61c_vE7SMlQYhHDAks5r9ZFwgaJpZM4NkNK2> .

jbrockmendel · 2023-03-02T03:36:13Z

generate_range and the relevant DateOffset.apply code have all been moved into cython, so some of this has improved. Generating with date_range(..., freq="D") and then masking as in the OP is fine for "B" where you're not discarding much, it would get expensive as you discard more. Instead I think we can now use period_range

In [3]: %timeit z = pd.date_range(start='1956-01-31', end='2017-05-16', freq='M')
5.46 ms ± 67.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit pd.period_range(start="1956-01-31", end="2017-05-16", freq="M").to_timestamp()
839 µs ± 7.72 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

TomAugspurger added Difficulty Intermediate Performance Memory or execution speed performance Datetime Datetime data dtype labels May 23, 2017

TomAugspurger added this to the Next Major Release milestone May 23, 2017

chris-b1 mentioned this issue Oct 18, 2017

PERF: enable caching on expensive offsets #17914

Closed

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Implement date_range for business day in terms of daily #16463

PERF: Implement date_range for business day in terms of daily #16463

TomAugspurger commented May 23, 2017 •

edited

Loading

dsm054 commented May 23, 2017

chris-b1 commented May 23, 2017

dsm054 commented May 23, 2017 •

edited

Loading

rohanp commented May 23, 2017

jreback commented May 23, 2017

rohanp commented May 24, 2017

dsm054 commented May 24, 2017

chris-b1 commented May 24, 2017

dsm054 commented May 25, 2017

rohanp commented May 25, 2017 via email

jbrockmendel commented Mar 2, 2023

PERF: Implement date_range for business day in terms of daily #16463

PERF: Implement date_range for business day in terms of daily #16463

Comments

TomAugspurger commented May 23, 2017 • edited Loading

Code Sample, a copy-pastable example if possible

dsm054 commented May 23, 2017

chris-b1 commented May 23, 2017

dsm054 commented May 23, 2017 • edited Loading

rohanp commented May 23, 2017

jreback commented May 23, 2017

rohanp commented May 24, 2017

dsm054 commented May 24, 2017

chris-b1 commented May 24, 2017

dsm054 commented May 25, 2017

rohanp commented May 25, 2017 via email

jbrockmendel commented Mar 2, 2023

TomAugspurger commented May 23, 2017 •

edited

Loading

dsm054 commented May 23, 2017 •

edited

Loading