Skip to content

PERF: Implement date_range for business day in terms of daily #16463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TomAugspurger opened this issue May 23, 2017 · 11 comments
Open

PERF: Implement date_range for business day in terms of daily #16463

TomAugspurger opened this issue May 23, 2017 · 11 comments
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 23, 2017

Code Sample, a copy-pastable example if possible

@dsm054 noticed that business day performance is slow relative to daily

These two are equivalent

In [23]: %timeit pd.date_range("1956-01-31", "2017-05-16", freq="B")
278 ms ± 4.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [24]: %timeit i2 = pd.date_range("1956-01-31", "2017-05-16", freq="D"); i2=i2[i2.dayofweek < 5]
1.44 ms ± 39.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

If we do these, we'll have to special case the handling when periods is passed to expand it by enough weekends. But that's a pretty nice performance boost for a pretty common operation.

@TomAugspurger TomAugspurger added Difficulty Intermediate Performance Memory or execution speed performance Datetime Datetime data dtype labels May 23, 2017
@TomAugspurger TomAugspurger added this to the Next Major Release milestone May 23, 2017
@dsm054
Copy link
Contributor

dsm054 commented May 23, 2017

pd.date_range is just a convenience wrapper around DatetimeIndex, and so the same problem affects resamples to B as well because of the call to DatetimeIndex it makes along the way when building a binner. There will be lots of secondary benefits as well.

@chris-b1
Copy link
Contributor

xref: #11214 - slightly more generic issue. In general any kind of bin generation larger than D is not performant right now.

@dsm054
Copy link
Contributor

dsm054 commented May 23, 2017

@chris-b1: wow, when you're right, you're right.

This is all kinds of crazysauce.

In [58]: %timeit z = pd.DatetimeIndex(start='1956-01-31', end='2017-05-16', freq='3min')
10 loops, best of 3: 40.4 ms per loop

In [59]: %timeit z = pd.DatetimeIndex(start='1956-01-31', end='2017-05-16', freq='M')
10 loops, best of 3: 34.9 ms per loop

In [60]: len(pd.DatetimeIndex(start='1956-01-31', end='2017-05-16', freq='3min'))
Out[60]: 10745281

In [61]: len(pd.DatetimeIndex(start='1956-01-31', end='2017-05-16', freq='M'))
Out[61]: 736

That's a wild ratio.

I think it should be possible to provide a vectorized (not even cythonized) version for each of the some-filter-on-daily offsets, both anchored and not, without too much trouble. But even if not, we should at least fix the big ones. I grepped through just one of my codebases and found hundreds of D->B resamples on either Series or few-column frames where this is apparently the bottleneck. :-/

@rohanp
Copy link
Contributor

rohanp commented May 23, 2017

I can try giving this a go!

@jreback
Copy link
Contributor

jreback commented May 23, 2017

@chris-b1 this is actually not that surprising, all this is doing is this an np.arange under the hood.

@rohanp
Copy link
Contributor

rohanp commented May 24, 2017

From what I can tell, cythonizing generate_range from offsets.py would solve this problem?

@dsm054
Copy link
Contributor

dsm054 commented May 24, 2017

I think that's the base location which is slow, but I don't know how much of the slowness is due to the Python-level looping in generate_range or the Python-level operations in the offset's .apply that get called. I suspect we're only getting a few milliseconds of overhead from the Python generator itself, in which case we'd need to cythonize a lot of other code too to see benefits.

Whether that's a better direction than just taking advantage of preexisting fast code, I don't know.

@chris-b1
Copy link
Contributor

Yeah @dsm054 is right, there is a whole chain of pure-python stuff behind the offsets, so not that much benefit to cythonizing the existing code without a major rewrite.

Much easier would be to define some fastpaths for "easy" offsets, using existing vectorized datetime methods like suggested at the top.

i = pd.date_range('2010', '2015', freq='D')

bday = i[i.dayofweek < 5]
month_end = i[i.is_month_end]
month_begin = i[i.is_month_start] #etc

Hardest part is probably getting the number of periods/start/end logic correct.

@dsm054
Copy link
Contributor

dsm054 commented May 25, 2017

@rohanp: are you still interested in looking at this even though it may not involve cython? ;-) If not, I have a prototype which looks promising although my testing of it has been pretty shallow so far (and as usual, it'll probably take longer to test than to implement..)

@rohanp
Copy link
Contributor

rohanp commented May 25, 2017 via email

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel
Copy link
Member

generate_range and the relevant DateOffset.apply code have all been moved into cython, so some of this has improved. Generating with date_range(..., freq="D") and then masking as in the OP is fine for "B" where you're not discarding much, it would get expensive as you discard more. Instead I think we can now use period_range

In [3]: %timeit z = pd.date_range(start='1956-01-31', end='2017-05-16', freq='M')
5.46 ms ± 67.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit pd.period_range(start="1956-01-31", end="2017-05-16", freq="M").to_timestamp()
839 µs ± 7.72 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

7 participants