Skip to content

Implement fast Cython Series iterator, for speeding up DataFrame.apply #309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wesm opened this issue Oct 31, 2011 · 7 comments
Closed

Implement fast Cython Series iterator, for speeding up DataFrame.apply #309

wesm opened this issue Oct 31, 2011 · 7 comments
Milestone

Comments

@wesm
Copy link
Member

wesm commented Oct 31, 2011

Having tons of calls to Series.__new__ seriously degrades performance because most of the logic isn't necessary. Could play tricks in Cython with the data pointers to avoid this.

@natekupp
Copy link

Hey Wes - any way I can help on this? I just ran into this on my own, then came here and found your open issue. Some code that demonstrates the performance issue:

import pandas, time
import numpy as np
data  = pandas.DataFrame(np.random.random((10000,100)))
fn    = lambda x: len(np.unique(x)) > 100

start = time.time()
data.apply(fn, axis=0)
print time.time() - start

start = time.time()
np.apply_along_axis(fn, 0, data)
print time.time() - start

## Output
4.69282603264
0.103554964066

My use case is similar to the above example, so it'd be great to close the performance gap between DataFrame.apply and np.apply_along_axis. From your comment I'm guessing that, at the moment, DataFrame.apply calls Series.__new__ for every Series in the DataFrame? Thanks!

@wesm
Copy link
Member Author

wesm commented Nov 12, 2011

Low-hanging fruit would be an option in apply that calls np.apply_along_axis-- the reason that it doesn't is because apply by default assumes that each slice is a Series, whereas in your case that may not be strictly necessary

@wesm
Copy link
Member Author

wesm commented Nov 12, 2011

maybe like

df.apply(f, axis=0, raw=True)

@wesm
Copy link
Member Author

wesm commented Nov 13, 2011

What version of pandas are you using? I fixed a performance problem that was causing np.unique to be very slow

@natekupp
Copy link

I was on the latest version from PyPI. I just installed from github source and it looks much better:

## Output
0.256111860275
0.103078842163

Thanks!

@wesm
Copy link
Member Author

wesm commented Nov 13, 2011

OK I made some further tweaks and things so apply actually beats apply_along_axis quite a bit in the axis=1 case with your example (most of the time is spent calling unique in axis=0 case):

In [6]: timeit data.apply(fn, axis=1, raw=True)
1 loops, best of 3: 288 ms per loop

In [7]: timeit data.apply(fn, axis=0, raw=True)
10 loops, best of 3: 82 ms per loop

In [8]: timeit np.apply_along_axis(fn, 1, data.values)
1 loops, best of 3: 518 ms per loop

In [9]: timeit np.apply_along_axis(fn, 0, data.values)
10 loops, best of 3: 82.7 ms per loop

@wesm wesm closed this as completed Nov 13, 2011
@natekupp
Copy link

Thanks Wes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants