Implement fast Cython Series iterator, for speeding up DataFrame.apply #309

wesm · 2011-10-31T01:26:56Z

Having tons of calls to Series.__new__ seriously degrades performance because most of the logic isn't necessary. Could play tricks in Cython with the data pointers to avoid this.

The text was updated successfully, but these errors were encountered:

natekupp · 2011-11-11T22:50:30Z

Hey Wes - any way I can help on this? I just ran into this on my own, then came here and found your open issue. Some code that demonstrates the performance issue:

import pandas, time
import numpy as np
data  = pandas.DataFrame(np.random.random((10000,100)))
fn    = lambda x: len(np.unique(x)) > 100

start = time.time()
data.apply(fn, axis=0)
print time.time() - start

start = time.time()
np.apply_along_axis(fn, 0, data)
print time.time() - start

## Output
4.69282603264
0.103554964066

My use case is similar to the above example, so it'd be great to close the performance gap between DataFrame.apply and np.apply_along_axis. From your comment I'm guessing that, at the moment, DataFrame.apply calls Series.__new__ for every Series in the DataFrame? Thanks!

wesm · 2011-11-12T19:06:51Z

Low-hanging fruit would be an option in apply that calls np.apply_along_axis-- the reason that it doesn't is because apply by default assumes that each slice is a Series, whereas in your case that may not be strictly necessary

wesm · 2011-11-12T19:07:14Z

maybe like

df.apply(f, axis=0, raw=True)

wesm · 2011-11-13T06:35:22Z

What version of pandas are you using? I fixed a performance problem that was causing np.unique to be very slow

natekupp · 2011-11-13T16:27:17Z

I was on the latest version from PyPI. I just installed from github source and it looks much better:

## Output
0.256111860275
0.103078842163

Thanks!

wesm · 2011-11-13T23:17:01Z

OK I made some further tweaks and things so apply actually beats apply_along_axis quite a bit in the axis=1 case with your example (most of the time is spent calling unique in axis=0 case):

In [6]: timeit data.apply(fn, axis=1, raw=True)
1 loops, best of 3: 288 ms per loop

In [7]: timeit data.apply(fn, axis=0, raw=True)
10 loops, best of 3: 82 ms per loop

In [8]: timeit np.apply_along_axis(fn, 1, data.values)
1 loops, best of 3: 518 ms per loop

In [9]: timeit np.apply_along_axis(fn, 0, data.values)
10 loops, best of 3: 82.7 ms per loop

natekupp · 2011-11-14T23:04:21Z

Thanks Wes!

wesm added a commit that referenced this issue Nov 13, 2011

ENH: Cython Reducer, speed up DataFrame.apply significantly, GH #309

74f5d6d

wesm added a commit that referenced this issue Nov 13, 2011

ENH: int64 type handling fix, tweaks, GH #309

75026f2

wesm closed this as completed Nov 13, 2011

dan-nadler pushed a commit to dan-nadler/pandas that referenced this issue Sep 23, 2019

Better format for DateRange.__str__ (pandas-dev#309)

57d5012

mattip mentioned this issue May 6, 2020

Update setting data pointers for Cython 3 #34014

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement fast Cython Series iterator, for speeding up DataFrame.apply #309

Implement fast Cython Series iterator, for speeding up DataFrame.apply #309

wesm commented Oct 31, 2011

natekupp commented Nov 11, 2011

wesm commented Nov 12, 2011

wesm commented Nov 12, 2011

wesm commented Nov 13, 2011

natekupp commented Nov 13, 2011

wesm commented Nov 13, 2011

natekupp commented Nov 14, 2011

Implement fast Cython Series iterator, for speeding up DataFrame.apply #309

Implement fast Cython Series iterator, for speeding up DataFrame.apply #309

Comments

wesm commented Oct 31, 2011

natekupp commented Nov 11, 2011

wesm commented Nov 12, 2011

wesm commented Nov 12, 2011

wesm commented Nov 13, 2011

natekupp commented Nov 13, 2011

wesm commented Nov 13, 2011

natekupp commented Nov 14, 2011