Skip to content

performance of DataFrame.corrwith vs raw apply #5671

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rosnfeld opened this issue Dec 9, 2013 · 8 comments
Closed

performance of DataFrame.corrwith vs raw apply #5671

rosnfeld opened this issue Dec 9, 2013 · 8 comments
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance
Milestone

Comments

@rosnfeld
Copy link
Contributor

rosnfeld commented Dec 9, 2013

For the following example I see corrwith performing considerably more slowly than doing a "raw" apply of np.corrcoef, with the tip code as of 2013-12-08:

s = pd.Series(np.arange(4096.))
df = pd.DataFrame({i:i + np.arange(3360.) for i in range(4096)})

%timeit -n 3 df.apply((lambda x: np.corrcoef(x.values, s.values)[0, 1]), axis=1)
3 loops, best of 3: 942 ms per loop

%timeit -n 3 df.corrwith(s, axis=1)
3 loops, best of 3: 1.64 s per loop

From comments on #5654 and the forums I would not have expected this to be slower, but it seems corrwith is just turning around and doing an "apply" itself, with some extra conditioning that seems to add significant overhead.

@rosnfeld
Copy link
Contributor Author

rosnfeld commented Dec 9, 2013

(I should also note that this extends to much larger cases than this toy example, I found it first by trying to migrate to corrwith on much larger scale real-world code)

@jreback
Copy link
Contributor

jreback commented Dec 9, 2013

The case of a frame with a series is doing a directly apply (as you mentioned).

If you create a repeated frame of the series, align it, you will see much greater perf.

Alignment is generally the responsibility of the user, however in this case it should be done inside the function (rather than passing to apply). Would appreciate a PR for this. thanks.

In [28]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3360 entries, 0 to 3359
Columns: 4096 entries, 0 to 4095
dtypes: float64(4096)

In [27]: df2 = DataFrame({i:s for i in df.columns})

In [29]: df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4096 entries, 0 to 4095
Columns: 4096 entries, 0 to 4095
dtypes: float64(4096)

In [30]: left,right=df.align(df2,join='inner')

timings

In [32]: %timeit -n 3 left.corrwith(right)
3 loops, best of 3: 767 ms per loop

In [33]: %timeit -n 3 df.apply((lambda x: np.corrcoef(x.values, s.values)[0, 1]), axis=1)
%timeit -n 3 df.corrwith(s, axis=1)
3 loops, best of 3: 930 ms per loop

In [34]: %timeit -n 3 df.corrwith(s, axis=1)
3 loops, best of 3: 2 s per loop

@rosnfeld
Copy link
Contributor Author

Hmm, if I'm understanding you correctly, I think "df2" should contain one copy of "s" per row, not per column:

df2 = pd.DataFrame({i:s for i in range(len(df))}).T

as I'm trying to take the correlation of each 4096-point row with a given "signature" Series (and there can be a variable number of rows).

In [81]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3360 entries, 0 to 3359
Columns: 4096 entries, 0 to 4095
dtypes: float64(4096)

In [82]: df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3360 entries, 0 to 3359
Columns: 4096 entries, 0 to 4095
dtypes: float64(4096)

In [83]: left, right = df.align(df2, join='inner')

And then unfortunately I am not observing any benefit to using the frame.corrwith(frame) approach, performance-wise. (and we also haven't counted the cost of setting up df2, left, and right)

In [89]: %timeit -n 3 left.corrwith(right, axis=1)
3 loops, best of 3: 1.59 s per loop

In [90]: %timeit -n 3 df.apply((lambda x: np.corrcoef(x.values, s.values)[0, 1]), axis=1)
3 loops, best of 3: 733 ms per loop

In [91]: %timeit -n 3 df.corrwith(s, axis=1)
3 loops, best of 3: 1.62 s per loop

@jreback
Copy link
Contributor

jreback commented Dec 13, 2013

ok..would welcome a PR on this; you might be able to just change corrwith to do the apply like you have it for certain cases then

@rosnfeld
Copy link
Contributor Author

Sounds good, I'll try. I haven't contributed to Pandas before so may take me a while to get set up.

@jreback
Copy link
Contributor

jreback commented Dec 14, 2013

np

this would be a nice intro that is pretty straightforward

@rosnfeld
Copy link
Contributor Author

Apologies for letting this sit open so long. I did take a look at it, and concluded that pandas is adding significant functionality on top of raw numpy corrcoef, through data alignment and avoidance of nan values. This functionality has a non-trivial cost (sadly I don't seem to have my profiling notes anymore), but also significant benefit, which I've taken advantage of on some newer projects. I think the moral of the story is "use raw numpy if you have clean, complete data (i.e. you've pre-aligned and checked for/handled nan), otherwise use pandas".

As I was the person to open this issue, I assume it's okay for me to close it as I don't think it's relevant anymore?

(and I do hope to make some 'real' code contributions to pandas in the coming months, I have a few sketches of minor improvements)

@jreback
Copy link
Contributor

jreback commented Jan 28, 2014

@rosnfeld np.....

this motivated #6013 which uses your example as a benchmark....get's a decent perf improvment in 0.13.1 (releasing next week)...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

2 participants