performance of DataFrame.corrwith vs raw apply #5671

rosnfeld · 2013-12-09T19:29:45Z

For the following example I see corrwith performing considerably more slowly than doing a "raw" apply of np.corrcoef, with the tip code as of 2013-12-08:

s = pd.Series(np.arange(4096.))
df = pd.DataFrame({i:i + np.arange(3360.) for i in range(4096)})

%timeit -n 3 df.apply((lambda x: np.corrcoef(x.values, s.values)[0, 1]), axis=1)
3 loops, best of 3: 942 ms per loop

%timeit -n 3 df.corrwith(s, axis=1)
3 loops, best of 3: 1.64 s per loop

From comments on #5654 and the forums I would not have expected this to be slower, but it seems corrwith is just turning around and doing an "apply" itself, with some extra conditioning that seems to add significant overhead.

The text was updated successfully, but these errors were encountered:

rosnfeld · 2013-12-09T19:42:11Z

(I should also note that this extends to much larger cases than this toy example, I found it first by trying to migrate to corrwith on much larger scale real-world code)

jreback · 2013-12-09T20:17:00Z

The case of a frame with a series is doing a directly apply (as you mentioned).

If you create a repeated frame of the series, align it, you will see much greater perf.

Alignment is generally the responsibility of the user, however in this case it should be done inside the function (rather than passing to apply). Would appreciate a PR for this. thanks.

In [28]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3360 entries, 0 to 3359
Columns: 4096 entries, 0 to 4095
dtypes: float64(4096)

In [27]: df2 = DataFrame({i:s for i in df.columns})

In [29]: df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4096 entries, 0 to 4095
Columns: 4096 entries, 0 to 4095
dtypes: float64(4096)

In [30]: left,right=df.align(df2,join='inner')

timings

In [32]: %timeit -n 3 left.corrwith(right)
3 loops, best of 3: 767 ms per loop

In [33]: %timeit -n 3 df.apply((lambda x: np.corrcoef(x.values, s.values)[0, 1]), axis=1)
%timeit -n 3 df.corrwith(s, axis=1)
3 loops, best of 3: 930 ms per loop

In [34]: %timeit -n 3 df.corrwith(s, axis=1)
3 loops, best of 3: 2 s per loop

rosnfeld · 2013-12-12T22:14:51Z

Hmm, if I'm understanding you correctly, I think "df2" should contain one copy of "s" per row, not per column:

df2 = pd.DataFrame({i:s for i in range(len(df))}).T

as I'm trying to take the correlation of each 4096-point row with a given "signature" Series (and there can be a variable number of rows).

In [81]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3360 entries, 0 to 3359
Columns: 4096 entries, 0 to 4095
dtypes: float64(4096)

In [82]: df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3360 entries, 0 to 3359
Columns: 4096 entries, 0 to 4095
dtypes: float64(4096)

In [83]: left, right = df.align(df2, join='inner')

And then unfortunately I am not observing any benefit to using the frame.corrwith(frame) approach, performance-wise. (and we also haven't counted the cost of setting up df2, left, and right)

In [89]: %timeit -n 3 left.corrwith(right, axis=1)
3 loops, best of 3: 1.59 s per loop

In [90]: %timeit -n 3 df.apply((lambda x: np.corrcoef(x.values, s.values)[0, 1]), axis=1)
3 loops, best of 3: 733 ms per loop

In [91]: %timeit -n 3 df.corrwith(s, axis=1)
3 loops, best of 3: 1.62 s per loop

jreback · 2013-12-13T13:30:46Z

ok..would welcome a PR on this; you might be able to just change corrwith to do the apply like you have it for certain cases then

rosnfeld · 2013-12-14T01:10:48Z

Sounds good, I'll try. I haven't contributed to Pandas before so may take me a while to get set up.

jreback · 2013-12-14T01:17:33Z

np

this would be a nice intro that is pretty straightforward

rosnfeld · 2014-01-28T16:49:31Z

Apologies for letting this sit open so long. I did take a look at it, and concluded that pandas is adding significant functionality on top of raw numpy corrcoef, through data alignment and avoidance of nan values. This functionality has a non-trivial cost (sadly I don't seem to have my profiling notes anymore), but also significant benefit, which I've taken advantage of on some newer projects. I think the moral of the story is "use raw numpy if you have clean, complete data (i.e. you've pre-aligned and checked for/handled nan), otherwise use pandas".

As I was the person to open this issue, I assume it's okay for me to close it as I don't think it's relevant anymore?

(and I do hope to make some 'real' code contributions to pandas in the coming months, I have a few sketches of minor improvements)

jreback · 2014-01-28T17:09:24Z

@rosnfeld np.....

this motivated #6013 which uses your example as a benchmark....get's a decent perf improvment in 0.13.1 (releasing next week)...

rosnfeld closed this as completed Jan 28, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance of DataFrame.corrwith vs raw apply #5671

performance of DataFrame.corrwith vs raw apply #5671

rosnfeld commented Dec 9, 2013

rosnfeld commented Dec 9, 2013

jreback commented Dec 9, 2013

rosnfeld commented Dec 12, 2013

jreback commented Dec 13, 2013

rosnfeld commented Dec 14, 2013

jreback commented Dec 14, 2013

rosnfeld commented Jan 28, 2014

jreback commented Jan 28, 2014

performance of DataFrame.corrwith vs raw apply #5671

performance of DataFrame.corrwith vs raw apply #5671

Comments

rosnfeld commented Dec 9, 2013

rosnfeld commented Dec 9, 2013

jreback commented Dec 9, 2013

rosnfeld commented Dec 12, 2013

jreback commented Dec 13, 2013

rosnfeld commented Dec 14, 2013

jreback commented Dec 14, 2013

rosnfeld commented Jan 28, 2014

jreback commented Jan 28, 2014