-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
performance of DataFrame.corrwith vs raw apply #5671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
(I should also note that this extends to much larger cases than this toy example, I found it first by trying to migrate to corrwith on much larger scale real-world code) |
The case of a frame with a series is doing a directly apply (as you mentioned). If you create a repeated frame of the series, align it, you will see much greater perf. Alignment is generally the responsibility of the user, however in this case it should be done inside the function (rather than passing to apply). Would appreciate a PR for this. thanks.
timings
|
Hmm, if I'm understanding you correctly, I think "df2" should contain one copy of "s" per row, not per column: df2 = pd.DataFrame({i:s for i in range(len(df))}).T as I'm trying to take the correlation of each 4096-point row with a given "signature" Series (and there can be a variable number of rows).
And then unfortunately I am not observing any benefit to using the frame.corrwith(frame) approach, performance-wise. (and we also haven't counted the cost of setting up df2, left, and right)
|
ok..would welcome a PR on this; you might be able to just change |
Sounds good, I'll try. I haven't contributed to Pandas before so may take me a while to get set up. |
np this would be a nice intro that is pretty straightforward |
Apologies for letting this sit open so long. I did take a look at it, and concluded that pandas is adding significant functionality on top of raw numpy corrcoef, through data alignment and avoidance of nan values. This functionality has a non-trivial cost (sadly I don't seem to have my profiling notes anymore), but also significant benefit, which I've taken advantage of on some newer projects. I think the moral of the story is "use raw numpy if you have clean, complete data (i.e. you've pre-aligned and checked for/handled nan), otherwise use pandas". As I was the person to open this issue, I assume it's okay for me to close it as I don't think it's relevant anymore? (and I do hope to make some 'real' code contributions to pandas in the coming months, I have a few sketches of minor improvements) |
For the following example I see corrwith performing considerably more slowly than doing a "raw" apply of np.corrcoef, with the tip code as of 2013-12-08:
From comments on #5654 and the forums I would not have expected this to be slower, but it seems corrwith is just turning around and doing an "apply" itself, with some extra conditioning that seems to add significant overhead.
The text was updated successfully, but these errors were encountered: