-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
corrwith in 0.24 is much slower than 0.23 (especially if corr axis is smaller than other axis) #26368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you profile things to see where the slowdown is?
The only PR reference corrwith in the 0.24.0 release notes is
#22375.
…On Mon, May 13, 2019 at 8:58 AM yavitzour ***@***.***> wrote:
Hi,
I've noticed that corrwith on pandas 0.24 is much slower than in 0.23,
especially when trying to correlate dataframes where the length of the axis
of correlation is much smaller than the length of the other axis.
Example:
import pandas as pdimport numpy as np
df1 = pd.DataFrame(np.random.rand(10000, 100))
df2 = pd.DataFrame(np.random.rand(10000, 100))
df1.corrwith(df2, axis=1)
With pandas 0.23.4 the snippet above finishes in about 0.1 sec, whereas
with pandas 0.24.1 it takes about 10 seconds (a 100 times slower...).
If we increase the length of the correlation axis, 0.23.4 still performs
much better, but the results are a bit less dramatic, for example with
10000 on both axes:
df1 = pd.DataFrame(np.random.rand(10000, 10000))
df2 = pd.DataFrame(np.random.rand(10000, 10000))
df1.corrwith(df2, axis=1)
Pandas 0.23.4 finishes in ~10 seconds whereas pandas 0.24.1 finishes in
about ~30 seconds ("only" 3 times slower)
Thanks!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#26368?email_source=notifications&email_token=AAKAOIR2YHZW5YPSQLOPSZLPVFXXTA5CNFSM4HMQCWA2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GTN63QQ>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIQ7PLZLSOLQRDXI23TPVFXXTANCNFSM4HMQCWAQ>
.
|
I can try, though I have no familiarity with the insides of pandas so I doubt that I could get something out of it. It's very easy to replicate, just run the code above in two clean virtual environments, one with pandas 0.24.1 and one with 0.23.4 (or any other 0.23 release). I just ran it now with cProfile (on a different computer, just for the fun of it). Here are the first few lines of the output. Hope you can make something out of it. For 0.23 I get: python -m cProfile -s cumtime scr.py
235002 function calls (229040 primitive calls) in 0.974 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
403/1 0.006 0.000 0.974 0.974 {built-in method builtins.exec}
1 0.004 0.004 0.974 0.974 scr.py:1(<module>)
615/2 0.006 0.000 0.751 0.375 <frozen importlib._bootstrap>:978(_find_and_load)
615/2 0.003 0.000 0.751 0.375 <frozen importlib._bootstrap>:948(_find_and_load_unlocked)
417/2 0.003 0.000 0.749 0.375 <frozen importlib._bootstrap>:663(_load_unlocked)
341/2 0.002 0.000 0.749 0.374 <frozen importlib._bootstrap_external>:722(exec_module)
650/2 0.001 0.000 0.748 0.374 <frozen importlib._bootstrap>:211(_call_with_frames_removed)
3 0.000 0.000 0.747 0.249 __init__.py:5(<module>)
442/41 0.001 0.000 0.598 0.015 {built-in method builtins.__import__}
2147/1194 0.003 0.000 0.264 0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
1 0.000 0.000 0.211 0.211 api.py:5(<module>)
1 0.000 0.000 0.209 0.209 __init__.py:106(<module>)
28 0.001 0.000 0.188 0.007 __init__.py:1(<module>)
4 0.000 0.000 0.182 0.046 __init__.py:2(<module>)
1 0.003 0.003 0.178 0.178 frame.py:6649(corrwith)
341 0.006 0.000 0.165 0.000 <frozen importlib._bootstrap_external>:793(get_code)
1 0.000 0.000 0.162 0.162 groupby.py:1(<module>)
545 0.006 0.000 0.159 0.000 <frozen importlib._bootstrap>:882(_find_spec)
2333 0.150 0.000 0.150 0.000 {built-in method nt.stat}
525 0.001 0.000 0.150 0.000 <frozen importlib._bootstrap_external>:1272(find_spec)
525 0.003 0.000 0.149 0.000 <frozen importlib._bootstrap_external>:1240(_get_spec)
851 0.011 0.000 0.134 0.000 <frozen importlib._bootstrap_external>:1356(find_spec)
6 0.000 0.000 0.131 0.022 frame.py:6845(_reduce)
6 0.000 0.000 0.130 0.022 frame.py:6856(f)
8/6 0.001 0.000 0.130 0.022 nanops.py:69(_f)
416/390 0.001 0.000 0.126 0.000 <frozen importlib._bootstrap>:576(module_from_spec)
1741 0.002 0.000 0.121 0.000 <frozen importlib._bootstrap_external>:74(_path_stat)
55/39 0.000 0.000 0.109 0.003 <frozen importlib._bootstrap_external>:1040(create_module)
55/39 0.064 0.001 0.109 0.003 {built-in method _imp.create_dynamic}
2 0.000 0.000 0.108 0.054 __init__.py:9(<module>)
250 0.001 0.000 0.105 0.000 {method 'extend' of 'list' objects}
1 0.000 0.000 0.105 0.105 lazy.py:97(_lazy)
593 0.000 0.000 0.104 0.000 __init__.py:1098(<genexpr>)
592 0.002 0.000 0.104 0.000 __init__.py:111(resource_exists) whereas for 0.24 I get: python -m cProfile -s cumtime scr.py
25185450 function calls (24818715 primitive calls) in 18.990 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
422/1 0.007 0.000 18.990 18.990 {built-in method builtins.exec}
1 0.004 0.004 18.990 18.990 scr.py:1(<module>)
1 0.003 0.003 17.996 17.996 frame.py:7151(corrwith)
7 0.000 0.000 17.853 2.550 ops.py:2015(f)
4 0.000 0.000 17.796 4.449 ops.py:1102(dispatch_to_series)
40011/11 0.067 0.000 15.818 1.438 expressions.py:192(evaluate)
40011/11 0.130 0.000 15.815 1.438 expressions.py:63(_evaluate_standard)
40004 0.343 0.000 9.213 0.000 ops.py:1536(wrapper)
2 0.000 0.000 8.995 4.497 ops.py:1891(_combine_series_frame)
2 0.018 0.009 8.995 4.497 frame.py:5111(_combine_match_columns)
2 0.017 0.009 8.837 4.419 frame.py:5118(_combine_const)
2 0.000 0.000 7.970 3.985 ops.py:1142(column_op)
2 0.121 0.060 7.970 3.985 ops.py:1143(<dictcomp>)
2 0.000 0.000 7.832 3.916 ops.py:1126(column_op)
2 0.097 0.049 7.832 3.916 ops.py:1127(<dictcomp>)
60012 0.197 0.000 6.324 0.000 indexing.py:1485(__getitem__)
40000 0.047 0.000 5.297 0.000 indexing.py:2141(_getitem_tuple)
80028 0.482 0.000 4.815 0.000 series.py:152(__init__)
40003/20003 0.166 0.000 4.795 0.000 {built-in method _operator.mul}
40001/20001 0.157 0.000 4.470 0.000 {built-in method _operator.sub}
40004 0.098 0.000 4.436 0.000 ops.py:1468(_construct_result)
40000 0.387 0.000 4.132 0.000 indexing.py:960(_getitem_lowerdim)
60012 0.202 0.000 3.393 0.000 indexing.py:2205(_getitem_axis)
4579482 1.595 0.000 2.873 0.000 {built-in method builtins.isinstance}
80040 0.310 0.000 2.826 0.000 blocks.py:3034(get_block_type)
60012 0.061 0.000 2.400 0.000 indexing.py:143(_get_loc)
40000 0.200 0.000 2.235 0.000 frame.py:2829(_ixs)
80028 0.294 0.000 2.140 0.000 managers.py:1443(__init__)
80043 0.145 0.000 2.002 0.000 blocks.py:3080(make_block)
25 0.001 0.000 1.992 0.080 frame.py:378(__init__)
4 0.008 0.002 1.989 0.497 construction.py:170(init_dict)
4 0.000 0.000 1.944 0.486 construction.py:43(arrays_to_mgr)
40004 0.146 0.000 1.891 0.000 ops.py:1512(safe_na_op) |
I personally don't plan to look into this. If you're not planning to work on it either, anything you can do to help another contributor identify the issue and propose a solution is helpful! Nothing in that cProfile output looks wrong at a glance. A line profile of DataFrame.corrwith would be the next place I look. |
Confirmed that the problem persists in pandas 0.25.0 |
I don't believe that anyone has started working on this, if you're still interested. |
I've encountered the same issue, the corrwith function is 30x slower in version later than 0.23.x, when calculating on two data frame of shape (4000, 4) in my case, that I had to downgrade pandas in production environment. |
Happy to report that while the problem still persisted up to 1.0.5, pandas 1.1.0 solves the problem for me. Thanks! |
Thanks @yavitzour can confirm recent improvement
|
We also have benchmarks for this in |
Hi,
I've noticed that corrwith on pandas 0.24 is much slower than in 0.23, especially when trying to correlate dataframes where the length of the axis of correlation is much smaller than the length of the other axis.
Example:
With pandas 0.23.4 the snippet above finishes in about 0.1 sec, whereas with pandas 0.24.1 it takes about 10 seconds (a 100 times slower...).
If we increase the length of the correlation axis, 0.23.4 still performs much better, but the results are a bit less dramatic, for example with 10000 on both axes:
Pandas 0.23.4 finishes in ~10 seconds whereas pandas 0.24.1 finishes in about ~30 seconds ("only" 3 times slower)
Thanks!
The text was updated successfully, but these errors were encountered: