Skip to content

PERF: Spearman correlation #14239

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tantrev opened this issue Sep 17, 2016 · 7 comments
Closed

PERF: Spearman correlation #14239

tantrev opened this issue Sep 17, 2016 · 7 comments
Labels
Performance Memory or execution speed performance
Milestone

Comments

@tantrev
Copy link

tantrev commented Sep 17, 2016

I'm trying to compute Spearman correlation for a relatively small DataFrame (500x8000), but the calculation appears to be orders of magnitude slower than Pearson correlation.

Any help to make Spearman's speed comparable to Pearson's would be greatly appreciated.

@tantrev
Copy link
Author

tantrev commented Sep 17, 2016

For what it's worth, scipy.stats.spearmanr appears to be much faster.

@jreback
Copy link
Contributor

jreback commented Sep 17, 2016

pls show actual detail
pd.show_versions()

and df.info() for a portion of the frame

@jreback jreback added Performance Memory or execution speed performance Difficulty Advanced labels Sep 19, 2016
@jreback
Copy link
Contributor

jreback commented Sep 19, 2016

spearman corr does re-ranking on every iteration. This is to accomodate the nan's changing. If you don't have nan's then the scipy method is great. I am sure this could be sped up in lots of cases.

pull-requests are welcome.

@jreback jreback changed the title ENH: Spearman correlation - slow performance PERF: Spearman correlation Sep 19, 2016
@CaselIT
Copy link

CaselIT commented Mar 13, 2019

I've come across this issue today. I'll just add some timing info

> df.shape 
< (10000, 30)
> %%timeit df.corr('spearman')
< 949 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> %%timeit pd.DataFrame(scipy.stats.spearmanr(df)[0], columns=df.columns, index=df.columns)
< 28.8 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
INSTALLED VERSIONS ------------------ commit: None python: 3.6.8.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.24.1
pytest: 4.3.0
pip: 19.0.3
setuptools: 40.8.0
Cython: None
numpy: 1.15.4
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: 1.8.4
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@Liam3851
Copy link
Contributor

Perhaps we can close this due to #28151?

@TomAugspurger
Copy link
Contributor

Thanks @Liam3851, I think you're right.

@TomAugspurger TomAugspurger added this to the 1.0 milestone Sep 10, 2019
@CaselIT
Copy link

CaselIT commented Sep 10, 2019

Just retried the snipped I posted. Still not as fast as scipy, but it's at least a lot better than before 🎉

In [8]: df.shape
Out[8]: (10000, 30)
In [9]: %timeit df.corr('spearman')
67.5 ms ± 3.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [10]: %timeit pd.DataFrame(scipy.stats.spearmanr(df)[0], columns=df.columns, index=df.columns)
30.8 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants