PERF: Spearman correlation #14239

tantrev · 2016-09-17T07:45:24Z

I'm trying to compute Spearman correlation for a relatively small DataFrame (~~500x~~8000), but the calculation appears to be orders of magnitude slower than Pearson correlation.

Any help to make Spearman's speed comparable to Pearson's would be greatly appreciated.

tantrev · 2016-09-17T07:48:56Z

For what it's worth, scipy.stats.spearmanr appears to be much faster.

jreback · 2016-09-17T10:06:27Z

pls show actual detail
pd.show_versions()

and df.info() for a portion of the frame

jreback · 2016-09-19T10:51:59Z

spearman corr does re-ranking on every iteration. This is to accomodate the nan's changing. If you don't have nan's then the scipy method is great. I am sure this could be sped up in lots of cases.

pull-requests are welcome.

CaselIT · 2019-03-13T12:48:10Z

I've come across this issue today. I'll just add some timing info

> df.shape 
< (10000, 30)
> %%timeit df.corr('spearman')
< 949 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> %%timeit pd.DataFrame(scipy.stats.spearmanr(df)[0], columns=df.columns, index=df.columns)
< 28.8 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

INSTALLED VERSIONS ------------------ commit: None python: 3.6.8.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.24.1
pytest: 4.3.0
pip: 19.0.3
setuptools: 40.8.0
Cython: None
numpy: 1.15.4
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: 1.8.4
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Liam3851 · 2019-09-10T13:18:28Z

Perhaps we can close this due to #28151?

TomAugspurger · 2019-09-10T20:12:10Z

Thanks @Liam3851, I think you're right.

CaselIT · 2019-09-10T21:42:06Z

Just retried the snipped I posted. Still not as fast as scipy, but it's at least a lot better than before 🎉

In [8]: df.shape
Out[8]: (10000, 30)
In [9]: %timeit df.corr('spearman')
67.5 ms ± 3.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [10]: %timeit pd.DataFrame(scipy.stats.spearmanr(df)[0], columns=df.columns, index=df.columns)
30.8 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

jreback added Performance Memory or execution speed performance Difficulty Advanced labels Sep 19, 2016

jreback changed the title ~~ENH: Spearman correlation - slow performance~~ PERF: Spearman correlation Sep 19, 2016

TomAugspurger closed this as completed Sep 10, 2019

TomAugspurger added this to the 1.0 milestone Sep 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Spearman correlation #14239

PERF: Spearman correlation #14239

tantrev commented Sep 17, 2016

tantrev commented Sep 17, 2016

jreback commented Sep 17, 2016

jreback commented Sep 19, 2016

CaselIT commented Mar 13, 2019

Liam3851 commented Sep 10, 2019

TomAugspurger commented Sep 10, 2019

CaselIT commented Sep 10, 2019

PERF: Spearman correlation #14239

PERF: Spearman correlation #14239

Comments

tantrev commented Sep 17, 2016

tantrev commented Sep 17, 2016

jreback commented Sep 17, 2016

jreback commented Sep 19, 2016

CaselIT commented Mar 13, 2019

Liam3851 commented Sep 10, 2019

TomAugspurger commented Sep 10, 2019

CaselIT commented Sep 10, 2019