Skip to content

Commit 5efb570

Browse files
PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) (#46174)
1 parent 4e3826f commit 5efb570

File tree

2 files changed

+30
-2
lines changed

2 files changed

+30
-2
lines changed

doc/source/whatsnew/v1.5.0.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -289,6 +289,7 @@ Other Deprecations
289289

290290
Performance improvements
291291
~~~~~~~~~~~~~~~~~~~~~~~~
292+
- Performance improvement in :meth:`DataFrame.corrwith` for column-wise (axis=0) Pearson and Spearman correlation when other is a :class:`Series` (:issue:`46174`)
292293
- Performance improvement in :meth:`.GroupBy.transform` for some user-defined DataFrame -> Series functions (:issue:`45387`)
293294
- Performance improvement in :meth:`DataFrame.duplicated` when subset consists of only one column (:issue:`45236`)
294295
- Performance improvement in :meth:`.GroupBy.diff` (:issue:`16706`)
@@ -299,7 +300,6 @@ Performance improvements
299300
- Performance improvement in :meth:`DataFrame.join` when left and/or right are empty (:issue:`46015`)
300301
- Performance improvement in :func:`factorize` (:issue:`46109`)
301302
- Performance improvement in :class:`DataFrame` and :class:`Series` constructors for extension dtype scalars (:issue:`45854`)
302-
-
303303

304304
.. ---------------------------------------------------------------------------
305305
.. _whatsnew_150.bug_fixes:

pandas/core/frame.py

+29-1
Original file line numberDiff line numberDiff line change
@@ -9836,8 +9836,36 @@ def corrwith(self, other, axis: Axis = 0, drop=False, method="pearson") -> Serie
98369836
axis = self._get_axis_number(axis)
98379837
this = self._get_numeric_data()
98389838

9839+
# GH46174: when other is a Series object and axis=0, we achieve a speedup over
9840+
# passing .corr() to .apply() by taking the columns as ndarrays and iterating
9841+
# over the transposition row-wise. Then we delegate the correlation coefficient
9842+
# computation and null-masking to np.corrcoef and np.isnan respectively,
9843+
# which are much faster. We exploit the fact that the Spearman correlation
9844+
# of two vectors is equal to the Pearson correlation of their ranks to use
9845+
# substantially the same method for Pearson and Spearman,
9846+
# just with intermediate argsorts on the latter.
98399847
if isinstance(other, Series):
9840-
return this.apply(lambda x: other.corr(x, method=method), axis=axis)
9848+
if axis == 0 and method in ["pearson", "spearman"]:
9849+
corrs = {}
9850+
numeric_cols = self.select_dtypes(include=np.number).columns
9851+
ndf = self[numeric_cols].values.transpose()
9852+
k = other.values
9853+
if method == "pearson":
9854+
for i, r in enumerate(ndf):
9855+
nonnull_mask = ~np.isnan(r) & ~np.isnan(k)
9856+
corrs[numeric_cols[i]] = np.corrcoef(
9857+
r[nonnull_mask], k[nonnull_mask]
9858+
)[0, 1]
9859+
else:
9860+
for i, r in enumerate(ndf):
9861+
nonnull_mask = ~np.isnan(r) & ~np.isnan(k)
9862+
corrs[numeric_cols[i]] = np.corrcoef(
9863+
r[nonnull_mask].argsort().argsort(),
9864+
k[nonnull_mask].argsort().argsort(),
9865+
)[0, 1]
9866+
return Series(corrs)
9867+
else:
9868+
return this.apply(lambda x: other.corr(x, method=method), axis=axis)
98419869

98429870
other = other._get_numeric_data()
98439871
left, right = this.align(other, join="inner", copy=False)

0 commit comments

Comments
 (0)