Skip to content

PERF: Speed up Spearman calculation #28151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Sep 7, 2019
27 changes: 23 additions & 4 deletions pandas/_libs/algos.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -296,6 +296,7 @@ def nancorr_spearman(const float64_t[:, :] mat, Py_ssize_t minp=1):
cdef:
Py_ssize_t i, j, xi, yi, N, K
ndarray[float64_t, ndim=2] result
ndarray[float64_t, ndim=2] ranked_mat
ndarray[float64_t, ndim=1] maskedx
ndarray[float64_t, ndim=1] maskedy
ndarray[uint8_t, ndim=2] mask
Expand All @@ -307,26 +308,44 @@ def nancorr_spearman(const float64_t[:, :] mat, Py_ssize_t minp=1):
result = np.empty((K, K), dtype=np.float64)
mask = np.isfinite(mat).view(np.uint8)

ranked_mat = np.empty((N, K), dtype=np.float64)
cached_ranks = set({})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the whole caching thing you have here is to avoid the memory overhead of ranking columns up front right? If so I think it's a premature optimization so would probably just prefer if everything was ranked and we didn't do caching.

Copy link
Member Author

@dsaxton dsaxton Aug 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this approach would accept a hit on memory by preallocating ranked_mat (even this is probably not necessary), but save time by not calculating ranks up front that we end up not reusing. Instead we wait until we need a full set of ranks for some calculation before computing and saving them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much of a time savings is that? In any case your PR should offer a significant speed up by not ranking in each loop so I think this might still be OK.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was actually fairly negligible based on some local testing I did (maybe 5% slowdown), but it's probably worthwhile to add some more cases to stat_ops.py (wide tables, possibly a worst case missing pattern for this change, etc.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see any perf difference here? Again would rather simplify and not have this if there's no major perf impact

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually seeing essentially the same performance from the previous implementation (ranking everything up front) on the "wide with nans" benchmark, so maybe just revert to that approach since the code is a lot easier to understand?


for xi in range(K):
for yi in range(xi + 1):
nobs = 0
# Keep track of whether the two columns have the same
# missing pattern, if so save off the ranks
all_ranks = True
for i in range(N):
all_ranks &= not (mask[i, xi] ^ mask[i, yi])
if mask[i, xi] and mask[i, yi]:
nobs += 1

if all_ranks:
if xi not in cached_ranks:
ranked_mat[:, xi] = rank_1d_float64(mat[:, xi])
cached_ranks.add(xi)
if yi not in cached_ranks:
ranked_mat[:, yi] = rank_1d_float64(mat[:, yi])
cached_ranks.add(yi)

if nobs < minp:
result[xi, yi] = result[yi, xi] = NaN
else:
maskedx = np.empty(nobs, dtype=np.float64)
maskedy = np.empty(nobs, dtype=np.float64)
j = 0

for i in range(N):
if mask[i, xi] and mask[i, yi]:
maskedx[j] = mat[i, xi]
maskedy[j] = mat[i, yi]
maskedx[j] = ranked_mat[i, xi] if all_ranks else mat[i, xi]
maskedy[j] = ranked_mat[i, yi] if all_ranks else mat[i, yi]
j += 1
maskedx = rank_1d_float64(maskedx)
maskedy = rank_1d_float64(maskedy)

if not all_ranks:
maskedx = rank_1d_float64(maskedx)
maskedy = rank_1d_float64(maskedy)

mean = (nobs + 1) / 2.

Expand Down