-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Improve numerical stability for Pearson corr() and cov() #37453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
But what if you're trying to get the correlation of series with very small floats? [1.0e-20, 2.0e-20, 3.0e-20] is perfectly correlated with itself, but has a variance smaller than 1e-15 ... |
Hm yes, that a good point. Have to add tests for these cases and change my fix. |
Welford does the trick. Also saves us one for loop. Running asvs |
asv results
|
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -404,6 +404,7 @@ Numeric | |||
- Bug in :class:`IntervalArray` comparisons with :class:`Series` not returning :class:`Series` (:issue:`36908`) | |||
- Bug in :class:`DataFrame` allowing arithmetic operations with list of array-likes with undefined results. Behavior changed to raising ``ValueError`` (:issue:`36702`) | |||
- Bug in :meth:`DataFrame.std`` with ``timedelta64`` dtype and ``skipna=False`` (:issue:`37392`) | |||
- Bug in :meth:`DataFrame.corr` returned inconsistent results for constant columns (:issue:`37448`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can i put this near the other corr precision fix (or combine them is ok)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a new point directly below.
@@ -283,37 +284,27 @@ def nancorr(const float64_t[:, :] mat, bint cov=False, minp=None): | |||
with nogil: | |||
for xi in range(K): | |||
for yi in range(xi + 1): | |||
nobs = sumxx = sumyy = sumx = sumy = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u add in the welford reference link somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Failure seems unrelated |
pandas/_libs/algos.pyx
Outdated
@@ -268,7 +268,8 @@ def nancorr(const float64_t[:, :] mat, bint cov=False, minp=None): | |||
ndarray[float64_t, ndim=2] result | |||
ndarray[uint8_t, ndim=2] mask | |||
int64_t nobs = 0 | |||
float64_t vx, vy, sumx, sumy, sumxx, sumyy, meanx, meany, divisor | |||
float64_t vx, vy, meanx, meany, divisor, prev_meany, prev_meanx, ssqdmx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra comma at the end :-<
can you run a quick asv to make sure that you typed all of the variables (not sure how else to check)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, missed that somehow. Asv is running, I will report the results when I get them. Result should be similar as mentioned above, because I made no substantial changes in algos after this run.
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Floating number issues when summing the same number often enough...