ENH: Improve numerical stability for Pearson corr() and cov() #37453

phofl · 2020-10-27T22:19:43Z

closes BUG: Inconsistent correlation between constant series (varies with number of rows) #37448
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Floating number issues when summing the same number often enough...

da-wad · 2020-10-28T08:29:54Z

But what if you're trying to get the correlation of series with very small floats? [1.0e-20, 2.0e-20, 3.0e-20] is perfectly correlated with itself, but has a variance smaller than 1e-15 ...

phofl · 2020-10-28T10:03:57Z

Hm yes, that a good point. Have to add tests for these cases and change my fix.

phofl · 2020-10-28T22:28:00Z

Welford does the trick. Also saves us one for loop. Running asvs

phofl · 2020-10-28T22:50:33Z

asv results

       before           after         ratio
     [d321be6e]       [11e8f250]
     <20816~1^2>       <37448~1> 
+      7.93±0.1ms      9.00±0.03ms     1.14  stat_ops.Correlation.time_corr_wide('pearson')
-     13.5±0.04ms       10.5±0.3ms     0.77  stat_ops.Correlation.time_corr_wide_nans('pearson')

jreback · 2020-10-29T01:59:09Z

doc/source/whatsnew/v1.2.0.rst

@@ -404,6 +404,7 @@ Numeric
 - Bug in :class:`IntervalArray` comparisons with :class:`Series` not returning :class:`Series` (:issue:`36908`)
 - Bug in :class:`DataFrame` allowing arithmetic operations with list of array-likes with undefined results. Behavior changed to raising ``ValueError`` (:issue:`36702`)
 - Bug in :meth:`DataFrame.std`` with ``timedelta64`` dtype and ``skipna=False`` (:issue:`37392`)
+- Bug in :meth:`DataFrame.corr` returned inconsistent results for constant columns (:issue:`37448`)


can i put this near the other corr precision fix (or combine them is ok)

Made a new point directly below.

jreback · 2020-10-29T01:59:48Z

pandas/_libs/algos.pyx

@@ -283,37 +284,27 @@ def nancorr(const float64_t[:, :] mat, bint cov=False, minp=None):
    with nogil:
        for xi in range(K):
            for yi in range(xi + 1):
-                nobs = sumxx = sumyy = sumx = sumy = 0


can u add in the welford reference link somewhere

phofl · 2020-10-29T23:44:43Z

Failure seems unrelated

jreback · 2020-10-30T16:25:44Z

pandas/_libs/algos.pyx

@@ -268,7 +268,8 @@ def nancorr(const float64_t[:, :] mat, bint cov=False, minp=None):
        ndarray[float64_t, ndim=2] result
        ndarray[uint8_t, ndim=2] mask
        int64_t nobs = 0
-        float64_t vx, vy, sumx, sumy, sumxx, sumyy, meanx, meany, divisor
+        float64_t vx, vy, meanx, meany, divisor, prev_meany, prev_meanx, ssqdmx,


extra comma at the end :-<

can you run a quick asv to make sure that you typed all of the variables (not sure how else to check)

Thanks, missed that somehow. Asv is running, I will report the results when I get them. Result should be similar as mentioned above, because I made no substantial changes in algos after this run.

phofl · 2020-10-30T17:42:29Z

cc @jreback

before after ratio
[0647f02] [1659881]
<29485^2> <37448>

 8.77±0.1ms       10.4±0.3ms     1.18  stat_ops.Correlation.time_corr_wide('pearson')

asv results

…-dev#37453)

BUG: Inconsisten result for corr with constant columns

061a51f

phofl added the Numeric Operations Arithmetic, Comparison, and Logical operations label Oct 27, 2020

Fix pattern

daaaabd

phofl changed the title ~~BUG: Inconsisten result for corr with constant columns~~ BUG: Inconsistent result for corr with constant columns Oct 27, 2020

phofl marked this pull request as draft October 28, 2020 10:04

Use welford to calculate corr

168d78d

phofl added 2 commits October 28, 2020 23:30

Delete inline comment

11e8f25

Fix flake8

7b1089d

phofl marked this pull request as ready for review October 28, 2020 22:50

phofl changed the title ~~BUG: Inconsistent result for corr with constant columns~~ ENH: Improve numerical stability for Pearson corr() and cov() Oct 28, 2020

jreback requested changes Oct 29, 2020

View reviewed changes

jreback added this to the 1.2 milestone Oct 29, 2020

Adress review

9bef42f

jreback reviewed Oct 30, 2020

View reviewed changes

Delte comma

1659881

jreback approved these changes Oct 30, 2020

View reviewed changes

jreback merged commit d4cd068 into pandas-dev:master Oct 30, 2020

phofl deleted the 37448 branch October 30, 2020 20:03

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

ENH: Improve numerical stability for Pearson corr() and cov() (pandas…

28de4ec

…-dev#37453)

ukarroum pushed a commit to ukarroum/pandas that referenced this pull request Nov 2, 2020

ENH: Improve numerical stability for Pearson corr() and cov() (pandas…

0c4012d

…-dev#37453)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Improve numerical stability for Pearson corr() and cov() #37453

ENH: Improve numerical stability for Pearson corr() and cov() #37453

Uh oh!

phofl commented Oct 27, 2020

Uh oh!

da-wad commented Oct 28, 2020

Uh oh!

phofl commented Oct 28, 2020

Uh oh!

phofl commented Oct 28, 2020

Uh oh!

phofl commented Oct 28, 2020

Uh oh!

jreback Oct 29, 2020

Uh oh!

phofl Oct 29, 2020

Uh oh!

jreback Oct 29, 2020

Uh oh!

phofl Oct 29, 2020

Uh oh!

phofl commented Oct 29, 2020

Uh oh!

jreback Oct 30, 2020

Uh oh!

phofl Oct 30, 2020

Uh oh!

phofl commented Oct 30, 2020

Uh oh!

Uh oh!

Uh oh!

ENH: Improve numerical stability for Pearson corr() and cov() #37453

ENH: Improve numerical stability for Pearson corr() and cov() #37453

Uh oh!

Conversation

phofl commented Oct 27, 2020

Uh oh!

da-wad commented Oct 28, 2020

Uh oh!

phofl commented Oct 28, 2020

Uh oh!

phofl commented Oct 28, 2020

Uh oh!

phofl commented Oct 28, 2020

Uh oh!

jreback Oct 29, 2020

Choose a reason for hiding this comment

Uh oh!

phofl Oct 29, 2020

Choose a reason for hiding this comment

Uh oh!

jreback Oct 29, 2020

Choose a reason for hiding this comment

Uh oh!

phofl Oct 29, 2020

Choose a reason for hiding this comment

Uh oh!

phofl commented Oct 29, 2020

Uh oh!

jreback Oct 30, 2020

Choose a reason for hiding this comment

Uh oh!

phofl Oct 30, 2020

Choose a reason for hiding this comment

Uh oh!

phofl commented Oct 30, 2020

Uh oh!

Uh oh!