Skip to content

ENH: Improve numerical stability for Pearson corr() and cov() #37453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 30, 2020

Conversation

phofl
Copy link
Member

@phofl phofl commented Oct 27, 2020

Floating number issues when summing the same number often enough...

@phofl phofl added the Numeric Operations Arithmetic, Comparison, and Logical operations label Oct 27, 2020
@phofl phofl changed the title BUG: Inconsisten result for corr with constant columns BUG: Inconsistent result for corr with constant columns Oct 27, 2020
@da-wad
Copy link

da-wad commented Oct 28, 2020

But what if you're trying to get the correlation of series with very small floats? [1.0e-20, 2.0e-20, 3.0e-20] is perfectly correlated with itself, but has a variance smaller than 1e-15 ...

@phofl
Copy link
Member Author

phofl commented Oct 28, 2020

Hm yes, that a good point. Have to add tests for these cases and change my fix.

@phofl phofl marked this pull request as draft October 28, 2020 10:04
@phofl
Copy link
Member Author

phofl commented Oct 28, 2020

Welford does the trick. Also saves us one for loop. Running asvs

@phofl
Copy link
Member Author

phofl commented Oct 28, 2020

asv results

       before           after         ratio
     [d321be6e]       [11e8f250]
     <20816~1^2>       <37448~1> 
+      7.93±0.1ms      9.00±0.03ms     1.14  stat_ops.Correlation.time_corr_wide('pearson')
-     13.5±0.04ms       10.5±0.3ms     0.77  stat_ops.Correlation.time_corr_wide_nans('pearson')

@phofl phofl marked this pull request as ready for review October 28, 2020 22:50
@phofl phofl changed the title BUG: Inconsistent result for corr with constant columns ENH: Improve numerical stability for Pearson corr() and cov() Oct 28, 2020
@@ -404,6 +404,7 @@ Numeric
- Bug in :class:`IntervalArray` comparisons with :class:`Series` not returning :class:`Series` (:issue:`36908`)
- Bug in :class:`DataFrame` allowing arithmetic operations with list of array-likes with undefined results. Behavior changed to raising ``ValueError`` (:issue:`36702`)
- Bug in :meth:`DataFrame.std`` with ``timedelta64`` dtype and ``skipna=False`` (:issue:`37392`)
- Bug in :meth:`DataFrame.corr` returned inconsistent results for constant columns (:issue:`37448`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can i put this near the other corr precision fix (or combine them is ok)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a new point directly below.

@@ -283,37 +284,27 @@ def nancorr(const float64_t[:, :] mat, bint cov=False, minp=None):
with nogil:
for xi in range(K):
for yi in range(xi + 1):
nobs = sumxx = sumyy = sumx = sumy = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add in the welford reference link somewhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jreback jreback added this to the 1.2 milestone Oct 29, 2020
@phofl
Copy link
Member Author

phofl commented Oct 29, 2020

Failure seems unrelated

@@ -268,7 +268,8 @@ def nancorr(const float64_t[:, :] mat, bint cov=False, minp=None):
ndarray[float64_t, ndim=2] result
ndarray[uint8_t, ndim=2] mask
int64_t nobs = 0
float64_t vx, vy, sumx, sumy, sumxx, sumyy, meanx, meany, divisor
float64_t vx, vy, meanx, meany, divisor, prev_meany, prev_meanx, ssqdmx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra comma at the end :-<

can you run a quick asv to make sure that you typed all of the variables (not sure how else to check)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, missed that somehow. Asv is running, I will report the results when I get them. Result should be similar as mentioned above, because I made no substantial changes in algos after this run.

@phofl
Copy link
Member Author

phofl commented Oct 30, 2020

cc @jreback

before after ratio
[0647f02] [1659881]
<29485^2> <37448>

  •  8.77±0.1ms       10.4±0.3ms     1.18  stat_ops.Correlation.time_corr_wide('pearson')
    

asv results

@jreback jreback merged commit d4cd068 into pandas-dev:master Oct 30, 2020
@phofl phofl deleted the 37448 branch October 30, 2020 20:03
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Inconsistent correlation between constant series (varies with number of rows)
3 participants