-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
numpy vs pandas: different estimation of covariance in presence of nan values #16837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the detailed example, seems very closed related to #3513 (possibly a duplicate) |
FWIW, np.random.seed(42)
<...>
In [58]: M
Out[58]:
array([[ nan, -0.1382643 ],
[ 0.64768854, nan],
[-0.23415337, -0.23413696],
[ 1.57921282, 0.76743473],
[-0.46947439, 0.54256004],
[-0.46341769, -0.46572975],
[ 0.24196227, -1.91328024],
[-1.72491783, -0.56228753],
[-1.01283112, 0.31424733],
[-0.90802408, -1.4123037 ]])
In [59]: cov_pd
Out[59]: 0.22724929787316234 R a = c(NA, 0.64768854, -0.23415337, 1.57921282, -0.46947439,
-0.46341769, 0.24196227, -1.72491783, -1.01283112, -0.90802408)
b = c(-0.1382643 , NA, -0.23413696, 0.76743473, 0.54256004, -0.46572975,
-1.91328024, -0.56228753, 0.31424733, -1.4123037 )
cov(a, b, use='pairwise')
# [1] 0.2272493 |
since this has a concrete example, will close #3513 |
I think there are two different (but closely related) issues in here:
Note that
For this very same example Numpy does this:
which, again, is not positive semi-definite:
BTW, Matlab function cov handles three cases via the
and does, by far, the best job documenting the difference. Calculation matches Pandas':
It is not clear to me, at this moment, which implementation is more reasonable since Numpy's may be more precise. |
Thanks - not something I'm deeply knowledgeable about, but at minimum would definitely take some expanded docs warning that |
This may give slightly different results from `pd.DataFrame.cov` in the presence of missing data, because the calculation of the mean is done differently, see pandas-dev/pandas#16837 As both implementations are correct, we switch to the more performant one here
Code Sample, a copy-pastable example if possible
Problem description
I try to calculate the covariance matrix in presence of missing values and I've note that numpy and pandas retrieve differents matrix and that difference increases when increase the presence of missing values. I let above a snippet of both implementations. For me is more useful numpy way, it's seems to be more robust in presence of missing values.
The text was updated successfully, but these errors were encountered: