Skip to content

numpy vs pandas: different estimation of covariance in presence of nan values #16837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
asdf8601 opened this issue Jul 6, 2017 · 5 comments
Labels
Bug cov/corr Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reduction Operations sum, mean, min, max, etc.

Comments

@asdf8601
Copy link
Contributor

asdf8601 commented Jul 6, 2017

Code Sample, a copy-pastable example if possible

# =============================================================================
# NUMPY VS PANDAS: DIFFERENT ESTIMATION OF COVARIANCE IN PRESENCE OF NAN VALUES
# =============================================================================
# data with nan values
M = np.random.randn(10,2)
# add missing values
M[0,0] = np.nan
M[1,1] = np.nan

# Covariance matrix calculations
# ==============================
# numpy
# -----
masked_arr = np.ma.array(M, mask=np.isnan(M))
cov_numpy = np.ma.cov(masked_arr, rowvar=0, allow_masked=True, ddof=1).data

# pandas
# ------
cov_pandas = pd.DataFrame(M).cov(min_periods=0).values

# Homemade covariance coefficient calculation
# (what each of them is actually doing)
# =============================================
# select elements to estimate the covariance matrix element (0,1)
x = M[:,0]
y = M[:,1]

mask_x = ~np.isnan(x)
mask_y = ~np.isnan(y)
mask_common = mask_x & mask_y

# numpy
# -----
xn = x-np.mean(x[mask_x])
yn = y-np.mean(y[mask_y])
cov_np = sum(a*b for a,b,c in zip(xn,yn, mask_common) if c)/(np.sum(mask_common)-1)

# pandas
# ------
xn = x-np.mean(x[mask_common])
yn = y-np.mean(y[mask_common])
cov_pd = sum(a*b for a,b,c in zip(xn,yn, mask_common) if c)/(np.sum(mask_common)-1)

Problem description

I try to calculate the covariance matrix in presence of missing values and I've note that numpy and pandas retrieve differents matrix and that difference increases when increase the presence of missing values. I let above a snippet of both implementations. For me is more useful numpy way, it's seems to be more robust in presence of missing values.

@asdf8601 asdf8601 changed the title NUMPY VS PANDAS: DIFFERENT ESTIMATION OF COVARIANCE IN PRESENCE OF NAN VALUES numpy vs pandas: different estimation of covariance in presence of nan values Jul 6, 2017
@chris-b1
Copy link
Contributor

chris-b1 commented Jul 6, 2017

Thanks for the detailed example, seems very closed related to #3513 (possibly a duplicate)

@chris-b1
Copy link
Contributor

chris-b1 commented Jul 6, 2017

FWIW, R seems to take a similar approach

np.random.seed(42)
<...>
In [58]: M
Out[58]: 
array([[        nan, -0.1382643 ],
       [ 0.64768854,         nan],
       [-0.23415337, -0.23413696],
       [ 1.57921282,  0.76743473],
       [-0.46947439,  0.54256004],
       [-0.46341769, -0.46572975],
       [ 0.24196227, -1.91328024],
       [-1.72491783, -0.56228753],
       [-1.01283112,  0.31424733],
       [-0.90802408, -1.4123037 ]])


In [59]: cov_pd
Out[59]: 0.22724929787316234

R

a = c(NA,  0.64768854, -0.23415337,  1.57921282, -0.46947439, 
      -0.46341769,  0.24196227, -1.72491783, -1.01283112, -0.90802408)
b = c(-0.1382643 ,   NA, -0.23413696,  0.76743473,  0.54256004,  -0.46572975,
      -1.91328024, -0.56228753,  0.31424733, -1.4123037 )
cov(a, b, use='pairwise')

# [1] 0.2272493

https://stats.stackexchange.com/q/110719

@jreback
Copy link
Contributor

jreback commented Jul 6, 2017

since this has a concrete example, will close #3513

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Numeric Operations Arithmetic, Comparison, and Logical operations labels Jul 6, 2017
@masdeseiscaracteres
Copy link

masdeseiscaracteres commented Jul 7, 2017

I think there are two different (but closely related) issues in here:

  1. The one referred to by "wrong" covariance matrix returned in the presence of nans #3513. That is, the functions in both pandas and numpy use the so-called "pairwise deletion" strategy which could lead to non positive semi-definite covariance matrices. An even simpler example than the above demonstrates the problem:
import pandas as pd
A = pd.DataFrame([[1, 2],[None, 4],[5, None],[7, 8]])
cov_pd = A.cov()

           0          1
0   9.333333  18.000000
1  18.000000   9.333333

Note that cov_pd is not positive semi-definite:

import numpy as np
np.linalg.eigvals(cov_pd)
    array([ 27.33333333,  -8.66666667])

For this very same example Numpy does this:

masked_A = np.ma.MaskedArray(A.values, np.isnan(A))
cov_np = np.ma.cov(masked_A, rowvar=0).data
    array([[  9.33333333,  17.77777778],
           [ 17.77777778,   9.33333333]])

which, again, is not positive semi-definite:

np.linalg.eigvals(cov_np)
    array([ 27.11111111,  -8.44444444])

BTW, Matlab function cov handles three cases via the nanflag argument:

  • nanflag = 'includenan'
  • nanflag = 'omitrows'
  • nanflag = 'partialrows'

and does, by far, the best job documenting the difference. Calculation matches Pandas':

nanflag = 'partialrows'
cov_m = cov(A, 0, nanflag)
    cov_m =

        9.3333   18.0000
       18.0000    9.3333
  1. Another different story is the one in which numpy and others (Pandas, R, Matlab) differ, and that is the number of elements they use to estimate the mean of each column. This raises concerns about estimation error and biased estimations (we have seen above that it loses the positive semi-definiteness property as well).

It is not clear to me, at this moment, which implementation is more reasonable since Numpy's may be more precise.

@chris-b1
Copy link
Contributor

chris-b1 commented Jul 7, 2017

Thanks - not something I'm deeply knowledgeable about, but at minimum would definitely take some expanded docs warning that cov in the presence of missing data should be interpreted carefully.

@mroeschke mroeschke added the Bug label Jun 12, 2021
@jbrockmendel jbrockmendel added Reduction Operations sum, mean, min, max, etc. cov/corr and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Mar 28, 2023
HippocampusGirl added a commit to HALFpipe/RAMP that referenced this issue Aug 30, 2024
This may give slightly different results from `pd.DataFrame.cov` in
the presence of missing data, because the calculation of the mean is
done differently, see pandas-dev/pandas#16837
As both implementations are correct, we switch to the more performant
one here
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug cov/corr Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

No branches or pull requests

7 participants