numpy vs pandas: different estimation of covariance in presence of nan values

#### Code Sample, a copy-pastable example if possible

```python
# =============================================================================
# NUMPY VS PANDAS: DIFFERENT ESTIMATION OF COVARIANCE IN PRESENCE OF NAN VALUES
# =============================================================================
# data with nan values
M = np.random.randn(10,2)
# add missing values
M[0,0] = np.nan
M[1,1] = np.nan

# Covariance matrix calculations
# ==============================
# numpy
# -----
masked_arr = np.ma.array(M, mask=np.isnan(M))
cov_numpy = np.ma.cov(masked_arr, rowvar=0, allow_masked=True, ddof=1).data

# pandas
# ------
cov_pandas = pd.DataFrame(M).cov(min_periods=0).values

# Homemade covariance coefficient calculation
# (what each of them is actually doing)
# =============================================
# select elements to estimate the covariance matrix element (0,1)
x = M[:,0]
y = M[:,1]

mask_x = ~np.isnan(x)
mask_y = ~np.isnan(y)
mask_common = mask_x & mask_y

# numpy
# -----
xn = x-np.mean(x[mask_x])
yn = y-np.mean(y[mask_y])
cov_np = sum(a*b for a,b,c in zip(xn,yn, mask_common) if c)/(np.sum(mask_common)-1)

# pandas
# ------
xn = x-np.mean(x[mask_common])
yn = y-np.mean(y[mask_common])
cov_pd = sum(a*b for a,b,c in zip(xn,yn, mask_common) if c)/(np.sum(mask_common)-1)
```
#### Problem description

I try to calculate the covariance matrix in presence of missing values and I've note that numpy and pandas retrieve differents matrix and that difference increases when increase the presence of missing values. I let above a snippet of both implementations. For me is more useful numpy way, it's seems to be more robust in presence of missing values.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

numpy vs pandas: different estimation of covariance in presence of nan values #16837

Code Sample, a copy-pastable example if possible

Problem description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

numpy vs pandas: different estimation of covariance in presence of nan values #16837

Description

Code Sample, a copy-pastable example if possible

Problem description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions