Skip to content

Commit 8d277e3

Browse files
author
Chang She
committed
ENH: min_periods for corr/cov #2002 and TST tweak to use better sys.stderr idiom 5645be2
1 parent 541c9a8 commit 8d277e3

File tree

9 files changed

+185
-66
lines changed

9 files changed

+185
-66
lines changed

RELEASE.rst

+3-1
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,9 @@ pandas 0.10.0
3131

3232
- Add error handling to Series.str.encode/decode (#2276)
3333
- Add ``where`` and ``mask`` to Series (#2337)
34+
- Grouped histogram via `by` keyword in Series/DataFrame.hist (#2186)
35+
- Support optional ``min_periods`` keyword in ``corr`` and ``cov``
36+
for both Series and DataFrame (#2002)
3437

3538
**API Changes**
3639

@@ -42,7 +45,6 @@ pandas 0.10.0
4245

4346
**Improvements to existing features**
4447

45-
- Grouped histogram via `by` keyword in Series/DataFrame.hist (#2186)
4648
- Add ``nrows`` option to DataFrame.from_records for iterators (#1794)
4749
- Unstack/reshape algorithm rewrite to avoid high memory use in cases where
4850
the number of observed key-tuples is much smaller than the total possible

doc/source/computation.rst

+43-15
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,21 @@ among the series in the DataFrame, also excluding NA/null values.
6262
frame = DataFrame(randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])
6363
frame.cov()
6464
65+
``DataFrame.cov`` also supports an optional ``min_periods`` keyword that
66+
specifies the required minimum number of observations for each column pair
67+
in order to have a valid result.
68+
69+
.. ipython:: python
70+
71+
frame = DataFrame(randn(20, 3), columns=['a', 'b', 'c'])
72+
frame.ix[:5, 'a'] = np.nan
73+
frame.ix[5:10, 'b'] = np.nan
74+
75+
frame.cov()
76+
77+
frame.cov(min_periods=12)
78+
79+
6580
.. _computation.correlation:
6681

6782
Correlation
@@ -97,6 +112,19 @@ All of these are currently computed using pairwise complete observations.
97112
Note that non-numeric columns will be automatically excluded from the
98113
correlation calculation.
99114

115+
Like ``cov``, ``corr`` also supports the optional ``min_periods`` keyword:
116+
117+
.. ipython:: python
118+
119+
frame = DataFrame(randn(20, 3), columns=['a', 'b', 'c'])
120+
frame.ix[:5, 'a'] = np.nan
121+
frame.ix[5:10, 'b'] = np.nan
122+
123+
frame.corr()
124+
125+
frame.corr(min_periods=12)
126+
127+
100128
A related method ``corrwith`` is implemented on DataFrame to compute the
101129
correlation between like-labeled Series contained in different DataFrame
102130
objects.
@@ -290,9 +318,9 @@ columns using ``ix`` indexing:
290318
291319
Expanding window moment functions
292320
---------------------------------
293-
A common alternative to rolling statistics is to use an *expanding* window,
294-
which yields the value of the statistic with all the data available up to that
295-
point in time. As these calculations are a special case of rolling statistics,
321+
A common alternative to rolling statistics is to use an *expanding* window,
322+
which yields the value of the statistic with all the data available up to that
323+
point in time. As these calculations are a special case of rolling statistics,
296324
they are implemented in pandas such that the following two calls are equivalent:
297325

298326
.. ipython:: python
@@ -301,7 +329,7 @@ they are implemented in pandas such that the following two calls are equivalent:
301329
302330
expanding_mean(df)[:5]
303331
304-
Like the ``rolling_`` functions, the following methods are included in the
332+
Like the ``rolling_`` functions, the following methods are included in the
305333
``pandas`` namespace or can be located in ``pandas.stats.moments``.
306334

307335
.. csv-table::
@@ -324,12 +352,12 @@ Like the ``rolling_`` functions, the following methods are included in the
324352
``expanding_corr``, Correlation (binary)
325353
``expanding_corr_pairwise``, Pairwise correlation of DataFrame columns
326354

327-
Aside from not having a ``window`` parameter, these functions have the same
328-
interfaces as their ``rolling_`` counterpart. Like above, the parameters they
355+
Aside from not having a ``window`` parameter, these functions have the same
356+
interfaces as their ``rolling_`` counterpart. Like above, the parameters they
329357
all accept are:
330358

331-
- ``min_periods``: threshold of non-null data points to require. Defaults to
332-
minimum needed to compute statistic. No ``NaNs`` will be output once
359+
- ``min_periods``: threshold of non-null data points to require. Defaults to
360+
minimum needed to compute statistic. No ``NaNs`` will be output once
333361
``min_periods`` non-null data points have been seen.
334362
- ``freq``: optionally specify a :ref:`frequency string <timeseries.alias>`
335363
or :ref:`DateOffset <timeseries.offsets>` to pre-conform the data to.
@@ -338,15 +366,15 @@ all accept are:
338366

339367
.. note::
340368

341-
The output of the ``rolling_`` and ``expanding_`` functions do not return a
342-
``NaN`` if there are at least ``min_periods`` non-null values in the current
343-
window. This differs from ``cumsum``, ``cumprod``, ``cummax``, and
344-
``cummin``, which return ``NaN`` in the output wherever a ``NaN`` is
369+
The output of the ``rolling_`` and ``expanding_`` functions do not return a
370+
``NaN`` if there are at least ``min_periods`` non-null values in the current
371+
window. This differs from ``cumsum``, ``cumprod``, ``cummax``, and
372+
``cummin``, which return ``NaN`` in the output wherever a ``NaN`` is
345373
encountered in the input.
346374

347-
An expanding window statistic will be more stable (and less responsive) than
348-
its rolling window counterpart as the increasing window size decreases the
349-
relative impact of an individual data point. As an example, here is the
375+
An expanding window statistic will be more stable (and less responsive) than
376+
its rolling window counterpart as the increasing window size decreases the
377+
relative impact of an individual data point. As an example, here is the
350378
``expanding_mean`` output for the previous time series dataset:
351379

352380
.. ipython:: python

pandas/core/frame.py

+23-6
Original file line numberDiff line numberDiff line change
@@ -4241,7 +4241,7 @@ def merge(self, right, how='inner', on=None, left_on=None, right_on=None,
42414241
#----------------------------------------------------------------------
42424242
# Statistical methods, etc.
42434243

4244-
def corr(self, method='pearson'):
4244+
def corr(self, method='pearson', min_periods=None):
42454245
"""
42464246
Compute pairwise correlation of columns, excluding NA/null values
42474247
@@ -4251,6 +4251,10 @@ def corr(self, method='pearson'):
42514251
pearson : standard correlation coefficient
42524252
kendall : Kendall Tau correlation coefficient
42534253
spearman : Spearman rank correlation
4254+
min_periods : int, optional
4255+
Minimum number of observations required per pair of columns
4256+
to have a valid result. Currently only available for pearson
4257+
correlation
42544258
42554259
Returns
42564260
-------
@@ -4261,8 +4265,10 @@ def corr(self, method='pearson'):
42614265
mat = numeric_df.values
42624266

42634267
if method == 'pearson':
4264-
correl = lib.nancorr(com._ensure_float64(mat))
4268+
correl = lib.nancorr(com._ensure_float64(mat), minp=min_periods)
42654269
else:
4270+
if min_periods is None:
4271+
min_periods = 1
42664272
mat = mat.T
42674273
corrf = nanops.get_corr_func(method)
42684274
K = len(cols)
@@ -4271,7 +4277,7 @@ def corr(self, method='pearson'):
42714277
for i, ac in enumerate(mat):
42724278
for j, bc in enumerate(mat):
42734279
valid = mask[i] & mask[j]
4274-
if not valid.any():
4280+
if valid.sum() < min_periods:
42754281
c = NA
42764282
elif not valid.all():
42774283
c = corrf(ac[valid], bc[valid])
@@ -4282,10 +4288,16 @@ def corr(self, method='pearson'):
42824288

42834289
return self._constructor(correl, index=cols, columns=cols)
42844290

4285-
def cov(self):
4291+
def cov(self, min_periods=None):
42864292
"""
42874293
Compute pairwise covariance of columns, excluding NA/null values
42884294
4295+
Parameters
4296+
----------
4297+
min_periods : int, optional
4298+
Minimum number of observations required per pair of columns
4299+
to have a valid result.
4300+
42894301
Returns
42904302
-------
42914303
y : DataFrame
@@ -4298,9 +4310,14 @@ def cov(self):
42984310
mat = numeric_df.values
42994311

43004312
if notnull(mat).all():
4301-
baseCov = np.cov(mat.T)
4313+
if min_periods is not None and min_periods > len(mat):
4314+
baseCov = np.empty((mat.shape[1], mat.shape[1]))
4315+
baseCov.fill(np.nan)
4316+
else:
4317+
baseCov = np.cov(mat.T)
43024318
else:
4303-
baseCov = lib.nancorr(com._ensure_float64(mat), cov=True)
4319+
baseCov = lib.nancorr(com._ensure_float64(mat), cov=True,
4320+
minp=min_periods)
43044321

43054322
return self._constructor(baseCov, index=cols, columns=cols)
43064323

pandas/core/nanops.py

+10-4
Original file line numberDiff line numberDiff line change
@@ -384,19 +384,22 @@ def _zero_out_fperr(arg):
384384
return 0 if np.abs(arg) < 1e-14 else arg
385385

386386

387-
def nancorr(a, b, method='pearson'):
387+
def nancorr(a, b, method='pearson', min_periods=None):
388388
"""
389389
a, b: ndarrays
390390
"""
391391
if len(a) != len(b):
392392
raise AssertionError('Operands to nancorr must have same size')
393393

394+
if min_periods is None:
395+
min_periods = 1
396+
394397
valid = notnull(a) & notnull(b)
395398
if not valid.all():
396399
a = a[valid]
397400
b = b[valid]
398401

399-
if len(a) == 0:
402+
if len(a) < min_periods:
400403
return np.nan
401404

402405
f = get_corr_func(method)
@@ -427,16 +430,19 @@ def _spearman(a, b):
427430
return _cor_methods[method]
428431

429432

430-
def nancov(a, b):
433+
def nancov(a, b, min_periods=None):
431434
if len(a) != len(b):
432435
raise AssertionError('Operands to nancov must have same size')
433436

437+
if min_periods is None:
438+
min_periods = 1
439+
434440
valid = notnull(a) & notnull(b)
435441
if not valid.all():
436442
a = a[valid]
437443
b = b[valid]
438444

439-
if len(a) == 0:
445+
if len(a) < min_periods:
440446
return np.nan
441447

442448
return np.cov(a, b)[0, 1]

pandas/core/series.py

+14-4
Original file line numberDiff line numberDiff line change
@@ -1529,7 +1529,8 @@ def pretty_name(x):
15291529

15301530
return Series(data, index=names)
15311531

1532-
def corr(self, other, method='pearson'):
1532+
def corr(self, other, method='pearson',
1533+
min_periods=None):
15331534
"""
15341535
Compute correlation two Series, excluding missing values
15351536
@@ -1540,21 +1541,29 @@ def corr(self, other, method='pearson'):
15401541
pearson : standard correlation coefficient
15411542
kendall : Kendall Tau correlation coefficient
15421543
spearman : Spearman rank correlation
1544+
min_periods : int, optional
1545+
Minimum number of observations needed to have a valid result
1546+
15431547
15441548
Returns
15451549
-------
15461550
correlation : float
15471551
"""
15481552
this, other = self.align(other, join='inner', copy=False)
1549-
return nanops.nancorr(this.values, other.values, method=method)
1553+
if len(this) == 0:
1554+
return np.nan
1555+
return nanops.nancorr(this.values, other.values, method=method,
1556+
min_periods=min_periods)
15501557

1551-
def cov(self, other):
1558+
def cov(self, other, min_periods=None):
15521559
"""
15531560
Compute covariance with Series, excluding missing values
15541561
15551562
Parameters
15561563
----------
15571564
other : Series
1565+
min_periods : int, optional
1566+
Minimum number of observations needed to have a valid result
15581567
15591568
Returns
15601569
-------
@@ -1565,7 +1574,8 @@ def cov(self, other):
15651574
this, other = self.align(other, join='inner')
15661575
if len(this) == 0:
15671576
return np.nan
1568-
return nanops.nancov(this.values, other.values)
1577+
return nanops.nancov(this.values, other.values,
1578+
min_periods=min_periods)
15691579

15701580
def diff(self, periods=1):
15711581
"""

pandas/sparse/tests/test_sparse.py

+5-2
Original file line numberDiff line numberDiff line change
@@ -822,9 +822,12 @@ def test_sparse_to_dense(self):
822822
def test_sparse_series_ops(self):
823823
import sys
824824
buf = StringIO()
825+
tmp = sys.stderr
825826
sys.stderr = buf
826-
self._check_all(self._check_frame_ops)
827-
sys.stderr = sys.__stderr__
827+
try:
828+
self._check_all(self._check_frame_ops)
829+
finally:
830+
sys.stderr = tmp
828831

829832
def _check_frame_ops(self, frame):
830833
fill = frame.default_fill_value

pandas/src/moments.pyx

+5-2
Original file line numberDiff line numberDiff line change
@@ -300,7 +300,7 @@ def ewma(ndarray[double_t] input, double_t com, int adjust):
300300

301301
@cython.boundscheck(False)
302302
@cython.wraparound(False)
303-
def nancorr(ndarray[float64_t, ndim=2] mat, cov=False):
303+
def nancorr(ndarray[float64_t, ndim=2] mat, cov=False, minp=None):
304304
cdef:
305305
Py_ssize_t i, j, xi, yi, N, K
306306
ndarray[float64_t, ndim=2] result
@@ -310,6 +310,9 @@ def nancorr(ndarray[float64_t, ndim=2] mat, cov=False):
310310

311311
N, K = (<object> mat).shape
312312

313+
if minp is None:
314+
minp = 1
315+
313316
result = np.empty((K, K), dtype=np.float64)
314317
mask = np.isfinite(mat).view(np.uint8)
315318

@@ -324,7 +327,7 @@ def nancorr(ndarray[float64_t, ndim=2] mat, cov=False):
324327
sumx += vx
325328
sumy += vy
326329

327-
if nobs == 0:
330+
if nobs < minp:
328331
result[xi, yi] = result[yi, xi] = np.NaN
329332
else:
330333
meanx = sumx / nobs

0 commit comments

Comments
 (0)