Skip to content

API: df.rolling(..).corr()/cov() when pairwise=True to return MI DataFrame #15677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 15 additions & 5 deletions doc/source/computation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -505,13 +505,18 @@ two ``Series`` or any combination of ``DataFrame/Series`` or
- ``DataFrame/DataFrame``: by default compute the statistic for matching column
names, returning a DataFrame. If the keyword argument ``pairwise=True`` is
passed then computes the statistic for each pair of columns, returning a
``Panel`` whose ``items`` are the dates in question (see :ref:`the next section
``MultiIndexed DataFrame`` whose ``index`` are the dates in question (see :ref:`the next section
<stats.moments.corr_pairwise>`).

For example:

.. ipython:: python

df = pd.DataFrame(np.random.randn(1000, 4),
index=pd.date_range('1/1/2000', periods=1000),
columns=['A', 'B', 'C', 'D'])
df = df.cumsum()

df2 = df[:20]
df2.rolling(window=5).corr(df2['B'])

Expand All @@ -520,11 +525,16 @@ For example:
Computing rolling pairwise covariances and correlations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. warning::

Prior to version 0.20.0 if ``pairwise=True`` was passed, a ``Panel`` would be returned.
This will now return a 2-level MultiIndexed DataFrame, see the whatsnew :ref:`here <whatsnew_0200.api_breaking.rolling_pairwise>`

In financial data analysis and other fields it's common to compute covariance
and correlation matrices for a collection of time series. Often one is also
interested in moving-window covariance and correlation matrices. This can be
done by passing the ``pairwise`` keyword argument, which in the case of
``DataFrame`` inputs will yield a ``Panel`` whose ``items`` are the dates in
``DataFrame`` inputs will yield a ``MultiIndexed DataFrame`` whose ``index`` are the dates in
question. In the case of a single DataFrame argument the ``pairwise`` argument
can even be omitted:

Expand All @@ -539,12 +549,12 @@ can even be omitted:
.. ipython:: python

covs = df[['B','C','D']].rolling(window=50).cov(df[['A','B','C']], pairwise=True)
covs[df.index[-50]]
covs.unstack(-1).iloc[-50]

.. ipython:: python

correls = df.rolling(window=50).corr()
correls[df.index[-50]]
correls.unstack(-1).iloc[-50]

You can efficiently retrieve the time series of correlations between two
columns using ``.loc`` indexing:
Expand All @@ -557,7 +567,7 @@ columns using ``.loc`` indexing:
.. ipython:: python

@savefig rolling_corr_pairwise_ex.png
correls.loc[:, 'A', 'C'].plot()
correls.unstack(-1).[('A', 'C')].plot()

.. _stats.aggregate:

Expand Down
46 changes: 46 additions & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,13 @@ Highlights include:
- The ``.ix`` indexer has been deprecated, see :ref:`here <whatsnew_0200.api_breaking.deprecate_ix>`
- Improved user API when accessing levels in ``.groupby()``, see :ref:`here <whatsnew_0200.enhancements.groupby_access>`
- Improved support for UInt64 dtypes, see :ref:`here <whatsnew_0200.enhancements.uint64_support>`
- Window Binary Corr/Cov operations return a MultiIndex DataFrame rather than a Panel, see :ref:`here <whhatsnew_0200.api_breaking.rolling_pairwise>`
- A new orient for JSON serialization, ``orient='table'``, that uses the Table Schema spec, see :ref:`here <whatsnew_0200.enhancements.table_schema>`
- Support for S3 handling now uses ``s3fs``, see :ref:`here <whatsnew_0200.api_breaking.s3>`
- Google BigQuery support now uses the ``pandas-gbq`` library, see :ref:`here <whatsnew_0200.api_breaking.gbq>`
- Switched the test framework to use `pytest <http://doc.pytest.org/en/latest>`__ (:issue:`13097`)


Check the :ref:`API Changes <whatsnew_0200.api_breaking>` and :ref:`deprecations <whatsnew_0200.deprecations>` before updating.

.. contents:: What's new in v0.20.0
Expand Down Expand Up @@ -766,6 +768,50 @@ New Behavior:

df.groupby('A').agg([np.mean, np.std, np.min, np.max])

.. _whatsnew_0200.api_breaking.rolling_pairwise:

Window Binary Corr/Cov operations return a MultiIndex DataFrame
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A binary window operation, like ``.corr()`` or ``.cov()``, when operating on a ``.rolling(..)``, ``.expanding(..)``, or ``.ewm(..)`` object,
will now return a 2-level ``MultiIndexed DataFrame`` rather than a ``Panel``. These are equivalent in function,
but MultiIndexed DataFrames enjoy more support in pandas.
See the section on :ref:`Windowed Binary Operations <stats.moments.binary>` for more information. (:issue:`15677`)

.. ipython:: python

np.random.seed(1234)
df = pd.DataFrame(np.random.rand(100, 2),
columns=pd.Index(['A', 'B'], name='bar'),
index=pd.date_range('20160101',
periods=100, freq='D', name='foo'))
df

Old Behavior:

.. code-block:: ipython

In [2]: df.rolling(12).corr()
Out[2]:
<class 'pandas.core.panel.Panel'>
Dimensions: 100 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: 2016-01-01 00:00:00 to 2016-04-09 00:00:00
Major_axis axis: A to B
Minor_axis axis: A to B

New Behavior:

.. ipython:: python

res = df.rolling(12).corr()
res

Retrieving a correlation matrix for a cross-section

.. ipython:: python

df.rolling(12).corr().loc['2016-04-07']

.. _whatsnew_0200.api_breaking.hdfstore_where:

HDFStore where string comparison
Expand Down
29 changes: 27 additions & 2 deletions pandas/core/window.py
Original file line number Diff line number Diff line change
Expand Up @@ -1652,7 +1652,8 @@ def _cov(x, y):


def _flex_binary_moment(arg1, arg2, f, pairwise=False):
from pandas import Series, DataFrame, Panel
from pandas import Series, DataFrame

if not (isinstance(arg1, (np.ndarray, Series, DataFrame)) and
isinstance(arg2, (np.ndarray, Series, DataFrame))):
raise TypeError("arguments to moment function must be of type "
Expand Down Expand Up @@ -1703,12 +1704,36 @@ def dataframe_from_int_dict(data, frame_template):
else:
results[i][j] = f(*_prep_binary(arg1.iloc[:, i],
arg2.iloc[:, j]))

# TODO: not the most efficient (perf-wise)
# though not bad code-wise
from pandas import Panel, MultiIndex, Index
p = Panel.from_dict(results).swapaxes('items', 'major')
if len(p.major_axis) > 0:
p.major_axis = arg1.columns[p.major_axis]
if len(p.minor_axis) > 0:
p.minor_axis = arg2.columns[p.minor_axis]
return p

if len(p.items):
result = pd.concat(
[p.iloc[i].T for i in range(len(p.items))],
keys=p.items)
else:

result = DataFrame(
index=MultiIndex(levels=[arg1.index, arg1.columns],
labels=[[], []]),
columns=arg2.columns,
dtype='float64')

# reset our names to arg1 names
# careful not to mutate the original names
result.columns = Index(result.columns).set_names(None)
result.index = result.index.set_names(
[arg1.index.name, arg1.columns.name])

return result

else:
raise ValueError("'pairwise' is not True/False")
else:
Expand Down
20 changes: 7 additions & 13 deletions pandas/indexes/multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -2069,20 +2069,14 @@ def convert_indexer(start, stop, step, indexer=indexer, labels=labels):
else:

loc = level_index.get_loc(key)
if level > 0 or self.lexsort_depth == 0:
if isinstance(loc, slice):
return loc
elif level > 0 or self.lexsort_depth == 0:
return np.array(labels == loc, dtype=bool)
else:
# sorted, so can return slice object -> view
try:
loc = labels.dtype.type(loc)
except TypeError:
# this occurs when loc is a slice (partial string indexing)
# but the TypeError raised by searchsorted in this case
# is catched in Index._has_valid_type()
pass
i = labels.searchsorted(loc, side='left')
j = labels.searchsorted(loc, side='right')
return slice(i, j)

i = labels.searchsorted(loc, side='left')
j = labels.searchsorted(loc, side='right')
return slice(i, j)

def get_locs(self, tup):
"""
Expand Down
Loading