Skip to content

DOC: add documentation to core.window.corr #20268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 8, 2018
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 101 additions & 10 deletions pandas/core/window.py
Original file line number Diff line number Diff line change
Expand Up @@ -1028,19 +1028,112 @@ def _get_cov(X, Y):
_get_cov, pairwise=bool(pairwise))

_shared_docs['corr'] = dedent("""
%(name)s sample correlation
Calculate %(name)s correlation.

This function uses Pearson's definition of correlation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a link to Wikipedia or similar here?


Parameters
----------
other : Series, DataFrame, or ndarray, optional
if not supplied then will default to self and produce pairwise output
If not supplied then will default to self.
pairwise : bool, default None
If False then only matching columns between self and other will be
used and the output will be a DataFrame.
If True then all pairwise combinations will be calculated and the
output will be a MultiIndex DataFrame in the case of DataFrame inputs.
In the case of missing elements, only complete pairwise observations
will be used.""")
Calculate pairwise combinations of columns within a
DataFrame. If other is not specified, defaults to True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put back ticks around parameters and built ins, so `other`, `True`, and `False` here.

otherwise defaults to False. Not relevant for Series.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:class:`~pandas.Series`

See notes.
**kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we remove this. @TomAugspurger

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is fine.

For the explanation, you can put "For compatibility with other %(name)s methods. Not used."

Under Review.

Returns
-------
Series or DataFrame
Returned object type is determined by the caller of the
%(name)s calculation.

See Also
--------
Series.%(name)s : Calling object with Series data
DataFrame.%(name)s : Calling object with DataFrames
Series.corr : Equivalent method for Series
DataFrame.corr : Equivalent method for DataFrame
%(name)s.cov : Similar method to calculate covariance
numpy.corrcoef : NumPy Pearson's correlation calculation

Notes
-----
Other should be always be specified, except for DataFrame inputs with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`other` and :class:~`pandas.DataFrame`

pairwise set to `True`. All other input combinations will return all 1's.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"pairwise" in single back-ticks


Function will return `NaN`s for correlations of equal valued sequences;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. If the sequences are equally valued, like in the case of non specifying other and pairwise=False, the correlation of each column with itself should be all 1's. Am I wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true, but trivial. I'm rewording for clarity.

this is the result of a 0/0 division error.

When pairwise is set to `False`, only matching columns between self and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`self` and `other`

other will be used.

When pairwise is set to `True`, the output will be a MultiIndex DataFrame
with the original index on the first level, and the "other" DataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other in single back-ticks I believe

columns on the second level.

In the case of missing elements, only complete pairwise observations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the correlation of "non-complete" elements will be set to NaN? Can we write this in the explanation if so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct as currently implemented. I agree that this would be the desired behavior, but would require a separate pull request.

will be used.

Examples
--------
The below example shows a rolling calculation with a window size of
four matching the equivalent function call using `numpy.corrcoef`.

>>> v1 = [3, 3, 3, 5, 8]
>>> v2 = [3, 4, 4, 4, 8]
>>> fmt = "{0:.6f}" # limit the printed precision to 6 digits
>>> import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to import numpy, it's automatically imported for each docstring, see https://python-sprints.github.io/pandas/guide/pandas_docstring.html#conventions-for-the-examples

>>> # numpy returns a 2X2 array, the correlation coefficient
>>> # is the number at entry [0][1]
>>> print(fmt.format(np.corrcoef(v1[:-1], v2[:-1])[0][1]))
0.333333
>>> print(fmt.format(np.corrcoef(v1[1:], v2[1:])[0][1]))
0.916949
>>> s1 = pd.Series(v1)
>>> s2 = pd.Series(v2)
>>> s1.rolling(4).corr(s2)
0 NaN
1 NaN
2 NaN
3 0.333333
4 0.916949
dtype: float64

The below example shows a similar rolling calculation on a
DataFrame using the pairwise option.

>>> matrix = np.array([[51., 35.], [49., 30.], [47., 32.],\
[46., 31.], [50., 36.]])
>>> print(np.corrcoef(matrix[:-1,0], matrix[:-1,1]).round(7))
[[1. 0.6263001]
[0.6263001 1. ]]
>>> print(np.corrcoef(matrix[1:,0], matrix[1:,1]).round(7))
[[1. 0.5553681]
[0.5553681 1. ]]
>>> df = pd.DataFrame(matrix, columns=['X','Y'])
>>> df
X Y
0 51.0 35.0
1 49.0 30.0
2 47.0 32.0
3 46.0 31.0
4 50.0 36.0
>>> df.rolling(4).corr(pairwise=True)
X Y
0 X NaN NaN
Y NaN NaN
1 X NaN NaN
Y NaN NaN
2 X NaN NaN
Y NaN NaN
3 X 1.000000 0.626300
Y 0.626300 1.000000
4 X 1.000000 0.555368
Y 0.555368 1.000000
""")

def corr(self, other=None, pairwise=None, **kwargs):
if other is None:
Expand Down Expand Up @@ -1288,7 +1381,6 @@ def cov(self, other=None, pairwise=None, ddof=1, **kwargs):
ddof=ddof, **kwargs)

@Substitution(name='rolling')
@Appender(_doc_template)
@Appender(_shared_docs['corr'])
def corr(self, other=None, pairwise=None, **kwargs):
return super(Rolling, self).corr(other=other, pairwise=pairwise,
Expand Down Expand Up @@ -1527,7 +1619,6 @@ def cov(self, other=None, pairwise=None, ddof=1, **kwargs):
ddof=ddof, **kwargs)

@Substitution(name='expanding')
@Appender(_doc_template)
@Appender(_shared_docs['corr'])
def corr(self, other=None, pairwise=None, **kwargs):
return super(Expanding, self).corr(other=other, pairwise=pairwise,
Expand Down