Skip to content

DOC: add documentation to core.window.corr #20268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 8, 2018

Conversation

theandygross
Copy link
Contributor

Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):

  • PR title is "DOC: update the docstring"
  • The validation script passes: scripts/validate_docstrings.py <your-function-or-method>
  • The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
  • The html version looks good: python doc/make.py --single <your-function-or-method>
  • It has been proofread on language by another sprint participant

Please include the output of the validation script below between the "```" ticks:

################################################################################
################# Docstring (pandas.core.window.Rolling.corr)  #################
################################################################################

Calculate rolling correlation.

This function uses Pearson's definition of correlation.

Parameters
----------
other : Series, DataFrame, or ndarray, optional
    If not supplied then will default to self.
pairwise : bool, default None
    Calculate pairwise combinations of columns within a
    DataFrame. If other is not specified, defaults to True,
    otherwise defaults to False. Not relevant for Series.
    See notes.
**kwargs
    Under Review.

Returns
-------
Series or DataFrame
    Returned object type is determined by the caller of the
    rolling calculation.

See Also
--------
Series.rolling : Calling object with Series data
DataFrame.rolling : Calling object with DataFrames
Series.corr : Equivalent method for Series
DataFrame.corr : Equivalent method for DataFrame
rolling.cov : Similar method to calculate covariance
numpy.corrcoef : NumPy Pearson's correlation calculation

Notes
-----
Other should be always be specified, except for DataFrame inputs with
pairwise set to `True`. All other input combinations will return all 1's.

Function will return `NaN`s for correlations of equal valued sequences;
this is the result of a 0/0 division error.

When pairwise is set to `False`, only matching columns between self and
other will be used.

When pairwise is set to `True`, the output will be a MultiIndex DataFrame
with the original index on the first level, and the "other" DataFrame
columns on the second level.

In the case of missing elements, only complete pairwise observations
will be used.

Examples
--------
The below example shows a rolling calculation with a window size of
four matching the equivalent function call using `numpy.corrcoef`.

>>> v1 = [3, 3, 3, 5, 8]
>>> v2 = [3, 4, 4, 4, 8]
>>> fmt = "{0:.6f}"  # limit the printed precision to 6 digits
>>> import numpy as np
>>> # numpy returns a 2X2 array, the correlation coefficient
>>> # is the number at entry [0][1]
>>> print(fmt.format(np.corrcoef(v1[:-1], v2[:-1])[0][1]))
0.333333
>>> print(fmt.format(np.corrcoef(v1[1:], v2[1:])[0][1]))
0.916949
>>> s1 = pd.Series(v1)
>>> s2 = pd.Series(v2)
>>> s1.rolling(4).corr(s2)
0         NaN
1         NaN
2         NaN
3    0.333333
4    0.916949
dtype: float64

The below example shows a similar rolling calculation on a
DataFrame using the pairwise option.

>>> matrix = np.array([[51., 35.], [49., 30.], [47., 32.],    [46., 31.], [50., 36.]])
>>> print(np.corrcoef(matrix[:-1,0], matrix[:-1,1]).round(7))
[[1.         0.6263001]
 [0.6263001  1.       ]]
>>> print(np.corrcoef(matrix[1:,0], matrix[1:,1]).round(7))
[[1.         0.5553681]
 [0.5553681  1.        ]]
>>> df = pd.DataFrame(matrix, columns=['X','Y'])
>>> df
      X     Y
0  51.0  35.0
1  49.0  30.0
2  47.0  32.0
3  46.0  31.0
4  50.0  36.0
>>> df.rolling(4).corr(pairwise=True)
            X         Y
0 X       NaN       NaN
  Y       NaN       NaN
1 X       NaN       NaN
  Y       NaN       NaN
2 X       NaN       NaN
  Y       NaN       NaN
3 X  1.000000  0.626300
  Y  0.626300  1.000000
4 X  1.000000  0.555368
  Y  0.555368  1.000000

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
	Errors in parameters section
		Parameters {'kwargs'} not documented
		Unknown parameters {'**kwargs'}
		Parameter "**kwargs" has no type

If the validation script still gives errors, but you think there is a good reason
to deviate in this case (and there are certainly such cases), please state this
explicitly.

Checklist for other PRs (remove this part if you are doing a PR for the pandas documentation sprint):

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

In the case of missing elements, only complete pairwise observations
will be used.""")
Calculate pairwise combinations of columns within a
DataFrame. If other is not specified, defaults to True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put back ticks around parameters and built ins, so `other`, `True`, and `False` here.

will be used.""")
Calculate pairwise combinations of columns within a
DataFrame. If other is not specified, defaults to True,
otherwise defaults to False. Not relevant for Series.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:class:`~pandas.Series`


Notes
-----
Other should be always be specified, except for DataFrame inputs with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`other` and :class:~`pandas.DataFrame`

Function will return `NaN`s for correlations of equal valued sequences;
this is the result of a 0/0 division error.

When pairwise is set to `False`, only matching columns between self and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`self` and `other`

%(name)s sample correlation
Calculate %(name)s correlation.

This function uses Pearson's definition of correlation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a link to Wikipedia or similar here?

Notes
-----
Other should be always be specified, except for DataFrame inputs with
pairwise set to `True`. All other input combinations will return all 1's.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"pairwise" in single back-ticks

Other should be always be specified, except for DataFrame inputs with
pairwise set to `True`. All other input combinations will return all 1's.

Function will return `NaN`s for correlations of equal valued sequences;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. If the sequences are equally valued, like in the case of non specifying other and pairwise=False, the correlation of each column with itself should be all 1's. Am I wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true, but trivial. I'm rewording for clarity.

other will be used.

When pairwise is set to `True`, the output will be a MultiIndex DataFrame
with the original index on the first level, and the "other" DataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other in single back-ticks I believe

with the original index on the first level, and the "other" DataFrame
columns on the second level.

In the case of missing elements, only complete pairwise observations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the correlation of "non-complete" elements will be set to NaN? Can we write this in the explanation if so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct as currently implemented. I agree that this would be the desired behavior, but would require a separate pull request.

>>> v1 = [3, 3, 3, 5, 8]
>>> v2 = [3, 4, 4, 4, 8]
>>> fmt = "{0:.6f}" # limit the printed precision to 6 digits
>>> import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to import numpy, it's automatically imported for each docstring, see https://python-sprints.github.io/pandas/guide/pandas_docstring.html#conventions-for-the-examples

@pep8speaks
Copy link

pep8speaks commented Apr 15, 2018

Hello @theandygross! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 08, 2018 at 14:31 Hours UTC

%(name)s sample correlation
Calculate %(name)s correlation.

This function uses Pearson's definition of correlation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move to Notes

DataFrame. If `other` is not specified, defaults to `True`,
otherwise defaults to `False`. Not relevant for :class:`~pandas.Series`.
See notes.
**kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we remove this. @TomAugspurger

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is fine.

For the explanation, you can put "For compatibility with other %(name)s methods. Not used."

@codecov
Copy link

codecov bot commented Jul 8, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@13febab). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #20268   +/-   ##
=========================================
  Coverage          ?   91.84%           
=========================================
  Files             ?      153           
  Lines             ?    49275           
  Branches          ?        0           
=========================================
  Hits              ?    45255           
  Misses            ?     4020           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.23% <100%> (?)
#single 41.9% <27.27%> (?)
Impacted Files Coverage Δ
pandas/core/window.py 96.25% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 13febab...4c322b0. Read the comment docs.

@WillAyd WillAyd merged commit 7d58ce6 into pandas-dev:master Jul 8, 2018
@WillAyd
Copy link
Member

WillAyd commented Jul 8, 2018

Thanks @theandygross !

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants