Skip to content

DOC: update the pandas.DataFrame.cummax docstring #20336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Mar 17, 2018
Merged
Changes from 14 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
5ccedc2
DOC: Improve the docstring of DataFrame.cummax
arminv Mar 13, 2018
04f70dd
Merge remote-tracking branch 'upstream/master' into docstring_cummax
arminv Mar 13, 2018
aec6084
Merge remote-tracking branch 'upstream/master' into docstring_cummax
arminv Mar 13, 2018
4acf753
DOC: Improve the docstring of DataFrame.cummax()
arminv Mar 13, 2018
1214c93
DOC: Improve the docstring of pandas.DataFrame.cummax
arminv Mar 13, 2018
a88e95a
Merge remote-tracking branch 'upstream/master' into docstring_cummax
arminv Mar 14, 2018
fe94dad
DOC: Improve the docstring of DataFrame.cummax
arminv Mar 14, 2018
f73b52f
Merge remote-tracking branch 'upstream/master' into docstring_cummax
arminv Mar 14, 2018
33e5337
Merge remote-tracking branch 'upstream/master' into docstring_cummax
arminv Mar 15, 2018
15b38dd
Merge remote-tracking branch 'upstream/master' into docstring_cummax
arminv Mar 15, 2018
3c30d18
Improved examples
arminv Mar 16, 2018
0cb3168
Merge remote-tracking branch 'upstream/master' into docstring_cummax
arminv Mar 16, 2018
9d46623
Addressed PEP8 issues
arminv Mar 16, 2018
5d502cb
Addressed PEP 8 issues
arminv Mar 16, 2018
e1e190f
Merge remote-tracking branch 'upstream/master' into docstring_cummax
arminv Mar 17, 2018
aa34ea0
Made See also of Series consistent
arminv Mar 17, 2018
94fc1b3
Merge remote-tracking branch 'upstream/master' into docstring_cummax
arminv Mar 17, 2018
657feac
Merge remote-tracking branch 'upstream/master' into docstring_cummax
arminv Mar 17, 2018
77789a8
Improved example wording. Addressed PEP8
arminv Mar 17, 2018
463eef7
Merge remote-tracking branch 'upstream/master' into docstring_cummax
arminv Mar 17, 2018
b03c32a
More templating in See also.Fixed typos
arminv Mar 17, 2018
9b05313
Merge remote-tracking branch 'upstream/master' into docstring_cummax
arminv Mar 17, 2018
1147a0d
Improved templating of See also section
arminv Mar 17, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
310 changes: 296 additions & 14 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -8408,17 +8408,19 @@ def compound(self, axis=None, skipna=None, level=None):
cls.cummin = _make_cum_function(
cls, 'cummin', name, name2, axis_descr, "cumulative minimum",
lambda y, axis: np.minimum.accumulate(y, axis), "min",
np.inf, np.nan)
np.inf, np.nan, _cummin_examples)
cls.cumsum = _make_cum_function(
cls, 'cumsum', name, name2, axis_descr, "cumulative sum",
lambda y, axis: y.cumsum(axis), "sum", 0., np.nan)
lambda y, axis: y.cumsum(axis), "sum", 0.,
np.nan, _cumsum_examples)
cls.cumprod = _make_cum_function(
cls, 'cumprod', name, name2, axis_descr, "cumulative product",
lambda y, axis: y.cumprod(axis), "prod", 1., np.nan)
lambda y, axis: y.cumprod(axis), "prod", 1.,
np.nan, _cumprod_examples)
cls.cummax = _make_cum_function(
cls, 'cummax', name, name2, axis_descr, "cumulative max",
cls, 'cummax', name, name2, axis_descr, "cumulative maximum",
lambda y, axis: np.maximum.accumulate(y, axis), "max",
-np.inf, np.nan)
-np.inf, np.nan, _cummax_examples)

cls.sum = _make_min_count_stat_function(
cls, 'sum', name, name2, axis_descr,
Expand Down Expand Up @@ -8680,24 +8682,304 @@ def _doc_parms(cls):
"""

_cnum_doc = """
Return %(desc)s over a DataFrame or Series axis.

Returns a DataFrame or Series of the same size containing the %(desc)s.

Parameters
----------
axis : %(axis_descr)s
axis : {0 or 'index', 1 or 'columns'}, default 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not technically right that default is 0, I think it's None, which I guess it's equivalent to 0.

Can you double check, and and change it if that's right. Something like {0 or 'index', 1 or 'columns'} or None, default None would probably be the most standard way if that's right. And a description about the axis would be useful (pointing out that None means index if that's the case).

If you check recent PRs there are some with a an axis parameter that you can check for reference.

Copy link
Contributor Author

@arminv arminv Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is right, cum_func (i.e. function corresponding to all cumulative methods) is defined with axis=None as default argument.

I also found this regarding the correct format of axis parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it is technically None, in practice it is 0 for Series/DataFrame, so I would keep the documentation like this.
The technical reason is because for Panel it is 1, but Panel is deprecated and I think we should not care about them in the documentation.

The index or the name of the axis. 0 is equivalent to None or 'index'.
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result
will be NA
will be NA.
*args, **kwargs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add here an explanation for args/kwargs: "Additional arguments have no effect but might be accepted for compatibility with NumPy."


Returns
-------
%(outname)s : %(name1)s\n


%(outname)s : %(name1)s or %(name2)s\n
%(examples)s
See also
--------
pandas.core.window.Expanding.%(accum_func_name)s : Similar functionality
but ignores ``NaN`` values.
Series.%(outname)s : Return %(desc)s over Series axis.
DataFrame.%(accum_func_name)s : Return the %(accum_func_name)s over
DataFrame axis.
Copy link
Contributor Author

@arminv arminv Mar 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche I used templating and %(accum_func_name)s to add DataFrame.prod to 'See also' section of DataFrame.cumprod, etc. All methods look good except DataFrame.prod where it reads: 'Return the prod over DataFrame axis' instead of 'Return the product over...'. Any suggestions on this? Should we even worry about this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One relatively easy solution would be to change the desc param to _make_cum_function from "cumulative produce" to "product", because we can put the "cumulative" (which is the same for each of the 4 functions" in the template itself. Then you can use that name in the see also description.

DataFrame.cummax : Return cumulative maximum over DataFrame axis.
Copy link
Contributor Author

@arminv arminv Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datapythonista @jorisvandenbossche Is it a good idea to also add: DataFrame.sum as a relevant method in the 'See also' section?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that might be a good idea, if you can automatically make it link in DataFrame.cumsum to DataFrame.sum, in DataFrame.cummin to DataFrame.min, etc (using the templating)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also %(name2)s like below ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I implement these changes, we won't get a reference to any corresponding Series in 'See also' of DataFrame methods and vice versa. That's why I left them like this (at the expense of repeating one of them).

I agree that we don't really need them because we have examples of both Series & DataFrame in all docstrings and this should be informative enough.

Just out of curiosity, is there a simple way to determine whether a Series or DataFrame method is being accessed at any one time? For example, if we were generating the doc for Series.cummax, we would determine it is a Series method. Based on this, we know that we need to show cummax but for a DataFrame.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that the docstring for both Series and DataFrame (of the same method) will be exactly the same (apart from some of the links here in see also). So linking to the same method but on the other object, is not that important I think, as the other one does not give you more information

Just out of curiosity, is there a simple way to determine whether a Series or DataFrame method is being accessed at any one time? For example, if we were generating the doc for Series.cummax, we would determine it is a Series method. Based on this, we know that we need to show cummax but for a DataFrame.

Yes, the cls that is passed to _make_cum_function is either the Series or DataFrame class. So you can pass cls.__name__ as a variable to the template, but I think this is already done as name2 ? Based on that you can know the other.

DataFrame.cummin : Return cumulative minimum over DataFrame axis.
DataFrame.cumsum : Return cumulative sum over DataFrame axis.
DataFrame.cumprod : Return cumulative product over DataFrame axis.
"""

_cummin_examples = """\
Examples
--------
**Series**

>>> s = pd.Series([2,np.nan,5,-1,0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8, spaces after ,

>>> s
0 2.0
1 NaN
2 5.0
3 -1.0
4 0.0
dtype: float64

skipna=True : Default value, ignores NaN values during operation:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we mostly use prose for the intermittent text in examples. Perhaps, say

By default, NA values are ignored.

>>> s.cummin()
...

To include NA values in the operation, use ``skipna=False``

>>> s.cummin(skipna=False) 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, what you have is great and if the others are OK with it I'm +1 as well. Maybe wait to hear feedback from @datapythonista and @jorisvandenbossche before updating.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think indeed your prose alternative reads a bit nicer.


>>> s.cummin()
0 2.0
1 NaN
2 2.0
3 -1.0
4 -1.0
dtype: float64

skipna=False : Includes NaN values:

>>> s.cummin(skipna=False)
0 2.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64

**DataFrame**

>>> df = pd.DataFrame([[2.0, 1.0],
... [3.0, np.nan],
... [1.0, 0.0]],
... columns=list('AB'))
>>> df
A B
0 2.0 1.0
1 3.0 NaN
2 1.0 0.0

skipna : Works in the same way as for Series.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably be removed.


axis=0 : Default value, equivalent to axis=None or axis='index'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about the prose.

Iterates over rows and finds the minimum in each column.
If value is smaller than the previous one, updates it:

>>> df.cummin()
A B
0 2.0 1.0
1 2.0 NaN
2 1.0 0.0

axis=1 : Iterates over columns and finds the minimum in each row.
If value is smaller than the previous one, updates it:

>>> df.cummin(axis=1)
A B
0 2.0 1.0
1 3.0 NaN
2 1.0 0.0
"""

_cumsum_examples = """\
Examples
--------
**Series**

>>> s = pd.Series([2,np.nan,5,-1,0])
>>> s
0 2.0
1 NaN
2 5.0
3 -1.0
4 0.0
dtype: float64

skipna=True : Default value, ignores NaN values during operation:

>>> s.cumsum()
0 2.0
1 NaN
2 7.0
3 6.0
4 6.0
dtype: float64

skipna=False : Includes NaN values:

>>> s.cumsum(skipna=False)
0 2.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64

**DataFrame**

>>> df = pd.DataFrame([[2.0, 1.0],
... [3.0, np.nan],
... [1.0, 0.0]],
... columns=list('AB'))
>>> df
A B
0 2.0 1.0
1 3.0 NaN
2 1.0 0.0

skipna : Works in the same way as for Series.

axis=0 : Default value, equivalent to axis=None or axis='index'.
Iterates over rows and finds the cumulative sum of values in each column.

>>> df.cumsum()
A B
0 2.0 1.0
1 5.0 NaN
2 6.0 1.0

axis=1 : Iterates over columns and finds the cumulative sum of
values in each row.

>>> df.cumsum(axis=1)
A B
0 2.0 3.0
1 3.0 NaN
2 1.0 1.0
"""

_cumprod_examples = """\
Examples
--------
**Series**

>>> s = pd.Series([2,np.nan,5,-1,0])
>>> s
0 2.0
1 NaN
2 5.0
3 -1.0
4 0.0
dtype: float64

skipna=True : Default value, ignores NaN values during operation:

>>> s.cumprod()
0 2.0
1 NaN
2 10.0
3 -10.0
4 -0.0
dtype: float64

skipna=False : Includes NaN values:

>>> s.cumprod(skipna=False)
0 2.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64

**DataFrame**

>>> df = pd.DataFrame([[2.0, 1.0],
... [3.0, np.nan],
... [1.0, 0.0]],
... columns=list('AB'))
>>> df
A B
0 2.0 1.0
1 3.0 NaN
2 1.0 0.0

skipna : Works in the same way as for Series.

axis=0 : Default value, equivalent to axis=None or axis='index'.
Iterates over rows and finds the cumulative product of values in each column.

>>> df.cumprod()
A B
0 2.0 1.0
1 6.0 NaN
2 6.0 0.0

axis=1 : Iterates over columns and finds the cumulative product
of values in each row.

>>> df.cumprod(axis=1)
A B
0 2.0 2.0
1 3.0 NaN
2 1.0 0.0
"""

_cummax_examples = """\
Examples
--------
**Series**

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with all the functions this is getting very long, I'd probably avoid having examples for Series, Or may be just one.

If you keep them, personally I'd have Series first, and start the example from the simplest to the more complext

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to have separate string examples for each method, we can keep the examples for Series.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to have separate string examples for each method, we can keep the examples for Series.

+ 1

Another suggestion would be to start with Series to just illustrate the concept of "cumulative max", as this will make the examples a little bit easier, and show the effect of NaNs. And then show DataFrame, saying that by default the same happens for each column of the DataFrame, and optionally use axis=1 to take cumulative max for each row.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I agree, it is definitely easier to see what is going on with NaNs if we use a Series example instead of DataFrame. I will change it this way.

>>> s = pd.Series([2,np.nan,5,-1,0])
>>> s
0 2.0
1 NaN
2 5.0
3 -1.0
4 0.0
dtype: float64

skipna=True : Default value, ignores NaN values during operation:

>>> s.cummax()
0 2.0
1 NaN
2 5.0
3 5.0
4 5.0
dtype: float64

skipna=False : Includes NaN values:

>>> s.cummax(skipna=False)
0 2.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64

**DataFrame**

>>> df = pd.DataFrame([[2.0, 1.0],
... [3.0, np.nan],
... [1.0, 0.0]],
... columns=list('AB'))
>>> df
A B
0 2.0 1.0
1 3.0 NaN
2 1.0 0.0

skipna : Works in the same way as for Series.

axis=0 : Default value, equivalent to axis=None or axis='index'.
Iterates over rows and finds the maximum in each column.
If value is larger than the previous one, updates it:

>>> df.cummax()
A B
0 2.0 1.0
1 3.0 NaN
2 3.0 1.0

axis=1 : Iterates over columns and finds the maximum in each row.
If value is larger than the previous one, updates it:

>>> df.cummax(axis=1)
A B
0 2.0 2.0
1 3.0 NaN
2 1.0 1.0
"""

_any_see_also = """\
Expand Down Expand Up @@ -8894,11 +9176,11 @@ def stat_func(self, axis=None, skipna=None, level=None, ddof=1,


def _make_cum_function(cls, name, name1, name2, axis_descr, desc,
accum_func, accum_func_name, mask_a, mask_b):
accum_func, accum_func_name, mask_a, mask_b, examples):
@Substitution(outname=name, desc=desc, name1=name1, name2=name2,
axis_descr=axis_descr, accum_func_name=accum_func_name)
@Appender("Return {0} over requested axis.".format(desc) +
_cnum_doc)
axis_descr=axis_descr, accum_func_name=accum_func_name,
examples=examples)
@Appender(_cnum_doc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all right like this, but may be it'd be simpler to leave this as it was, and have the examples in _cnum_doc, instead of in a separate variable. As they're the same for all methods, there is not much value in having them separate.

Another option would be to have a different string for each method example, in that case, something similar to this would make more sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having separate string examples for each method makes everything clearer, especially when showing examples for use of skipna & axis. It also helps with keeping the docstring concise. For instance, now we can have a Series example for each method.

The disadvantage is user will only see examples for the method they’re checking, but I think this is ok because we are referencing all methods in the ‘See also’ section, which comes before 'Examples'.

In these PRs #20216 and #20217 examples for DataFrame.all and DataFrame.any are separate even though they are similar methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am also in favor of splitting up the examples.

def cum_func(self, axis=None, skipna=True, *args, **kwargs):
skipna = nv.validate_cum_func_with_skipna(skipna, args, kwargs, name)
if axis is None:
Expand Down