DOC: Improve the docstring of DataFrame.describe() #20222

nehiljain · 2018-03-10T19:23:53Z

Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):

PR title is "DOC: update the docstring"
The validation script passes: scripts/validate_docstrings.py <your-function-or-method>
The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
The html version looks good: python doc/make.py --single <your-function-or-method>
It has been proofread on language by another sprint participant

Please include the output of the validation script below between the "```" ticks:

################################################################################
#################### Docstring (pandas.DataFrame.describe)  ####################
################################################################################

Generate descriptive statistics that summarize the central tendency,
dispersion and shape of a dataset's distribution, excluding
``NaN`` values.

Analyzes both numeric and object series, as well
as ``DataFrame`` column sets of mixed data types. The output
will vary depending on what is provided. Refer to the notes
below for more detail.

Parameters
----------
percentiles : list-like of numbers, optional
    The percentiles to include in the output. All should
    fall between 0 and 1. The default is
    ``[.25, .5, .75]``, which returns the 25th, 50th, and
    75th percentiles.
include : 'all', list-like of dtypes or None (default), optional
    A white list of data types to include in the result. Ignored
    for ``Series``. Here are the options:

    - 'all' : All columns of the input will be included in the output.
    - A list-like of dtypes : Limits the results to the
      provided data types.
      To limit the result to numeric types submit
      ``numpy.number``. To limit it instead to object columns submit
      the ``numpy.object`` data type. Strings
      can also be used in the style of
      ``select_dtypes`` (e.g. ``df.describe(include=['O'])``). To
      select pandas categorical columns, use ``'category'``
    - None (default) : The result will include all numeric columns.
exclude : list-like of dtypes or None (default), optional,
    A black list of data types to omit from the result. Ignored
    for ``Series``. Here are the options:

    - A list-like of dtypes : Excludes the provided data types
      from the result. To exclude numeric types submit
      ``numpy.number``. To exclude object columns submit the data
      type ``numpy.object``. Strings can also be used in the style of
      ``select_dtypes`` (e.g. ``df.describe(include=['O'])``). To
      exclude pandas categorical columns, use ``'category'``
    - None (default) : The result will exclude nothing.

Returns
-------
Series or DataFrame
    Summary statistics of the Series or Dataframe provided.

See Also
--------
DataFrame.count: Count number of non-NA/null observations.
DataFrame.max: Maximum of the values in the object.
DataFrame.min: Minimum of the values in the object.
DataFrame.mean: Mean of the values.
DataFrame.std: Standard deviation of the obersvations.
DataFrame.select_dtypes: Subset of a DataFrame including/excluding
    columns based on their dtype.

Notes
-----
For numeric data, the result's index will include ``count``,
``mean``, ``std``, ``min``, ``max`` as well as lower, ``50`` and
upper percentiles. By default the lower percentile is ``25`` and the
upper percentile is ``75``. The ``50`` percentile is the
same as the median.

For object data (e.g. strings or timestamps), the result's index
will include ``count``, ``unique``, ``top``, and ``freq``. The ``top``
is the most common value. The ``freq`` is the most common value's
frequency. Timestamps also include the ``first`` and ``last`` items.

If multiple object values have the highest count, then the
``count`` and ``top`` results will be arbitrarily chosen from
among those with the highest count.

For mixed data types provided via a ``DataFrame``, the default is to
return only an analysis of numeric columns. If the dataframe consists
only of object and categorical data without any numeric columns, the
default is to return an analysis of both the object and categorical
columns. If ``include='all'`` is provided as an option, the result
will include a union of attributes of each type.

The `include` and `exclude` parameters can be used to limit
which columns in a ``DataFrame`` are analyzed for the output.
The parameters are ignored when analyzing a ``Series``.

Examples
--------
Describing a numeric ``Series``.

>>> s = pd.Series([1, 2, 3])
>>> s.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Describing a categorical ``Series``.

>>> s = pd.Series(['a', 'a', 'b', 'c'])
>>> s.describe()
count     4
unique    3
top       a
freq      2
dtype: object

Describing a timestamp ``Series``.

>>> s = pd.Series([
...   np.datetime64("2000-01-01"),
...   np.datetime64("2010-01-01"),
...   np.datetime64("2010-01-01")
... ])
>>> s.describe()
count                       3
unique                      2
top       2010-01-01 00:00:00
freq                        2
first     2000-01-01 00:00:00
last      2010-01-01 00:00:00
dtype: object

Describing a ``DataFrame``. By default only numeric fields
are returned.

>>> df = pd.DataFrame({ 'categorical': pd.Categorical(['d','e','f']),
...                     'numeric': [1, 2, 3],
...                     'object': ['a', 'b', 'c']
...                   })
>>> df.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a ``DataFrame`` regardless of data type.

>>> df.describe(include='all')
        categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      c
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

Describing a column from a ``DataFrame`` by accessing it as
an attribute.

>>> df.numeric.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a ``DataFrame`` description.

>>> df.describe(include=[np.number])
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a ``DataFrame`` description.

>>> df.describe(include=[np.object])
       object
count       3
unique      3
top         c
freq        1

Including only categorical columns from a ``DataFrame`` description.

>>> df.describe(include=['category'])
       categorical
count            3
unique           3
top              f
freq             1

Excluding numeric columns from a ``DataFrame`` description.

>>> df.describe(exclude=[np.number])
       categorical object
count            3      3
unique           3      3
top              f      c
freq             1      1

Excluding object columns from a ``DataFrame`` description.

>>> df.describe(exclude=[np.object])
       categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0

################################################################################
################################## Validation ##################################
################################################################################

Docstring for "pandas.DataFrame.describe" correct. :)

If the validation script still gives errors, but you think there is a good reason
to deviate in this case (and there are certainly such cases), please state this
explicitly.

pep8speaks · 2018-03-10T19:23:55Z

Hello @nehiljain! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 08, 2018 at 04:52 Hours UTC

jreback · 2018-03-10T21:58:34Z

pandas/core/generic.py

-        top              f      NaN      c
-        freq             1      NaN      1
-        mean           NaN      2.0    NaN
-        std            NaN      1.0    NaN


is there a reason these are being changed? they are already in alphabetical order. I suppose you could supply columns on construction to guarantee the order.

thanks for the suggestion. Updated it with your recommendation.

datapythonista

Looks good, couple of comments.

The constructor of the DataFrame with 3 columns has some non PEP-8 spaces. Besides that, it'd be good to create it in a way that guarantees the order of columns, as Jeff says.

The Returns method should be type+description, not name+type.

datapythonista · 2018-03-10T22:16:45Z

pandas/core/generic.py

-        DataFrame.mean
-        DataFrame.std
-        DataFrame.select_dtypes
+        DataFrame.count : Count number of non-NA/null observations


Codecov Report

Merging #20222 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #20222   +/-   ##
=======================================
  Coverage   91.95%   91.95%           
=======================================
  Files         160      160           
  Lines       49858    49858           
=======================================
  Hits        45845    45845           
  Misses       4013     4013

Flag	Coverage Δ
#multiple	`90.33% <ø> (ø)`	⬆️
#single	`42.08% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/generic.py	`96.45% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cb5880...c570d24. Read the comment docs.

datapythonista · 2018-03-11T19:33:37Z

It'd be great if you could also change the Returns to follow this standard:
https://python-sprints.github.io/pandas/guide/pandas_docstring.html#section-4-returns-or-yields

I think for the type we're using Series or DataFrame instead of Series/DataFrame.

Besides that, lgtm.

…ocumenation conventions

datapythonista

lgtm

nehiljain · 2018-03-12T03:03:49Z

@jreback please take another look

TomAugspurger · 2018-03-21T19:59:49Z

pandas/core/generic.py

@@ -7393,7 +7405,7 @@ def describe(self, percentiles=None, include=None, exclude=None):
        Excluding object columns from a ``DataFrame`` description.

        >>> df.describe(exclude=[np.object])
-                categorical  numeric
+               categorical  numeric


When running the validation script, I occasionally get a failure

Line 210, in pandas.DataFrame.describe Failed example: df.describe(exclude=[np.number]) Expected: categorical object count 3 3 unique 3 3 top f c freq 1 1 Got: categorical object count 3 3 unique 3 3 top f a freq 1 1

Did you see this at all? This likely is an issue in the method itself, and not the docstring.

yeah i do see this error but its flaky.

To be clear, it's probably some kind of non-stable sorting inside the describe method, and nothing wrong with the docstring. It may be best to just include the docstring, and open a new issue.

The strange thing is that just doing

pd.DataFrame({"A": pd.Categorical(['d', 'e', 'f']), "B": ['a', 'b', 'c'], 'C': [1, 2, 3]}).describe(exclude=['number'])

seems deterministic.

TomAugspurger · 2018-03-21T20:00:13Z

pandas/core/generic.py

@@ -7305,9 +7317,9 @@ def describe(self, percentiles=None, include=None, exclude=None):
        Describing a ``DataFrame``. By default only numeric fields
        are returned.

-        >>> df = pd.DataFrame({ 'object': ['a', 'b', 'c'],
+        >>> df = pd.DataFrame({ 'categorical': pd.Categorical(['d','e','f']),


PEP8 formatting here. No space after the {, spaces after the , in the Categorical.

TomAugspurger · 2018-03-21T20:00:45Z

Only concern is #20222 (comment), which isn't an issue with the docstring. LGTM otherwise.

…ame_describe * upstream/master: (158 commits) Add link to "Craft Minimal Bug Report" blogpost (pandas-dev#20431) BUG: fixed json_normalize for subrecords with NoneTypes (pandas-dev#20030) (pandas-dev#20399) BUG: ExtensionArray.fillna for scalar values (pandas-dev#20412) DOC" update the Pandas core window rolling count docstring" (pandas-dev#20264) DOC: update the pandas.DataFrame.plot.hist docstring (pandas-dev#20155) DOC: Only use ~ in class links to hide prefixes. (pandas-dev#20402) Bug: Allow np.timedelta64 objects to index TimedeltaIndex (pandas-dev#20408) DOC: add disallowing of Series construction of len-1 list with index to whatsnew (pandas-dev#20392) MAINT: Remove weird pd file DOC: update the Index.isin docstring (pandas-dev#20249) BUG: Handle all-NA blocks in concat (pandas-dev#20382) DOC: update the pandas.core.resample.Resampler.fillna docstring (pandas-dev#20379) BUG: Don't raise exceptions splitting a blank string (pandas-dev#20067) DOC: update the pandas.DataFrame.cummax docstring (pandas-dev#20336) DOC: update the pandas.core.window.x.mean docstring (pandas-dev#20265) DOC: update the api.types.is_number docstring (pandas-dev#20196) Fix linter (pandas-dev#20389) DOC: Improved the docstring of pandas.Series.dt.to_pytimedelta (pandas-dev#20142) DOC: update the pandas.Series.dt.is_month_end docstring (pandas-dev#20181) DOC: update the window.Rolling.min docstring (pandas-dev#20263) ...

…ame_describe

jreback · 2018-07-08T12:59:51Z

thanks @nehiljain and @mroeschke for the fixup!

DOC: Improve the docstring of DataFrame.describe()

916624c

jreback requested changes Mar 10, 2018

View reviewed changes

jreback added Docs Numeric Operations Arithmetic, Comparison, and Logical operations labels Mar 10, 2018

datapythonista reviewed Mar 10, 2018

View reviewed changes

nehiljain added 3 commits March 11, 2018 15:13

fixed order of columns

d365098

more comments incorporated

aa13b25

return documentation changed and see also moved so section 5 as per d…

8da3c9a

…ocumenation conventions

datapythonista approved these changes Mar 11, 2018

View reviewed changes

TomAugspurger reviewed Mar 21, 2018

View reviewed changes

nehiljain and others added 3 commits March 21, 2018 16:11

missed a pep-8 related comment

1277860

Merge remote-tracking branch 'upstream/master' into docstrings_datafr…

0f4e8ed

…ame_describe

mroeschke added this to the 0.24.0 milestone Jul 7, 2018

Merge remote-tracking branch 'upstream/master' into docstrings_datafr…

c570d24

…ame_describe

jreback approved these changes Jul 8, 2018

View reviewed changes

jreback merged commit e2800fa into pandas-dev:master Jul 8, 2018

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

DOC: Improve the docstring of DataFrame.describe() (pandas-dev#20222)

ca4ec32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Improve the docstring of DataFrame.describe() #20222

DOC: Improve the docstring of DataFrame.describe() #20222

nehiljain commented Mar 10, 2018 •

edited

Loading

pep8speaks commented Mar 10, 2018 •

edited

Loading

jreback Mar 10, 2018

nehiljain Mar 11, 2018

datapythonista left a comment

datapythonista Mar 10, 2018

nehiljain Mar 11, 2018

codecov bot commented Mar 11, 2018 •

edited

Loading

datapythonista commented Mar 11, 2018

datapythonista left a comment

nehiljain commented Mar 12, 2018

TomAugspurger Mar 21, 2018

nehiljain Mar 21, 2018

TomAugspurger Mar 21, 2018

TomAugspurger Mar 21, 2018

TomAugspurger Mar 21, 2018

TomAugspurger commented Mar 21, 2018

jreback commented Jul 8, 2018

DOC: Improve the docstring of DataFrame.describe() #20222

DOC: Improve the docstring of DataFrame.describe() #20222

Conversation

nehiljain commented Mar 10, 2018 • edited Loading

pep8speaks commented Mar 10, 2018 • edited Loading

Comment last updated on July 08, 2018 at 04:52 Hours UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 11, 2018 • edited Loading

Codecov Report

datapythonista commented Mar 11, 2018

datapythonista left a comment

Choose a reason for hiding this comment

nehiljain commented Mar 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Mar 21, 2018

jreback commented Jul 8, 2018

nehiljain commented Mar 10, 2018 •

edited

Loading

pep8speaks commented Mar 10, 2018 •

edited

Loading

codecov bot commented Mar 11, 2018 •

edited

Loading