Skip to content

DOC: Improve the docstring of DataFrame.describe() #20222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 8, 2018

Conversation

nehiljain
Copy link
Contributor

@nehiljain nehiljain commented Mar 10, 2018

Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):

  • PR title is "DOC: update the docstring"
  • The validation script passes: scripts/validate_docstrings.py <your-function-or-method>
  • The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
  • The html version looks good: python doc/make.py --single <your-function-or-method>
  • It has been proofread on language by another sprint participant

Please include the output of the validation script below between the "```" ticks:

################################################################################
#################### Docstring (pandas.DataFrame.describe)  ####################
################################################################################

Generate descriptive statistics that summarize the central tendency,
dispersion and shape of a dataset's distribution, excluding
``NaN`` values.

Analyzes both numeric and object series, as well
as ``DataFrame`` column sets of mixed data types. The output
will vary depending on what is provided. Refer to the notes
below for more detail.

Parameters
----------
percentiles : list-like of numbers, optional
    The percentiles to include in the output. All should
    fall between 0 and 1. The default is
    ``[.25, .5, .75]``, which returns the 25th, 50th, and
    75th percentiles.
include : 'all', list-like of dtypes or None (default), optional
    A white list of data types to include in the result. Ignored
    for ``Series``. Here are the options:

    - 'all' : All columns of the input will be included in the output.
    - A list-like of dtypes : Limits the results to the
      provided data types.
      To limit the result to numeric types submit
      ``numpy.number``. To limit it instead to object columns submit
      the ``numpy.object`` data type. Strings
      can also be used in the style of
      ``select_dtypes`` (e.g. ``df.describe(include=['O'])``). To
      select pandas categorical columns, use ``'category'``
    - None (default) : The result will include all numeric columns.
exclude : list-like of dtypes or None (default), optional,
    A black list of data types to omit from the result. Ignored
    for ``Series``. Here are the options:

    - A list-like of dtypes : Excludes the provided data types
      from the result. To exclude numeric types submit
      ``numpy.number``. To exclude object columns submit the data
      type ``numpy.object``. Strings can also be used in the style of
      ``select_dtypes`` (e.g. ``df.describe(include=['O'])``). To
      exclude pandas categorical columns, use ``'category'``
    - None (default) : The result will exclude nothing.

Returns
-------
Series or DataFrame
    Summary statistics of the Series or Dataframe provided.

See Also
--------
DataFrame.count: Count number of non-NA/null observations.
DataFrame.max: Maximum of the values in the object.
DataFrame.min: Minimum of the values in the object.
DataFrame.mean: Mean of the values.
DataFrame.std: Standard deviation of the obersvations.
DataFrame.select_dtypes: Subset of a DataFrame including/excluding
    columns based on their dtype.

Notes
-----
For numeric data, the result's index will include ``count``,
``mean``, ``std``, ``min``, ``max`` as well as lower, ``50`` and
upper percentiles. By default the lower percentile is ``25`` and the
upper percentile is ``75``. The ``50`` percentile is the
same as the median.

For object data (e.g. strings or timestamps), the result's index
will include ``count``, ``unique``, ``top``, and ``freq``. The ``top``
is the most common value. The ``freq`` is the most common value's
frequency. Timestamps also include the ``first`` and ``last`` items.

If multiple object values have the highest count, then the
``count`` and ``top`` results will be arbitrarily chosen from
among those with the highest count.

For mixed data types provided via a ``DataFrame``, the default is to
return only an analysis of numeric columns. If the dataframe consists
only of object and categorical data without any numeric columns, the
default is to return an analysis of both the object and categorical
columns. If ``include='all'`` is provided as an option, the result
will include a union of attributes of each type.

The `include` and `exclude` parameters can be used to limit
which columns in a ``DataFrame`` are analyzed for the output.
The parameters are ignored when analyzing a ``Series``.

Examples
--------
Describing a numeric ``Series``.

>>> s = pd.Series([1, 2, 3])
>>> s.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Describing a categorical ``Series``.

>>> s = pd.Series(['a', 'a', 'b', 'c'])
>>> s.describe()
count     4
unique    3
top       a
freq      2
dtype: object

Describing a timestamp ``Series``.

>>> s = pd.Series([
...   np.datetime64("2000-01-01"),
...   np.datetime64("2010-01-01"),
...   np.datetime64("2010-01-01")
... ])
>>> s.describe()
count                       3
unique                      2
top       2010-01-01 00:00:00
freq                        2
first     2000-01-01 00:00:00
last      2010-01-01 00:00:00
dtype: object

Describing a ``DataFrame``. By default only numeric fields
are returned.

>>> df = pd.DataFrame({ 'categorical': pd.Categorical(['d','e','f']),
...                     'numeric': [1, 2, 3],
...                     'object': ['a', 'b', 'c']
...                   })
>>> df.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Describing all columns of a ``DataFrame`` regardless of data type.

>>> df.describe(include='all')
        categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      c
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

Describing a column from a ``DataFrame`` by accessing it as
an attribute.

>>> df.numeric.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

Including only numeric columns in a ``DataFrame`` description.

>>> df.describe(include=[np.number])
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

Including only string columns in a ``DataFrame`` description.

>>> df.describe(include=[np.object])
       object
count       3
unique      3
top         c
freq        1

Including only categorical columns from a ``DataFrame`` description.

>>> df.describe(include=['category'])
       categorical
count            3
unique           3
top              f
freq             1

Excluding numeric columns from a ``DataFrame`` description.

>>> df.describe(exclude=[np.number])
       categorical object
count            3      3
unique           3      3
top              f      c
freq             1      1

Excluding object columns from a ``DataFrame`` description.

>>> df.describe(exclude=[np.object])
       categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0

################################################################################
################################## Validation ##################################
################################################################################

Docstring for "pandas.DataFrame.describe" correct. :)

If the validation script still gives errors, but you think there is a good reason
to deviate in this case (and there are certainly such cases), please state this
explicitly.

@pep8speaks
Copy link

pep8speaks commented Mar 10, 2018

Hello @nehiljain! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 08, 2018 at 04:52 Hours UTC

top f NaN c
freq 1 NaN 1
mean NaN 2.0 NaN
std NaN 1.0 NaN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason these are being changed? they are already in alphabetical order. I suppose you could supply columns on construction to guarantee the order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion. Updated it with your recommendation.

@jreback jreback added Docs Numeric Operations Arithmetic, Comparison, and Logical operations labels Mar 10, 2018
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, couple of comments.

The constructor of the DataFrame with 3 columns has some non PEP-8 spaces. Besides that, it'd be good to create it in a way that guarantees the order of columns, as Jeff says.

The Returns method should be type+description, not name+type.

DataFrame.mean
DataFrame.std
DataFrame.select_dtypes
DataFrame.count : Count number of non-NA/null observations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See Also goes before Examples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed thanks

…ame_describe

* upstream/master: (25 commits)
  DOC: Improved pandas.plotting.bootstrap_plot docstring (pandas-dev#20166)
  DOC: update the Index.get_values docstring (pandas-dev#20231)
  DOC: update the pandas.DataFrame.all docstring (pandas-dev#20216)
  DOC: update the Series.view docstring (pandas-dev#20220)
  DOC: update the docstring of pandas.DataFrame.from_dict (pandas-dev#20259)
  DOC: add docstring for Index.get_duplicates (pandas-dev#20223)
  Docstring pandas.series.diff (pandas-dev#20238)
  DOC: update `pandas/core/ops.py` docstring template to accept examples (pandas-dev#20246)
  DOC: update the DataFrame.iat[] docstring (pandas-dev#20219)
  DOC: update the pandas.DataFrame.diff docstring (pandas-dev#20227)
  DOC: pd.core.window.Expanding.kurt docstring (split from pd.core.Rolling.kurt) (pandas-dev#20064)
  DOC: update the pandas.date_range() docstring (pandas-dev#20143)
  DOC: update DataFrame.to_records (pandas-dev#20191)
  DOC: Improved the docstring of pandas.plotting.radviz (pandas-dev#20169)
  DOC: Update pandas.DataFrame.tail docstring (pandas-dev#20225)
  DOC: update the DataFrame.cov docstring (pandas-dev#20245)
  DOC: update pandas.DataFrame.head docstring (pandas-dev#20262)
  DOC: Improve pandas.Series.plot.kde docstring and kwargs rewording for whole file (pandas-dev#20041)
  DOC: update the DataFrame.head()  docstring (pandas-dev#20206)
  DOC: update the Index.shift docstring (pandas-dev#20192)
  ...
@codecov
Copy link

codecov bot commented Mar 11, 2018

Codecov Report

Merging #20222 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #20222   +/-   ##
=======================================
  Coverage   91.95%   91.95%           
=======================================
  Files         160      160           
  Lines       49858    49858           
=======================================
  Hits        45845    45845           
  Misses       4013     4013
Flag Coverage Δ
#multiple 90.33% <ø> (ø) ⬆️
#single 42.08% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/generic.py 96.45% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cb5880...c570d24. Read the comment docs.

@datapythonista
Copy link
Member

It'd be great if you could also change the Returns to follow this standard:
https://python-sprints.github.io/pandas/guide/pandas_docstring.html#section-4-returns-or-yields

I think for the type we're using Series or DataFrame instead of Series/DataFrame.

Besides that, lgtm.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@nehiljain
Copy link
Contributor Author

@jreback please take another look

@@ -7393,7 +7405,7 @@ def describe(self, percentiles=None, include=None, exclude=None):
Excluding object columns from a ``DataFrame`` description.

>>> df.describe(exclude=[np.object])
categorical numeric
categorical numeric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running the validation script, I occasionally get a failure

Line 210, in pandas.DataFrame.describe
Failed example:
    df.describe(exclude=[np.number])
Expected:
           categorical object
    count            3      3
    unique           3      3
    top              f      c
    freq             1      1
Got:
           categorical object
    count            3      3
    unique           3      3
    top              f      a
    freq             1      1

Did you see this at all? This likely is an issue in the method itself, and not the docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i do see this error but its flaky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, it's probably some kind of non-stable sorting inside the describe method, and nothing wrong with the docstring. It may be best to just include the docstring, and open a new issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strange thing is that just doing

pd.DataFrame({"A": pd.Categorical(['d', 'e', 'f']), "B": ['a', 'b', 'c'], 'C': [1, 2, 3]}).describe(exclude=['number'])

seems deterministic.

@@ -7305,9 +7317,9 @@ def describe(self, percentiles=None, include=None, exclude=None):
Describing a ``DataFrame``. By default only numeric fields
are returned.

>>> df = pd.DataFrame({ 'object': ['a', 'b', 'c'],
>>> df = pd.DataFrame({ 'categorical': pd.Categorical(['d','e','f']),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 formatting here. No space after the {, spaces after the , in the Categorical.

@TomAugspurger
Copy link
Contributor

Only concern is #20222 (comment), which isn't an issue with the docstring. LGTM otherwise.

nehiljain and others added 3 commits March 21, 2018 16:11
…ame_describe

* upstream/master: (158 commits)
  Add link to "Craft Minimal Bug Report" blogpost (pandas-dev#20431)
  BUG: fixed json_normalize for subrecords with NoneTypes (pandas-dev#20030) (pandas-dev#20399)
  BUG: ExtensionArray.fillna for scalar values (pandas-dev#20412)
  DOC" update the Pandas core window rolling count docstring" (pandas-dev#20264)
  DOC: update the pandas.DataFrame.plot.hist docstring (pandas-dev#20155)
  DOC: Only use ~ in class links to hide prefixes. (pandas-dev#20402)
  Bug: Allow np.timedelta64 objects to index TimedeltaIndex (pandas-dev#20408)
  DOC: add disallowing of Series construction of len-1 list with index to whatsnew (pandas-dev#20392)
  MAINT: Remove weird pd file
  DOC: update the Index.isin docstring (pandas-dev#20249)
  BUG: Handle all-NA blocks in concat (pandas-dev#20382)
  DOC: update the pandas.core.resample.Resampler.fillna docstring (pandas-dev#20379)
  BUG: Don't raise exceptions splitting a blank string (pandas-dev#20067)
  DOC: update the pandas.DataFrame.cummax docstring (pandas-dev#20336)
  DOC: update the pandas.core.window.x.mean docstring (pandas-dev#20265)
  DOC: update the api.types.is_number docstring (pandas-dev#20196)
  Fix linter (pandas-dev#20389)
  DOC: Improved the docstring of pandas.Series.dt.to_pytimedelta (pandas-dev#20142)
  DOC: update the pandas.Series.dt.is_month_end docstring (pandas-dev#20181)
  DOC: update the window.Rolling.min docstring (pandas-dev#20263)
  ...
@mroeschke mroeschke added this to the 0.24.0 milestone Jul 7, 2018
@jreback jreback merged commit e2800fa into pandas-dev:master Jul 8, 2018
@jreback
Copy link
Contributor

jreback commented Jul 8, 2018

thanks @nehiljain and @mroeschke for the fixup!

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants