Skip to content

DOC: Improve the docstring of DataFrame.describe() #20222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 8, 2018
33 changes: 18 additions & 15 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -8348,7 +8348,7 @@ def abs(self):

def describe(self, percentiles=None, include=None, exclude=None):
"""
Generates descriptive statistics that summarize the central tendency,
Generate descriptive statistics that summarize the central tendency,
dispersion and shape of a dataset's distribution, excluding
``NaN`` values.

Expand Down Expand Up @@ -8392,7 +8392,18 @@ def describe(self, percentiles=None, include=None, exclude=None):

Returns
-------
summary: Series/DataFrame of summary statistics
Series or DataFrame
Summary statistics of the Series or Dataframe provided.

See Also
--------
DataFrame.count: Count number of non-NA/null observations.
DataFrame.max: Maximum of the values in the object.
DataFrame.min: Minimum of the values in the object.
DataFrame.mean: Mean of the values.
DataFrame.std: Standard deviation of the obersvations.
DataFrame.select_dtypes: Subset of a DataFrame including/excluding
columns based on their dtype.

Notes
-----
Expand Down Expand Up @@ -8436,6 +8447,7 @@ def describe(self, percentiles=None, include=None, exclude=None):
50% 2.0
75% 2.5
max 3.0
dtype: float64

Describing a categorical ``Series``.

Expand Down Expand Up @@ -8466,9 +8478,9 @@ def describe(self, percentiles=None, include=None, exclude=None):
Describing a ``DataFrame``. By default only numeric fields
are returned.

>>> df = pd.DataFrame({ 'object': ['a', 'b', 'c'],
... 'numeric': [1, 2, 3],
... 'categorical': pd.Categorical(['d','e','f'])
>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),
... 'numeric': [1, 2, 3],
... 'object': ['a', 'b', 'c']
... })
>>> df.describe()
numeric
Expand Down Expand Up @@ -8554,7 +8566,7 @@ def describe(self, percentiles=None, include=None, exclude=None):
Excluding object columns from a ``DataFrame`` description.

>>> df.describe(exclude=[np.object])
categorical numeric
categorical numeric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running the validation script, I occasionally get a failure

Line 210, in pandas.DataFrame.describe
Failed example:
    df.describe(exclude=[np.number])
Expected:
           categorical object
    count            3      3
    unique           3      3
    top              f      c
    freq             1      1
Got:
           categorical object
    count            3      3
    unique           3      3
    top              f      a
    freq             1      1

Did you see this at all? This likely is an issue in the method itself, and not the docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i do see this error but its flaky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, it's probably some kind of non-stable sorting inside the describe method, and nothing wrong with the docstring. It may be best to just include the docstring, and open a new issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strange thing is that just doing

pd.DataFrame({"A": pd.Categorical(['d', 'e', 'f']), "B": ['a', 'b', 'c'], 'C': [1, 2, 3]}).describe(exclude=['number'])

seems deterministic.

count 3 3.0
unique 3 NaN
top f NaN
Expand All @@ -8566,15 +8578,6 @@ def describe(self, percentiles=None, include=None, exclude=None):
50% NaN 2.0
75% NaN 2.5
max NaN 3.0

See Also
--------
DataFrame.count
DataFrame.max
DataFrame.min
DataFrame.mean
DataFrame.std
DataFrame.select_dtypes
"""
if self.ndim >= 3:
msg = "describe is not implemented on Panel objects."
Expand Down