Skip to content

DOC: Clarified and expanded describe documentation #14995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jan 2, 2017
225 changes: 191 additions & 34 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -5201,60 +5201,217 @@ def abs(self):
"""
return np.abs(self)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you redefining things here???

this is just a very small edit to the _shared_docs['describe']

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@palewire Typically we reuse docstring on several places, eg for Series/DataFrame/Panel definitions, that's the reason of the use of _shared_docs.

But, @jreback, was just looking in this specific case, this is the only place where this docstring is used, so it is actually not really needed to put it in _shared_docs I think? (maybe a leftover from when the definitions where in multiple places)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is used in both Series & DataFrame, so needs to stay as shared docs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it is only defined here (the function is not redefined in series or dataframe, so the shared docstrings is not used anywhere else)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, ok then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to try that correction, though I think it's worth pointing out that was a pre-existing bug in the describe documentation and nothing introduced by this pull request. Could you point me to example of a similar shared method I could model the fix on?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I said, pretty much any function in series or dataframe that has a shared doc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if it is worth including describe defs in both Series and DataFrame (that of course just simply passes args to its super method), just for customizing this single word. For docstrings that include more variables to be changed, that would be OK. But IMO in this case it is not worth it.

It's a bit of a problem with how our handling of shared docstrings currently works, as it does not work perfectly for all cases that we use it for. But having a better approach for functions like this (i.e. functions that have only a definition in generic, and not in series/frame.py) is a whole other/larger issue that can be left for another issue/PR to discuss.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche, if that's how you feel I can hold off on pursuing that route. Are there other modifications you'd like to see?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback are you OK with this in its current form (so not using the _shared_docs). I agree that we should try to have accurate docstrings for both Series and DataFrame making use of our decorator machinery, but in this case it did not make use of that machinery.

_shared_docs['describe'] = """
Generate various summary statistics, excluding NaN values.
def describe(self, percentiles=None, include=None, exclude=None):
"""
Generates descriptive statistics that summarize the central tendency,
dispersion and shape of a dataset's distribution, excluding
``NaN`` values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add here the sentence from the notes with something like "Analyzes both numeric and object series, as well
as DataFrame column sets of mixed data types." + that output depends on data type + refer to notes for more details on this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Parameters
----------
percentiles : array-like, optional
The percentiles to include in the output. Should all
be in the interval [0, 1]. By default `percentiles` is
[.25, .5, .75], returning the 25th, 50th, and 75th percentiles.
include, exclude : list-like, 'all', or None (default)
Specify the form of the returned result. Either:

- None to both (default). The result will include only
numeric-typed columns or, if none are, only categorical columns.
- A list of dtypes or strings to be included/excluded.
To select all numeric types use numpy numpy.number. To select
categorical objects use type object. See also the select_dtypes
documentation. eg. df.describe(include=['O'])
- If include is the string 'all', the output column-set will
match the input one.
percentiles : list of numbers, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly spoken, the change from array-like to list of numbers is not really a correction, as arrays are also accepted, but I agree list is more clear to users. Maybe "list-like of numbers"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

The percentiles to include in the output. All should
fall between 0 and 1. The default is
``[.25, .5, .75]``, which returns the 25th, 50th, and
75th percentiles.
include : None (default), 'all', or list of dtypes or strings, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically put the "default None" at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will flip them to the bottom.

A white list of data types to include in the result. Ignored
for `Series`. Here are the options:

- None (default). The result will include all numeric columns.
- 'all'. All columns on the input will be included in the output.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

" on the input" -> "of the input" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is my mistake. I will correct it.

- A list of dtypes or strings. Limits the results to the
provided data types.
To limit the result to numeric types submit
``numpy.number``. To limit it instead to categorical
objects submit the ``numpy.object`` data type. Strings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NDFrame -> Series/DataFrame

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

can also be used in the style of
``select_dtypes`` (e.g. ``df.describe(include=['O'])``)
exclude : None (default) or a list of dtypes or strings, optional,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a strong opinion here, but I also kind of liked the combined explanation. What is most clear (less repetition vs being more explicit) is always a bit of a difficult/subjective balance ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a balancing act. I decided to separate them after consulting the source code, where I found that include and exclude actually have slightly different behavior. From what I can tell, all is not an acceptable input for exclude. Separating them also allowed me space to try to write more explicit descriptions that contrast the two keywords.

A black list of data types to omit from the result. Ignored
for Series. Here are the options:

- None (default). The result will exclude nothing.
- A list of dtypes or strings. Excludes the provided data types
from the result. To select numeric types submit
``numpy.number``. To select categorical objects submit the data
type ``numpy.object``. Strings can also be used in the style of
``select_dtypes`` (e.g. ``df.describe(include=['O'])``)

Returns
-------
summary: %(klass)s of summary statistics
summary: NDFrame of summary statistics

Notes
-----
The output DataFrame index depends on the requested dtypes:

For numeric dtypes, it will include: count, mean, std, min,
max, and lower, 50, and upper percentiles.
Analyzes both numeric and object series, as well
as DataFrame column sets of mixed data types.

For object dtypes (e.g. timestamps or strings), the index
will include the count, unique, most common, and frequency of the
most common. Timestamps also include the first and last items.
For numeric data, the result's index will include ``count``,
``mean``, ``std``, ``min``, ``max`` as well as lower, ``50`` and
upper percentiles. By default the lower percentile is ``25`` and the
upper percentile is ``75``. The ``50`` percentile is typically the
same as the median.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The 50 percentile is typically the same as the median." -> when is this not the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of the alternative methods of returning medians when there are an even number of values that might result in differing expectations among users. But that's probably unnecessary. I will remove the qualification.


For mixed dtypes, the index will be the union of the corresponding
output types. Non-applicable entries will be filled with NaN.
Note that mixed-dtype outputs can only be returned from mixed-dtype
inputs and appropriate use of the include/exclude arguments.
For object data (e.g. strings or timestamps), the result's index
will include ``count``, ``unique``, ``top``, and ``freq``. The ``top``
is the most common value. The ``freq`` is the most common value's
frequency. Timestamps also include the ``first`` and ``last`` items.

If multiple values have the highest count, then the
`count` and `most common` pair will be arbitrarily chosen from
If multiple object values have the highest count, then the
``count`` and ``top`` results will be arbitrarily chosen from
among those with the highest count.

The include, exclude arguments are ignored for Series.
For mixed data types provided via a DataFrame, the result will
include a union of attributes of each type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention first that the default is only to show numeric columns. And if, using include='all', different types are included, then the index is union of the attributes of each type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think that will make things more clear. I will change it.


The `include` and `exclude` parameters can be used to limit
which columns in a DataFrame are analyzed for the output.
The parameters are ignored when analyzing a Series.

Examples
--------
Describing a numeric Series.

>>> import pandas as pd
>>> s = pd.Series([1, 2, 3])
>>> s.describe()
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0

Describing a categorical Series.

>>> s = pd.Series(['a', 'a', 'b', 'c'])
>>> s.describe()
count 4
unique 3
top a
freq 2
dtype: object

Describing a timestamp Series.

>>> import numpy as np
>>> s = pd.Series([
.. np.datetime64("2000-01-01"),
.. np.datetime64("2010-01-01"),
.. np.datetime64("2010-01-01")
.. ])
>>> s.describe()
count 3
unique 2
top 2010-01-01 00:00:00
freq 2
first 2000-01-01 00:00:00
last 2010-01-01 00:00:00
dtype: object

Describing a DataFrame. By default only numeric fields are returned.

>>> df = pd.DataFrame(
.. [[1, 'a'], [2, 'b'], [3, 'c']],
.. columns=['numeric', 'object']
.. )
>>> df.describe()
numeric
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0

Describing all columns of a DataFrame regardless of data type.

>>> df.describe(include='all')
numeric object
count 3.0 3
unique NaN 3
top NaN b
freq NaN 1
mean 2.0 NaN
std 1.0 NaN
min 1.0 NaN
25% 1.5 NaN
50% 2.0 NaN
75% 2.5 NaN
max 3.0 NaN

Describing a column from a DataFrame by accessing it as an attribute.

>>> df.numeric.describe()
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
Name: numeric, dtype: float64

Including only numeric columns in a DataFrame description.

>>> df.describe(include=[np.number])
numeric
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0

Including only string columns in a DataFrame description.

>>> df.describe(include=[np.object])
object
count 3
unique 3
top b
freq 1

Excluding numeric columns from a DataFrame description.

>>> df.describe(exclude=[np.number])
object
count 3
unique 3
top b
freq 1

Excluding object columns from a DataFrame description.

>>> df.describe(exclude=[np.object])
numeric
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0

See Also
--------
DataFrame.count
DataFrame.max
DataFrame.min
DataFrame.mean
DataFrame.std
DataFrame.select_dtypes
"""

@Appender(_shared_docs['describe'] % _shared_doc_kwargs)
def describe(self, percentiles=None, include=None, exclude=None):
if self.ndim >= 3:
msg = "describe is not implemented on Panel or PanelND objects."
raise NotImplementedError(msg)
Expand Down