-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: Clarified and expanded describe documentation #14995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
1d6aa0e
a445d2a
55cf4ec
0161a57
38015ce
86dd44a
8880a89
dff88bb
d97df49
a61dda1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5201,60 +5201,209 @@ def abs(self): | |
""" | ||
return np.abs(self) | ||
|
||
_shared_docs['describe'] = """ | ||
Generate various summary statistics, excluding NaN values. | ||
def describe(self, percentiles=None, include=None, exclude=None): | ||
""" | ||
Generates descriptive statistics that summarize the central tendency, | ||
dispersion and shape of a dataset's distribution, excluding | ||
``NaN`` values. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would add here the sentence from the notes with something like "Analyzes both numeric and object series, as well There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do. |
||
Parameters | ||
---------- | ||
percentiles : array-like, optional | ||
The percentiles to include in the output. Should all | ||
be in the interval [0, 1]. By default `percentiles` is | ||
[.25, .5, .75], returning the 25th, 50th, and 75th percentiles. | ||
include, exclude : list-like, 'all', or None (default) | ||
Specify the form of the returned result. Either: | ||
|
||
- None to both (default). The result will include only | ||
numeric-typed columns or, if none are, only categorical columns. | ||
- A list of dtypes or strings to be included/excluded. | ||
To select all numeric types use numpy numpy.number. To select | ||
categorical objects use type object. See also the select_dtypes | ||
documentation. eg. df.describe(include=['O']) | ||
- If include is the string 'all', the output column-set will | ||
match the input one. | ||
Analyzes both ``numeric`` and ``object`` series, as well | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 'numeric' and 'object' do not really refer to actual code or keywords in this case, so I would leave them in 'normal' text (not back tick quoted) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For DataFrame (line below), you can also leave out the quotes I think (for consistency, although we certainly are not consistent in all places) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it. I struggled with how to properly use the single and double backticks. I tried to model my behavior on the recommendations of the numpy HOWTO. I'll take another pass through and try to be more sparing with their use. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We follow the suggestion for the numpy howto to refer to keywords with single backticks, but use typically double backticks for 'code fragments' or other variables/functions apart from keywords. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it. I will take another pass to try to tighten things up to meet your standard. |
||
as `DataFrame` column sets of mixed data types. | ||
|
||
Returns | ||
------- | ||
summary: %(klass)s of summary statistics | ||
For ``numeric`` data, the result's index will include ``count``, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would leave these details in the notes section as it was before. Reason is that otherwise it 'takes a long time' before you get to the 'Parameters'. But maybe we could include in the previous sentence something like "Output format depends on the data type, see Notes for more details" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it. I'll move it back down. I had decided to move it up after consulting numpy's HOWTO, which says that Notes should be reserved for background information. But if pandas is consistent about having the extended description there instead of at the top of the docstring that seems a-okay to me. I'll move it down. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, to follow the numpy docstring standard, the first sentence should be shorter. And can then be followed by a longer description. But ideally this longer description is only a couple of lines, and IMO if this is longer, it should better go into Notes (as you typically want to show parameters rather prominently as well). |
||
``mean``, ``std``, ``min``, ``max`` as well as lower, ``50`` and | ||
upper percentiles. By default the lower percentile is ``25`` and the | ||
upper percentile is ``75``. The ``50`` percentile is typically the | ||
same as the median. | ||
|
||
Notes | ||
----- | ||
The output DataFrame index depends on the requested dtypes: | ||
For ``object`` data (e.g. strings or timestamps), the result's index | ||
will include ``count``, ``unique``, ``top``, and ``freq``. The ``top`` | ||
is the most common value. The ``freq`` is the most common value's | ||
frequency. Timestamps also include the ``first`` and ``last`` items. | ||
|
||
For numeric dtypes, it will include: count, mean, std, min, | ||
max, and lower, 50, and upper percentiles. | ||
If multiple ``object`` values have the highest count, then the | ||
``count`` and ``top`` results will be arbitrarily chosen from | ||
among those with the highest count. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. NDFrame -> Series/DataFrame There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do. |
||
For object dtypes (e.g. timestamps or strings), the index | ||
will include the count, unique, most common, and frequency of the | ||
most common. Timestamps also include the first and last items. | ||
For mixed data types provided via a `DataFrame`, the result will | ||
include a union of attributes of each type. | ||
|
||
For mixed dtypes, the index will be the union of the corresponding | ||
output types. Non-applicable entries will be filled with NaN. | ||
Note that mixed-dtype outputs can only be returned from mixed-dtype | ||
inputs and appropriate use of the include/exclude arguments. | ||
The `include` and `exclude` parameters can be used to limit | ||
which columns in a `DataFrame` are analyzed for the output. | ||
The parameters are ignored when analyzing a `Series`. | ||
|
||
If multiple values have the highest count, then the | ||
`count` and `most common` pair will be arbitrarily chosen from | ||
among those with the highest count. | ||
Parameters | ||
---------- | ||
percentiles : list of numbers, optional | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Strictly spoken, the change from array-like to list of numbers is not really a correction, as arrays are also accepted, but I agree list is more clear to users. Maybe "list-like of numbers"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do. |
||
The percentiles to include in the output. All should | ||
fall between 0 and 1. The default is | ||
``[.25, .5, .75]``, which returns the 25th, 50th, and | ||
75th percentiles. | ||
include : None (default), 'all', or list of dtypes or strings, optional | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We typically put the "default None" at the end. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will flip them to the bottom. |
||
A white list of data types to include in the result. Ignored | ||
for `Series`. Here are the options: | ||
|
||
- None (default). The result will include all ``numeric`` columns. | ||
- 'all'. All columns on the input will be included in the output. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. " on the input" -> "of the input" ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is my mistake. I will correct it. |
||
- A list of dtypes or strings. Limits the results to the | ||
provided data types. | ||
To limit the result to numeric types submit | ||
``numpy.number``. To limit it instead to categorical | ||
objects submit the data type ``object``. Strings | ||
can also be used in the style of | ||
`select_dtypes` (e.g. df.describe(include=['O'])) | ||
exclude : None (default) or a list of dtypes or strings, optional, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not a strong opinion here, but I also kind of liked the combined explanation. What is most clear (less repetition vs being more explicit) is always a bit of a difficult/subjective balance .. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree this is a balancing act. I decided to separate them after consulting the source code, where I found that include and exclude actually have slightly different behavior. From what I can tell, |
||
A black list of data types to omit from the result. Ignored | ||
for `Series`. Here are the options: | ||
|
||
- None (default). The result will exclude nothing. | ||
- A list of dtypes or strings. Excludes the provided data types | ||
from the result. To select numeric types submit | ||
``numpy.number``. To select categorical objects submut the data | ||
type ``object``. Strings can also be used in the style of | ||
`select_dtypes` (e.g. df.describe(include=['O'])) | ||
|
||
Returns | ||
------- | ||
summary: NDFrame of summary statistics | ||
|
||
Examples | ||
-------- | ||
Describing a numeric `Series`. | ||
|
||
The include, exclude arguments are ignored for Series. | ||
>>> import pandas as pd | ||
>>> s = pd.Series([1, 2, 3]) | ||
>>> s.describe() | ||
count 3.0 | ||
mean 2.0 | ||
std 1.0 | ||
min 1.0 | ||
25% 1.5 | ||
50% 2.0 | ||
75% 2.5 | ||
max 3.0 | ||
|
||
Describing a categorical `Series`. | ||
|
||
>>> s = pd.Series(['a', 'a', 'b', 'c']) | ||
>>> s.describe() | ||
count 4 | ||
unique 3 | ||
top a | ||
freq 2 | ||
dtype: object | ||
|
||
Describing a timestamp `Series`. | ||
|
||
>>> import numpy as np | ||
>>> s = pd.Series([ | ||
.. np.datetime64("2000-01-01"), | ||
.. np.datetime64("2010-01-01"), | ||
.. np.datetime64("2010-01-01") | ||
.. ]) | ||
>>> s.describe() | ||
count 3 | ||
unique 2 | ||
top 2010-01-01 00:00:00 | ||
freq 2 | ||
first 2000-01-01 00:00:00 | ||
last 2010-01-01 00:00:00 | ||
dtype: object | ||
|
||
Describing a `DataFrame`. By default only numeric fields are returned. | ||
|
||
>>> df = pd.DataFrame( | ||
.. [[1, 'a'], [2, 'b'], [3, 'c']], | ||
.. columns=['numeric', 'object'] | ||
.. ) | ||
>>> df.describe() | ||
numeric | ||
count 3.0 | ||
mean 2.0 | ||
std 1.0 | ||
min 1.0 | ||
25% 1.5 | ||
50% 2.0 | ||
75% 2.5 | ||
max 3.0 | ||
|
||
Describing all columns of a `DataFrame` regardless of data type. | ||
|
||
>>> df.describe(include='all') | ||
numeric object | ||
count 3.0 3 | ||
unique NaN 3 | ||
top NaN b | ||
freq NaN 1 | ||
mean 2.0 NaN | ||
std 1.0 NaN | ||
min 1.0 NaN | ||
25% 1.5 NaN | ||
50% 2.0 NaN | ||
75% 2.5 NaN | ||
max 3.0 NaN | ||
|
||
Describing a column from a `DataFrame` by accessing it as an attribute. | ||
|
||
>>> df.numeric.describe() | ||
count 3.0 | ||
mean 2.0 | ||
std 1.0 | ||
min 1.0 | ||
25% 1.5 | ||
50% 2.0 | ||
75% 2.5 | ||
max 3.0 | ||
Name: numeric, dtype: float64 | ||
|
||
Including only ``numeric`` columns in a `DataFrame` description. | ||
|
||
>>> df.describe(include=[np.number]) | ||
numeric | ||
count 3.0 | ||
mean 2.0 | ||
std 1.0 | ||
min 1.0 | ||
25% 1.5 | ||
50% 2.0 | ||
75% 2.5 | ||
max 3.0 | ||
|
||
Including only ``string`` columns in a `DataFrame` description. | ||
|
||
>>> df.describe(include=[np.object]) | ||
object | ||
count 3 | ||
unique 3 | ||
top b | ||
freq 1 | ||
|
||
Excluding ``numeric`` columns from a `DataFrame` description. | ||
|
||
>>> df.describe(exclude=[np.number]) | ||
object | ||
count 3 | ||
unique 3 | ||
top b | ||
freq 1 | ||
|
||
Excluding ``object`` columns from a `DataFrame` description. | ||
|
||
>>> df.describe(exclude=[np.object]) | ||
numeric | ||
count 3.0 | ||
mean 2.0 | ||
std 1.0 | ||
min 1.0 | ||
25% 1.5 | ||
50% 2.0 | ||
75% 2.5 | ||
max 3.0 | ||
|
||
See Also | ||
-------- | ||
DataFrame.select_dtypes | ||
""" | ||
|
||
@Appender(_shared_docs['describe'] % _shared_doc_kwargs) | ||
def describe(self, percentiles=None, include=None, exclude=None): | ||
if self.ndim >= 3: | ||
msg = "describe is not implemented on Panel or PanelND objects." | ||
raise NotImplementedError(msg) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you redefining things here???
this is just a very small edit to the
_shared_docs['describe']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@palewire Typically we reuse docstring on several places, eg for Series/DataFrame/Panel definitions, that's the reason of the use of
_shared_docs
.But, @jreback, was just looking in this specific case, this is the only place where this docstring is used, so it is actually not really needed to put it in
_shared_docs
I think? (maybe a leftover from when the definitions where in multiple places)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is used in both Series & DataFrame, so needs to stay as shared docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it is only defined here (the function is not redefined in series or dataframe, so the shared docstrings is not used anywhere else)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, ok then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy to try that correction, though I think it's worth pointing out that was a pre-existing bug in the describe documentation and nothing introduced by this pull request. Could you point me to example of a similar shared method I could model the fix on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as I said, pretty much any function in series or dataframe that has a shared doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if it is worth including
describe
defs in both Series and DataFrame (that of course just simply passes args to its super method), just for customizing this single word. For docstrings that include more variables to be changed, that would be OK. But IMO in this case it is not worth it.It's a bit of a problem with how our handling of shared docstrings currently works, as it does not work perfectly for all cases that we use it for. But having a better approach for functions like this (i.e. functions that have only a definition in generic, and not in series/frame.py) is a whole other/larger issue that can be left for another issue/PR to discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche, if that's how you feel I can hold off on pursuing that route. Are there other modifications you'd like to see?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback are you OK with this in its current form (so not using the
_shared_docs
). I agree that we should try to have accurate docstrings for both Series and DataFrame making use of our decorator machinery, but in this case it did not make use of that machinery.