Skip to content

Commit 843aa60

Browse files
author
TomAugspurger
committed
ENH/API: accept percentiles in describe
1 parent d505d23 commit 843aa60

File tree

9 files changed

+337
-196
lines changed

9 files changed

+337
-196
lines changed

doc/source/basics.rst

+9-1
Original file line numberDiff line numberDiff line change
@@ -454,6 +454,7 @@ non-null values:
454454
series[10:20] = 5
455455
series.nunique()
456456
457+
.. _basics.describe:
457458

458459
Summarizing data: describe
459460
~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -471,7 +472,13 @@ course):
471472
frame.ix[::2] = np.nan
472473
frame.describe()
473474
474-
.. _basics.describe:
475+
You can select specific percentiles to include in the output:
476+
477+
.. ipython:: python
478+
479+
series.describe(percentiles=[.05, .25, .75, .95])
480+
481+
By default, the median is always included.
475482

476483
For a non-numerical Series object, `describe` will give a simple summary of the
477484
number of unique values and most frequently occurring values:
@@ -482,6 +489,7 @@ number of unique values and most frequently occurring values:
482489
s = Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])
483490
s.describe()
484491
492+
485493
There also is a utility function, ``value_range`` which takes a DataFrame and
486494
returns a series with the minimum/maximum values in the DataFrame.
487495

doc/source/release.rst

+7
Original file line numberDiff line numberDiff line change
@@ -204,6 +204,8 @@ API Changes
204204
- Produce :class:`~pandas.io.parsers.ParserWarning` on fallback to python
205205
parser when no options are ignored (:issue:`6607`)
206206
- Added ``factorize`` functions to ``Index`` and ``Series`` to get indexer and unique values (:issue:`7090`)
207+
- :meth:`DataFrame.describe` on a DataFrame with a mix of Timestamp and string like objects
208+
returns a different Index (:issue:`7088`). Previously the index was unintentionally sorted.
207209

208210
Deprecations
209211
~~~~~~~~~~~~
@@ -250,6 +252,10 @@ Deprecations
250252
- The support for the 'mysql' flavor when using DBAPI connection objects has been deprecated.
251253
MySQL will be further supported with SQLAlchemy engines (:issue:`6900`).
252254

255+
- The `percentile_width` keyword argument in :meth:`~DataFrame.describe` has been deprecated.
256+
Use the `percentiles` keyword instead, which takes a list of percentiles to display. The
257+
default output is unchanged.
258+
253259
Prior Version Deprecations/Changes
254260
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
255261

@@ -339,6 +345,7 @@ Improvements to existing features
339345
- ``boxplot`` now supports ``layout`` keyword (:issue:`6769`)
340346
- Regression in the display of a MultiIndexed Series with ``display.max_rows`` is less than the
341347
length of the series (:issue:`7101`)
348+
- :meth:`~DataFrame.describe` now accepts an array of percentiles to include in the summary statistics (:issue:`4196`)
342349

343350
.. _release.bug_fixes-0.14.0:
344351

doc/source/v0.14.0.txt

+7
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,8 @@ API changes
196196
- accept ``TextFileReader`` in ``concat``, which was affecting a common user idiom (:issue:`6583`), this was a regression
197197
from 0.13.1
198198
- Added ``factorize`` functions to ``Index`` and ``Series`` to get indexer and unique values (:issue:`7090`)
199+
- ``describe`` on a DataFrame with a mix of Timestamp and string like objects returns a different Index (:issue:`7088`).
200+
Previously the index was unintentionally sorted.
199201

200202
.. _whatsnew_0140.display:
201203

@@ -509,6 +511,10 @@ Deprecations
509511
- The support for the 'mysql' flavor when using DBAPI connection objects has been deprecated.
510512
MySQL will be further supported with SQLAlchemy engines (:issue:`6900`).
511513

514+
- The `percentile_width` keyword argument in :meth:`~DataFrame.describe` has been deprecated.
515+
Use the `percentiles` keyword instead, which takes a list of percentiles to display. The
516+
default output is unchanged.
517+
512518
.. _whatsnew_0140.enhancements:
513519

514520
Enhancements
@@ -575,6 +581,7 @@ Enhancements
575581
- ``CustomBuisnessMonthBegin`` and ``CustomBusinessMonthEnd`` are now available (:issue:`6866`)
576582
- :meth:`Series.quantile` and :meth:`DataFrame.quantile` now accept an array of
577583
quantiles.
584+
- :meth:`~DataFrame.describe` now accepts an array of percentiles to include in the summary statistics (:issue:`4196`)
578585
- ``pivot_table`` can now accept ``Grouper`` by ``index`` and ``columns`` keywords (:issue:`6913`)
579586

580587
.. ipython:: python

pandas/core/frame.py

-48
Original file line numberDiff line numberDiff line change
@@ -3805,54 +3805,6 @@ def corrwith(self, other, axis=0, drop=False):
38053805

38063806
return correl
38073807

3808-
def describe(self, percentile_width=50):
3809-
"""
3810-
Generate various summary statistics of each column, excluding
3811-
NaN values. These include: count, mean, std, min, max, and
3812-
lower%/50%/upper% percentiles
3813-
3814-
Parameters
3815-
----------
3816-
percentile_width : float, optional
3817-
width of the desired uncertainty interval, default is 50,
3818-
which corresponds to lower=25, upper=75
3819-
3820-
Returns
3821-
-------
3822-
DataFrame of summary statistics
3823-
"""
3824-
numdata = self._get_numeric_data()
3825-
3826-
if len(numdata.columns) == 0:
3827-
return DataFrame(dict((k, v.describe())
3828-
for k, v in compat.iteritems(self)),
3829-
columns=self.columns)
3830-
3831-
lb = .5 * (1. - percentile_width / 100.)
3832-
ub = 1. - lb
3833-
3834-
def pretty_name(x):
3835-
x *= 100
3836-
if x == int(x):
3837-
return '%.0f%%' % x
3838-
else:
3839-
return '%.1f%%' % x
3840-
3841-
destat_columns = ['count', 'mean', 'std', 'min',
3842-
pretty_name(lb), '50%', pretty_name(ub),
3843-
'max']
3844-
3845-
destat = []
3846-
3847-
for i in range(len(numdata.columns)):
3848-
series = numdata.iloc[:, i]
3849-
destat.append([series.count(), series.mean(), series.std(),
3850-
series.min(), series.quantile(lb), series.median(),
3851-
series.quantile(ub), series.max()])
3852-
3853-
return self._constructor(lmap(list, zip(*destat)),
3854-
index=destat_columns, columns=numdata.columns)
3855-
38563808
#----------------------------------------------------------------------
38573809
# ndarray-like stats methods
38583810

pandas/core/generic.py

+149-1
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
import pandas.core.common as com
2020
import pandas.core.datetools as datetools
2121
from pandas import compat, _np_version_under1p7
22-
from pandas.compat import map, zip, lrange, string_types, isidentifier
22+
from pandas.compat import map, zip, lrange, string_types, isidentifier, lmap
2323
from pandas.core.common import (isnull, notnull, is_list_like,
2424
_values_from_object, _maybe_promote, _maybe_box_datetimelike,
2525
ABCSeries, SettingWithCopyError, SettingWithCopyWarning)
@@ -3478,6 +3478,154 @@ def _convert_timedeltas(x):
34783478

34793479
return np.abs(self)
34803480

3481+
_shared_docs['describe'] = """
3482+
Generate various summary statistics, excluding NaN values.
3483+
3484+
Parameters
3485+
----------
3486+
percentile_width : float, deprecated
3487+
The ``percentile_width`` argument will be removed in a future
3488+
version. Use ``percentiles`` instead.
3489+
width of the desired uncertainty interval, default is 50,
3490+
which corresponds to lower=25, upper=75
3491+
percentiles : array-like, optional
3492+
The percentiles to include in the output. Should all
3493+
be in the interval [0, 1]. By default `percentiles` is
3494+
[.25, .5, .75], returning the 25th, 50th, and 75th percentiles.
3495+
3496+
Returns
3497+
-------
3498+
summary: %(klass)s of summary statistics
3499+
3500+
Notes
3501+
-----
3502+
For numeric dtypes the index includes: count, mean, std, min,
3503+
max, and lower, 50, and upper percentiles.
3504+
3505+
If self is of object dtypes (e.g. timestamps or strings), the output
3506+
will include the count, unique, most common, and frequency of the
3507+
most common. Timestamps also include the first and last items.
3508+
3509+
If multiple values have the highest count, then the
3510+
`count` and `most common` pair will be arbitrarily chosen from
3511+
among those with the highest count.
3512+
"""
3513+
3514+
@Appender(_shared_docs['describe'] % _shared_doc_kwargs)
3515+
def describe(self, percentile_width=None, percentiles=None):
3516+
if self.ndim >= 3:
3517+
msg = "describe is not implemented on on Panel or PanelND objects."
3518+
raise NotImplementedError(msg)
3519+
3520+
if percentile_width is not None and percentiles is not None:
3521+
msg = "Cannot specify both 'percentile_width' and 'percentiles.'"
3522+
raise ValueError(msg)
3523+
if percentiles is not None:
3524+
# get them all to be in [0, 1]
3525+
percentiles = np.asarray(percentiles)
3526+
if (percentiles > 1).any():
3527+
percentiles = percentiles / 100.0
3528+
msg = ("percentiles should all be in the interval [0, 1]. "
3529+
"Try {0} instead.")
3530+
raise ValueError(msg.format(list(percentiles)))
3531+
else:
3532+
# only warn if they change the default
3533+
if percentile_width is not None:
3534+
do_warn = True
3535+
else:
3536+
do_warn = False
3537+
percentile_width = percentile_width or 50
3538+
lb = .5 * (1. - percentile_width / 100.)
3539+
ub = 1. - lb
3540+
percentiles = np.array([lb, 0.5, ub])
3541+
if do_warn:
3542+
msg = ("The `percentile_width` keyword is deprecated. "
3543+
"Use percentiles={0} instead".format(list(percentiles)))
3544+
warnings.warn(msg, FutureWarning)
3545+
3546+
# median should always be included
3547+
if (percentiles != 0.5).all(): # median isn't included
3548+
lh = percentiles[percentiles < .5]
3549+
uh = percentiles[percentiles > .5]
3550+
percentiles = np.hstack([lh, 0.5, uh])
3551+
3552+
# dtypes: numeric only, numeric mixed, objects only
3553+
data = self._get_numeric_data()
3554+
if self.ndim > 1:
3555+
if len(data._info_axis) == 0:
3556+
is_object = True
3557+
else:
3558+
is_object = False
3559+
else:
3560+
is_object = not self._is_numeric_mixed_type
3561+
3562+
def pretty_name(x):
3563+
x *= 100
3564+
if x == int(x):
3565+
return '%.0f%%' % x
3566+
else:
3567+
return '%.1f%%' % x
3568+
3569+
def describe_numeric_1d(series, percentiles):
3570+
return ([series.count(), series.mean(), series.std(),
3571+
series.min()] +
3572+
[series.quantile(x) for x in percentiles] +
3573+
[series.max()])
3574+
3575+
def describe_categorical_1d(data):
3576+
if data.dtype == object:
3577+
names = ['count', 'unique']
3578+
objcounts = data.value_counts()
3579+
result = [data.count(), len(objcounts)]
3580+
if result[1] > 0:
3581+
names += ['top', 'freq']
3582+
top, freq = objcounts.index[0], objcounts.iloc[0]
3583+
result += [top, freq]
3584+
3585+
elif issubclass(data.dtype.type, np.datetime64):
3586+
names = ['count', 'unique']
3587+
asint = data.dropna().values.view('i8')
3588+
objcounts = compat.Counter(asint)
3589+
result = [data.count(), len(objcounts)]
3590+
if result[1] > 0:
3591+
top, freq = objcounts.most_common(1)[0]
3592+
names += ['first', 'last', 'top', 'freq']
3593+
result += [lib.Timestamp(asint.min()),
3594+
lib.Timestamp(asint.max()),
3595+
lib.Timestamp(top), freq]
3596+
3597+
return pd.Series(result, index=names)
3598+
3599+
if is_object:
3600+
if data.ndim == 1:
3601+
return describe_categorical_1d(self)
3602+
else:
3603+
result = pd.DataFrame(dict((k, describe_categorical_1d(v))
3604+
for k, v in compat.iteritems(self)),
3605+
columns=self._info_axis,
3606+
index=['count', 'unique', 'first', 'last',
3607+
'top', 'freq'])
3608+
# just objects, no datime
3609+
if pd.isnull(result.loc['first']).all():
3610+
result = result.drop(['first', 'last'], axis=0)
3611+
return result
3612+
else:
3613+
stat_index = (['count', 'mean', 'std', 'min'] +
3614+
[pretty_name(x) for x in percentiles] +
3615+
['max'])
3616+
if data.ndim == 1:
3617+
return pd.Series(describe_numeric_1d(data, percentiles),
3618+
index=stat_index)
3619+
else:
3620+
destat = []
3621+
for i in range(len(data._info_axis)): # BAD
3622+
series = data.iloc[:, i]
3623+
destat.append(describe_numeric_1d(series, percentiles))
3624+
3625+
return self._constructor(lmap(list, zip(*destat)),
3626+
index=stat_index,
3627+
columns=data._info_axis)
3628+
34813629
_shared_docs['pct_change'] = """
34823630
Percent change over given number of periods.
34833631

pandas/core/series.py

-61
Original file line numberDiff line numberDiff line change
@@ -1267,67 +1267,6 @@ def multi(values, qs):
12671267
def ptp(self, axis=None, out=None):
12681268
return _values_from_object(self).ptp(axis, out)
12691269

1270-
def describe(self, percentile_width=50):
1271-
"""
1272-
Generate various summary statistics of Series, excluding NaN
1273-
values. These include: count, mean, std, min, max, and
1274-
lower%/50%/upper% percentiles
1275-
1276-
Parameters
1277-
----------
1278-
percentile_width : float, optional
1279-
width of the desired uncertainty interval, default is 50,
1280-
which corresponds to lower=25, upper=75
1281-
1282-
Returns
1283-
-------
1284-
desc : Series
1285-
"""
1286-
from pandas.compat import Counter
1287-
1288-
if self.dtype == object:
1289-
names = ['count', 'unique']
1290-
objcounts = Counter(self.dropna().values)
1291-
data = [self.count(), len(objcounts)]
1292-
if data[1] > 0:
1293-
names += ['top', 'freq']
1294-
top, freq = objcounts.most_common(1)[0]
1295-
data += [top, freq]
1296-
1297-
elif issubclass(self.dtype.type, np.datetime64):
1298-
names = ['count', 'unique']
1299-
asint = self.dropna().values.view('i8')
1300-
objcounts = Counter(asint)
1301-
data = [self.count(), len(objcounts)]
1302-
if data[1] > 0:
1303-
top, freq = objcounts.most_common(1)[0]
1304-
names += ['first', 'last', 'top', 'freq']
1305-
data += [lib.Timestamp(asint.min()),
1306-
lib.Timestamp(asint.max()),
1307-
lib.Timestamp(top), freq]
1308-
else:
1309-
1310-
lb = .5 * (1. - percentile_width / 100.)
1311-
ub = 1. - lb
1312-
1313-
def pretty_name(x):
1314-
x *= 100
1315-
if x == int(x):
1316-
return '%.0f%%' % x
1317-
else:
1318-
return '%.1f%%' % x
1319-
1320-
names = ['count']
1321-
data = [self.count()]
1322-
names += ['mean', 'std', 'min', pretty_name(lb), '50%',
1323-
pretty_name(ub), 'max']
1324-
data += [self.mean(), self.std(), self.min(),
1325-
self.quantile(
1326-
lb), self.median(), self.quantile(ub),
1327-
self.max()]
1328-
1329-
return self._constructor(data, index=names).__finalize__(self)
1330-
13311270
def corr(self, other, method='pearson',
13321271
min_periods=None):
13331272
"""

0 commit comments

Comments
 (0)