Skip to content

Commit 6d3803d

Browse files
bthyreaujreback
authored andcommitted
More flexible describe() via include/exclude type filtering
This enhance describe()'s output via new include/exclude list arguments, letting the user specify the dtypes to be summarized as output. This provides an simple way to overcome the automatic type-filtering done by default; it's also convenient with groupby(). Also includes documentation and changelog entries.
1 parent 72a051c commit 6d3803d

File tree

4 files changed

+221
-59
lines changed

4 files changed

+221
-59
lines changed

doc/source/basics.rst

+18
Original file line numberDiff line numberDiff line change
@@ -490,6 +490,24 @@ number of unique values and most frequently occurring values:
490490
s = Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])
491491
s.describe()
492492
493+
Note that on a mixed-type DataFrame object, `describe` will restrict the summary to
494+
include only numerical columns or, if none are, only categorical columns:
495+
496+
.. ipython:: python
497+
498+
frame = DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})
499+
frame.describe()
500+
501+
This behaviour can be controlled by providing a list of types as ``include``/``exclude``
502+
arguments. The special value ``all`` can also be used:
503+
504+
.. ipython:: python
505+
506+
frame.describe(include=['object'])
507+
frame.describe(include=['number'])
508+
frame.describe(include='all')
509+
510+
That feature relies on :ref:`select_dtypes <basics.selectdtypes>`. Refer to there for details about accepted inputs.
493511

494512
There also is a utility function, ``value_range`` which takes a DataFrame and
495513
returns a series with the minimum/maximum values in the DataFrame.

doc/source/v0.15.0.txt

+18
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,24 @@ users upgrade to this version.
5757

5858
API changes
5959
~~~~~~~~~~~
60+
- :func:`describe` on mixed-types DataFrames is more flexible. Type-based column filtering is now possible via the ``include``/``exclude`` arguments (:issue:`8164`).
61+
62+
.. ipython:: python
63+
64+
df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8,
65+
'catB': ['a', 'b', 'c', 'd'] * 6,
66+
'numC': np.arange(24),
67+
'numD': np.arange(24.) + .5})
68+
df.describe(include=["object"])
69+
df.describe(include=["number", "object"], exclude=["float"])
70+
71+
Requesting all columns is possible with the shorthand 'all'
72+
73+
.. ipython:: python
74+
75+
df.describe(include='all')
76+
77+
Without those arguments, 'describe` will behave as before, including only numerical columns or, if none are, only categorical columns. See also the :ref:`docs <basics.describe>`
6078

6179
- Passing multiple levels to :meth:`~pandas.DataFrame.stack()` will now work when multiple level
6280
numbers are passed (:issue:`7660`), and will raise a ``ValueError`` when the

pandas/core/generic.py

+73-52
Original file line numberDiff line numberDiff line change
@@ -3654,27 +3654,51 @@ def abs(self):
36543654
The percentiles to include in the output. Should all
36553655
be in the interval [0, 1]. By default `percentiles` is
36563656
[.25, .5, .75], returning the 25th, 50th, and 75th percentiles.
3657+
include, exclude : list-like, 'all', or None (default)
3658+
Specify the form of the returned result. Either:
3659+
3660+
- None to both (default). The result will include only numeric-typed
3661+
columns or, if none are, only categorical columns.
3662+
- A list of dtypes or strings to be included/excluded.
3663+
To select all numeric types use numpy numpy.number. To select
3664+
categorical objects use type object. See also the select_dtypes
3665+
documentation. eg. df.describe(include=['O'])
3666+
- If include is the string 'all', the output column-set will
3667+
match the input one.
36573668
36583669
Returns
36593670
-------
36603671
summary: %(klass)s of summary statistics
36613672
36623673
Notes
36633674
-----
3664-
For numeric dtypes the index includes: count, mean, std, min,
3675+
The output DataFrame index depends on the requested dtypes:
3676+
3677+
For numeric dtypes, it will include: count, mean, std, min,
36653678
max, and lower, 50, and upper percentiles.
36663679
3667-
If self is of object dtypes (e.g. timestamps or strings), the output
3680+
For object dtypes (e.g. timestamps or strings), the index
36683681
will include the count, unique, most common, and frequency of the
36693682
most common. Timestamps also include the first and last items.
36703683
3684+
For mixed dtypes, the index will be the union of the corresponding
3685+
output types. Non-applicable entries will be filled with NaN.
3686+
Note that mixed-dtype outputs can only be returned from mixed-dtype
3687+
inputs and appropriate use of the include/exclude arguments.
3688+
36713689
If multiple values have the highest count, then the
36723690
`count` and `most common` pair will be arbitrarily chosen from
36733691
among those with the highest count.
3692+
3693+
The include, exclude arguments are ignored for Series.
3694+
3695+
See also
3696+
--------
3697+
DataFrame.select_dtypes
36743698
"""
36753699

36763700
@Appender(_shared_docs['describe'] % _shared_doc_kwargs)
3677-
def describe(self, percentile_width=None, percentiles=None):
3701+
def describe(self, percentile_width=None, percentiles=None, include=None, exclude=None ):
36783702
if self.ndim >= 3:
36793703
msg = "describe is not implemented on on Panel or PanelND objects."
36803704
raise NotImplementedError(msg)
@@ -3711,16 +3735,6 @@ def describe(self, percentile_width=None, percentiles=None):
37113735
uh = percentiles[percentiles > .5]
37123736
percentiles = np.hstack([lh, 0.5, uh])
37133737

3714-
# dtypes: numeric only, numeric mixed, objects only
3715-
data = self._get_numeric_data()
3716-
if self.ndim > 1:
3717-
if len(data._info_axis) == 0:
3718-
is_object = True
3719-
else:
3720-
is_object = False
3721-
else:
3722-
is_object = not self._is_numeric_mixed_type
3723-
37243738
def pretty_name(x):
37253739
x *= 100
37263740
if x == int(x):
@@ -3729,10 +3743,12 @@ def pretty_name(x):
37293743
return '%.1f%%' % x
37303744

37313745
def describe_numeric_1d(series, percentiles):
3732-
return ([series.count(), series.mean(), series.std(),
3733-
series.min()] +
3734-
[series.quantile(x) for x in percentiles] +
3735-
[series.max()])
3746+
stat_index = (['count', 'mean', 'std', 'min'] +
3747+
[pretty_name(x) for x in percentiles] + ['max'])
3748+
d = ([series.count(), series.mean(), series.std(), series.min()] +
3749+
[series.quantile(x) for x in percentiles] + [series.max()])
3750+
return pd.Series(d, index=stat_index, name=series.name)
3751+
37363752

37373753
def describe_categorical_1d(data):
37383754
names = ['count', 'unique']
@@ -3745,44 +3761,49 @@ def describe_categorical_1d(data):
37453761
names += ['top', 'freq']
37463762
result += [top, freq]
37473763

3748-
elif issubclass(data.dtype.type, np.datetime64):
3764+
elif com.is_datetime64_dtype(data):
37493765
asint = data.dropna().values.view('i8')
3750-
names += ['first', 'last', 'top', 'freq']
3751-
result += [lib.Timestamp(asint.min()),
3752-
lib.Timestamp(asint.max()),
3753-
lib.Timestamp(top), freq]
3754-
3755-
return pd.Series(result, index=names)
3756-
3757-
if is_object:
3758-
if data.ndim == 1:
3759-
return describe_categorical_1d(self)
3766+
names += ['top', 'freq', 'first', 'last']
3767+
result += [lib.Timestamp(top), freq,
3768+
lib.Timestamp(asint.min()),
3769+
lib.Timestamp(asint.max())]
3770+
3771+
return pd.Series(result, index=names, name=data.name)
3772+
3773+
def describe_1d(data, percentiles):
3774+
if com.is_numeric_dtype(data):
3775+
return describe_numeric_1d(data, percentiles)
3776+
elif com.is_timedelta64_dtype(data):
3777+
return describe_numeric_1d(data, percentiles)
37603778
else:
3761-
result = pd.DataFrame(dict((k, describe_categorical_1d(v))
3762-
for k, v in compat.iteritems(self)),
3763-
columns=self._info_axis,
3764-
index=['count', 'unique', 'first', 'last',
3765-
'top', 'freq'])
3766-
# just objects, no datime
3767-
if pd.isnull(result.loc['first']).all():
3768-
result = result.drop(['first', 'last'], axis=0)
3769-
return result
3770-
else:
3771-
stat_index = (['count', 'mean', 'std', 'min'] +
3772-
[pretty_name(x) for x in percentiles] +
3773-
['max'])
3774-
if data.ndim == 1:
3775-
return pd.Series(describe_numeric_1d(data, percentiles),
3776-
index=stat_index)
3779+
return describe_categorical_1d(data)
3780+
3781+
if self.ndim == 1:
3782+
return describe_1d(self, percentiles)
3783+
elif (include is None) and (exclude is None):
3784+
if len(self._get_numeric_data()._info_axis) > 0:
3785+
# when some numerics are found, keep only numerics
3786+
data = self.select_dtypes(include=[np.number, np.bool])
37773787
else:
3778-
destat = []
3779-
for i in range(len(data._info_axis)): # BAD
3780-
series = data.iloc[:, i]
3781-
destat.append(describe_numeric_1d(series, percentiles))
3782-
3783-
return self._constructor(lmap(list, zip(*destat)),
3784-
index=stat_index,
3785-
columns=data._info_axis)
3788+
data = self
3789+
elif include == 'all':
3790+
if exclude != None:
3791+
msg = "exclude must be None when include is 'all'"
3792+
raise ValueError(msg)
3793+
data = self
3794+
else:
3795+
data = self.select_dtypes(include=include, exclude=exclude)
3796+
3797+
ldesc = [describe_1d(s, percentiles) for _, s in data.iteritems()]
3798+
# set a convenient order for rows
3799+
names = []
3800+
ldesc_indexes = sorted([x.index for x in ldesc], key=len)
3801+
for idxnames in ldesc_indexes:
3802+
for name in idxnames:
3803+
if name not in names:
3804+
names.append(name)
3805+
d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
3806+
return d
37863807

37873808
_shared_docs['pct_change'] = """
37883809
Percent change over given number of periods.

pandas/tests/test_generic.py

+112-7
Original file line numberDiff line numberDiff line change
@@ -1005,18 +1005,17 @@ def test_describe_objects(self):
10051005
df = DataFrame({"C1": pd.date_range('2010-01-01', periods=4, freq='D')})
10061006
df.loc[4] = pd.Timestamp('2010-01-04')
10071007
result = df.describe()
1008-
expected = DataFrame({"C1": [5, 4, pd.Timestamp('2010-01-01'),
1009-
pd.Timestamp('2010-01-04'),
1010-
pd.Timestamp('2010-01-04'), 2]},
1011-
index=['count', 'unique', 'first', 'last', 'top',
1012-
'freq'])
1008+
expected = DataFrame({"C1": [5, 4, pd.Timestamp('2010-01-04'), 2,
1009+
pd.Timestamp('2010-01-01'),
1010+
pd.Timestamp('2010-01-04')]},
1011+
index=['count', 'unique', 'top', 'freq',
1012+
'first', 'last'])
10131013
assert_frame_equal(result, expected)
10141014

10151015
# mix time and str
10161016
df['C2'] = ['a', 'a', 'b', 'c', 'a']
10171017
result = df.describe()
1018-
# when mix of dateimte / obj the index gets reordered.
1019-
expected['C2'] = [5, 3, np.nan, np.nan, 'a', 3]
1018+
expected['C2'] = [5, 3, 'a', 3, np.nan, np.nan]
10201019
assert_frame_equal(result, expected)
10211020

10221021
# just str
@@ -1036,6 +1035,112 @@ def test_describe_objects(self):
10361035
assert_frame_equal(df[['C1', 'C3']].describe(), df[['C3']].describe())
10371036
assert_frame_equal(df[['C2', 'C3']].describe(), df[['C3']].describe())
10381037

1038+
def test_describe_typefiltering(self):
1039+
df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8,
1040+
'catB': ['a', 'b', 'c', 'd'] * 6,
1041+
'numC': np.arange(24, dtype='int64'),
1042+
'numD': np.arange(24.) + .5,
1043+
'ts': tm.makeTimeSeries()[:24].index})
1044+
1045+
descN = df.describe()
1046+
expected_cols = ['numC', 'numD',]
1047+
expected = DataFrame(dict((k, df[k].describe())
1048+
for k in expected_cols),
1049+
columns=expected_cols)
1050+
assert_frame_equal(descN, expected)
1051+
1052+
desc = df.describe(include=['number'])
1053+
assert_frame_equal(desc, descN)
1054+
desc = df.describe(exclude=['object', 'datetime'])
1055+
assert_frame_equal(desc, descN)
1056+
desc = df.describe(include=['float'])
1057+
assert_frame_equal(desc, descN.drop('numC',1))
1058+
1059+
descC = df.describe(include=['O'])
1060+
expected_cols = ['catA', 'catB']
1061+
expected = DataFrame(dict((k, df[k].describe())
1062+
for k in expected_cols),
1063+
columns=expected_cols)
1064+
assert_frame_equal(descC, expected)
1065+
1066+
descD = df.describe(include=['datetime'])
1067+
assert_series_equal( descD.ts, df.ts.describe())
1068+
1069+
desc = df.describe(include=['object','number', 'datetime'])
1070+
assert_frame_equal(desc.loc[:,["numC","numD"]].dropna(), descN)
1071+
assert_frame_equal(desc.loc[:,["catA","catB"]].dropna(), descC)
1072+
descDs = descD.sort_index() # the index order change for mixed-types
1073+
assert_frame_equal(desc.loc[:,"ts":].dropna().sort_index(), descDs)
1074+
1075+
desc = df.loc[:,'catA':'catB'].describe(include='all')
1076+
assert_frame_equal(desc, descC)
1077+
desc = df.loc[:,'numC':'numD'].describe(include='all')
1078+
assert_frame_equal(desc, descN)
1079+
1080+
desc = df.describe(percentiles = [], include='all')
1081+
cnt = Series(data=[4,4,6,6,6], index=['catA','catB','numC','numD','ts'])
1082+
assert_series_equal( desc.count(), cnt)
1083+
self.assertTrue('count' in desc.index)
1084+
self.assertTrue('unique' in desc.index)
1085+
self.assertTrue('50%' in desc.index)
1086+
self.assertTrue('first' in desc.index)
1087+
1088+
desc = df.drop("ts", 1).describe(percentiles = [], include='all')
1089+
assert_series_equal( desc.count(), cnt.drop("ts"))
1090+
self.assertTrue('first' not in desc.index)
1091+
desc = df.drop(["numC","numD"], 1).describe(percentiles = [], include='all')
1092+
assert_series_equal( desc.count(), cnt.drop(["numC","numD"]))
1093+
self.assertTrue('50%' not in desc.index)
1094+
1095+
def test_describe_typefiltering_category_bool(self):
1096+
df = DataFrame({'A_cat': pd.Categorical(['foo', 'foo', 'bar'] * 8),
1097+
'B_str': ['a', 'b', 'c', 'd'] * 6,
1098+
'C_bool': [True] * 12 + [False] * 12,
1099+
'D_num': np.arange(24.) + .5,
1100+
'E_ts': tm.makeTimeSeries()[:24].index})
1101+
1102+
# bool is considered numeric in describe, although not an np.number
1103+
desc = df.describe()
1104+
expected_cols = ['C_bool', 'D_num']
1105+
expected = DataFrame(dict((k, df[k].describe())
1106+
for k in expected_cols),
1107+
columns=expected_cols)
1108+
assert_frame_equal(desc, expected)
1109+
1110+
desc = df.describe(include=["category"])
1111+
self.assertTrue(desc.columns.tolist() == ["A_cat"])
1112+
1113+
# 'all' includes numpy-dtypes + category
1114+
desc1 = df.describe(include="all")
1115+
desc2 = df.describe(include=[np.generic, "category"])
1116+
assert_frame_equal(desc1, desc2)
1117+
1118+
def test_describe_timedelta(self):
1119+
df = DataFrame({"td": pd.to_timedelta(np.arange(24)%20,"D")})
1120+
self.assertTrue(df.describe().loc["mean"][0] == pd.to_timedelta("8d4h"))
1121+
1122+
def test_describe_typefiltering_dupcol(self):
1123+
df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8,
1124+
'catB': ['a', 'b', 'c', 'd'] * 6,
1125+
'numC': np.arange(24),
1126+
'numD': np.arange(24.) + .5,
1127+
'ts': tm.makeTimeSeries()[:24].index})
1128+
s = df.describe(include='all').shape[1]
1129+
df = pd.concat([df, df], axis=1)
1130+
s2 = df.describe(include='all').shape[1]
1131+
self.assertTrue(s2 == 2 * s)
1132+
1133+
def test_describe_typefiltering_groupby(self):
1134+
df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8,
1135+
'catB': ['a', 'b', 'c', 'd'] * 6,
1136+
'numC': np.arange(24),
1137+
'numD': np.arange(24.) + .5,
1138+
'ts': tm.makeTimeSeries()[:24].index})
1139+
G = df.groupby('catA')
1140+
self.assertTrue(G.describe(include=['number']).shape == (16, 2))
1141+
self.assertTrue(G.describe(include=['number', 'object']).shape == (22, 3))
1142+
self.assertTrue(G.describe(include='all').shape == (26, 4))
1143+
10391144
def test_no_order(self):
10401145
tm._skip_if_no_scipy()
10411146
s = Series([0, 1, np.nan, 3])

0 commit comments

Comments
 (0)