Skip to content

Commit 0f83679

Browse files
committed
Merge pull request #8413 from JanSchulz/CategoricalFixups3
Categorical doc fixups
2 parents 170cf12 + 5fc1daa commit 0f83679

File tree

5 files changed

+97
-38
lines changed

5 files changed

+97
-38
lines changed

doc/source/10min.rst

+24-7
Original file line numberDiff line numberDiff line change
@@ -640,27 +640,44 @@ Categoricals
640640
------------
641641

642642
Since version 0.15, pandas can include categorical data in a ``DataFrame``. For full docs, see the
643-
:ref:`Categorical introduction <categorical>` and the :ref:`API documentation <api.categorical>` .
643+
:ref:`categorical introduction <categorical>` and the :ref:`API documentation <api.categorical>`.
644644

645645
.. ipython:: python
646646
647647
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
648648
649-
# convert the raw grades to a categorical
650-
df["grade"] = pd.Categorical(df["raw_grade"])
649+
Convert the raw grades to a categorical data type.
651650

652-
# Alternative: df["grade"] = df["raw_grade"].astype("category")
651+
.. ipython:: python
652+
653+
df["grade"] = df["raw_grade"].astype("category")
653654
df["grade"]
654655
655-
# Rename the categories inplace
656+
Rename the categories to more meaningful names (assigning to ``Series.cat.categories`` is inplace!)
657+
658+
.. ipython:: python
659+
656660
df["grade"].cat.categories = ["very good", "good", "very bad"]
657661
658-
# Reorder the categories and simultaneously add the missing categories
662+
Reorder the categories and simultaneously add the missing categories (methods under ``Series
663+
.cat`` return a new ``Series`` per default).
664+
665+
.. ipython:: python
666+
659667
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
660668
df["grade"]
669+
670+
Sorting is per order in the categories, not lexical order.
671+
672+
.. ipython:: python
673+
661674
df.sort("grade")
662-
df.groupby("grade").size()
663675
676+
Grouping by a categorical column shows also empty categories.
677+
678+
.. ipython:: python
679+
680+
df.groupby("grade").size()
664681
665682
666683
Plotting

doc/source/categorical.rst

+2
Original file line numberDiff line numberDiff line change
@@ -611,6 +611,8 @@ available ("missing value") or `np.nan` is a valid category.
611611
pd.isnull(s)
612612
s.fillna("a")
613613
614+
.. _categorical.rfactor:
615+
614616
Differences to R's `factor`
615617
---------------------------
616618

doc/source/comparison_with_r.rst

+39-14
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
77
import pandas as pd
88
import numpy as np
9-
options.display.max_rows=15
9+
pd.options.display.max_rows=15
1010
1111
Comparison with R / R libraries
1212
*******************************
@@ -51,7 +51,7 @@ Selecting multiple columns by name in ``pandas`` is straightforward
5151

5252
.. ipython:: python
5353
54-
df = DataFrame(np.random.randn(10, 3), columns=list('abc'))
54+
df = pd.DataFrame(np.random.randn(10, 3), columns=list('abc'))
5555
df[['a', 'c']]
5656
df.loc[:, ['a', 'c']]
5757
@@ -63,7 +63,7 @@ with a combination of the ``iloc`` indexer attribute and ``numpy.r_``.
6363
named = list('abcdefg')
6464
n = 30
6565
columns = named + np.arange(len(named), n).tolist()
66-
df = DataFrame(np.random.randn(n, n), columns=columns)
66+
df = pd.DataFrame(np.random.randn(n, n), columns=columns)
6767
6868
df.iloc[:, np.r_[:10, 24:30]]
6969
@@ -88,8 +88,7 @@ function.
8888

8989
.. ipython:: python
9090
91-
from pandas import DataFrame
92-
df = DataFrame({
91+
df = pd.DataFrame({
9392
'v1': [1,3,5,7,8,3,5,np.nan,4,5,7,9],
9493
'v2': [11,33,55,77,88,33,55,np.nan,44,55,77,99],
9594
'by1': ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, 12],
@@ -166,7 +165,7 @@ In ``pandas`` we may use :meth:`~pandas.pivot_table` method to handle this:
166165
import random
167166
import string
168167
169-
baseball = DataFrame({
168+
baseball = pd.DataFrame({
170169
'team': ["team %d" % (x+1) for x in range(5)]*5,
171170
'player': random.sample(list(string.ascii_lowercase),25),
172171
'batting avg': np.random.uniform(.200, .400, 25)
@@ -197,7 +196,7 @@ index/slice as well as standard boolean indexing:
197196

198197
.. ipython:: python
199198
200-
df = DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
199+
df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
201200
df.query('a <= b')
202201
df[df.a <= df.b]
203202
df.loc[df.a <= df.b]
@@ -225,7 +224,7 @@ In ``pandas`` the equivalent expression, using the
225224

226225
.. ipython:: python
227226
228-
df = DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
227+
df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
229228
df.eval('a + b')
230229
df.a + df.b # same as the previous expression
231230
@@ -283,7 +282,7 @@ In ``pandas`` the equivalent expression, using the
283282

284283
.. ipython:: python
285284
286-
df = DataFrame({
285+
df = pd.DataFrame({
287286
'x': np.random.uniform(1., 168., 120),
288287
'y': np.random.uniform(7., 334., 120),
289288
'z': np.random.uniform(1.7, 20.7, 120),
@@ -317,7 +316,7 @@ In Python, since ``a`` is a list, you can simply use list comprehension.
317316
.. ipython:: python
318317
319318
a = np.array(list(range(1,24))+[np.NAN]).reshape(2,3,4)
320-
DataFrame([tuple(list(x)+[val]) for x, val in np.ndenumerate(a)])
319+
pd.DataFrame([tuple(list(x)+[val]) for x, val in np.ndenumerate(a)])
321320
322321
|meltlist|_
323322
~~~~~~~~~~~~
@@ -336,7 +335,7 @@ In Python, this list would be a list of tuples, so
336335
.. ipython:: python
337336
338337
a = list(enumerate(list(range(1,5))+[np.NAN]))
339-
DataFrame(a)
338+
pd.DataFrame(a)
340339
341340
For more details and examples see :ref:`the Into to Data Structures
342341
documentation <basics.dataframe.from_items>`.
@@ -361,7 +360,7 @@ In Python, the :meth:`~pandas.melt` method is the R equivalent:
361360

362361
.. ipython:: python
363362
364-
cheese = DataFrame({'first' : ['John', 'Mary'],
363+
cheese = pd.DataFrame({'first' : ['John', 'Mary'],
365364
'last' : ['Doe', 'Bo'],
366365
'height' : [5.5, 6.0],
367366
'weight' : [130, 150]})
@@ -394,7 +393,7 @@ In Python the best way is to make use of :meth:`~pandas.pivot_table`:
394393

395394
.. ipython:: python
396395
397-
df = DataFrame({
396+
df = pd.DataFrame({
398397
'x': np.random.uniform(1., 168., 12),
399398
'y': np.random.uniform(7., 334., 12),
400399
'z': np.random.uniform(1.7, 20.7, 12),
@@ -426,7 +425,7 @@ using :meth:`~pandas.pivot_table`:
426425

427426
.. ipython:: python
428427
429-
df = DataFrame({
428+
df = pd.DataFrame({
430429
'Animal': ['Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
431430
'Animal2', 'Animal3'],
432431
'FeedType': ['A', 'B', 'A', 'A', 'B', 'B', 'A'],
@@ -444,6 +443,30 @@ The second approach is to use the :meth:`~pandas.DataFrame.groupby` method:
444443
For more details and examples see :ref:`the reshaping documentation
445444
<reshaping.pivot>` or :ref:`the groupby documentation<groupby.split>`.
446445

446+
|factor|_
447+
~~~~~~~~
448+
449+
.. versionadded:: 0.15
450+
451+
pandas has a data type for categorical data.
452+
453+
.. code-block:: r
454+
455+
cut(c(1,2,3,4,5,6), 3)
456+
factor(c(1,2,3,2,2,3))
457+
458+
In pandas this is accomplished with ``pd.cut`` and ``astype("category")``:
459+
460+
.. ipython:: python
461+
462+
pd.cut(pd.Series([1,2,3,4,5,6]), 3)
463+
pd.Series([1,2,3,2,2,3]).astype("category")
464+
465+
For more details and examples see :ref:`categorical introduction <categorical>` and the
466+
:ref:`API documentation <api.categorical>`. There is also a documentation regarding the
467+
:ref:`differences to R's factor <categorical.rfactor>`.
468+
469+
447470
.. |c| replace:: ``c``
448471
.. _c: http://stat.ethz.ch/R-manual/R-patched/library/base/html/c.html
449472

@@ -477,3 +500,5 @@ For more details and examples see :ref:`the reshaping documentation
477500
.. |cast| replace:: ``cast``
478501
.. cast: http://www.inside-r.org/packages/cran/reshape2/docs/cast
479502
503+
.. |factor| replace:: ``factor``
504+
.. _factor: https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html

doc/source/v0.15.0.txt

+3-6
Original file line numberDiff line numberDiff line change
@@ -540,21 +540,18 @@ Categoricals in Series/DataFrame
540540
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
541541

542542
:class:`~pandas.Categorical` can now be included in `Series` and `DataFrames` and gained new
543-
methods to manipulate. Thanks to Jan Schultz for much of this API/implementation. (:issue:`3943`, :issue:`5313`, :issue:`5314`,
543+
methods to manipulate. Thanks to Jan Schulz for much of this API/implementation. (:issue:`3943`, :issue:`5313`, :issue:`5314`,
544544
:issue:`7444`, :issue:`7839`, :issue:`7848`, :issue:`7864`, :issue:`7914`, :issue:`7768`, :issue:`8006`, :issue:`3678`,
545545
:issue:`8075`, :issue:`8076`, :issue:`8143`).
546546

547-
For full docs, see the :ref:`Categorical introduction <categorical>` and the
547+
For full docs, see the :ref:`categorical introduction <categorical>` and the
548548
:ref:`API documentation <api.categorical>`.
549549

550550
.. ipython:: python
551551

552552
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
553553

554-
# convert the raw grades to a categorical
555-
df["grade"] = pd.Categorical(df["raw_grade"])
556-
557-
# Alternative: df["grade"] = df["raw_grade"].astype("category")
554+
df["grade"] = df["raw_grade"].astype("category")
558555
df["grade"]
559556

560557
# Rename the categories

pandas/tools/tile.py

+29-11
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,8 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
3434
right == True (the default), then the bins [1,2,3,4] indicate
3535
(1,2], (2,3], (3,4].
3636
labels : array or boolean, default None
37-
Labels to use for bins, or False to return integer bin labels.
37+
Used as labels for the resulting bins. Must be of the same length as the resulting
38+
bins. If False, return only integer indicators of the bins.
3839
retbins : bool, optional
3940
Whether to return the bins or not. Can be useful if bins is given
4041
as a scalar.
@@ -47,7 +48,8 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
4748
-------
4849
out : Categorical or Series or array of integers if labels is False
4950
The return type (Categorical or Series) depends on the input: a Series of type category if
50-
input is a Series else Categorical.
51+
input is a Series else Categorical. Bins are represented as categories when categorical
52+
data is returned.
5153
bins : ndarray of floats
5254
Returned only if `retbins` is True.
5355
@@ -63,12 +65,15 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
6365
6466
Examples
6567
--------
66-
>>> cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True)
67-
(array([(0.191, 3.367], (0.191, 3.367], (0.191, 3.367], (3.367, 6.533],
68-
(6.533, 9.7], (0.191, 3.367]], dtype=object),
69-
array([ 0.1905 , 3.36666667, 6.53333333, 9.7 ]))
70-
>>> cut(np.ones(5), 4, labels=False)
71-
array([2, 2, 2, 2, 2])
68+
>>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True)
69+
([(0.191, 3.367], (0.191, 3.367], (0.191, 3.367], (3.367, 6.533], (6.533, 9.7], (0.191, 3.367]]
70+
Categories (3, object): [(0.191, 3.367] < (3.367, 6.533] < (6.533, 9.7]],
71+
array([ 0.1905 , 3.36666667, 6.53333333, 9.7 ]))
72+
>>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, labels=["good","medium","bad"])
73+
[good, good, good, medium, bad, good]
74+
Categories (3, object): [good < medium < bad]
75+
>>> pd.cut(np.ones(5), 4, labels=False)
76+
array([1, 1, 1, 1, 1], dtype=int64)
7277
"""
7378
# NOTE: this binning code is changed a bit from histogram for var(x) == 0
7479
if not np.iterable(bins):
@@ -126,7 +131,8 @@ def qcut(x, q, labels=None, retbins=False, precision=3):
126131
Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately
127132
array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles
128133
labels : array or boolean, default None
129-
Labels to use for bin edges, or False to return integer bin labels
134+
Used as labels for the resulting bins. Must be of the same length as the resulting
135+
bins. If False, return only integer indicators of the bins.
130136
retbins : bool, optional
131137
Whether to return the bins or not. Can be useful if bins is given
132138
as a scalar.
@@ -135,15 +141,27 @@ def qcut(x, q, labels=None, retbins=False, precision=3):
135141
136142
Returns
137143
-------
138-
cat : Categorical or Series
139-
Returns a Series of type category if input is a Series else Categorical.
144+
out : Categorical or Series or array of integers if labels is False
145+
The return type (Categorical or Series) depends on the input: a Series of type category if
146+
input is a Series else Categorical. Bins are represented as categories when categorical
147+
data is returned.
148+
bins : ndarray of floats
149+
Returned only if `retbins` is True.
140150
141151
Notes
142152
-----
143153
Out of bounds values will be NA in the resulting Categorical object
144154
145155
Examples
146156
--------
157+
>>> pd.qcut(range(5), 4)
158+
[[0, 1], [0, 1], (1, 2], (2, 3], (3, 4]]
159+
Categories (4, object): [[0, 1] < (1, 2] < (2, 3] < (3, 4]]
160+
>>> pd.qcut(range(5), 3, labels=["good","medium","bad"])
161+
[good, good, medium, bad, bad]
162+
Categories (3, object): [good < medium < bad]
163+
>>> pd.qcut(range(5), 4, labels=False)
164+
array([0, 0, 1, 2, 3], dtype=int64)
147165
"""
148166
if com.is_integer(q):
149167
quantiles = np.linspace(0, 1, q + 1)

0 commit comments

Comments
 (0)