Skip to content

Commit b8972bb

Browse files
jankatinsjreback
authored andcommitted
Categorical: Thanks for Jan Schulz for much of the work on Categoricals
Doc: Add Release notes for pandas-dev#7217
1 parent 30233fd commit b8972bb

17 files changed

+451
-287
lines changed

doc/source/api.rst

+11-5
Original file line numberDiff line numberDiff line change
@@ -485,18 +485,26 @@ and has the following usable methods and properties (all available as
485485
:toctree: generated/
486486

487487
Categorical
488+
Categorical.from_codes
488489
Categorical.levels
489490
Categorical.ordered
490491
Categorical.reorder_levels
491492
Categorical.remove_unused_levels
492493
Categorical.min
493494
Categorical.max
494495
Categorical.mode
496+
Categorical.describe
497+
498+
``np.asarray(categorical)`` works by implementing the array interface. Be aware, that this converts
499+
the Categorical back to a numpy array, so levels and order information is not preserved!
500+
501+
.. autosummary::
502+
:toctree: generated/
503+
504+
Categorical.__array__
495505

496506
To create compatibility with `pandas.Series` and `numpy` arrays, the following (non-API) methods
497-
are also introduced. Apart from these methods, ``np.asarray(categorical)`` works by implementing the
498-
array interface (`Categorical.__array__()`). Be aware, that this converts the
499-
Categorical back to a numpy array, so levels and order information is not preserved!
507+
are also introduced.
500508

501509
.. autosummary::
502510
:toctree: generated/
@@ -507,13 +515,11 @@ Categorical back to a numpy array, so levels and order information is not preser
507515
Categorical.dtype
508516
Categorical.ndim
509517
Categorical.sort
510-
Categorical.describe
511518
Categorical.equals
512519
Categorical.unique
513520
Categorical.order
514521
Categorical.argsort
515522
Categorical.fillna
516-
Categorical.__array__
517523

518524

519525
Plotting

doc/source/categorical.rst

+90-35
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,8 @@ Categorical
2727
`Categorical` data in `Series` and `DataFrame` is new.
2828

2929

30-
This is a short introduction to pandas `Categorical` type, including a short comparison with R's
31-
`factor`.
30+
This is a introduction to pandas :class:`pandas.Categorical` type, including a short comparison
31+
with R's `factor`.
3232

3333
`Categoricals` are a pandas data type, which correspond to categorical variables in
3434
statistics: a variable, which can take on only a limited, and usually fixed,
@@ -108,7 +108,7 @@ By using some special functions:
108108
creation time. Use `levels` to change the levels after creation time.
109109

110110
To get back to the original Series or `numpy` array, use ``Series.astype(original_dtype)`` or
111-
``Categorical.get_values()``:
111+
``np.asarray(categorical)``:
112112

113113
.. ipython:: python
114114
@@ -118,7 +118,33 @@ To get back to the original Series or `numpy` array, use ``Series.astype(origina
118118
s2
119119
s3 = s2.astype('string')
120120
s3
121-
s2.cat.get_values()
121+
np.asarray(s2.cat)
122+
123+
If you have already `codes` and `levels`, you can use the :func:`~pandas.Categorical.from_codes`
124+
constructor to save the factorize step during normal constructor mode:
125+
126+
.. ipython:: python
127+
128+
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
129+
pd.Categorical.from_codes(splitter, levels=["train", "test"])
130+
131+
Description
132+
-----------
133+
134+
Using ``.describe()`` on a ``Categorical(...)`` or a ``Series(Categorical(...))`` will show
135+
different output.
136+
137+
138+
As part of a `Dataframe` or as a `Series` a similar output as for a `Series` of type ``string`` is
139+
shown. Calling ``Categorical.describe()`` will show the frequencies for each level, with NA for
140+
unused levels.
141+
142+
.. ipython:: python
143+
144+
cat = pd.Categorical(["a","c","c",np.nan], levels=["b","a","c",np.nan] )
145+
df = pd.DataFrame({"cat":cat, "s":["a","c","c",np.nan]})
146+
df.describe()
147+
cat.describe()
122148
123149
Working with levels
124150
-------------------
@@ -153,7 +179,8 @@ It's also possible to pass in the levels in a specific order:
153179
154180
.. note::
155181

156-
Passing in a `levels` argument implies ``ordered=True``.
182+
Passing in a `levels` argument implies ``ordered=True``. You can of course overwrite that by
183+
passing in an explicit ``ordered=False``.
157184

158185
Any value omitted in the levels argument will be replaced by `np.nan`:
159186

@@ -178,8 +205,7 @@ Renaming levels is done by assigning new values to the ``Category.levels`` or
178205
179206
.. note::
180207

181-
I contrast to R's `factor` function, a `Categorical` can have levels of other types than
182-
string.
208+
I contrast to R's `factor`, a `Categorical` can have levels of other types than string.
183209

184210
Levels must be unique or a `ValueError` is raised:
185211

@@ -190,14 +216,16 @@ Levels must be unique or a `ValueError` is raised:
190216
except ValueError as e:
191217
print("ValueError: " + str(e))
192218
193-
Appending a level can be done by assigning a levels list longer than the current levels:
219+
Appending levels can be done by assigning a levels list longer than the current levels:
194220

195221
.. ipython:: python
196222
197223
s.cat.levels = [1,2,3,4]
198224
s.cat.levels
199225
s
200226
227+
.. note::
228+
Adding levels in other positions can be done with ``.reorder_levels(<levels_including_new>)``.
201229

202230
Removing a level is also possible, but only the last level(s) can be removed by assigning a
203231
shorter list than current levels. Values which are omitted are replaced by `np.nan`.
@@ -236,8 +264,8 @@ Ordered or not...
236264
-----------------
237265

238266
If a `Categoricals` is ordered (``cat.ordered == True``), then the order of the levels has a
239-
meaning and certain operations are possible. If the the categorical is unordered,
240-
a `TypeError` is raised.
267+
meaning and certain operations are possible. If the categorical is unordered, a `TypeError` is
268+
raised.
241269

242270
.. ipython:: python
243271
@@ -268,7 +296,8 @@ This is even true for strings and numeric data:
268296
print(s.min(), s.max())
269297
270298
Reordering the levels is possible via the ``Categorical.reorder_levels(new_levels)`` or
271-
``Series.cat.reorder_levels(new_levels)`` methods:
299+
``Series.cat.reorder_levels(new_levels)`` methods. All old levels must be included in the new
300+
levels.
272301

273302
.. ipython:: python
274303
@@ -287,6 +316,15 @@ Reordering the levels is possible via the ``Categorical.reorder_levels(new_level
287316
way values are sorted is different afterwards, but not that individual values in the
288317
`Series` are changed.
289318

319+
You can also add new levels with :func:`Categorical.reorder_levels`, as long as you include all
320+
old levels:
321+
322+
.. ipython:: python
323+
324+
s3 = pd.Series(pd.Categorical(["a","b","d"]))
325+
s3.cat.reorder_levels(["a","b","c",d"])
326+
s3
327+
290328
291329
Operations
292330
----------
@@ -317,8 +355,8 @@ The mode:
317355
.. note::
318356
319357
Numeric operations like ``+``, ``-``, ``*``, ``/`` and operations based on them (e.g.
320-
``Categorical.median()``, which would need to compute the mean between two values if the
321-
length of an array is even) do not work and raise a `TypeError`.
358+
``.median()``, which would need to compute the mean between two values if the length of an
359+
array is even) do not work and raise a `TypeError`.
322360
323361
`Series` methods like `Series.value_counts()` will use all levels, even if some levels are not
324362
present in the data:
@@ -353,7 +391,7 @@ Pivot tables:
353391
Data munging
354392
------------
355393
356-
The optimized pandas data access methods ``.loc``, ``.iloc`` ``ix`` ``.at``, and``.iat``,
394+
The optimized pandas data access methods ``.loc``, ``.iloc``, ``.ix`` ``.at``, and ``.iat``,
357395
work as normal, the only difference is the return type (for getting) and
358396
that only values already in the levels can be assigned.
359397
@@ -393,7 +431,7 @@ of length "1".
393431
df.at["h","cats"] # returns a string
394432
395433
.. note::
396-
Note that this is a difference to R's `factor` function, where ``factor(c(1,2,3))[1]``
434+
This is a difference to R's `factor` function, where ``factor(c(1,2,3))[1]``
397435
returns a single value `factor`.
398436
399437
To get a single value `Series` of type ``category`` pass in a single value list:
@@ -455,7 +493,9 @@ but the levels of these `Categoricals` need to be the same:
455493
cat = pd.Categorical(["a","b"], levels=["a","b"])
456494
vals = [1,2]
457495
df = pd.DataFrame({"cats":cat, "vals":vals})
458-
pd.concat([df,df])
496+
res = pd.concat([df,df])
497+
res
498+
res.dtypes
459499
460500
df_different = df.copy()
461501
df_different["cats"].cat.levels = ["a","b","c"]
@@ -501,27 +541,34 @@ store does not yet work.
501541
502542
503543
Writing to a csv file will convert the data, effectively removing any information about the
504-
`Categorical` (`levels` and ordering). So if you read back the csv file you have to convert the
505-
relevant columns back to `category` and assign the right `levels` and level ordering.
544+
`Categorical` (levels and ordering). So if you read back the csv file you have to convert the
545+
relevant columns back to `category` and assign the right levels and level ordering.
506546
507547
.. ipython:: python
508548
:suppress:
509549
510550
from pandas.compat import StringIO
511-
csv_file = StringIO
551+
csv_file = StringIO()
512552
513553
.. ipython:: python
514554
515-
s = pd.Series(pd.Categorical(['a', 'b', 'b', 'a', 'a', 'c'], levels=['a','b','c','d']))
555+
s = pd.Series(pd.Categorical(['a', 'b', 'b', 'a', 'a', 'd']))
556+
# rename the levels
557+
s.cat.levels = ["very good", "good", "bad"]
558+
# add new levels at the end
559+
s.cat.levels = list(s.cat.levels) + ["medium", "very bad"]
560+
# reorder the levels
561+
s.cat.reorder_levels(["very bad", "bad", "medium", "good", "very good"])
516562
df = pd.DataFrame({"s":s, "vals":[1,2,3,4,5,6]})
517563
df.to_csv(csv_file)
518564
df2 = pd.read_csv(csv_file)
519-
df2.dtype
565+
df2.dtypes
520566
df2["vals"]
521567
# Redo the category
522568
df2["vals"] = df2["vals"].astype("category")
523-
df2["vals"].cat.levels = ['a','b','c','d']
524-
df2.dtype
569+
df2["vals"].cat.levels = list(df2["vals"].cat.levels) + ["medium", "very bad"]
570+
df2["vals"].cat.reorder_levels(["very bad", "bad", "medium", "good", "very good"])
571+
df2.dtypes
525572
df2["vals"]
526573
527574
@@ -576,8 +623,8 @@ object and not as a low level `numpy` array dtype. This leads to some problems.
576623
dtype == np.str_
577624
np.str_ == dtype
578625
579-
Using ``numpy`` functions on a `Series` of type ``category`` should not work as `Categoricals`
580-
are not numeric data (even in the case that levels is numeric).
626+
Using `numpy` functions on a `Series` of type ``category`` should not work as `Categoricals`
627+
are not numeric data (even in the case that ``.levels`` is numeric).
581628
582629
.. ipython:: python
583630
@@ -612,36 +659,40 @@ means that changes to the `Series` will in most cases change the original `Categ
612659
Use ``copy=True`` to prevent such a behaviour:
613660
614661
.. ipython:: python
662+
615663
cat = pd.Categorical([1,2,3,10], levels=[1,2,3,4,10])
616664
s = pd.Series(cat, name="cat", copy=True)
617665
cat
618666
s.iloc[0:2] = 10
619667
cat
620668
621669
.. note::
622-
This also happens in some cases when you supply a `numpy` array: using an int array
623-
(e.g. ``np.array([1,2,3,4])``) will exhibit the same behaviour, but using a string
624-
array (e.g. ``np.array(["a","b","c","a"])``) will not.
670+
This also happens in some cases when you supply a `numpy` array instea dof a `Categorical`:
671+
using an int array (e.g. ``np.array([1,2,3,4])``) will exhibit the same behaviour, but using
672+
a string array (e.g. ``np.array(["a","b","c","a"])``) will not.
625673
626674
627675
Danger of confusion
628676
~~~~~~~~~~~~~~~~~~~
629677
630-
Both `Series` and `Categorical` have a method ``.reorder_levels()`` . For Series of type
631-
``category`` this means that there is some danger to confuse both methods.
678+
Both `Series` and `Categorical` have a method ``.reorder_levels()`` but for different things. For
679+
Series of type ``category`` this means that there is some danger to confuse both methods.
632680
633681
.. ipython:: python
634682
635683
s = pd.Series(pd.Categorical([1,2,3,4]))
684+
print(s.cat.levels)
636685
# wrong and raises an error:
637686
try:
638687
s.reorder_levels([4,3,2,1])
639688
except Exception as e:
640689
print("Exception: " + str(e))
641690
# right
642-
print(s.cat.levels)
643-
print([4,3,2,1])
644691
s.cat.reorder_levels([4,3,2,1])
692+
print(s.cat.levels)
693+
694+
See also the API documentation for :func:`pandas.Series.reorder_levels` and
695+
:func:`pandas.Categorical.reorder_levels`
645696
646697
Old style constructor usage
647698
~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -665,8 +716,8 @@ In the default case (``compat=False``) the first argument is interpreted as valu
665716
666717
.. warning::
667718
Using Categorical with precomputed level_codes and levels is deprecated and a `FutureWarning`
668-
is raised. Please change your code to use one of the proper constructor modes instead of
669-
adding ``compat=False``.
719+
is raised. Please change your code to use the :func:`~pandas.Categorical.from_codes`
720+
constructor instead of adding ``compat=False``.
670721
671722
No categorical index
672723
~~~~~~~~~~~~~~~~~~~~
@@ -682,9 +733,13 @@ ordering of the levels:
682733
values = [4,2,3,1]
683734
df = pd.DataFrame({"strings":strings, "values":values}, index=cats)
684735
df.index
685-
# This should sort by levels but doesn't!
736+
# This should sort by levels but does not as there is no CategoricalIndex!
686737
df.sort_index()
687738
739+
.. note::
740+
This could change if a `CategoricalIndex` is implemented (see
741+
https://github.com/pydata/pandas/issues/7629)
742+
688743
dtype in apply
689744
~~~~~~~~~~~~~~
690745

doc/source/v0.15.0.txt

+36-5
Original file line numberDiff line numberDiff line change
@@ -30,11 +30,42 @@ users upgrade to this version.
3030
API changes
3131
~~~~~~~~~~~
3232

33-
34-
35-
36-
37-
33+
- `pandas.core.group_agg` and `pandas.core.factor_agg` were removed. As an alternative, construct
34+
a dataframe and use `df.groupby(<group>).agg(<func>)`.
35+
36+
- Supplying "codes/labels and levels" to the `pandas.Categorical` constructor is deprecated and does
37+
not work without supplying ``compat=True``. The default mode now uses "values and levels".
38+
Please change your code to use the ``Categorical.from_codes(...)`` constructor.
39+
40+
- The `pandas.Categorical.labels` attribute was renamed to `pandas.Categorical.codes` and is read
41+
only. If you want to manipulate the `Categorical`, please use one of the
42+
:ref:`API methods on Categoricals<api.categorical>`.
43+
44+
45+
46+
Categoricals in Series/DataFrame
47+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
48+
49+
:class:`~pandas.Categorical` can now be included in `Series` and `DataFrames` and gained new
50+
methods to manipulate.
51+
52+
.. ipython:: python
53+
import pandas as pd
54+
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
55+
# convert the raw grades to a categorical
56+
df["grade"] = pd.Categorical(df["raw_grade"])
57+
# Alternative: df["grade"] = df["raw_grade"].astype("category")
58+
df["grade"]
59+
# Rename the levels
60+
df["grade"].cat.levels = ["very good", "good", "very bad"]
61+
# Reorder the levels and simultaneously add the missing levels
62+
df["grade"].cat.reorder_levels(["very bad", "bad", "medium", "good", "very good"])
63+
df["grade"]
64+
df.sort("grade")
65+
df.groupby("grade").size()
66+
67+
See the :ref:`Categorical introduction<_categorical>` and the
68+
:ref:`API documentation<api.categorical>`.
3869

3970

4071

0 commit comments

Comments
 (0)