@@ -27,8 +27,8 @@ Categorical
27
27
`Categorical ` data in `Series ` and `DataFrame ` is new.
28
28
29
29
30
- This is a short introduction to pandas ` Categorical ` type, including a short comparison with R's
31
- `factor `.
30
+ This is a introduction to pandas :class: ` pandas. Categorical ` type, including a short comparison
31
+ with R's `factor `.
32
32
33
33
`Categoricals ` are a pandas data type, which correspond to categorical variables in
34
34
statistics: a variable, which can take on only a limited, and usually fixed,
@@ -108,7 +108,7 @@ By using some special functions:
108
108
creation time. Use `levels ` to change the levels after creation time.
109
109
110
110
To get back to the original Series or `numpy ` array, use ``Series.astype(original_dtype) `` or
111
- ``Categorical.get_values( ) ``:
111
+ ``np.asarray(categorical ) ``:
112
112
113
113
.. ipython :: python
114
114
@@ -118,7 +118,33 @@ To get back to the original Series or `numpy` array, use ``Series.astype(origina
118
118
s2
119
119
s3 = s2.astype(' string' )
120
120
s3
121
- s2.cat.get_values()
121
+ np.asarray(s2.cat)
122
+
123
+ If you have already `codes ` and `levels `, you can use the :func: `~pandas.Categorical.from_codes `
124
+ constructor to save the factorize step during normal constructor mode:
125
+
126
+ .. ipython :: python
127
+
128
+ splitter = np.random.choice([0 ,1 ], 5 , p = [0.5 ,0.5 ])
129
+ pd.Categorical.from_codes(splitter, levels = [" train" , " test" ])
130
+
131
+ Description
132
+ -----------
133
+
134
+ Using ``.describe() `` on a ``Categorical(...) `` or a ``Series(Categorical(...)) `` will show
135
+ different output.
136
+
137
+
138
+ As part of a `Dataframe ` or as a `Series ` a similar output as for a `Series ` of type ``string `` is
139
+ shown. Calling ``Categorical.describe() `` will show the frequencies for each level, with NA for
140
+ unused levels.
141
+
142
+ .. ipython :: python
143
+
144
+ cat = pd.Categorical([" a" ," c" ," c" ,np.nan], levels = [" b" ," a" ," c" ,np.nan] )
145
+ df = pd.DataFrame({" cat" :cat, " s" :[" a" ," c" ," c" ,np.nan]})
146
+ df.describe()
147
+ cat.describe()
122
148
123
149
Working with levels
124
150
-------------------
@@ -153,7 +179,8 @@ It's also possible to pass in the levels in a specific order:
153
179
154
180
.. note ::
155
181
156
- Passing in a `levels ` argument implies ``ordered=True ``.
182
+ Passing in a `levels ` argument implies ``ordered=True ``. You can of course overwrite that by
183
+ passing in an explicit ``ordered=False ``.
157
184
158
185
Any value omitted in the levels argument will be replaced by `np.nan `:
159
186
@@ -178,8 +205,7 @@ Renaming levels is done by assigning new values to the ``Category.levels`` or
178
205
179
206
.. note ::
180
207
181
- I contrast to R's `factor ` function, a `Categorical ` can have levels of other types than
182
- string.
208
+ I contrast to R's `factor `, a `Categorical ` can have levels of other types than string.
183
209
184
210
Levels must be unique or a `ValueError ` is raised:
185
211
@@ -190,14 +216,16 @@ Levels must be unique or a `ValueError` is raised:
190
216
except ValueError as e:
191
217
print (" ValueError: " + str (e))
192
218
193
- Appending a level can be done by assigning a levels list longer than the current levels:
219
+ Appending levels can be done by assigning a levels list longer than the current levels:
194
220
195
221
.. ipython :: python
196
222
197
223
s.cat.levels = [1 ,2 ,3 ,4 ]
198
224
s.cat.levels
199
225
s
200
226
227
+ .. note ::
228
+ Adding levels in other positions can be done with ``.reorder_levels(<levels_including_new>) ``.
201
229
202
230
Removing a level is also possible, but only the last level(s) can be removed by assigning a
203
231
shorter list than current levels. Values which are omitted are replaced by `np.nan `.
@@ -236,8 +264,8 @@ Ordered or not...
236
264
-----------------
237
265
238
266
If a `Categoricals ` is ordered (``cat.ordered == True ``), then the order of the levels has a
239
- meaning and certain operations are possible. If the the categorical is unordered,
240
- a ` TypeError ` is raised.
267
+ meaning and certain operations are possible. If the categorical is unordered, a ` TypeError ` is
268
+ raised.
241
269
242
270
.. ipython :: python
243
271
@@ -268,7 +296,8 @@ This is even true for strings and numeric data:
268
296
print (s.min(), s.max())
269
297
270
298
Reordering the levels is possible via the ``Categorical.reorder_levels(new_levels) `` or
271
- ``Series.cat.reorder_levels(new_levels) `` methods:
299
+ ``Series.cat.reorder_levels(new_levels) `` methods. All old levels must be included in the new
300
+ levels.
272
301
273
302
.. ipython :: python
274
303
@@ -287,6 +316,15 @@ Reordering the levels is possible via the ``Categorical.reorder_levels(new_level
287
316
way values are sorted is different afterwards, but not that individual values in the
288
317
`Series ` are changed.
289
318
319
+ You can also add new levels with :func: `Categorical.reorder_levels `, as long as you include all
320
+ old levels:
321
+
322
+ .. ipython :: python
323
+
324
+ s3 = pd.Series(pd.Categorical([" a" ," b" ," d" ]))
325
+ s3.cat.reorder_levels([" a" ," b" ," c" ,d" ])
326
+ s3
327
+
290
328
291
329
Operations
292
330
----------
@@ -317,8 +355,8 @@ The mode:
317
355
.. note::
318
356
319
357
Numeric operations like `` + `` , `` - `` , `` * `` , `` / `` and operations based on them (e.g.
320
- ``Categorical .median() ``, which would need to compute the mean between two values if the
321
- length of an array is even) do not work and raise a `TypeError `.
358
+ `` .median()`` , which would need to compute the mean between two values if the length of an
359
+ array is even) do not work and raise a `TypeError ` .
322
360
323
361
`Series` methods like `Series.value_counts()` will use all levels, even if some levels are not
324
362
present in the data:
@@ -353,7 +391,7 @@ Pivot tables:
353
391
Data munging
354
392
------------
355
393
356
- The optimized pandas data access methods ``.loc ``, ``.iloc `` ``ix `` ``.at ``, and``.iat``,
394
+ The optimized pandas data access methods `` .loc`` , `` .iloc`` , `` . ix`` `` .at`` , and `` .iat`` ,
357
395
work as normal, the only difference is the return type (for getting) and
358
396
that only values already in the levels can be assigned.
359
397
@@ -393,7 +431,7 @@ of length "1".
393
431
df.at[" h" ," cats" ] # returns a string
394
432
395
433
.. note::
396
- Note that this is a difference to R's `factor ` function, where ``factor(c(1,2,3))[1] ``
434
+ This is a difference to R ' s `factor` function, where ``factor(c(1,2,3))[1]``
397
435
returns a single value `factor` .
398
436
399
437
To get a single value `Series` of type `` category`` pass in a single value list :
@@ -455,7 +493,9 @@ but the levels of these `Categoricals` need to be the same:
455
493
cat = pd.Categorical([" a" ," b" ], levels = [" a" ," b" ])
456
494
vals = [1 ,2 ]
457
495
df = pd.DataFrame({" cats" :cat, " vals" :vals})
458
- pd.concat([df,df])
496
+ res = pd.concat([df,df])
497
+ res
498
+ res.dtypes
459
499
460
500
df_different = df.copy()
461
501
df_different[" cats" ].cat.levels = [" a" ," b" ," c" ]
@@ -501,27 +541,34 @@ store does not yet work.
501
541
502
542
503
543
Writing to a csv file will convert the data, effectively removing any information about the
504
- `Categorical ` (` levels ` and ordering). So if you read back the csv file you have to convert the
505
- relevant columns back to `category ` and assign the right ` levels ` and level ordering.
544
+ `Categorical` (levels and ordering). So if you read back the csv file you have to convert the
545
+ relevant columns back to `category` and assign the right levels and level ordering.
506
546
507
547
.. ipython:: python
508
548
:suppress:
509
549
510
550
from pandas.compat import StringIO
511
- csv_file = StringIO
551
+ csv_file = StringIO()
512
552
513
553
.. ipython:: python
514
554
515
- s = pd.Series(pd.Categorical([' a' , ' b' , ' b' , ' a' , ' a' , ' c' ], levels = [' a' ,' b' ,' c' ,' d' ]))
555
+ s = pd.Series(pd.Categorical([' a' , ' b' , ' b' , ' a' , ' a' , ' d' ]))
556
+ # rename the levels
557
+ s.cat.levels = [" very good" , " good" , " bad" ]
558
+ # add new levels at the end
559
+ s.cat.levels = list (s.cat.levels) + [" medium" , " very bad" ]
560
+ # reorder the levels
561
+ s.cat.reorder_levels([" very bad" , " bad" , " medium" , " good" , " very good" ])
516
562
df = pd.DataFrame({" s" :s, " vals" :[1 ,2 ,3 ,4 ,5 ,6 ]})
517
563
df.to_csv(csv_file)
518
564
df2 = pd.read_csv(csv_file)
519
- df2.dtype
565
+ df2.dtypes
520
566
df2[" vals" ]
521
567
# Redo the category
522
568
df2[" vals" ] = df2[" vals" ].astype(" category" )
523
- df2[" vals" ].cat.levels = [' a' ,' b' ,' c' ,' d' ]
524
- df2.dtype
569
+ df2[" vals" ].cat.levels = list (df2[" vals" ].cat.levels) + [" medium" , " very bad" ]
570
+ df2[" vals" ].cat.reorder_levels([" very bad" , " bad" , " medium" , " good" , " very good" ])
571
+ df2.dtypes
525
572
df2[" vals" ]
526
573
527
574
@@ -576,8 +623,8 @@ object and not as a low level `numpy` array dtype. This leads to some problems.
576
623
dtype == np.str_
577
624
np.str_ == dtype
578
625
579
- Using `` numpy ` ` functions on a `Series ` of type ``category `` should not work as `Categoricals `
580
- are not numeric data (even in the case that levels is numeric).
626
+ Using `numpy` functions on a `Series` of type `` category`` should not work as `Categoricals`
627
+ are not numeric data (even in the case that `` . levels`` is numeric).
581
628
582
629
.. ipython:: python
583
630
@@ -612,36 +659,40 @@ means that changes to the `Series` will in most cases change the original `Categ
612
659
Use `` copy=True `` to prevent such a behaviour:
613
660
614
661
.. ipython:: python
662
+
615
663
cat = pd.Categorical([1 ,2 ,3 ,10 ], levels = [1 ,2 ,3 ,4 ,10 ])
616
664
s = pd.Series(cat, name = " cat" , copy = True )
617
665
cat
618
666
s.iloc[0 :2 ] = 10
619
667
cat
620
668
621
669
.. note::
622
- This also happens in some cases when you supply a `numpy ` array: using an int array
623
- (e.g. ``np.array([1,2,3,4]) ``) will exhibit the same behaviour, but using a string
624
- array (e.g. ``np.array(["a","b","c","a"]) ``) will not.
670
+ This also happens in some cases when you supply a `numpy` array instea dof a `Categorical` :
671
+ using an int array (e.g. `` np.array([1 ,2 ,3 ,4 ])`` ) will exhibit the same behaviour, but using
672
+ a string array (e.g. `` np.array([" a" ," b" ," c" ," a" ])`` ) will not .
625
673
626
674
627
675
Danger of confusion
628
676
~~~~~~~~~~~~~~~~~~~
629
677
630
- Both `Series ` and `Categorical ` have a method ``.reorder_levels() `` . For Series of type
631
- ``category `` this means that there is some danger to confuse both methods.
678
+ Both `Series` and `Categorical` have a method `` .reorder_levels()`` but for different things. For
679
+ Series of type `` category`` this means that there is some danger to confuse both methods.
632
680
633
681
.. ipython:: python
634
682
635
683
s = pd.Series(pd.Categorical([1 ,2 ,3 ,4 ]))
684
+ print (s.cat.levels)
636
685
# wrong and raises an error:
637
686
try :
638
687
s.reorder_levels([4 ,3 ,2 ,1 ])
639
688
except Exception as e:
640
689
print (" Exception: " + str (e))
641
690
# right
642
- print (s.cat.levels)
643
- print ([4 ,3 ,2 ,1 ])
644
691
s.cat.reorder_levels([4 ,3 ,2 ,1 ])
692
+ print (s.cat.levels)
693
+
694
+ See also the API documentation for :func:`pandas.Series.reorder_levels` and
695
+ :func:`pandas.Categorical.reorder_levels`
645
696
646
697
Old style constructor usage
647
698
~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -665,8 +716,8 @@ In the default case (``compat=False``) the first argument is interpreted as valu
665
716
666
717
.. warning::
667
718
Using Categorical with precomputed level_codes and levels is deprecated and a `FutureWarning `
668
- is raised. Please change your code to use one of the proper constructor modes instead of
669
- adding ``compat=False ``.
719
+ is raised. Please change your code to use the :func: ` ~ pandas.Categorical.from_codes`
720
+ constructor instead of adding `` compat=False `` .
670
721
671
722
No categorical index
672
723
~~~~~~~~~~~~~~~~~~~~
@@ -682,9 +733,13 @@ ordering of the levels:
682
733
values = [4 ,2 ,3 ,1 ]
683
734
df = pd.DataFrame({" strings" :strings, " values" :values}, index = cats)
684
735
df.index
685
- # This should sort by levels but doesn't !
736
+ # This should sort by levels but does not as there is no CategoricalIndex !
686
737
df.sort_index()
687
738
739
+ .. note::
740
+ This could change if a `CategoricalIndex` is implemented (see
741
+ https:// github.com/ pydata/ pandas/ issues/ 7629 )
742
+
688
743
dtype in apply
689
744
~~~~~~~~~~~~~~
690
745
0 commit comments