Skip to content

Commit 6a4399d

Browse files
committed
Merge pull request #8007 from JanSchulz/categorical_fixups
CLN/DOC/TST: Categorical fixups (GH7768)
2 parents a5bb77e + 0165d14 commit 6a4399d

12 files changed

+851
-130
lines changed

doc/source/10min.rst

+28-1
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,8 @@ Creating a ``DataFrame`` by passing a dict of objects that can be converted to s
6666
'B' : pd.Timestamp('20130102'),
6767
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
6868
'D' : np.array([3] * 4,dtype='int32'),
69-
'E' : 'foo' })
69+
'E' : pd.Categorical(["test","train","test","train"]),
70+
'F' : 'foo' })
7071
df2
7172
7273
Having specific :ref:`dtypes <basics.dtypes>`
@@ -635,6 +636,32 @@ the quarter end:
635636
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
636637
ts.head()
637638
639+
Categoricals
640+
------------
641+
642+
Since version 0.15, pandas can include categorical data in a ``DataFrame``. For full docs, see the
643+
:ref:`Categorical introduction <categorical>` and the :ref:`API documentation <api.categorical>` .
644+
645+
.. ipython:: python
646+
647+
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
648+
649+
# convert the raw grades to a categorical
650+
df["grade"] = pd.Categorical(df["raw_grade"])
651+
652+
# Alternative: df["grade"] = df["raw_grade"].astype("category")
653+
df["grade"]
654+
655+
# Rename the levels
656+
df["grade"].cat.levels = ["very good", "good", "very bad"]
657+
658+
# Reorder the levels and simultaneously add the missing levels
659+
df["grade"].cat.reorder_levels(["very bad", "bad", "medium", "good", "very good"])
660+
df["grade"]
661+
df.sort("grade")
662+
df.groupby("grade").size()
663+
664+
638665
639666
Plotting
640667
--------

doc/source/api.rst

+9-27
Original file line numberDiff line numberDiff line change
@@ -521,51 +521,33 @@ Categorical
521521
.. currentmodule:: pandas.core.categorical
522522

523523
If the Series is of dtype ``category``, ``Series.cat`` can be used to access the the underlying
524-
``Categorical``. This data type is similar to the otherwise underlying numpy array
525-
and has the following usable methods and properties (all available as
526-
``Series.cat.<method_or_property>``).
527-
524+
``Categorical``. This accessor is similar to the ``Series.dt`` or ``Series.str``and has the
525+
following usable methods and properties (all available as ``Series.cat.<method_or_property>``).
528526

529527
.. autosummary::
530528
:toctree: generated/
531529

532-
Categorical
533-
Categorical.from_codes
534530
Categorical.levels
535531
Categorical.ordered
536532
Categorical.reorder_levels
537533
Categorical.remove_unused_levels
538-
Categorical.min
539-
Categorical.max
540-
Categorical.mode
541-
Categorical.describe
542534

543-
``np.asarray(categorical)`` works by implementing the array interface. Be aware, that this converts
544-
the Categorical back to a numpy array, so levels and order information is not preserved!
535+
The following methods are considered API when using ``Categorical`` directly:
545536

546537
.. autosummary::
547538
:toctree: generated/
548539

549-
Categorical.__array__
540+
Categorical
541+
Categorical.from_codes
542+
Categorical.codes
550543

551-
To create compatibility with `pandas.Series` and `numpy` arrays, the following (non-API) methods
552-
are also introduced.
544+
``np.asarray(categorical)`` works by implementing the array interface. Be aware, that this converts
545+
the Categorical back to a numpy array, so levels and order information is not preserved!
553546

554547
.. autosummary::
555548
:toctree: generated/
556549

557-
Categorical.from_array
558-
Categorical.get_values
559-
Categorical.copy
560-
Categorical.dtype
561-
Categorical.ndim
562-
Categorical.sort
563-
Categorical.equals
564-
Categorical.unique
565-
Categorical.order
566-
Categorical.argsort
567-
Categorical.fillna
568-
550+
Categorical.__array__
569551

570552
Plotting
571553
~~~~~~~~

doc/source/categorical.rst

+93-18
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,7 @@ By using some special functions:
9090
df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
9191
df.head(10)
9292
93+
See :ref:`documentation <reshaping.tile.cut>` for :func:`~pandas.cut`.
9394

9495
`Categoricals` have a specific ``category`` :ref:`dtype <basics.dtypes>`:
9596

@@ -331,6 +332,57 @@ Operations
331332

332333
The following operations are possible with categorical data:
333334

335+
Comparing `Categoricals` with other objects is possible in two cases:
336+
337+
* comparing a `Categorical` to another `Categorical`, when `level` and `ordered` is the same or
338+
* comparing a `Categorical` to a scalar.
339+
340+
All other comparisons will raise a TypeError.
341+
342+
.. ipython:: python
343+
344+
cat = pd.Series(pd.Categorical([1,2,3], levels=[3,2,1]))
345+
cat_base = pd.Series(pd.Categorical([2,2,2], levels=[3,2,1]))
346+
cat_base2 = pd.Series(pd.Categorical([2,2,2]))
347+
348+
cat
349+
cat_base
350+
cat_base2
351+
352+
Comparing to a categorical with the same levels and ordering or to a scalar works:
353+
354+
.. ipython:: python
355+
356+
cat > cat_base
357+
cat > 2
358+
359+
This doesn't work because the levels are not the same:
360+
361+
.. ipython:: python
362+
363+
try:
364+
cat > cat_base2
365+
except TypeError as e:
366+
print("TypeError: " + str(e))
367+
368+
.. note::
369+
370+
Comparisons with `Series`, `np.array` or a `Categorical` with different levels or ordering
371+
will raise an `TypeError` because custom level ordering would result in two valid results:
372+
one with taking in account the ordering and one without. If you want to compare a `Categorical`
373+
with such a type, you need to be explicit and convert the `Categorical` to values:
374+
375+
.. ipython:: python
376+
377+
base = np.array([1,2,3])
378+
379+
try:
380+
cat > base
381+
except TypeError as e:
382+
print("TypeError: " + str(e))
383+
384+
np.asarray(cat) > base
385+
334386
Getting the minimum and maximum, if the categorical is ordered:
335387

336388
.. ipython:: python
@@ -489,34 +541,38 @@ but the levels of these `Categoricals` need to be the same:
489541

490542
.. ipython:: python
491543
492-
cat = pd.Categorical(["a","b"], levels=["a","b"])
493-
vals = [1,2]
494-
df = pd.DataFrame({"cats":cat, "vals":vals})
495-
res = pd.concat([df,df])
496-
res
497-
res.dtypes
544+
cat = pd.Categorical(["a","b"], levels=["a","b"])
545+
vals = [1,2]
546+
df = pd.DataFrame({"cats":cat, "vals":vals})
547+
res = pd.concat([df,df])
548+
res
549+
res.dtypes
498550
499-
df_different = df.copy()
500-
df_different["cats"].cat.levels = ["a","b","c"]
551+
In this case the levels are not the same and so an error is raised:
501552

502-
try:
503-
pd.concat([df,df])
504-
except ValueError as e:
505-
print("ValueError: " + str(e))
553+
.. ipython:: python
554+
555+
df_different = df.copy()
556+
df_different["cats"].cat.levels = ["a","b","c"]
557+
try:
558+
pd.concat([df,df_different])
559+
except ValueError as e:
560+
print("ValueError: " + str(e))
506561
507562
The same applies to ``df.append(df)``.
508563

509564
Getting Data In/Out
510565
-------------------
511566

512-
Writing data (`Series`, `Frames`) to a HDF store that contains a ``category`` dtype will currently raise ``NotImplementedError``.
567+
Writing data (`Series`, `Frames`) to a HDF store that contains a ``category`` dtype will currently
568+
raise ``NotImplementedError``.
513569

514570
Writing to a CSV file will convert the data, effectively removing any information about the
515571
`Categorical` (levels and ordering). So if you read back the CSV file you have to convert the
516572
relevant columns back to `category` and assign the right levels and level ordering.
517573

518574
.. ipython:: python
519-
:suppress:
575+
:suppress:
520576
521577
from pandas.compat import StringIO
522578
@@ -548,7 +604,7 @@ default not included in computations. See the :ref:`Missing Data section
548604
<missing_data>`
549605

550606
There are two ways a `np.nan` can be represented in `Categorical`: either the value is not
551-
available or `np.nan` is a valid level.
607+
available ("missing value") or `np.nan` is a valid level.
552608

553609
.. ipython:: python
554610
@@ -560,9 +616,25 @@ available or `np.nan` is a valid level.
560616
s2.cat.levels = [1,2,np.nan]
561617
s2
562618
# three levels, np.nan included
563-
# Note: as int arrays can't hold NaN the levels were converted to float
619+
# Note: as int arrays can't hold NaN the levels were converted to object
564620
s2.cat.levels
565621
622+
.. note::
623+
Missing value methods like ``isnull`` and ``fillna`` will take both missing values as well as
624+
`np.nan` levels into account:
625+
626+
.. ipython:: python
627+
628+
c = pd.Categorical(["a","b",np.nan])
629+
c.levels = ["a","b",np.nan]
630+
# will be inserted as a NA level:
631+
c[0] = np.nan
632+
s = pd.Series(c)
633+
s
634+
pd.isnull(s)
635+
s.fillna("a")
636+
637+
566638
Gotchas
567639
-------
568640

@@ -579,15 +651,18 @@ object and not as a low level `numpy` array dtype. This leads to some problems.
579651
try:
580652
np.dtype("category")
581653
except TypeError as e:
582-
print("TypeError: " + str(e))
654+
print("TypeError: " + str(e))
583655
584656
dtype = pd.Categorical(["a"]).dtype
585657
try:
586658
np.dtype(dtype)
587659
except TypeError as e:
588660
print("TypeError: " + str(e))
589661
590-
# dtype comparisons work:
662+
Dtype comparisons work:
663+
664+
.. ipython:: python
665+
591666
dtype == np.str_
592667
np.str_ == dtype
593668

doc/source/reshaping.rst

+7
Original file line numberDiff line numberDiff line change
@@ -505,3 +505,10 @@ handling of NaN:
505505
506506
pd.factorize(x, sort=True)
507507
np.unique(x, return_inverse=True)[::-1]
508+
509+
.. note::
510+
If you just want to handle one column as a categorical variable (like R's factor),
511+
you can use ``df["cat_col"] = pd.Categorical(df["col"])`` or
512+
``df["cat_col"] = df["col"].astype("category")``. For full docs on :class:`~pandas.Categorical`,
513+
see the :ref:`Categorical introduction <categorical>` and the
514+
:ref:`API documentation <api.categorical>`. This feature was introduced in version 0.15.

doc/source/v0.15.0.txt

+3-2
Original file line numberDiff line numberDiff line change
@@ -288,9 +288,10 @@ Categoricals in Series/DataFrame
288288

289289
:class:`~pandas.Categorical` can now be included in `Series` and `DataFrames` and gained new
290290
methods to manipulate. Thanks to Jan Schultz for much of this API/implementation. (:issue:`3943`, :issue:`5313`, :issue:`5314`,
291-
:issue:`7444`, :issue:`7839`, :issue:`7848`, :issue:`7864`, :issue:`7914`).
291+
:issue:`7444`, :issue:`7839`, :issue:`7848`, :issue:`7864`, :issue:`7914`, :issue:`7768`, :issue:`8006`, :issue:`3678`).
292292

293-
For full docs, see the :ref:`Categorical introduction <categorical>` and the :ref:`API documentation <api.categorical>`.
293+
For full docs, see the :ref:`Categorical introduction <categorical>` and the
294+
:ref:`API documentation <api.categorical>`.
294295

295296
.. ipython:: python
296297

0 commit comments

Comments
 (0)