Skip to content

CLN/DOC/TST: categorical fixups (GH7768) #8006

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 28 additions & 1 deletion doc/source/10min.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,8 @@ Creating a ``DataFrame`` by passing a dict of objects that can be converted to s
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : 'foo' })
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
df2

Having specific :ref:`dtypes <basics.dtypes>`
Expand Down Expand Up @@ -635,6 +636,32 @@ the quarter end:
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts.head()

Categoricals
------------

Since version 0.15, pandas can include categorical data in a `DataFrame`. For full docs, see the
:ref:`Categorical introduction <categorical>` and the :ref:`API documentation <api.categorical>` .

.. ipython:: python

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

# convert the raw grades to a categorical
df["grade"] = pd.Categorical(df["raw_grade"])

# Alternative: df["grade"] = df["raw_grade"].astype("category")
df["grade"]

# Rename the levels
df["grade"].cat.levels = ["very good", "good", "very bad"]

# Reorder the levels and simultaneously add the missing levels
df["grade"].cat.reorder_levels(["very bad", "bad", "medium", "good", "very good"])
df["grade"]
df.sort("grade")
df.groupby("grade").size()



Plotting
--------
Expand Down
13 changes: 10 additions & 3 deletions doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -528,11 +528,17 @@ and has the following usable methods and properties (all available as
:toctree: generated/

Categorical
Categorical.from_codes
Categorical.levels
Categorical.ordered
Categorical.reorder_levels
Categorical.remove_unused_levels

The following methods are considered API when using ``Categorical`` directly:

.. autosummary::
:toctree: generated/

Categorical.from_codes
Categorical.min
Categorical.max
Categorical.mode
Expand All @@ -547,7 +553,7 @@ the Categorical back to a numpy array, so levels and order information is not pr
Categorical.__array__

To create compatibility with `pandas.Series` and `numpy` arrays, the following (non-API) methods
are also introduced.
are also introduced and available when ``Categorical`` is used directly.

.. autosummary::
:toctree: generated/
Expand All @@ -563,7 +569,8 @@ are also introduced.
Categorical.order
Categorical.argsort
Categorical.fillna

Categorical.notnull
Categorical.isnull

Plotting
~~~~~~~~
Expand Down
137 changes: 85 additions & 52 deletions doc/source/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ By using some special functions:
df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
df.head(10)

See :ref:`documentation <reshaping.tile.cut>` for :func:`~pandas.cut`.

`Categoricals` have a specific ``category`` :ref:`dtype <basics.dtypes>`:

Expand Down Expand Up @@ -210,11 +211,9 @@ Renaming levels is done by assigning new values to the ``Category.levels`` or
Levels must be unique or a `ValueError` is raised:

.. ipython:: python
:okexcept:

try:
s.cat.levels = [1,1,1]
except ValueError as e:
print("ValueError: " + str(e))
s.cat.levels = [1,1,1]

Appending levels can be done by assigning a levels list longer than the current levels:

Expand Down Expand Up @@ -268,12 +267,11 @@ meaning and certain operations are possible. If the categorical is unordered, a
raised.

.. ipython:: python
:okexcept:

s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
try:
s.sort()
except TypeError as e:
print("TypeError: " + str(e))
s.sort()

s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=True))
s.sort()
s
Expand Down Expand Up @@ -331,6 +329,44 @@ Operations

The following operations are possible with categorical data:

Comparing `Categoricals` with other objects is possible in two cases:
* comparing a `Categorical` to another `Categorical`, when `level` and `ordered` is the same or
* comparing a `Categorical` to a scalar.
All other comparisons will raise a TypeError.

.. ipython:: python

cat = pd.Series(pd.Categorical([1,2,3], levels=[3,2,1]))
cat
cat_base = pd.Series(pd.Categorical([2,2,2], levels=[3,2,1]))
cat_base
cat_base2 = pd.Series(pd.Categorical([2,2,2]))
cat_base2

cat > cat_base
cat > 2

This doesn't work because the levels are not the same

.. ipython:: python
:okexcept:

cat > cat_base2

.. note::

Comparisons with `Series`, `np.array` or a `Categorical` with different levels or ordering
will raise an `TypeError` because custom level ordering would result in two valid results:
one with taking in account the ordering and one without. If you want to compare a `Categorical`
with such a type, you need to be explicit and convert the `Categorical` to values:

.. ipython:: python
:okexcept:

base = np.array([1,2,3])
cat > base
np.asarray(cat) > base

Getting the minimum and maximum, if the categorical is ordered:

.. ipython:: python
Expand Down Expand Up @@ -454,21 +490,22 @@ Setting values in a categorical column (or `Series`) works as long as the value

df.iloc[2:4,:] = [["b",2],["b",2]]
df
try:
df.iloc[2:4,:] = [["c",3],["c",3]]
except ValueError as e:
print("ValueError: " + str(e))

The value is not included in the levels here.

.. ipython:: python
:okexcept:

df.iloc[2:4,:] = [["c",3],["c",3]]

Setting values by assigning a `Categorical` will also check that the `levels` match:

.. ipython:: python
:okexcept:

df.loc["j":"k","cats"] = pd.Categorical(["a","a"], levels=["a","b"])
df
try:
df.loc["j":"k","cats"] = pd.Categorical(["b","b"], levels=["a","b","c"])
except ValueError as e:
print("ValueError: " + str(e))
df.loc["j":"k","cats"] = pd.Categorical(["b","b"], levels=["a","b","c"])

Assigning a `Categorical` to parts of a column of other types will use the values:

Expand All @@ -489,27 +526,30 @@ but the levels of these `Categoricals` need to be the same:

.. ipython:: python

cat = pd.Categorical(["a","b"], levels=["a","b"])
vals = [1,2]
df = pd.DataFrame({"cats":cat, "vals":vals})
res = pd.concat([df,df])
res
res.dtypes
cat = pd.Categorical(["a","b"], levels=["a","b"])
vals = [1,2]
df = pd.DataFrame({"cats":cat, "vals":vals})
res = pd.concat([df,df])
res
res.dtypes

df_different = df.copy()
df_different["cats"].cat.levels = ["a","b","c"]
df_different = df.copy()
df_different["cats"].cat.levels = ["a","b","c"]

try:
pd.concat([df,df])
except ValueError as e:
print("ValueError: " + str(e))
These levels are not the same

.. ipython:: python
:okexcept:

pd.concat([df,df])

The same applies to ``df.append(df)``.

Getting Data In/Out
-------------------

Writing data (`Series`, `Frames`) to a HDF store that contains a ``category`` dtype will currently raise ``NotImplementedError``.
Writing data (`Series`, `Frames`) to a HDF store that contains a ``category`` dtype will currently
raise ``NotImplementedError``.

Writing to a CSV file will convert the data, effectively removing any information about the
`Categorical` (levels and ordering). So if you read back the CSV file you have to convert the
Expand Down Expand Up @@ -575,33 +615,26 @@ object and not as a low level `numpy` array dtype. This leads to some problems.
`numpy` itself doesn't know about the new `dtype`:

.. ipython:: python
:okexcept:

try:
np.dtype("category")
except TypeError as e:
print("TypeError: " + str(e))
np.dtype("category")
dtype = pd.Categorical(["a"]).dtype
np.dtype(dtype)

dtype = pd.Categorical(["a"]).dtype
try:
np.dtype(dtype)
except TypeError as e:
print("TypeError: " + str(e))

# dtype comparisons work:
dtype == np.str_
np.str_ == dtype
# dtype comparisons work:
dtype == np.str_
np.str_ == dtype

Using `numpy` functions on a `Series` of type ``category`` should not work as `Categoricals`
are not numeric data (even in the case that ``.levels`` is numeric).

.. ipython:: python
:okexcept:

s = pd.Series(pd.Categorical([1,2,3,4]))
try:
np.sum(s)
#same with np.log(s),..
except TypeError as e:
print("TypeError: " + str(e))
s = pd.Series(pd.Categorical([1,2,3,4]))

#same with np.log(s),..
np.sum(s)

.. note::
If such a function works, please file a bug at https://github.com/pydata/pandas!
Expand Down Expand Up @@ -647,14 +680,14 @@ Both `Series` and `Categorical` have a method ``.reorder_levels()`` but for diff
Series of type ``category`` this means that there is some danger to confuse both methods.

.. ipython:: python
:okexcept:

s = pd.Series(pd.Categorical([1,2,3,4]))
print(s.cat.levels)

# wrong and raises an error:
try:
s.reorder_levels([4,3,2,1])
except Exception as e:
print("Exception: " + str(e))
s.reorder_levels([4,3,2,1])

# right
s.cat.reorder_levels([4,3,2,1])
print(s.cat.levels)
Expand Down
7 changes: 7 additions & 0 deletions doc/source/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -503,3 +503,10 @@ handling of NaN:

pd.factorize(x, sort=True)
np.unique(x, return_inverse=True)[::-1]

.. note::
If you just want to handle one column as a categorical variable (like R's factor),
you can use ``df["cat_col"] = pd.Categorical(df["col"])`` or
``df["cat_col"] = df["col"].astype("category")``. For full docs on :class:`~pandas.Categorical`,
see the :ref:`Categorical introduction <categorical>` and the
:ref:`API documentation <api.categorical>`. This feature was introduced in version 0.15.
3 changes: 2 additions & 1 deletion doc/source/v0.15.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,8 @@ Categoricals in Series/DataFrame
methods to manipulate. Thanks to Jan Schultz for much of this API/implementation. (:issue:`3943`, :issue:`5313`, :issue:`5314`,
:issue:`7444`, :issue:`7839`, :issue:`7848`, :issue:`7864`, :issue:`7914`).

For full docs, see the :ref:`Categorical introduction <categorical>` and the :ref:`API documentation <api.categorical>`.
For full docs, see the :ref:`Categorical introduction <categorical>` and the
:ref:`API documentation <api.categorical>`.

.. ipython:: python

Expand Down
Loading