Skip to content

Commit 3b58f48

Browse files
ivirshupWillAyd
authored andcommitted
DOC: Fix docs on merging categoricals. (pandas-dev#28185)
1 parent 62f6a42 commit 3b58f48

File tree

2 files changed

+36
-61
lines changed

2 files changed

+36
-61
lines changed

doc/source/user_guide/categorical.rst

+35-60
Original file line numberDiff line numberDiff line change
@@ -797,37 +797,52 @@ Assigning a ``Categorical`` to parts of a column of other types will use the val
797797
df.dtypes
798798
799799
.. _categorical.merge:
800+
.. _categorical.concat:
800801

801-
Merging
802-
~~~~~~~
802+
Merging / Concatenation
803+
~~~~~~~~~~~~~~~~~~~~~~~
803804

804-
You can concat two ``DataFrames`` containing categorical data together,
805-
but the categories of these categoricals need to be the same:
805+
By default, combining ``Series`` or ``DataFrames`` which contain the same
806+
categories results in ``category`` dtype, otherwise results will depend on the
807+
dtype of the underlying categories. Merges that result in non-categorical
808+
dtypes will likely have higher memory usage. Use ``.astype`` or
809+
``union_categoricals`` to ensure ``category`` results.
806810

807811
.. ipython:: python
808812
809-
cat = pd.Series(["a", "b"], dtype="category")
810-
vals = [1, 2]
811-
df = pd.DataFrame({"cats": cat, "vals": vals})
812-
res = pd.concat([df, df])
813-
res
814-
res.dtypes
813+
from pandas.api.types import union_categoricals
815814
816-
In this case the categories are not the same, and therefore an error is raised:
815+
# same categories
816+
s1 = pd.Series(['a', 'b'], dtype='category')
817+
s2 = pd.Series(['a', 'b', 'a'], dtype='category')
818+
pd.concat([s1, s2])
817819
818-
.. ipython:: python
820+
# different categories
821+
s3 = pd.Series(['b', 'c'], dtype='category')
822+
pd.concat([s1, s3])
819823
820-
df_different = df.copy()
821-
df_different["cats"].cat.categories = ["c", "d"]
822-
try:
823-
pd.concat([df, df_different])
824-
except ValueError as e:
825-
print("ValueError:", str(e))
824+
# Output dtype is inferred based on categories values
825+
int_cats = pd.Series([1, 2], dtype="category")
826+
float_cats = pd.Series([3.0, 4.0], dtype="category")
827+
pd.concat([int_cats, float_cats])
828+
829+
pd.concat([s1, s3]).astype('category')
830+
union_categoricals([s1.array, s3.array])
826831
827-
The same applies to ``df.append(df_different)``.
832+
The following table summarizes the results of merging ``Categoricals``:
828833

829-
See also the section on :ref:`merge dtypes<merging.dtypes>` for notes about preserving merge dtypes and performance.
834+
+-------------------+------------------------+----------------------+-----------------------------+
835+
| arg1 | arg2 | identical | result |
836+
+===================+========================+======================+=============================+
837+
| category | category | True | category |
838+
+-------------------+------------------------+----------------------+-----------------------------+
839+
| category (object) | category (object) | False | object (dtype is inferred) |
840+
+-------------------+------------------------+----------------------+-----------------------------+
841+
| category (int) | category (float) | False | float (dtype is inferred) |
842+
+-------------------+------------------------+----------------------+-----------------------------+
830843

844+
See also the section on :ref:`merge dtypes<merging.dtypes>` for notes about
845+
preserving merge dtypes and performance.
831846

832847
.. _categorical.union:
833848

@@ -918,46 +933,6 @@ the resulting array will always be a plain ``Categorical``:
918933
# "b" is coded to 0 throughout, same as c1, different from c2
919934
c.codes
920935
921-
.. _categorical.concat:
922-
923-
Concatenation
924-
~~~~~~~~~~~~~
925-
926-
This section describes concatenations specific to ``category`` dtype. See :ref:`Concatenating objects<merging.concat>` for general description.
927-
928-
By default, ``Series`` or ``DataFrame`` concatenation which contains the same categories
929-
results in ``category`` dtype, otherwise results in ``object`` dtype.
930-
Use ``.astype`` or ``union_categoricals`` to get ``category`` result.
931-
932-
.. ipython:: python
933-
934-
# same categories
935-
s1 = pd.Series(['a', 'b'], dtype='category')
936-
s2 = pd.Series(['a', 'b', 'a'], dtype='category')
937-
pd.concat([s1, s2])
938-
939-
# different categories
940-
s3 = pd.Series(['b', 'c'], dtype='category')
941-
pd.concat([s1, s3])
942-
943-
pd.concat([s1, s3]).astype('category')
944-
union_categoricals([s1.array, s3.array])
945-
946-
947-
Following table summarizes the results of ``Categoricals`` related concatenations.
948-
949-
+----------+--------------------------------------------------------+----------------------------+
950-
| arg1 | arg2 | result |
951-
+==========+========================================================+============================+
952-
| category | category (identical categories) | category |
953-
+----------+--------------------------------------------------------+----------------------------+
954-
| category | category (different categories, both not ordered) | object (dtype is inferred) |
955-
+----------+--------------------------------------------------------+----------------------------+
956-
| category | category (different categories, either one is ordered) | object (dtype is inferred) |
957-
+----------+--------------------------------------------------------+----------------------------+
958-
| category | not category | object (dtype is inferred) |
959-
+----------+--------------------------------------------------------+----------------------------+
960-
961936
962937
Getting data in/out
963938
-------------------

doc/source/user_guide/merging.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -881,7 +881,7 @@ The merged result:
881881
.. note::
882882

883883
The category dtypes must be *exactly* the same, meaning the same categories and the ordered attribute.
884-
Otherwise the result will coerce to ``object`` dtype.
884+
Otherwise the result will coerce to the categories' dtype.
885885

886886
.. note::
887887

0 commit comments

Comments
 (0)