Skip to content

Commit ab4bd36

Browse files
sinhrksjorisvandenbossche
authored andcommitted
ENH: concat and append now can handle unordered categories (#13767)
Concatting categoricals with non-matching categories will now return object dtype instead of raising an error. * ENH: concat and append now can handleunordered categories * reomove union_categoricals kw from concat
1 parent 3f3839b commit ab4bd36

File tree

9 files changed

+473
-184
lines changed

9 files changed

+473
-184
lines changed

doc/source/categorical.rst

+53-5
Original file line numberDiff line numberDiff line change
@@ -675,12 +675,60 @@ be lexsorted, use ``sort_categories=True`` argument.
675675
676676
union_categoricals([a, b], sort_categories=True)
677677
678-
.. note::
678+
``union_categoricals`` also works with the "easy" case of combining two
679+
categoricals of the same categories and order information
680+
(e.g. what you could also ``append`` for).
681+
682+
.. ipython:: python
683+
684+
a = pd.Categorical(["a", "b"], ordered=True)
685+
b = pd.Categorical(["a", "b", "a"], ordered=True)
686+
union_categoricals([a, b])
687+
688+
The below raises ``TypeError`` because the categories are ordered and not identical.
689+
690+
.. code-block:: ipython
691+
692+
In [1]: a = pd.Categorical(["a", "b"], ordered=True)
693+
In [2]: b = pd.Categorical(["a", "b", "c"], ordered=True)
694+
In [3]: union_categoricals([a, b])
695+
Out[3]:
696+
TypeError: to union ordered Categoricals, all categories must be the same
697+
698+
.. _categorical.concat:
699+
700+
Concatenation
701+
~~~~~~~~~~~~~
702+
703+
This section describes concatenations specific to ``category`` dtype. See :ref:`Concatenating objects<merging.concat>` for general description.
704+
705+
By default, ``Series`` or ``DataFrame`` concatenation which contains the same categories
706+
results in ``category`` dtype, otherwise results in ``object`` dtype.
707+
Use ``.astype`` or ``union_categoricals`` to get ``category`` result.
708+
709+
.. ipython:: python
710+
711+
# same categories
712+
s1 = pd.Series(['a', 'b'], dtype='category')
713+
s2 = pd.Series(['a', 'b', 'a'], dtype='category')
714+
pd.concat([s1, s2])
715+
716+
# different categories
717+
s3 = pd.Series(['b', 'c'], dtype='category')
718+
pd.concat([s1, s3])
719+
720+
pd.concat([s1, s3]).astype('category')
721+
union_categoricals([s1.values, s3.values])
722+
723+
724+
Following table summarizes the results of ``Categoricals`` related concatenations.
679725

680-
In addition to the "easy" case of combining two categoricals of the same
681-
categories and order information (e.g. what you could also ``append`` for),
682-
``union_categoricals`` only works with unordered categoricals and will
683-
raise if any are ordered.
726+
| arg1 | arg2 | result |
727+
|---------|-------------------------------------------|---------|
728+
| category | category (identical categories) | category |
729+
| category | category (different categories, both not ordered) | object (dtype is inferred) |
730+
| category | category (different categories, either one is ordered) | object (dtype is inferred) |
731+
| category | not category | object (dtype is inferred) |
684732
685733
Getting Data In/Out
686734
-------------------

doc/source/merging.rst

+14-13
Original file line numberDiff line numberDiff line change
@@ -78,34 +78,35 @@ some configurable handling of "what to do with the other axes":
7878
::
7979

8080
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
81-
keys=None, levels=None, names=None, verify_integrity=False)
81+
keys=None, levels=None, names=None, verify_integrity=False,
82+
copy=True)
8283

83-
- ``objs``: a sequence or mapping of Series, DataFrame, or Panel objects. If a
84+
- ``objs`` : a sequence or mapping of Series, DataFrame, or Panel objects. If a
8485
dict is passed, the sorted keys will be used as the `keys` argument, unless
8586
it is passed, in which case the values will be selected (see below). Any None
8687
objects will be dropped silently unless they are all None in which case a
8788
ValueError will be raised.
88-
- ``axis``: {0, 1, ...}, default 0. The axis to concatenate along.
89-
- ``join``: {'inner', 'outer'}, default 'outer'. How to handle indexes on
89+
- ``axis`` : {0, 1, ...}, default 0. The axis to concatenate along.
90+
- ``join`` : {'inner', 'outer'}, default 'outer'. How to handle indexes on
9091
other axis(es). Outer for union and inner for intersection.
91-
- ``join_axes``: list of Index objects. Specific indexes to use for the other
92+
- ``ignore_index`` : boolean, default False. If True, do not use the index
93+
values on the concatenation axis. The resulting axis will be labeled 0, ...,
94+
n - 1. This is useful if you are concatenating objects where the
95+
concatenation axis does not have meaningful indexing information. Note
96+
the index values on the other axes are still respected in the join.
97+
- ``join_axes`` : list of Index objects. Specific indexes to use for the other
9298
n - 1 axes instead of performing inner/outer set logic.
93-
- ``keys``: sequence, default None. Construct hierarchical index using the
99+
- ``keys`` : sequence, default None. Construct hierarchical index using the
94100
passed keys as the outermost level. If multiple levels passed, should
95101
contain tuples.
96102
- ``levels`` : list of sequences, default None. Specific levels (unique values)
97103
to use for constructing a MultiIndex. Otherwise they will be inferred from the
98104
keys.
99-
- ``names``: list, default None. Names for the levels in the resulting
105+
- ``names`` : list, default None. Names for the levels in the resulting
100106
hierarchical index.
101-
- ``verify_integrity``: boolean, default False. Check whether the new
107+
- ``verify_integrity`` : boolean, default False. Check whether the new
102108
concatenated axis contains duplicates. This can be very expensive relative
103109
to the actual data concatenation.
104-
- ``ignore_index`` : boolean, default False. If True, do not use the index
105-
values on the concatenation axis. The resulting axis will be labeled 0, ...,
106-
n - 1. This is useful if you are concatenating objects where the
107-
concatenation axis does not have meaningful indexing information. Note
108-
the index values on the other axes are still respected in the join.
109110
- ``copy`` : boolean, default True. If False, do not copy data unnecessarily.
110111

111112
Without a little bit of context and example many of these arguments don't make

doc/source/whatsnew/v0.19.0.txt

+49-16
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ Highlights include:
1515

1616
- :func:`merge_asof` for asof-style time-series joining, see :ref:`here <whatsnew_0190.enhancements.asof_merge>`
1717
- ``.rolling()`` are now time-series aware, see :ref:`here <whatsnew_0190.enhancements.rolling_ts>`
18+
- :func:`read_csv` now supports parsing ``Categorical`` data, see :ref:`here <whatsnew_0190.enhancements.read_csv_categorical>`
19+
- A function :func:`union_categorical` has been added for combining categoricals, see :ref:`here <whatsnew_0190.enhancements.union_categoricals>`
1820
- pandas development api, see :ref:`here <whatsnew_0190.dev_api>`
1921
- ``PeriodIndex`` now has its own ``period`` dtype, and changed to be more consistent with other ``Index`` classes. See :ref:`here <whatsnew_0190.api.period>`
2022
- Sparse data structures now gained enhanced support of ``int`` and ``bool`` dtypes, see :ref:`here <whatsnew_0190.sparse>`
@@ -218,7 +220,7 @@ they are in the file or passed in as the ``names`` parameter (:issue:`7160`, :is
218220
data = '0,1,2\n3,4,5'
219221
names = ['a', 'b', 'a']
220222

221-
Previous behaviour:
223+
Previous Behavior:
222224

223225
.. code-block:: ipython
224226

@@ -231,7 +233,7 @@ Previous behaviour:
231233
The first ``a`` column contains the same data as the second ``a`` column, when it should have
232234
contained the values ``[0, 3]``.
233235

234-
New behaviour:
236+
New Behavior:
235237

236238
.. ipython :: python
237239

@@ -277,6 +279,38 @@ Individual columns can be parsed as a ``Categorical`` using a dict specification
277279
df['col3'].cat.categories = pd.to_numeric(df['col3'].cat.categories)
278280
df['col3']
279281

282+
.. _whatsnew_0190.enhancements.union_categoricals:
283+
284+
Categorical Concatenation
285+
^^^^^^^^^^^^^^^^^^^^^^^^^
286+
287+
- A function :func:`union_categoricals` has been added for combining categoricals, see :ref:`Unioning Categoricals<categorical.union>` (:issue:`13361`, :issue:`:13763`, issue:`13846`)
288+
289+
.. ipython:: python
290+
291+
from pandas.types.concat import union_categoricals
292+
a = pd.Categorical(["b", "c"])
293+
b = pd.Categorical(["a", "b"])
294+
union_categoricals([a, b])
295+
296+
- ``concat`` and ``append`` now can concat ``category`` dtypes wifht different
297+
``categories`` as ``object`` dtype (:issue:`13524`)
298+
299+
Previous Behavior:
300+
301+
.. code-block:: ipython
302+
303+
In [1]: s1 = pd.Series(['a', 'b'], dtype='category')
304+
In [2]: s2 = pd.Series(['b', 'c'], dtype='category')
305+
In [3]: pd.concat([s1, s2])
306+
ValueError: incompatible categories in categorical concat
307+
308+
New Behavior:
309+
310+
.. ipython:: python
311+
312+
pd.concat([s1, s2])
313+
280314
.. _whatsnew_0190.enhancements.semi_month_offsets:
281315

282316
Semi-Month Offsets
@@ -378,11 +412,11 @@ get_dummies dtypes
378412

379413
The ``pd.get_dummies`` function now returns dummy-encoded columns as small integers, rather than floats (:issue:`8725`). This should provide an improved memory footprint.
380414

381-
Previous behaviour:
415+
Previous Behavior:
382416

383417
.. code-block:: ipython
384418

385-
In [1]: pd.get_dummies(['a', 'b', 'a', 'c']).dtypes
419+
In [1]: pd.get_dummies(['a', 'b', 'a', 'c']).dtypes
386420

387421
Out[1]:
388422
a float64
@@ -404,7 +438,7 @@ Other enhancements
404438

405439
- The ``.get_credentials()`` method of ``GbqConnector`` can now first try to fetch `the application default credentials <https://developers.google.com/identity/protocols/application-default-credentials>`__. See the :ref:`docs <io.bigquery_authentication>` for more details (:issue:`13577`).
406440

407-
- The ``.tz_localize()`` method of ``DatetimeIndex`` and ``Timestamp`` has gained the ``errors`` keyword, so you can potentially coerce nonexistent timestamps to ``NaT``. The default behaviour remains to raising a ``NonExistentTimeError`` (:issue:`13057`)
441+
- The ``.tz_localize()`` method of ``DatetimeIndex`` and ``Timestamp`` has gained the ``errors`` keyword, so you can potentially coerce nonexistent timestamps to ``NaT``. The default behavior remains to raising a ``NonExistentTimeError`` (:issue:`13057`)
408442
- ``pd.to_numeric()`` now accepts a ``downcast`` parameter, which will downcast the data if possible to smallest specified numerical dtype (:issue:`13352`)
409443

410444
.. ipython:: python
@@ -448,7 +482,6 @@ Other enhancements
448482
- ``DataFrame`` has gained the ``.asof()`` method to return the last non-NaN values according to the selected subset (:issue:`13358`)
449483
- The ``DataFrame`` constructor will now respect key ordering if a list of ``OrderedDict`` objects are passed in (:issue:`13304`)
450484
- ``pd.read_html()`` has gained support for the ``decimal`` option (:issue:`12907`)
451-
- A function :func:`union_categorical` has been added for combining categoricals, see :ref:`Unioning Categoricals<categorical.union>` (:issue:`13361`, :issue:`:13763`, :issue:`13846`)
452485
- ``Series`` has gained the properties ``.is_monotonic``, ``.is_monotonic_increasing``, ``.is_monotonic_decreasing``, similar to ``Index`` (:issue:`13336`)
453486
- ``DataFrame.to_sql()`` now allows a single value as the SQL type for all columns (:issue:`11886`).
454487
- ``Series.append`` now supports the ``ignore_index`` option (:issue:`13677`)
@@ -512,7 +545,7 @@ API changes
512545
``Series.tolist()`` will now return Python types
513546
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
514547

515-
``Series.tolist()`` will now return Python types in the output, mimicking NumPy ``.tolist()`` behaviour (:issue:`10904`)
548+
``Series.tolist()`` will now return Python types in the output, mimicking NumPy ``.tolist()`` behavior (:issue:`10904`)
516549

517550

518551
.. ipython:: python
@@ -547,7 +580,7 @@ including ``DataFrame`` (:issue:`1134`, :issue:`4581`, :issue:`13538`)
547580

548581
.. warning::
549582
Until 0.18.1, comparing ``Series`` with the same length, would succeed even if
550-
the ``.index`` are different (the result ignores ``.index``). As of 0.19.0, this will raises ``ValueError`` to be more strict. This section also describes how to keep previous behaviour or align different indexes, using the flexible comparison methods like ``.eq``.
583+
the ``.index`` are different (the result ignores ``.index``). As of 0.19.0, this will raises ``ValueError`` to be more strict. This section also describes how to keep previous behavior or align different indexes, using the flexible comparison methods like ``.eq``.
551584

552585

553586
As a result, ``Series`` and ``DataFrame`` operators behave as below:
@@ -615,7 +648,7 @@ Logical operators
615648

616649
Logical operators align both ``.index``.
617650

618-
Previous Behavior (``Series``), only left hand side ``index`` is kept:
651+
Previous behavior (``Series``), only left hand side ``index`` is kept:
619652

620653
.. code-block:: ipython
621654

@@ -935,7 +968,7 @@ Index ``+`` / ``-`` no longer used for set operations
935968
Addition and subtraction of the base Index type and of DatetimeIndex
936969
(not the numeric index types)
937970
previously performed set operations (set union and difference). This
938-
behaviour was already deprecated since 0.15.0 (in favor using the specific
971+
behavior was already deprecated since 0.15.0 (in favor using the specific
939972
``.union()`` and ``.difference()`` methods), and is now disabled. When
940973
possible, ``+`` and ``-`` are now used for element-wise operations, for
941974
example for concatenating strings or subtracting datetimes
@@ -956,13 +989,13 @@ The same operation will now perform element-wise addition:
956989
pd.Index(['a', 'b']) + pd.Index(['a', 'c'])
957990

958991
Note that numeric Index objects already performed element-wise operations.
959-
For example, the behaviour of adding two integer Indexes:
992+
For example, the behavior of adding two integer Indexes:
960993

961994
.. ipython:: python
962995

963996
pd.Index([1, 2, 3]) + pd.Index([2, 3, 4])
964997

965-
is unchanged. The base ``Index`` is now made consistent with this behaviour.
998+
is unchanged. The base ``Index`` is now made consistent with this behavior.
966999

9671000
Further, because of this change, it is now possible to subtract two
9681001
DatetimeIndex objects resulting in a TimedeltaIndex:
@@ -1130,7 +1163,7 @@ the result of calling :func:`read_csv` without the ``chunksize=`` argument.
11301163

11311164
data = 'A,B\n0,1\n2,3\n4,5\n6,7'
11321165

1133-
Previous behaviour:
1166+
Previous Behavior:
11341167

11351168
.. code-block:: ipython
11361169

@@ -1142,7 +1175,7 @@ Previous behaviour:
11421175
0 4 5
11431176
1 6 7
11441177

1145-
New behaviour:
1178+
New Behavior:
11461179

11471180
.. ipython :: python
11481181

@@ -1268,7 +1301,7 @@ These types are the same on many platform, but for 64 bit python on Windows,
12681301
``np.int_`` is 32 bits, and ``np.intp`` is 64 bits. Changing this behavior improves performance for many
12691302
operations on that platform.
12701303

1271-
Previous behaviour:
1304+
Previous Behavior:
12721305

12731306
.. code-block:: ipython
12741307

@@ -1277,7 +1310,7 @@ Previous behaviour:
12771310
In [2]: i.get_indexer(['b', 'b', 'c']).dtype
12781311
Out[2]: dtype('int32')
12791312

1280-
New behaviour:
1313+
New Behavior:
12811314

12821315
.. code-block:: ipython
12831316

pandas/core/internals.py

+3-4
Original file line numberDiff line numberDiff line change
@@ -4787,10 +4787,9 @@ def concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy):
47874787
[get_mgr_concatenation_plan(mgr, indexers)
47884788
for mgr, indexers in mgrs_indexers], concat_axis)
47894789

4790-
blocks = [make_block(concatenate_join_units(join_units, concat_axis,
4791-
copy=copy),
4792-
placement=placement)
4793-
for placement, join_units in concat_plan]
4790+
blocks = [make_block(
4791+
concatenate_join_units(join_units, concat_axis, copy=copy),
4792+
placement=placement) for placement, join_units in concat_plan]
47944793

47954794
return BlockManager(blocks, axes)
47964795

pandas/tests/series/test_combine_concat.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -185,9 +185,9 @@ def test_concat_empty_series_dtypes(self):
185185
'category')
186186
self.assertEqual(pd.concat([Series(dtype='category'),
187187
Series(dtype='float64')]).dtype,
188-
np.object_)
188+
'float64')
189189
self.assertEqual(pd.concat([Series(dtype='category'),
190-
Series(dtype='object')]).dtype, 'category')
190+
Series(dtype='object')]).dtype, 'object')
191191

192192
# sparse
193193
result = pd.concat([Series(dtype='float64').to_sparse(), Series(

0 commit comments

Comments
 (0)